Today we’ll finish the counting example from last time, and then expand on it. Then, we’ll talk about the merits of different data structures in Python. I’ve also added some optional reading (which may be helpful for setup, understanding Python, etc.) to the end of these notes.

Most-Common Words, Data Structures, and Setup

(You can find the link to the livecode here and here. The example test file, which we didn’t get to in class is here.)

Logistics

The first drill is being released after class today. It contains some questions about how to use the data structures from today’s lecture. This first drill will be due Wednesday at noon. Drills are due at noon because I will use the answers from drills to spot issues I might need to call out in class.

Most Common Words

Last time, we wrote a function called count_words that took a string, broke that string down into words, and then returned a dictionary of how many times each word appeared in the string.

We could use this dictionary for many things, but today let’s try to figure out what the most common word in the text of Frankenstein is. I’d like to make this an exercise in assembling a function from small pieces. Here is a function header for what we’re trying to write:

def most_common(counts: dict):

and here are a collection of lines of Python code, in no particular order, and without any indentation:

most_common = ''
if counts[word] > most_common_count:
most_common = word
for word in counts:   
most_common_count = counts[word]
return most_common
most_common_count = 0

Your job is to assemble these lines into a working implementation of most_common. This time you can trust that there are no unnecessary lines—I’ve given you exactly what you need to build the function.

Think, then click!
def most_common(counts: dict):
    most_common = ''
    most_common_count = 0
    for word in counts:   
        if counts[word] > most_common_count:
            most_common = word
            most_common_count = counts[word]
    return most_common

The most-common word isn’t very surprising, in retrospect. Often, the most common word in an English document is “the”.

A Few Good Data Structures

I’m going to demo the below in the Python console, but if you want you could put them all into a single Python file and run it.

Lists

Lists are used to represent sequences of items in a particular order. We can build and add to lists like this:

story = ["It", "was", "a", "dark", "and", "stormy"]
story.append("night")

We can access and modify particular elements of the list by index. The first element of the list is at index 0, so if we want to change the words above from past to present tense, we would use index 1:

print(story[1])
"was"
story[1] = "is"
print(story)
["It", "is", "a", "dark", "and", "stormy", "night"]

We can even add lists together (sometimes you’ll hear people calling this “concatenation”):

print(["NARRATOR:"] + story)
["NARRATOR:", "It", "is", "a", "dark", "and", "stormy", "night"]

And we can loop over the elements of a list:

for word in story:
    print(word.upper())
IT
IS
A
DARK
AND
STORMY
NIGHT

Notice that some of these operations modify the list, and others return a new list. This is an important distinction, and we’ll talk more about it soon.

Dictionaries

Dictionaries (often called “hashtables” or “maps” in other contexts) are used to represent mappings between keys and values.

status = {"brightness": "dark", "weather": "stormy"}
status["time"] = "night"

We access elements by key rather than (as in lists) by index:

status["weather"]
"stormy"
status["weather"] = "pleasant"
status
{"brightness": "dark", "weather": "pleasant", "time": "night"}

We can check whether the dictionary contains a key:

"weather" in status
True

We can loop over the keys:

for attribute in status:
    print(status[attribute].upper())
"DARK"
"PLEASANT"
"NIGHT"

A common issue

It’s very common to get the following error when you’re working with dictionaries in Python:

TypeError: unhashable type: 'list'

You might see something else there besides list.

What’s going on? The problem is that dictionary keys can only be specific types of data. Lists cannot be keys in a dictionary. We’ll learn why later; for now, just know what this error means: that the type you want to use as a key can’t be used that way.

Sets

Sets store unordered collections of elements:

night = {"dark", "stormy"}

We can add elements, but can never get duplicate elements in the set:

night.add("frightening")
len(night)
3
night.add("stormy")
len(night)
3

We can test whether elements are present:

"frightening" in night
True
"inauspicious" in night
False

Like with lists and dictionaries, we can loop over the elements:

for quality in night:
    print(quality.upper())
FRIGHTENING
DARK
STORMY

We can combine sets to get a new set, if we want, without modifying the original:

night | {"inauspicious"}
{"dark", "stormy", "frightening", "inauspicious"}

We can convert a list to a set, and vice versa. But remember that once you convert to a set, duplicates are gone:

monster = ["very", "very", "scary"]
set(monster)
{"very", "scary"}
list(set(monster))
["very", "scary"]

Sets are very useful when we care about which elements are present, but not about their order.

Optional: something I wonder about sets

If I can loop over the elements of a set, but sets are unordered, what order will the elements be visited by the loop? (Put another way, should you rely on the order in which the elements will be printed, when the data structure offers no guarantees?)

Reflecting

When should we use lists, dictionaries, and sets?

Let’s say we’re looking at the text of Frankenstein again and want to answer a few questions. Which data structure could we use in order to compute each of the following?

Think, then click!

Some thoughts:

  • the number of unique non-capitalized words in Frankenstein
    • List? This is more or less the format we’ve got to work with at the start. We can extract the data from the list, certainly.
    • Set? If we just wanted to find out what the unique words were, we could use a set (since we wouldn’t need to save duplicates). But we’re after counts, so a set isn’t going to give us what we need directly. However, we could use this set as an intermediate step on the way to producing the count.
    • Dictionary? It’s not clear what the keys would be, since we’re just after “the number of…”
  • all of the characters in Frankenstein, ordered by when they appear
    • Set? We don’t need duplicate appearances, so this seems initially promising. Unfortunately, sets don’t have order.
    • Dictionary? We could have the keys be the order of appearance—but this is as bit roundabout compared to…
    • List? Probably the most natural choice. We could get the 17th character via characters[17], and we aren’t obligated to have duplicates..

Optional Content: VSCode and Common Python Issues

I’ve put some common issues and fixes in the notes below.

Python 2 versus Python 3

It turns out that there are 2 major versions of Python currently in use: version 2 and version 3. These versions are different enough that you need to be careful and run the right version: some systems have both installed! For example, here’s the state of Tim’s laptop:

% python --version
Python 2.7.16
% python3 --version
Python 3.7.3

Make sure you run the right version; this class uses Python 3, which is why, in lecture, I’m usually careful to run python3 and, when I’m not, hilarity ensues.

Testing your Python Programs

We’ll be using pytest to help us organize tests this semester. Last time I showed you how to write tests with assert statements. But pytest makes things a bit better. Last time I wrote:

example_input: str = "Why do I feel bitter when I should be feeling sweet?"
example_output: dict = {"Why": 1, "do": 1, "I": 2, "bitter": 1, "when": 1,
                        "should": 1, "be": 1, "feel": 1, "sweet?": 1}
assert count_words(example_input) == example_output

When it failed (I’d forgotten a word), we just got an AssertionError without further information. In contrast, if we write a test function like this, and then run pytest:

def test_most_common():
    example_input: str = "Why do I feel bitter when I should be feeling sweet?"
    example_output: dict = {"Why": 1, "do": 1, "I": 2, "bitter": 1, "when": 1,
                        "should": 1, "be": 1, "feel": 1, "sweet?": 1}
    assert count_words(example_input) == example_output

Disabling and Enabling Popups

When I’m programming, I like to have popups enabled in VSCode. If I mouse over a function or something like that, it will pop up some documentation for the function. But this is a distraction when I’m presenting! So I’ve turned off that feature. If you want, you can control this in Code -> Preferences -> Settings. Search for editor.hover and uncheck the Enabled entry. If I want to show a popup, I can hit (on a Mac) Cmd+K and then Cmd+I.

Running your programs at the terminal

You can run your programs from outside VSCode via the terminal (which you may also hear me call the “command line”). VSCode gives you a terminal window under your code file, but you can also get a terminal through various operating-system specific means. On MacOS, I can find it under Applications and then Utilities in the Finder.

Every terminal window will have a “current directory”, which is the folder it’s currently browsing. Right now, I have Python file called files_prep.py (from preparing this lecture!) in my teaching/112/lectures/sep13 folder. But if my terminal isn’t browsing that folder, it won’t be able to see the Python file:

% python3 files_prep.py
/usr/local/bin/python3: can't open file '/Users/tim/repos/teaching/112/learning/files_prep.py': [Errno 2] No such file or directory

This is common when, for instance, VSCode thinks the directory you want to be working in is different. Here, I’ve previously told VSCode’s explorer that I wanted to work in a separate, learning folder! I can fix the problem by just changing directory:

% cd ../lectures/sep13
% python3 files_prep.py frankenstein.txt
the

More on scripts and errors

You might have noticed the very end of your homework 1 stencil contained some odd text. You might have also, independently, noticed that if you try to run homework 1 via the run arrow in VSCode, you can sometimes get a strange error:

File "/Users/tim/repos/cs0112/materials/Lectures/live/courses.py", line 47, in <module>
   filename = sys.argv[1]
IndexError: list index out of range

It turns out these are related. The code at the bottom of the stencil lets the program be run as a script from the terminal. The list sys.argv contains all the arguments to python. We can see these by adding print(sys.argv) right after the import in that last block of code. If I run the stencil with that addition, I see this before the error:

['/Users/tim/repos/cs0112/materials/Lectures/live/courses.py']

This makes sense: the run arrow just executes python and gives it the Python source file to run. But the way we wrote the stencil, it expects two more arguments:

We’re getting the IndexError error because the script is trying to access the second and third elements of that list, and they don’t exist! The script isn’t carefully written enough—the error is terrible, and doesn’t actually say what it needs to! So rather than just say sorry (although I am!) we’re going to fix this here, live.

What do you think the error should be? Let’s add it! Instead of allowing the bad indexing, we’ll raise a better error message first. Here’s 2 lines we might have added:

   # the first argument to the script is the filename
   if(len(sys.argv) < 2):
      raise ValueError('Missing the name of the CSV file to load. Add it after the Python file name.')
   filename = sys.argv[1]

So there’s an lesson here for your own work: consider the error messages that your users might see, and try to make sure they are given from the user’s perspective. YOU weren’t trying to get a list element that wasn’t there; you just wanted to run the script.