Today we’ll spend some time on setup as we learn to work with files. Then we’ll start writing a function that uses those files. If we don’t finish the function today, we’ll finish it next time.

Working with Files

I will always try to upload the livecode from class so you can read it. Once class is over, you’ll be able to find the livecode linked on the lectures page. E.g., this file should be visible on the lecture index as well.

If you find a link to such a file before class, it may be from last year, and not match what we’ll actually do in class. I leave these in the course materials as I’m preparing because it can be useful to remind myself what happened a year ago.

Logistical Comments

Please join the course edstem, and turn on notifications. This is where most of our class communication will happen.
Required labs start next week. Make sure you attend; the first lab is about working with VSCode and Python. Since the lab isn’t until then, you will need to set up VSCode yourself according to the set up section on the website. If you have issues, go to hours or post on EdStem.

Working with files

In CSCI 0111, you learned how to work with data organized in several structures: tables, lists, trees, and dictionaries (which are sometimes called hashtables, for reasons we’ll talk about later in the semester). You also might have seen data loaded from Google sheets, or CSV (comma-separated value) files. In this class, we’ll see how to load data from a couple of other sources like text files and web pages. These sources are incredibly common, but not as convenient to work with right away as what you saw in 0111. Sometimes you’ll need to process the data structure you’re given so that it better fits the things you want to do with it.

Today we’ll cover text files. Web pages come later in the course, once we’ve learned a bit about how they are structured.

Let’s say we want to write a program that works with the complete text of Frankenstein, by Mary Wollstonecraft Shelley. The text is available here via Project Gutenberg, an online collection of public-domain books. First, we’ll download the file and save it to disk somewhere. By default, it should be named 84-0.txt. This isn’t very descriptive, so I renamed mine to frankenstein.txt.

If I open up VSCode, I can start working with the file in Python. The first thing I might want to do is open the file:

frankenstein_file = open('frankenstein.txt', 'r', encoding='utf8')

Notice that there are 3 different parameters given to open:

The first parameter is the name of the file. We could also give a file path, like books/frankenstein.txt, which would mean to look inside a folder called books for the file. There are two kinds of path:
- Relative paths, like the above, are always rooted in some current folder. Usually this will be the folder you open in VSCode. If you’re working in the terminal, you can change the current folder with the cd command; we’ll say more about this in the first lab.
- Absolute paths are rooted in the top-level folder of your computer’s filesystem. At time of writing, an absolute path on my laptop for the same file would be: /Users/tbn/repos/cs112/materials/Lectures/live/frankenstein.txt. But, unless somebody else has organized things exactly the same way (including having the user name tbn on a MacBook!) the path wouldn’t work for them.
The second parameter says what mode to open the file in. The r means we want to open the file for reading. Other options include w for writing, a for appending, and so on.
Finally, the third parameter (which is named encoding) tells Python how the text is encoded. It turns out that there are multiple ways to store text in a computer, and different systems throughout history have used different encodings. Fortunately, utf-8 is now extremely common and widely supported, so you’re unlikely to need to worry about this in 112.

Opening a file gives us a sort of token that we can use to work with it. Crucially, this isn’t the same thing as the text itself. The text is still stored in the file; the value in frankenstein_file is only the token for accessing the data! To get the text itself, we need to read the (now opened) file’s contents into a string:

> frankenstein = frankenstein_file.read()

Now we have the whole text as a really long string. You could see the text by asking Python to print frankenstein, but I won’t do that here because it would take up a lot of space in the notes! Something you might try is comparing what Python prints out for frankenstein against what it prints out for frankenstein_file. Remember:

The text of the book is represented as a string (frankenstein); and
The file, which provides access to the book’s data, is represented as a file object (frankenstein_file)

Now that we have the text, what are some things we might do with it?

Replacing words and writing files

Maybe we want to rewrite Frankenstein to instead be about Bruno the bear (Brown’s sports mascot). The easiest way to do this is probably just to take the text of Frankenstein and replace “Frankenstein” with “Bruno” everywhere:

> bruno = frankenstein.replace("Frankenstein", "Bruno")

Now that we’ve done that, we could save the results in a new file:

> bruno_file = open('bruno.txt', 'w', encoding='utf-8')
> bruno_file.write(bruno)
> bruno_file.close()

If we look at that file, we can see that the text has been rewritten. Everything will be identical to the original, except that any occurrences of “Frankenstein” will have been replaced with “Bruno”. And since it’s saved into a file, you could share it easily: put it on your website, send it as an email attachment, or just archive it for future generations.

Word counts

Let’s say we wanted to create a count of the number of times every word appears in Frankenstein. How should we get started?

This sort of design problem is quite common in computing. No matter how experienced someone is, figuring out how to get from a general problem statement to a solution can be tough. But it helps to have some standard tools. I like to follow a process like this…

Write down the shape of the data you’ve already got (here, it’s a string containing the entire book).
Write down what you need to do with the data (here, it’s counting the number of times each word appears).
Decide what intermediate data structures we should use to help get from the data to the answer. (Here, a dictionary would probably make sense, since that way we can store a count for every word. Recall that a dictionary is like a lookup table: it stores values corresponding to keys.)

You’ll sometimes hear me say that “queries influence structure”, which is the idea that your choice of how to store and represent data should be governed by what you need to do with that data. Hence this strategy.

Let’s try writing a helper function to do that. To get started here, we’ll write the function header (name it, say what inputs it takes, etc.) and add some documentation about what it needs to do.

def count_words(s: str) -> dict:
    """
    Accepts a string and returns a dictionary that maps words to the number of times 
    each word appears in the string. 
    """
    pass

Notice that I’ve included type hints here, to help me remember what the input is and what kind of value the function should return. The pass keyword is a way to tell Python to do nothing—we haven’t yet written any code!

Now that we’ve got the shape of our function, it would make sense to write some examples. Here is one, taken from a song I like:

Input: "Why do I feel bitter when I should be feeling sweet?"
Output: {"Why": 1, "do": 1, "I": 2, "feel": 1, "bitter": 1, "when": 1, "should": 1, "be": 1, "feeling": 1, "sweet": 1}

We can write this as a test in Python by writing:

assert count_words("Why do I feel bitter when I should be feeling sweet?") == {
    "Why": 1, "do": 1, "I": 2, "feel": 1, "bitter": 1, "when": 1, 
    "should": 1, "be": 1, "feeling": 1, "sweet": 1}

but of course the test will fail until we’ve actually written the count_words function.

Now we’re ready to start filling in the code. We know we’re going to have to return a dictionary, so let’s make one and return it. It can be empty for now.

Python has “significant” blank spaces, so when we write Python code, we need to watch out for things like indentation and line breaks. A generally helpful rule is that, if you end a line with an open brace, parenthesis, etc. Python will usually understand that you mean to continue on the next line.

def count_words(s: str) -> dict:
    """
    Accepts a string and returns a dictionary that maps words to the number of times 
    each word appears in the string. 
    """    
    counts = {}
    # Code goes here, I don't know what yet
    return counts

Now we just have to figure out how to convert the input string into the dictionary. To do that, we need to break the goal down into subtasks. We don’t need to immediately figure out how to do each of these, but we need to know what they are. We’ll fill in the details after. Here’s an example. We need to:

count each word, but to do that we need to first;
break up the single big string into words. So we might write something like this:

def count_words(s: str) -> dict:
    """
    Accepts a string and returns a dictionary that maps words to the number of times 
    each word appears in the string. 
    """    
    counts = {}
    words = None # ??? I don't know how to get this yet.
    for word in words:
        pass # ??? We need to count this word in our dictionary.
    return counts

We can easily break up the string into words by using the split() method, which breaks the string up into pieces and returns them in a list. It takes an optional argument that says what to split the string on, but the default (blank space) will serve our purposes.

Now, how do we count each word? If we don’t have a recorded number for a given word yet, we should add a fresh entry to the dictionary. Otherwise, we can increment that entry by 1.

Here’s an example of where we might end up:

def count_words(s: str) -> dict:
    """
    Accepts a string and returns a dictionary that maps words to the number of times 
    each word appears in the string. 
    """    
    counts = {}
    for word in s.split():
        if word not in counts:
            counts[word] = 1
        else:
            counts[word] += 1
    return counts

You might not have seen += before. When we write counts[word] += 1, it’s just shorthand for counts[word] = counts[word] + 1.

What can we do with this dictionary? Well, lots of things! But let’s reflect on how we got here. We didn’t try to write the entire function at once. Instead, we first figured out the shape we wanted it to be and wrote an example to use as a guide. Then we started sketching, without filling in the low-level details. Then we finally filled in the details.

This exercise continues in the next set of notes.