Today we’ll spend some time on setup as we learn to work with files. Then we’ll start writing a function that uses those files. (I’ll also spend some time going over fears, needs, and goals from the exercise we did last time.) If we don’t finish the function today, we’ll finish it next time.

Working with Files

I will always try to upload the livecode from class, so you can read it. Once class is over, you’ll be able to find the livecode linked on the lectures page. E.g., this file should be visible on the lecture index as well. But beware: if you find a link to such a file before class, it may be outdated!

Logistical Comments

Working with files

In CSCI 0111, you learned how to work with data organized in several structures: tables, lists, trees, and dictionaries (which are sometimes called hashtables, for reasons we’ll talk about later in the semester). You also might have seen data loaded from Google sheets, or CSV (comma-separated value) files. In this class, we’ll see how to load data from a couple of other sources like text files and web pages. These sources are incredibly common, but not as convenient to work with right away as what you saw in 0111. Sometimes you’ll need to process the data structure you’re given so that it better fits the things you want to do with it.

Today we’ll cover text files. Web pages come later in the course, once we’ve learned a bit about how they are structured.

Let’s say we want to write a program that works with the complete text of Frankenstein, by Mary Wollstonecraft Shelley. The text is available here via Project Gutenberg, an online collection of public-domain books. First, we’ll download the file and save it to disk somewhere. By default, it should be named 84-0.txt. This isn’t very descriptive, so I renamed mine to frankenstein.txt.

If I open up VSCode, I can start working with the file in Python. The first thing I might want to do is open the file:

frankenstein_file = open('frankenstein.txt', 'r')

The r means we want to open the file for Reading. Opening a file gives us a sort of token that we can use to work with it. Crucially, this isn’t the same thing as the text itself. The text is still stored in the file! To get the text itself, we need to read the (now opened) file’s contents into a string:

> frankenstein = frankenstein_file.read()

Now we have the whole text as a really long string. You could see the text by asking Python to print frankenstein, but I won’t do that here because it would take up a lot of space in the notes! Something you might try is comparing what Python prints out for frankenstein against what it prints out for frankenstein_file. Remember:

Now that we have the text, what are some things we might do with it?

Replacing words and writing files

Maybe we want to rewrite Frankenstein to instead be about Bruno the bear (Brown’s sports mascot). The easiest way to do this is probably just to take the text of Frankenstein and replace “Frankenstein” with “Bruno” everywhere:

> bruno = frankenstein.replace("Frankenstein", "Bruno")

Now that we’ve done that, we could save the results in a new file:

> bruno_file = open('bruno.txt', 'w')
> bruno_file.write(bruno)
> bruno_file.close()

If we look at that file, we can see that the text has been rewritten. Everything will be identical to the original, except that any occurrences of “Frankenstein” will have been replaced with “Bruno”. And since it’s saved into a file, you could share it easily: put it on your website, send it as an email attachment, or just archive it for future generations.

Word counts

Let’s say we wanted to create a count of the number of times every word appears in Frankenstein. How should we get started?

This sort of design problem is quite common in computing. No matter how experienced someone is, figuring out how to get from a general problem statement to a solution can be tough. But it helps to have some standard tools. I like to follow a process like this…

You’ll sometimes hear me say that “queries influence structure”, which is the idea that your choice of how to store and represent data should be governed by what you need to do with that data. Hence this strategy.

Let’s try writing a helper function to do that. To get started here, we’ll write the function header (name it, say what inputs it takes, etc.) and add some documentation about what it needs to do.

def count_words(s: str) -> dict:
    """
    Accepts a string and returns a dictionary that maps words to the number of times 
    each word appears in the string. 
    """
    pass

Notice that I’ve included type hints here, to help me remember what the input is and what kind of value the function should return. The pass keyword is a way to tell Python to do nothing—we haven’t yet written any code!

Now that we’ve got the shape of our function, it would make sense to write some examples. Here’s a couple:

Input: "Why do I feel bitter when I should be feeling sweet?"
Output: {"Why": 1, "do": 1, "I": 2, "feel": 1, "bitter": 1, "when": 1, "should": 1, "be": 1, "feeling": 1, "sweet": 1}

We can write this as a test in Python by writing:

assert count_words("Why do I feel bitter when I should be feeling sweet?") == {
    "Why": 1, "do": 1, "I": 2, "feel": 1, "bitter": 1, "when": 1, 
    "should": 1, "be": 1, "feeling": 1, "sweet": 1}

Python can sometimes make line breaks difficult. A generally helpful rule is that, if you end a line with an open brace, parenthesis, etc. Python will usually understand that you mean to continue on the next line.

Now we’re ready to start filling in the code. We know we’re going to have to return a dictionary, so let’s make one and return it. It can be empty for now.

def count_words(s: str) -> dict:
    """
    Accepts a string and returns a dictionary that maps words to the number of times 
    each word appears in the string. 
    """    
    counts = {}
    # Code goes here, I don't know what yet
    return counts

Now we just have to figure out how to convert the input string into the dictionary. To do that, we need to break the goal down into subtasks. What are they?

Think, then click!

We need to:

  • break up the input string into words; and
  • count each word.

To break up the input, we’ll use the split() function, which converts a string into a list of strings, broken up by empty space (spaces, tabs, newlines, etc.).

To count the words, we’ll loop over that list!


Here’s an example of where we might end up:

def count_words(s: str) -> dict:
    """
    Accepts a string and returns a dictionary that maps words to the number of times 
    each word appears in the string. 
    """    
    counts = {}
    for word in s.split():
        if word not in counts:
            counts[word] = 1
        else:
            counts[word] += 1
    return counts

You might not have seen += before. When we write counts[word] += 1, it’s just shorthand for counts[word] = counts[word] + 1.

What can we do with this dictionary? Well, lots of things! But we’ll do that in the next class.