Web Scraping

2023 note: These notes are in draft form, and are released for those who want to skim them in advance. Tim is likely to make changes before and after class.

The livecode prep.

The in-class livecode.

2022 livecode for prep 2022 livecode from prep 2

Lab and Project 3

This week we’re introducing a new way to get data: web scraping. Since HTML is tree-based, this is a great chance to get more practice working with tree-shaped data.

When Project 3 is released, you’ll analyze data obtained in the wild! That is to say, you’ll be scraping data directly from the web. Once you’ve got that data, the tasks are structurally similar to what you did for Homework 1.

Web scraping involves more work than reading in data from a file. Sites are structured in a general way, but the specifics of each site you have to discover yourself (often with some trial and error involved). And, since these are live websites, someone might change the shape of the data at any time!

Looking Ahead to APIs

There’s another way to get data from the web that we’ll offer a lab on in the near future: APIs. Don’t think of web scraping as the only way to obtain data. More on that soon.

Setup

I’m going to run you through a demo, that’s colored by my own experiences learning how to web scrape. We’ll use 2 libraries in Python that you might not have installed yet:

You should be able to install both of these via pip3. On my Mac, this took 2 commands at the terminal:

pip3 install bs4
pip3 install requests

If you have multiple installations of Python, you might use:

python3 -m pip install bs4
python3 -m pip install requests

Goals

Today we’ll write a basic web-scraping program that extracts data from the CSCI 0112 course webpage. In particular, we’re going to get the names of all the homeworks, along with their due-dates, available to work with in Python. If time permits, we’ll do the same (with help from you!) for the list of staff members.

Testing

Depending on how complex your web-scraping program is, testing it can sometimes be challenging. Why?

Think, then click! Because you're testing your script against a moving target! Sites can change, be reformatted, go down, etc. So you'll likely want to test on a static HTML page that you've stored locally. But even that can be challenging if you need to scrape data from pages that change in response to input or other events. There are advanced testing techniques that address these challenges, but for this class we won't expect you to use them. Testing on static pages saved in a file is a good idea, though!

Documentation

Throughout this class, we’ll consult the BS4 Documentation, and you should definitely use the docs to help you navigate the HTML trees that you scrape from the web.

General sketch

We’ll use the following recipe for web scraping:

Step 1: Getting the data

Getting the content of a static web page is pretty straightforward. We’ll use the requests library to ask for the page, like this:

# Give a convenient name to the URL
assignments_url = "https://cs0112.github.io/Pages/assignments.html"
# Send a GET request, obtain the text
assignments_text = requests.get(assignments_url).content

Note that .content at the end; that’s what gets the text; if we leave that off we’ll have an object containing more information about the response to the request. If you forget this, you’re likely to get a confusing and annoying error message: TypeError: object of type 'Response' has no len().

Step 2: Parsing the data

Now we’ll give that text to beautiful soup, to be converted into a (professional-grade, robust) HTML tree:

assignments_page = BeautifulSoup(assignments_text, features='html.parser')

This returns a tree node, just like our more lightweight HTMLTree class from before. What’s with 'html.parser'? This is just a flag telling the library how to interpret the text (the library can also parse other tree-shaped data, like XML).

If we print out assignments_page it will look very much like the original text. This is because the class implements __repr__ in that way; under the hood, assignments_page really is an object representing the root of an HTML tree. And we can look at the docs to find out how to use it. For instance: assignments_page.title ought to be the title of the page.

Unlike our former HTMLTrees, which added a special text tag for text, BeautifulSoup objects can give you the text of any element directly. E.g., assignments_page.text.

Step 3: Extracting Information (The Hard Part)

We’d like to extract the contents of some of these table cells.

Concretely, we want the name and due-date of each assignment. Let’s aim to construct a dictionary with those as the keys and values. But these cells are buried deep down inside an HTML Tree. Yeah, we have a BeautifulSoup object for that tree, but how should we navigate to the information we need?

Modern browsers usually have an “inspector” tool that lets you view the source of a page. Often, these will highlight the area on the page corresponding to each element. Here’s what the inspector looks like in Firefox, highlighting the first <table> element on the page:

This is a great way to explore the structure of the HTML document and identify the location you want to isolate. Using the inspector, we can discover that there are 2 <table> elements in the document, and that the first table is the one we’re interested in.

Beautiful Soup provides a find_all method that is useful for this sort of goal:

hw_table = assignments_page.find_all('table')[0]

Looking deeper into the HTML tree via Firefox’s inspector, we see that we’d like to extract all the rows from that first table:

hw_rows = hw_table.find_all('tr')

Unfortunately, there is a snag: the header of the table is itself a row. We’d like to remove that row from consideration, otherwise we’ll get an entry in the dictionary whose key is 'Assignment' and whose value is 'Due'.

There are a few ways to fix this. Here, we might use a list comprehension to retain only the rows which contain <td> elements. It would also work to remove the rows which contain <th> (table header) elements. We could also simply look only within the <tbody> element:

homework_rows = first_table.find_all('tbody')[0].find_all('tr')

Next, we’ll use a dictionary comprehension to concisely build a dictionary pairing for each row, containing the assignment name and date:

cleaned_hw_cells = {row.find_all('td')[1].text : row.find_all('td')[3].text 
                    for row in homework_rows}

Notice that getting text from each table cell was easy. But, for clarity, I sometimes like to make a helper function for each. Like this:

    def scrape_homework_row_due(row):
        '''Nested helper to processes a table row and return a _value_ to place in our dictionary'''
        return row.find_all('td')[3].text.strip()

Our old HTMLTree library added special <text> tags for text, but Beautiful Soup just lets us get the text inside an element directly.

What’s the general pattern?

Here, we found the right table, then isolated it. Then we built and cleaned a list of rows in that table. And finally we extracted the columns we wanted.

This is a common pattern when you’re building a web scraper. You’ll find yourself repeatedly alternating between Python and the inspector, gradually refining the data you’re scraping.

How could this break?

I might add new assignments, but fortunately the script is written independent of the number of rows in the table.

If I swap the order of tables, or add another table before the homework one, the script would break. Likewise, if I change column orders, add new columns, remove a column, etc. that might also break the script.

These are all realistic problems in web-scraping, because you’re writing code against an object that the site owner or web dev might change at any moment. But at least we’d like to write a script that won’t break under common, normal modifications (like adding a new homework).

A Cleaned-Up End Result

Here’s a somewhat cleaner implementation. It also contains something we didn’t get to in class: a scraper to extract staff names from the staff webpage. (It turned out that this was more complicated, because the the staff names are best identified by looking at the element class)

I’ve also added, at the end, an example of how to test with a static, locally-stored HTML file.

from bs4 import BeautifulSoup 
from requests import get

# pip3 install bs4
# pip3 install requests

# I like to have a variable name for the URL, because otherwise the code below can get verbose.
assignments_url = "https://cs0112.github.io/Pages/assignments.html"
staff_url       = "https://cs0112.github.io/Pages/staff.html" 

# send a GET request and obtain the response (we imported this function)
assignments_response = get(assignments_url)
# get the HTML content of the response as text
assignments_text = assignments_response.content
# create a BeautifulSoup object tree that we can easily explore
# the 'html.parser' parameter says what language we're parsing. This parser
# is much more sophisticated than the toy one we used a few weeks ago. 
assignments_page = BeautifulSoup(assignments_text, features='html.parser')

# By default, this only prints a response code; 200 means successful for us
print(f'assignments_response: {assignments_response}')
# But the content of the reply contains more info: the HTML as a raw string 
print(f'assignments_text: {assignments_text}')
# Finally, the object containing the explorable tree
#   (This class has a very nice __repr__ method! Note the indentation.)
print(f'assignments_page: {assignments_page}')

def scrape_homeworks(page: BeautifulSoup) -> dict:
    '''Accepts a BeautifulSoup tree of the 0112 homeworks page, and returns a homework dictionary'''
    # We noticed that the homeworks were in the _first_ HTML Table
    first_table = page.find_all('table')[0]
    # We noticed that the table body contained rows, one row per homework
    homework_rows = first_table.find_all('tbody')[0].find_all('tr')
    print(f'Found {len(homework_rows)} rows.')

    def scrape_homework_row_name(row):
        '''Nested helper to process a table row and return a _key_ to place in our dictionary'''
        # Eliding type hints for some of this, because they get difficult. the .text field of 
        #   an element returns its text, and then we use Python's string.strip() method to remove
        #   blank spaces, etc.
        cells = row.find_all('td')
        print(cells)
        print()
        return cells[1].text.strip()
    def scrape_homework_row_due(row):
        '''Nested helper to processes a table row and return a _value_ to place in our dictionary'''
        return row.find_all('td')[3].text.strip()
    
    homework_assignments = {scrape_homework_row_name(row): scrape_homework_row_due(row) 
                            for row in homework_rows}
    return homework_assignments

print(f'scraped homeworks: {scrape_homeworks(assignments_page)}')

########################################################
## staff names

staff_page = BeautifulSoup(get(staff_url).content, features='html.parser')

def scrape_staff_names(page: BeautifulSoup) -> list:
    # Doesn't deliver anything
    # names = [name.find('span').strip() for name in page.find_all('staff-name')]
    # Look for a header-3 element with "class name" 
    names = [name.text.strip() for name in page.find_all('h3', class_='staff-name')]
    return names

print(f'scraped staff names: {scrape_staff_names(staff_page)}')

###################################################
### How can I test this with a local HTML file? ###

# test_file = open('downloaded.html', 'r')
# test_text = test_file.read()
# test_file.close()
# test_page = BeautifulSoup(test_text, features='html.parser')
# Now use test_page as you would something you scraped from the web

Optional: Classes in BS4, Anticipated Error Message

Last year in class, we tried chaining find_all calls like this:

>>> assignments_page.find_all('table')[0].find_all('tr').find_all('td')

and got a very nice error message, like this:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.9/site-packages/bs4/element.py", line 2253, in __getattr__
    raise AttributeError(
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
>>> 

I was surprised, because the list that find_all produces has no find_all attribute, sure, but Beautiful Soup would have no way to interpose this luxurious and informative error on lists.

The answer to how they did this is in the text of the error. It’s just that find_all doesn’t actually return a list. It returns a ResultSet, which is where this error is implemented. Python lets us treat a ResultSet like a list because various methods are implemented, like __iter__ and others, which tell Python how to (say) loop over elements of the ResultSet.