Lab 5: Web Scraping

View presentation

Note: We expect this lab to take around 2 hours to complete. This is mostly because this is also part of the Project stencil.

Section 1: Background

As an avid reader, Tim is looking for a break from industry articles and is turning to students’ theses and dissertations. However, with the vast number of theses written over the years, Tim wants to streamline his process for finding his next paper to read. The solution? Tim wants you to scrape just the information about each paper (title, publishing year, authors, etc.).

In this lab, we’ll be focusing on web scraping techniques to extract information from the Brown Digital Repository (BDR), specifically the Theses and Dissertations collection.

Section 2: Setup

Download the stencil file above and open it in VSCode. Install required libraries:

$ pip3 install bs4 requests

Section 2: Understanding the Website

What is the Brown Digital Repository?

The Brown Digital Repository (BDR) is maintained by the Brown University Library as “a place to gather, index, store, preserve, and make available digital assets produced via the scholarly, instructional, research, and administrative activities at Brown”. Throughout this lab and Project 3, you will see the terminology “item” and “collection”. An item is any file uploaded to the BDR, such as an image, pdf, or audio file. A collection is a related group of items. We’ll be focusing on the collections of Brown’s academic departments, specifically their Theses and Dissertations.

Task 1: Open the Brown Digital Repository and navigate to the American Studies Theses and Dissertations collection. Take a look at the structure of the website and the information available for each paper. See if you can identify the HTML tags that contain the information you need to scrape.

Tip: If you’re having trouble finding the information you need, you can use the Inspect Element tool in your browser to view the HTML structure of the page.

Part 2: Storing the Data

Now that we understand the structure of the website, the next thing you should do is decide how you want to structure the data you scrape within your code. For each paper, you will need to store the title, year, contributor(s), subject(s), abstract, and notes. There are multiple approaches that will work fine and some that are less good – think about how you will want to access the data in order to write each function!

Task 1: Define the dataclass BDRItem with the information you want to scrape from each item page.

Section 3: Data Scraping

Our first step is to make a mini function that will retrieve links to the items from only the first page of the collection. We will worry about getting links from multiple pages later.

Task 2: Complete the get_items_from_url function, which takes in the URL of a collection page and should return a list of BeautifulSoup objects, one for each item page. Your TA will guide you through the process of scraping the information from the item pages.

But you might notice that there are items marked as “Available to Brown-affiliated users only”. We will need to handle these items differently. Think about the ethical implications of scraping this data and how you might handle this situation.

Now that we have the information from the item pages, we can move on to getting the information from the collection pages and extracting the contents of the item pages.

Task 3: Complete the scrape_item function, which takes in a BeautifulSoup object representing an item page and should return a BDRItem object with the information from the page.

Try to use helper functions to make your code more readable and modular! TAs will guide you through get_title, get_year, and get_abstract functions.

def scrape_item(item_page: BeautifulSoup) -> BDRItem:
    """
    Extracts relevant information from an item page and returns an Item object.
    
    :param item_page: BeautifulSoup object of the item page
    :return: BDRItem object containing the scraped information
    """
    # TODO: Implement this function
    title = get_title(page)
    year = get_year(page)
    contributor = get_contributor(page)
    subject = get_subject(page)
    abstract = get_abstract(page)
    notes = get_notes(page)
    return BDRItem(title, year, contributor, subject, abstract, notes)

Part 4: Advanced Data Scraping

Now that we have the basic data scraping functions, we can move on to more advanced scraping tasks. We will need to extract information from multiple pages of the collection and handle items that are only available to Brown-affiliated users.

Take a minute to navigate between the pages of the collections. How does the URL change depending on what page you are on?

Task 4: Complete the get_pages_from_id function, which takes in the id of a collection (7 character string, ex: gwy9dgfq) and the number of pages to retrieve and should return a list of BeautifulSoup objects, one for each page. Note that if the number of pages requested is more pages than the collection has, this function should return however many pages there are.

def get_pages_from_id(collection_id: str, num_pages: int = 1) -> list[BeautifulSoup]:
    """
    Retrieves and parses the specified number of pages from a collection.
    
    :param collection_id: ID (7 character string, ex: gwy9dgfq) of the collection
    :param num_pages: Number of pages to retrieve (default is 1)
    :return: List of BeautifulSoup objects, one for each page
    """
    # TODO: Implement this function
    pass

Inget_items_from_url function, we got a list of BeautifulSoup objects, one for each item page, from a single collection page. Now that we have multiple collection pages, we need to do this process for every single collection page. Note that our return is a list, rather than a list of lists, because we no longer want to group by page but rather collection.

Task 5: Complete the get_items_from_pages function, which takes in a list of BeautifulSoup objects representing collection pages and should return a list of BeautifulSoup objects, one for each item page.

def get_items_from_pages(collection_pages: list[BeautifulSoup]) -> list[BeautifulSoup]:
    """
    Extracts individual item pages from the collection pages.
    
    :param collection_pages: List of BeautifulSoup objects representing collection pages
    :return: List of BeautifulSoup objects, one for each item page
    """
    # TODO: Implement this function
    passs

Section 3: You’re all set!

That’s all for lab this week! Be sure to ask any questions as this will be the same Web Scraping format that we follow for Project 3 and may be helpful in the final project as well.