Project 3: Web Scraping

Due Dates

Summary

In order to find more venues to host Tim Talks and increase accessibility, you’ve turned to Craigslist. You know there’s a lot of useful information on Craigslist listings, but your goal is to scrape data from the site and put it into helpful data structures that allow you to find the ideal venue more efficiently!

For this project, we’re asking you to scrape some data from Craigslist and write a few queries over your scraped data. Specifically, you’re going to examine rental housing listings for a number of major US cities, plus Providence.

Learning goals

This project will give you experience:

Project tasks

  1. Read through this document in its entirety!
  2. Setup: Create a new VSCode project and install some libraries (if not already installed!):
    • pip3 install pytest
    • pip3 install selenium
    • pip3 install bs4

    Then, copy the starting point code below into a file called scraper.py.

  3. Design Check: Complete the Design Check questions due by Nov. 8th at 10pm EST
  4. Implement your web scraper!

When writing code, please follow the testing and style guide.

Implementation starting point

Put this stencil code in a file called scraper.py:

Implementation tasks

Your implementation will have two parts: the scraper and the query functions.

The Scraper

Your scraper will be a function called scrape_data. It takes in a dictionary where the keys are city names from the CITIES list and the values are BeautifulSoup objects. Your scraper should use BeautifulSoup methods to transform the data into a format that you can query in your query functions.

You’ll need to scrape the following data from each rental listing:

Some listings may be missing the number of bedrooms. You should skip these listings (i.e., you should not include them in your data).

Implementation hints

Some string methods may be useful; our implementation uses .split, .strip, .replace, and .endswith.

In class, we saw the find_all method called with one argument–a tag name. Another form may be useful for this assignment: you can call soup.find_all("table", "cls") to find table tags with the class "cls". Classes are a way to indicate structure in HTML, and are used by browsers to determine how to display elements. The tag

<div class="assignments red">

has both "assignments" and "red" as classes; we could find it with soup.find_all("div", "assignments") or soup.find_all("div", "red").

Query functions

You will write several queries over your data:

Understanding Craigslist’s Website

In past years, Craigslist employed a static method for presenting information, meaning that content was directly embedded into the HTML code that is received by your browser. This method makes it more navigatable to web scrape using tools such as BeautifulSoup.

However, like many modern websites Craiglist has transitioned to a more dynamic content delivery method. This type of method leverages JavaScript to load content asynchronously after the initial page load! This technique is often referred to as AJAX (Asynchronous JavaScript and XML). This allows for faster and much more responsive web applications. With this technique, instead of the whole page refreshing with every change, only specific sections are updated asynchronously, in turn allowing the rest of the page to remain static and uninterrupted.

Now here is the problem- Tools like BeautifulSoup are designed to parse HTML and are not equipped to execute JavaScript. When BeautifulSoup makes a request to a website, it receives only the initial HTML content, which does not include any content loaded asynchronously via JavaScript. As a result, there will be a discrepancy between what you see in your web browser (where JavaScript runs by default) and the content that is retrieved by BeautifulSoup.

To bridge the gap between what BeautifulSoup sees and what your browser displays, you can disable JavaScript in your web browser. This step will make the webpage’s content appear similar to the HTML content that BeautifulSoup extracts. Here’s how you can disable JavaScript in some web browsers:

Disabling JavaScript in Your Browser

Google Chrome:

Safari:

You may notice a significant amount of information is lost!

Selenium

We will be using Selenium in order to extract the data that is dynamically loaded via JavaScript on Craigslist.

In the stencil code, you may notice a couple of extra lines for the retrieval of data. Please read through the documentation to gain a better understanding of what is going on.

Google Chrome Requirement:

This script is designed to work with Google Chrome, which means you need to have the Chrome browser installed on your machine. The script uses webdriver_manager to handle the ChromeDriver installation automatically. This process requires an internet connection.

If You Don’t Have Google Chrome:

If you don’t have the Google Chrome browser installed, or if you prefer to use another browser, you’ll need a different driver.

You may replace the get_driver method in the stencil code for your browser:

Note: We will also be providing HTML files to provide consistent city data as well as an alternative if your driver is failing.

Design check

For the design check, only one submission with all group memebers added is required. scraper.py has block comments for you to place the answers to the questions asked below.

Before your design check meeting, you should read through this document and examine a Craigslist Rental Listings Page in a web browser using the Web Inspector (see lecture from Oct. 25 for more on how to do this). Then, you should answer the following questions:

Testing and clarity

You should write good tests for all of your query functions in a file called test_scraper.py.

You don’t need to write automated tests for your scraping function. You should convince yourself that it works on Craigslist data by calling scrape_craigslist_data.

Please follow the design and clarity guide–part of your grade will be for code style and clarity.

Local data

In order to be able to test your project against consistent data, we’ve saved a copy of the craigslist listings for every city; you can download these data here. If you want to be able to run your code without internet access, or on consistent data, you can save the contents of that file to a directory called localdata in your project. You can then run scrape_local_data to run your scraping function against the locally-saved files.

Handin

You may submit as many times as you want. Only your latest submission will be graded. This means that if you submit after the deadline, you will be using a late day – so do NOT submit after the deadline unless you plan on using late days. Reminder that you may only use late days if both you and your partner have late days available.

The README template can be found here.

In addition to the questions in the template, in your README, please tell us how you tested your scraper function.

After completing the homework, you will submit:

Because all we need are the files, you do not need to submit the whole project folder. As long as the file you create has the same name as what needs to be submitted, you’re good to go! Only one of your partners should submit the project on Gradescope! Make sure to add your partner as a Group Member on your Gradescope submission so that you both can see the submission. Please DO NOT write your partner’s name in the README when listing other collaborators’ cslogins.

If you are using late days, make sure to make a note of that in your README. Remember, you may only use a maximum of 3 late days per assignment. If the assignment is late (and you do NOT have anymore late days) no credit will be given.

Please don’t put your name anywhere in any of the handin files–we grade assignments anonymously!