Project 3: Web Scraping
Due Dates
- Design Checks November 6 – November 8
- Design Check Response Due November 8 by 10:00 PM EST (no late days allowed)
- Project Code Due November 14 at 10:00 PM EST (up to 3 late days allowed ONLY if both you and your partner have late days remaining).
Summary
In order to find more venues to host Tim Talks and increase accessibility, you’ve turned to Craigslist. You know there’s a lot of useful information on Craigslist listings, but your goal is to scrape data from the site and put it into helpful data structures that allow you to find the ideal venue more efficiently!
For this project, we’re asking you to scrape some data from Craigslist and write a few queries over your scraped data. Specifically, you’re going to examine rental housing listings for a number of major US cities, plus Providence.
Learning goals
This project will give you experience:
- Designing data structures
- Scraping web data
Project tasks
- Read through this document in its entirety!
- Setup: Create a new VSCode project and install some libraries (if not already installed!):
pip3 install pytest
pip3 install selenium
pip3 install bs4
Then, copy the starting point code below into a file called
scraper.py
. - Design Check: Complete the Design Check questions due by Nov. 8th at 10pm EST
- Implement your web scraper!
When writing code, please follow the testing and style guide.
Implementation starting point
Put this stencil code in a file called scraper.py
:
Implementation tasks
Your implementation will have two parts: the scraper and the query functions.
The Scraper
Your scraper will be a function called scrape_data
. It takes in a dictionary
where the keys are city names from the CITIES
list and the values are
BeautifulSoup
objects. Your scraper should use BeautifulSoup
methods to transform
the data into a format that you can query in your query functions.
You’ll need to scrape the following data from each rental listing:
- Price
- Number of bedrooms
- Description (i.e., the text of the link to that listing)
Some listings may be missing the number of bedrooms. You should skip these listings (i.e., you should not include them in your data).
Implementation hints
Some string
methods may
be useful; our implementation uses .split
, .strip
, .replace
, and
.endswith
.
In class, we saw the find_all
method called with one argument–a tag
name. Another form may be useful for this assignment: you can call
soup.find_all("table", "cls")
to find table
tags with the class
"cls"
. Classes are a way to indicate structure in HTML, and are used by
browsers to determine how to display elements. The tag
<div class="assignments red">
has both "assignments"
and "red"
as classes; we could find it with
soup.find_all("div", "assignments")
or soup.find_all("div", "red")
.
Query functions
You will write several queries over your data:
- A function to find the average number of bedrooms across all of the
cities. The function should take in the data and return a
float
representing the average number of bedrooms. - A function to find the city with the highest average price for a given number of bedrooms. The function should take in your structured data and a number of bedrooms and return a city name.
- A function to find the most commonly-occuring “interesting” word in the
listing for a given city. The function should use the
interesting_word
function in the starter code to determine which words are interesting, and should count upper- and lower-case versions of a word as being the same. It should take in the data and a city name and return a single word.
Understanding Craigslist’s Website
In past years, Craigslist employed a static method for presenting information, meaning that content was directly embedded into the HTML code that is received by your browser. This method makes it more navigatable to web scrape using tools such as BeautifulSoup.
However, like many modern websites Craiglist has transitioned to a more dynamic content delivery method. This type of method leverages JavaScript to load content asynchronously after the initial page load! This technique is often referred to as AJAX (Asynchronous JavaScript and XML). This allows for faster and much more responsive web applications. With this technique, instead of the whole page refreshing with every change, only specific sections are updated asynchronously, in turn allowing the rest of the page to remain static and uninterrupted.
Now here is the problem- Tools like BeautifulSoup are designed to parse HTML and are not equipped to execute JavaScript. When BeautifulSoup makes a request to a website, it receives only the initial HTML content, which does not include any content loaded asynchronously via JavaScript. As a result, there will be a discrepancy between what you see in your web browser (where JavaScript runs by default) and the content that is retrieved by BeautifulSoup.
To bridge the gap between what BeautifulSoup sees and what your browser displays, you can disable JavaScript in your web browser. This step will make the webpage’s content appear similar to the HTML content that BeautifulSoup extracts. Here’s how you can disable JavaScript in some web browsers:
Disabling JavaScript in Your Browser
Google Chrome:
- Inspect the page: Right-click and select “Inspect”.
- Navigate to the command palette: CTRL/CMD + SHIFT + P
- Search for JavaScript.
- Click on Disable JavaScript.
- Refresh the page.
Safari:
- In safari, go to the menu bar and select Safari
- Go to Preferences » Advanced
- Enable the “Show develop in menu in menu bar”
- Return to the menu bar and select Develop
- Here you can select Disable Javascript and Show Web Inspector
You may notice a significant amount of information is lost!
Selenium
We will be using Selenium in order to extract the data that is dynamically loaded via JavaScript on Craigslist.
In the stencil code, you may notice a couple of extra lines for the retrieval of data. Please read through the documentation to gain a better understanding of what is going on.
Google Chrome Requirement:
This script is designed to work with Google Chrome, which means you need to have the Chrome browser installed on your machine. The script uses webdriver_manager
to handle the ChromeDriver installation automatically. This process requires an internet connection.
If You Don’t Have Google Chrome:
If you don’t have the Google Chrome browser installed, or if you prefer to use another browser, you’ll need a different driver.
You may replace the get_driver
method in the stencil code for your browser:
Note: We will also be providing HTML files to provide consistent city data as well as an alternative if your driver is failing.
Design check
For the design check, only one submission with all group memebers added is required. scraper.py
has block comments for you to place the answers to the questions asked below.
Before your design check meeting, you should read through this document and examine a Craigslist Rental Listings Page in a web browser using the Web Inspector (see lecture from Oct. 25 for more on how to do this). Then, you should answer the following questions:
- What do you notice about the way the data you need to scrape (price, bedrooms, description) are structured on the page?
- Can we directly use data that we scrape from the page? Is there any pre-processing that needs to take place before the data can be used?
- How will you structure your data internally in order to implement your query functions? It might be helpful to think about what classes you will need, if any.
- What are the names and signatures of your three query functions?
Testing and clarity
You should write good tests for all of your query functions in a file called
test_scraper.py
.
You don’t need to write automated tests for your scraping function. You should
convince yourself that it works on Craigslist data by calling
scrape_craigslist_data
.
Please follow the design and clarity guide–part of your grade will be for code style and clarity.
Local data
In order to be able to test your project against consistent data, we’ve saved a
copy of the craigslist listings for every city; you can download these data
here. If you want to be able to run your code without internet
access, or on consistent data, you can save the contents of that file to a
directory called localdata
in your project. You can then run
scrape_local_data
to run your scraping function against the locally-saved
files.
Handin
You may submit as many times as you want. Only your latest submission will be graded. This means that if you submit after the deadline, you will be using a late day – so do NOT submit after the deadline unless you plan on using late days. Reminder that you may only use late days if both you and your partner have late days available.
The README template can be found here.
In addition to the questions in the template, in your README, please tell us how you tested your scraper function.
After completing the homework, you will submit:
README.txt
scraper.py
test_scraper.py
- Any additional files you create.
Because all we need are the files, you do not need to submit the whole project folder. As long as the file you create has the same name as what needs to be submitted, you’re good to go! Only one of your partners should submit the project on Gradescope! Make sure to add your partner as a Group Member on your Gradescope submission so that you both can see the submission. Please DO NOT write your partner’s name in the README when listing other collaborators’ cslogins.
If you are using late days, make sure to make a note of that in your README. Remember, you may only use a maximum of 3 late days per assignment. If the assignment is late (and you do NOT have anymore late days) no credit will be given.
Please don’t put your name anywhere in any of the handin files–we grade assignments anonymously!