Lab 5: Web Scraping
After Tim’s trips to various venues around the world, he is headed back to Brown! Unfortunately, his agent booked him for the same weekend as Spring Break. Tim’s fans who already bought plane tickets to leave for break are furious that they will miss his talk and unite to find another day for Tim to speak with no conflicts. They enlist your help, as a Tim superfan, to web scrape the Brown academic calendar for information on event names and dates, in order to rectify this wrong. The goal is to collect the event dates and names from the website to store in a dictionary that Tim’s agent can use in the future when booking Tim at Brown so that this mistake never happens again.
Part 1: Getting Started
Originally, we were supposed to scrape the academic calendar for Brown. However, due to a calendar update today, we will instead be choosing from this [website] (https://www.scrapethissite.com/pages/). Navigate to the [Countries project] (https://www.scrapethissite.com/pages/simple/) and try navigating through the elements on the page.
To start this lab, go to the academic calendar at this link . In your browser,
pull up the web inspector. To do this, right click the page and select inspect element (we recommend that you use Firefox or Google Chrome). Take a second to
go through the tag information and notice where information is located.
Also, please run pip3 install bs4
and pip3 install requests
to install the libraries needed to scrape the web.
Part 2: Scraping the Site
Copy the stencil code below into a file called lab5.py
and complete the following tasks:
- Write
scrape_events
(We’re expecting this function to return a dictionary containing the information from the specified tags) - Run
scrape_events
in your terminal with the argument for the function as the BeautifulSoup object containing the academic calendar page. - Verify that some of the values in the dictionary line up with the keys that we had in the inspector
- Notify your TA that you have completed the Lab
A couple of tips as you are scraping:
- BeautifulSoup Documentation
- Some useful properties of BeautifulSoup’s
find
andfind_all
methods:- They can take multiple arguments. If you give it two arguments, it will not only look for html nodes with the specified tags, but also with the specified id or class
- Find tags based on class:
find("h1", "mac")
will find a node with tag ‘h1’ and class ‘mac’ - Find tags based on id:
find_all("div", id="cheese")
will find all nodes with tag ‘div’ and id ‘cheese’
- Find tags based on class:
- They will search deep levels of your page, i.e. if you say
node.find("div", "pretzel")
it will not only searchnode
’s children, it will also searchnode
’s grandchildren, and their children, and their children, and so on through all levels of the HTMLTree, until it finds a node with tag “div” and class “pretzel”
- They can take multiple arguments. If you give it two arguments, it will not only look for html nodes with the specified tags, but also with the specified id or class
- BeautifulSoup objects have a
.text
attribute which you can use to get their text rather than all of their html. - The
strip()
method can be called on strings to remove extra spaces at the beginning and end of a string.
Part 3: You’re all set!
That’s all for lab this week! Be sure to ask any questions as this will be the same Web Scraping format that we follow for Project 3 and may be helpful in the final project as well.
Practice Webscraping! Link Here