Project 3: Data on the Web
Checkin: October 23 - October 29
Design Check due: October 29 at 9:00 PM EST (no late days allowed)
Implementation: due November 5 at 9:00 PM EST
Note: This project is longer than the previous ones. We recommend starting early and working on it in chunks.
This project will be a continuation of the work you did in Lab 5. Please work with the same partners. You will be working with the Brown Digital Repository (BDR) to scrape data from the web and use the BDR API to get data.
Section 1: Design Check
Design check will be slightly different this time. It will be more so like a soft deadline for this project to get you on track. You will be submitting whatever code you have written so far. This will help us understand where you are in the project and provide feedback to help you move forward. It will also be a place where you can get help for any issues you are facing.
Note: Attempt to finish the web scraping and query part of the project before the design check.
Section 2: Background
Tim’s taste for theses and dissertations has not been sated! He needs more. Using what you’ve learned, he wants to get the data from multiple collections. He also wants to know more about the collections themselves. There’s also the issue of the BDR API. Tim has heard that the BDR API can be used to get data about collections and items. He wants you to use the API to get data about the collections and items and see if it’s any different from the data you get from scraping.
Note: Working with data on the web inherently involves printing out data to see if it is correct. You should use print statements periodically as you implement functions to check the data you are getting from the BDR scraper, BDR API and the BDR website. This will help you understand the data you are working with and how to use it. Remember print statements are your friend!
Section 2: Web Scraping
We will be continuing and expanding upon the scraping we completed in Lab 5. The following parts of the project will be completed in bdrscrape.py
Task 1: Check the lab 5 implementation we’ve provided. You are free to use your own implementation if you prefer.
By this point, we assume that you are familiar with the BDR website and the structure of the pages. So we will skip the explanation of the BDR website structure. If you need more support please check the lab 5 instructions.
Part 1: Scraping
Task 2: Define the function get_collection_dictionary
. This function should call functions from your scraper implementation to return a dictionary where the keys are collection names from the input collection
dictionary and the values are lists of BeautifulSoup
objects of all the item pages for the given collection. If a collection has 30 publicly accessible items, get_collection_dictionary should have a key:value pair of the collection name and a list of 30 BeautifulSoups, one for each item page.
For example, a possible key:value pair in collection_pages
would be
{
“Africana Studies”: [BeautifulSoup(item page 1),
BeautifulSoup(item page 2),
BeautifulSoup(item page 3)....]
}
Think carefully about what combination of your three functions you should use and which return data that would be appropriate for the task.
Task 3: Define the function scrape_data
. This function takes in a dictionary collection_pages
, which is a dictionary such as one returned by get_collection_dictionary
. With this dictionary, use a function from your scraper implementation to create a BDRItem
for every item’s BeautifulSoup
for every collection, and store this in a new dictionary.
For example, a possible key:value pair in data
would be
{
“Africana Studies”: [BDRItem(item 1),
BDRItem(item 2),
BDRItem(item 3)....]
}
Part 2: Testing
Web scraping is inherently difficult to test because the data is constantly changing. You can’t write a test that will always pass because the data you are testing on will change.
Task 4: Write in README.txt
how the pages you are web scraping could change, whether through actions by BDR users or the Brown University Library staff who maintain the BDR. How would this affect any tests?
Besides this, you don’t need to write automated tests for your scraping function. You should convince yourself that it works on BDR data by calling scrape_data
. You can call scrape_data
on the given COLLECTIONS
dictionary by using the function run_scrape
, which we have provided you. You can also call scrape_data
on your own collections dictionary by using the structure we used in run_scrape
.
Section 3: Query functions
Now that you have your data, Tim wants you to write some queries to get information about the collections and items. These will be functions that take in the data returned by scrape_data
and utilize the BDRItem
’s information to return information about the collection data.
Notice how these functions are independent of the method used to get the data. This is a good example of how you can separate the data collection from the data analysis.
Task 1: Define the function most_common_subject
to find the most common subject(s) of a collection. The function should take in the data and a collection name and return a list of str
representing the most common subject(s) of the collection. The function should not be case sensitive (i.e. “chemistry” == “Chemistry”).
Task 2: Define the function year_after
to find all items with a year after the input year. The function should take in the data, a collection name, and a year and return a list of BDRItem
s that meet the criterion.
Task 3: Define the function top_contributor
to find the top contributor(s) of a given role. The top contributor is the person who has contributed to the most items, i.e. is the most common contributor. The function should take in the data, a collection name, and a position and return a list of str
representing the top contributor(s) of that role. The function should not be case sensitive (i.e. Tim Nelson (reader) == Tim Nelson (Reader) == tim nelson (ReAdER)).
Notice how even though web scraping is inherently difficult to test, you can still write tests for your query functions. These tests should work on data for the data generated by both of your implementations: Web Scraping and API Use.
Task 4: Write tests for all of your query functions in test_query.py
. These tests should work on data for the data generated by both of your implementations: Web Scraping and API Use (which you will implement later).
Section 4: Running the Code
Check demo.py
. You will find two functions that you can use to run your code. You can use these functions to run your code and include print statements to see the results of your queries.
Section 5: BDR API
Remember how we mentioned that the query functions are independent of the method used to get the data? Now we will be using the BDR API to get the data instead of scraping it. You will still use the query functions you wrote in the previous section to analyze the data, but you will need to write new functions to get the data from the API.
Part 1: BDR API
Task 1: You will be accessing the BDR API in order to get information about collections and items. Read through the BDR API documentation and preview some of the JSON data returned by the examples.
The BDR maintains different types of ids for collections. While web scraping you interacted with ids that followed the format bdr:xxxxxx
, collections also have database ids. These ids are integer values and are used to make collection requests.
Part 2: API Use
Task 2: Define the function get_collectionids
. This function should make a request and return a dictionary where the keys are the names of collections and the values are the database ids of collections.
Think about whether the Item API, Collection API, or Search API is the proper API to use in this task. Additionally, think about the level at which you should make a request. Recall that JSON syntax is very similar to Python dictionary syntax.
Task 3: Define the function collection_request
. This function takes in a collection database id and makes a request for that collection, then returns the JSON response.
What url would you need to use to make a collection request? How do these urls change depending on the collection you are requesting? You can use a formatted string to format a url string.
While get_collectionids
returns a dictionary of every single collection, we are only interested in the Theses and Dissertations collections. You are given the function get_thesis_collections
in bdrapi.py
. You should call this function to get the Theses and Dissertations collection. Notice that it calls get_collectionids
and collection_request
and returns a list of all the theses collections in the BDR that have less than 50 items.
Task 5: Define the function item_request
. This function takes in an item id and makes a request for that item, then returns the JSON response.
Task 6: Define the function scrape_item
. This function interacts with an item’s JSON data and returns a BDRItem
where the data reflects the information of the item. If an item does not have a piece of information, it should set the BDRItem
variables to empty strings, lists, or None
for int values.
Hint: Look at an example item JSON! Make sure the JSON you look at is for a thesis/dissertation, such as this JSON for an American Studies thesis. Is there any one field in the JSON that stores all of the information you want (full title, year, subjects, contributors, abstract, and notes) as subfields you can access?
Task 6: Define the function get_items_from_collection
. This function interacts with a collection’s JSON data to return a list of BDRItem
s for all the items in the collection. When writing this function, consider how you can find the items in the collection’s JSON data and use functions you have written to make requests and scrape item information to create your returned list.
Task 7: Define the function api_data
. It takes in a dictionary where the keys are collection names, and the values are collection ids. This function should make requests for the given collections and get the items for each collection.
Part 3: Testing
We don’t expect you to write automated tests for your API functions. You should convince yourself that it works on BDR data by calling api_data
, search_data
, and get_collectionids
.
But, we will introduce a new type of testing: model based testing. Model based testing is a type of testing where you test your functions against a model of the data. In this case, you can test your API functions against the data you scraped. This will help you ensure that your API functions are working as expected. We will be doing this in a later project, but you can start thinking about how you would do this now.
Section 5: Web Scraping + API Use
Now that you have both the web scraping and API use implementations, you can compare the two. You can use the query functions you wrote in the previous section to analyze the data from both implementations.
In bdrscrape.py, we provided you with the COLLECTIONS dictionary, which stores collection name and the BDR id for some select collections. You can manually obtain the BDR id for a collection by looking at the URL:
https://repository.library.brown.edu/studio/collections/bdr:tkz6xrdc
However, as you did in Part 2, you can also access the BDR ids of collections by making a collection request and finding the ids in the JSON response. Answer the following in README.txt
.
- How could you use your
bdrapi.py
implementation in order to get a dictionary like COLLECTIONS? What functions could you use/modify?
Section 6: README
In addition to late day information, your final README.txt
should answer the following prompts.
- The BDR allows any member of the Brown community to upload items and set the items as either private, public only to Brown, or public to anyone. In your previous implementations, we simply ignored any items that were not publicly accessible. However, the public still has access to some of the information via the item thumbnails on collection pages, even if we can’t access the item pages themselves. Explore a collection page using Web Inspector and make a collection request in order to glean as much information as possible. Compare and contrast this process and describe the information available to you.
- BDR items are varying levels of public. Public APIs, such as the BDR’s API, allows developers to interact with applications, although also in varying levels of publicity. For example, some APIs require authentication of an API key, which is a unique identifier that is generated to authenticate a user to access the API. Do we have a moral obligation to share and make public information? To what extent? Why might someone want to keep something private? Of the public APIs, did any surprise you? Note that not every program makes a public API.
- In cases where an API is not public, why is web scraping still a viable option? How does web scraping differ in “required permissions” from API use? Read this article about Memex, a program that collects data from the Internet to identify human trafficking. What are some moral/ethical concerns about web scraping? Note that we informed the Brown University Library that we would be web scraping the BDR. What do you think would happen if a web scraper was operating on a larger scale? (Hint: what happens when you overwhelm a server with traffic/requests?)
- Web scraping and APIs have different circumstances in which each method of gathering data can be used. What are some advantages and disadvantages of using each? In which cases would web scraping be more applicable, and in which cases would APIs be more applicable?
Section 7: Submission
Only one of you and your partner needs to submit your code on Gradescope. You may submit as many times as you want – only your latest submission will be graded. This means that if you submit after the deadline, you will be using a late day – so do NOT submit after the deadline unless you plan on using late days. Again, BOTH you and your partner must have late days to use them.
Please follow the design and clarity guide–part of your grade will be for code style and clarity. After completing the project, you will submit:
README.txt
bdrscrape.py
bdrapi.py
common.py
test_query.py
- Any additional files you create.
Because all we need are the files, you do not need to submit the whole project folder. As long as the file you create has the same name as what needs to be submitted, you’re good to go! If you are using late days, make sure to make a note of that in your README. Remember, you may only use a maximum of 3 late days per assignment. If the assignment is late (and you do NOT have anymore late days) no credit will be given.
Submit your work on Gradescope. Please don’t put your name anywhere in any of the handin files–we grade assigments anonymously!