Project 1: Song Analysis

Due Dates

Project Learning Goals

In this project, students will work with data sets of varying sizes, get an introduction to machine learning, and practice implementing written algorithms.

Summary

Tim has decided to unleash a special episode of Tim Talk called Tim Sings! Tim is running out of time, and he needs your help. Tim has a list of songs whose genres are unknown. It is up to you to build a program that takes in the lyrics of a song with an unknown genre and finds the most similar song with a known genre.

You will build a program that takes in lyrics for a song and determines its genre. Rather than hard-coding a set of rules, you’ll use machine learning to classify songs.

We’ll provide you with a set of training data: song lyrics with known genres. When you want to classify a new song into its genre, you will find the most similar song in the training data and return its genre.

In order to find the most similar song in the training data, you will implement the nearest neighbor algorithm. This will require calculating term-frequency-inverse document frequency (tf-idf) weights for every song. These weights are essentially word counts, modified by how common various words are. You’ll calculate the same weights for the input song. Then, you’ll compare the input song’s weights to each song in the training data using cosine similarity, a measure of the similarity of two sets of weights.

TF-IDF

tf-idf, or Term Frequency-Inverse Document Frequency, is a metric that represents a given term’s importance within a corpus (i.e., a set of documents, such as the training data). tf-idf is often used in text search algorithms (like Google!), keyword extraction, information retrieval, and natural language processing.

Term Frequency (tf)

The Term Frequency of a word i is the number of times the word appears in a certain document d. Given the document: “Tik Tok on the clock, the party don’t stop”. The term frequencies would be as follows:

Tik Tok on the clock party don’t stop
1 1 1 2 1 1 1 1

Inverse Document Frequency (idf)

Inverse Document Frequency is used to scale up the importance of words that occur with less frequency in the data (and to scale down less important words). To compute Inverse Document Frequency, we first need to get the fraction of the total number of documents in the corpus n over the number of documents in which a particular term i appears in the training data n~i~. Once this fraction is found, idf can be computed by taking the natural logarithm of the fraction.

Consider this corpus:

Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7
Apple Banana Watermelon Mango Strawberry Strawberry Kiwi Kiwi Strawberry Strawberry Banana

To calculate the idf score of “strawberry,” we would first identify n, the total number of documents (in this case 7). Then, we would divide by the number of documents “strawberry” appears in, n~i~. To find idf, we take the natural logarithm of n/n~i~, so ln (7/3) = .85. In order to get the natural log, make sure you use Python’s math.log function.

tf-idf

For our purposes, we want to use both term frequency and inverse document frequency in order to understand how important a single term is to a document in the context of the corpus. The tf-idf value is the product of the term frequency and inverse document frequency:

Where tf~ij~ represents the frequency of the term i in the document j and idf~i~ represents the inverse document frequency of the term i in the corpus as a whole.

For the above corpus, the tf-idf weights for Doc 6 would be:

Kiwi Strawberry
2 * ln(7/1) 1 * ln(7/3)

Cosine Similarity

In order to compare the tf-idf weights of two songs, you’ll use cosine similarity (for which we have provided an implementation). The cosine similarity of two identical documents is 1, and the cosine similarity of documents that share no words in common is 0. Beyond that, it will be higher when documents share higher-weighted words in common.

Nearest Neighbor

Finally, you will implement the nearest neighbor algorithm, which outputs the song in our training data that is most similar to the input song using cosine similarity. This will allow us to classify the input song to have the same genre as the song most similar to it.

Requirements

  1. Fill out the Partner Form. In this class, all projects will be in a group format, where you will have one or more partners to collaborate with, depending on the project. For this project, you will have one group partner to work through this project with.

    Fill out this form by Tuesday, 9/19 at 9 PM so we can make the groups and assign you a mentor. TAs will email you to set up a Design Check. Make sure you and your project partner both attend the Design Check! (We need this information for grading purposes).

  2. Setup: Either you or your partner should create a new project directory on your computer, create a new file on VSCode in that directory, and copy in the stencil code pasted in the “Implementation Phase” section below to a file called song_analysis.py. Make sure you can run PyTest in this project!
  3. Design Check: Answer the design check questions and begin thinking about the return types of the functions that you will create and how they may work with your intended program organization.
  4. Data/Coding: See below.
  5. Style and Design: Be sure to follow all of the style and design guidelines outlined in the cs0112 python style guide.
  6. Testing: You will also be evaluated based on the testing of your implementation. Make sure that your code works for a variety of cases and consider where your implementation may fail.

Design Check

For the design check, only one of you and your partner will be expected to hand in song_analysis.py with block comments answering the questions asked below.

Design Questions:

  1. After reading through the handout, demonstrate your understanding of tf-idf by answering the following questions:
    • The Term Frequency of a word i is the number of times the word appears in a certain document d. Given the document: “the dog bit the man” — what are the term frequencies?
    • What do tfij and idfi represent?
    • What is the cosine similarity of two identical documents?
    • Consider the following corpus.
      • What is the idf score of “mango”?
      • What are the tf-idf weights for Doc 3?
    Doc 1 Doc 2 Doc 3 Doc 4
    Mango Strawberry Mango Starfruit Watermelon Watermelon Strawberry Banana Mango
  2. Being able to test your program is an important step in deciding how accurate your genre classifier is. After reading the handout, consider ways that you could test the program and its accuracy.

The Implementation Phase

In working through this assignment, there are several components the algorithms can be broken into: TF, IDF, TF-IDF, and Nearest-Neighbor. These components are used in sequential order and build on each other. We recommend that you complete the project in the following order:

Three additional functions have been provided to you: create_corpuscosine_similarity, and main.

You can start with this stencil code:

Details: create_corpus

This function, which has been provided to you in the skeleton code, reads a CSV file and sets up a data set (i.e., a corpus) of songs. You should read it and understand how it works!

Details: compute_tf

In order to calculate the Term Frequency, you need to fill out the compute_tf() function in the stencil code. This function is designed to calculate how many times a term appears in a specifc song. It takes in a list of strings representing the lyrics for a song and produces a dictionary containing the term frequency where the key is the term and the value is the frequency for that term.

Details: compute_idf

In order to calculate the Inverse Document Frequency for each song in your corpus, you need to fill out the compute_idf() function in the stencil code. This function takes in a corpus of songs, as generated via the create_corpus function, and outputs a dictionary from words to idf values (floats).

Details: compute_tf_idf

In order to calculate the Term Frequency - Inverse Document Frequency, you need to fill out the compute_tf_idf() function in the stencil code. This function takes in a list of strings representing song lyrics and a dictionary of idf values and outputs a dictionary where the keys are words from the song and the values are tf-idf weights. This function should call the compute_tf function.

Details: compute_corpus_tf_idf

You’ll need to calculate tf-idf values for every song in your corpus. This song takes a corpus and a dictionary of idf values and produces a dictionary where the keys are song ids and the values are tf-idf dictionaries. This function should call compute_tf_idf.

Details: cosine_similarity

This function is provided in the stencil code and is used to find a measure of similarity. It takes in two dictionaries that are the result of tf-idf calculations (effectively the lyrics have been vectorized to have a magnitude and direction so that each song could be plotted on a plane). We can find the similarity of two vectors (defined by the proximity of the vectors in their vector space) by dividing the dot product of the two vectors by the product of the magnitude of the two vectors.

Details: nearest_neighbor

In order to calculate the nearest neighbor of a novel song, you need to fill out the nearest_neighbor() function in the stencil code. This function takes in a string representing song lyrics, a corpus of songs, a dictionary of tf-idf dictionaries (as produced by compute_corpus_tf_idf) and a dictionary of idf weights and outputs the Song that is most similar to the lyrics (as measured by cosine similarity). This function should call compute_tf_idf and cosine_similarity.

Details: main

This function, which we have written for you, is what glues all the pieces together. It creates a corpus, computes tf-idf weights, and returns the genre of a given song.

Testing

For this project, we would like you to put all of your tests in test_song_analysis.py. You should test all of the functions you write; you don’t need to write tests for create_corpus or cosine_similarity. Please follow the testing section of the design and clarity guide.

You can also run your code on a real data set, which you can find here.

Handin

Design check

For the design check, hand in song_analysis.py with the design check questions answered. You may not use late days on the design check.

Code

Only one of you and your partner needs to submit your code on Gradescope. You may submit as many times as you want – only your latest submission will be graded. This means that if you submit after the deadline, you will be using a late day – so do NOT submit after the deadline unless you plan on using late days. Again, BOTH you and your partner must have late days to use them.

The README template can be found here.

Please follow the design and clarity guide–part of your grade will be for code style and clarity. After completing the project, you will submit:

Because all we need are the files, you do not need to submit the whole project folder. As long as the file you create has the same name as what needs to be submitted, you’re good to go!

If you are using late days, make sure to make a note of that in your README. Remember, you may only use a maximum of 3 late days per assignment. If the assignment is late (and you do NOT have anymore late days) no credit will be given.

Please don’t put your name anywhere in any of the handin files–we grade assigments anonymously!

Submit your work on Gradescope as usual with all three files.