Data Collection and Modeling

Lab 7: Data Collection and Modeling

Section 1: Background

Now that you have completed Project 2 and have experience with modeling and simulating systems, we’d like you to revisit the idea and reflect.

Suppose that Brown is looking to add a new course in the Urban Studies department and has asked you to compile a report on the state of education in New York City. They requested three datasets that provide the most comprehensive and pressing issues for students, parents, educators and policymakers in the New York City public school system. Tim was able to gather a list of ten datasets, but needs your help to narrow this list down to just three.

Section 2: Learning Goals

Data collection and modeling, like we saw in Project 2 (and that we are about to explore in this lab, are never neutral or objective. While data (and models built from it) are often presented as unbiased, cold, hard, facts, they are inherently subjective. Data are collected by researchers or other stakeholders that, consciously or not, have a particular goal in mind. The same holds for those who curate the data over time, or those who use it for modeling or making policy decisions.

For example, in Project 2, we provided you with five example countries but allowed you to choose off a bigger set of countries if you liked. Both we, as a staff, and you, as the programmer, made decisions about the country you wished to represent. You may have chosen the five countries that held a particular interest for you, five countries close together, or five countries that were listed on the handout because they were the most easily available. Whatever the reasoning, picking those five countries centered some and excluded others. If you lived in a country that was going to lose 50% of its crop production due to global warming in the next year, or that would be underwater due to sea-level rise, you would likely be concerned and upset that your home (or factors important to your home) didn’t appear in this model.

All data collection and modeling works in a similar way. We can’t possibly collect every single piece of pertinent data or model every single situation because of time, space, or energy constraints. As Tim likes to say, “The only accurate model is the entire universe.” The goal of this lab is to have you consider various implications of data collection and modeling.

Importantly, we are not arguing that data collection (because of its biases) or modeling exercises (because they can’t account for all variables) are useless! To the contrary, we _are _saying that because of the importance and power of data-gathering and modeling we have an obligation to think deeply and carefully about who we choose to represent and exclude when we engage in these activities.

Notice: Throughout this lab, you will be paired with various classmates. Be prepared to engage and move throughout the room. The answers you submit in the Google Form will count as your attendance and be graded on your participation. You can find the Google Form here.

Take a look at NYC OpenData, a project that aims to collect data and publicize data on various aspects of New York City. The following ten datasets come from the education category on their website:

School Safety Report (2010-2016) This dataset was collected by the New York City Police Department (NYPD) and provides crime data for incidents that occurred within NYC public schools. It includes the overall number of incidents, in categories such as criminal, non-criminal, and violent crimes, among others, organized by school and borough. Collected 2010-2016.
Graduation Outcomes by School (2005-2010) This dataset provides information on the graduation outcomes of students enrolled in each school. Graduation is measured by those who earned a local or Regents diploma. Collected from 2005-2010.
Average SAT Results by School (2012) This dataset provides average SAT results across math, reading, and writing, organized by school. Collected from 2012.
List of Bilingual Programs (2020-2021) This dataset provides information on the type of support programs for bilingual students and the language in which the programs are offered. Collected from the 2020-2021 school year.
Student Demographic Snapshot by School (2015-2020) This dataset provides information on student demographics, such as grade, race, gender, first language, among others, collected from 2015 to 2020.
Students with COVID-19 Vaccinations (March, 2022) This dataset keeps track of the percentage of students with at least one COVID-19 vaccination and students fully vaccinated by school. Updated March, 2022.
School Food Report (2019) This dataset keeps track of the type of meals offered throughout and before the school day, if applicable. Collected in 2019.
Fair Student Funding Budget (2015) This dataset keeps track of funding allocated to schools and where that money is going throughout the school year. This dataset includes the growth or decrease in teacher salaries. Collected in 2015.
Aid Projects Submitted to the State (updated 2022) This dataset keeps track of all the requests schools have submitted to New York state for additional funding. The requests range from full modernizations to bathroom renovations. It also includes the anticipated cost of the project. Last updated September 2015.
School Bus Breakdowns and Delays (updated 2022) This dataset keeps track of all the bus breakdown and delays when bringing students to and from schools. It includes the type of breakdown that occurred, the length of the delay, and how many students were on the bus, among other data points. Last updated November 2022.

Remember, the goal is to compile three datasets that provide pertinent information on the most comprehensive and pressing issues for students, parents, educators and policymakers in the New York City public school system. Pay less attention to the years the data were collected and more to the information each dataset conveys.

Task 1: Choose five datasets you think best serve the goal above. Be prepared to articulate your choices and reasoning behind them. Submit this list of five to the Google Form.

Task 2: Pair up with someone in the class, and compare the datasets you chose. Working together, narrow your list down to just three datasets. Disagreements should be resolved charitably; this is the kind of (professional) discussion that data scientists and modelers have. Submit this list of three to the Google Form. Each partner should fill out the form.

Task 3: Final move! You and your partner should find another pair. Working in this group of four, come up with a final list of three datasets to submit as your final proposal. Each group member should fill out the form.

Task 4: Discuss the following reflection questions in your groups. Individually submit your responses to the Google Form.

Section 3: Reflection Questions

Part 1: Projection

Projection is the process of seeing one’s own inner circumstances reflected onto external circumstances. Projection is a very common process for us when filtering and choosing information. In other words, what matters personally to us oftentimes shapes what we view as necessary or important.

Task 5: While reading through the dataset lists, were there particularly datasets you did not consider at all or considered of higher importance than others? Did projection play any role in your decision making?

Part 2: Stakeholders

Take a look at the final datasets you came up with and consider the stakeholders that might be impacted by your set. Who might have been left out and in what ways might they be impacted? For example, a number of schools have 95% of their student populations in poverty. For these students, the school meal program would highly impact their ability to learn and engage in their classes. Thus, including the school meal program in an overview of schools would be of high importance to these students and parents. What are other examples of stakeholders impacted by the data you chose and in what why?

Task 6: Reflect on the stakeholders that might be impacted by the datasets you chose. Who might have been left out and in what ways might they be impacted?

Task 7: Were there any datasets that were left out of the list of ten? (There were many.) Try to think of one or two we missed.