Program Performance (Part 2); Rainfall
These notes aren’t entirely synchronized with lectures; notice that we covered distinct_set
last time. So in class, we’ll pick up with the formal definition and then give the rainfall comparison the time it deserves.
A note on the drill
There’s a drill today.
One of our multiple-choice answers was “logarithmic”. We haven’t introduced this concept yet; we’ll talk about what it means in a couple of weeks. So far, everything has been constant ($O(1)$), linear ($O(n)$), or quadratic ($O(n^2)$).
Big-O (or asymptotic) Notation
Last time, we talked about analyzing the performance of programs. When deciding whether a function would run in constant, linear, or quadratic time, we ignored the constants and just looked at the fastest-growing (i.e., worst-scaling) term. We can define this more formally using asymptotic notation. This is also often called big-O notation (because a capital “O” is involved in the usual way of writing it).
Let’s look again at the last function we considered last time:
def distinct(l: list) -> list:
seen = []
for x in l:
if x not in seen: # 0 + 1 + 2 + ... + (len(seen)-1)
seen.append(x) # once for every new element
return seen
We decided that this function’s worst case running time is quadratic in its input (in the worst case). For a list of length $n$, we can calculate the worst-case number of operations as $\frac{n(n−1)}{2}+n+3$:
- $((n*(n−1))/2)$: when every element in
l
is unique, the inner membership check has to loop through the entireseen
list, which grows by 1 element every iteration of the outer loop (and algebra gets us from $0 + 1 + … + (n-1)$ to this form); - $n$: we know the
append
has to run once for every element of the input list in the worst case (we glossed over this last time); and - $3$ (or some other constant): constant setup time for creating the
seen
list and returning from the function (Whether or not we count certain operations as the same or different, it’s still a constant).
A computer scientist would say that this function runs in $O(n^2)$ time. Said out loud, this might be pronounced: “Big Oh of n squared”.
Recall that we’re trying to collapse out factors like how fast my computer is, and focus on scaling: is there a way to formalize the idea that one algorithm is faster than another?
Why does this matter?
Question: Why do we get to treat append
as something that runs in constant time? Does it necessarily do so? Might there be different ways that append
works, which have different worst-case runtimes?
You’ve noticed already that the data structures you use can influence the runtime of your program. In fact, soon we’ll talk about the difference between Pyret lists and Python lists. If append
were implemented with a Pyret list, it would be worst-case linear, rather than constant.
But where else do these differences appear? Let’s run a quick experiment
def distinct_list(l: list) -> list:
seen = []
for x in l:
if x not in seen:
seen.append(x)
return seen
def distinct_set(l: list) -> list:
seen = set()
for x in l:
if x not in seen:
seen.add(x)
return list(seen) # convert back to a list
Try running each of these on Frankenstein. On my laptop, with sets: 0.054 total
and with lists: 2.153 total
. Both numbers are in seconds.
It seems like there’s a huge impact on performance. In fact, the version that uses sets performs around the same as count_the
did last time, and count_the
had worst-cast runtime linear in the input size. That is, its worst-case performance was in $O(n)$. It seems like using a set has somehow eliminated the linear cost from the x not in seen
check.
Next time we’ll talk about how sets and dictionaries make this kind of fast lookup possible, and whether there are any limitations. (There are limitations.) For now, think of big-O notation as a way to quickly capture how different tools (like data structures or algorithms) scale for different purposes. If you have the option to use something in $O(n)$ instead of something in $O(n^2)$, and there are no other factors involved, you should pick the better-scaling option.
I wonder…
Is there a potential bug lurking in the switch from distinct_list
to distinct_set
? Hint: it depends on what you might need to use the output for.
Formally Speaking
The formal definition looks like this:
If we have (mathematical, not Python!) functions $f(x)$ and $g(x)$, then $f(x)$ is in $O(g(x))$ if and only if there are constants $x_0$ and $C$ such that for all $x > x_0$, $f(x) < C g(x)$.
That is, $f(x)$ is in $O(g(x))$ if we can pick a constant $C$ so that after some point ($x_0$), $f(x)$ is always less than $C g(x)$. There may be many such $x_0$ and $C$ values that work. But if even one such pair exists, $f$ is in $O(g)$.
If we want to show that $\frac{n(n+1)}{2} + n + 3$ is in $O(n^2)$, our constants could be, say, $x_0 = 5$ and $C = 1$. After the list is at least $5$ elements long, $\frac{n(n+1)}{2} + n + 3$ will always be less than $1*n^2$. We could prove that using algebra, but here’s a picture motivating the argument:
We could have picked some other values. For example, $x_0 = 2$ and $C = 3$:
Important Note: We’re using Wolfram Alpha for generating plots. Wolfram Alpha is an excellent tool which is free for personal use. Its terms allow academic use in settings such as this provided that we credit them and ideally link to the original query, which you can explore here. Note that they don’t allow, for instance, web scraping vs. their tool, so please use it responsibly!
In this class, we won’t expect you to rigorously prove that a function’s running time is in a particular big-O class. We will, though, use the notation. E.g., we’ll use $O(n)$ as a shorthand for “linear” and $O(n^2)$ as shorthand for “quadratic”. And we’ll try to be precise about whether we’re talking about “worst case”, “best case”, “average case”, etc. Don’t conflate the two ideas! “Upper bound” does not mean “worst case”. We can also put an upper bound on the best-case behavior of a program!
Also, notice that, by this definition, if a function is in $O(n)$ it is also in $O(n^2)$ and so on. The notation puts an upper limit on a function, that’s all. However, by convention, we tend to just say $O(n^2)$ by itself to mean “quadratic” rather than also saying “and not $O(n)$”.
The quick trick of looking for the largest term ($n$, $n^2$ etc.) will usually work on everything we give this semester, and in fact is a good place to start regardless. If you decide to take an algorithms course, things become both more complicated and, potentially, more interesting.
Group Programming Exercise: Rainfall
Let’s say we are tracking daily rainfall around Brown University. We want to compute the average rainfall over the period for which we have useful sensor readings. Our rainfall sensor is a bit unreliable, and reports data in a strange format. Both of these factors are things you sometimes encounter when dealing with real-world data!
In particular, our sensor data is arrives as a list of numbers like:
sensor_data = [1, 6, -2, 4, -999, 4, 5]
The -999 represents the end of the period we’re interested in. This might seem strange: why not just end the list after the first 4
value? But, for good reasons, real-world raw data formats sometimes use a “terminator” symbol like this one. It’s also possible for the -999
not to be present, in which case the entire list is our dataset.
The other negative numbers represent sensor error; we can’t really have a negative amount of rainfall. These should be scrubbed out of the dataset before we take the average.
In summary, we want to take the average of the non-negative numbers in the input list up to the first -999
, if one appears. How would we solve this problem? What are the subproblems?
Think, then click!
One decomposition might be:
- Finding the list segment before the
-999
- Filtering out the negative values
- Computing the average of the positive rainfall days
This time, you will drive the entire process of building the function:
- note what your input and output look like, and write a few examples to sketch the shape of the data you have and the data you need to produce;
- brainstorm the steps you might use to solve the problem (without worrying about how to actually perform them—we just did some of that above); then
- create a function skeleton, and gradually fill it in.
Since these notes are being written before lecture, it’s tough to anticipate the solutions you’ll come up with, but here are two potential solutions:
Think, then click!
def average_rainfall(sensor_input: lst) -> float:
number_of_readings = 0
total_rainfall = 0
for reading in sensor_input:
if reading == -999:
return total_rainfall / number_of_readings
elif reading >= 0:
number_of_readings += 1
total_rainfall += reading
return total_rainfall / number_of_readings
In this solution, we loop over the list once. The first two subproblems are solved by returning early from the list and by ignoring the negative values in our loop. The final subproblem is solved with the number_of_readings
and total_rainfall
variables.
Another approach might be:
def list_before(l: list, item) -> list:
result = []
for element in l:
if element == item:
return result
result.append(element)
return result
def average_rainfall(sensor_input: lst) -> float:
readings_in_period = list_before(sensor_input, -999)
good_readings = [reading for reading in readings_in_period if reading >= 0]
return sum(good_readings) / len(good_readings)
In this solution, the first subproblem is solved with a helper funciton, the second subproblem by filtering the remaining elements of the list to remove negative values, and the third subproblem calls the built-in sum
and len
functions on the final list.
These two solutions have very different feels. One accumulates a value and count while passing through the function once, with different behaviors for different values in the list. The other produces intermediate lists that are steps on the way to the solution, before finally just computing the average value.
The first version is, in my experience, much easier to make mistakes with—and harder to build up incrementally, testing while you go. But it only traverses the list once! That means it’s faster—right?
Well, maybe.
Runtime vs. Scaling
What are the worst-case running times of each solution?
Think, then click!
For inputs of size $n$, both solutions run in $O(n)$ time. This is true even though version 2 contains multiple for
loops. The reason is that these for
loops are sequential, and not nested one within the other: the program loops through the input, and then loops through the (truncated) input, and then through the (truncated, filtered) input, separately.
To see why this is the case, let’s look at a smaller example. This program has worst-case performance linear in the length of l
:
sum = 0
for i in l:
sum = sum + i
for i in l:
sum = sum + i
It loops through the list twice.
But this program has quadratic performance in the length of l
:
sum = 0
for i in l:
for j in l:
sum = sum + 1
For every element in the list, it loops again through the list.
So if you’re “counting loops” to estimate how efficient a program is, pay attention to the structure of nesting, not just the number of times you see a loop.