Program Performance (Part 2); Rainfall

These notes aren’t entirely synchronized with lectures; notice that we covered distinct_set last time. So in class, we’ll pick up with the formal definition and then give the rainfall comparison the time it deserves.

A note on the drill

There’s a drill today.

One of our multiple-choice answers was “logarithmic”. We haven’t introduced this concept yet; we’ll talk about what it means in a couple of weeks. So far, everything has been constant ($O(1)$), linear ($O(n)$), or quadratic ($O(n^2)$).

Big-O (or asymptotic) Notation

Last time, we talked about analyzing the performance of programs. When deciding whether a function would run in constant, linear, or quadratic time, we ignored the constants and just looked at the fastest-growing (i.e., worst-scaling) term. We can define this more formally using asymptotic notation. This is also often called big-O notation (because a capital “O” is involved in the usual way of writing it).

Let’s look again at the last function we considered last time:

def distinct(l: list) -> list:
    seen = []
    for x in l:            
        if x not in seen:  # 0 + 1 + 2 + ... + (len(seen)-1) 
            seen.append(x) # once for every new element
    return seen

We decided that this function’s worst case running time is quadratic in its input (in the worst case). For a list of length $n$, we can calculate the worst-case number of operations as $\frac{n(n−1)}{2}+n+3$:

A computer scientist would say that this function runs in $O(n^2)$ time. Said out loud, this might be pronounced: “Big Oh of n squared”.

Recall that we’re trying to collapse out factors like how fast my computer is, and focus on scaling: is there a way to formalize the idea that one algorithm is faster than another?

Why does this matter?

Question: Why do we get to treat append as something that runs in constant time? Does it necessarily do so? Might there be different ways that append works, which have different worst-case runtimes?

You’ve noticed already that the data structures you use can influence the runtime of your program. In fact, soon we’ll talk about the difference between Pyret lists and Python lists. If append were implemented with a Pyret list, it would be worst-case linear, rather than constant.

But where else do these differences appear? Let’s run a quick experiment

def distinct_list(l: list) -> list:
    seen = []
    for x in l:
        if x not in seen:
            seen.append(x)
    return seen

def distinct_set(l: list) -> list:
    seen = set()
    for x in l:
        if x not in seen:
            seen.add(x)
    return list(seen) # convert back to a list    

Try running each of these on Frankenstein. On my laptop, with sets: 0.054 total and with lists: 2.153 total. Both numbers are in seconds.

It seems like there’s a huge impact on performance. In fact, the version that uses sets performs around the same as count_the did last time, and count_the had worst-cast runtime linear in the input size. That is, its worst-case performance was in $O(n)$. It seems like using a set has somehow eliminated the linear cost from the x not in seen check.

Next time we’ll talk about how sets and dictionaries make this kind of fast lookup possible, and whether there are any limitations. (There are limitations.) For now, think of big-O notation as a way to quickly capture how different tools (like data structures or algorithms) scale for different purposes. If you have the option to use something in $O(n)$ instead of something in $O(n^2)$, and there are no other factors involved, you should pick the better-scaling option.

I wonder…

Is there a potential bug lurking in the switch from distinct_list to distinct_set? Hint: it depends on what you might need to use the output for.

Formally Speaking

The formal definition looks like this:

If we have (mathematical, not Python!) functions $f(x)$ and $g(x)$, then $f(x)$ is in $O(g(x))$ if and only if there are constants $x_0$ and $C$ such that for all $x > x_0$, $f(x) < C g(x)$.

That is, $f(x)$ is in $O(g(x))$ if we can pick a constant $C$ so that after some point ($x_0$), $f(x)$ is always less than $C g(x)$. There may be many such $x_0$ and $C$ values that work. But if even one such pair exists, $f$ is in $O(g)$.

If we want to show that $\frac{n(n+1)}{2} + n + 3$ is in $O(n^2)$, our constants could be, say, $x_0 = 5$ and $C = 1$. After the list is at least $5$ elements long, $\frac{n(n+1)}{2} + n + 3$ will always be less than $1*n^2$. We could prove that using algebra, but here’s a picture motivating the argument:

The two functions plotted against each other

We could have picked some other values. For example, $x_0 = 2$ and $C = 3$:

Another two functions plotted against each other

Important Note: We’re using Wolfram Alpha for generating plots. Wolfram Alpha is an excellent tool which is free for personal use. Its terms allow academic use in settings such as this provided that we credit them and ideally link to the original query, which you can explore here. Note that they don’t allow, for instance, web scraping vs. their tool, so please use it responsibly!

In this class, we won’t expect you to rigorously prove that a function’s running time is in a particular big-O class. We will, though, use the notation. E.g., we’ll use $O(n)$ as a shorthand for “linear” and $O(n^2)$ as shorthand for “quadratic”. And we’ll try to be precise about whether we’re talking about “worst case”, “best case”, “average case”, etc. Don’t conflate the two ideas! “Upper bound” does not mean “worst case”. We can also put an upper bound on the best-case behavior of a program!

Also, notice that, by this definition, if a function is in $O(n)$ it is also in $O(n^2)$ and so on. The notation puts an upper limit on a function, that’s all. However, by convention, we tend to just say $O(n^2)$ by itself to mean “quadratic” rather than also saying “and not $O(n)$”.

The quick trick of looking for the largest term ($n$, $n^2$ etc.) will usually work on everything we give this semester, and in fact is a good place to start regardless. If you decide to take an algorithms course, things become both more complicated and, potentially, more interesting.

Group Programming Exercise: Rainfall

Let’s say we are tracking daily rainfall around Brown University. We want to compute the average rainfall over the period for which we have useful sensor readings. Our rainfall sensor is a bit unreliable, and reports data in a strange format. Both of these factors are things you sometimes encounter when dealing with real-world data!

In particular, our sensor data is arrives as a list of numbers like:

sensor_data = [1, 6, -2, 4, -999, 4, 5]

The -999 represents the end of the period we’re interested in. This might seem strange: why not just end the list after the first 4 value? But, for good reasons, real-world raw data formats sometimes use a “terminator” symbol like this one. It’s also possible for the -999 not to be present, in which case the entire list is our dataset.

The other negative numbers represent sensor error; we can’t really have a negative amount of rainfall. These should be scrubbed out of the dataset before we take the average.

In summary, we want to take the average of the non-negative numbers in the input list up to the first -999, if one appears. How would we solve this problem? What are the subproblems?

Think, then click!

One decomposition might be:

  • Finding the list segment before the -999
  • Filtering out the negative values
  • Computing the average of the positive rainfall days


This time, you will drive the entire process of building the function:

Since these notes are being written before lecture, it’s tough to anticipate the solutions you’ll come up with, but here are two potential solutions:

Think, then click!
def average_rainfall(sensor_input: lst) -> float:
  number_of_readings = 0
  total_rainfall = 0
  for reading in sensor_input:
    if reading == -999:
      return total_rainfall / number_of_readings
    elif reading >= 0:
      number_of_readings += 1
      total_rainfall += reading
  return total_rainfall / number_of_readings

In this solution, we loop over the list once. The first two subproblems are solved by returning early from the list and by ignoring the negative values in our loop. The final subproblem is solved with the number_of_readings and total_rainfall variables.

Another approach might be:

def list_before(l: list, item) -> list:
  result = []
  for element in l:
    if element == item:
      return result
    result.append(element)
  return result

def average_rainfall(sensor_input: lst) -> float:
  readings_in_period = list_before(sensor_input, -999)
  good_readings = [reading for reading in readings_in_period if reading >= 0]
  return sum(good_readings) / len(good_readings)

In this solution, the first subproblem is solved with a helper funciton, the second subproblem by filtering the remaining elements of the list to remove negative values, and the third subproblem calls the built-in sum and len functions on the final list.


These two solutions have very different feels. One accumulates a value and count while passing through the function once, with different behaviors for different values in the list. The other produces intermediate lists that are steps on the way to the solution, before finally just computing the average value.

The first version is, in my experience, much easier to make mistakes with—and harder to build up incrementally, testing while you go. But it only traverses the list once! That means it’s faster—right?

Well, maybe.

Runtime vs. Scaling

What are the worst-case running times of each solution?

Think, then click!

For inputs of size $n$, both solutions run in $O(n)$ time. This is true even though version 2 contains multiple for loops. The reason is that these for loops are sequential, and not nested one within the other: the program loops through the input, and then loops through the (truncated) input, and then through the (truncated, filtered) input, separately.

To see why this is the case, let’s look at a smaller example. This program has worst-case performance linear in the length of l:

sum = 0
for i in l:
    sum = sum + i
for i in l:
    sum = sum + i

It loops through the list twice.

But this program has quadratic performance in the length of l:

sum = 0
for i in l:
    for j in l:
        sum = sum + 1

For every element in the list, it loops again through the list.

So if you’re “counting loops” to estimate how efficient a program is, pay attention to the structure of nesting, not just the number of times you see a loop.