How are sets and dictionaries so fast?

Setting the stage

Last time we saw that switching from a list to a set in our distinct function produced a massive performance improvement. Today we’ll learn about why. Or, rather, we’ll invent the trick that Python sets use for ourselves!

Next time, we’ll actually implement this idea to help you experiment with it.

Two Inefficient Set Structures

Inefficient Structure 1

If we really wanted to, we could use Python lists to store sets. Adding an element would just mean appending a new value to the list, and checking for membership would involve a linear search through the list.

Python lists are built so that accessing any particular index takes constant time. For example, suppose we have a list called staff that stores the first names of the staff of a course:

How long does it take for Python to return 'Ben' for staff[2]? It depends on how lists are implemented! In Python, lists are contiguous blocks in memory. Python doesn’t have to count down the list elements one by one, it can just do some arithmetic and skip ahead.

This is different from how Pyret lists work, and we’ll contrast the two in more detail soon. For now, imagine Python lists as being written on a sheet of lined paper in memory. Element 0 is a certain distance from element 1, which is that same distance from element 2 and so on. As a result, Python should be able to get the value of staff[2] in constant time. Python doesn’t have to count down the list elements one by one, it can just do some arithmetic and skip ahead.

The trouble is that this kind of fast lookup by index doesn’t help us check for membership in the set. There’s no connection between 2 and 'Ben' except the arbitrary order in which elements were added. So let’s simplify the problem.

Inefficient Structure 2

What if, instead, the element we were searching for was the index? Suppose that we wanted to check whether a certain ID number was in our set. Then we could take advantage of how lists work in Python, and draw a picture like this, where the list elements are just True and False:

We store a True if that ID number is in the set, and False otherwise. Since Python can look up an index of a list in constant time, we can use this data structure to report on set membership in constant time—if the elements are numbers.

Aside: If we stored more than just booleans in that list, we could easily map numeric keys to arbitrary values, much like a dictionary does. Access would remain constant time.

Aside: Even better, there are tricks we can use to encode most data-types as numbers. So this idea isn’t limited to only looking up non-negative integers. We could use it to solve the lookup-by-name problem above by deciding to encode, say, ‘Ben’ as $226^2+526+14$. In practice, there are far better encodings.

This all sounds great: it really does get us worst-case constant time lookup.

But there’s a cost. Unless the range of possible values is small, this technique needs a colossally big list—probably more than we could fit in memory! Worse, most of it is going to be used on False values, to say “nope, nothing here!”. We’re making an extreme trade-off of space for time: allocating a list cell for every potential search key, in exchange for constant-time lookup rewards.

Making a Reasonable Tradeoff: Hash Functions

Can we find a happy medium between these solutions? Maybe by spending a little bit of extra space, we can somehow end up with constant-time lookup? And even if we don’t always manage that, maybe we can still improve on both the extremes above.

Let’s start with another design question. Suppose we want to map the IDs of CSCI 0112 staff to their corresponding name. Further, let’s say that IDs are non-negative numbers ranging between 0 and 999999, but that there are far fewer than a million staff members: there are fewer than 10, and so you’ll never have more than 10 datapoints to store in a given semester.

So I’m going to give you a Python list with just 10 elements to work with. Strictly speaking, that ought to be enough space, but nowhere near enough to make a list cell for every possible ID. Here’s what it looks like:

How can we convert a 6-digit student ID to a unique index between 0 and 9?

We can’t.

But what if we were willing to give up the “unique” part of that question? Do we know a way to convert an ID to a number between 0 and 9 (while still making a reasonable effort at uniqueness?)

Every ID has a remainder when divided by 10. For instance, 1234 has remainder 4 when divided by 10. Let’s try using this as our index function. We’ll insert two datapoints: {1234 : 'Tim'} and {5678: 'Ashley'}.

Now if we want to look up the key 5678 we can do so in constant time! Just divide by 10 to get 8, and do a lookup in this Python list.

This sounds great, but do we yet have a usable data structure? What’s the problem?

Think, then click!

We gave up uniqueness! For every index in the list, there are ten thousand potential keys with the same remainder when divided by 10.

Just look up the key 0004 in the above list to see it. This isn’t Tim’s ID, but it still gets mapped to the string 'Tim'. This is often called a collision: two actually-used keys get mapped to the same index, with potentially destructive results.

So: this is a nice start, but we’d better do a little bit more work.

A Note on Terminology

This kind of key-to-index transformation is often called a hash function, where “hash” here means a digest or summary of the original data. Hash functions are usually fast, lossy, and (in applications beyond this lecture!) built to have a relatively uniform distribution of hashes over the key population to reduce the chance of collision.

The Challenge: Collisions

You’ll learn more about the details if you take CSCI 0200. In short, there are a few ways of handling collisions: my favorite is storing a separate list for every index, but there are more sophisticated approaches.

The important takeaway is that, in the worst case (all keys hash to the same index, nothing but collisions) searching a hash table is as slow as a list: linear in the number of elements being stored. With a well-chosen hash function, and enough space in the table, collisions turn out to be rare. With some probability theory, we could prove that the average case is constant time. It’s usually incredibly fast, and anyway never scales worse than a list.

Takeaway

That’s how sets and dictionaries are so fast in Python, and that’s why data structure choice matters. You’ll be choosing between lists, dictionaries, sets, etc. on the final project, so keep these scaling differences in mind.

Another Note on Terminology

This kind of data structure is called a hash table. For myself, I like to make a distinction between a hash table and a dictionary: one is low level, and one is high level. A dictionary (also called a map in some other languages) is an abstract description of the shape of the data (a key-value store) and certain common operations (like lookup). You could implement a dictionary using any of the above data structures! Hash tables are by far the most common (and usually the wisest) way to implement a dictionary, and Python’s dictionaries are backed by hash tables. The two terms are often used interchangeably.