Loose Ends, Building a Simple Hash Table

Project 1 Q&A: Continuing to Learn

Some of you may have (very reasonably!) wondered why we don’t ask you to add new points to the dataset once your code has classified them. It’s certainly true that more data can lead to better classification! However, there are some risks. Consider a situation like this:

Continuing to Learn (Wrongly)

We’ve classified an incoming data point as a cat, and then added the point to our training set. Then, because of that point, we classified another incoming point as a cat. And then another. Soon, we’re at risk of our dataset being biased by the specific queries we’ve received up to that point. So we avoid the problem by not saving a history of classifications made.

More advanced machine-learning techniques exist, and you can learn about them if you take more CS.

Hashing More Complicated Keys

In principle, you can use any datatype you’d like as either the key or value in a hash table (and thus in a Python set or dictionary). There’s one snag: the key needs to be something that can be reliably hashed. And if the key is a mutable datatype, like a list, it might hash to one value today and a different value tomorrow. Python avoids this problem by forcing us to use immutable datatypes as keys in sets and dictionaries. Numbers, strings, etc. are all OK. But if we try to use a list:

class_time = ['BH 141', '1:00pm']
classes = {class_time: 'CSCI 0112'}

we’ll get a TypeError: unhashable type: 'list' error.

If you want to use something like a list, check out the tuple datatype instead, which is always immutable (and fixed-length):

class_time = ('BH 141', '1:00pm')
classes = {class_time: 'CSCI 0112'}

Dataclasses

Recall that Python’s analogue to Pyret’s data definitions is called a dataclass. (See this 0111 textbook chapter.) E.g.:

 @dataclass
 class Location:
      lat: int
      long: int

If you want to use a dataclass as a key, you need to tell Python that the data cannot be changed by using the frozen=True annotation, like this:

 @dataclass(frozen=True)
 class Location:
      lat: int
      long: int

There is a lot more to hash tables than we have time to discuss in 0112. If you want to learn more (without official 0112 support) about how hashing works in Python, check out the documentation on the __hash__ method of objects. If you want to learn more in the context of a course, check out 0200.

Hashing Perspective

In CS, we often make tradeoffs between memory and time. The most common example is caching (storing work done to avoid re-doing it in the future). But in hashing we saw a new kind of tradeoff: giving up correctness in a very careful way, in exchange for performance.

The “very careful way” has to do with both the selection of the hash function and how collisions get resolved. My favorite way to resolve collisions is to store another list at every location in the overall list. Then, if two elements get hashed to the same location, Python can store them both in the sub-list. Searching now needs to loop through the entire sub-list, and with a good hash function elements are usually (but not necessarily) uniformly distributed between locations in the top-level list. It might look something like this:

A "chained" hash table

Tim’s Homework

Last time someone asked whether a program could have runtime inversely proportional to the size of the input. Here’s an example of how factors like that can appear when we’re thinking about scaling.

Imagine a program tries to find a random element of a set built using this technique. It generates a random key, hashes it, and checks for an element at that location. If no element is present, it keeps searching. In the best case, the entire set is full. In the worst case, there’s only one thing in the set.

Practice: A Simple Hash Table

Let’s do another code-arrangement exercise. This time we’ll be writing a real hash-table, using the sublist method of collision-handling. You’ll fill in 3 functions:

def add(table: list, element: str):
    '''Add an element to the hash-table set'''
    
def search(table: list, element: str):
     '''search for an element in the hash-table set'''

def demo(table_size: int, num_elements: int, element_length: int):
    '''Add a large number of random elements to the hash-table set, then
        show how well-distributed the random elements were'''

Using these lines:

table[idx].append(element)
table = [[] for _ in range(table_size)]
print([len(sublist) for sublist in table])
idx = hash(element) % len(table)
if element not in table[idx]:
idx = hash(element) % len(table)
return element not in table[idx]
idx = hash(element)
if element in table[idx]:
random_element = ''.join(choices(ascii_letters + digits, k=element_length))
for _ in range(num_elements):
return element in table[idx]
add(table, random_element)

Note that, unlike last time, you might not need all of these. I’ve also used list-comprehension and a random-choice library you won’t have seen before. What’s the point? To show you that you don’t need to know exactly how to do a low-level thing in Python to be able to know how to start solving the problem.

You can find a solution in the livecode.

Testing Our Hash Table

We’ve got 2 functions to test (the demo function is just a demonstration).

Our add function modifies the state of objects in memory. Specifically, it modifies the table list, and potentially all sublists contained in it.
Our search function doesn’t modify the state, but it does depend on that state.

Remember that, when we’re testing side effects (e.g., adding elements to our table) we can’t just write assertions by themselves. We need to assert something about the effects on the state. E.g., for add maybe we could create an empty table and then see if it contains what we expect.

But how should we check what the table contains? Since the hash of every element, changes between program runs, we can’t rely on the location of an element being consistent. And just checking to see if any sublist contains the element is too broad, since search will only check a single sublist. Let’s just use search. E.g.,

def test_single_element():
    table = [[] for _ in range(10)]
    add(table, 'Hi, everybody!')
    assert search(table, 'Hi, everybody!') == True    
    # ...once this function returns, `table` disappears
    # future tests can make their own `table` list

What other tests would you write?