Random Testing

On Homework 5, you briefly experimented with creating a random list and using it to test your implementation of tree sort. Let’s bring that idea more into focus.

Today we’ll:

The presentation of the new sort, Quicksort, is deliberately rushed to showcase the value of random testing.

A final sorting algorithm

Suppose we liked the overall structure of merge sort, but we didn’t like the fact that the actual sorting is done in merge. We could create a variant that splits the list in a way that lets us combine sublists with +, instead.

def quick_sort(l: list) -> list:
    if len(l) <= 1:
        return l[:]
    # ???
    side1_sorted = quick_sort(side1)
    side2_sorted = quick_sort(side2)
    return side1_sorted + side2_sorted

If we want to avoid a more complicated merge, however, we need to split the list by some more elaborate method than just slicing it in half. How about we pick some arbitrary element in the list and use that value as a dividing line? Concretely, we’ll separate out the elements that are less than this element (which we’ll call the pivot) and those greater than it.

def quick_sort(l: list) -> list:
    if len(l) <= 1:
        return l[:]
    pivot = l[0] # many other choices possible
    smaller = [x for x in l if x < pivot]
    larger = [x for x in l if x > pivot]
    smaller_sorted = quick_sort(smaller)
    larger_sorted = quick_sort(larger)
    return smaller_sorted + [pivot] + larger_sorted

How confident are we that this implementation works?

Random Testing (Continued)

We very briefly saw random testing last time. Let’s talk more about it.

Motivation: Humans

Think of a country. What country are you thinking of?

Think, then click! Chances are, the country you thought of was: - close to home; - large; or - in the news often. I'd bet that the country you thought of was also in existence today. You probably didn't say [the USSR](https://en.wikipedia.org/wiki/Soviet_Union) or [Austria-Hungary](https://en.wikipedia.org/wiki/Austria-Hungary). And note that my choices there were all limited by my own historical knowledge. I went and [looked up more](https://en.wikipedia.org/wiki/List_of_former_sovereign_states) after writing that sentence. Even if we only count nations that existed after the U.N. was created, there are many: the [Republic of Egypt (1953-1958)](https://en.wikipedia.org/wiki/History_of_Egypt_under_Gamal_Abdel_Nasser#Republic_of_Egypt_(1953–1958)), the [Fourth Brazilian Republic (1946-1964)](https://en.wikipedia.org/wiki/Fourth_Brazilian_Republic), etc.

Why does that impact software testing?

Think, then click! You can only test what you imagine needing to test. This is heavily influenced by what you have loaded into your "mental cache"---this is sometimes called availability bias or the [availability heuristic](https://en.wikipedia.org/wiki/Availability_heuristic). If you haven't been thinking of it recently, you likely won't test it. Although it's somewhat mitigated by careful and disciplined thought, the problem persists.

But that’s dangerous. If humans are innately poor at testing (and even those with training aren’t always great at it) then is testing just doomed in general?

Computers as assistive devices

When confronted with our limitations, humans create assistive devices. These have included (e.g.) the pole lathe, the slide rule, and so on.

Maybe we can use our computers as an assistive device to help us with testing?

What’s the hard part of writing a test case? Usually it’s coming up with a creative input that exercises something special about the program. What if we let a computer come up with inputs for us? We could do that in a variety of different ways; random generators are just one. (For more on this topic, see CSCI 0320 or especially CSCI 1710.)

(1) Building random lists

We’d write a function that produces random lists (in this case, of integers):

def random_list(max_length: int, min_value: int, max_value: int) -> list:
  length = randint(0, max_length)
  return [randint(min_value, max_value) for _ in range(length)]

(2) Running our sort on those random lists

MAX_LENGTH = 100
MIN_VALUE = -100000
MAX_VALUE =  100000
NUM_TRIALS = 100
def test_quicksort():
    for i in range(NUM_TRIALS):
        test_list = random_list(MAX_LENGTH, MIN_VALUE, MAX_VALUE)
        quick_sort(test_list)

This may look like it’s not doing anything. While it’s true that there’s no assert statement yet, we are testing something: that the quick_sort function doesn’t crash!

This sort of output-less testing is called fuzzing or fuzz testing and it’s used a lot in industry. After all, there’s nothing that says we need to stop at 100 trials! I could run this millions of times overnight, and gain some confidence that my program doesn’t crash (even if it might not produce a correct result overall).

(3) Dealing with the output question

Of course, we’d still like to actually test the outputs on these randomly generated inputs. It wouldn’t really be feasible to produce a million inputs automatically, and then manually go and figure out the corresponding outputs. Instead, we should ask: do we have another source of the correct answer? Indeed we do: our own merge sort, or better yet, Python’s built in sorting function.

MAX_LENGTH = 100
MIN_VALUE = -100000
MAX_VALUE =  100000
NUM_TRIALS = 100
def test_quicksort():
    for i in range(NUM_TRIALS):
        test_list = random_list(MAX_LENGTH, MIN_VALUE, MAX_VALUE)
        # We could also compare the result of Python's sort here...
        assert quick_sort(test_list) == merge_sort(test_list))

As it turns out, this procedure quickly finds a bug in our quick_sort implementation: if a list has duplicate elements, those duplicates will be deleted! Sadly, we’re only retaining one copy of the pivot in the above code. Instead, we’ll:

def quick_sort(l: list) -> list:
    if len(l) <= 1:
        return l[:]
    pivot = l[0] # many other choices possible
    smaller = [x for x in l if x < pivot]
    larger = [x for x in l if x > pivot]
    same = [x for x in l if x == pivot]
    smaller_sorted = quick_sort(smaller)
    larger_sorted = quick_sort(larger)
    return smaller_sorted + same + larger_sorted

Perspective

It doesn’t always make sense to use random testing, but when it does it’s a powerful technique: it literally allows you to mine for bugs while you sleep!

Hopefully it goes without saying, but there’s nothing specific to sorting about this technique: in fact, most of you could use it to help test your final projects, if you wanted! The key is not to rely on it completely: always use random testing to augment an existing, cleverly-chosen set of manual test cases. And if your random testing finds a new bug, as we did today, just add that input and output to your manual suite: you’ll never let that bug sneak by you again!

There’s a lot more to this story. If you want to explore it more, try using this approach on randomly generated lists of records, rather than randomly generated lists of numbers.

Sorting more than numbers

Let’s define a class whose job is to represent a record that combines someone’s age and name. We could use dataclasses for this (or tuples, or lists, or many other options) but we’ll use a plain class for two reasons.

class Record: 
    def __init__(self, age, name):
        self.age = age
        self.name = name
    def __eq__(self, other):
        return self.age == other.age and self.name == other.name
    def __ne__(self, other):
        return self.age != other.age or self.name != other.name
    def __lt__(self, other):
        return self.age < other.age
    def __gt__(self, other):
        return self.age > other.age
    def __ge__(self, other):
        return self.age >= other.age
    def __le__(self, other):
        return self.age <= other.age
    def __repr__(self):
        return f'Record({self.age}, {self.name})'

The first is that dataclasses won’t let you use different subsets of fields for == and <. Here, we want the records to be sorted by age alone, but equality should use both fields. This means that not (x < y) and not (x > y) together don’t imply x == y.

The second reason is that I’d like to demonstrate how to define <, ==, etc. in Python manually, which is what the above code does.

Alternative: via a dataclass

We could also do something like this via a dataclass, but then we’ve got to either include or exclude each field from all comparisons:

# Immutable, but also auto-generate <
@dataclass(frozen=True,order=True)
class Record:
    age: int
    # exclude "name" from use in <, >, etc.
    name: str = field(compare=False)

The order=True parameter tells Python to automatically create ordering functions (<, >, etc.) that work over instances of this dataclass. The field declaration excludes the name field from these comparisons. As a result, Records will be ordered entirely by their age field.

Sorting records

If we wanted to sort lists of Records, how would we need to change our merge sort function from last time?

Think, then Click! We won't need to make any changes at all. Because the comparison functions are defined for `Record` (because of the `dataclass` annotations we added), our existing code will handle `Record`s just fine. That's the power of polymorphism! Of course, if we tried to sort a list that contained both numbers and records, we'd get an error, since `<` is only defined for comparing a number to a number, a `Record` to a `Record`, a string to a string, and so on.

Do different correct sorting algorithms always agree perfectly on their output, now?

Think, then click!

No. Not all sorting algorithms will produce the same ordering for elements with identical keys. If the values being sorted are just numbers, this is immaterial. If the values are more complex, we may see disagreement.

I say “may” because it depends on the low-level specifics of the sorting code. For example, this list may or may not be sorted differently by two different functions, even though those functions are correct, because only the age field matters for sorting:

[Record(41, "Tim"), Record(41, "Nim")]

There are ways to work around this: we can do more than just compare results. What if we wrote a function that recognized what correctness meant? It might look something like this:

def verify_sorted_correctly(input: list, output: list) -> boolean: 
    # code to check that the output is in order
    # code to check that the output contains a permutation of the input
    # ...

A final bug

There is a subtle error in our quicksort function that only appears when we’re testing with records, not integers. Above, we said that we’d built the Record class so that:

not (x < y) and not (x > y) together don’t imply x == y.

Given this, why are we using == to produce the pivot sublist? If two records are generated with the same age, but different names, won’t one of them be dropped?

Here’s the fixed code:

def quick_sort(l: list) -> list:
    if len(l) <= 1:
        return l[:]
    pivot = l[0] # many other choices possible
    smaller = [x for x in l if x < pivot]
    larger = [x for x in l if x > pivot]
    # don't assume that not > and not < implies ==:
    same = [x for x in l if not (x < pivot) and not (x > pivot)] 
    smaller_sorted = quick_sort(smaller)
    larger_sorted = quick_sort(larger)
    return smaller_sorted + same + larger_sorted

Equality is challenging.