Tree Structured Data (Web Documents)

HTML Livecode

HTMLTree library

Web documents

There’s a lot of useful data out there on the web. For example, this disambiguation page on Wikipedia for “Python”, or this list of courses offered by the CSCI department.

Is there any data out there on the web that you’d like to be able to analyze, do computation over, or otherwise make use of? Then you’re in luck: today we’re going to start a portion of the class devoted to working with the kind of data you’ll find on a webpage.

You’re probably viewing these notes in your web browser. Most browsers have a “View Source” menu item somewhere:

Use this option to open up the source of this page. What does the source look like? There’s a lot here, so instead let’s start with a simpler page: this one. (You can also find this on the CS department’s page here). Its source looks like this:

<html>
  <head>
    <title>This is a page</title>
  </head>
  <body>
    <p>Here is a paragraph of text</p>
    <p>Another paragraph, with <strong>bold</strong> text</p>
    <p>And a list:
      <ul>
        <li>Item 1</li>
        <li>Item 2</li>
      </ul>
    </p>
    <p>And a bolded list:
      <strong>
        <ul>
          <li>Item 1</li>
          <li>Item 2</li>
        </ul>
      </strong>
    </p>
  </body>
</html>

What do we notice about the structure of this source?

Webpages are generally written using this language, which is called HTML (short for Hyper-Text Markup Language). HTML is made up of tags (the things in angle brackets). A tag can contain text; it can also contain other tags, which can contain still more tags, and so on.

Why does HTML look like this? Because HTML documents have structure. They aren’t just plain text: they have a header and a document body, formatting like paragraph and boldface text, and even formatted lists.

We won’t really learn how to write HTML in this course. We will however, learn how to do computations over HTML documents in Python, because web pages can be an excellent source of data. If we want to extract that data and work with it, we need to be able to turn it into a useful form.

So: how might we represent the HTML document above in a data structure?

The problem is that the tags are nested: tags contain tags which contain tags, and so on. So far we haven’t learned any data structures in Python that are good for working with that sort of document.

Trees

Data with nested structure is quite common in computer science. In CSCI 0111, we saw one example: ancestry trees, recording the parents of individuals (and their parents, and their parents, and so on). Because this sort of data can resemble a family tree, we’ll call it tree structured. Let’s build a tree-shaped data structure for storing HTML documents.

Our tree type will look like this:

@dataclass
class HTMLTree:
    tag: str
    children: list
    text: str = ""

Each instance of the HTMLTree datatype represents a single tag. The name of the tag ('p', 'strong', etc.) is in the tag field. The tag’s children are in the children field.

In the HTML document above, there’s text in addition to tags. In our HTMLTree class, these are represented as text tags; the text goes in the text field. text tags never have children.

If we wanted to represent the HTML linked above using this data structure, it might look something like this in memory:

Working with HTML

Here’s how we’d create a very basic document in Python:

HTMLTree("p", [HTMLTree("text", [], "Text in a paragraph")])

This corresponds to the HTML:

<p>Text in a paragraph</p>

If our documents get much bigger, defining them by hand in Python like this is going to get pretty annoying. We’ve written a little HTML library for the class (which is also where HTMLTree is defined). You can download the Python file from this link.

There’s a function in the library to take a string of HTML and turn it into a tree. Here’s how you might use it to produce HTMLTrees from strings:

from htmltree import *

# directly via constructor
tree1 = HTMLTree("p", [HTMLTree("text", [], "Text in a paragraph")])
print(tree1)
# parsed into an HTMLTree via our helper code
tree2 = parse('<p>Some other text</p>')
print(tree2)

We can also print out the HTML string a tree corresponds to:

> print_html(tree)
<p>Text in a paragraph</p>

This is enough to start working with the HTML source of webpages. Over the next few days, we’ll build the skills needed to search, generate statistics over, and even edit HTML documents. And these skills aren’t just about HTML; they apply to any tree-structured format.

I wonder…

What other tree-structured data formats have you used a lot already?

A Note on Python vs. Pyret lists

This isn’t strictly about trees, but it will come up in a lab soon.

Earlier this week, we used the fact that Python’s lists are contiguous blocks of memory to build very fast set and dictionary data structures. If you took 0111, you might recall that Pyret’s lists don’t work that way. Instead, they are built up recursively—very like the way we built the HTMLTree dataclass earlier today.

You haven’t yet been able to exercise all the skills you developed to work with Pyret lists. But those skills are about to become valuable, because, computationally speaking, trees look an awful lot like Pyret lists.

What does that mean?

Consider: how would I write a function that measures the depth of an HTML Tree?