TSE Crawler design and implementation

Goals

To understand the specifications for the Tiny Search Engine (TSE)
To develop an implementation plan for the TSE Crawler

Specifications

A good software implementation requires a good software design, and a good software design must be based on a clear set of requirements. We think about these specifications as a sort of “contract” between the programmer (who writes the code) and the customer (the ultimate user of the software). Thus, we need three specs:

Requirements spec: specifies what the software must do
Design spec: specifies the structure of the software in a language-independent, machine-dependent way
Implementation spec: specifies the language-dependent, machine-dependent details of the implementation.

We’ll take a closer look at the design process in the next lecture. For today, let’s apply this approach to the TSE.

Tiny Search Engine (TSE)

Our Tiny Search Engine (TSE) design is inspired by the material in the paper Searching the Web, by Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, and Sriram Raghavan (Stanford University); ACM Transactions on Internet Technology (TOIT), Volume 1, Issue 1 (August 2001).

TSE Requirements Spec

The Tiny Search Engine (TSE) shall consist of three subsystems:

crawler, which crawls the web from a given seed to a given maxDepth and caches the content of the pages it finds, one page per file, in a given directory.
indexer, which reads files from the given directory, builds an index that maps from words to pages (URLs), and writes that index to a given file.
querier, which reads the index from a given file, and a query expressed as a set of words optionally combined by (AND, OR), and outputs a ranked list of pages (URLs) in which the given combination of words appear.

Each subsystem is a standalone program executed from the command line, but they inter-connect through files in the file system.

In a spec document, we write shall do to mean must do.

We’ll look deeper at the requirements for the each subsystem, later. First, let’s go to the next level on the overall TSE: the Design Spec.

TSE Design Spec

The overall architecture presented below shows the modular decomposition of the system:

Tiny Search Engine modular design

The above diagram is consistent with the Requirements Spec: we can clearly see three sub-systems, their interconnection through files, and the user interface for submitting queries to the querier. The querier subsystem has an internal ranking module, which we anticipate might be separate from the query processor module; we’ll look more closely when we come to the querier design.

Next, we describe each sub-system and its high-level design.

The crawler crawls a website and retrieves webpages starting with a specified URL. It parses the initial webpage, extracts any embedded href URLs and retrieves those pages, and crawls the pages found at those URLs, but limiting itself to maxDepth hops from the seed URL. When the crawler process is complete, the indexing of the collected documents can begin.

The indexer extracts all the keywords for each stored webpage and creates a lookup table that maps each word found to all the documents (webpages) where the word was found.

The query engine responds to requests (queries) from users. The query processor module loads the index and searches for pages that include the search keywords. Because there may be many hits, we need a ranking module to rank the results (e.g., high to low number of instances of a keyword on a page).

TSE Crawler Specs

We’ll look deeper at the requirements for the indexer and querier later. Right now, let’s focus on the crawler:

Requirements Spec (provided by the starter kit).
Design Spec (provided by the starter kit).
Implementation spec: The Design Spec describes abstractions - abstract data structures, and pseudocode. The same design could be implemented in C or Java or another language. The Implementation Spec gets more specific. It is language, operating system, and hardware dependent. The implementation spec includes many or all of these topics:
- Detailed pseudo code for each of the objects/components/functions,
- Definition of detailed APIs, interfaces, function prototypes and their parameters,
- Data structures (e.g., struct names and members),
- Security and privacy properties,
- Error handling and recovery,
- Resource management,
- Persistant storage (files, database, etc).

The Lab3 assignment is written like an implementation spec, right down to the specific function prototypes.

You need to write the Implementation spec for the Crawler in Lab 4.

Organization of the TSE code

Let’s take a look at the structure of the TSE solution - so you can see what you’re aiming for.

Directory structure

My TSE comprises six subdirectories:

libcs50 - a library of code we provide
common - a library of code you write
crawler - the crawler
indexer - the indexer
querier - the querier
data - with subdirectories where the crawler and indexer can write files, and the querier can read files.

My top-level .gitignore file excludes data from the repository, because the data files are big, changing often, and don’t deserve to be saved.

The full tree looks like this:

├── common
│   ├── index.c
│   ├── index.h
│   ├── Makefile
│   ├── pagedir.c
│   ├── pagedir.h
│   ├── word.c
│   └── word.h
├── crawler
│   ├── crawler.c
│   ├── DESIGN.md
│   ├── IMPLEMENTATION.md
│   ├── Makefile
│   ├── README.md
│   ├── REQUIREMENTS.md
│   └── testing.sh
├── indexer
│   ├── DESIGN.md
│   ├── IMPLEMENTATION.md
│   ├── indexer.c
│   ├── indextest.c
│   ├── indextest.sh
│   ├── Makefile
│   ├── README.md
│   ├── REQUIREMENTS.md
│   └── testing.sh
├── libcs50
│   ├── bag.c
│   ├── bag.h
│   ├── counters.c
│   ├── counters.h
│   ├── file.c
│   ├── file.h
│   ├── file.md
│   ├── hashtable.c
│   ├── hashtable.h
│   ├── jhash.c
│   ├── jhash.h
│   ├── libcs50-given.a
│   ├── Makefile
│   ├── memory.c
│   ├── memory.h
│   ├── memory.md
│   ├── README.md
│   ├── set.c
│   ├── set.h
│   ├── webpage.c
│   ├── webpage.h
│   └── webpage.md
├── Makefile
├── querier
│   ├── DESIGN.md
│   ├── fuzzquery.c
│   ├── IMPLEMENTATION.md
│   ├── Makefile
│   ├── querier.c
│   ├── README.md
│   ├── REQUIREMENTS.md
│   ├── testing.sh
│   └── testing.txt
├── README.md

Source files

My crawler, indexer, and querier each consist of just one .c file. They share some common code, which I keep in the common directory:

pagedir - a suite of functions to help the crawler write pages to the pageDirectory and help the indexer read them back in
index - a suite of functions that implement the “index” data structure; this module includes functions to write an index to a file (used by indexer) and read an index from a file (used by querier).
word - a function NormalizeWord used by both the indexer and the querier.

Each of the program directories (crawler, indexer, querier) include a few files related to testing, as well.

You’ll recognize the Lab3 data structures - they’re all in the libcs50 library. Note the flatter organization - there’s not a separate subdirectory (with Makefile or test code) for each data structure.

Activity

In the activity today, we aim to familiarize the usage of the webpage module. We will use it to fetch the web page of a URL and save it to an external file.