Searching The Web

In the next series of Labs we will design, implement and test a command-line web-search tool called Tiny Search Engine (TSE) because it can be written in under 2000 lines of student-written C code (about 500 lines in each of four Labs 3-6). Today we begin to discuss the concepts behind web search and the top-level design of TSE and its decomposition into three major components: crawler, indexer, and query.

In this lecture, we discuss some of the foundational issues associated with searching the web. We also discuss the overall architectural design of a more comprehensive search engine than TSE.

You should skim this classic paper about a web search engine:

Searching the Web, Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, Sriram Raghavan (Stanford University). ACM Transactions on Internet Technology (TOIT), Volume 1, Issue 1 (August 2001).

This paper gives insights into search-engine design. You can skip the plots and deeper research material, but do your best to understand the text on the main components and design issues of a search engine.

Goals

We plan to learn the following from today’s lecture:

How does search engine like Google search the web?
A general search-engine architecture.
URLs, webpages, HTML, HTTP, and keywords.
Requirement of crawler and demonstration of a crawler implementation.

Searching the Web

How do you get information from the Web? Searching the web is something we do every day with ease, but it’s technically challenging to implement because of the scale of the web and because pages change at dramatically different rates. As of April 20, 2022 there are 2.97 billion publicly available webpages on the “indexed web” according to one estimate. Even that number likely to be an underestimate, because many organizations have lots of internal webpages that aren’t indexed by the public search engines.

To get information about hiking in New Hampshire, I can use a search engine (such as Google) as an information retrieval system; it returns a list of links (URLs) to sites that have the keywords I specified embedded in them. Conveniently, the search engine orders (ranks) the links so the most-relevant pages are near the top of the list.

Google search for *hiking "new hampshire"*

Google responded to my query in 0.61 seconds with 22,900,000 matches found! How is that possible? How does Google search more than 5 billion web pages in half of a second? Surely, Google does not actually search those pages for each user’s request. Instead, it looks into a vast ‘index’ of the web, built earlier. Given a new search term, it checks the index for pages known to have the word “hiking”, and again for those with the phrase “new hampshire”, and then intersects the two results to come up with a list.

How does Google rank the pages in the resulting list? The solution is actually Google’s ‘secret sauce’, the “page-rank algorithm” developed by Brin and Page when they were grad students. (Although the original algorithm was published, Google continues to refine and tweak it and keeps the details secret.)

When and how does Google build that index? And how does it find all the pages on the web? Google’s servers are constantly “crawling” the web: given one link (URL), download that page, find all the links in that page, and then go examine those pages - recursively. As new (or updated) pages are found, they “index” each page by extracting the list of words used on that page, and then building a data structure that maps from words to URLs.

Later, a search query is broken into words, and each word is sought in the index, returning a set of URLs where that word appears. For a multi-word query, they intersect the sets to find a set where all words appear. Then they apply the page-rank algorithm to the set to produce a ranked list.

In April 2014, Google’s website said its index filled over 100 million gigabytes! Check out this nice video from Google explaining how search engine works.

General search engine architecture [Arvind, 2001]

Search engines like Google are complex, sophisticated, highly distributed systems. Below we reproduce the general search engine architecture discussed in Searching the Web.

General search engine architecture [Arvind, 2001]

The main components include parallel crawlers, crawler control (when and where to crawl), page repository, indexer, analysis, collection of data structures (index tables, structure, utility), and a query engine and ranking module. Such a general architecture would take a significant amount of time to code. In our TSE, we will implement a stripped down version of the main components.

URLs, HTML, and keywords

Some terminology:

URL, short for Uniform Resource Locator, is used to specify addresses of webpages and other resources on the web. An example is http://www.dartmouth.edu/index.html, which refers to the HTTP network protocol, the www.dartmouth.edu server, and the index.html file on that server.
HTML. Most web pages are written in HyperText Markup Language (HTML). For a quick tutorial on HTML see this Introduction to HTML. An HTML file is a text file with an htm or html file extension. HTML pages can be created by tools or simply in an editor like emacs. You will not need to write any HTML for this course.
tags. HTML uses “tags” to mark-up the text; for example this text would be bold. Most tags are enclosed in angle brackets, like , and most come in matching pairs marking the beginning and ending of a region of text to which the tag applies; note the  and  pair.

We are interested collecting URLs from HTML files. The HTML tag that forms a link and references a URL is called an ‘anchor’, or ‘a’ for short. The tag <a> takes parameters, most importantly the href parameter:

<a href="http://www.dartmouth.edu/index.html">Dartmouth home page</a>

For the purpose of indexing the page, we need to find the ‘words’ in the page. In most web pages, most of the content is outside the tags because the tags are there to format the content. For TinySearchEngine, we define keywords as being outside of tags.

So when TinySearchEngine downloads a webpage of HTML source it needs to parse the page to extract URLs (so it can crawl those URLs) and identify the words that users might be interested in running queries for.

Parsing HTML can be challenging, especially because so many pages on the web don’t follow the HTML standard cleanly. We will provide you with a C function to parse the HTML. (Feel free to write your own if you prefer).

For more information about HTML check out the old HTML 4 specification or the brand new HTML 5 specification.

Demonstration

Check the requirements spec of Crawler in the starter kit, available after accepting Lab4 assignment. It is located under the crawler/ directory.

Crawler execution and output

Below is a snippet of when the program starts to crawl the CS50 website to a depth of 1. The crawler prints status information as it goes along.

Note, you might consider covering this debugging print-out code in an #ifdef block that can be triggered by a compile-time switch.

See the Lecture extra about this trick.

$ ./crawler http://cs50tse.cs.dartmouth.edu/tse/letters/index.html data 2
 Fetched: http://cs50tse.cs.dartmouth.edu/tse/letters/index.html
Scanning: http://cs50tse.cs.dartmouth.edu/tse/letters/index.html
   Found: http://cs50tse.cs.dartmouth.edu/tse/letters/A.html
   Added: http://cs50tse.cs.dartmouth.edu/tse/letters/A.html
  Fetched: http://cs50tse.cs.dartmouth.edu/tse/letters/A.html
 Scanning: http://cs50tse.cs.dartmouth.edu/tse/letters/A.html
    Found: https://en.wikipedia.org/wiki/Algorithm
 IgnExtrn: https://en.wikipedia.org/wiki/Algorithm
    Found: http://cs50tse.cs.dartmouth.edu/tse/letters/B.html
    Added: http://cs50tse.cs.dartmouth.edu/tse/letters/B.html
    Found: http://cs50tse.cs.dartmouth.edu/tse/letters/index.html
  IgnDupl: http://cs50tse.cs.dartmouth.edu/tse/letters/index.html
   Fetched: http://cs50tse.cs.dartmouth.edu/tse/letters/B.html
$

Notice how I printed the depth of the current crawl at left, then indented slightly based on the current depth, then printed a single word meant to indicate what is being done, then printed the URL. By ensuring a consistent format, and choosing a simple/unique word for each type of line, I can post-process the output with grep, awk, and so forth, enabling me to run various checks on the output of the crawler. Much better than a mish-mash of arbitrary output formats!

To make this easy, I wrote a simple function to print those lines:

// log one word (1-9 chars) about a given url
inline static void logr(const char *word, const int depth, const char *url)
{
  printf("%2d %*s%9s: %s\n", depth, depth, "", word, url);
}

I thus have just one printf call, and if I want to tweak the format, I just need to edit one place and not every log-type printf in the code.

Notice the inline modifier. This means that C is allowed to compile this code ‘in line’ where the function call occurs, rather than compiling code that actually jumps to a function and returns. Syntactically, in every way, it’s just like a function - but more efficient. Great for tiny functions like this one, where it’s worth duplicating the code (making the executable bigger) to save time (making the program run slightly faster).

Anyway, at strategic points in the code, I make a call like this one:

   logr("Fetched", page->depth, page->url);

Contents of pageDirectory after crawler has run

For each URL crawled the program creates a file and places in the file the URL and filename followed by all the contents of the webpage. But for a maxDepth = 1 as in this example there are only a few webpages crawled and files created. Below is a peek at the files created during the above crawl. Notice how each page starts with the URL, then a number (the depth of that page during the crawl), then the contents of the page (here I printed only the first line of the content).

$ cd data/
$ ls 
1 2 3
$ head -3 1
http://cs50tse.cs.dartmouth.edu/tse/letters/index.html
0
<html>

Makefiles

The starter kit includes a Makefile at the top level and another to build the libcs50 library.

The top-level Makefile recursively calls Make on each of the source directories.

The libcs50/Makefile demonstrates how you can build a library, in this case libcs50.a, from a collection of object .o files. Study this Makefile, because you’ll need to write something similar for your common directory.

You can then link these libraries into your other programs without having to list all the individual .o files on the command line. For example, when I build my crawler Make runs commands as follows:

make -C crawler
gcc -Wall -pedantic -std=c11 -ggdb -DAPPTEST  -I../libcs50 -I../common   -c -o crawler.o crawler.c
gcc -Wall -pedantic -std=c11 -ggdb -DAPPTEST  -I../libcs50 -I../common crawler.o ../common/common.a ../libcs50/libcs50.a  -o crawler

The crawler/Makefile is written in good style, with appropriate use of variables, so the rule that causes Make to run the above commands is actually much shorter:

crawler: crawler.o $(LLIBS)
        $(CC) $(CFLAGS) $^ -o $@

We’ll work more with Makefiles in upcoming classes.

Activity

In today’s activity your group will start envisioning a design for the TSE Crawler.