CS 50 Software Design and Implementation

Lab4

TinySearch Engine: Crawler Lab

Lab4 starts a series of three assignments that will build the TinySearch Engine. This lab will design, implement and test the crawler module.

Tarball of files you need

Here is tarball of all the files you need for this lab rather than copying them over from my webpage. untar the file:

tar xvfz lab4handout.tar.gz

lab4handout.tar.gz .

The design of the crawler will be discussed in class as a collective design. We will develop the design and implementation in class. All students will implement the same DESIGN SPEC developed in class.

This is a challenging assignment. It includes data structures of link lists and a hash table and hash function. We will discuss both of these data structures in class.

This lab will cover many aspects of the C language discussed in class: structures, pointers, string processing, malloc/free, file operations and management, and interacting with the shell to execute wget via the system call (which you used in the last lab).

Importantly, for this lab you will use multiple files. Single functions or groups of related functions will go in their own .c file (e.g., crawler.c, list.c, html.c). You will also write header files for this lab which be included in the your various .c files.

We also use the GNU make command to help manage and compile these multiple files. Here is a snippet from man:

NAME
       make - GNU make utility to maintain groups of programs

SYNOPSIS
       make [ -f makefile ] [ option ] ...  target ...

DESCRIPTION
       The purpose of the make utility is  to  determine  automatically  which
       pieces of a large program need to be recompiled, and issue the commands
       to recompile them.

Grading: Grading rubric for lab.

Submitting assignment:

Use svn as normal.

Please make sure that the lab4 directory contains a simple text file, named README, describing anything “unusual” about how your solutions should be located, executed, and considered.

Coding style: Please put comments in your programs to help increase its understanding. Include a header in each C file including the file name; brief description of the program; inputs; and outputs. It is important you write C programs defensively; that is, you need to check the program’s input and provide appropriate warning or error messages back to the user in terms of the program usage. If you call functions from your code make sure you check the return status (if available) and print out meaningful error messages and take suitable action. See testing below.

Multiple files and make. Your directory will include a make file that will be used to do a build of your system. We will discuss the make utility in class this week. As part of grading we will do a build using the make command. There should be no warnings from the compiler.

Writing a bash test script to automatically test your crawler program. You will also be asked to write a test.sh that calls your crawler program multiple times with different parameters to check the operation of the program. For example, how does your program deals with various input some of which will be erroneous (e.g., bad URL, depth to large, etc.). Writing a test.sh script has the benefit that if you have changed your code you can quickly rerun the new build (output of the make) against your test script as a sanity check that nothing has been broken. While we will discuss unit testing in more detail for Lab5 we will start with Lab4 to get students to write simple bash shells (these are tools that are specific to testing the program) to test their code. Note, we expect that the test.sh script and the log from the tests will be included in the tarball. Call the test script crawler_test.sh and the log of what the test prints out should be directed to a file called crawler_testlog.‘date‘ (i.e., crawler_testlog.Wed Jan 30 02:17:20 EST 2008). Again, these two files must be included in your tarball. As part of your grade we will run you script against your code and look at the new log produced. Please make sure that your test script writes out the name of the test, expected results, and what the system outputs.

Testing at depth 3. You should incrementally test your crawler at depth 1, 2 and 3 to build confidence in your code. Note, that the final submitted code must work at depth 3. It will be tested at depth 3 as part of grading.

We will not grade labs that segfault. You would not buy a product that crashed. So please do not submit a lab that has segfaults. We will not grade it. You will have to fix the segfaults and resubmit the lab. There will be penalties. This is in fairness to students that submited labs without segfaults.

____________________________________________________________________________

Crawler Requirements

Design, implement, and test (but not exhaustively at this stage) a standalone crawler. The design of crawler will be done in class. In the next lecture we will start to develop a DESIGN SPEC for the crawler module. We will also develop an IMPLEMENTATION SPEC. Based on these two specs you will have a blueprint of the system to develop your own implementation that you can test.

The crawler is a standalone program that crawls the web and retrieves webpage starting with a seed URL. It parses the seed webpage extracts any embedded URLs that are tagged as discussed above and retrieves those pages, and so on. Once the crawler has completed at least one complete crawling cycle (i.e., it has visited a target number of Web pages which is defined by a depth parameter on the crawler command line) then the crawler process will complete its operation.

The crawler REQUIREMENTS are as follows. The crawler SHALL (a term here that means requirement):

receives a SEED_URL as input - considered as the initial URL;
only needs one good (i.e., valid/page found) URL to start the crawler process;
retrieves the seed page from the Web using the SEED_URL;
uses wget to transfer webpages to the TARGET_DIRECTORY;
parses the embedded URL links inside the seed page;
stores theses URL links in the URLsList The URL is only added to the URL List iff it is not already present in the list. Each element of this list is associated with a boolean flag visited that is initialized to false;
saves the seed webpage as a unique document ID starting at one and incrementing by one;
saves the SEED_URL and current depth in the file too. The SEED_URL is put on the first line and the depth on the second line. The HTML follows.
The crawler sets the seed URL as visited;
The crawler repeats the crawl cycle getting a new URL for a webpage not visited (not the seed because that has been visited). The crawler gets a new address from the URLsList, retrieves the page, parses for new URL links inside page; stores these new URL links in the URL list if not already in the URLsList; and sets the current URL as visited in the URLsList (i.e., visited=true).

The TinySearch architecture is shown in Figure 1. Observe the crawled webpages saved with unique document ID as the file name. The URL and the current depth of the search is stored in each file.

Figure 1: Crawler: webpages are saved as file with unique document IDs; the URL and depth are save in first two lines.

When does it complete. The crawler cycle completes when either all the URLs in the URL list are visited or an external stop command is issued. Note, the crawler stops retrieving new webpages once its as reached the depth of the depth parameter. For example, if the depth = 1 then only the seed page is retrieved and the pages of the URLs embedded in the seed page. If the depth = 2 then all the pages pointed to by the pages with embedded URL in the seed page are retrieved. The depth parameter tunes the number of pages that the crawler will retrieve.

Need to sleep Because webservers DO NOT like crawlers (think about why) they will block your crawler based on its address. THIS is a real problem. Why? Imagine you launch your spiffy TinySearch crawler and crawl the New York Times webpage continuously and fast. The New York Times server will try and serve your pages as fast as it can. Imagine 100s of crawlers launched against the server? Yes, it would spend an increasing amount of time serving crawlers and not customers.

But, wait. What would the New York Times do if it detects you crawling heavily from a domain dartmouth.edu? It would likely block the domain, i.e., the complete dartmouth community! What would that mean? Probably, Jim Kim wouldn’t be able to read his New York Times and I’m toast. So what should we do? Well let’s try and not look like a crawler to the New York Times website. Let’s introduce a delay. Just like spy - recall? - lets sleep for a period INTERVAL_PER_FETCH. Sneaky hey.

Command-line execution

The crawler command takes the following input:

./crawler [SEED_URL] [TARGET_DIRECTORY WHERE TO PUT THE DATA] [CRAWLING_DEPTH]

Example command input

crawler www.cs.dartmouth.edu ./data/ 2

[SEED_URL]
Requirement: The URL must be valid.
Usage: The crawler needs to inform the user if URL is not found

[TARGET_DIRECTORY]
Requirement: The directory must exist and be already created by the user.
Usage: The crawler needs to inform the user if the directory can not be found

[CRAWLING_DEPTH]
Requirement: The crawl depth cannot exceed 4
Usage: The crawler needs to inform the user the user exceeds the maximum depth

For the example command line:

crawler www.cs.dartmouth.edu ./data/ 2

[SEED_URL] www.cs.dartmouth.edu
[TARGET_DIRECTORY] ./data/
[CRAWLING_DEPTH] 2

Tip: You need to determine if the TARGET_DIRECTORY is a valid directory. You can use the stat() function and macros to help here.

Crawler output

The output of your crawler program should be:

For each webpage crawled the crawler program will create a file in the
[TARGET_DIRECTORY]. The name of the file will start a 1 for the [SEED_URL]
and be incremented for each subsequent HTML webpage crawled.

Each file (e.g., 10) will include the URL associated with the saved webpage and the
current depth of the search (e.g., 1, 2, .. N) in the file. The URL will be on the
first line of the file and the depth on the second line. The HTML for the webpage
will start on the third line.

Once the crawler starts to run it gets wget to download the SEED_URL then it starts to process each webpage hunting for new URLs. A parser function (which we will provide to you) runs through the stored webpage looking for URLs. These are stored in a URLList for later processing. The crawler must remove duplicate URLs that it finds (or better it marks that it has visited a webpage (URL) and does not visit again even if it finds the same URL again. There are also conditions of stale URLs that give “Page Not Found”.

Below is a snipped when the program starts to crawl the CS webserver to a depth of 2. Meaning it will attempt to visit all URLs in the main CS webpage and then all URLs in those pages. The crawler prints status information as it goes along (this could be used in debug mode to observe the operation of the crawler as it moves through its workload). Note, you should use a LOGSTAUS macro that can be set when compiling to switch these status print outs on or off. In addition, you should be able to write the output to a logfile to look at later should you wish.

In the snippet, the program get the SEED_URL page then prints out all the URLs it finds and then crawls http://www.cs.dartmouth.edu/index.php next. PHP is a scripting language that produces HTML - .php provides a valid webpage just like .html. In the snippet it only get two webpages.

atc@dhcp-212-163 crawler] crawler www.cs.dartmouth.edu ./data/ 2
Crawler]Crawlingwww.cs.dartmouth.edu
--02:36:10--  http://www.cs.dartmouth.edu/
           => ‘temp’
Resolving www.cs.dartmouth.edu... done.
Connecting to www.cs.dartmouth.edu[129.170.213.101]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7,679 [text/html]

100%[===================================================>] 7,679          7.32M/s    ETA 00:00

02:36:10 (7.32 MB/s) - ‘temp’ saved [7679/7679]

[crawler]:Parser find link:http://www.cs.dartmouth.edu/index.php
[crawler]:Parser find link:http://www.cs.dartmouth.edu/about.php
[crawler]:Parser find link:http://www.cs.dartmouth.edu/news.php
[crawler]:Parser find link:http://www.cs.dartmouth.edu/people.php
[crawler]:Parser find link:http://www.cs.dartmouth.edu/jobs.php
[crawler]:Parser find link:http://www.cs.dartmouth.edu/contact.php
[crawler]:Parser find link:http://www.cs.dartmouth.edu/internal/
[crawler]:Parser find link:http://www.cs.dartmouth.edu/research.php
[crawler]:Parser find link:http://www.cs.dartmouth.edu/seminar.php
[crawler]:Parser find link:http://www.cs.dartmouth.edu/books.php
[crawler]:Parser find link:http://www.cs.dartmouth.edu/reports
[crawler]:Parser find link:http://www.cs.dartmouth.edu/ug.php
[crawler]:Parser find link:http://www.cs.dartmouth.edu/gr.php
[crawler]:Parser find link:http://www.cs.dartmouth.edu/ug_courses.php
[crawler]:Parser find link:http://www.cs.dartmouth.edu/gr_courses.php
[crawler]:Parser find link:http://www.cs.dartmouth.edu/robotcamp
[crawler]:Parser find link:http://www.dartmouth.edu/apply
[crawler]:Parser find link:http://www.cs.dartmouth.edu/admit_ms.php
[crawler]:Parser find link:http://www.cs.dartmouth.edu/admit_phd.php
[crawler]:Parser find link:http://www.cs.dartmouth.edu/~mdphd
[crawler]:Parser find link:http://www.cs.dartmouth.edu/gr_life.php
[crawler]:Parser find link:http://www.cs.dartmouth.edu/~sws
[crawler]:Parser find link:http://www.ams.org/bull
[crawler]:Parser find link:http://www.cs.dartmouth.edu/~afra
[crawler]:Parser find link:http://www.ams.org/bull/2008-45-01/home.html
[crawler]:Parser find link:http://www.cs.dartmouth.edu/~farid
[crawler]:Parser find link:http://www.cs.dartmouth.edu/farid/press/todayshow07.html
[crawler]:Parser find link:www.cs.dartmouth.edu/~robotics
[crawler]:Parser find link:http://www.maa.org/mathland/mathtrek_07_25_05.html
[crawler]:Parser find link:http://www.cs.dartmouth.edu/~robotics/
[crawler]:Parser find link:http://www.cs.dartmouth.edu/
[Crawler]Crawlinghttp://www.cs.dartmouth.edu/index.php
--02:36:11--  http://www.cs.dartmouth.edu/index.php
           => ‘temp’
Resolving www.cs.dartmouth.edu... done.
Connecting to www.cs.dartmouth.edu[129.170.213.101]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7,527 [text/html]

100%[============================================>] 7,527          7.18M/s    ETA 00:00

02:36:11 (7.18 MB/s) - ‘temp’ saved [7527/7527]

What does the TARGET_DIRECTORY look after the crawler has run

For each URL crawled the program creates a file and places in the file the URL and filename. But for a CRAWLING_DEPTH = 2 as in this example there are a large amount of webpages are crawled and files created. For example, if we do a look at the files created in the [TARGET_DIRECTORY] pages directory in this case, then crawler creates 184 files (184 webpages) of 3.2 Megabtes. That means for a depth of 2 on the departmental webpage there are 184 unique URLs. In fact there might be more - some could be stale URLs aka deadlinks - something to check for in your crawler. For example, using wget on a deadlink will return the following:

$ wget www.cs.dartmouth.edu/deadlink.html
--12:05:41-- http://www.cs.dartmouth.edu/deadlink.html
=> ‘deadlink.html’
Resolving www.cs.dartmouth.edu... done.
Connecting to www.cs.dartmouth.edu[129.170.213.101]:80... connected.
HTTP request sent, awaiting response... 404 Not Found
12:05:41 ERROR 404: Not Found.

Here are the files from crawling:

[atc@dhcp-212-163 pages] ls | sort -n
1
2
3
4
5
6
7
8
9
=====
SNIP
=====
174
175
176
177
178
179
180
181
182
183
184

OK. You are done.

Looking at the format of the save files

Note, that each webpage is saved as a unique document ID starting at 1 and incrementing by one. Below we less three files (viz. 1, 5, 139). As you can see the crawler has stored the URL and the current depth value when the page was crawled.

[atc@dhcp-212-163 pages] less 1

www.cs.dartmouth.edu
0
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">

====
SNIP
====

atc@dhcp-212-163 pages] less 5

http://www.cs.dartmouth.edu/people.php
1
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">

====
SNIP
====

[atc@dhcp-212-163 pages] less 139

http://www.cs.dartmouth.edu/ug_honors.php
2
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">

Tip: Make sure you always logout when you are done and see the prompt to login again before you leave the terminal.