## DESCRIPTION

This is the solution for lab 4 of CS23 instructed by Andrew T. Campbell.
The source code implements the crawler component of the TinySearch Engine.

PLEASE DO NOT DISTRIBUTE THIS CODE TO ANYONE.

Please take a close look at the code - its decomposition into:

1) crawler.[hc] The main Crawler controller code
2) file.[hc] Common file processing functions and data structures
3) list.[hc] Common URL list processing functions and data structures
4) hmtl.[hc] Parsing and html processing functions and data structures
5) hash.[hc] hash table functions and data structures
6) dictionary.[hc] Dictionary data structure and processing functions
7) header.h Some useful Marcos

How did we come up with this decomposition?

When you think of the data flow and data structures needed to implement the
crawler and then write the pseudo code you can clearly see that there
are components that manage and manipulate files, URL lists and strings.
There is a component that deals with the dictionary. Another component
handle html file and buffer processing. 

There is no such thing as the right decomposition into components for 
a software system. What we have above is a reasonble decomposition. 

You may have bundled all the code into crawler.c as part of your 
solution - don't worry. You may have also written long functions 
that embedded many functions - don't worry. 

*In lab5 and lab6 we start to care about good code and decomposition.*

Take a look at the coded solution and learn from it. Can you
refactor your code in to list, file, dictionary, html, and hash components.
Please do so. Refactor and then rerun your refactored crawler 
to make sure you have not introduced errors. 

The reason for doing this other than learning is that you
will use these list, file, dictionary  components in the next two assignments.

What you are doing here is building up a set of common utilities that
can be used for the remaining parts of TinySearch.

This will make your final TinySearch smaller and less buggy. You will
reuse code that you have written and debugged rather than writing
new code.

So refactor, refactor, refactor.

It is not necessary to do and is not graded but if you refactor
your crawler and pull out common code and shown in ./lab4/src/util/
below you will become a better C programmer - or your money back.
You will also save a lot of time in this course.

## SOURCE CODE DETAILS

The lab4 solution includes:

- this file explains the source code, and how to build, run and test it. 

./lab4/README 

- this directory is where files are stored [1, 2, 3 ... N]

./lab4/data/ 

- the source code is  arranged in the following directory:

./lab4/src/util/   

        Description: Includes general purpose utilities and the html parsers

        What's in the util directory? They are sourcs files we think that have 
        a general usage for the whole search engine project. So we put them in 
	one directory. However, when we build applications like crawler, when link 
        these codes in in the Makefile.

	file.c file.h  (file access utilities)
	html.c html.h (html parsers, where getNextUrl() hides)
	header.h (some useful MACROS)
	hash.h hash.c (hash functions)
	dictionary.h dictionary.c (a general data structure library implementing dictionary)

./lab4/src/crawler/

	Description: Main crawler controller, ULR list processing, Makefile
        and bash script to run and test the code.

	crawler.c  (the main entry, the crawler main logic is here)
	list.h list.c (all functions related to the URL list is here)

	Makefile - the Makefile for the crawler
	test_and_start_crawl.sh (the bash script to test and start crawl)

## TO BUILD 

To build the crawler

goto /src/crawler/ and type "make"


## TO RUN and TEST

To test the crawler

goto /src/crawler/ and type "./test_and_start_crawl.sh"
     This script will: 
     1 Test the argument checking of crawler.
     2 Test the termination of crawling when using a small depth.
     3 Crawl http://www.cs.dartmouth.edu, and save data ../../data