## DESCRIPTION This is the solution for lab 4 of CS23 instructed by Andrew T. Campbell. The source code implements the crawler component of the TinySearch Engine. PLEASE DO NOT DISTRIBUTE THIS CODE TO ANYONE. Please take a close look at the code - its decomposition into: 1) crawler.[hc] The main Crawler controller code 2) file.[hc] Common file processing functions and data structures 3) list.[hc] Common URL list processing functions and data structures 4) hmtl.[hc] Parsing and html processing functions and data structures 5) hash.[hc] hash table functions and data structures 6) dictionary.[hc] Dictionary data structure and processing functions 7) header.h Some useful Marcos How did we come up with this decomposition? When you think of the data flow and data structures needed to implement the crawler and then write the pseudo code you can clearly see that there are components that manage and manipulate files, URL lists and strings. There is a component that deals with the dictionary. Another component handle html file and buffer processing. There is no such thing as the right decomposition into components for a software system. What we have above is a reasonble decomposition. You may have bundled all the code into crawler.c as part of your solution - don't worry. You may have also written long functions that embedded many functions - don't worry. *In lab5 and lab6 we start to care about good code and decomposition.* Take a look at the coded solution and learn from it. Can you refactor your code in to list, file, dictionary, html, and hash components. Please do so. Refactor and then rerun your refactored crawler to make sure you have not introduced errors. The reason for doing this other than learning is that you will use these list, file, dictionary components in the next two assignments. What you are doing here is building up a set of common utilities that can be used for the remaining parts of TinySearch. This will make your final TinySearch smaller and less buggy. You will reuse code that you have written and debugged rather than writing new code. So refactor, refactor, refactor. It is not necessary to do and is not graded but if you refactor your crawler and pull out common code and shown in ./lab4/src/util/ below you will become a better C programmer - or your money back. You will also save a lot of time in this course. ## SOURCE CODE DETAILS The lab4 solution includes: - this file explains the source code, and how to build, run and test it. ./lab4/README - this directory is where files are stored [1, 2, 3 ... N] ./lab4/data/ - the source code is arranged in the following directory: ./lab4/src/util/ Description: Includes general purpose utilities and the html parsers What's in the util directory? They are sourcs files we think that have a general usage for the whole search engine project. So we put them in one directory. However, when we build applications like crawler, when link these codes in in the Makefile. file.c file.h (file access utilities) html.c html.h (html parsers, where getNextUrl() hides) header.h (some useful MACROS) hash.h hash.c (hash functions) dictionary.h dictionary.c (a general data structure library implementing dictionary) ./lab4/src/crawler/ Description: Main crawler controller, ULR list processing, Makefile and bash script to run and test the code. crawler.c (the main entry, the crawler main logic is here) list.h list.c (all functions related to the URL list is here) Makefile - the Makefile for the crawler test_and_start_crawl.sh (the bash script to test and start crawl) ## TO BUILD To build the crawler goto /src/crawler/ and type "make" ## TO RUN and TEST To test the crawler goto /src/crawler/ and type "./test_and_start_crawl.sh" This script will: 1 Test the argument checking of crawler. 2 Test the termination of crawling when using a small depth. 3 Crawl http://www.cs.dartmouth.edu, and save data ../../data