Due Friday April 29, at 10pm

In this lab you’ll begin the Tiny Search Engine (TSE) by coding the Crawler according to the Requirements Spec and Design Spec in the starter kit we provided on GitHub.

Grading will focus on CS50 coding style - including consistent formatting, selection of identifier names, and use of meaningful comments - in addition to correctness, testing, and documentation.

Your C code must compile without producing any compiler warnings. You will lose points if the compiler produces warnings when using our CS50-standard compiler flags.

If your submitted code fails to compile, or triggers a segmentation fault, you will fail all/some of our correctness tests. Write defensive code: each function should check its pointer parameters for NULL, and take some appropriate (safe) action. Write solid unit tests, test drivers, and use regression testing as your development proceeds.

If valgrind reports memory errors or memory leaks, when crawler exits normally, you will lose points.

Preparation

Accept the assignment, which includes a starter kit.
(OPTIONAL) Copy your counters.c, set.c, hashtable.c from your Lab3 solution to the libcs50 subdirectory in your brand-new repository; try make in that directory to ensure everything works.

Assignment

Write the first sub-system of the Tiny Search Engine: the crawler. Your implementation must follow the Crawler Requirements Spec and the Crawler Design Spec (aka “the Specs”, linked at the top), and make good use of our abstract data structures.

As you create files and directories, add them to your git repository. Important: do not add any compiled files to git. In short: your repo should not contain any data files, editor backup files, core dumps, or files built with make and removed by make clean.

Commit and push at least once a day.

In your tse-xxxx directory,

Edit the README.md file to include your name; later, add any top-level comments you need to communicate to the grader.
Review Makefile, which calls Make recursively in crawler and common; make should build everything and make clean should clean everything. You may need to comment-out the lines for indexer and querier, for now.

In the crawler directory,

Write a README.md file to provide any special information about how to compile or test your crawler, or any assumptions you made while writing the crawler.
Write an IMPLEMENTATION.md file to describe the implementation of your crawler. See lecture notes about what should be in an implementation spec.
Write a program crawler.c according to the Specs.
Write a Makefile that will, by default, build the crawler executable program.
Add a make clean target that removes files produced by Make or your tests.
Add a make test target that tests your crawler.
Write a testing.sh bash script that can be run by make test. Read about testing below. This script must include good comments describing your tests.
Save the output of your tests with make test &> testing.out.
Write a .gitignore file telling Git to ignore the binary files (object files) and other unnecessary files in this directory.

In the common directory, assemble code that will eventually be shared by the crawler, indexer, and querier. For now, that comprises code for initializing the “pageDirectory” and saving webpages there. (You will find reason to expand this file and this directory, when you write the other subsystems.)

Create files pagedir.c and pagedir.h.
Write a Makefile that will, by default, build the common.a library. It must also have a clean target that removes files produced by Make.
Write a README.md file to provide any special information about how to compile or use this library. This file is likely to be short.
Write a .gitignore file telling Git to ignore the binary files (object files) and other unnecessary files in this directory.

Submission - what

Your submitted repo should include the following files, plus any others you created for testing:

├── .gitignore
├── Makefile
├── README.md
├── common			# added by you
│   ├── .gitignore
│   ├── Makefile
│   ├── README.md
│   ├── pagedir.c
│   └── pagedir.h
├── crawler			# added by you
│   ├── .gitignore
│   ├── DESIGN.md
│   ├── IMPLEMENTATION.md
│   ├── Makefile
│   ├── README.md
│   ├── testing.sh
│   ├── testing.out
│   └── crawler.c
├── libcs50
│   ├── .gitignore
│   ├── Makefile
│   ├── README.md
│   ├── bag.c
│   ├── bag.h
│   ├── counters.c	# if you decided to add your Lab3 solution
│   ├── counters.h
│   ├── file.c
│   ├── file.h
│   ├── file.md
│   ├── hashtable.c	# if you decided to add your Lab3 solution
│   ├── hashtable.h
│   ├── jhash.c
│   ├── jhash.h
│   ├── libcs50-given.a
│   ├── memory.c
│   ├── memory.h
│   ├── memory.md
│   ├── set.c		# if you decided to add your Lab3 solution
│   ├── set.h
│   ├── webpage.c
│   ├── webpage.h
│   └── webpage.md

You shall NOT modify any of the code we provide in libcs50.

You shall NOT commit any data files produced by the crawler, any binary/object files produced by the compiler, backup files, core dumps, etc.

Submission - how

When you are ready for final submission,

Add all required files using an appropriate git add command line.
Commit all changes using an appropriate git commit command line.
Tag: git tag lab4submit
Push: git push --tags to ensure the tags are pushed to the remote. See more about tags.

Use your web browser to view your remote on GitHub and make sure there is a tag lab4submit and the files associated with that tagged commit include everything we’ll need to grade your lab.

We grade the version tagged lab4submit so it is safe for you to continue work (e.g., so you can start work on Lab 5), even if you commit or push.

We will start grading when we first see one with tag lab4submit, and judge it late if that tag was added after the deadline. To avoid confusion, please email the professor and graduate TA if you want a late submission to be graded.*

Your entire code base must compile with make from the top-level directory and produce no compiler warnings.

We will run valgrind on your program to look for memory leaks; you will lose points for memory leaks or memory errors.

Hints and tips

libcs50

We provide several modules in the libcs50 directory, which compiles to a library libcs50.a you can link with your crawler. You should not change any of our code. You may drop in your set.[ch], counters.[ch], hashtable.[ch] files from Lab 3. If you’d prefer to use our version of set, counters, and hashtable, that’s totally fine; the starter kit includes a pre-compiled library you can use instead; see libcs50/README.md.

Several of the modules in this directory are documented with a corresponding Markdown file; all of their interfaces are documented by their corresponding .h file.

Testing Crawler

It is your responsibility to test your crawler thoroughly. Consider following some of the testing plan discussed in Lecture.

Some safe seeds you might test include

http://cs50tse.cs.dartmouth.edu/tse/letters/index.html
http://cs50tse.cs.dartmouth.edu/tse/wikipedia/index.html
http://cs50tse.cs.dartmouth.edu/tse/toscrape/index.html

The first two are good at depths 0 and higher, but the other cases are likely too slow to run past maxDepth=2.

The output of our crawler can be found at this link. Keep in mind that our crawler may process URLs in a different order, so your directory may not be identical to ours.

You may pick other test sites - but your seed URL must begin with http://cs50tse.cs.dartmouth.edu/.

Use valgrind and gdb

We’ve provided information about gdb and valgrind; make use of them.

Memory allocation

You may find the memory.c module useful. Its use is optional, but it is ready to use as part of library libcs50.a.

In our Lab3 we tried to recover gracefully from memory-allocation failures. In the TSE application programs, we’ll be more lazy: on NULL return from malloc/calloc you can print an error and exit with non-zero exit status. (You may find the assertp() function, defined by the memory module, to be useful here.) In such cases, it is ok if valgrind reports your program as having leaks.

Hashtable of URL

Our design calls for use of a hashtable to store the URLs already seen. Our hashtable implementation stores a void* item with each key. But if you’re just storing URLs (as keys), what item do you store? Our hashtable disallows NULL items. Suggestion: just pass a constant string as the item; even "" will do.

Pseudocode

Your implementation does not need to follow our pseudocode; this is one reason why you need to write about your implementation choices in IMPLEMENTATION.md.

Our pseudocode refers to a pagefetcher; note this is effectively webpage_fetch(). Note that webpage_fetch() inserts a 1-second delay after every fetch attempt; your Crawler thus gets this behavior for free.

Our pseudocode refers to a pagescanner; yours should leverage webage_getNextURL().

pageDirectory

The Crawler is supposed to determine whether the pathname provided as pageDirectory is indeed a directory, and is writable (meaning, you can create files there). How? The simplest method is to create a file there. If you construct a string filename that is the directory name, followed by a slash '/', followed by a file name like .crawler, you could call fopen(filename, "w"); if it returns NULL (failure), then either the directory does not exist or is not writable. If it returns non-NULL (success), close the file. It’s fine to leave .crawler in the page directory. (Actually, the presence of this file will later help your indexer confirm the pageDirectory is indeed a Crawler-produced directory.)

This sort of detail is one reason we recommend writing a separate pagedir module in the common sub-directory, because its functions will need to be shared by crawler, indexer, and querier, and the details about the structure and format of that directory should not be scattered across those three programs, but rather contained within one clean module.