Lab 4 - Crawler
Due Friday April 29, at 10pm
In this lab you’ll begin the Tiny Search Engine (TSE) by coding the Crawler according to the Requirements Spec and Design Spec in the starter kit we provided on GitHub.
Grading will focus on CS50 coding style - including consistent formatting, selection of identifier names, and use of meaningful comments - in addition to correctness, testing, and documentation.
Your C code must compile without producing any compiler warnings. You will lose points if the compiler produces warnings when using our CS50-standard compiler flags.
If your submitted code fails to compile, or triggers a segmentation fault, you will fail all/some of our correctness tests. Write defensive code: each function should check its pointer parameters for NULL, and take some appropriate (safe) action. Write solid unit tests, test drivers, and use regression testing as your development proceeds.
If valgrind reports memory errors or memory leaks, when crawler exits normally, you will lose points.
Preparation
-
Accept the assignment, which includes a starter kit.
-
(OPTIONAL) Copy your
counters.c
,set.c
,hashtable.c
from your Lab3 solution to thelibcs50
subdirectory in your brand-new repository; trymake
in that directory to ensure everything works.
Assignment
Write the first sub-system of the Tiny Search Engine: the crawler. Your implementation must follow the Crawler Requirements Spec and the Crawler Design Spec (aka “the Specs”, linked at the top), and make good use of our abstract data structures.
As you create files and directories, add them to your git repository.
Important: do not add any compiled files to git.
In short: your repo should not contain any data files, editor backup files, core dumps, or files built with make
and removed by make clean
.
Commit and push at least once a day.
In your tse-xxxx
directory,
-
Edit the
README.md
file to include your name; later, add any top-level comments you need to communicate to the grader. -
Review
Makefile
, which calls Make recursively incrawler
andcommon
;make
should build everything andmake clean
should clean everything. You may need to comment-out the lines forindexer
andquerier
, for now.
In the crawler
directory,
-
Write a
README.md
file to provide any special information about how to compile or test your crawler, or any assumptions you made while writing the crawler. -
Write an
IMPLEMENTATION.md
file to describe the implementation of your crawler. See lecture notes about what should be in an implementation spec. -
Write a program
crawler.c
according to the Specs. -
Write a
Makefile
that will, by default, build thecrawler
executable program. -
Add a
make clean
target that removes files produced by Make or your tests. -
Add a
make test
target that tests your crawler. -
Write a
testing.sh
bash script that can be run bymake test
. Read about testing below. This script must include good comments describing your tests. -
Save the output of your tests with
make test &> testing.out
. -
Write a
.gitignore
file telling Git to ignore the binary files (object files) and other unnecessary files in this directory.
In the common
directory, assemble code that will eventually be shared by the crawler, indexer, and querier.
For now, that comprises code for initializing the “pageDirectory” and saving webpages there.
(You will find reason to expand this file and this directory, when you write the other subsystems.)
-
Create files
pagedir.c
andpagedir.h
. -
Write a
Makefile
that will, by default, build thecommon.a
library. It must also have aclean
target that removes files produced by Make. -
Write a
README.md
file to provide any special information about how to compile or use this library. This file is likely to be short. -
Write a
.gitignore
file telling Git to ignore the binary files (object files) and other unnecessary files in this directory.
Submission - what
Your submitted repo should include the following files, plus any others you created for testing:
├── .gitignore
├── Makefile
├── README.md
├── common # added by you
│ ├── .gitignore
│ ├── Makefile
│ ├── README.md
│ ├── pagedir.c
│ └── pagedir.h
├── crawler # added by you
│ ├── .gitignore
│ ├── DESIGN.md
│ ├── IMPLEMENTATION.md
│ ├── Makefile
│ ├── README.md
│ ├── testing.sh
│ ├── testing.out
│ └── crawler.c
├── libcs50
│ ├── .gitignore
│ ├── Makefile
│ ├── README.md
│ ├── bag.c
│ ├── bag.h
│ ├── counters.c # if you decided to add your Lab3 solution
│ ├── counters.h
│ ├── file.c
│ ├── file.h
│ ├── file.md
│ ├── hashtable.c # if you decided to add your Lab3 solution
│ ├── hashtable.h
│ ├── jhash.c
│ ├── jhash.h
│ ├── libcs50-given.a
│ ├── memory.c
│ ├── memory.h
│ ├── memory.md
│ ├── set.c # if you decided to add your Lab3 solution
│ ├── set.h
│ ├── webpage.c
│ ├── webpage.h
│ └── webpage.md
You shall NOT modify any of the code we provide in libcs50
.
You shall NOT commit any data files produced by the crawler, any binary/object files produced by the compiler, backup files, core dumps, etc.
Submission - how
When you are ready for final submission,
- Add all required files using an appropriate
git add
command line. - Commit all changes using an appropriate
git commit
command line. - Tag:
git tag lab4submit
- Push:
git push --tags
to ensure the tags are pushed to the remote. See more about tags.
Use your web browser to view your remote on GitHub and make sure there is a tag lab4submit
and the files associated with that tagged commit include everything we’ll need to grade your lab.
We grade the version tagged lab4submit
so it is safe for you to continue work (e.g., so you can start work on Lab 5), even if you commit or push.
We will start grading when we first see one with tag lab4submit
, and judge it late if that tag was added after the deadline.
To avoid confusion, please email the professor and graduate TA if you want a late submission to be graded.*
Your entire code base must compile with make
from the top-level directory and produce no compiler warnings.
We will run valgrind
on your program to look for memory leaks; you will lose points for memory leaks or memory errors.
Hints and tips
libcs50
We provide several modules in the libcs50
directory, which compiles to a library libcs50.a
you can link with your crawler.
You should not change any of our code.
You may drop in your set.[ch]
, counters.[ch]
, hashtable.[ch]
files from Lab 3.
If you’d prefer to use our version of set, counters, and hashtable, that’s totally fine; the starter kit includes a pre-compiled library you can use instead; see libcs50/README.md
.
Several of the modules in this directory are documented with a corresponding Markdown file; all of their interfaces are documented by their corresponding .h
file.
Testing Crawler
It is your responsibility to test your crawler thoroughly. Consider following some of the testing plan discussed in Lecture.
Some safe seeds you might test include
- http://cs50tse.cs.dartmouth.edu/tse/letters/index.html
- http://cs50tse.cs.dartmouth.edu/tse/wikipedia/index.html
- http://cs50tse.cs.dartmouth.edu/tse/toscrape/index.html
The first two are good at depths 0 and higher, but the other cases are likely too slow to run past maxDepth=2.
The output of our crawler can be found at this link. Keep in mind that our crawler may process URLs in a different order, so your directory may not be identical to ours.
You may pick other test sites - but your seed URL must begin with http://cs50tse.cs.dartmouth.edu/
.
Use valgrind and gdb
We’ve provided information about gdb and valgrind; make use of them.
Memory allocation
You may find the memory.c
module useful.
Its use is optional, but it is ready to use as part of library libcs50.a
.
In our Lab3 we tried to recover gracefully from memory-allocation failures.
In the TSE application programs, we’ll be more lazy: on NULL return from malloc/calloc you can print an error and exit with non-zero exit status.
(You may find the assertp()
function, defined by the memory
module, to be useful here.)
In such cases, it is ok if valgrind reports your program as having leaks.
Hashtable of URL
Our design calls for use of a hashtable to store the URLs already seen.
Our hashtable implementation stores a void* item
with each key. But if you’re just storing URLs (as keys), what item do you store?
Our hashtable disallows NULL items.
Suggestion: just pass a constant string as the item; even ""
will do.
Pseudocode
Your implementation does not need to follow our pseudocode; this is one reason why you need to write about your implementation choices in IMPLEMENTATION.md
.
Our pseudocode refers to a pagefetcher
; note this is effectively webpage_fetch()
.
Note that webpage_fetch()
inserts a 1-second delay after every fetch attempt; your Crawler thus gets this behavior for free.
Our pseudocode refers to a pagescanner
; yours should leverage webage_getNextURL()
.
pageDirectory
The Crawler is supposed to determine whether the pathname provided as pageDirectory is indeed a directory, and is writable (meaning, you can create files there).
How?
The simplest method is to create a file there.
If you construct a string filename
that is the directory name, followed by a slash '/'
, followed by a file name like .crawler
, you could call fopen(filename, "w")
; if it returns NULL (failure), then either the directory does not exist or is not writable.
If it returns non-NULL (success), close the file.
It’s fine to leave .crawler
in the page directory.
(Actually, the presence of this file will later help your indexer confirm the pageDirectory is indeed a Crawler-produced directory.)
This sort of detail is one reason we recommend writing a separate pagedir module in the common
sub-directory, because its functions will need to be shared by crawler, indexer, and querier, and the details about the structure and format of that directory should not be scattered across those three programs, but rather contained within one clean module.