Lab 5 - Indexer

Due Monday, May 9, at 10pm

In this lab you’ll continue the Tiny Search Engine (TSE) by coding the Indexer and a program for testing a unit that saves and loads index files, according to the Requirements Spec in the starter kit.

Preparation
Assignment
- Submission - what
- Submission - how
Hints and tips

Grading will focus on CS50 coding style - including consistent formatting, selection of identifier names, and use of meaningful comments - in addition to correctness, testing, and documentation.

Your C code must compile without producing any compiler warnings. You will lose points if the compiler produces warnings when using our CS50-standard compiler flags.

If your submitted code fails to compile, or triggers a segmentation fault, you will fail all/some of our correctness tests. Write defensive code: each function should check its pointer parameters for NULL, and take some appropriate (safe) action. Write solid unit tests, test drivers, and use regression testing as your development proceeds.

If valgrind reports memory errors or memory leaks, when indexer exits normally, you will lose points.

Preparation

Start with the same repository you used for Lab 4. Before you begin, make sure you submitted Lab 4 correctly and pushed the tag to your GitHub remote.
Check to ensure your local repo is clean with make clean and everything has been committed and pushed according to git status.
Do not proceed if you have uncommitted changes or unpushed commits. Seek help if you need to sort out your repo or GitHub.

Assignment

Design and code the second sub-system of the Tiny Search Engine, the indexer. Your design and implementation must follow the Indexer Requirements Spec (aka “the Specs”), and make good use of our abstract data structures.

As you create files and directories, add them to your git repository. Important: do not add any compiled files to git. In short: your repo should not contain any data files, editor backup files, core dumps, or files built with make and removed by make clean.

Commit and push at least once a day.

Now is a good time to read Section 4 in Searching the Web, the paper about search engines.

In your tse-xxxx directory,

Review the README.md file; add any top-level comments you need to communicate to the grader.
Review Makefile; you may need to uncomment the lines for indexer.

In the indexer subdirectory,

Write a file DESIGN.md to provide the Design Spec for indexer (not index tester), drawing inspiration from the lecture notes.
Write a file IMPLEMENTATION.md to provide the implementation spec and testing plan for indexer (not index tester). See lecture notes about what should be in an implementation spec.
Write a README.md file to provide any special information about how to compile or test your indexer, or any assumptions you made while writing the indexer.
Write a program indexer.c according to the Specs and your Design.
Write a program indextest.c according to the Specs.
Both programs shall leverage common code from common.a in the ../common directory:
- a module index.c that contains all the logic for saving and loading index files (this module is common between the Indexer, Querier, and indextest);
- a module pagedir.c that contains all the logic for saving pages to a crawler output directory, for loading pages from a crawler output directory (this module is common between the Crawler, Indexer, and Querier);
- a module word.c that implements NormalizeWord (this module is common between the Indexer and Querier);
- a Makefile that builds common.a for use by the programs;
- a short README.md file explaining how to build common.a;
- a .gitignore file telling Git to ignore the binary files (object files) in this directory.
Write a Makefile that will, by default, build the indexer and indextest executable programs.
Add a make clean target that removes files produced by Make or your tests.
Add a make test target that tests your indexer.
Write a testing.sh bash script that can be run by make test. This script must include good comments describing your tests. For best results, make test should run bash -v testing.sh.
Save the output of your tests with make test &> testing.out.
Write a .gitignore file telling Git to ignore the binary files (object files) and other unnecessary files in this directory.

Submission - what

Your submitted repo should now include the following files, plus any others you created for testing:

├── .gitignore
├── Makefile
├── README.md
├── common			# added by you (Lab4)
│   ├── .gitignore
│   ├── Makefile
│   ├── README.md
│   ├── index.c		# added by you (Lab5)
│   ├── index.h		# added by you (Lab5)
│   ├── pagedir.c
│   ├── pagedir.h
│   ├── word.c		# added by you (Lab5)
│   └── word.h		# added by you (Lab5)
├── crawler			# added by you (Lab4) except REQUIREMENTS, DESIGN
│   ├── .gitignore
│   ├── REQUIREMENTS.md
│   ├── DESIGN.md
│   ├── IMPLEMENTATION.md
│   ├── Makefile
│   ├── README.md
│   ├── testing.sh
│   ├── testing.out
│   └── crawler.c
├── indexer				# added by you (Lab5) except REQUIREMENTS
│   ├── .gitignore
│   ├── REQUIREMENTS.md	
│   ├── DESIGN.md
│   ├── IMPLEMENTATION.md
│   ├── Makefile
│   ├── README.md
│   ├── indexer.c
│   ├── indextest.c
│   ├── testing.sh
│   └── testing.out
├── libcs50
│   ├── .gitignore
│   ├── Makefile
│   ├── README.md
│   ├── bag.c
│   ├── bag.h
│   ├── counters.c	# if you decided to add your Lab3 solution
│   ├── counters.h
│   ├── file.c
│   ├── file.h
│   ├── file.md
│   ├── hashtable.c	# if you decided to add your Lab3 solution
│   ├── hashtable.h
│   ├── jhash.c
│   ├── jhash.h
│   ├── libcs50-given.a
│   ├── memory.c
│   ├── memory.h
│   ├── memory.md
│   ├── set.c		# if you decided to add your Lab3 solution
│   ├── set.h
│   ├── webpage.c
│   ├── webpage.h
│   └── webpage.md

You shall NOT modify any of the code we provide in libcs50.

You shall NOT commit any data files produced by the crawler, any binary/object files produced by the compiler, backup files, core dumps, etc.

Submission - how

When you are ready for final submission,

Add all required files using an appropriate git add commandline.
Commit all changes using an appropriate git commit commandline.
Tag: git tag lab5submit
Push: git push --tags to ensure the tags are pushed to the remote. See more about tags.

Use your web browser to view your remote on GitHub and make sure there is a tag lab5submit and the files associated with that tagged commit include everything we’ll need to grade your lab.

We grade the version tagged lab5submit so it is safe for you to continue work (e.g., so you can start work on Lab 6), even if you commit or push.

We will start grading when we first see one with tag lab5submit, and judge it late if that tag was added after the deadline. To avoid confusion, please email the lecturer and the graduate TA if you want a late submission to be graded.

Your entire codebase must compile with make from the top-level directory and produce no compiler warnings.

We will run valgrind on your program to look for memory leaks; you will lose points for memory leaks or memory errors.

Hints and tips

Many of the Lab4 hints are still relevant.

Review the lecture notes

There are some design tips in the lecture notes. The reading is great too!

Testing

If your crawler never quite worked, never fear! You do not need a working crawler to write or test your indexer. Try your indexer on our crawler’s output, which we provided earlier.

Hashtable

“How big should the hashtable be?”

In the indexer/querier, we use a hashtable to store an inverted index (words –> documents), and thus the hashtable is keyed by words. The answer to the above question, then, depends on how many words will be in the index.

When building the inverted index, it’s impossible to know in advance how many words you will find in all those pages encountered by the crawler. Pick a reasonable size, perhaps something in the range of 200..900 slots.

When loading an inverted index from a file, though, you can know how many words… because there is one word per line in the index file, and it’s easy to count the number of lines (see file.c). Once your code can obtain the number of words, think about how it can compute a good size for your hashtable.

pageDirectory

Recall that the Crawler fills a directory with saved pages; in Lab 4 we recommended that you also create and leave a file .crawler in the page directory. If you followed that advice, you now have a great opportunity! You might include a function isCrawlerDirectory(char *dir) in your pagedir module, which can easily verify whether dir is indeed a Crawler-produced directory. If it can open a file dir/.crawler for reading, then dir is a Crawler-produced directory; if not, either the dir is not a valid pathname, not a directory, not a readable directory, it’s not a Crawler produced directory, or some other error - all easily caught by this simple test.

Index files

To read each line of the index file, it works well to read the word, then loop calling fscanf with format "%d %d " to pull off one pair at a time, checking the return value of fscanf.

To write the index file, use the _iterate methods built into your data structures. (Indeed you may need to have iterators call iterators!) Do not use your _print methods for this purpose; those are meant for producing pretty human-readable output for debugging purposes.

The functions found in file.c should be helpful.

Makefile

Your indexer/Makefile needs to build both indexer and indextest. It may be most convenient to add a phony top-most rule:

all: indexer indextest

so that make or make all will build both programs; if you want to build just one, run make indexer or make indextest.

You may also find it useful to add phony rules so you can make test or make valgrind.