Lab 5 - Indexer
Due Monday, May 9, at 10pm
In this lab you’ll continue the Tiny Search Engine (TSE) by coding the Indexer and a program for testing a unit that saves and loads index files, according to the Requirements Spec in the starter kit.
Grading will focus on CS50 coding style - including consistent formatting, selection of identifier names, and use of meaningful comments - in addition to correctness, testing, and documentation.
Your C code must compile without producing any compiler warnings. You will lose points if the compiler produces warnings when using our CS50-standard compiler flags.
If your submitted code fails to compile, or triggers a segmentation fault, you will fail all/some of our correctness tests. Write defensive code: each function should check its pointer parameters for NULL, and take some appropriate (safe) action. Write solid unit tests, test drivers, and use regression testing as your development proceeds.
If valgrind reports memory errors or memory leaks, when indexer exits normally, you will lose points.
Preparation
-
Start with the same repository you used for Lab 4. Before you begin, make sure you submitted Lab 4 correctly and pushed the tag to your GitHub remote.
-
Check to ensure your local repo is clean with
make clean
and everything has been committed and pushed according togit status
. -
Do not proceed if you have uncommitted changes or unpushed commits. Seek help if you need to sort out your repo or GitHub.
Assignment
Design and code the second sub-system of the Tiny Search Engine, the indexer. Your design and implementation must follow the Indexer Requirements Spec (aka “the Specs”), and make good use of our abstract data structures.
As you create files and directories, add them to your git repository.
Important: do not add any compiled files to git.
In short: your repo should not contain any data files, editor backup files, core dumps, or files built with make
and removed by make clean
.
Commit and push at least once a day.
Now is a good time to read Section 4 in Searching the Web, the paper about search engines.
In your tse-xxxx
directory,
-
Review the
README.md
file; add any top-level comments you need to communicate to the grader. -
Review
Makefile
; you may need to uncomment the lines forindexer
.
In the indexer
subdirectory,
-
Write a file
DESIGN.md
to provide the Design Spec for indexer (not index tester), drawing inspiration from the lecture notes. -
Write a file
IMPLEMENTATION.md
to provide the implementation spec and testing plan for indexer (not index tester). See lecture notes about what should be in an implementation spec. -
Write a
README.md
file to provide any special information about how to compile or test your indexer, or any assumptions you made while writing the indexer. -
Write a program
indexer.c
according to the Specs and your Design. -
Write a program
indextest.c
according to the Specs. -
Both programs shall leverage common code from
common.a
in the../common
directory:- a module
index.c
that contains all the logic for saving and loading index files (this module is common between the Indexer, Querier, and indextest); - a module
pagedir.c
that contains all the logic for saving pages to a crawler output directory, for loading pages from a crawler output directory (this module is common between the Crawler, Indexer, and Querier); - a module
word.c
that implementsNormalizeWord
(this module is common between the Indexer and Querier); - a
Makefile
that buildscommon.a
for use by the programs; - a short
README.md
file explaining how to buildcommon.a
; - a
.gitignore
file telling Git to ignore the binary files (object files) in this directory.
- a module
- Write a
Makefile
that will, by default, build theindexer
andindextest
executable programs. - Add a
make clean
target that removes files produced by Make or your tests. - Add a
make test
target that tests your indexer. - Write a
testing.sh
bash script that can be run bymake test
. This script must include good comments describing your tests. For best results,make test
should runbash -v testing.sh
. -
Save the output of your tests with
make test &> testing.out
. - Write a
.gitignore
file telling Git to ignore the binary files (object files) and other unnecessary files in this directory.
Submission - what
Your submitted repo should now include the following files, plus any others you created for testing:
├── .gitignore
├── Makefile
├── README.md
├── common # added by you (Lab4)
│ ├── .gitignore
│ ├── Makefile
│ ├── README.md
│ ├── index.c # added by you (Lab5)
│ ├── index.h # added by you (Lab5)
│ ├── pagedir.c
│ ├── pagedir.h
│ ├── word.c # added by you (Lab5)
│ └── word.h # added by you (Lab5)
├── crawler # added by you (Lab4) except REQUIREMENTS, DESIGN
│ ├── .gitignore
│ ├── REQUIREMENTS.md
│ ├── DESIGN.md
│ ├── IMPLEMENTATION.md
│ ├── Makefile
│ ├── README.md
│ ├── testing.sh
│ ├── testing.out
│ └── crawler.c
├── indexer # added by you (Lab5) except REQUIREMENTS
│ ├── .gitignore
│ ├── REQUIREMENTS.md
│ ├── DESIGN.md
│ ├── IMPLEMENTATION.md
│ ├── Makefile
│ ├── README.md
│ ├── indexer.c
│ ├── indextest.c
│ ├── testing.sh
│ └── testing.out
├── libcs50
│ ├── .gitignore
│ ├── Makefile
│ ├── README.md
│ ├── bag.c
│ ├── bag.h
│ ├── counters.c # if you decided to add your Lab3 solution
│ ├── counters.h
│ ├── file.c
│ ├── file.h
│ ├── file.md
│ ├── hashtable.c # if you decided to add your Lab3 solution
│ ├── hashtable.h
│ ├── jhash.c
│ ├── jhash.h
│ ├── libcs50-given.a
│ ├── memory.c
│ ├── memory.h
│ ├── memory.md
│ ├── set.c # if you decided to add your Lab3 solution
│ ├── set.h
│ ├── webpage.c
│ ├── webpage.h
│ └── webpage.md
You shall NOT modify any of the code we provide in libcs50
.
You shall NOT commit any data files produced by the crawler, any binary/object files produced by the compiler, backup files, core dumps, etc.
Submission - how
When you are ready for final submission,
- Add all required files using an appropriate
git add
commandline. - Commit all changes using an appropriate
git commit
commandline. - Tag:
git tag lab5submit
- Push:
git push --tags
to ensure the tags are pushed to the remote. See more about tags.
Use your web browser to view your remote on GitHub and make sure there is a tag lab5submit
and the files associated with that tagged commit include everything we’ll need to grade your lab.
We grade the version tagged lab5submit
so it is safe for you to continue work (e.g., so you can start work on Lab 6), even if you commit or push.
We will start grading when we first see one with tag lab5submit
, and judge it late if that tag was added after the deadline.
To avoid confusion, please email the lecturer and the graduate TA if you want a late submission to be graded.
Your entire codebase must compile with make
from the top-level directory and produce no compiler warnings.
We will run valgrind
on your program to look for memory leaks; you will lose points for memory leaks or memory errors.
Hints and tips
Many of the Lab4 hints are still relevant.
Review the lecture notes
There are some design tips in the lecture notes. The reading is great too!
Testing
If your crawler never quite worked, never fear! You do not need a working crawler to write or test your indexer. Try your indexer on our crawler’s output, which we provided earlier.
Hashtable
“How big should the hashtable be?”
In the indexer/querier, we use a hashtable to store an inverted index (words –> documents), and thus the hashtable is keyed by words. The answer to the above question, then, depends on how many words will be in the index.
When building the inverted index, it’s impossible to know in advance how many words you will find in all those pages encountered by the crawler. Pick a reasonable size, perhaps something in the range of 200..900 slots.
When loading an inverted index from a file, though, you can know how many words… because there is one word per line in the index file, and it’s easy to count the number of lines (see file.c
).
Once your code can obtain the number of words, think about how it can compute a good size for your hashtable.
pageDirectory
Recall that the Crawler fills a directory with saved pages; in Lab 4 we recommended that you also create and leave a file .crawler
in the page directory.
If you followed that advice, you now have a great opportunity!
You might include a function isCrawlerDirectory(char *dir)
in your pagedir
module, which can easily verify whether dir
is indeed a Crawler-produced directory.
If it can open a file dir/.crawler
for reading, then dir
is a Crawler-produced directory; if not, either the dir
is not a valid pathname, not a directory, not a readable directory, it’s not a Crawler produced directory, or some other error - all easily caught by this simple test.
Index files
To read each line of the index file, it works well to read the word, then loop calling fscanf
with format "%d %d "
to pull off one pair at a time, checking the return value of fscanf
.
To write the index file, use the _iterate
methods built into your data structures.
(Indeed you may need to have iterators call iterators!)
Do not use your _print
methods for this purpose; those are meant for producing pretty human-readable output for debugging purposes.
The functions found in file.c
should be helpful.
Makefile
Your indexer/Makefile
needs to build both indexer
and indextest
.
It may be most convenient to add a phony top-most rule:
all: indexer indextest
so that make
or make all
will build both programs; if you want to build just one, run make indexer
or make indextest
.
You may also find it useful to add phony rules so you can make test
or make valgrind
.