Activity - Fetch and save a web page
In this activity, we investigate the webpage module provided by the Lab 4 starter kit. We call its APIs to fetch the web page of a given URL and save it to a file, where the first line of the file is the URL, and the second line is the HTML of this URL.
Grab the starter code:
$ cp -r /thayerfs/courses/22fall/cosc050/workspace/activities/day17/webpage .
$ cd webpage/
fetchweb.c is the skeleton code we will work on. Our task is to fill in the to-dos and finish the program.
To compile the code, first generate the libcs50.a library under libcs50/, then use the library to generate the executable:
$ cd libcs50/
$ make
cp libcs50-given.a libcs50.a
$ cd ..
$ mygcc -o fetchweb fetchweb.c libcs50/libcs50.a
Your program should take the following command line arguments (see usage example below):
- arg1: seedURL
- arg2: file to save html of page at seedURL
Use the following pseudo code:
1. get a new pointer to a webpage_t struct using webpage_new() passing in:
- url from the command line
- depth = 0
- html = NULL (you will fill this in when you fetch the page next)
2. fetch the web page using webpage_fetch(), passing the new pointer to the webpage_t struct from the previous step
- Note: webpage_fetch fills a webpage_t struct with the html from the page
3. If the fetch succeeded, save the page with webpage_save that you write, passing in:
- newly filled webpage_t struct
- filename to store results (arg 2 from the command line)
The webpage_save() function should open a file with the name of the second parameter and write the page's URL on the first line and the HTML on the second line
4. print error to stderr if fetch or save fails
5. free webpage_t struct with webpage_delete
Note: webpage_XXX functions except webpage_save
(you write this function in fetchweb.c) are in libcs50/webpage.h and are described in the page on webpage module.
Usage
An example usage of the program (make sure to use our test Internet at http://cs50tse.cs.dartmouth.edu/tse/):
$ ./fetchweb http://cs50tse.cs.dartmouth.edu/tse/letters/index.html letter
The first crawls the letters index page and saves the URL and HTML to a file named letter
.
Note: if you are using VS Code, you may need to refresh to see the newly created file (letter in the example above).
Work in groups to finish the program. Use gdb and valgrind for debugging if necessary.