Webpage module
This module defines the opaque webpage_t
type, and a suite of functions for creating and manipulating “web pages”. A “web page” is really a struct holding
- the URL of this web page, in canonical form
- the depth at which the crawler found this page
- the HTML for the page - may be NULL.
Sometimes you just want to keep track of web pages without HTML - perhaps because you have not yet fetched that HTML - in that case, the webpage object has a null HTML pointer.
Othertimes, you have fetched the HTML and want to work with it; then, the webpage object has a non-null HTML pointer.
For a complete and up-to-date description, read webpage.h
for documentation of the interfaces.
webpage_t
A webpage_t
object is a struct and exists in one of two states:
- initialized, but with no HTML;
- initialized, and with HTML fetched from the site.
Initially, on return from webpage_new()
, the struct is in the first state. After a successful call to webpage_fetch
, the struct now has HTML content and is in the second state.
Definition
typedef struct webpage {
char* url; // url of the page
char* html; // html code of the page
size_t html_len; // length of html code
int depth; // depth of crawl
} webpage_t;
webpage_new
Creates a new webpage
object, given the URL for the web page, the depth at which the crawler encountered this URL, and (optionally) the HTML for this page.
webpage_t *webpage_new(char *url, const int depth, char *html);
webpage_delete
Deletes a webpage object and frees its memory.
It takes a void*
to make it easy to call this from a generic data structure like bag_delete()
.
void webpage_delete(void *data);
webpage_fetch
Downloads the HTML for the page at the given URL and saves it in the html
portion of the webpage
struct.
bool webpage_fetch(webpage_t *page);
webpage_getNextWord
Starts (or continues) a scan of the HTML from index pos
for the given page, returning the next word in the page.
char *webpage_getNextWord(webpage_t *page, int *pos);
webpage_getNextURL
Starts (or continues) a scan of the HTML from index pos
for the given page, returning the next URL in the page.
char *webpage_getNextURL(webpage_t *page, int *pos);
normalizeURL
To normalize a URL to canonical form.
bool normalizeURL(char *url);
isInternalURL
To determine whether a given URL is “internal” to the CS50 playground (e.g., on plank).
bool isInternalURL(char *url);
getter methods
If you must access the contents of the struct, you can request them via one of three getter methods:
int webpage_getDepth(const webpage_t *page);
char *webpage_getURL(const webpage_t *page);
char *webpage_getHTML(const webpage_t *page);