In this class we focus on the design of reusable modules for common data structures.

Goals

  • Understanding the importance of reusable modules
  • Using header files to enable modules
  • Reiterating struct and typedef
  • General-purpose data structures with void*
  • Function pointers.

For a look at the C preprocessor, see another lecture extra.

For more about C header files, see another lecture extra.

Activity

In today’s class activity, we add a counter to track the number of items in the bag.

Conditional compilation

Recall the process of compiling a C program.

The preprocessor understands various directives, all beginning with #. We’ve already seen one common directive, #include. Another common use is for conditional compilation, that is, to include (or exclude) a section of code based on a condition - most commonly, whether or not a given preprocessor symbol is defined.

One use of preprocessor conditional compilation is to support debugging. Consider this very common example:

#define DEBUG
...
#ifdef DEBUG
 ... code for use when DEBUG is defined
#else // DEBUG
 ... code for use when DEBUG is not defined
#endif // DEBUG

The C preprocessor literally includes some code, and strips out other code, before passing it to the C compiler. In the above example, we define the preprocessor constant called DEBUG, and later test its value with #ifdef. Notice the use of comments on the #else and #endif lines to help readability.

The above example is a common approach for debugging. Even better, remove the #define DEBUG line and let that be determined by the programmer at compile time:

#ifdef DEBUG
 ... code for use when DEBUG is defined
#else // DEBUG
 ... code for use when DEBUG is not defined
#endif // DEBUG
  mygcc -DDEBUG program.c -o program

The program compiles (and behaves) differently with and without the compile-time switch -DDEBUG. This particular approach is very common and vastly better than editing the code to add/remove debugging code, or to comment in/out the debugging code; just flip a switch on the commandline!

MEMTEST

In CS50 you’ll see a specific example of conditional compilation in support of debugging. It is in the bag module, part of the starter kit for Labs 3-6. For example,

bag_insert(bag_t* bag, void* item)
{
...

#ifdef MEMTEST
  mem_report(stdout, "After bag_insert");
#endif
}

When -DMEMTEST is added to the gcc command-line, to define the preprocessor symbol MEMTEST, the call to mem_report() becomes part of the compiled code; otherwise it does not. The source file never changes - but the person compiling the source file can ‘flip a switch’ to enable this debugging code when needed.

Header files

The C language does not, itself, have an ‘import’ or ‘include’ mechanism; instead, the C preprocessor supports an #include directive to wholly include one file within another. Although the C preprocessor has many powerful capabilities, we will use it only for very specific purposes. Most of the other uses for preprocessor directives are either beyond the scope of this course, or unnecessary (and bad style) in modern C.

Header file inclusion - how?

The #include directive, often pronounced “pound include”, should appear at the beginning of a C program. It is used to literally copy the entire contents of another file at the point of the #include directive. A common #include directive, seen at the beginning of most C files, is

#include <stdio.h>

This directive indicates that the contents of the file named stdio.h should be included at this point (the directive is replaced with the contents). There is no limit to the number of lines that may be included with this directive and, in fact, the contents of the included file may contain other #include directives which are handled in the same way. We say that the inclusions are nested and, of course, care should be taken to avoid duplicate or, even worse, recursive nestings!

The example using <stdio.h>, above, demonstrates two important points. The filename itself appears between the characters < >. The use of these characters indicates that the enclosed filename should be found in the standard include directory /usr/include. The required file is then /usr/include/stdio.h.

The standard include files are used to consistently provide system-wide data structures or declarations that are required in many different files. By having the standard include files centrally located and globally available, all C programmers are guaranteed of using the same data structures and declarations that they (all) require.

Alternatively, the " " characters may also be used, as in the following example:

#include "readlinep.h"

to include the contents of the file readline.h at the correct point in the C program. Because double-quotes are used, the file is sought in the present working directory, that is ./readline.h. The filename in double-quotes can actually be a pathname, if your source code is spread over several directories; for example,

#include "includes/readlinep.h"

In both of the above examples the indicated filename had the .h extension. Whereas we have previously said that the extension of .c is expected by the C compiler, the use of .h is only a convention within UNIX. The .h indicates that the file is a header file, because they generally contain information required at the head (beginning) of a C program. Header files typically (and should) contain only declarations of C constructs, like data types, data structures, and constants used throughout the C program. In CS50, include files should contain only declarations, and no definitions, of variables or functions.

You can also give the compiler a command-line flag to identify a directory where it can search for non-standard include files. If, for example, you moved your includes directory up a level, you might write

#include "readlinep.h"

but then tell the compiler where to look for such files:

mygcc -I ../includes    -c readline.c

Header file inclusion - why?

Ok, all that’s great, but why did we need this header file, readlinep.h, in the first place? Because the compiler processes names.c and readlinep.c separately; names.c invokes the function, but readlinep.c defines the function. When it compiles names.c it needs to see the prototype (declaration) for the function readlinep, just as we would have if readlinep() had been defined at the bottom of names.c. When compiling names.c into names.o the C compiler leaves the readlinep symbol unresolved, and hopes the linker will later be able to connect it to the function’s actual definition when the linker links names.o with other object file.

So: we write a prototype (declaration) for readlinep() into readlinep.h. That file is #included in at least two places: first, it is included in readlinep.c, allowing the compiler to compare the declaration (prototype) with the definition (implementation), and warn us if they somehow differ. Second, it is included in names.c, allowing the compiler to compare the declaration (prototype) with the invocation (function call), and warn us if they somehow differ.

In CS50, every source file that does not include main() will need a corresponding header file, a header file that must be included both by that source file and by any other source file needing to interface with that source file.

Header file protection

As a matter of style, and sometimes necessity, we use these same directives in every header file. Take another look at readline.h; all of its content is wrapped in an #ifdef:

#ifndef __READLINEP_H__
#define __READLINEP_H__
...
#endif //  __READLINEP_H__

If the constant __READLINEP_H__ is not yet defined (#ifndef) then include the remainder of the file up to #endif. The first thing it does is to define that __READLINEP_H__. Thus, if a C file happens to include readlinep.h twice, which can happen in complex programs when header files include other include files, the C compiler won’t see multiple copies of the included code. That is,

#include "readlinep.h"
#include "readlinep.h"

does not translate to multiple declarations

extern char *freadlinep(FILE *fp);
static inline char *readlinep(void) { return freadlinep(stdin); }
extern char *freadlinep(FILE *fp);
static inline char *readlinep(void) { return freadlinep(stdin); }

but just one time to

extern char *freadlinep(FILE *fp);
static inline char *readlinep(void) { return freadlinep(stdin); }

The standard file stdbool protects itself this way, which is good, because a programmer may well write into her .c file something like

#include <stdio.h>
#include <stdbool.h>
#include <readlinep.h>

and, because readlinep.h also includes stdbool.h, the latter might get imported several times without that protection.

Why do we care? Some include files can safely be repeated and the C compiler won’t care; actually, readlinep.h is one like that. But others, e.g., if they declare global constants, will cause the compiler to complain about multiply-defined variables.

Testing header files

How can you be sure your header file has included all the relevant header files? Compile it! To compile the header file foo.h by itself,

mygcc -c foo.h

If there are any errors, it’s probably because you forgot one of the include files.

When done, rm foo.gch to clean up the compiler’s output.

Modules

In CS10 we learned that some ADTs are so common that it is valuable to code them once - lists, queues, stacks, trees, and hash tables - and then re-use that code for multiple programs (or multiple purposes within a program). Code re-use saves time because you don’t need to write everything from scratch. By using robust, well-tested modules rather than fresh (buggy) code, your program is more reliable. Finally, by separating ‘business logic’ from ‘data structures’, the code is clearer and more flexible (e.g., if you later want to switch to a more-efficient data structure you can do so without rewriting all the business logic).

Object-oriented languages make this easier, because they make it simple to define a ‘class’ of objects and then to create new ‘instances’ as needed. Many such languages such as Java go further by providing ‘generic’ and ‘subclasses’ as a way to derive new variants of the base class.

C has none of these capabilities. But we can approximate some of these concepts through good style and careful design.

Pointers to anything

In developing a general-purpose data-structure module, we would really like the module to be able to store arbitrary “things” – not just predetermined types – but anything. For example, instead of a linked-list of strings (as we built in the names program) we’d like a linked-list that could store pointers to anything. Today we want a list of string things; tomorrow we might want a list of struct student things.

The solution is for the module to store pointers to things, rather than the things themselves. The user’s contract with the module is thus “I allocate and initialize a thing, and give you a pointer to remember; later, when I need that thing, I’ll ask and you give me back that pointer. When I ask you to print things, or delete things, I’ll give you customized functions that know how to print and delete things of this type.”

Java and other object-oriented languages do this with generics. C has no support for generics, but it does have a type for “pointer to anything”: void*. Consider:

char *p = "hello";     // pointer to a char, in this case, a string
int x = 42, *xp = &x;  // pointer to an int
struct student * sp;   // pointer to a struct
sp = malloc(sizeof(struct student)); // initialize the pointer
sp->name = "Tim";  // initialize the struct at that pointer
sp->house = "South";  // ...initialize
sp->class = 2018;    // ...initialize

void *anything;       // a pointer to any type
anything = p;         // here, a pointer to a char on the stack
anything = &x;        // here, a pointer to an int on the stack
anything = sp;        // here, a pointer to a struct in the heap

Thus, our modules will accept and store void* pointers, and return them to the caller when asked.

Pointers to functions

As noted above, the module may need help when it needs to print, compare, or delete “things”. The module’s user must provide the module with special helper functions. In an object-oriented language, like Java, “things” are objects and objects know how to print themselves (via toString), compare themselves (via compare), or delete themselves (via a user provided delete method). C has no such support.

Thus, the caller may pass a function to the module, when needed. We can’t actually pass the function - we have to pass the module a pointer to the function.

The concept of a pointer to a function can be confusing. Like any other pointer, it is an address in memory. Recall that the compiled code of the program lives in the code (aka text) segment in memory so every function resides at some address. A function pointer is simply the function’s address in memory.

We can refer to the function’s address simply by naming the function, without the syntax of calling a function. That is, foo is a function pointer, whereas foo(a, b, c) calls that function and passes arguments a, b, and c. In our pointer2.c example, we passed the address of functions main and change to printf so it could print those addresses for us to examine.

If I have a function called myprint, like this:

void myprint(FILE *fp, char *key, void *item)
{
    int *valuep = item; // in this case, the "things" are integers
    fprintf(fp, "(%s, %d)", key, *valuep); // note *valuep derefs pointer
}

If I have a function pointer printfunc, I can initialize the function pointer and call the function through the pointer as follows:

printfunc = myprint;  //store address of myprint function in printfunc
(*printfunc)(fp, key, thing); //deference function address to call myprint function

In other words, I dereference the pointer to get a function, and then call that function. Notice the * and parens, (*printfunc).

How would I declare that variable printfunc? pretty gnarly syntax:

void (*printfunc)(FILE *fp, char *key, void *item);

declares variable printfunc to be a pointer to a function whose prototype looks like that of myprint above, that is, it takes the given three arguments and returns void. Indeed, it looks almost like a common function prototype, except for that (*variablename) notation. For this pointer to be useful, it must be initialized to the address of a function.

This approach is extremely powerful as it lets us define a function (print in this example), for each data type. If we need the same functionality later, but with a different data type, we can pass a pointer to a function for that data type – very similar to generics in Java.

typedefs

C allows us to give a new name for a type; we most often see this for structure types. In CS50 our naming convention appends _t to the name of a type. Thus, for example,

struct student {
    char *name;
    char *house;
    int class;
};
typedef struct student student_t;
student_t st;

Here we define a type called student_t, identical to the type called struct student, and a variable (of that type) called st. It is so common to define a struct type and immediately give it a typedef alias that one often sees it written in shorter form:

typedef struct {
    char *name;
    char *house;
    int class;
} student_t;
student_t st;

Here the struct type itself is not even named, but the typedef is formed and can be used to define variables of that type.

Example module - bags

In class we will look at the bag data structure and use it to make a bag version of our names series. A bag is an unordered collection of (items). The bag starts empty, grows as the caller adds one item at a time, and shrinks as the caller extracts one item at a time. It could be empty, or could contain hundreds of items. Items are indistinguishable, so the extract function is free to return any item from the bag.

We start from names5.c and modify it with the goal of coding functions that manipulate a generic “bag of things”.

The first example demonstrates

  • structure types (like struct bag_t and bagnode_t).
  • typedefs to create new type names.
  • Heap memory (created via malloc in bag_new).
  • The use of pointers to build and manipulate a bag data structure.
  • The use of void* to represent “a pointer to anything” and its use to build a bag of generic things.
  • A reminder that strings - even constant strings, in double quotes - are stored in memory as arrays of characters and referenced by a char* pointer to their first character; thus even a constant string has an address and is passed as a pointer.

We extend the example to split the bag functions out as to a separate set of files, demonstrating

  • a module, which is as close as we get to a class in C.
  • A set of functions exported via bag.h to other C files.
  • Public types (like struct bag_t) - and opaque types.
  • Private types (like struct bagnode_t).
  • Private functions (like bagnode_new).
  • No need for global variables. (We always try to avoid them!)

The result, after some cleanup:

A complete bag module is included in the Lab 3 starter kit, which you will see once you accept the lab3 assignment invite.

Another example - binary trees

For a more complex example, demonstrating the use of binary trees as a “dictionary” data structure – that is, one that stores (key,item) pairs – study the binary-tree module.

Other examples

In the table below, we compare data structures with the two we explored in class. Most of these data structures allow you to store a collection of “items”. Both the set and hashtable are examples of an abstract data structure called a dictionary, which provide methods like insert(key, item) and item = retrieve(key); the key allows the structure to distinguish among the items it stores.

Behavior linked list bag set counters hashtable
stores an item yes yes yes no yes
uses a key no no yes yes yes
keeps items in order yes no no no no
retrieval first item any item by key by key by key
insertion of duplicates allowed allowed error increment count error

Notice that

  • a linked list keeps items in order, but a bag or a set does not.
  • a set and hashtable allow you to retrieve a specific item (indicated by its key) whereas a bag might return any item
  • because the bag and list don’t distinguish among items they store, they can hold duplicates; the others cannot
  • the counters data structure maintains a set of counters, each identified by a key, but it stores no items. Instead, it keeps a counter for each key; inserting a duplicate key increments the counter.