CS 50 | Software Design and Implementation

In this lecture, we continue our accelerated introduction to the C programming language, focusing on file processing.

Goals

We learn several things today:

opening and reading from files
compiling code from multiple files into one program

You may also want to check three examples in lecture extra, which are built upon the things we learn and provide a useful readline() function. This is the same extra code from last class.

Activity

In the activity we will parse command-line arguments and do simple file operations.

Files

Unlike Java and Python, the C language itself does not define any syntax for input or output. Instead, the C library provides a standard set of functions to perform file-based input and output. The standard I/O library (aka stdio) functions provide efficient, buffered I/O to and from both terminals and files.

Every C file using functions in the standard I/O library must include this line near the top of the source file:

#include <stdio.h>

All transactions through the standard I/O functions require a file pointer:

  FILE* fp;

  fp = fopen("file.txt", "r");
  ...
  fclose(fp);

Although a ‘file pointer’ is, strictly speaking, a C pointer, we don’t care much about what it is - we simply pass this pointer to functions in the standard C library.

Some texts will refer to this pointer as a file stream (and C++ confused this term even more), but these should not be confused with nor be described as akin to Java’s “streams”.

The stdio library predefines three file pointers: stdin, stdout, and stderr, which are already opened for you (by the shell or other program that executed your C program) and which you can read, write, or manipulate using stdio functions.

Several functions are provided to check the status of file operations on a given file pointer, including:

  feof(fp)      /* is the file (fp) at the 'end of file'? */
  ferror(fp)    /* was there an error in the last action on file (fp)? */ 

The standard I/O functions all return NULL or 1 (as appropriate) when an error is detected. Here is an example of opening a file, checking for success, and later closing the file:

#include <stdio.h>

int
main(int argc, char *argv[])
{
  char* filename = "/etc/passwd";
  FILE* fp;
  int exit_code = 0;

  if ((fp=fopen("/etc/passwd", "w")) == NULL) {
    fprintf(stderr,"*** could not open file!\n");
    exit_code = 1;
  } 
  else {
    printf( "File opened!\n");
    /* ... print to the file ... */
    fclose(fp);
  }
  return exit_code;
}

In the preceding, notice how it embeds an assignment statement inside the condition of the if statement! That’s ok, because an assignment statement is an expression that itself has a value - the value that is assigned to the variable on the left-hand side - and that value is then used in the outer expression (here, a conditional expression testing equality). Thus, the if statement could have been written in my perferred form as

  fp = fopen("/etc/passwd", "w");
  if(fp == NULL) {
  ...

but such a construct appears so often that they are often combined. When the assignment is included in the if condition, it was wrapped in parentheses just to be sure that it’s treated as a whole, as the left-hand-side of the == comparison operator.

For details, man fopen, man fclose.

printf, scanf

The most frequently used functions in the C standard I/O library perform output of formatted data.

  fprintf(FILE* fp, char* format, arg1, arg2, ...);

for example,

  int class = 50;
  char *department = "Computer Science";

  fprintf(fp, "Course: %s %02d\n", department, class);

prints Course: Computer Science 50 to the file pointed to by fp.

The fprintf function is a generalization of printf. Put another way, printf(format, arg...) is shorthand for fprintf(stdout, format, arg...).

#include <stdio.h>
  
int main() {
        int class = 50;
        char *department = "Computer Science";

        fprintf(stdout, "Course: %s %02d\n", department, class);
        fprintf(stderr, "Message to stderr for class %02d\n",class);
        return 0;
}

Format specifiers

Many standard I/O functions accept a format specifier, which is a string describing how the following arguments are to be interpreted. This mechanism is in contrast to Java’s toString facility in which each object knows how to output/display itself as a String object. There are many possible format specifiers, the most common ones being %c for character values, %d for integer values, %f for floating-point values, and %s for character strings. Format specifiers may be preceded by a number of format modifiers, which may further specify their data type and may indicate the width of the required output (in characters). For example %5.2f will print a floating point number that is 5 characters wide, with 2 decimal places. You can also specify a string’s width with %10s for a string with width 10 (left padded with spaces). For details, man 3 printf (note the 3 to ensure man prints the page about the printf functions rather than the printf shell command).

The C standard I/O library provides efficient “buffering”. This means that although it appears that the output has gone to the FILE pointer, it may still be held within an internal character buffer in the library (and will hence not be actually written to the hard disk or to the screen until more output is accumulated and the buffer becomes full). We often need to “flush” our output to allow us to know when the output will be written to disk or the screen:

  /* ... format some output ...*/
  fflush(fp);

As you might expect, FILE pointers are automatically flushed when a file is closed or the process exits.

A related function sprintf allows us ‘output’ to a character array (a string):

  int class = 50;
  char *department = "Computer Science";
  char buffer[BUFSIZ];

  sprintf(buffer, "Course: %s %02d\n", department, class);

Security alert! What is the potential exposure in the code above? Read man snprintf.

C’s standard I/O library may also be used to input values from FILE pointers and character arrays using fscanf() and sscanf(). Because we want the contents of C’s variables to be modified by the standard I/O functions, we need to pass them by reference, which in C is accomplished by passing the ‘address’ of the variables using the & operator:

  fscanf(fp, format, &arg1, &arg2, ...);

like this:

  int i, res;
  char buffer[BUFSIZ];
  fscanf(fp, "%d %d", &i, &res);
  sscanf(buffer, "%d %d", &i, &res);

We’ll talk more about addresses and pointers next week.

getchar, putchar

Below is a common idiom that reads a file character by character and, in this case, simply prints it right back out. (It’s like cat without command-line arguments.) From getput.c:

 /*
 * getput: a short demo of getchar() and putchar()
 *
 * CS 50, Fall 2022
 */

#include <stdio.h>
#include <stdlib.h>

int
main()
{
  char c;

  while ((c = getchar()) != EOF) {
    putchar(c);
  }
}

Here, getchar() reads a single character from stdin, putchar() writes a single character to stdout; you could of course include other logic within the loop.

Similar functions fgetc() and fputc() allow you to specify a FILE for input or output.

Notice how the character variable c is assigned the return value of getchar() within a parenthetical expression (c = getchar()); the value of that expression is the same assigned value, and thus it is that value compared against the stdio-defined constant EOF. Because getchar() returns EOF on end of file or error, this loop eventually terminates.

For details on this family of functions, man 3 getchar.

Choose the right approach

Sometimes, it is convenient to read input line by line; for that we recommend the readline() function from readline.c.

Other times, it is more convenient for the logic of the program to read input character by character, as above.

Finally, there are times when the input can be assumed to be clean, such as a sequence of numbers or words; in those rare cases, it may be convenient to loop over calls to scanf().

Check for errors

When using the C standard I/O functions, we must pay attention to the particular return values from those functions. The functions themselves may return NULL file pointers, the number of input items read, the number of characters written, or special non-zero values that indicate the end of a file or some other error condition. Always check the man page to be sure. Some of the special return values have constants defined in the system include files for convenience, improved code readability, and consistency. Here are some sample code snippets:

#include <stdio.h>  
const int  MAXLINE = 80;

int i, sum;
char line[MAXLINE];

for(;;) {
  fgets(line, sizeof(line), fp);
  if(feof(fp)) {
     break;
  }

  /* ... process the line just read ...*/

}
fclose(fp);
...
for (numlines=0, (fp=fopen("thefile",'r')); feof(fp); numlines++ ) {
    fgets(fp, buffer, sizeof(buffer), fp);
    ...
}
...
sum = 0;
while(fscanf(fp, "%d", &i) == 1)
   sum += i;
fclose(fp);

Here is a code snippet that uses fopen(), fgets(), strlen(), printf(), sscanf(), and fclose(). It reads from the file into to a character array, and then applies sscanf() to that array to extract information from the array into various variables.

Example: files.c

gets and fgets — do not use gets!

There is a saying - you learn from your mistakes, so make lots of them. There is another one: don’t make the same mistake twice. The use of the stdio function gets() is a mistake. Lots of programmers have made this mistake, and caused headaches for millions of computer users around the world. The lesson: never use gets()!

The reason: gets() reads a line from stdin into a character buffer - an array of characters provided by the caller - but has no idea how big is the buffer. If the input contains more characters than fit in the buffer, gets() will happily continue to write characters into memory beyond the end of the buffer. But there is likely something else important stored in that region of memory! The resulting buffer overflow has been the mechanism for many cyberattacks, in which hackers craft clever strings that will overflow an input buffer and write just the right sort of data or code into adjacent memory, causing the program to do something its programmer never intended - but which serves the hacker’s interest.

The stdio library includes a safer function, fgets(), which reads from any FILE into a character buffer… but requires the user to provide the length of the buffer. As long as the programmer supplies the right size with the right buffer, fgets will never overflow the buffer.

I won’t go into more detail here, because it’s too much of a digression… in this class, we won’t use gets (because it’s so dangerous nobody should ever use it) or fgets (because we developed more convenient alternatives, readline() and readlinep()).

Buffer overflows

Let’s look at an example that could have been named really-bad-code.c.

Example: buffer-overflow.c

/*
 * buffer-overflow.c - This is a bad program! But its fun. The basic 
 * idea of the program to input and manipulate strings using arrays 
 * of chars is fun. However, there is a serious flaw in the program.
 * The book uses the function gets(). This is a seriously dangerous 
 * function call. DONT USE IT. Revised code taken from pg. 457 (
 * Program 9.5) (Bronson) "First Book on ANSI C"
 * 
 * CS50, Fall 2022
 */

#include  <stdio.h>
#include <string.h> /* required for the string function library */

#define MAXELS 50

int main()
{
  char string1[MAXELS] = "Hello";
  char string2[MAXELS] = "Hello there";
  int n;

  n = strcmp(string1, string2);

  if (n < 0)
    printf("%s is less than - %s\n\n", string1, string2);
  else if (n == 0)
    printf("%s is equal to - %s\n\n", string1, string2);
  else
    printf("%s is greater than -  %s\n\n", string1, string2);

  printf("The length of string1 is %ld characters\n", strlen(string1));
  printf("The length of string2 is %ld characters\n\n", strlen(string2));

  strcat(string1," there World!");

  printf("After concatenation, string1 contains the string value\n");
  printf("%s\n", string1);
  printf("The length of this string is %ld characters\n\n",
                                                   strlen(string1));

  printf("Please enter a line of text for string2, max %ld characters: ", sizeof(string2));

  /* In the code below comment and uncomment the gets() code */
  gets(string2);

  /* In the code below comment and uncomment the fgets() code segment */

  // fgets(string2, sizeof(string2), stdin); 

  printf ("Thanks for entering %s\n", string2);
 
  /* Warning: The gets() function cannot be used securely. Because of its lack of 
  bounds checking, and the inability for the calling program to reliably determine 
  the length of the next incoming line, the use of this function enables 
  malicious users to arbitrarily change a  running program's functionality 
  through a buffer overflow attack. It is strongly suggested that the fgets() 
  function be used in all cases. OK lets type in more that 50 chars and see
  what happens - segfault!*/ 

  strcpy(string1, string2);

  printf("After copying string2 to string1");
  printf(" the string value in string1 is:\n");
  printf("%s\n", string1);
  printf("The length of string1 is %ld characters\n\n",             
   strlen(string1));
  printf("\nThe starting address of the string1 string is: %p\n",
   (void*) string1);
  printf("\nThe starting address of the string2 string is: %p\n",
   (void*) string2);
  
  return 0;
}

Warning! this does not follow CS50 style! Even the compiler warns us about the danger:

$ mygcc -o buffer-overflow buffer-overflow.c 
buffer-overflow.c: In function ‘main’:
buffer-overflow.c:45:3: warning: implicit declaration of function ‘gets’; did you mean ‘fgets’? [-Wimplicit-function-declaration]
   45 |   gets(string2);
      |   ^~~~
      |   fgets
/usr/bin/ld: /tmp/ccH2E989.o: in function `main':
/thayerfs/home/d84xxxx/cs50/activities/day8/buffer-overflow.c:45: warning: 
the `gets' function is dangerous and should not be used.
$ 

Let’s look at the output when running the program first with gets() and then with the safer fgets(). If we run the code with gets() we get a segmentation fault when entering 60 characters.

$ ./buffer-overflow 
Hello is less than - Hello there

The length of string1 is 5 characters
The length of string2 is 11 characters

After concatenation, string1 contains the string value
Hello there World!
The length of this string is 18 characters

Please enter a line of text for string2, max 50 characters: 012345678901234567890123456789012345678901234567890123456789
Thanks for entering 012345678901234567890123456789012345678901234567890123456789
After copying string2 to string1 the string value in string1 is:
012345678901234567890123456789012345678901234567890123456789
The length of string1 is 60 characters


The starting address of the string1 string is: 0x7ffeb9b92d90

The starting address of the string2 string is: 0x7ffeb9b92dd0
*** stack smashing detected ***: <unknown> terminated
Aborted (core dumped)
$ file core
core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from './overflow', real uid: 23925, effective uid: 23925, real gid: 168108, effective gid: 168108, execfn: './overflow', platform: 'x86_64'
$

This is a bad program! The basic idea of the program is to accept and manipulate strings using arrays of chars. However, there is a serious flaw in the program. Some older books use the function gets(); it is a seriously dangerous function call. Do not use gets()!

The program defines a buffer of 50 chars in length. The user types in characters from the keyboard and they are written to the buffer, i.e., string1 and string2.

The input parameter to gets() is the name of the array (which is a pointer - more on pointers later). The function does not know how long the array is! It is impossible to determine the length of string1 and string2 from a pointer alone.

If we run the program and type in 50 characters, including the newline, all is safe. But if we type 51 or 60 or more characters, we overrun or ‘overflow’ the buffer. We end up writing past the end of the array! Fortunately, modern C compilers insert some code to detect the worst cases of such overflows, which tends to “smash the stack”; above you see the program exited when it detected that result, and “dumped core” (saved an copy of the memory from the running program, for later analysis and debugging). Unfortunately, such detection methods are not perfect … and even when they work, they crash your program.

This overflow can happen even without calling an unsafe function such as gets(), so it’s an important lesson to learn. Buffer overflows can have rather spectacular results!

Bugs often happen at boundary conditions and one important boundary is the end of the array. If we overwrite string1, we might write into string2. Recall that, by convention, C strings are terminated by \0 (aka null character). If this character is overwritten then a piece of code operating on the array will keep on scanning until it finds a \0.

If we run this code and type in more than 50 chars (as we did above) anything can happen; for example: 1) the code could work with no visible affect of the bug; 2) immediate segfault; 3) segfault later in the code stream; 3) mistakes happen in unrelated functions (e.g., strcat() in our code).

Some books use gets() and promote its use. Just Say NO! Instead, use the safe fgets() as it is a buffer-safe function. Its prototype is:

    char *fgets(char *s, int size, FILE *stream);

It requires you to identify which file, yes, but more importantly, it requires you to identify the size of the character buffer into which it will write characters; fgets will not write more characters than the size of the buffer.

Example:

    fgets(buf, sizeof(buf), stdin);

The fgets() function shall read bytes from stream into the array pointed to by buf, until sizeof(buf)-1 bytes are read, or a newline is read and transferred to buf, or an end-of-file condition is encountered. The string is then terminated with a null byte.

We replace gets() with fgets() in the above code and now we are safe.