CS 50 | Software Design and Implementation

In this lecture, we continue our accelerated introduction to the C programming language.

Goals

Today’s topics include:

command-line arguments
arrays and arrays of strings

You may also want to check three examples in lecture extra, which are built upon the things we learn and provide a useful readline() function. Some of this material we will cover in the next class, but you should have a good head start on it after this class.

Activity

In the activity we will parse command-line arguments.

Strings in C

C does not have a “string” type, so C programs represent strings as an array of characters. Recall from CS 10 that an array is a contiguous block of memory. For example, consider the following:

char* CS = "Computer Science";

This code declares a variable named CS, whose type is char*, that is, a pointer to a character. We’ll dig into ‘pointers’ more deeply soon, but for now think of it as pointing at a single byte in memory, which holds the first character of the string; the other characters of the string appear in the following bytes of memory.

diagram of a C string in memory

After the last character of the string there is a byte with value zero; C represents that as '\0'.

When you provide a string constant like "Computer Science", C will add that null character for you. All other string operations in C are done by library functions, such as printf() or strcmp(). All those functions expect a string parameter to be a pointer to the first character (char*) and then expect to read through memory until they hit a null character, indicating end of string.

String variables

There are two ways to declare a string variable.

First, you can declare a new string (an array of characters):

char dept[30];     // an array of characters, with uninitialized content
dept[0] = '\0';    // initialize it to the empty string

The first line defines a new variable dept and allocates room for a string of up to 29 characters (not 30!) because we need to allow one spot for the terminating null character; the second line sets the first character to null, effectively initializing the string to be the empty string.

Alternatively, we could skip the initialization and immediately fill the new string with a copy of an existing string:

strcpy(dept, "Computer Science");

Here we actually copy the string “Computer Science”, one character at at time, into the space allocated for dept. Because the C language does not provide any way to manipulate strings, we depend on a library function called strcpy().

Note the strcpy() parameters are like strcpy(to, from), not strcpy(from, to).

Second, you can just declare a string pointer, if you want a variable that will point to an existing string, e.g.,

char* CS = "Computer Science";  // initialized to point at a constant string
char* department = NULL;  // initialized to "null pointer"
department = CS;          // now pointing to the same string.

When defining a new string pointer variable, it is always good practice to immediately initialize it; in this case, to the null pointer, that is, pointing at location zero in memory, which by convention is used to represent an unassigned pointer.

After the third line, both CS and department point at the same string – it copies the pointer, not the string itself.

You may sometimes see the [] syntax, which implies an array of unspecified size:

char   firstName[];  // this syntax...
char  *firstName;    // is equivalent to this syntax, and
char * firstName;    // is equivalent to this syntax, and
char*  firstName;    // is equivalent to this syntax.

The first form is most often seen as a function parameter, and is equivalent to the other forms: in all cases, firstName is a pointer to a character, but the first form makes it more clear to a reader that it is pointing to an array of characters (likely to a string) – not to just one character. We generally use the fourth syntax in CS 50.

A final note: although NULL and '\0' are really both just names for the number zero, NULL is a pointer (null pointer) and '\0' is a character (null character).

String library

The C library contains many useful string functions; see man string. To use them, include the following at the top of your C code,

#include <strings.h>

The C/Operating system interface

As we’ve seen, the user of a shell (like bash) can specify zero or more command-line arguments for each command - and most commands are simply programs. Some programs are scripts interpreted by other programs (like bash or python), and other programs are machine-code binaries compiled from another language (like C or C++). In any case, the shell asks the operating system (OS) to execute the program, and passes along the arguments. When the OS executes a compiled C program, it calls the function main() with two parameters:

an integer argument count (conventionally called argc),
an array of pointers to character strings (conventionally called argv)

Notice that in many previous examples we’ve provided a main() without any parameters all. C does not check the length and types of parameter lists of functions for which it does not know a prototype.

In addition, the function main() has no special significance to the C compiler. Only the linker requires main() as the apparent starting point of any program.

Most C programs you see will look like this:

int main(int argc, char* argv[])

Some people prefer to declare them as constant so and let the C compiler help avoid modifying these input parameters:

int main(const int argc, const char* argv[])

So how do you get “an array of pointers to char” out of a mouthful like char *argv[]? It’s all about operator precedence.

The highest precedence of everything is variable names and literals. Then the next highest precedence thing is the subscripting operator [ ]. About halfway down the list of operators is the indirection operator *, so its precedence is lower than the subscripting operator (see precedence from last class). Thus, the above declaration is read as: argv is an array of pointers to char.

The following program prints out its command line. Note that argv[0] is the command name and argv[1] … argv[N] are the command-line arguments (after any expansion or substitutions done by the shell).

Example: arguments.c

Look at the following snippet:

#include <stdio.h>

int main(int argc, char *argv[]) 
{
  int i;

  printf("%d items were input on the command line\n", argc);
  for (i = 0; i < argc; i++)
    printf("argument %d is %s\n", i, argv[i]);

  return 0;
}

$ mygcc arguments.c -o arguments
 ./arguments 1 two 3.1415 4 "F i v e"
6 items were input on the command line
arguments 0 is ./arguments
arguments 1 is 1
arguments 2 is two
arguments 3 is 3.1415
arguments 4 is 4
arguments 5 is F i v e
$

We declared argv as array of pointers to char. For any given argument i, argv[i] is one of those pointers; that is, argv[i] is of type char*. We can pass that pointer to functions like printf, wherever it expects a string. We will discuss pointers in more detail soon.

A more interesting snippet of code shows that the command line is stored as a set of string arguments in memory and that the address of the location of the first character for each string argument is stored in the argv[] array.

Example: command.c

Lets look at the following snippet:

#include <stdio.h>

int main(int argc, char *argv[])
{
  int i;

  printf("\nThe number of items on the command line is %d\n\n",argc);
  for (i = 0; i < argc; i++)
  {
    printf("argument %d is \"%s\"`\n", i, argv[i]);
    printf("The address stored in argv[%d] is %p\n", i, argv[i]);
    printf("The first character pointed to there is \'%c\'\n", *argv[i]);
  }

  return 0;
}

If you run the program you will see the following output - note that the hexadecimal memory address of the first character for each argument is printed out too, using %p in the print statement. Also notice I escaped the apostrophe (') characters with \'. In this way we tell the C compiler that we intend to have a apostrophe character appear.

$ mygcc command.c -o command
$ ./command hello cs50 ready to go skating?

The number of items on the command line is 7

argument 0 is "./command"
The address stored in argv[0] is 0x7ffd1f27eabf
The first character pointed to there is '.'
argument 1 is "hello"
The address stored in argv[1] is 0x7ffd1f27eac9
The first character pointed to there is 'h'
argument 2 is "cs50"
The address stored in argv[2] is 0x7ffd1f27eacf
The first character pointed to there is 'c'
argument 3 is "ready"
The address stored in argv[3] is 0x7ffd1f27ead4
The first character pointed to there is 'r'
argument 4 is "to"
The address stored in argv[4] is 0x7ffd1f27eada
The first character pointed to there is 't'
argument 5 is "go"
The address stored in argv[5] is 0x7ffd1f27eadd
The first character pointed to there is 'g'
argument 6 is "skating?"
The address stored in argv[6] is 0x7ffd1f27eae0
The first character pointed to there is 's'
$

Command line switches

A common activity at the start of a C program is to search the argument list for command-line switches commencing with a dash character. The remaining command-line parameters are often assumed to be filenames.

Study the example nosort.c, which shows one way to parse the command line of a hypothetical replacement for the sort command.

The program nosort.c (no sorting code is included in this example, only the command line parsing) looks like this:

/*
 * nosort - an example about handling of command-line switches;
 *  supports command lines such as "sort -r -u -n", but not "sort -run".
 * It also displays an old-style approach to looping over (argc,argv),
 * stepping through the arguments by changing each variable.
 */

#include<stdio.h>
#include<stdbool.h>

int
main(int argc, char* argv[])
{
  char* progname = argv[0];   // name of this program
  bool unique =  false;       // -u
  bool reverse = false;       // -r
  bool numsort = false;       // -n
  
  // this loop changes argc and argv as it loops
  while((argc > 1) && (argv[1][0] == '-')) {

    // argv[1][0] is the '-'
    // argv[1][1] is the first letter
    char opt = argv[1][1];
    
    switch (opt) {

    case 'r': 
      reverse = true;
      break;
    case 'u':
      unique = true;
      break;
    case 'n':
      numsort = true;
      break;
    case '\0':
      fprintf(stderr, "Error: missing option '-'\n");
      fprintf(stderr, "Usage:  %s [-r] [-u] [-n] filename...\n", progname);
      return 0;
      
    default:
      fprintf(stderr, "Error: bad option '-%c'\n", opt);
      fprintf(stderr, "Usage:  %s [-r] [-u] [-n] filename...\n", progname);
      return 1;
    }

    // decrement the number of arguments left
    // increment the argv pointer to the next argument
    argc--; argv++;
  }

  if (numsort)  printf("numeric sort\n");
  if (unique)   printf("unique sort\n");
  if (reverse)  printf("reverse sort\n");
  if (argc > 1) printf("next argument '%s'\n", argv[1]);
  
  // ...other processing

  return 0;
}

Its commandline should look like nosort [-r] [-u] [-n] filename... as follows:

$ mygcc nosort.c -o nosort
$ ./nosort
$ ./nosort x
next argument 'x'
$ ./nosort -r x
reverse sort
next argument 'x'
$ ./nosort -r -n -u x y z
numeric sort
unique sort
reverse sort
next argument 'x'
$ 

The switches can be listed in any order, but not combined as in

nosort -run

Note the example’s defensive programming: if the user enters a missing or bad option then the user is informed with a usage message:

$ ./nosort -
Error: missing option '-'
Usage:  ./nosort [-r] [-u] [-n] filename...
$ ./nosort -x
Error: bad option '-x'
Usage:  ./nosort [-r] [-u] [-n] filename...
$ 

The example also demonstrates several things about C, and C idioms:

the switch statement and its component case and break statements.
the use of argc--; argv++ as a way of stepping through an array. Note: each time argv is incremented, it changes the base address on which a subscript like [1] is interpreted.
the syntax for subscripting a two-dimensional array, like argv[1][0].

Note argv is not (strictly speaking) a two-dimensional array, and C does not (strictly speaking) support multi-dimensional arrays; the first subscript selects one of the char* pointers in the array-of-pointers that is argv; the second subscript selects one of the characters in the array of characters to which that pointer refers. Thus, argv[1] is a char* (a pointer to a character string), and by subscripting that further, in argv[1][0], we refer to the character itself.

The standard I/O library

The C language itself does not define any particular file or character-based input or output routines (nor any windowing routines), unlike Java. Instead any program may provide its own. Clearly this is a daunting task, and so the standard C library provides a collection of functions to perform file-based input and output. The standard I/O library functions provide efficient, buffered I/O to and from both terminals and files.

C programs requiring standard I/O should include the line:

#include <stdio.h>

NOTE: the #include preprocessor directive simply instructs the compiler to go find the stdio.h file in the usual system place (e.g., /usr/include in Linux) and include its contents at this point in the source stream. We’ll learn more about the preprocessor later.

The following functions are defined in stdio.h:

clearerr()	clrmemf()	fclose()	fdelrec()	feof()
ferror()	fflush()	fgetc()	fgetpos()	fgets()
fldata()	flocate()	fopen()	fprintf()	fputc()
fputs()	fread()	freopen()	fscanf()	fseek()
fseeko()	fsetpos()	ftell()	ftello()	fupdate()
fwrite()	getc()	getchar()	gets()	perror()
printf()	putc()	putchar()	puts()	remove()
rename()	rewind()	scanf()	setbuf()	setvbuf()
sprintf()	sscanf()	svc99()	tmpfile()	tmpnam()
ungetc()	vfprintf()	vprintf()	vsprintf()

Many standard I/O functions accept a format specifier, which is a string describing how the following arguments are to be interpreted. This mechanism is in contrast to Java’s toString facility in which each object knows how to output/display itself as a String object. There are many possible format specifiers, the most common ones are:

%c for character values
%d for decimal values
%f for floating-point values
%s for character strings
%p for a variable’s memory address

Format specifiers may be preceded by a number of format modifiers, which may further specify their data type and may indicate the width of the required output (in characters). For example "%7.2f" will print a floating point variable’s value using 7 characters, two of which are decimal values. Shorter values are left-padded with spaces by default (e.g., using "%7.2f" with 123.456789 will result in _123.46 where _ is a space). Longer values (e.g., using "%2.2f" with 123.456789) will still display the whole number in addition to the specified decimal places (e.g., 123.46).

Consider the following C code. Note the * around the numbers to show any padding.

#include <stdio.h>
  
int main() {
        float a = 123.456789;

        printf("Normal floating point: *%f*\n",a);
        printf("Seven wide, with two decimal places *%7.2f*\n",a);

        return 0;
}

The code produces this output:

$ mygcc print_width.c -o print_width
$ ./print_width 
Normal floating point: *123.456787*
Seven wide, with two decimal places * 123.46*

Note the left-padded space around 123.46 on the second line of output.