10.7. Strings¶

It is often the case that a seemingly simplistic idea or design can turn out to be surprisingly complicated. We saw one example of this previously in the discussion of pointers. The definition of equating a pointer with an address seems straightforward; the implications of their usage for dynamic memory allocation, call-by-reference parameters, variable sizes, and so forth quickly become challenging for the programmer. The same can be said for strings in C. As with pointers, we start with a basic definition:

A string is an array of characters ending in the null byte.

To interpret the situation in a slightly different way, the C programming language does not actually have a string type in the intuitive sense that makes sense to humans. Instead, C just provides a thin veneer of interface for working with fixed-size arrays of char data. A string in the C sense consists of the array of chars that are (typically) observable to a human reader, with one additional char added to the end of the array. Code Listing A.36 illustrates this fact by defining the string "Hello" in a very unusual manner: as an array of six uint8_t values. One key idea here is that everything in the machine is just a number. The meaning and interpretation of those bytes as the string "Hello" is created by the %s format specifier, which tells the display to present the ASCII interpretation of the bytes to the user instead of the numeric values.

/* Code Listing A.36:
   Printing "Hello" and turning it into "Ha!"
 */

uint8_t string[] = { 72, 101, 108, 108, 111, 0 };
printf ("The string is '%s'\n", string);

string[1] = 'a';
string[2] = 0x21;
string[3] = (char) NULL;
printf ("The string is '%s'\n", string);

Since the string is an array, its individual elements can be accessed and modified; line 8 changes the 'e' to 'a', line 9 changes the first 'l' to '!', and line 10 changes the second 'l' to the null byte '\0' (literally the number 0). These changes cause line 11 to print the string as "Ha!" instead of the original "Hello". These lines did not change the 'o' byte stored as string[4], nor did the original '\0' stored in string[5] change; both bytes are still there in memory as part of the original array. The only reason they do not get printed by line 11 is, again, because of the %s format specifier, which tells printf() to stop printing at the first null byte. Table A.5 illustrates the memory content of this array of chars from before and after the modifications, based on three different interpretations for formatting. Note that the ASCII interpretation '\0' is not displayed to the screen, but is shown here for completeness.

ASCII interpretation `%c` or `%s`	`H`	`e`	`l`	`l`	`o`	`\0`
Hexadecimal format `%x`	`48`	`65`	`6c`	`6c`	`6f`	`00`
Decimal format `%d`	`72`	`101`	`108`	`108`	`111`	`0`

ASCII interpretation `%c` or `%s`	`H`	`a`	`!`	`\0`	`o`	`\0`
Hexadecimal format `%x`	`48`	`61`	`21`	`00`	`6f`	`00`
Decimal format `%d`	`72`	`97`	`33`	`0`	`111`	`0`

Table A.5: Three interpretations of the bytes that make up the strings from Code Listing A.36

It is important to observe that there are two things missing from this representation. First, the quotation marks " used to begin and end the string appear only in the program’s source code. They are a construct of the C programming language (and other languages, as well), but they do not exist in the memory representation of the string. C needs the quotes to know where the string begins and ends in the source code. The machine does not; the string begins at the address of the first character and ends at the null byte.

Second, there is no explicit storage of the string length. This fact follows from the design structure that strings are null-terminated arrays. The designers of the language made the choice that the string length could always be dynamically determined by traversing through memory until the null byte was found. This design choice—using a single extra byte for a null terminator instead of four bytes to store an explicit length field—is a quintessential example of the space-time tradeoff that system designers face. By requiring extra time to search the string manually, the language could save three bytes of space per string; as programs could store and work with many, many strings, the cumulative space savings of three bytes per string could be potentially very large. At the time the language was designed, execution time was cheap but memory space was prohibitively expensive; thus, this design decision was a good tradeoff at the time, given the circumstances.

Given this understanding of strings as arrays of chars, we can now focus on issues related to using them in practice. Specifically, we can simply use the more conventional and readable notation of "Hello" rather than the (equivalent and perhaps more accurate) { 'H', 'e', 'l', 'l', 'o', '\0' }, assuming that the reader has the correct mental model of the computer’s internal representation. The first important consideration to highlight at this point, then, is the question of where these six bytes are actually stored in memory.

Code Listing A.37 illustrates this point by creating three different versions of the string "Hello". The differences between lines 5, 6, and 7 are small but significant, which the other lines of the code reveal. Assuming this code is run inside the body of a function, all three variables (array, pointer, and heap) are local and associated with the idea of storage on the stack. Line 5, by declaring a local array, behaves in the intuitive manner in this regard; the array variable indicates an array of six bytes that are placed in the function’s stack frame. That is, line 5 operates in a similar manner to declaring an array of int values or any other such local array. Line 6, in contrast, places a pointer variable on the stack; the actual bytes of the string are placed into the program’s read-only data section (.rodata). Line 7 also places a pointer on the stack, but the strdup() function returns a pointer to a dynamically allocated copy of the string’s bytes on the heap. In short, these three lines illustrate how we can determine which memory segment (stack, data, or heap) will contain the bytes of the string.

/* Code Listing A.37:
   Three different ways to create the string "Hello"
 */

char array[] = "Hello";
char *pointer = "Hello";
char *heap = strdup ("Hello");

/* None of these is the string length */
printf ("Sizes: %zd %zd %zd\n", sizeof (array),
        sizeof (pointer), sizeof (heap));

array[1] = 'a';
printf ("Array version: %s\n", array);
heap[1] = 'a';
printf ("Heap version: %s\n", heap);
pointer[1] = 'a'; // run-time exception

Line 6 illustrates a very common source of confusion for those new to the intricacies of C strings. Recall that the sizeof() operator returns the number of bytes required to store a particular variable. In the cases of the pointer and heap variables, sizeof() will always return the same answer regardless of the string: 8 (assuming this is a 64-bit architecture). Both of these variables are pointers, so sizeof() returns the size of an address; sizeof() never dereferences a pointer to determine the size (or length) of the object being pointed to. In the case of the array variable, sizeof() returns the total number of bytes allocated for the variable on the stack: 6. That is, calling sizeof() on the array version of declaration will include the null byte. Furthermore, assume that we had modified the array variable as in Code Listing A.37, changing the string from "Hello" to "Ha!"; sizeof() would still return an answer of 6 (not 4), because that is how much storage space the compiler statically associated with the variable named array. In short, sizeof() should never be used to determine the length of a string; it does not ever examine the actual contents. Instead, if you need to compute the length of a string, you should always use strlen() or strnlen().

Bug Warning

The use of pointers to declare strings leads to a number of subtle misunderstandings that end up as bugs in programs. One misunderstanding is that there is a difference between initializing a pointer to the empty string ("") as opposed to NULL. The empty string is a char array that consists of a single char: the null byte '\0'. As such, initializing a char* to the empty string makes the pointer point to a valid memory location (the address of the null byte). In contrast, setting the char* variable to NULL makes it point to nothing; dereferencing the pointer would produce a segmentation fault. This point of confusion leads to potential errors when the strings are used. Consider the following example:

char *empty = "";
char *null = NULL;
printf ("Empty: %s; null: %s\n", empty, null);

Although there is no * on line 3, this code involves two pointer dereferences. That is, when printf() processes the %s format specifiers, it needs to get the contents of the string by dereferencing the empty and null pointers. When the empty string is processed, nothing interesting happens; it is a valid string, but it has no characters to print.

In contrast, when printf() encounters the null pointer, there is a problem; processing %s involves dereferencing the pointer (which is NULL), so this line would traditionally cause a segmentation fault. Newer implementations of the C library have modified printf() to detect and avoid such crashes by printing the string (null) when given a NULL pointer. This new version only makes this exception for NULL exactly. If the pointer is not NULL, but the value is not a valid address (e.g., try changing the code above to point to use char *null = (char *)1;), printf() will cause a segmentation fault.

C library functions – <string.h>

char * strdup(const char *s1);: Dynamically allocate a copy on the heap of the string pointed to by s1.
size_t strlen(const char *s);: Compute the length of the string pointed to by s, measured in bytes (not including the null byte).
size_t strnlen(const char *s, size_t maxlen);: Compute the length like strlen(), but never scan more than maxlen bytes.

Note

All of the C string library functions have a version that starts with str and a version that starts with strn. The strn versions take an additional parameter (n) that specify a maximum number of bytes to operate on. The n parameter provides a safety termination of the operation in case the null byte that is supposed to terminate the string has been overwritten. For instance, strlen() would continue scanning the bytes following the intended string until a random null byte is encountered. Consequently, calling strlen() on such a string would turn a length that is (possibly significantly) larger than the actual length. If we started with the string "hello" and the null terminator was changed, we might end up with strlen() indicating that the string is 2500 bytes in length. This incorrect response might cause a crash or some other problem later, but the call to strlen() itself will not cause direct harm.

On the other hand, some functions are considered so dangerous that the str version should never be used. In fact, many projects scan for these functions and automatically reject code submissions that contain them. The most famous example of this is the strcpy() function that copies one string into a buffer that has already been allocated. If the buffer is not big enough, strcpy() will write beyond the end of it anyways, potentially corrupting other parts of memory after the buffer. For instance, if you allocate a buffer that can store only four bytes of data, using strcpy() to copy the string "Hello world from your evil hacker friend!" will write 41 bytes of data; the first four will go into the buffer, and the remaining 37 will clobber the contents of memory (i.e., other variables) after the end of the buffer. Over the past several decades, this one programming error has been one of the most common and persistent sources of security vulnerabilities.

10.7.1. Investigating String Contents¶

Given a pointer to a string, particular an input string, it is common to investigate the string’s contents for a variety of purposes. The C standard library provides several functions that can be used to examine a string. One of the most common is strcmp(), which takes the pointers to two strings, dereferences them, and compares their contents. The return value for strcmp() can be -1, 0, or 1, with 0 indicating the strings are identical. The -1 and 1 values are used to indicate the lexicographic [1] ordering (i.e., how they would appear in an alphabetized list) if there is a mismatch; strcmp ("hello", "goodbye") would return the positive value to indicate that the first argument should be ordered after the second. Switching the order of the arguments would flip the result to -1. Two additional common functions are strchr() and strstr(), which are used for searching within the contents of the string; strchr() looks for a specified character in the string (passed as an int rather than a char), while strstr() looks for a substring. If the character or substring is found, these functions return a pointer to the first location; otherwise, they return NULL.

C library functions – <string.h>

int strcmp(const char *s1, const char *s2);: Compare two strings for the same content.
char * strchr(const char *s, int c);: Search for the first occurrence of a character c in a larger string s.
char * strstr(const char *haystack, const char *needle);: Search for one string (needle) as a substring of another (haystack).

Code Listing A.38 demonstrates some common uses of these functions. Lines 5 and 6 specify two strings to work with. Line 9 then compares them using strcmp(), implicitly relying on a convention in C that 0 indicates false and anything non-zero indicates true. Since these strings do not match, strcmp() would return 1 or -1 (1 in this particular case); C interprets this value as true, so the assertion is satisfied. (Note that it is a common practice to write !strcmp(s1, s2) to evaluate if the strings are identical. If they match, strcmp() returns 0 (false) and the logical negation (!) operator negates this value to true; if they do not match, the ! would convert the 1 or -1 returned into false.)

/* Code Listing A.38:
   Comparing strings and searching for substring/character occurrences
 */

char *longer = "breathe";
char *shorter = "eat";

/* Assertion holds because they are not the same */
assert (strcmp (longer, shorter));

char *substr = strstr (longer, shorter);
printf ("Substring starting at \"%s\" is %s\n", shorter, substr);

size_t count = 0;
char *walker = strchr (longer, 'e');
while (walker != NULL)
  {
    count++;
    walker = strchr (walker + 1, 'e');
  }

printf ("There are %zd occurrences of 'e' in %s\n", count, longer);

Line 11 checks if the string “eat” occurs anywhere as a substring in the longer string “breathe”. Since it does, strstr() would return the pointer of the first 'e' in the string. Note that strstr() does not alter the original string in any way; it simply returns a pointer to the middle of the existing string. Because of this, line 12 will print the substr variable as the string "eathe", as printf() processes %s by traversing through the characters until the null byte is encountered. That is, if a strstr() finds a substring anywhere, printing that substring will print the contents from the first occurrence of the substring all the way to the end of the original string.

Lines 14 – 20 use strchr() in a loop to count the number of occurrences of a particular character, 'e' in this case. Line 15 initializes the walker variable to point to the first 'e', the third byte of the string. If line 15 had search for ‘q’ instead, walker would be initialized to NULL. Within each iteration of the while-loop, line 19 finds the next location of an 'e' (if one exists). In this case, the call to strchr() indicates that it needs to start looking at walker+1, the first byte after an ‘e’ that has already been found. (Calling strchr(walker, 'e'); on line 19 would create an infinite loop, since it would repeatedly find the same 'e'!) Assuming the original string longer is null-terminated (as it is), the while-loop is guaranteed to terminate as written. The function strchr() will stop and return NULL once it encounters the null byte. Even if walker ends up pointing to the last character of the string (as it does in this case) walker+1 can never accidentally skip over the null terminator, because walker is always set to point to an 'e'.

Bug Warning

Many languages have built-in string types that allow easy comparison with the standard equality operator. Again, C is not one of those languages. The only safe way to check if two strings have the same contents is to use strcmp(). Using other comparisons, such as the == operator, can lead to erroneous results if not interpreted correctly. With primitives like int and char, this operator compares the values and returns true if the values match. The same is true of strings (and pointers in general), but this fact does not match our intuitions. Specifically, the value of a string (char*) variable or any other pointer is the address being pointed to. That is, the == operator checks if the pointers are pointing to the same location, not that the strings themselves match. The following example illustrates key features of this distinction:

char *first = "hello";
char *second = "hello";
char third[] = "hello";
char fourth[] = "hello";

printf ("Comparing first and second:\n");
printf ("Same contents? %s\n", (! strcmp (first, second) ? "yes" : "no"));
printf ("Same string? %s\n\n", (first == second ? "yes" : "no"));

printf ("Comparing first and third:\n");
printf ("Same contents? %s\n", (! strcmp (first, third) ? "yes" : "no"));
printf ("Same string? %s\n\n", (first == third ? "yes" : "no"));

printf ("Comparing third and fourth:\n");
printf ("Same contents? %s\n", (! strcmp (third, fourth) ? "yes" : "no"));
printf ("Same string? %s\n", (third == fourth ? "yes" : "no"));

Lines 1 – 4 declare the string "hello” four times, twice with a char* and twice with a char array. These declarations influence the equality checks that follow. In all three cases, the strcmp() function will return 0 to indicate that they match; this should not be surprising since they are all initialized with the same string. With the equality check, it should not be surprising that the equality check on line 12 returns false. Recall that that char* initialization style puts the contents of the string in .rodata, whereas the char array style places the contents on the stack. In other words, second and third are pointing to different memory segments.

The equality checks on lines 8 and 16 are somewhat less predictable initially. Line 8 returns true, indicating that the first and second pointers are pointing to the same place, despite the fact that they are both initialized with what appears to be a separate copy of the string. In fact, the compiler determines that the strings are the same, which makes it redundant to store two copies in .rodata; by definition, the strings in .rodata cannot change, so one shared copy is sufficient. Line 16, on the other hand, returns false. The array declaration style must create two distinct instances, because each one can be modified independently of the other; it does not matter that the initial contents are the same. In fact, the compiler produces a warning on this line to indicate that such array comparisons always evaluate to false.

Another common task with strings is to determine if the characters fit into particular classes, such as numeric, alphanumeric, whitespace, printable, upper- or lower-case, etc. The functions defined in the ctype.h file provide these tests without requiring the programmer to recreate the pattern-matching required. Code Listing A.39 illustrates how these class tests could be used to validate the strength [2] of a password. Line 9 performs a standard safety check. Functions that take a pointer as input—particular from user-supplied input—need to check explicitly for NULL arguments. Line 12 then throws out passwords that are shorter than 16 characters in length.

/* Code Listing A.39:
   Using ctype.h tests to determine if a password uses multiple classes
 */

bool
is_strong (char *password)
{
  /* Safety check: Don't accept a NULL pointer */
  assert (password != NULL);

  /* Short passwords are bad */
  if (strlen (password) < 16)
    return false;

  char *walker = password;
  bool digit = false, lower = false, upper = false, punct = false;
  while (*walker != '\0')
    {
      digit |= isdigit (*walker);
      lower |= islower (*walker);
      upper |= isupper (*walker);
      punct |= ispunct (*walker++);
    }
  /* Return true only if all are true */
  return digit && lower && upper && punct;
}

Lines 15 – 23 perform the bulk of the checking. The four bool variables are all initialized to false, indicating that we have not yet encountered a digit ('0' – '9'), lower-case letter ('a' – 'z'), upper-case letter ('A' – 'Z'), or a punctuation mark (see ispunct(3) for the full list). The walker variable is set to traverse through each byte of the string until the null byte is encountered (observe that line 22 advances walker after all checks have been done for one character). Within the while-loop, each bool variable is bit-wise ORed (|) with the result of applying the isX functions to the current character *walker. The first time that a character passes one of the tests (e.g., when *walker points to 'Z' and isupper(*walker) is called), the corresponding bool variable will be set to 1 (true). From then on, that variable can never become false, because applying bit-wise OR of 1 with any value will always produce a non-zero result. Consequently, line 25 will return true the password contains at least one character from each of the four classes.

C library functions – <ctype.h>

int isalnum(int c);: Determines if c is alphanumeric.
int isalpha(int c);: Determines if c is alphabetical letter.
int isdigit(int c);: Determines if c is a numeric digit.
int isspace(int c);: Determines if c is a whitespace character (including tab, newline, etc.).
int islower(int c);: Determines if c is lower-case alphabetical character.
int isupper(int c);: Determines if c is upper-case alphabetical character.
int ispunct(int c);: Determines if c is a punctuation mark.

10.7.2. Common String Manipulations¶

Most modern programming languages provide a simple mechanism for a very common task: merging strings. Some languages use the + operator, such as string1 + string2 to concatenate the two strings; others use a . operator instead. Unfortunately, C is not such a language. There are various functions that can be used for this purpose, with strncpy() and strncat() being two of the first encountered. Both of these functions copy the contents of one string (s2, passed as the second argument) into a portion of memory identified by the first argument (s1). The key difference between the two is that strncpy() will copy the bytes starting at the exact location that s1 points to; strncat() appends the strings by copying the bytes starting at the first null byte at or after s2. In other words, strncpy() replaces the contents of the first string, whereas strncat() concatenates the two.

Unlike their unsafe cousins strcpy() and strcat() (which should NEVER be used), strncpy() and strncat() take a third argument that specifies a maximum number of bytes to copy. If the length of s2 is less than n, then the function will stop before processing n bytes. The memcpy() function shown below behaves similarly to strncpy(), except that it ignores the null byte; that is, memcpy() is used to copy an arbitrary memory buffer from one location to another, regardless of whether that buffer contains a string. In that regard, memcpy() will always copy exactly n bytes, unless some unusual circumstance occurs (such as the dst and src buffers overlapping, which is undefined behavior in the C specification).

C library functions – <string.h>

char * strncpy(char *s1, const char *s2, size_t n);: Copy string s2 into the buffer s1; stops after copying n bytes or at the first ‘0’.
char * strncat(char *s1, const char *s2, size_t n);: Appends string s2 after the string s1; stops after copying n bytes or at the first ‘0’.
void * memcpy(void *dst, const void *src, size_t n);: Copy n bytes of memory from src to dst; does not stop at ‘0’.

Although it is certainly fair to refer to strcpy() or strcat() as an unsafe version of strncpy() or strncat(), it would be a mistake to consider the latter two functions truly safe. One key aspect of this is whether or not these functions guarantee that the result is null terminated. Code Listing A.40 demonstrate two examples of this problem. The n argument on line 10 ensures that only the 'h' and 'e' characters get copied into the buffer array. That is, the n argument for strncpy() places a maximum number of bytes copied, and the function does not guarantee that one of these is a null byte. Line 11, then, is likely to print additional characters after the string "he", because there is no null byte in buffer. As such, the %s causes printf() to continue traversing through memory until a null byte is encountered.

/* Code Listing A.40:
   strncpy() and strncat() do not agree on null-termination of strings
 */

char buffer[2];
strncpy (buffer, "hello", 2);
printf ("buffer: %s\n", buffer);

char trouble[10];
strncpy (trouble, "hello", 10);
strncat (trouble, " world", 5);
printf ("trouble: %s\n", trouble);

Bug Warning

The strncpy() and strncat() functions are a frequent source for error. As describe above, they differ on their interpretation of the n parameter and whether or not null-termination is guaranteed (yes for strncat(), no for strncpy()). Besides the confusion around these issues, they still leave plenty of room for errors on the part of the programmer. One common mistake is to switch the order of the first two arguments, mistaking the source and the destination of the copy operation. Another common mistake with these functions can be illustrated in the following line of code:

strncpy (destination, source, strlen (source));

This line of code, in essence, re-creates the functionality of the banned strcpy() function. When strcpy() runs, it will only stop when it encounters a null byte in source; in the process, it has copied strlen(source) bytes over to the destination. By making the n parameter be the same as the number of bytes in the string, this line of code is setting a redundant maximum length check (strcpy() would already stop after strlen(source) of data). The third parameter must always be based on how much space is available in the destination, never the source.

Another common mistake that occurs with these functions is due to confusion regarding the sizeof() operator as discussed previously. Consider the following example:

char *buffer = calloc (100, sizeof (char));
strncpy (buffer, "This is a string", sizeof (buffer));

The first line of this example creates a dynamically allocated buffer of 100 bytes of space. Since it uses calloc(), all of the bytes are set to null bytes (0 = '\0'). On the second line, the source argument string is 16 characters in length; clearly the buffer has sufficient space for the entire string. The problem with this line of code is that only the bytes "This is " will be copied over, due to the use of sizeof(). As described previously, sizeof() can never check how much space a pointer is pointing to. Instead, sizeof() returns the amount of space required for the variable itself. Since buffer is a char* (i.e, it is a pointer), its size is the size of an address: 8 bytes. It does not matter that buffer is pointing to 100 bytes allocated on the heap. Based on the first line of code (with a hard-coded size of 100), the last argument to strncpy() would need to be the hard-coded value 99 (keeping the $100^{th}$ byte as 0 to guarantee a null-terminated string).

A third common mistake occurs when the programmer forgets about the implications of memory segment permissions. In the following example, message is declared as a char* that points to the hard-coded string "Hello, ", which resides in the read-only global data segment (.rodata). Line 2, then, is an attempt to write into read-only memory. The result would be a segmentation fault or an abort trap, depending on the architecture.

char *message = "Hello, ";
strncat (message, username, 20);

While strncpy() and strncat() focus on building or merging strings, another common task is to split a string into smaller parts, a procedure known as tokenizing. C provides two functions, strtok() and strtok_r(), for this purpose. In both cases, when the function is first called, the str parameter points to the string to tokenize; on subsequent calls, str is set to NULL to indicate that the function is continuing to process the previous string. The sep parameter is a pointer to a string of separator characters; whenever one or more of these characters is encountered in a row, strtok() or strtok_r() would return a pointer to the token ending at that character. The difference between the two functions is that strtok_r() is reentrant, while strtok() is not. (Reentrancy is discussed in Chapter 7.) In short, strtok() uses a static variable to keep track of where to continue within the string. This approach fails when there are multiple threads calling strtok() on distinct strings; the threads might accidentally receive each other’s tokens. If there are multiple threads in execution, the strtok_r() version is needed to avoid this dilemma; the third parameter, lasts, keeps track of the tokenization of the string, thus eliminating the race conditions that can occur with static variables.

C library functions – <string.h>

char * strtok(char *str, const char *sep);: Split the string str at an occurrence of the separator sep.
char * strtok_r(char *str, const char *sep, char **lasts);: Thread-safe version of strtok(); sets lasts to the beginning of the next token.

As an example of tokenization, consider a comma-separated value (CSV) file, a common format for sharing collections of data. Each line in a CSV file consists of a number of data fields with a comma to separate them. As an example, consider a CSV file of holidays for the year 2020. One line of the file might look as follows:

Wed,Jan,01,2020,New Year's Day

Once the file contents have been read into memory, the lines might be tokenized to retrieve the individual fields of the line. Code Listing A.41 splits this line one token at a time, storing the fields in the fields of a struct declared as the holiday_t type.

/* Code Listing A.41:
   Using strtok_r() to split a CSV file line
 */

/* Assume line contains "Wed,Jan,01,2020,New Year's" */
holiday_t nyd;
char *save = NULL;
nyd.wkd = strtok_r (line, ",", &save); // set weekday to "Wed"
nyd.mon = strtok_r (NULL, ",", &save); // set month to "Jan"
nyd.day = strtok_r (NULL, ",", &save); // set day to "01"
nyd.yer = strtok_r (NULL, ",", &save); // set year to "2020"
nyd.nam = strtok_r (NULL, ",", &save); // set name "New Year's"

In each of the lines 8 – 12, the call-by-reference parameter &save changes the save pointer to keep track of the continuation point that immediately follows the separator instance. For instance, line 8 sets save to point to the 'J' in "Jan" and returns a pointer to the string "Wed". When the first parameter to strtok_r() is NULL, this continuation point determines where the function will look for the next delimiter. On line 9, then, strtok_r() starts looking at the ‘J’ and finds the comma just after "Jan"; strtok() then updates save to point to the first ‘0’ and returns the token "Jan".

There are subtle aspects to the behavior of strtok() and strtok_r() that require consideration. First, these functions do not return copies of the tokens; they modify the original string and return a pointer into it. Specifically, the first occurrence of any of the characters in the separator string sep is replaced with the null byte. Table A.6 illustrates two snapshots of the string pointed to by line in Code Listing A.41, both the original version and after two calls to strtok_r(). After two calls, the first two commas in the line have been overwritten with a null byte. Because of this modification, the pointer that is returned is a complete string. The first call to strtok_r() returns a pointer to the 'W' at the beginning of the line, but the token returned is the string "Wed". The fact that the original string gets modified means that string constants cannot be tokenized. Since string constants are stored in .rodata, tokenizing it would require writing a null byte into read-only memory.

Original string contents:
`W`	`e`	`d`	`,`	`J`	`a`	`n`	`,`	`0`	`1`	`,`	`2`	`0`	`2`	`0`	`,`	`N`	`e`	`w`		`Y`	`e`	`a`	`r`	`'`	`s`	`\0`

After line 5 of Code Listing 41:
`W`	`e`	`d`	`\0`	`J`	`a`	`n`	`\0`	`0`	`1`	`,`	`2`	`0`	`2`	`0`	`,`	`N`	`e`	`w`		`Y`	`e`	`a`	`r`	`'`	`s`	`\0`

Table A.6: The contents of the line variable before line 4 and after line 5 of Code Listing A.41

Second, because the pointers returned are to the original string, freeing or modifying the original data can corrupt the tokens. In the CSV example above, we assumed that the entire file contents were read into memory. What if this were not the case? Instead, the program reads a line of the file into memory at a time, repeatedly overwriting the buffer variable line. This approach would allow the first line to be tokenized successfully, and the fields of the holiday_t struct would be pointing to their tokens. But the next line of the file would get read into this exact same memory. As such the fields of the holiday_t would now be pointing to characters in the second line of data, not the first. On the other hand, perhaps the line variable points to a dynamically allocated buffer that is created anew for each line of the file. In this case, the holiday_t fields could be corrupted if line is freed; the fields would still be pointing to the heap where the contents of line were stored, but that part of the heap would now be invalid.

Third, strtok() and strtok_r() ignore repeated instances of separators. This behavior can be problematic for CSV files, as fields can be blank. For example, assume that the CSV file from above was modified to include a location field between the year and name of the holiday. If these fields were missing for the New Year’s Day holiday, that line (and another) of the file might look like:

Wed,Jan,01,2020,,New Year's Day
Fri,Feb,14,2020,Charlottesville,Valentine's Day

One call to strtok_r() would get the string "2020". The next call would then get the string "New Year's Day", rather than an empty string to indicate the missing location field. There are times when skipping over repeated separators is advantageous (consider skipping over repeated whitespace in a C source code file), but there are also times where it can lead to incorrect results. The strtok() and strtok_r() work well for the former cases, but other approaches are needed for the latter.

Code Listing A.42 demonstrates two techniques for splitting a file’s contents based on lines. To start, assume that the file’s contents have been read into file_contents and (for simplicity) this buffer is null-terminated. Lines 5 – 17 store copies of the lines without using strtok() or strtok_r(). Instead, these lines use start_of_line to keep track of where a line begins (initially, the start of the file contents). Line 6 then uses strchr() to identify the end of the first line by looking for the '\n' character. Line 10 uses strndup() to make a dynamically allocated copy of the line. The semantics of strndup() are like strncat(); it will copy up to the specified number of bytes and it will add on a null terminator. Since end_of_line – start_of_line is the exactly the number of bytes in the line, line 10 makes a complete null-terminated copy and stores the address of this copy into an array. Line 12 then moves start_of_line just past the newline character so that it points to the beginning of the next line. Line 14, then, starts looking for the next end_of_line after that point. The additional copy on line 17 is necessary because the while-loop will terminate when there are no more newline characters; when this occurs, start_of_line is pointing to the last line (which has no '\n' after it). Line 17 can use the standard strdup() instead of strndup(), because the original file_contents are null terminated.

/* Code Listing A.42:
   Tokenizing and storing file lines without and with strtok()
 */

char *start_of_line = file_contents;
char *end_of_line = strchr (start_of_line, '\n');
while (end_of_line != NULL)
  {
    line_copies[lineno++] =
      strndup (start_of_line, end_of_line - start_of_line);
    /* Next line starts after the '\n' */
    start_of_line = end_of_line + 1;
    /* Find the next end of line */
    end_of_line = strchr (start_of_line, '\n');
  }
/* Copy the last line */
line_copies[lineno] = strdup (start_of_line);

lineno = 0;
char *line = strtok (file_contents, "\n");
while (line != NULL)
  {
    /* Dynamically allocate a copy and store the pointer */
    all_lines[lineno++] = strdup (line);
    line = strtok (NULL, "\n");
  }

When line 19 begins processing, the original file_contents have not been altered in any way. The use of strchr() and strndup() in lines 6 – 17 do not write anything into this buffer. Consequently, we can begin to use strtok() and start over. Through each iteration of the while-loop in lines 21 – 26, the line variable points to the current (null-terminated) line. To keep copies of the lines, again, we use strdup(). In practice, it does not make sense to perform both of these loops, since they are keeping track of the same data. The purpose of combining them in Code Listing A.42 is to show that they ultimately end up as two equivalent ways to accomplish the same goal; the only difference is that the strtok() approach modifies the original file_contents, whereas the strchr() approach does not. The while-loop structure in lines 21 – 26 is a common approach for using strtok().

10.7.3. Converting Between Strings and Integers¶

One final common task in relation to strings involves converting numeric values back and forth between representations. When reading user input or data from a file, numeric text data ("123") might need to be converted to one of C’s integer primitive representations (123) for easy manipulation or compact storage. On the other hand, integers often need to be converted to their string format to append to other text data (e.g., writing the HTTP header line "Content-Length: 123\r\n" when the length has been stored as a size_t variable). Code Listing A.43 illustrates the difference in the internal representations of 123 (as a uint8_t) and "123" (as a string).

/* Code Listing A.43:
   Printing the byte contents of an integer and string
 */

uint8_t integer = 123;
char string[] = "123";

uint8_t *walker = &integer;
for (size_t i = 0; i < sizeof (integer); i++)
  printf ("%02" PRIx8 "  ", *walker++);
printf ("\n");

walker = (uint8_t *) &string;
for (size_t i = 0; i < sizeof (string); i++)
  printf ("%02" PRIx8 "  ", *walker++);
printf ("\n");

The particular for-loops here might appear odd, but they are used to show that the two variables are being handled in the same way. Specifically, the loop on line 9 will only have one iteration, because integer is only one byte in size (as a uint8_t). At the same time, the loop on line 14 deliberately uses sizeof() on a string (instead of strlen()), which is a common source of bugs; however, this approach lets us examine all four bytes in the char array.

The first loop demonstrates that the internal representation of integer is the single byte 0x7b. The second loop demonstrates that the representation of string is the four consecutive bytes 0x31, 0x32, 0x33, and 0x00. (Printing these four bytes were at once would produce the value 0x00333231 due to endianness.) The issue of conversion focuses on ways to translate automatically between these two byte representations, which do not appear to be similar.

C library functions – <stdlib.h>

long strtol(const char *str, char **endptr, int base);: Translate the numeric string str into a long representation for the provided base.

The strtol() function handles the conversion from integer to string. [3] The str parameter points to a string containing the number, and the base indicates an arbitrary numeric base (10 for decimal, 16 for hexadecimal, or an arbitrary base such as 13 for base-13). The string can contain multiple numeric values separated by non-numbers (e.g., "123 456 -42" for the three values 123, 456, and -42). When endptr is non-NULL (i.e., it is a call-by-reference value), it will be set to point to the first character after the current number.

Code Listing A.44 demonstrates multiple ways that strtol() can be used. Line 5 starts by creating a string "123 -32 alpha". This can be broken down into the integer values 123 and -32, but the "alpha" cannot be interpreted as an integer. Lines 9, 14, and 20 use strtol() to parse this string into these numeric components. Line 9 uses the original numbers string as the first argument, whereas lines 14 and 20 use end for this parameter. Using end on the subsequent calls is necessary, because strtol() does not keep track of any prior progress; using numbers each time would repeatedly return the first value, 123.

/* Code Listing A.44:
   Converting from string to integer representations
 */

char *numbers = "123 -32 alpha";
char *end = NULL;

/* strip off the 123 and make end point to " -32 alpha" */
long result = strtol (numbers, &end, 10);
assert (errno != EINVAL); // no match indicates success
printf ("Result = %ld; end = '%s'\n", result, end);

/* continue from " -32 alpha" to get the -32 */
result = strtol (end, &end, 10);
assert (errno != EINVAL); // no match indicates success
printf ("Result = %ld; end = '%s'\n", result, end);

/* continue from " alpha", which cannot be processed */
char *final = NULL;
result = strtol (end, &final, 10);
assert (errno == EINVAL); // match indicates strtol() failed
printf ("Result = %ld\n", result);
printf ("end = '%s'; final = '%s'\n", end, final);
assert (final == end);

/* use a bizarre base-11 format, ignoring endptr */
numbers = "60a1";
result = strtol (numbers, NULL, 11);
printf ("Result = %ld\n", result);

The strtol() function provides two ways to check for errors in the processing. The first (and most straightforward) way is to use the errno global variable. On a failure (such as " alpha"), strtol() sets errno to EINVAL, which is a positive constant (errno is set to 0 on success). The assert() calls on lines 10, 15, and 21 all pass, indicating the calls to strtol() on lines 9 and 14 succeed, while line 20 fails. The other mechanism is through the endptr parameter. Line 20 uses end as the input, pointing to the string " alpha". After strtol() runs, final is also set to this location. If the endptr ends up at the beginning of the string (i.e., final after the call matches end, which hasn’t changed), then strtol() was unable to process any data successfully.

Lines 27 – 29 demonstrate other features of strtol(). First, the endptr parameter can be (and is often) ignored by passing NULL. Even with a NULL endptr, we could still check errno to determine if the conversion succeeded. Second, strtol() supports generally arbitrary base values (2 – 36 are allowed). Conventionally, C numeric constants use 0x as a prefix to indicate hexadecimal format (e.g., 0x7ff) and a leading 0 to indicate octal (e.g., 0644); otherwise, the number is interpreted as decimal. Importantly, this means that C has no convention to declare binary constants. The strtol() function supports this by taking 2 as the base parameter.

Converting values in the opposite direction, from integers to strings, is mostly intuitive, because it is very similar to one of the first functions novices learn in C: printf(). The main difference is that the snprintf() function takes two parameters before the format string to indicate the destination and the maximum number of bytes. (The sprintf() function does not take a maximum number of bytes, which makes this function unsafe in the same ways as strcpy() or strcat(). As such, sprintf() should never be used.)

C library functions – <stdlib.h>

int snprintf(char *str, size_t size, const char *format, ...);: Format a string in memory similar to printing to the screen.

Code Listing A.45 highlights the similarities between snprintf() and the more familiar printf(). Both functions take a format string ("%d" or "%d\n") to indicate how the number should be formatted, along with the number as an additional argument. The primary difference is that snprintf() also indicates a destination to write the formatted value into (the buffer). Once the value has been written into the buffer, it can be printed (if needed) using the %s format specifier.

/* Code Listing A.45:
   Converting from integer to string is similar to printing to standard I/O
 */

int number = 42;
char buffer[3];

/* Print the number to the screen */
printf ("%d\n", number);

/* "Print" the number into the buffer */
snprintf (buffer, 3, "%d", number);

/* Print the string */
printf ("%s\n", buffer);

Recall from the discussion of strncpy() and strncat() that the two functions had different interpretations of the respective maximum size parameter, n. Specifically, strncpy() would copy a maximum of n bytes, potentially leaving the string un-terminated if those n bytes did not contain the null byte. In contrast, strncat() would write a maximum of n+1 bytes, because it always appends the null byte. The snprintf() function adds a third interpretation: it will print up to n-1 bytes and then append the null byte. Frustrating! Code Listing A.46 summarizes this situation. Since both strncat() and snprintf() guarantee null termination, they both end up writing a null byte; however, strncat() appends this after the two bytes 'h' and 'e', whereas snprintf() does so after only one byte '4'. Unlike the other two, strncpy() does not guarantee null termination.

/* Code Listing A.46:
   Converting from integer to string is similar to printing to standard I/O
 */

strncat (buffer_1, "hello", 2); // copies 3 bytes 'h', 'e', '\0'
strncpy (buffer_2, "hello", 2); // copies 2 bytes 'h' and 'e'
snprintf (buffer_3, 2, "%d", 42); // copies 2 bytes '4' and '\0'

Bug Warning

The snprintf() function, once again, creates a very common vector for buffer overflow vulnerabilities. One of the challenges—and common mistakes—arises from the anticipation of what is a likely integer value as compared to what is a possible one. The buffer from Code Listing A.45 is not a safe size for the format specifier %d. As an int is typically four bytes, its string form can be as long as 12 characters in length (for example, including the negative sign and null byte for the INT_MIN constant "-2147483647"). As such, the buffer should generally be larger than required. One simple way to do this (and to ensure the bytes are all initialized to 0) is to use calloc() to allocate enough space. If needed, realloc() could then be used to shrink the buffer.

char *buffer = calloc (12, sizeof (char));
snprintf (buffer, 12, "%d", 35);

/* Shrink it down to size, keeping an extra byte for '\0' */
buffer = realloc (buffer, strlen (buffer) + 1);

Since snprintf() takes a normal format string (which can contain a mix of string data and multiple format specifiers), it creates an easier mechanism to concatenate multiple values together into a single string. Code Listing A.47 demonstrates a simple example of this practice to build the string "5 + 10 = 15\n" using int variables.

/* Code Listing A.47:
   Using snprintf() to build a string that mixes integer and string data
 */

int x = 5, y = 10;
char *sum = calloc (100, sizeof (char));
snprintf (sum, 100, "%d + %d = %d\n", x, y, x + y);

Chapter 4 introduces the structure of HTTP headers. These headers consist of a series of lines, each ending in "\r\n". As one example, consider the following header snippet:

HTTP/1.0 200 OK\r\n
Content-Length: 37\r\n
Connection: close\r\n
Content-Type: text/html\r\n
\r\n

Assuming some of these fields are stored in variables, this could be constructed with a single snprintf() call, as shown in Code Listing A.48.

/* Code Listing A.48:
   Building an HTTP response header with one snprintf()
 */

snprintf (header, MAX_HEADER_LENGTH,
          "HTTP/%d.%d %d %s\r\n" // version, code, status
          "Content-Length: %d\r\n" // length
          "Connection: close\r\n"
          "Content-Type: %s\r\n\r\n", // type
          vers_major, vers_minor, code, status, length, type);

Code Listing A.48 relies on the fact that string constants are concatenated in C. As such, lines 6 – 9 all build a single format string. Displaying them as separate lines in the code makes the organization easier to understand from the programmer’s perspective. This string could also be built a line at a time with repeated calls to strncat(), but this version simplifies the processing as a single function call.

Bug Warning

Another common use of snprintf() is to inject formatted numbers into the middle of an existing string. For instance, consider an event logging mechanism that uses a common reporting form for events. The snprintf() function could be used to fill these in, but requires special care as shown in the example below.

char record[] = "Month [   ] Day [  ] Year [    ]";
snprintf (record + 7, 4, "%s", month);  // write [mon]
snprintf (record + 17, 3, "%-2d", day); // write [da]
snprintf (record + 27, 5, "%4d", year); // write [year]

/* Restore the ] characters that snprintf() overwrote with the
   null byte */
record[10] = ']';
record[19] = ']';
record[31] = ']';

The problem is that snprintf() always null terminates what it writes. As such, if line 2 writes the month as "Jan", the record variable would become the string "Month [Jan". The rest of the string would still be there in memory, but snprintf() overwrite the first bracket with the null byte; printing the string at that point would stop there instead of showing the full record. Lines 8 – 10 fix this problem by restoring the brackets to their original locations, overwriting the null bytes that snprintf() had added.

Also note that the original record string created three spaces between the brackets for the month, two for the day, and four for the year. The size parameter for lines 2 – 4 added one to each of these values (four, three, and five, respectively), because snprintf() includes the null byte in this count. Thus, writing the string "Jan" into the month field requires writing four bytes, not three.

[1]

The use of lexicographic instead of alphabetic is common and intentional in computing, as the former is more general and works with non-alphabetical characters. For instance, it does not make sense to characterize the alphabetical ordering of “15” as compared to “3” since neither contain letters in the alphabet. However, “15” comes before “3” in lexicographical ordering.

[2]

Determining if a password is strong is significantly more complicated than this function, and this should not be used for real security purposes. For instance, the password also needs to be compared with common dictionary words, previous passwords, easily guessed patterns, etc. This example just illustrates how these character class checks can be used as part of this procedure.

[3] C also has an older function atoi() for this purpose, though this function is deprecated and should not be used in new code. The strtol() function adds explicit support for multiple bases, whereas atoi() handled this implicitly within the string; strtol() also returns a long rather than the int returned by atoi(), supporting larger values. Finally, and most importantly, atoi()’s error handling was weak, returning a 0 for bad input; as such, it was not possible to distinguish between "0" and truly bad input.