It is often the case that a seemingly simplistic idea or design can turn out to be surprisingly complicated. We saw one example of this previously in the discussion of pointers. The definition of equating a pointer with an address seems straightforward; the implications of their usage for dynamic memory allocation, call-by-reference parameters, variable sizes, and so forth quickly become challenging for the programmer. The same can be said for strings in C. As with pointers, we start with a basic definition:
A string is an array of characters ending in the null byte.
To interpret the situation in a slightly different way, the C programming language does not actually
have a string type in the intuitive sense that makes sense to humans. Instead, C just provides a
thin veneer of interface for working with fixed-size arrays of char
data. A string in the C
sense consists of the array of char
s that are (typically) observable to a human reader, with
one additional char
added to the end of the array. Code Listing A.36 illustrates
this fact by defining the string "Hello"
in a very unusual manner: as an array of six
uint8_t
values. One key idea here is that everything in the machine is just a number. The
meaning and interpretation of those bytes as the string "Hello"
is created by the %s
format
specifier, which tells the display to present the ASCII interpretation of the bytes to the user
instead of the numeric values.
1 2 3 4 5 6 7 8 9 10 11 | /* Code Listing A.36:
Printing "Hello" and turning it into "Ha!"
*/
uint8_t string[] = { 72, 101, 108, 108, 111, 0 };
printf ("The string is '%s'\n", string);
string[1] = 'a';
string[2] = 0x21;
string[3] = (char) NULL;
printf ("The string is '%s'\n", string);
|
Since the string is an array, its individual elements can be accessed and modified; line 8 changes
the 'e'
to 'a',
line 9 changes the first 'l'
to '!',
and line 10 changes the second
'l'
to the null byte '\0'
(literally the number 0). These changes cause line 11 to print the
string as "Ha!"
instead of the original "Hello"
. These lines did not change the 'o'
byte
stored as string[4]
, nor did the original '\0'
stored in string[5]
change; both bytes
are still there in memory as part of the original array. The only reason they do not get printed by
line 11 is, again, because of the %s
format specifier, which tells printf()
to stop printing
at the first null byte. Table A.5 illustrates the memory content of this array of chars
from before and after the modifications, based on three different interpretations for formatting.
Note that the ASCII interpretation '\0'
is not displayed to the screen, but is shown here for
completeness.
ASCII interpretation %c or %s |
H |
e |
l |
l |
o |
\0 |
Hexadecimal format %x |
48 |
65 |
6c |
6c |
6f |
00 |
Decimal format %d |
72 |
101 |
108 |
108 |
111 |
0 |
ASCII interpretation %c or %s |
H |
a |
! |
\0 |
o |
\0 |
Hexadecimal format %x |
48 |
61 |
21 |
00 |
6f |
00 |
Decimal format %d |
72 |
97 |
33 |
0 |
111 |
0 |
Table A.5: Three interpretations of the bytes that make up the strings from Code Listing A.36
It is important to observe that there are two things missing from this representation. First, the
quotation marks "
used to begin and end the string appear only in the program’s source code.
They are a construct of the C programming language (and other languages, as well), but they do not
exist in the memory representation of the string. C needs the quotes to know where the string begins
and ends in the source code. The machine does not; the string begins at the address of the first
character and ends at the null byte.
Second, there is no explicit storage of the string length. This fact follows from the design structure that strings are null-terminated arrays. The designers of the language made the choice that the string length could always be dynamically determined by traversing through memory until the null byte was found. This design choice—using a single extra byte for a null terminator instead of four bytes to store an explicit length field—is a quintessential example of the space-time tradeoff that system designers face. By requiring extra time to search the string manually, the language could save three bytes of space per string; as programs could store and work with many, many strings, the cumulative space savings of three bytes per string could be potentially very large. At the time the language was designed, execution time was cheap but memory space was prohibitively expensive; thus, this design decision was a good tradeoff at the time, given the circumstances.
Given this understanding of strings as arrays of char
s, we can now focus on issues related to
using them in practice. Specifically, we can simply use the more conventional and readable notation
of "Hello"
rather than the (equivalent and perhaps more accurate) { 'H', 'e', 'l', 'l', 'o',
'\0' }
, assuming that the reader has the correct mental model of the computer’s internal
representation. The first important consideration to highlight at this point, then, is the question
of where these six bytes are actually stored in memory.
Code Listing A.37 illustrates this point by creating three different versions of the
string "Hello"
. The differences between lines 5, 6, and 7 are small but significant, which the
other lines of the code reveal. Assuming this code is run inside the body of a function, all three
variables (array
, pointer
, and heap
) are local and associated with the idea of storage
on the stack. Line 5, by declaring a local array
, behaves in the intuitive manner in this
regard; the array
variable indicates an array of six bytes that are placed in the function’s
stack frame. That is, line 5 operates in a similar manner to declaring an array of int
values or
any other such local array. Line 6, in contrast, places a pointer variable on the stack; the actual
bytes of the string are placed into the program’s read-only data section (.rodata
). Line 7 also
places a pointer on the stack, but the strdup()
function returns a pointer to a dynamically
allocated copy of the string’s bytes on the heap. In short, these three lines illustrate how we can
determine which memory segment (stack,
data,
or heap)
will contain the bytes of the string.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | /* Code Listing A.37:
Three different ways to create the string "Hello"
*/
char array[] = "Hello";
char *pointer = "Hello";
char *heap = strdup ("Hello");
/* None of these is the string length */
printf ("Sizes: %zd %zd %zd\n", sizeof (array),
sizeof (pointer), sizeof (heap));
array[1] = 'a';
printf ("Array version: %s\n", array);
heap[1] = 'a';
printf ("Heap version: %s\n", heap);
pointer[1] = 'a'; // run-time exception
|
Line 6 illustrates a very common source of confusion for those new to the intricacies of C strings.
Recall that the sizeof()
operator returns the number of bytes required to store a particular
variable. In the cases of the pointer and heap variables, sizeof()
will always return the same
answer regardless of the string: 8 (assuming this is a 64-bit architecture). Both of these variables
are pointers, so sizeof() returns the size of an address; sizeof() never dereferences a pointer to
determine the size (or length) of the object being pointed to. In the case of the array
variable, sizeof()
returns the total number of bytes allocated for the variable on the stack: 6.
That is, calling sizeof()
on the array version of declaration will include the null byte.
Furthermore, assume that we had modified the array
variable as in Code Listing A.37, changing the string from "Hello"
to "Ha!"
; sizeof()
would still return an
answer of 6 (not 4), because that is how much storage space the compiler statically associated with
the variable named array
. In short, sizeof()
should never be used to determine the length
of a string; it does not ever examine the actual contents. Instead, if you need to compute the
length of a string, you should always use strlen()
or strnlen()
.
Bug Warning
The use of pointers to declare strings leads to a number of subtle misunderstandings that end up as
bugs in programs. One misunderstanding is that there is a difference between initializing a pointer
to the empty string (""
) as opposed to NULL
. The empty string is a char
array that
consists of a single char
: the null byte '\0'
. As such, initializing a char*
to the
empty string makes the pointer point to a valid memory location (the address of the null byte). In
contrast, setting the char*
variable to NULL
makes it point to nothing; dereferencing the
pointer would produce a segmentation fault. This point of confusion leads to potential errors when
the strings are used. Consider the following example:
1 2 3 | char *empty = "";
char *null = NULL;
printf ("Empty: %s; null: %s\n", empty, null);
|
Although there is no *
on line 3, this code involves two pointer dereferences. That is, when
printf()
processes the %s
format specifiers, it needs to get the contents of the string by
dereferencing the empty and null pointers. When the empty string is processed, nothing interesting
happens; it is a valid string, but it has no characters to print.
In contrast, when printf()
encounters the null pointer, there is a problem; processing %s
involves dereferencing the pointer (which is NULL
), so this line would traditionally cause a
segmentation fault. Newer implementations of the C library have modified printf()
to detect and
avoid such crashes by printing the string (null)
when given a NULL
pointer. This new
version only makes this exception for NULL
exactly. If the pointer is not NULL, but the value
is not a valid address (e.g., try changing the code above to point to use char *null = (char
*)1;
), printf()
will cause a segmentation fault.
C library functions – <string.h>
char * strdup(const char *s1);
size_t strlen(const char *s);
size_t strnlen(const char *s, size_t maxlen);
Note
All of the C string library functions have a version that starts with str
and a version that
starts with strn
. The strn
versions take an additional parameter (n
) that specify a
maximum number of bytes to operate on. The n parameter provides a safety termination of the
operation in case the null byte that is supposed to terminate the string has been overwritten. For
instance, strlen()
would continue scanning the bytes following the intended string until a
random null byte is encountered. Consequently, calling strlen()
on such a string would turn a
length that is (possibly significantly) larger than the actual length. If we started with the
string "hello"
and the null terminator was changed, we might end up with strlen()
indicating that the string is 2500 bytes in length. This incorrect response might cause a crash or
some other problem later, but the call to strlen()
itself will not cause direct harm.
On the other hand, some functions are considered so dangerous that the str
version should never
be used. In fact, many projects scan for these functions and automatically reject code submissions
that contain them. The most famous example of this is the strcpy()
function that copies one
string into a buffer that has already been allocated. If the buffer is not big enough, strcpy()
will write beyond the end of it anyways, potentially corrupting other parts of memory after the
buffer. For instance, if you allocate a buffer that can store only four bytes of data, using
strcpy()
to copy the string "Hello world from your evil hacker friend!"
will write 41 bytes
of data; the first four will go into the buffer, and the remaining 37 will clobber the contents of
memory (i.e., other variables) after the end of the buffer. Over the past several decades, this one
programming error has been one of the most common and persistent sources of security vulnerabilities.
Given a pointer to a string, particular an input string, it is common to investigate the string’s contents for a variety of purposes. The C standard library provides several functions that can be used to examine a string. One of the most common is strcmp()
, which takes the pointers to two strings, dereferences them, and compares their contents. The return value for strcmp()
can be -1, 0, or 1, with 0 indicating the strings are identical. The -1 and 1 values are used to indicate the lexicographic [1] ordering (i.e., how they would appear in an alphabetized list) if there is a mismatch; strcmp ("hello", "goodbye")
would return the positive value to indicate that the first argument should be ordered after the second. Switching the order of the arguments would flip the result to -1. Two additional common functions are strchr()
and strstr()
, which are used for searching within the contents of the string; strchr()
looks for a specified character in the string (passed as an int rather than a char
), while strstr()
looks for a substring. If the character or substring is found, these functions return a pointer to the first location; otherwise, they return NULL
.
C library functions – <string.h>
int strcmp(const char *s1, const char *s2);
char * strchr(const char *s, int c);
char * strstr(const char *haystack, const char *needle);
Code Listing A.38 demonstrates some common uses of these functions. Lines 5 and 6 specify two strings to work with. Line 9 then compares them using strcmp()
, implicitly relying on a convention in C that 0 indicates false and anything non-zero indicates true. Since these strings do not match, strcmp() would return 1 or -1 (1 in this particular case); C interprets this value as true, so the assertion is satisfied. (Note that it is a common practice to write !strcmp(s1, s2)
to evaluate if the strings are identical. If they match, strcmp()
returns 0 (false) and the logical negation (!
) operator negates this value to true; if they do not match, the !
would convert the 1 or -1 returned into false.)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | /* Code Listing A.38:
Comparing strings and searching for substring/character occurrences
*/
char *longer = "breathe";
char *shorter = "eat";
/* Assertion holds because they are not the same */
assert (strcmp (longer, shorter));
char *substr = strstr (longer, shorter);
printf ("Substring starting at \"%s\" is %s\n", shorter, substr);
size_t count = 0;
char *walker = strchr (longer, 'e');
while (walker != NULL)
{
count++;
walker = strchr (walker + 1, 'e');
}
printf ("There are %zd occurrences of 'e' in %s\n", count, longer);
|
Line 11 checks if the string “eat” occurs anywhere as a substring in the longer string “breathe”.
Since it does, strstr()
would return the pointer of the first 'e'
in the string. Note that
strstr()
does not alter the original string in any way; it simply returns a pointer to the
middle of the existing string. Because of this, line 12 will print the substr variable as the string
"eathe"
, as printf()
processes %s
by traversing through the characters until the null
byte is encountered. That is, if a strstr()
finds a substring anywhere, printing that substring
will print the contents from the first occurrence of the substring all the way to the end of the
original string.
Lines 14 – 20 use strchr()
in a loop to count the number of occurrences of a particular
character, 'e'
in this case. Line 15 initializes the walker
variable to point to the first
'e'
, the third byte of the string. If line 15 had search for ‘q’ instead, walker
would be
initialized to NULL. Within each iteration of the while
-loop, line 19 finds the next location of
an 'e'
(if one exists). In this case, the call to strchr()
indicates that it needs to start
looking at walker+1
, the first byte after an ‘e’ that has already been found. (Calling
strchr(walker, 'e');
on line 19 would create an infinite loop, since it would repeatedly find
the same 'e'
!) Assuming the original string longer is null-terminated (as it is), the
while
-loop is guaranteed to terminate as written. The function strchr()
will stop and return
NULL
once it encounters the null byte. Even if walker
ends up pointing to the last character
of the string (as it does in this case) walker+1
can never accidentally skip over the null
terminator, because walker
is always set to point to an 'e'
.
Bug Warning
Many languages have built-in string types that allow easy comparison with the standard equality
operator. Again, C is not one of those languages. The only safe way to check if two strings have
the same contents is to use strcmp()
. Using other comparisons, such as the ==
operator, can
lead to erroneous results if not interpreted correctly. With primitives like int
and char
,
this operator compares the values and returns true if the values match. The same is true of strings
(and pointers in general), but this fact does not match our intuitions. Specifically, the value of
a string (char*
) variable or any other pointer is the address being pointed to. That is, the
==
operator checks if the pointers are pointing to the same location, not that the strings
themselves match. The following example illustrates key features of this distinction:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | char *first = "hello";
char *second = "hello";
char third[] = "hello";
char fourth[] = "hello";
printf ("Comparing first and second:\n");
printf ("Same contents? %s\n", (! strcmp (first, second) ? "yes" : "no"));
printf ("Same string? %s\n\n", (first == second ? "yes" : "no"));
printf ("Comparing first and third:\n");
printf ("Same contents? %s\n", (! strcmp (first, third) ? "yes" : "no"));
printf ("Same string? %s\n\n", (first == third ? "yes" : "no"));
printf ("Comparing third and fourth:\n");
printf ("Same contents? %s\n", (! strcmp (third, fourth) ? "yes" : "no"));
printf ("Same string? %s\n", (third == fourth ? "yes" : "no"));
|
Lines 1 – 4 declare the string "hello
” four times, twice with a char*
and twice with a
char
array. These declarations influence the equality checks that follow. In all three cases,
the strcmp()
function will return 0 to indicate that they match; this should not be surprising
since they are all initialized with the same string. With the equality check, it should not be
surprising that the equality check on line 12 returns false. Recall that that char*
initialization style puts the contents of the string in .rodata
, whereas the char array style
places the contents on the stack. In other words, second and third are pointing to different memory
segments.
The equality checks on lines 8 and 16 are somewhat less predictable initially. Line 8 returns true,
indicating that the first
and second
pointers are pointing to the same place, despite the
fact that they are both initialized with what appears to be a separate copy of the string. In fact,
the compiler determines that the strings are the same, which makes it redundant to store two copies
in .rodata
; by definition, the strings in .rodata
cannot change, so one shared copy is
sufficient. Line 16, on the other hand, returns false. The array declaration style must create two
distinct instances, because each one can be modified independently of the other; it does not matter
that the initial contents are the same. In fact, the compiler produces a warning on this line to
indicate that such array comparisons always evaluate to false.
Another common task with strings is to determine if the characters fit into particular classes, such
as numeric, alphanumeric, whitespace, printable, upper- or lower-case, etc. The functions defined in
the ctype.h
file provide these tests without requiring the programmer to recreate the
pattern-matching required. Code Listing A.39 illustrates how these class tests could be
used to validate the strength [2] of a password. Line 9 performs a standard safety check.
Functions that take a pointer as input—particular from user-supplied input—need to check explicitly
for NULL
arguments. Line 12 then throws out passwords that are shorter than 16 characters in length.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | /* Code Listing A.39:
Using ctype.h tests to determine if a password uses multiple classes
*/
bool
is_strong (char *password)
{
/* Safety check: Don't accept a NULL pointer */
assert (password != NULL);
/* Short passwords are bad */
if (strlen (password) < 16)
return false;
char *walker = password;
bool digit = false, lower = false, upper = false, punct = false;
while (*walker != '\0')
{
digit |= isdigit (*walker);
lower |= islower (*walker);
upper |= isupper (*walker);
punct |= ispunct (*walker++);
}
/* Return true only if all are true */
return digit && lower && upper && punct;
}
|
Lines 15 – 23 perform the bulk of the checking. The four bool
variables are all initialized to
false, indicating that we have not yet encountered a digit ('0'
– '9'
), lower-case letter
('a'
– 'z'
), upper-case letter ('A'
– 'Z'
), or a punctuation mark (see
ispunct(3)
for the full list). The walker
variable is set to traverse through each byte of
the string until the null byte is encountered (observe that line 22 advances walker
after all
checks have been done for one character). Within the while
-loop, each bool variable is bit-wise
OR
ed (|
) with the result of applying the isX
functions to the current character
*walker
. The first time that a character passes one of the tests (e.g., when *walker
points
to 'Z'
and isupper(*walker)
is called), the corresponding bool variable will be set to 1
(true). From then on, that variable can never become false, because applying bit-wise OR of 1 with
any value will always produce a non-zero result. Consequently, line 25 will return true the password
contains at least one character from each of the four classes.
C library functions – <ctype.h>
int isalnum(int c);
int isalpha(int c);
int isdigit(int c);
int isspace(int c);
int islower(int c);
int isupper(int c);
int ispunct(int c);
Most modern programming languages provide a simple mechanism for a very common task: merging
strings. Some languages use the +
operator, such as string1 + string2
to concatenate the two
strings; others use a .
operator instead. Unfortunately, C is not such a language. There are
various functions that can be used for this purpose, with strncpy()
and strncat()
being two
of the first encountered. Both of these functions copy the contents of one string (s2
, passed as
the second argument) into a portion of memory identified by the first argument (s1
). The key
difference between the two is that strncpy()
will copy the bytes starting at the exact location
that s1 points to; strncat()
appends the strings by copying the bytes starting at the first null
byte at or after s2
. In other words, strncpy()
replaces the contents of the first string,
whereas strncat()
concatenates the two.
Unlike their unsafe cousins strcpy()
and strcat()
(which should NEVER be used),
strncpy()
and strncat()
take a third argument that specifies a maximum number of bytes to
copy. If the length of s2
is less than n
, then the function will stop before processing
n
bytes. The memcpy()
function shown below behaves similarly to strncpy()
, except that
it ignores the null byte; that is, memcpy()
is used to copy an arbitrary memory buffer from one
location to another, regardless of whether that buffer contains a string. In that regard,
memcpy()
will always copy exactly n bytes, unless some unusual circumstance occurs (such as the
dst
and src
buffers overlapping, which is undefined behavior in the C specification).
C library functions – <string.h>
char * strncpy(char *s1, const char *s2, size_t n);
char * strncat(char *s1, const char *s2, size_t n);
void * memcpy(void *dst, const void *src, size_t n);
Although it is certainly fair to refer to strcpy()
or strcat()
as an unsafe version of
strncpy()
or strncat()
, it would be a mistake to consider the latter two functions truly
safe. One key aspect of this is whether or not these functions guarantee that the result is null
terminated. Code Listing A.40 demonstrate two examples of this problem. The n
argument on line 10 ensures that only the 'h'
and 'e'
characters get copied into the buffer
array. That is, the n
argument for strncpy()
places a maximum number of bytes copied, and
the function does not guarantee that one of these is a null byte. Line 11, then, is likely to print
additional characters after the string "he"
, because there is no null byte in buffer. As such,
the %s
causes printf()
to continue traversing through memory until a null byte
is encountered.
1 2 3 4 5 6 7 8 9 10 11 12 | /* Code Listing A.40:
strncpy() and strncat() do not agree on null-termination of strings
*/
char buffer[2];
strncpy (buffer, "hello", 2);
printf ("buffer: %s\n", buffer);
char trouble[10];
strncpy (trouble, "hello", 10);
strncat (trouble, " world", 5);
printf ("trouble: %s\n", trouble);
|
Bug Warning
The strncpy()
and strncat()
functions are a frequent source for error. As describe above,
they differ on their interpretation of the n
parameter and whether or not null-termination is
guaranteed (yes for strncat()
, no for strncpy())
. Besides the confusion around these
issues, they still leave plenty of room for errors on the part of the programmer. One common
mistake is to switch the order of the first two arguments, mistaking the source and the destination
of the copy operation. Another common mistake with these functions can be illustrated in the
following line of code:
strncpy (destination, source, strlen (source));
This line of code, in essence, re-creates the functionality of the banned strcpy()
function.
When strcpy()
runs, it will only stop when it encounters a null byte in source; in the process,
it has copied strlen(source)
bytes over to the destination. By making the n parameter be the
same as the number of bytes in the string, this line of code is setting a redundant maximum length
check (strcpy()
would already stop after strlen(source)
of data
). The third parameter
must always be based on how much space is available in the destination, never the source.
Another common mistake that occurs with these functions is due to confusion regarding the
sizeof()
operator as discussed previously. Consider the following example:
1 2 | char *buffer = calloc (100, sizeof (char));
strncpy (buffer, "This is a string", sizeof (buffer));
|
The first line of this example creates a dynamically allocated buffer of 100 bytes of space. Since
it uses calloc()
, all of the bytes are set to null bytes (0 = '\0'
). On the second line,
the source argument string is 16 characters in length; clearly the buffer has sufficient space for
the entire string. The problem with this line of code is that only the bytes "This is "
will be
copied over, due to the use of sizeof()
. As described previously, sizeof() can never check how
much space a pointer is pointing to. Instead, sizeof()
returns the amount of space required for
the variable itself. Since buffer
is a char*
(i.e, it is a pointer), its size is the size
of an address: 8 bytes. It does not matter that buffer
is pointing to 100 bytes allocated on
the heap. Based on the first line of code (with a hard-coded size of 100), the last argument to
strncpy()
would need to be the hard-coded value 99 (keeping the $100^{th}$ byte as 0 to guarantee a
null-terminated string).
A third common mistake occurs when the programmer forgets about the implications of memory segment
permissions. In the following example, message
is declared as a char*
that points to the
hard-coded string "Hello, "
, which resides in the read-only global data segment (.rodata
).
Line 2, then, is an attempt to write into read-only memory. The result would be a segmentation
fault or an abort trap, depending on the architecture.
1 2 | char *message = "Hello, ";
strncat (message, username, 20);
|
While strncpy()
and strncat()
focus on building or merging strings, another common task is
to split a string into smaller parts, a procedure known as tokenizing. C provides two
functions, strtok()
and strtok_r()
, for this purpose. In both cases, when the function is
first called, the str
parameter points to the string to tokenize; on subsequent calls, str
is set to NULL
to indicate that the function is continuing to process the previous string. The
sep parameter is a pointer to a string of separator characters; whenever one or more of these
characters is encountered in a row, strtok()
or strtok_r()
would return a pointer to the
token ending at that character. The difference between the two functions is that strtok_r()
is
reentrant, while strtok()
is not. (Reentrancy is discussed in Chapter 7.) In short,
strtok()
uses a static variable to keep track of where to continue within the string. This
approach fails when there are multiple threads calling strtok()
on distinct strings; the threads
might accidentally receive each other’s tokens. If there are multiple threads in execution, the
strtok_r()
version is needed to avoid this dilemma; the third parameter, lasts
, keeps track
of the tokenization of the string, thus eliminating the race conditions that can occur with static variables.
C library functions – <string.h>
char * strtok(char *str, const char *sep);
str
at an occurrence of the separator sep
.char * strtok_r(char *str, const char *sep, char **lasts);
strtok()
; sets lasts to the beginning of the next token.As an example of tokenization, consider a comma-separated value (CSV) file, a common format for sharing collections of data. Each line in a CSV file consists of a number of data fields with a comma to separate them. As an example, consider a CSV file of holidays for the year 2020. One line of the file might look as follows:
Wed,Jan,01,2020,New Year's Day
Once the file contents have been read into memory, the lines might be tokenized to retrieve the
individual fields of the line. Code Listing A.41 splits this line one token at a time,
storing the fields in the fields of a struct
declared as the holiday_t
type.
1 2 3 4 5 6 7 8 9 10 11 12 | /* Code Listing A.41:
Using strtok_r() to split a CSV file line
*/
/* Assume line contains "Wed,Jan,01,2020,New Year's" */
holiday_t nyd;
char *save = NULL;
nyd.wkd = strtok_r (line, ",", &save); // set weekday to "Wed"
nyd.mon = strtok_r (NULL, ",", &save); // set month to "Jan"
nyd.day = strtok_r (NULL, ",", &save); // set day to "01"
nyd.yer = strtok_r (NULL, ",", &save); // set year to "2020"
nyd.nam = strtok_r (NULL, ",", &save); // set name "New Year's"
|
In each of the lines 8 – 12, the call-by-reference parameter &save
changes the save pointer to
keep track of the continuation point that immediately follows the separator instance. For instance,
line 8 sets save to point to the 'J'
in "Jan"
and returns a pointer to the string "Wed"
.
When the first parameter to strtok_r()
is NULL,
this continuation point determines where the
function will look for the next delimiter. On line 9, then, strtok_r()
starts looking at the ‘J’
and finds the comma just after "Jan"
; strtok()
then updates save to point to the first ‘0’
and returns the token "Jan"
.
There are subtle aspects to the behavior of strtok()
and strtok_r()
that require
consideration. First, these functions do not return copies of the tokens; they modify the original
string and return a pointer into it. Specifically, the first occurrence of any of the characters in
the separator string sep
is replaced with the null byte. Table A.6 illustrates two
snapshots of the string pointed to by line in Code Listing A.41, both the original
version and after two calls to strtok_r()
. After two calls, the first two commas in the line
have been overwritten with a null byte. Because of this modification, the pointer that is returned
is a complete string. The first call to strtok_r()
returns a pointer to the 'W'
at the
beginning of the line, but the token returned is the string "Wed"
. The fact that the original
string gets modified means that string constants cannot be tokenized. Since string constants are
stored in .rodata
, tokenizing it would require writing a null byte into read-only memory.
Original string contents: | ||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
W |
e |
d |
, |
J |
a |
n |
, |
0 |
1 |
, |
2 |
0 |
2 |
0 |
, |
N |
e |
w |
|
Y |
e |
a |
r |
' |
s |
\0 |
After line 5 of Code Listing 41: | ||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
W |
e |
d |
\0 |
J |
a |
n |
\0 |
0 |
1 |
, |
2 |
0 |
2 |
0 |
, |
N |
e |
w |
|
Y |
e |
a |
r |
' |
s |
\0 |
Table A.6: The contents of the line variable before line 4 and after line 5 of Code Listing A.41
Second, because the pointers returned are to the original string, freeing or modifying the original
data can corrupt the tokens. In the CSV example above, we assumed that the entire file contents were
read into memory. What if this were not the case? Instead, the program reads a line of the file into
memory at a time, repeatedly overwriting the buffer variable line
. This approach would allow the
first line to be tokenized successfully, and the fields of the holiday_t
struct
would be
pointing to their tokens. But the next line of the file would get read into this exact same memory.
As such the fields of the holiday_t
would now be pointing to characters in the second line of
data, not the first. On the other hand, perhaps the line
variable points to a dynamically
allocated buffer that is created anew for each line of the file. In this case, the holiday_t
fields could be corrupted if line
is freed; the fields would still be pointing to the heap where
the contents of line
were stored, but that part of the heap would now be invalid.
Third, strtok()
and strtok_r()
ignore repeated instances of separators. This behavior can be
problematic for CSV files, as fields can be blank. For example, assume that the CSV file from above
was modified to include a location field between the year and name of the holiday. If these fields
were missing for the New Year’s Day holiday, that line (and another) of the file might look like:
Wed,Jan,01,2020,,New Year's Day
Fri,Feb,14,2020,Charlottesville,Valentine's Day
One call to strtok_r()
would get the string "2020"
. The next call would then get the string
"New Year's Day"
, rather than an empty string to indicate the missing location field. There are
times when skipping over repeated separators is advantageous (consider skipping over repeated
whitespace in a C source code file), but there are also times where it can lead to incorrect
results. The strtok()
and strtok_r()
work well for the former cases, but other approaches
are needed for the latter.
Code Listing A.42 demonstrates two techniques for splitting a file’s contents based on
lines. To start, assume that the file’s contents have been read into file_contents
and (for
simplicity) this buffer is null-terminated. Lines 5 – 17 store copies of the lines without using
strtok()
or strtok_r()
. Instead, these lines use start_of_line
to keep track of where a
line begins (initially, the start of the file contents). Line 6 then uses strchr()
to identify
the end of the first line by looking for the '\n'
character. Line 10 uses strndup()
to make
a dynamically allocated copy of the line. The semantics of strndup()
are like strncat(); it will
copy up to the specified number of bytes and it will add on a null terminator. Since end_of_line –
start_of_line
is the exactly the number of bytes in the line, line 10 makes a complete
null-terminated copy and stores the address of this copy into an array. Line 12 then moves
start_of_line
just past the newline character so that it points to the beginning of the next
line. Line 14, then, starts looking for the next end_of_line
after that point. The additional
copy on line 17 is necessary because the while
-loop will terminate when there are no more
newline characters; when this occurs, start_of_line
is pointing to the last line (which has no
'\n'
after it). Line 17 can use the standard strdup()
instead of strndup()
, because the
original file_contents
are null terminated.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | /* Code Listing A.42:
Tokenizing and storing file lines without and with strtok()
*/
char *start_of_line = file_contents;
char *end_of_line = strchr (start_of_line, '\n');
while (end_of_line != NULL)
{
line_copies[lineno++] =
strndup (start_of_line, end_of_line - start_of_line);
/* Next line starts after the '\n' */
start_of_line = end_of_line + 1;
/* Find the next end of line */
end_of_line = strchr (start_of_line, '\n');
}
/* Copy the last line */
line_copies[lineno] = strdup (start_of_line);
lineno = 0;
char *line = strtok (file_contents, "\n");
while (line != NULL)
{
/* Dynamically allocate a copy and store the pointer */
all_lines[lineno++] = strdup (line);
line = strtok (NULL, "\n");
}
|
When line 19 begins processing, the original file_contents
have not been altered in any way. The
use of strchr()
and strndup()
in lines 6 – 17 do not write anything into this buffer.
Consequently, we can begin to use strtok()
and start over. Through each iteration of the
while
-loop in lines 21 – 26, the line
variable points to the current (null-terminated) line.
To keep copies of the lines, again, we use strdup()
. In practice, it does not make sense to
perform both of these loops, since they are keeping track of the same data. The purpose of combining
them in Code Listing A.42 is to show that they ultimately end up as two equivalent ways
to accomplish the same goal; the only difference is that the strtok()
approach modifies the
original file_contents, whereas the strchr()
approach does not. The while
-loop structure in
lines 21 – 26 is a common approach for using strtok()
.
One final common task in relation to strings involves converting numeric values back and forth
between representations. When reading user input or data from a file, numeric text data ("123"
)
might need to be converted to one of C’s integer primitive representations (123) for easy
manipulation or compact storage. On the other hand, integers often need to be converted to their
string format to append to other text data (e.g., writing the HTTP header line
"Content-Length: 123\r\n"
when the length has been stored as a size_t
variable). Code Listing A.43
illustrates the difference in the internal representations of 123 (as a uint8_t
) and "123"
(as a string).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | /* Code Listing A.43:
Printing the byte contents of an integer and string
*/
uint8_t integer = 123;
char string[] = "123";
uint8_t *walker = &integer;
for (size_t i = 0; i < sizeof (integer); i++)
printf ("%02" PRIx8 " ", *walker++);
printf ("\n");
walker = (uint8_t *) &string;
for (size_t i = 0; i < sizeof (string); i++)
printf ("%02" PRIx8 " ", *walker++);
printf ("\n");
|
The particular for
-loops here might appear odd, but they are used to show that the two variables
are being handled in the same way. Specifically, the loop on line 9 will only have one iteration,
because integer is only one byte in size (as a uint8_t
). At the same time, the loop on line 14
deliberately uses sizeof()
on a string (instead of strlen()
), which is a common source of
bugs; however, this approach lets us examine all four bytes in the char
array.
The first loop demonstrates that the internal representation of integer
is the single byte
0x7b
. The second loop demonstrates that the representation of string
is the four consecutive
bytes 0x31
, 0x32
, 0x33
, and 0x00
. (Printing these four bytes were at once would
produce the value 0x00333231
due to endianness.) The issue of conversion focuses on ways to
translate automatically between these two byte representations, which do not appear to be similar.
C library functions – <stdlib.h>
long strtol(const char *str, char **endptr, int base);
The strtol()
function handles the conversion from integer to string. [3] The str
parameter points to a string containing the number, and the base indicates an arbitrary numeric base
(10 for decimal, 16 for hexadecimal, or an arbitrary base such as 13 for base-13). The string can
contain multiple numeric values separated by non-numbers (e.g., "123 456 -42"
for the three
values 123, 456, and -42). When endptr
is non-NULL
(i.e., it is a call-by-reference value),
it will be set to point to the first character after the current number.
Code Listing A.44 demonstrates multiple ways that strtol()
can be used. Line 5
starts by creating a string "123 -32 alpha"
. This can be broken down into the integer values 123
and -32, but the "alpha"
cannot be interpreted as an integer. Lines 9, 14, and 20 use
strtol()
to parse this string into these numeric components. Line 9 uses the original
numbers
string as the first argument, whereas lines 14 and 20 use end
for this parameter.
Using end
on the subsequent calls is necessary, because strtol()
does not keep track of any
prior progress; using numbers
each time would repeatedly return the first value, 123.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | /* Code Listing A.44:
Converting from string to integer representations
*/
char *numbers = "123 -32 alpha";
char *end = NULL;
/* strip off the 123 and make end point to " -32 alpha" */
long result = strtol (numbers, &end, 10);
assert (errno != EINVAL); // no match indicates success
printf ("Result = %ld; end = '%s'\n", result, end);
/* continue from " -32 alpha" to get the -32 */
result = strtol (end, &end, 10);
assert (errno != EINVAL); // no match indicates success
printf ("Result = %ld; end = '%s'\n", result, end);
/* continue from " alpha", which cannot be processed */
char *final = NULL;
result = strtol (end, &final, 10);
assert (errno == EINVAL); // match indicates strtol() failed
printf ("Result = %ld\n", result);
printf ("end = '%s'; final = '%s'\n", end, final);
assert (final == end);
/* use a bizarre base-11 format, ignoring endptr */
numbers = "60a1";
result = strtol (numbers, NULL, 11);
printf ("Result = %ld\n", result);
|
The strtol()
function provides two ways to check for errors in the processing. The first (and
most straightforward) way is to use the errno global variable. On a failure (such as " alpha"
),
strtol()
sets errno to EINVAL
, which is a positive constant (errno
is set to 0 on
success). The assert()
calls on lines 10, 15, and 21 all pass, indicating the calls to
strtol()
on lines 9 and 14 succeed, while line 20 fails. The other mechanism is through the
endptr
parameter. Line 20 uses end
as the input, pointing to the string " alpha"
. After
strtol()
runs, final is also set to this location. If the endptr ends up at the beginning of the
string (i.e., final
after the call matches end
, which hasn’t changed), then strtol()
was
unable to process any data successfully.
Lines 27 – 29 demonstrate other features of strtol()
. First, the endptr parameter can be (and is
often) ignored by passing NULL
. Even with a NULL
endptr,
we could still check errno to
determine if the conversion succeeded. Second, strtol()
supports generally arbitrary base values
(2 – 36 are allowed). Conventionally, C numeric constants use 0x
as a prefix to indicate
hexadecimal format (e.g., 0x7ff
) and a leading 0 to indicate octal (e.g., 0644
); otherwise,
the number is interpreted as decimal. Importantly, this means that C has no convention to declare
binary constants. The strtol()
function supports this by taking 2 as the base parameter.
Converting values in the opposite direction, from integers to strings, is mostly intuitive, because
it is very similar to one of the first functions novices learn in C: printf()
. The main
difference is that the snprintf()
function takes two parameters before the format string to
indicate the destination and the maximum number of bytes. (The sprintf()
function does not take
a maximum number of bytes, which makes this function unsafe in the same ways as strcpy()
or
strcat()
. As such, sprintf()
should never be used.)
C library functions – <stdlib.h>
int snprintf(char *str, size_t size, const char *format, ...);
Code Listing A.45 highlights the similarities between snprintf()
and the more
familiar printf()
. Both functions take a format string ("%d"
or "%d\n"
) to indicate how
the number should be formatted, along with the number as an additional argument. The primary
difference is that snprintf()
also indicates a destination to write the formatted value into
(the buffer
). Once the value has been written into the buffer, it can be printed (if needed)
using the %s
format specifier.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | /* Code Listing A.45:
Converting from integer to string is similar to printing to standard I/O
*/
int number = 42;
char buffer[3];
/* Print the number to the screen */
printf ("%d\n", number);
/* "Print" the number into the buffer */
snprintf (buffer, 3, "%d", number);
/* Print the string */
printf ("%s\n", buffer);
|
Recall from the discussion of strncpy()
and strncat()
that the two functions had different
interpretations of the respective maximum size parameter, n
. Specifically, strncpy()
would
copy a maximum of n
bytes, potentially leaving the string un-terminated if those n
bytes did
not contain the null byte. In contrast, strncat()
would write a maximum of n+1
bytes,
because it always appends the null byte. The snprintf()
function adds a third interpretation: it
will print up to n-1
bytes and then append the null byte. Frustrating! Code Listing A.46 summarizes this situation. Since both strncat()
and snprintf()
guarantee null
termination, they both end up writing a null byte; however, strncat()
appends this after the two
bytes 'h'
and 'e'
, whereas snprintf()
does so after only one byte '4'
. Unlike the
other two, strncpy()
does not guarantee null termination.
1 2 3 4 5 6 7 | /* Code Listing A.46:
Converting from integer to string is similar to printing to standard I/O
*/
strncat (buffer_1, "hello", 2); // copies 3 bytes 'h', 'e', '\0'
strncpy (buffer_2, "hello", 2); // copies 2 bytes 'h' and 'e'
snprintf (buffer_3, 2, "%d", 42); // copies 2 bytes '4' and '\0'
|
Bug Warning
The snprintf()
function, once again, creates a very common vector for buffer overflow
vulnerabilities. One of the challenges—and common mistakes—arises from the anticipation of what is
a likely integer value as compared to what is a possible one. The buffer from Code Listing A.45 is not a safe size for the format specifier %d. As an int is typically four bytes, its
string form can be as long as 12 characters in length (for example, including the negative sign and
null byte for the INT_MIN
constant "-2147483647")
. As such, the buffer should generally be
larger than required. One simple way to do this (and to ensure the bytes are all initialized to 0)
is to use calloc()
to allocate enough space. If needed, realloc()
could then be used to
shrink the buffer.
1 2 3 4 5 | char *buffer = calloc (12, sizeof (char));
snprintf (buffer, 12, "%d", 35);
/* Shrink it down to size, keeping an extra byte for '\0' */
buffer = realloc (buffer, strlen (buffer) + 1);
|
Since snprintf()
takes a normal format string (which can contain a mix of string data and
multiple format specifiers), it creates an easier mechanism to concatenate multiple values together
into a single string. Code Listing A.47 demonstrates a simple example of this practice
to build the string "5 + 10 = 15\n"
using int
variables.
1 2 3 4 5 6 7 | /* Code Listing A.47:
Using snprintf() to build a string that mixes integer and string data
*/
int x = 5, y = 10;
char *sum = calloc (100, sizeof (char));
snprintf (sum, 100, "%d + %d = %d\n", x, y, x + y);
|
Chapter 4 introduces the structure of HTTP headers. These headers consist of a series of lines, each
ending in "\r\n"
. As one example, consider the following header snippet:
HTTP/1.0 200 OK\r\n
Content-Length: 37\r\n
Connection: close\r\n
Content-Type: text/html\r\n
\r\n
Assuming some of these fields are stored in variables, this could be constructed with a single
snprintf()
call, as shown in Code Listing A.48.
1 2 3 4 5 6 7 8 9 10 | /* Code Listing A.48:
Building an HTTP response header with one snprintf()
*/
snprintf (header, MAX_HEADER_LENGTH,
"HTTP/%d.%d %d %s\r\n" // version, code, status
"Content-Length: %d\r\n" // length
"Connection: close\r\n"
"Content-Type: %s\r\n\r\n", // type
vers_major, vers_minor, code, status, length, type);
|
Code Listing A.48 relies on the fact that string constants are concatenated in C. As
such, lines 6 – 9 all build a single format string. Displaying them as separate lines in the code
makes the organization easier to understand from the programmer’s perspective. This string could
also be built a line at a time with repeated calls to strncat()
, but this version simplifies the
processing as a single function call.
Bug Warning
Another common use of snprintf()
is to inject formatted numbers into the middle of an existing
string. For instance, consider an event logging mechanism that uses a common reporting form for
events. The snprintf()
function could be used to fill these in, but requires special care as
shown in the example below.
1 2 3 4 5 6 7 8 9 10 | char record[] = "Month [ ] Day [ ] Year [ ]";
snprintf (record + 7, 4, "%s", month); // write [mon]
snprintf (record + 17, 3, "%-2d", day); // write [da]
snprintf (record + 27, 5, "%4d", year); // write [year]
/* Restore the ] characters that snprintf() overwrote with the
null byte */
record[10] = ']';
record[19] = ']';
record[31] = ']';
|
The problem is that snprintf()
always null terminates what it writes. As such, if line 2 writes
the month as "Jan"
, the record variable would become the string "Month [Jan"
. The rest of
the string would still be there in memory, but snprintf()
overwrite the first bracket with the
null byte; printing the string at that point would stop there instead of showing the full record.
Lines 8 – 10 fix this problem by restoring the brackets to their original locations, overwriting
the null bytes that snprintf()
had added.
Also note that the original record string created three spaces between the brackets for the month,
two for the day, and four for the year. The size
parameter for lines 2 – 4 added one to each of
these values (four, three, and five, respectively), because snprintf()
includes the null byte in
this count. Thus, writing the string "Jan"
into the month field requires writing four bytes, not
three.
[1] | The use of lexicographic instead of alphabetic is common and intentional in computing, as the former is more general and works with non-alphabetical characters. For instance, it does not make sense to characterize the alphabetical ordering of “15” as compared to “3” since neither contain letters in the alphabet. However, “15” comes before “3” in lexicographical ordering. |
[2] | Determining if a password is strong is significantly more complicated than this function, and this should not be used for real security purposes. For instance, the password also needs to be compared with common dictionary words, previous passwords, easily guessed patterns, etc. This example just illustrates how these character class checks can be used as part of this procedure. |
[3] | C also has an older function atoi() for this purpose, though this function is
deprecated and should not be used in new code. The strtol() function adds explicit support for
multiple bases, whereas atoi() handled this implicitly within the string; strtol() also returns
a long rather than the int returned by atoi() , supporting larger values. Finally, and most
importantly, atoi() ’s error handling was weak, returning a 0 for bad input; as such, it was not
possible to distinguish between "0" and truly bad input. |