Tuesday, June 15, 2010

Objective-C Tuesdays: C strings

Welcome to the start of a new topic: strings. We'll cover both plain old C strings as well as the much nicer NSStrings of Objective-C (and their CFString siblings). Today we start at the beginning with C strings.

C strings are also called null (or nul) terminated strings, zero terminated strings or sometimes z-strings. A C string is simply a block of memory where bytes represent characters. The last byte in the block contains a zero value (the nul character) to mark the end of the string.

Here is the memory layout of the string "iPhone" in ASCII encoding. The string starts at memory address 48:

Address 48 49 50 51 52 53 54
Value 105 80 104 111 110 101 0
Character 'i' 'P' 'h' 'o' 'n' 'e' '\0'
Notice that "iPhone" is six characters long but uses seven bytes of memory, since it has a zero value after the last character to mark the end of the string. Functions that work with C strings depend on the zero terminator being there to know how long the string is. Forgetting to write a zero at the end, or overwriting it by accident is a common programming error when working with C strings. This is a type of buffer overrun error that can lead to security breaches and program crashes.

Since C strings are just memory blocks, you declare C string variables as pointers to type char for mutable C strings or type const char or char const for constant (immutable) C strings. (The const has the same meaning before or directly after the char.)
// example C string variable declarations
char *mutable_c_string;
const char *immutable_c_string1;
char const *immutable_c_string2;
Since C string literals are immutable, variables that point to literals should be declared const, or the compiler will complain:
char const *string1 = "foobar"; // okay
char *string2 = "barfoo";       // WARNING! should be const
When you need some temporary storage to receive a C string, it's common to declare a char array.
char buffer[81];
sprintf(buffer, "The answer is %d\n", 42);
Here we use the sprintf() function to write formatted data to a string that's placed in buffer.

You may also sometimes see a C string declared like this:
char const name[] = "foo";
This is almost the same as:
char const *name = "foo";
There's a subtle and mostly unimportant difference between these two declarations. The first one declares an array, the second one declares a pointer to an array. We'll look at the difference between these two in the future when we cover arrays.

Character Encodings
C doesn't mandate any specific character encoding for strings. C strings frequently contain single byte encodings like ASCII or ISO-8859-1 (Latin-1). In a single byte encoding, each byte in the C string corresponds to a character, and the encoding defines 256 characters (or fewer -- some byte values may not be valid characters). C strings can also contain multibyte encodings such as UTF-8 or Shift JIS where some characters are represented by two or more bytes. It's up to the application programmer to keep track of character encoding issues when using C strings.

Most encodings used today are ASCII compatible, meaning that character values from zero to 127 represent the same characters defined by the ASCII encoding. If your program only ever processes ASCII text, you're in luck: you can ignore most encoding issues (at least until some pesky user decides to enter "San José" or "Björk"). In the real world, people use many more characters than the measly 128 in the ASCII set, so it's necessary to pay a little attention to character encodings when working with C strings. When you have mismatched encodings, you get data corruption and unhappy users.

For example, here is the word "Γεεκ" ("Geek" in Greek letters) at memory address 64 using the ISO-8859-7 single byte encoding:
Address 64 65 66 67 68
Value 195 229 229 234 0
Character 'Γ' 'ε' 'ε' 'κ' '\0'
And here is the word "Γεεκ" at memory address 64 using the multibyte UTF-8 encoding:
Address 64 65 66 67 68 69 70 71 72
Value 206 147 206 181 206 181 206 186 0
Character 'Γ' 'ε' 'ε' 'κ' '\0'
Even though they represent the same text, the two strings have very different representations in memory. If you fed one string into a function expecting the other encoding, you would get an error at best. Data corruption would be the usual result.

Converting Between Encodings
The standard C library doesn't provide support for converting between encodings. On modern Unix and Unix-derived systems, the iconv() function is commonly used to convert between encodings. Objective-C programs usually use the facilities provided by NSString or CFString. Since NSString and CFString objects are represented internally as Unicode, they can store text from any encoding. To translate a C string to an NSString:
char const *c_string = "foobar";
NSString *ns_string = [NSString stringWithCString:c_string 
                                encoding:NSASCIIStringEncoding];
And to translate an NSString to a C string:
NSString *ns_string = @"foobar";
NSData *c_string_data = [ns_string dataUsingEncoding:NSASCIIStringEncoding];
char const *c_string = c_string_data.bytes;
When converting to a C string, the -dataUsingEncoding: method returns an NSData object to manage the memory that needs to be allocated for the C string. You simply use the -bytes method to retrieve the C string pointer.

If you're using the UTF-8 encoding, you can do this in one step using the convenience method -UTF8String.
char const *c_string = ns_string.UTF8String;
Note that this is just a short cut for calling -dataUsingEncoding: with the NSUTF8StringEncoding; the returned C string lives in an autoreleased NSData object. (This is great if you just need to pass a C string along to a C function, but you'll need to copy the returned C string if you want to keep it around.)

Core Foundation provides similar C functions. You use CFStringCreateWithCString() to create a CFString from a C string:
char const *c_string = "foobar";
CFStringRef cf_string = CFStringCreateWithCString(kCFAllocatorDefault, c_string, kCFStringEncodingASCII);
Converting from a CFString to a C string requires that you provide a buffer to receive the converted C string.
CFStringRef cf_string = (CFStringRef)@"foobar";
char buffer[7];
Boolean result = CFStringGetCString(cf_string, buffer, 7, kCFStringEncodingASCII);
if (result == true) {
  // ... conversion succeeded, okay to use string in buffer
  printf("%s\n", buffer);
}
There's a lot more to cover. Computers may be all about numbers, but it seems to me that programming is 90% text processing. Next time, we'll look at C string literals and NSString literals.

4 comments:

Kevin Bomberry said...

Hey Don, this is a good start to what looks like another great series of articles. I find that these Tuesdays with your tutorials are immensely helpful. I'm looking forward to next week's continuation of "string" theory. (And you can use that.)

Don McCaughey said...

Thanks! Glad that you're finding these useful.

Unknown said...

Great!
Thanks.

mudphone said...

This is a great series. Thank you for taking the time to put these together. I'm looking forward to the next installment.