NSString
s of Objective-C (and their CFString
siblings). Today we start at the beginning with C strings.C strings are also called null (or nul) terminated strings, zero terminated strings or sometimes z-strings. A C string is simply a block of memory where bytes represent characters. The last byte in the block contains a zero value (the nul character) to mark the end of the string.
Here is the memory layout of the string "iPhone" in ASCII encoding. The string starts at memory address 48:
Address | 48 | 49 | 50 | 51 | 52 | 53 | 54 |
---|---|---|---|---|---|---|---|
Value | 105 | 80 | 104 | 111 | 110 | 101 | 0 |
Character | 'i' | 'P' | 'h' | 'o' | 'n' | 'e' | '\0' |
Since C strings are just memory blocks, you declare C string variables as pointers to type
char
for mutable C strings or type const char
or char const
for constant (immutable) C strings. (The const
has the same meaning before or directly after the char
.)// example C string variable declarations char *mutable_c_string; const char *immutable_c_string1; char const *immutable_c_string2;Since C string literals are immutable, variables that point to literals should be declared
const
, or the compiler will complain:char const *string1 = "foobar"; // okay char *string2 = "barfoo"; // WARNING! should be constWhen you need some temporary storage to receive a C string, it's common to declare a
char
array.char buffer[81]; sprintf(buffer, "The answer is %d\n", 42);Here we use the
sprintf()
function to write formatted data to a string that's placed in buffer
.You may also sometimes see a C string declared like this:
char const name[] = "foo";This is almost the same as:
char const *name = "foo";There's a subtle and mostly unimportant difference between these two declarations. The first one declares an array, the second one declares a pointer to an array. We'll look at the difference between these two in the future when we cover arrays.
Character Encodings
C doesn't mandate any specific character encoding for strings. C strings frequently contain single byte encodings like ASCII or ISO-8859-1 (Latin-1). In a single byte encoding, each byte in the C string corresponds to a character, and the encoding defines 256 characters (or fewer -- some byte values may not be valid characters). C strings can also contain multibyte encodings such as UTF-8 or Shift JIS where some characters are represented by two or more bytes. It's up to the application programmer to keep track of character encoding issues when using C strings.
Most encodings used today are ASCII compatible, meaning that character values from zero to 127 represent the same characters defined by the ASCII encoding. If your program only ever processes ASCII text, you're in luck: you can ignore most encoding issues (at least until some pesky user decides to enter "San José" or "Björk"). In the real world, people use many more characters than the measly 128 in the ASCII set, so it's necessary to pay a little attention to character encodings when working with C strings. When you have mismatched encodings, you get data corruption and unhappy users.
For example, here is the word "Γεεκ" ("Geek" in Greek letters) at memory address 64 using the ISO-8859-7 single byte encoding:
Address | 64 | 65 | 66 | 67 | 68 |
---|---|---|---|---|---|
Value | 195 | 229 | 229 | 234 | 0 |
Character | 'Γ' | 'ε' | 'ε' | 'κ' | '\0' |
Address | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 |
---|---|---|---|---|---|---|---|---|---|
Value | 206 | 147 | 206 | 181 | 206 | 181 | 206 | 186 | 0 |
Character | 'Γ' | 'ε' | 'ε' | 'κ' | '\0' |
Converting Between Encodings
The standard C library doesn't provide support for converting between encodings. On modern Unix and Unix-derived systems, the
iconv()
function is commonly used to convert between encodings. Objective-C programs usually use the facilities provided by NSString
or CFString
. Since NSString
and CFString
objects are represented internally as Unicode, they can store text from any encoding. To translate a C string to an NSString
:char const *c_string = "foobar"; NSString *ns_string = [NSString stringWithCString:c_string encoding:NSASCIIStringEncoding];And to translate an
NSString
to a C string:NSString *ns_string = @"foobar"; NSData *c_string_data = [ns_string dataUsingEncoding:NSASCIIStringEncoding]; char const *c_string = c_string_data.bytes;When converting to a C string, the
-dataUsingEncoding:
method returns an NSData
object to manage the memory that needs to be allocated for the C string. You simply use the -bytes
method to retrieve the C string pointer.If you're using the UTF-8 encoding, you can do this in one step using the convenience method
-UTF8String
.char const *c_string = ns_string.UTF8String;Note that this is just a short cut for calling
-dataUsingEncoding:
with the NSUTF8StringEncoding
; the returned C string lives in an autoreleased NSData
object. (This is great if you just need to pass a C string along to a C function, but you'll need to copy the returned C string if you want to keep it around.)Core Foundation provides similar C functions. You use
CFStringCreateWithCString()
to create a CFString
from a C string:char const *c_string = "foobar"; CFStringRef cf_string = CFStringCreateWithCString(kCFAllocatorDefault, c_string, kCFStringEncodingASCII);Converting from a
CFString
to a C string requires that you provide a buffer to receive the converted C string.CFStringRef cf_string = (CFStringRef)@"foobar"; char buffer[7]; Boolean result = CFStringGetCString(cf_string, buffer, 7, kCFStringEncodingASCII); if (result == true) { // ... conversion succeeded, okay to use string in buffer printf("%s\n", buffer); }There's a lot more to cover. Computers may be all about numbers, but it seems to me that programming is 90% text processing. Next time, we'll look at C string literals and
NSString
literals.
4 comments:
Hey Don, this is a good start to what looks like another great series of articles. I find that these Tuesdays with your tutorials are immensely helpful. I'm looking forward to next week's continuation of "string" theory. (And you can use that.)
Thanks! Glad that you're finding these useful.
Great!
Thanks.
This is a great series. Thank you for taking the time to put these together. I'm looking forward to the next installment.
Post a Comment