Tuesday, July 13, 2010

Objective-C Tuesdays: string literals

Last time we started our new topic, strings, by looking at memory organization and character encodings of C strings. Today we'll look at C string and NSString literals.

Most programs do string processing, or at least print out a status message or two. It's convenient to define some strings directly in the program's code. Since strings are really just lists of numbers, you could certainly define your strings "by the numbers" using the raw character codes:
// what does this print out?
char message[7] = { 105, 80, 104, 111, 110, 101, 0 };
printf(message);
Geek points if you recognized that message is a null terminated string. Super ultra mega geek points if you can read what it says (hint: it's ASCII).

So obviously writing the raw character codes isn't that convenient for the programmer. Since the compiler has to translate your code into machine instructions anyway, it's a no-brainer to make it translate strings into the correctly encoded bytes. A string literal is a representation of a string in your program that the compiler translates into the corresponding character codes and stores in the program's data section. There are two kinds of string literals in Objective-C: plain old C string literals and NSString literals. They look like this:
// C string literal
char const *s1 = "Hello, world!";

// NSString literal
NSString *s2 = @"Hello, world!";
The double quote characters (") mark the beginning and the end of the string literal. NSString literals are prefixed with @ to distinguish them from C string literals. It's important not to mix the two up; they're not directly compatible.

Line Breaks
String literals are not allowed to span multiple lines. Actually, that's not exactly true, so I'll illustrate what I mean; this string is not a legal string literal:
// not a legal string literal
char const *s1 = "Hello, world!
How are you?";
Line breaks are not allowed inside the double quotes in a string literal. To include a line break, you use the new line (\n) escape sequence. We'll talk more about escape sequences below, but using the new line escape sequence, the string literal becomes:
// string literal containing a new line
char const *s1 = "Hello, world!\nHow are you?";
Notice that the new line escape sequence takes the place of an actual line break in the code. When the compiler sees "\n" in a string literal, it replaces it with ASCII character code 10, the line feed (or new line) character.

But sometimes you don't want to add line breaks to your string literal, but simply to break a long string literal across several lines to make your code more readable. One way is to use a backslash (\) before the line break to tell the compiler to ignore the line break; this is often used to format long preprocessor macros. These two string literals are identical:
char const *error1 = "Unable to complete request: please wait a few minutes and try again.";

char const *error2 = "Unable to complete request: \
please wait a few minutes and try again.";
This works for NSString literals also, but note that any leading space in the continuation line will be interpreted as part of the string. Also note that the backslash (\) must be directly before the line break in the code; if you have any space or tab characters between the backslash and the line break, the compiler will complain.

A better way to do this is by simply breaking the string literal into two or more string literals that are separated only by whitespace. The following two string literals are identical:
char const *error1 = "Unable to complete request: please wait a few minutes and try again.";

char const *error2 = "Unable to complete request: "
                     "please wait a few minutes and try again.";
This also works for NSString literals; only the first part of an NSString literal is prefixed with @:
NSString *error1 = @"Unable to complete request: please wait a few minutes and try again.";

NSString *error2 = @"Unable to complete request: "
                    "please wait a few minutes and try again.";
Only spaces, tabs and line breaks are allowed between sections of a string literal. If the string is supposed to have a line break at the end of each section, you need to add new line escapes:
char const *error_page = 
  "<html>\n"
  "  <head><title>404 Not Found</title></head>\n"
  "  <body>\n"
  "    <h1>404 Not Found</h1>\n"
  "  </body>\n"
  "</html>\n"; 

Escape Sequences
There are other escape sequences like the new line (\n) escape sequence. The most commonly used ones are:

escape sequencenameASCII value
\nnew line or line feed10
\rcarriage return13
\ttab9
\"double quote34
\\backslash92
Each of these escape sequences requires two characters in the string literal, but becomes only one character in the string when the program is compiled.

Octal Escape Sequences
If you wish to specify an arbitrary byte value in a string literal, you can use an octal escape sequence. Octal escape sequences begin with a backslash (\) like normal escapes, but the backslash is followed by an octal (base 8) number instead of a letter or punctuation mark.
// octal escape sequence examples
char const *bell = "\7";  // ASCII code 7 (bell)
char const *bs = "\10";   // ASCII code 8 (backspace)
char const *del = "\177"; // ASCII code 127 (delete)
The octal numbers in escape sequences are limited to three digits; you can pad short octal numbers with leading zeros:
char const *bell = "\007";
which is handy to format a long sequence of octal escapes. Also note that octal numbers must be between 0 and 255. Octal escapes greater than 255 (377 octal) will be interpreted in a surprising way:
// max octal value is 377 (255 decimal)
char const *two55 = "\377";
NSLog(@"length = %u", strlen(two55));
NSLog(@"first char = %u", (unsigned char)two55[0]);
// prints: 
//   length = 1
//   first char = 255

// octal value of 378 (256 decimal) isn't a valid escape
char const *two56 = "\378";
NSLog(@"length = %u", strlen(two56));
NSLog(@"first char = %u", (unsigned char)two56[0]);
NSLog(@"second char = %u", (unsigned char)two56[1]);
// prints:
//   length = 2
//   first char = 31 (octal 37)
//   second char = 56 (ASCII code for '8')
Because the compiler will try to read up to three octal digits, an octal escape with fewer than three digits can sometimes have an unexpected interpretation. For example, embedding a form feed character (ASCII code 12, octal 14) at the start of this string produces the expected string:
char const *heading = "\14Preface";

NSLog(@"first char = %u", (unsigned char)title[0]);
NSLog(@"second char = %u", (unsigned char)title[1]);
// prints:
//   first char = 12 (form feed, octal value 14)
//   second char = 80 (ASCII code for 'P')
But if the character directly after '\14' is a valid octal digit, the compiler produces something unintended:
char const *heading = "\141. Introduction";

NSLog(@"first char = %u", (unsigned char)title[0]);
NSLog(@"second char = %u", (unsigned char)title[1]);
// prints:
//   first char = 97 (octal value 141)
//   second char = 46 (ASCII code for '.')
The heading number '1' is a valid octal digit, so the compiler assumes it's part of the octal escape. There are several ways to prevent this. You can use an escape sequence to specify the ambiguous character, break the string into parts, or simply pad the octal number with leading zeros.
// dealing with ambiguous octal escapes

// replace possible octal characters with escapes
char const *heading1 = "\14\61. Introduction"; // '\61' is octal escape for '1'

// pad octal escape to three digits
char const *heading2 = "\0141. Introduction"; // unambiguous

// break string into parts
char const *heading3 = "\14" "1. Introduction"; // easier to read

Hexadecimal Escape Sequences
Hexadecimal numbers can also be used in escape sequences to specify an arbitrary byte value. Hexadecimal escape sequences begin with a backslash (\) followed by 'x' and one or more hexadecimal (base 16) numbers. Note that the 'x' must be lower case. Like octal escapes, you can pad hexadecimal escapes with leading zeros.
// hexadecimal escape sequence examples
char const *tab = "\x09";    // ASCII code 9 (horizontal tab)
char const *newline = "\xA"; // ASCII code 10 (new line/line feed)
char const *del = "\x7f";    // ASCII code 127 (delete)
The upper hexadecimal digits (represented by A through F) can be upper or lower case.

Like octal escapes, hexadecimal escapes have a gotcha: the compiler will interpret every valid hex digit after the "\x" as part of the hexadecimal escape. For example, embedding a form feed character (ASCII code 12) at the start of this string works as expected:
char const *title = "\xcThe C Language";

NSLog(@"first char = %u", (unsigned char)title[0]);
NSLog(@"second char = %u", (unsigned char)title[1]);
// prints:
//   first char = 12 (form feed, hex value c)
//   second char = 84 (ASCII code for 'T')
Since 'T' isn't a valid hex digit, the compiler figures out that the first character is '\xc'. The following string doesn't work as expected:
char const *title = "\xcC Language Primer";

NSLog(@"first char = %u", (unsigned char)title[0]);
NSLog(@"second char = %u", (unsigned char)title[1]);
// prints:
//   first char = 204 (hex value cc)
//   second char = 32 (ASCII code for space)
Since 'C' is a valid hex digit, the compiler sees the first character as '\xcC' (cc in hex, 204 in decimal) and the second character as the space after the 'C'. To prevent this, you can replace any ambiguous character with an escape sequence, or better yet simply break the string into parts.
// dealing with ambiguous hexadecimal escapes

// replace possible hex characters with escapes
char const *title1 = "\xc\103 Language Primer"; // '\103' is octal escape for 'C'

// break string into parts
char const *title2 = "\xc" "C Language Primer"; // much easier to read
As in octal escapes, the hexadecimal number in a hexadecimal escape is limited to the range of 0 through 255. If you specify a hexadecimal escape sequence larger than 255, the compiler will emit a "hex escape sequence out of range" warning.

Next time we will continue our look at string literals by examining Unicode escape sequences.

5 comments:

Kevin Bomberry said...

Great post! I'm looking forward to the Unicode escaped sequences as unicode should be the standard for text content served up by web sites, applications and services.

Also, if you can cover in a future segment create a filter for characters when parsing a string that would be great. I know that some RSS feeds are propagated with content that has been copy and pasted into a CMS and that there are control characters that sometimes cause bad and hard to find/debug things to happen.

Looking forward to your next article!

Mike said...

Learned a lot of stuff... Some of it I hope I never have to use, though. :P

By the way, the first time you mentioned the backslash, you actually used a forward slash. You may want to correct it, as enough people mix them up already.

Don McCaughey said...

@Mike Doh! Thanks, fixed the typo. Glad you found this useful.

Felix said...

What does this produce?


@"firstbit",@"secondbit"

Don McCaughey said...

If you do something like this:

NSString *s = @"firstbit", @"secondbit";

This will assign the result of (@"firstbit", @"secondbit") to s. If I remember correctly, the comma operator returns the result of the second expression, so s will point to @"secondbit".

If you omit the comma, the compiler concatenates the adjacent strings together, so

NSString *s = @"firstbit" @"secondbit";

assigns the string @"firstbitsecondbit" to s.