Tuesday, July 20, 2010

Objective-C Tuesdays: Unicode string literals

Last week we started looking into C string and NSString literals. Today we'll continue this topic by looking at embedding Unicode characters in literals using Unicode escape sequences.

Unicode escape sequences were added to the C language in the TC2 amendment to C99, and to the Objective-C language (for NSString literals) with Mac OS X 10.5. The C99 standard actually refers to these escape sequences as universal character names since C doesn't require that the compiler use a particular character set or encoding scheme, but iOS and most modern systems use the Unicode character set so we'll continue to call them "Unicode escapes".

There are two flavors of Unicode escapes. The first begins with a backslash (\) followed by a lower case 'u' and four hexadecimal digits, allowing the encoding of Unicode characters from 0 to 65535. This Unicode range encodes the basic multilingual plane, which includes most characters in common use today. The second Unicode escape type begins with a backslash (\) followed by an upper case 'U' and eight hexadecimal digits, which can encode every possible Unicode character, including historical languages and special character sets such as musical notation.
// Examples of Unicode escapes

char const *gamma1 = "\u0393";    // capital Greek letter gamma (Γ)
NSString *gamma2 = @"\U00000393"; // also Γ
Unlike hexadecimal escape sequences, Unicode escapes are required to have four or eight digits after the 'u' or 'U' respectively. If you have too few digits, the compiler generates a "incomplete universal character name" error.

If you're familiar with character encoding issues, you're probably wondering how Unicode characters get encoded in plain old C strings. Since the char data type can only hold a character value from zero to 255, what does the compiler do when it encounters a capital gamma (Γ) with a Unicode character value of 915 (or 393 in hex)? The C99 standard leaves this up to the compiler. In the version of GCC that ships with Xcode and the iOS SDK, the answer is UTF-8 encoding.

This is one potential gotcha when using Unicode escape sequences. Even though the string literal in our example specifies a single logical character, capital gamma (Γ)
char const *gamma1 = "\u0393";
the compiler has no way to encode that logical character in a single char. We would expect that
NSLog(@"%u", strlen(gamma1));
would print 1 for the length of the string, but it actually prints 2.

If you read the first post in the strings series, you might remember this table showing the memory layout of the word "Geek" in Greek letters (Γεεκ) in the UTF-8 encoding:

Address 64 65 66 67 68 69 70 71 72
Value 206 147 206 181 206 181 206 186 0
Character 'Γ' 'ε' 'ε' 'κ' '\0'

In UTF-8, letters in the Greek alphabet take up two bytes (or chars) each. (And other characters may use three or four bytes.) The standard C strlen() function actually counts chars (or bytes) in the string rather than logical characters, which made perfect sense in 1970 when computers used ASCII or another single byte character set like Latin-1.

NSString literals suffer from a similar problem. Internally, NSString uses 16 bit words to encode each character. This made sense when NSString was created, since early versions of the Unicode standard only encoded up to 65,535 characters, so a 16 bit word value could hold any Unicode character (at the time).

Unfortunately the Unicode consortium discovered there was a strong desire to encode historical scripts and special character sets like music and math notation along with modern languages, and 16 bits wasn't large enough to accommodate all the symbols. The Unicode character set was expanded to 32 bits and the UTF-16 encoding was created. In the UTF-16 encoding, characters in the hexadecimal ranges DC00-DFFF (the low surrogates) and D800-DB7F (the high surrogates) are used in pairs to encode Unicode characters with values greater than 65,535. This is analogous to how UTF-8 uses multiple bytes to encode a single logical character.

So the musical G clef symbol (𝄞) which has Unicode value 1D11E in hex (119,070 in decimal), is encoded as two "characters" in an NSString.
// NSString sometimes has a misleading "length"
NSString *gClef = @"\U0001d11e"; // musical G clef symbol (𝄞)
NSLog(@"%u", [gClef length]);
The log statement prints out 2 instead of 1.

In memory, the NSString data looks like this:
Address 64 65 66 67 68 69
Value 0xD834 0xDD1E 0
Character '𝄞''\0'
Like the strlen() function for C strings, the -length method actually returns the number of words in the NSString, which is usually but not always the number of logical characters in the NSString object.

Next time, we'll continue our dive into Unicode string madness by looking at wide character strings.

1 comment:

Kevin Bomberry said...

Again, another great post. I really enjoy the depth and detail of these articles and am looking forward to next week!