Tuesday, September 21, 2010

Objective-C Tuesdays: slicing and dicing strings

Last time, we looked at C string and NSString comparison and equality. Today we'll examine functions and methods for creating substrings of C strings and NSStrings.

Substrings of C strings
Creating a C string requires you to explicitly manage the memory the string lives in. Depending on how long you need to keep the C string around, you can use either a fixed buffer or a dynamically allocated one. As always with C strings, you need to be careful not to write past the end of the buffer.

Creating a substring that starts at the beginning of the source string is straight forward: use the strncpy() function. There's a big gotcha when using strncpy() to copy a substring: it doesn't automatically add a null terminator to the destination. Here's an example of copying the first three characters of a C string into a fixed buffer:
// copy substring from start of source
// using a fixed buffer
char const *source = "foobar";
char buffer[4];                // make sure buffer includes
                               // space for null terminator

strncpy(buffer, source, 3);    // copy first 3 chars from source
buffer[3] = '\0';              // remember to add null terminator
Using a dynamic buffer is similar, but requires explicit memory management.
// copy substring from start of source
// using a dynamic buffer
char const *source = "foobar";
char *buffer = malloc(4 * sizeof(char)); // make sure buffer includes
                                         // space for null terminator

if ( ! buffer) {
  // must handle allocation failure
}

strncpy(buffer, source, 3); // copy first 3 chars from source
buffer[3] = '\0';           // remember to add null terminator

// use buffer ...

// don't forget to free() buffer when done
free(buffer);
You can make this a little more compact by using calloc() instead of malloc(). The calloc() function allocates memory using malloc(), then clears all the bytes to zero. As long as you make sure to include an extra byte at the end, your new substring will be null terminated:
// copy substring from start of source
// using a dynamic buffer 
// allocated with calloc()
char const *source = "foobar";
char *buffer = calloc(4, sizeof(char)); // make sure buffer includes
                                        // space for null terminator

if ( ! buffer) {
  // handle allocation failure
}

strncpy(buffer, source, 3); // copy first 3 chars from source
                            // last char in buffer is already '/0'

// use buffer ...

// don't forget to free() buffer when done
free(buffer);
There's not a huge difference between malloc() and calloc(), so choose whichever one you're more used to using, or use calloc() if you don't have a strong preference. The cost of clearing a range of memory to zeros is so tiny as to not be worth considering in most circumstances, and knowing that your buffer is initialized to zeros can be handy.

There's no standard C function for getting a substring that starts somewhere in the middle of the source string, because one isn't needed -- you simply move the pointer from the start of the string. Here's an illustration:
// C strings are pointers
char const *string = "foobar";

NSLog(@"'%s'", string);
// prints out 'foobar'

char const *substring = string + 3;
NSLog(@"'%s'", substring);
// prints out 'bar'
You can add an integer value to the C string pointer to get a pointer to the middle of the source string -- just be careful not to go off the end of the string! If you only need the substring for a short period of time, or if you know that the source string will live longer than the substring and never change, it's safe to simply create a substring this way. However, you can introduce weird bugs if you get this wrong. When in doubt, copy the substring to a new buffer:
// create a substring from the middle of a string
char const *source = "foobar";
char const *substringSource = source + 3;
size_t charCount = strlen(substringSource) + 1;
char *buffer = calloc(charCount, sizeof(char));

if ( ! buffer) {
  // handle allocation failure
}

strcpy(buffer, substringSource);

// use buffer ...

free(buffer);
Here we calculate the starting point by simply adding 3 to the string pointer source. Then we figure out the number of chars we need to allocate using the strlen() function, remembering to add 1 for the null terminator character. After allocating memory, the strcpy() function copies all the characters from substringSource into buffer. Unlike strncpy(), strcpy() will copy the null terminator, so this code will be the same whether we use calloc() or malloc() to allocate the buffer.

If you need to grab a substring that falls between the beginning and end of a longer string, you combine these two techniques: use pointer arithmetic to get a pointer to the start of the substring, then use strncpy() to copy just the characters you need.

Warning: beware encoding issues!
Slicing and dicing C strings is easy when you're using a single byte encoding like ASCII. If you're using a multibyte encoding like UTF-8, you need to be aware that one logical character may require two or more bytes. If you want to omit the first three logical characters in a string, you need to examine each byte from the start of a string to determine if it's part of a multibyte sequence, and adjust your string pointer accordingly. If you need to work with multibyte encodings, I recommend finding an appropriate library for the encoding, such as the International Components for Unicode for working with Unicode encodings. Or better yet, transform your C strings into NSStrings.

Substrings of NSStrings
There are three ways to get a substring of an NSString. First we'll look at taking a substring from the start of an NSString:
// create a substring from the start of source
NSString *source = @"foobar";

NSString *substring = [source substringToIndex:3];
// substring is "foo"
The substring returned by -substringToIndex: is autoreleased. You should -retain or -copy it if you need to hold on to it.

Similarly, to get a substring that starts in the middle of an NSString and goes to the end:
// create a substring to the end of source
NSString *source = @"foobar";

NSString *substring = [source substringFromIndex:3];
// substring is "bar"
Finally, the general purpose way to create a substring of an NSString is the -substringWithRange: method, which uses an NSRange structure, which is defined something like this:
// NSRange structure
struct NSRange {
  NSUInteger location;
  NSUInteger length;
}
When used with -substringWithRange: method, the NSRange's location field is the zero-based index of the first character to be included in the substring, and the length field is the number of characters to include in the substring. Here are some examples:
// -substringWithRange: examples
NSString *source = @"foobar";
NSRange range;

range.location = 0;
range.length = 3;
NSString *frontHalf = [source substringWithRange:range];
// frontHalf is "foo"

range.location = 3;
range.length = 3;
NSString *backHalf = [source substringWithRange:range];
// backHalf is "bar"

range.location = 2;
range.length = 2;
NSString *middle = [source substringWithRange:range];
// middle = "ob"
One word of caution: if the range you give falls outside the receiver (the source string), this method will raise an NSRangeException.

Setting the fields of NSRange is fairly verbose; it's generally more convenient to use the NSMakeRange() function to create the NSRange structure instead.
// NSMakeRange() example
NSString *source = @"foobar";

NSString *frontHalf = [source substringWithRange:NSMakeRange(0, 3)];
// frontHalf is "foo"

NSString encoding mostly not a worry
Internally, NSString uses UTF-16 encoding. Although UTF-16 is a variable length encoding like UTF-8, characters from the basic multilingual plane are all two bytes (one word) in length. If you're certain that your NSString contains only basic multilingual plane characters, then methods like -length and -substringWithRange: will work exactly as you expect them. However, if your NSString includes characters outside the basic multilingual plane, it will contain surrogate pairs, which are multi-word sequences that represent a single character. You'll find that -length tells you the number of words rather than logical characters, and if you're not careful, methods like -substringWithRange: can split a surrogate pair in half, leaving you with an invalidly encoded string.

Unless your application needs to work with characters outside the basic multilingual plane, the easiest solution is to filter out such characters when you accept data from a source outside your app. Since the basic multilingual plane contains all the characters in common use in most modern languages, this is sufficient for many applications. The standard iOS input keyboards limit the user to characters in the basic multilingual plane, but if your app reads data from the network, such as an RSS feed you don't control, you need to watch out for this.

Next time, we'll look at searching in C strings and NSStrings.

No comments: