Tuesday, September 28, 2010

Objective-C Tuesdays: searching in strings

Last week we looked at creating substrings of C strings and NSStrings. Today we look at another common string operation: searching within a string.

Find a character in a C string
As with all operations on C strings, searching requires you to deal with pointers. To find the first occurrence of a character in a C string, use the strchr() function. If the character is found, a pointer to that character is returned. If the character isn't present in the string, NULL is returned.
// find a character in a C string
char const *s = "foobar";

char const *character = strchr(s, 'b');
if (character) {
  NSLog(@"Found b");
} else {
  NSLog(@"Didn't find b");
// prints "Found b"
As we saw last week when we looked at substrings, the pointer returned by strchr() is effectively a substring of the source string starting at the first occurrence of the character you were searching for:
char const *s = "foobar";

char const *substring = strchr(s, 'b');
if (substring) {
  NSLog(@"The substring is %s", substring);
// prints "The substring is bar"
Once you find the character you're looking for, it's common to want to create a substring containing everything up to that position in the string:
char const *filename = "myfile.txt";

char const *dot = strchr(filename, '.');
if (dot) {
  size_t length = dot - filename;
  char *baseFilename = calloc(length + 1, sizeof(char));
  if (baseFilename) {
    strncpy(baseFilename, filename, length);
    NSLog(@"The base filename is %s");
// prints "The base filename is myfile"
You use the difference between the two string pointers to calculate the number of chars up to (but not including) the character you searched for. After allocating a buffer to hold the new substring (and the null terminator), you use the strncpy() function to copy the first part of the source string. Because we called calloc(), the last char in our buffer is already set to zero; if you use malloc() or a fixed buffer instead, you need to remember to set the null terminator since strncpy() isn't guaranteed to do it for you.

Very often, you want to find the last occurrence of a character; you can use the strrchr() function to search in reverse:
// find a character in reverse
char const *filename = "myfile.txt";

char const *extension = strrchr(filename, '.');
if (extension) {
  NSLog(@"The extension is %s", extension);
// prints "The extension is .txt"

Find one C string in another
To find the first occurrence of one C string in another, use the strstr() function. Like strchr(), it returns a pointer to the first occurrence of the string, or NULL if it wasn't found.
// find one C string in another
char const *s1 = "The quick brown fox";

char const *s2 = strchr(s1, "ick");
if (s2) {
  NSLog(@"Found ick");
} else {
  NSLog(@"Didn't find ick");
// prints "Found ick"
Unfortunately the C standard library doesn't have a strrstr() function to search for the last occurrence of one string in another. You'll need to roll your own by calling strstr() in a loop until you reach the end of the string. (The implementation of this is left as an exercise for the reader, or better yet convert your C string to an NSString and keep reading :-)

C string encoding issues
The standard library functions for searching C strings work great with ASCII and similar single byte encodings. If you need to search inside UTF-8 encoded C strings, you'll quickly realize that strchr() and strrchr() are only useful for finding the basic ASCII characters (which are also valid UTF-8 characters). If you need to find non-ASCII characters like 'é', you'll need to use strstr() to search for the byte sequence that UTF-8 uses to represent it ("\xc3\xa9" for 'é'). Even then, Unicode characters like 'é' can be represented two ways: as the single Unicode character 'é' or as the base character 'e' followed by the combining character '´'. In general, it's better to use a C library designed to deal with the encoding such as the International Components for Unicode for handling UTF-8 encoded strings. Or if you're developing for iOS or Mac OS X, use NSString instead.

Find one NSString in another
The NSString class doesn't have separate methods to search for a single character or a string; you use -rangeOfString: to do either:
// find a character in an NSString
NSString *s = @"foobar";

NSRange range = [s rangeOfString:@"b"];
if (range.location != NSNotFound) {
  NSLog(@"Found b at %u", range.location);
// prints "Found b at 3"
Searching for the last occurrence of a string is done using the related method -rangeOfString:options: with the NSBackwardsSearch option.
// find last occurrence in an NSString
NSString *s = @"The rain in Spain falls mainly on the plain";

NSRange range = [s rangeOfString:@"ain" options:NSBackwardsSearch];
if (range.location != NSNotFound) {
  NSLog(@"Found ain at %u", range.location);
// prints "Found ain at 40"
The options are a combination of the following bit flags: NSCaseInsensitiveSearch, NSLiteralSearch, NSBackwardsSearch and NSAnchoredSearch. You use the bitwise or (|) operator to combine them together, or pass in zero for no options.

Use the NSCaseInsensitiveSearch option to find the first match, ignoring the case of both strings. The NSLiteralSearch option is used when you want to match a specific Unicode string form, such as the single character 'é' (Unicode character U+00E9) and not match equivalent character sequences like 'e' + '´' (Unicode characters U+0065 and U+0301). Most applications won't care about this option, but it's really handy when you need it.

NSAnchoredSearch checks for a match only at the start of the string (or the end if combined with NSBackwardsSearch). This option is occasionally handy, but the methods -hasPrefix: and -hasSuffix: are easier to read equivalents.
// anchored search
NSString *s = @"The rain in Spain falls mainly on the plain";

NSRange range = [s rangeOfString:@"ain" 
if (range.location == NSNotFound) {
  NSLog(@"Doesn't start with ain");
// prints "Doesn't start with ain"

// same thing using -hasPrefix:
if ( ! [s hasPrefix:@"ain"]) {
  NSLog(@"Doesn't have prefix ain");
// prints "Doesn't have prefix ain"

// now from the end
range = [s rangeOfString:@"ain"
                 options:NSAnchoredSearch | NSBackwardsSearch];
if (range.location != NSNotFound) {
  NSLog(@"Ends with ain");
// prints "Ends with ain"

// same thing using -hasSuffix:
if ([s hasSuffix:@"ain"]) {
  NSLog(@"Has suffix ain");
// prints "Has suffix ain"
There are two other variations of -rangeOfString:. The first, -rangeOfString:options:range:, allows you to search within a section of a larger string without having to create a substring.

The second, -rangeOfString:options:range:locale:, allows you to specify a locale as well as a range. In most cases you want to use the current locale, which is taken from the language setting on the user's device. The other variations of -rangeOfString: use the current locale, and you can pass nil for the locale to use the current one. Sometimes you know that the string contains text in a particular language, in an app that teaches German for instance. In this case you should specify a locale when searching the string; the locale can affect how text is matched, especially when using the NSCaseInsensitiveSearch option.

Next week, we'll look at replacing characters in C strings and NSStrings.

