Tuesday, July 12, 2011

Objective-C Tuesdays: regular expressions

Welcome back to Objective-C Tuesdays after a long hiatus. In the last couple of entries, we looked at searching and replacing in C strings and NSStrings. Today we'll look at a more powerful way to search and replace in strings: regular expressions.

A mini language
Regular expressions is a small, specialized programming language for matching text. In addition to being specialized in scope, regular expressions are also very compact, which can make them very hard to read. I won't cover regular expression grammar in dept here—there are plenty of regular expression tutorials, references and cheat sheets out there. If you're looking for a good reference, I recommend O'Reilly's Mastering Regular Expressions and the Regular Expressions Cookbook.

Some programming languages have first class support for regular expressions, including Perl, Ruby and JavaScript. Modern languages like Python and Java support regular expressions as part of the language's standard library. Unfortunately C doesn't have regular expression support as part of its standard library. For Objective-C, regular expression support first appeared with iOS 3.2 on iPad and 4.0 on iPhone. Before that, developers needed to use a third party library to use regular expressions in their apps.

Some simple examples
Most characters in regular expressions match themselves, but some characters have special meaning. Here's a simple regular expression that matches the word "foobar":
foobar
To match "foo" followed by one 'b' character, you could use:
foob?
Here, the '?' character is a special character that modifies the expression before it, (in this case the character 'b') and matches it zero or one times. Thus foob? will match "foo" or "foob" but not "foe".

To match "foo" followed by zero or more 'b' characters, you could use:
foob*
The '*' special character matches the preceding expression zero or more times. foob* will match "foo", "foob" and "foobbbbbb" but not "fod".

To match "foo" followed by one or more 'b' characters, you could use:
foob+
The '+' special character matches the preceding expression one or more times. foob+ will match "foob" and "foobbbbbb" but not "foo" or "food".

Regular expression languages have many more capabilities. Though different regular expression implementations have different capabilities, most share a large set of common operations,.

C libraries for regular expressions
The PCRE or Perl Compatible Regular Expressions library is a widely used C library that implements the same regular expression language that's used in Perl 5. The PCRE library is open source and distributed under a BSD license. It's a substantial library, and due to the lack of support on iOS for OS X style frameworks, it can be a bit challenging to include in an iOS project. In addition, because it's a C library, working with PCRE requires more low-level code than many Cocoa Touch programmers are comfortable with.

Here's a PCRE code snippet that compiles the regular expression foo(bar|fy) and tests it against the string "foofy".
#include 
#include 
#include 

#define OUTPUT_VECTOR_COUNT 30    /* should be a multiple of 3 */

/* ... */

char const *pattern = "foo(bar|fy)";
int compileOptions = 0;
char const *error;
int errorOffset;
unsigned char const *characterTable = NULL;
pcre *regularExpression = pcre_compile(pattern, compileOptions, 
                                       &error, &errorOffset, characterTable);

if ( ! regularExpression) {
  NSLog(@"ERROR: regular expression <%s> failed to compile\n", pattern);
  NSLog(@"  Error at offset %i: %s\n", errorOffset, error);
  exit(EXIT_FAILURE);
}

pcre_extra *extraData = NULL;
char const *subject = "foofy";
int subjectLength = strlen(subject);
int subjectOffset = 0;
int execOptions = 0;
int outputVector[OUTPUT_VECTOR_COUNT];
int execResultCount = pcre_exec(regularExpression, extraData, 
                                subject, subjectLength, subjectOffset, 
                                execOptions, outputVector, OUTPUT_VECTOR_COUNT);

if (execResultCount == PCRE_ERROR_NOMATCH) {
  NSLog(@"The subject <%s> did not match the pattern <%s>\n", subject, pattern);
} else if (execResultCount < 0) {
  NSLog(@"Unexpected pcre_exec() result %i\n", execResultCount);
} else if (execResultCount == 0) {
  NSLog(@"Output vector only has room for %i captured substrings\n", 
        execResultCount - 1);
} else {
  int resultIndex;
  
  NSLog(@"Found match for pattern <%s> in subject <%s> at offset %i\n", 
        pattern, subject, outputVector[0]);
  for (resultIndex = 0; resultIndex < execResultCount; ++resultIndex) {
    int substringIndex = 2 * resultIndex;
    int substringStartingOffset = outputVector[substringIndex];
    int substringEndingOffset = outputVector[substringIndex + 1];
    char const* substringStart = subject + substringStartingOffset;
    int substringLength = substringEndingOffset - substringStartingOffset;
    NSLog(@"match %i: <%.*s>\n", resultIndex, substringLength, substringStart);
  }
}

pcre_free(regularExpression);

/* ... */
As you can see, it takes a lot of boilerplate code to do even a simple regular expression search using PCRE, but it's a very powerful library used by many well knows open source and commercial projects, including Apple's Safari browser. The PCRE project also includes a set of C++ wrappers that make the library easier to use from that language. If you only plan to target OS X, there is also the very excellent RegexKit open source library that provides a nice Objective-C wrapper around PCRE. The PCRE library can be built with Unicode support, but is limited to scanning UTF-8 encoded Unicode strings, since it's a byte oriented library.

The POSIX Regular Expressions specification describes a regular expression dialect and C interface that is supported on many variants of Unix and Linux, including OS X and iOS. Like many specifications, POSIX regular expressions are a "lowest common denominator" solution and lack some of the advanced features found in regular expression variants like Perl and PCRE. The POSIX regular expression interface is much smaller than what PCRE provides: four functions, two structs and a bunch of constants. Here's a PCRE code snippet that compiles the regular expression foo(bar|fy) and tests it against the string "foofy" using the POSIX interface.
#import 

#define MATCH_COUNT 10

/* ... */

regex_t regularExpression;
char const *pattern = "foo(fy|bar)";
int compileFlags = REG_EXTENDED;

int compileResult = regcomp(&regularExpression, pattern, compileFlags);

if (compileResult) {
  size_t bufferSize = regerror(compileResult, &regularExpression, NULL, 0);
  char *buffer = malloc(bufferSize);
  if ( ! buffer) {
    NSLog(@"Memory allocation failed");
    return;
  }
  regerror(compileResult, &regularExpression, buffer, bufferSize);
  NSLog(@"%s", buffer);
  free(buffer);
  return;
}

char const *string = "foofy";
regmatch_t matches[MATCH_COUNT];
int executeFlags = 0;

int executeResult = regexec(&regularExpression, string, 
                            MATCH_COUNT, matches, executeFlags);

if (executeResult == REG_NOMATCH) {
  NSLog(@"Pattern <%s> doesn't match string <%s>\n", pattern, string);
} else {  
  NSLog(@"Pattern <%s> matches string <%s>\n", pattern, string);
  NSLog(@"Found %lu submatches\n", (unsigned long)regularExpression.re_nsub);
  for (size_t i = 0; i < regularExpression.re_nsub + 1; ++i) {
    int substringLength = matches[i].rm_eo - matches[i].rm_so;
    char const *substringStart = string + matches[i].rm_so;
    NSLog(@"submatch %lu: <%.*s>\n", (unsigned long)i, substringLength, substringStart);
  }
}

regfree(&regularExpression);

/* ... */
This is a little more concise than the PCRE version, mainly because the POSIX functions have fewer options, but is still very low level. POSIX regular expression implementations are generally regarded as slower than PCRE, and don't provide any explicit Unicode support (though you may be able to do regex matching against UTF-8 encoded strings if you're careful to keep your UTF-8 strings normalized).

The ICU or International Components for Unicode library is included as part of iOS. (ICU is open source and carries a non-restrictive license.) The ICU is a general purpose library for working with Unicode text which includes Unicode-aware regular expression support. It has two versions: one for Java and another for C and C++. The C/C++ version features a low level C function interface and a set of higher level C++ classes. I'll only look at the C interface to ICU here.

Because the ICU library is a general purpose Unicode library, most operations require that C strings be converted into ICU's Unicode text type, UChar *, which makes the ICU example the longest yet:
// compile regular expression
char const *pattern = "foo(bar|fy)";
uint32_t compileFlags = 0;
UParseError parseError;
UErrorCode errorCode = U_ZERO_ERROR;

URegularExpression* regularExpression = uregex_openC(pattern, compileFlags, 
                                                     &parseError, &errorCode);
if (errorCode) {
  NSLog(@"uregex_openC() failed: %li: %s", 
        (long)errorCode, u_errorName(errorCode));
  NSLog(@"  parse error at line %li, offset %li" ,
        (long)parseError.line, (long)parseError.offset);
  return;
}

// determine size of search text as ICU Unicode string
UChar *unicodeText = NULL;
int32_t unicodeTextCapacity = 0;
int32_t unicodeTextLength;
char const *utf8Text = "foofy";
int32_t utf8TextLength = -1; /* null terminated */

errorCode = U_ZERO_ERROR;
u_strFromUTF8(unicodeText, unicodeTextCapacity, &unicodeTextLength, 
              utf8Text, utf8TextLength, &errorCode);

if (errorCode != U_BUFFER_OVERFLOW_ERROR) {
  NSLog(@"Conversion to Unicode string failed: %li: %s", 
        (long)errorCode, u_errorName(errorCode));
  uregex_close(regularExpression);
  return;
}

// allocate buffer for search text ICU Unicode string
unicodeTextCapacity = unicodeTextLength + 1;
unicodeText = calloc(sizeof(UChar), unicodeTextLength);
if ( ! unicodeText) {
  NSLog(@"Memory allocation failed");
  uregex_close(regularExpression);
  return;
}

// convert search text to ICU Unicode string
errorCode = U_ZERO_ERROR;
u_strFromUTF8(unicodeText, unicodeTextCapacity, &unicodeTextLength, 
              utf8Text, utf8TextLength, &errorCode);

uregex_setText(regularExpression, unicodeText, unicodeTextLength, &errorCode);
if (errorCode) {
  NSLog(@"uregex_setText() failed: %li: %s", 
        (long)errorCode, u_errorName(errorCode));
  free(unicodeText);
  uregex_close(regularExpression);
  return;
}

// search for regular expression
int32_t startIndex = 0;
errorCode = U_ZERO_ERROR;
BOOL matchFound = uregex_find(regularExpression, startIndex, &errorCode);
if (errorCode) {
  NSLog(@"uregex_find() failed: %li: %s", 
        (long)errorCode, u_errorName(errorCode));
  free(unicodeText);
  uregex_close(regularExpression);
  return;
}

if (matchFound) {
  // get number of subgroup matches
  NSLog(@"Pattern <%s> matched string <%s>", pattern, utf8Text);
  errorCode = U_ZERO_ERROR;
  int32_t subgroupCount = uregex_groupCount(regularExpression, &errorCode);
  if (errorCode) {
    NSLog(@"uregex_groupCount() failed: %li: %s", 
          (long)errorCode, u_errorName(errorCode));
    free(unicodeText);
    uregex_close(regularExpression);
    return;
  }
  
  // enumerate subgroup matches
  NSLog(@"Matched %li subgroups", (long)subgroupCount);
  for (int32_t i = 0; i <= subgroupCount; ++i) {
    // get size of the subgroup
    UChar *subgroup = NULL;
    int32_t subgroupCapacity = 0;
    errorCode = U_ZERO_ERROR;
    int32_t subgroupLength = uregex_group(regularExpression, i, 
                                          subgroup, subgroupCapacity, 
                                          &errorCode);
    if (errorCode != U_BUFFER_OVERFLOW_ERROR) {
      NSLog(@"uregex_group() failed: %li: %s", 
            (long)errorCode, u_errorName(errorCode));
      break;
    }
    
    // allocate buffer to hold the subgroup
    subgroupCapacity = subgroupLength + 1;
    subgroup = calloc(sizeof(UChar), subgroupCapacity);
    if ( ! subgroup) {
      NSLog(@"Memory allocation failed");
      return;
    }
    
    // copy subgroup into buffer
    errorCode = U_ZERO_ERROR;
    uregex_group(regularExpression, i, subgroup, subgroupCapacity, &errorCode);
    
    // determine size of buffer to hold subgroup as UTF8 string
    char *utf8Subgroup = NULL;
    int32_t utf8SubgroupCapacity = 0;
    int32_t utf8SubgroupLength;
    errorCode = U_ZERO_ERROR;
    u_strToUTF8(utf8Subgroup, utf8SubgroupCapacity, &utf8SubgroupLength, 
                subgroup, subgroupLength, &errorCode);
    if (errorCode != U_BUFFER_OVERFLOW_ERROR) {
      NSLog(@"u_strToUTF8() failed: %li: %s", 
            (long)errorCode, u_errorName(errorCode));
      free(subgroup);
      break;
    }
    
    // allocate buffer to hold subgroup as UTF8 string
    utf8SubgroupCapacity = utf8SubgroupLength + 1;
    utf8Subgroup = calloc(sizeof(char), utf8SubgroupCapacity);
    if ( ! utf8Subgroup) {
      NSLog(@"Memory allocation failed");
      free(subgroup);
      break;
    }
    
    // convert subgroup to UTF8 string
    errorCode = U_ZERO_ERROR;
    u_strToUTF8(utf8Subgroup, utf8SubgroupCapacity, &utf8SubgroupLength, 
                subgroup, subgroupLength, &errorCode);
    if (errorCode) {
      NSLog(@"u_strToUTF8() failed: %li: %s", 
            (long)errorCode, u_errorName(errorCode));
      free(subgroup);
      free(utf8Subgroup);
      break;
    }
    
    // print subgroup UTF8 string
    NSLog(@"submatch %lu: <%.*s>\n", 
          (unsigned long)i, utf8SubgroupLength, utf8Subgroup);
    free(subgroup);
    free(utf8Subgroup);
  }
} else {
  NSLog(@"Pattern <%s> did not match string <%s>", pattern, utf8Text);
}

free(unicodeText);
uregex_close(regularExpression);
I wouldn't be surprised if there's a memory leak in there somewhere. I'm also no expert on this library, and I may be doing some things the hard way. If you're using C++, I'm sure a lot of this boilerplate code goes away. If you don't need or want to mix C++ into your iOS app, there are some Objective-C alternatives that are much more satisfying.

Regular expressions in Objective-C
In iOS 3.2, Apple added a new NSStringCompareOptions value for use in the various -rangeOfString:options: methods of NSString: NSRegularExpressionSearch. This option uses the regular expression support of the ICU library to do simple regular expression matching on an NSString object. Unfortunately, Apple only implemented a very minimal regular expression interface on NSString. You can search, but not replace, within an NSString and powerful regular expression features like subgroup matching are not exposed.
NSString *string = @"foofy";
NSString *pattern = @"foo(bar|fy)";

NSRange match = [string rangeOfString:pattern 
                              options:NSRegularExpressionSearch];

if (match.location == NSNotFound) {
  NSLog(@"Pattern <%@> doesn't match string <%@>", pattern, string);
} else {
  NSLog(@"Pattern <%@> matches string <%@> starting at location %lu",
        pattern, string, (unsigned long)match.location);
}

In iOS 4.0, Cocoa Touch gained the NSRegularExpression class, a full-featured regular expression processor built on top of the ICU regular expression library.
NSError *error;
NSString *pattern = @"foo(bar|fy)";

NSRegularExpression *regularExpression = [NSRegularExpression regularExpressionWithPattern:pattern 
                                                                                   options:0 
                                                                                     error:&error];
if ( ! regularExpression) {
  NSLog(@"Error in pattern <%@>: %@", pattern, error);
  return;
}

NSString *string = @"foofy";
NSRange range = NSMakeRange(0, [string length]);
NSArray *matches = [regularExpression matchesInString:string 
                                              options:0 
                                                range:range];
if ([matches count]) {
  NSTextCheckingResult *firstMatch = [matches objectAtIndex:0];
  NSLog(@"Found %lu submatches", (unsigned long)[firstMatch numberOfRanges]);
  for (NSUInteger i = 0; i < [firstMatch numberOfRanges]; ++i) {
    NSRange range = [firstMatch rangeAtIndex:i];
    NSString *submatch = [string substringWithRange:range];
    NSLog(@"submatch %lu: <%@>", (unsigned long)i, submatch);      
  }
} else {
  NSLog(@"Pattern <%@> doesn't match string <%@>", pattern, string);
}
This is much easier to use than the underlying ICU C library and a good choice for iOS apps that target 4.0 and later.

The RegexKitLite library provides an alternate Objective-C wrapper around the low level ICU regular expression library that supports both Mac OS X and iOS, including iOS 3.0. RegexKitLite provides a full featured set of regular expression methods in a category that extends NSString and can do matching, searching, replacing and supports subgroups. RegexKitLite is open source and available under a BSD license.
NSString *string = @"foofy";
NSString *pattern = @"foo(bar|fy)";

NSArray *submatches = [string captureComponentsMatchedByRegex:pattern];
if ([submatches count]) {
  NSLog(@"Found %lu submatches", (unsigned long)[submatches count]);
  for (NSUInteger i = 0; i < [submatches count]; ++i) {
    NSLog(@"submatch %lu: <%@>", (unsigned long)i, [submatches objectAtIndex:i]);      
  }
} else {
  NSLog(@"Pattern <%@> doesn't match string <%@>", pattern, string);
}
RegexKitLite code is a little more concise than the NSRegularExpression class, but comparable in power, since they're both built on top of ICU.

Which one to use?
Unless you're writing cross-platform code in C or need to target very old versions of iOS, I recommend you avoid using the C regular expression libraries. If you must, the POSIX regex functions are the easiest to get started with, but if you're targeting non-Apple platforms, watch out for implementation differences across platforms. Both PCRE and ICU are good cross-platform choices; choose ICU if you also need robust Unicode support, PCRE if you're mainly working with eight bit encodings.

On the Objective-C side, choosing NSRegularExpression or RegexKitLite is largely up to personal preference. NSRegularExpression is a no-brainer if you're targeting iOS 4.0 and later (but currently not supported on OS X). Adding RegexKitLite to your project is as easy as adding two files to Xcode and currently works on both Apple operating systems. And the -rangeOfString:options: method on NSString is handy for simple searches.

<eot /> (end of topic)
That concludes our look at strings in Objective-C. Data structures is our next topic.

No comments: