Tuesday, September 14, 2010

Objective-C Tuesdays: string comparison and equality

Welcome back after an end-of-Summer hiatus. Last time we looked at concatenating strings in Objective-C. Today we look at another common string operation: comparison.

Identity
Comparing two variables or objects can sometimes be a tricky proposition. There are several different senses of equality. The most fundamental type of equality is identity: do two variables represent the same thing in memory. Identity only makes sense for reference types, like C strings, NSStrings and other pointer types. Value types like ints always designate separate things in memory. In C and Objective-C, identity equality is determined by comparing pointer values using the == operator.
// comparing two strings for identity
char const s1 = "foo";
char const s2 = s1;
if (s1 == s2) {
  NSLog(@"s1 is identical to s2");
}

NSString *s3 = @"foo";
NSString *s4 = @"bar";
if (s3 != s4) {
  NSLog(@"s3 is not identical to s4");
}

Equivalence
A more useful type of equality is equivalence of value: do two variables represent equivalent data. Equivalence is useful when comparing value types as well as reference types, and is usually what programmers think of when comparing two strings.

For C strings, the primary equivalence test is done with the strcmp() function. The strcmp() function compares the data of two C strings char by char; if two C strings represent the same sequence of char values in memory, they are equivalent and strcmp() returns zero.
// checking two C strings for equal value
char const *s1 = "foo";
char const *s2 = "bar";

if (strcmp(s1, s2) == 0) {
  NSLog(@"s1 is equivalent to s2");
} else {
  NSLog(@"s1 is not equivalent to s2");
}
In addition to checking for equivalence, strcmp() also categorizes the sort order of the two C strings. If the first argument comes before the second, a negative value is returned; if the first argument comes after the second, a positive value is returned. The strcmp() function uses a lexicographic comparison, which means that the comparison is strictly on the basis of the integer values of the chars in the C strings. For ASCII strings, the string "2" (ASCII code 50) comes before "A" (ASCII code 64), which precedes "a" (ASCII code 97). Many sorting algorithms, including the qsort() function in the C standard library, require a function like strcmp().
// using strcmp() result

int compareResult = strcmp(s1, s2);
if (compareResult < 0) {
  NSLog(@"s1 comes before s2");
} else if (compareResult > 0) {
  NSLog(@"s1 comes after s2");
}

Sometimes you only want to see if two strings have a common prefix, or you're working with character buffers that aren't null terminated. The strncmp() function will compare a limited number of characters, stopping early if it encounters a null terminator in either string. Thus these two strings are equivalent when the first three characters are compared:
if (strncmp("foo", "fooey", 3) == 0) {
  NSLog(@"both start with foo");
}
// prints "both start with foo"

When sorting with strncmp(), short strings come first:
if (strncmp("foo", "fooey", 5) < 0) {
  NSLog(@"foo comes before fooey");
}
// prints "foo comes before fooey"

Case Insensitive
In languages that have upper and lower case letters, you often need to do a case insensitive comparisons. The C standard library doesn't define a case insensitive string comparison function, but one is part of the POSIX standard, and most compiler vendors and operating systems include one. The POSIX version is called strcasecmp(). Most modern Unix and Linux systems (including iOS and Mac OS X) have strcasecmp() available in the standard library. Older Unix systems and other operating systems may call this function stricmp() or strcmpi(). There is usually also a length limited version called strncasecmp() or strnicmp().

The case insensitive comparison functions usually compare only ASCII characters, which limits their usefulness.
// case insensitive comparison
char const *s = "<HTML><HEAD>...";

if (strncasecmp(s, "<html>", 6) == 0) {
  NSLog(@"looks like HTML");
}

Encoding Issues
The strcmp() function was created in the era when most computers used ASCII or other simple single byte encodings. In ASCII, there is only one byte sequence that represents any particular character sequence. This isn't true of many modern encodings, including Unicode. The Unicode character set contains both accented characters such as "é" as well as a combining accent character "´", so there are two ways to represent "é" in UTF-8 encoding:

Address646566
Character'é'
Value195169
Character'e''´'
Value101204129
Obviously a lexicographic comparison function like strcmp() will not see these two strings as equivalent. Accounting for this requires performing normalization on the Unicode characters in the string before doing the comparison. Unicode has several different types of normalization, which we won't dive into here. If you need to do a lot of low level processing of UTF-8 or other Unicode encoded text, you should look at the International Components for Unicode, a library of C functions for Unicode processing that is included as part of iOS. Better yet, in most cases you should use NSStrings when working with text.

NSString equality
The NSString class defines the -isEqualToString: instance method for testing if an NSString is equivalent to another NSString:
// compare two NSStrings
NSString *s1 = @"foo";
NSString *s2 = @"bar";

if ( [s1 isEqualToString:s2] ) {
  NSLog(@"The strings are equivalent.");
}
You can also use the -isEqual: instance method defined by NSObject to compare two NSStrings, or to compare an NSString with any other object:
// compare two NSStrings using -isEqual:
NSString *s1 = @"foo";
NSString *s2 = @"bar";

if ( [s1 isEqual:s2] ) {
  NSLog(@"The strings are equivalent.");
}
The difference between the two methods is in their declarations. The -isEqualToString: method is only for comparing one NSString to another; it's declaration looks like:
// declaration of -isEqualToString:
- (BOOL)isEqualToString:(NSString *)aString
The -isEqual: method is for comparing any kind of NSObject to another object; it's declaration looks like:
// declaration of -isEqual:
- (BOOL)isEqual:(id)anObject
It's possible to use -isEqual: to compare an NSString with an object of a different type, such as an NSNumber:
NSString *fiveString = @"5";
NSNumber *fiveNumber = [NSNumber numberWithInt:5];

if ( [fiveString isEqual:fiveNumber] ) {
  NSLog(@"fiveString equals fiveNumber");
} else {
  NSLog(@"Strings aren't equivalent to numbers, silly!");
}
You might hope that the NSString "5" is equivalent to the NSNumber "5" but unfortunately they are not; the code above will print out "Strings aren't equivalent to numbers, silly!". In general, objects of different classes aren't considered to be equivalent with one common exception: immutable classes like NSString can be equivalent to their mutable subclasses (NSMutableString in this case) and vice versa.
NSString *fiveString = @"5";
NSMutableString *fiveMutableString = [NSMutableString stringWithString:@"5"];

if ( [fiveString isEqual:fiveMutableString] ) {
  NSLog(@"immutable and mutable strings can be equivalent");
}
And since NSMutableString is a subclass of NSString, you can also use -isEqualToString: to compare them:
if ( [fiveString isEqualToString:fiveMutableString] ) {
  NSLog(@"immutable and mutable strings can be equivalent");
}

-compare:
In addition to testing for equivalence using -isEqual: or -isEqualToString:, you can also discover the relative order of two NSString objects using the -compare: family of methods. The -compare: method is very similar to the strcmp() method in C. The -compare: method returns a NSComparisonResult value, which is simply an integer value. Similar to strcmp(), -compare: will return zero if the two NSStrings are equivalent, though you can also use the constant NSOrderedSame instead of zero:
// compare two NSStrings
NSString *s1 = @"foo";
NSString *s2 = @"bar";

if ( [s1 compare:s2] == NSOrderedSame] ) {
  NSLog(@"s1 is equivalent to s2");
} else {
  NSLog(@"s1 is not equivalent to s2");
}
Like strcmp(), if the receiver of the -compare: message (the first NSString) comes before the first argument (the second NSString), negative one is returned; if the receiver comes after the first argument, positive one is returned. The constants NSOrderedAscending and NSOrderedDescending can be used instead of -1 and 1 respectively.
// using NSComparisonResult

NSComparisonResult comparisonResult = [s1 compare:s2];
if (comparisonResult == NSOrderedAscending) {
  NSLog(@"s1 comes before s2");
} else if (comparisonResult == NSOrderedAscending) {
  NSLog(@"s1 comes after s2");
}

Case Insensitive -compare:
To test the equivalence of two NSString objects in a case insensitive manner, use -compare:options: with the NSCaseInsensitiveSearch flag.
// case insensitive compare
NSString *s1 = @"foo";
NSString *s2 = @"FOO";

if ( [s1 compare:s2 options:NSCaseInsensitiveSearch] == NSOrderedSame) {
  NSLog(@"s1 is equivalent to s2");
}
Since case insensitive comparison is a common operation, NSString has a convenience method, -caseInsensitiveCompare: which does the same thing.
// case insensitive compare
NSString *s1 = @"foo";
NSString *s2 = @"FOO";

if ( [s1 caseInsensitiveCompare:s2] == NSOrderedSame) {
  NSLog(@"s1 is equivalent to s2");
}

Unicode and -compare:
By default, NSString is pretty smart about Unicode and automatically understands things like Unicode combining characters. For instance, you can represent é two ways, but NSString knows that they represent equivalent strings:
// comparing equivalent Unicode strings
NSString *eAcute = @"\u00e9";      // single character 'é'
NSString *ePlusAcute = @"e\u0301"; // 'e' + combining '´'

if ( [eAcute isEqualToString:ePlusAcute] ) {
  NSLog(@"'é' is equivalent to 'e' + '´'");
}
This can be surprising if you've only worked with ASCII or other single byte encodings. With NSString, you can't assume that equivalent strings have the same length and character sequence. Usually you don't care about the Unicode representation, but occasionally it's important. You can use the NSLiteralSearch flag along with -compare:options: to do a lexicographic comparison that compares strings character value by character value.

// lexicographic comparison of Unicode strings

if ( [eAcute compare:ePlusAcute options:NSLiteralSearch] != NSOrderedSame) {
  NSLog(@"'é' is not lexicographically equivalent to 'e' + '´'");
}

combining options
The options constants used in the -compare:options: method are bit flags. You combine them using the bitwise or operator (|).
// using multiple options
NSString *eAcute = @"\u00e9";        // 'é'
NSString *capitalEAcute = @"\u00c9"; // 'É'

if ( [eAcute compare:capitalEAcute 
             options:NSCaseInsensitiveSearch | NSLiteralSearch] 
         != NSOrderedSame) 
{
  NSLog(@"'é' is equivalent to 'É'");
}

comparing substrings
If you only want to compare parts of two NSString objects, you can use -compare:options:range: method and specify an NSRange structure. The NSRange structure is composed of two parts: a starting location field named loc and a length field named len. Usually it's convenient to use the NSMakeRange() function to generate the NSRange.
// compare substrings
NSString *s1 = @"foo";
NSString *s2 = @"fooey";

if ( [s1 compare:s2 
         options:0 
           range:MakeRange(0, 3)] == NSOrderedSame)
{
  NSLog(@"both strings start with 'foo'");
}
You pass in zero for the options to use the default comparison. -compare:options:range: is similar to strncmp() with one important difference: the NSRange you give must fall completely inside the receiver (the first string) or an NSRangeException will be thrown.

comparing using a specific locale
By default, the -compare: methods use the current locale to determine the ordering of two strings. The current locale is controlled by the user when they set their language and region for their iOS device. Most of the time you should respect the user's settings, but sometimes it's appropriate to compare strings using a fixed locale. Perhaps your app teaches French vocabulary and you want your French word list to sort in standard French order whether the user's phone is set to English, German or Japanese. In French, accented letters at the end of a word sort before accented letters earlier in a word, thus "coté" should come before "côte". If you use the default locale, the result of comparing "coté" and "côte" varies but will probably not give you the correct ordering.
// compare using default locale
NSString *coteAcute = @"cot\u00e9";      // "coté"
NSString *coteCircumflex = @"c\u00f4te"; // "côte"

if ( [coteAcute compare:coteCircumflex] == NSOrderedAscending) {
  NSLog(@"Not using a French locale");
}
To remedy this, you can set the locale explicitly when you do your comparison:
// compare using specific locale
NSLocale *frenchLocale = [[[NSLocale alloc] initWithLocaleIdentifier:@"fr_FR"] autorelease];
NSComparisonResult comparisonResult = [coteAcute compare:coteCircumflex 
                                                 options:0 
                                                   range:NSMakeRange(0, 4)
                                                  locale:frenchLocale];
if (comparisonResult == NSOrderedDescending) {
  NSLog(@"Using a French locale");
}

That sums up the options for comparing C strings and NSStrings. Next time, we'll look at slicing and dicing strings by creating substrings.

9 comments:

Kevin Bomberry said...

Very good article with an insightful look at string comparison operations. Also I really appreciate that you explore both C and Objective-C syntax and their respective methods. I'm looking forward to the next Objective-C Tuesday as you explain how to slice and dice strings. (^_^)

Mike said...

Thanks for the post, I missed these!

As the designated proofreader, I feel I should point out you probably meant "precedes" instead of "proceeds".

Don McCaughey said...

@Mike Doh! (fixed :-)

Pavel Gnatyuk (Павел Гнатюк) said...

Thanks. Nice articles. Please continue

Don McCaughey said...

@Pavel Thanks! Glad you found it useful.

launch-mailinator-com said...

For the "Case Insensitive -compare:" you should use the -caseInsensitiveCompare: method

Don McCaughey said...

Thanks, I didn't realize that I overlooked the -caseInsensitiveCompare: method. I added an example of its use after the example for -compare:options: with the NSCaseInsensitiveSearch flag.

Rob said...

// comparing equivalent Unicode strings

this snippet does not seem to be true in my testing (snow leopard and iOS4)

it does not print the ..is equivalent.. message

Rob said...

I figured out that to make the
// comparing equivalent Unicode strings
snippet work, you first have to normalise each string with

string=[string decomposedStringWithCanonicalMapping];