Archive for the ‘C’ Category

Handling Ampersand Codes in ThrededTweet

December 21, 2009

One of the architectural issues I’ve been having with ThrededTweet is handling HTML ampersand codes, such as & , ⁁ , etc. ThrededTweet uses a UIWebView instance to display whole tweets, and that handles ampersand codes automatically. But the summaries of tweets that are displayed, such as on the main Feed page, use a UILabel.

For release 1.0.0, I manually added handling of what I assumed were common cases, but then immediately found an example of a tweet with an M-dash (—), which I didn’t handle. Version 1.0.1 expanded the set of ampersand codes that are handled, but as you can probably guess, as soon as 1.0.1 came out, I found example of another code that wasn’t handled. So I needed a comprehensive solution.

For verson  1.0.2, I decided to use a program called gperf to create code to automatically handle all ampersand codes. gperf is the GNU perfect hash function generator:

For a given list of strings, [gperf] produces a hash function and hash table, in form of C or C++ code, for looking up a value depending on the input string. The hash function is perfect, which means that the hash table has no collisions, and the hash table lookup needs a single string comparison only.

The first step was to get a list of all HTML ampersand codes, along with their corresponding Unicode code points, which is available on the “List of XML and HTML character entity reference” page on Wikipedia. I cut-and-pasted the code list, separated out the parts I wanted with a one-line Perl script, and stored the results in a file called codelist:

quot,0x0022
amp,0x0026
apos,0x0027
lt,0x003C
gt,0x003E
nbsp,0x00A0
[…]

The default behavior of gperf is to create a function that just verifies whether or not a given string is in the hash table, but I needed it to return the Unicode code point as well. So I added the definition of the struct that would be stored in the hash table at the start of the codelist file:

struct code { char *name, int val; };
%%
quot,0x0022
amp,0x0026
apos,0x0027
lt,0x003C
gt,0x003E
nbsp,0x00A0
[…]

I then fed this to gperf (the “-t” means that I’m providing my own struct definition):

gperf –t codelist >amp.c

That created a file I could import into Xcode. The important function is that file is:

struct code *
in_word_set (str, len)
register const char *str;
register unsigned int len;
{
[…]

Feed it a string, and its length, and it will return the instance of “code” that corresponds to that string, or NULL if none was found.  If the pointer was not NULL, I still needed to compare code->str to the string I was looking for, in order to validate that we found the correct entry. Note that, for this case, the only reason that it would not match is there was a garbage ampersand code (e.g. “&iowfdna;”).

I then did some cleanup. “in_word_set()” became “is_amp_code()”, and I changed the function to something more modern than ye olde K&R style generated by gperf:

struct AmpCode *
is_amp_code (const char* str, unsigned int len)

The structure definition changed from this:

struct code { char *name, int val; };

to this:

typedef struct AmpCode { char *name, int val; } AmpCode_t;

Finally, “amp.c” became “Amp.m” (perhaps not strictly necessary), and I created an “Amp.h” that contained the struct definition and the prototype for is_amp_code().

I then imported Amp.m and Amp.h into Xcode, removed most of my old ampersand-code parsing code (keeping the part that found substrings that started with “&” and ended with “;”), and put a call to is_amp_code() in its place.

Voila! An ampersand-parser that handled anything I threw at it. If you were following my Twitter test account, you would have seen a string of tweets with all sorts of random ampersand codes, some legal, some not.

Advertisements