So, you know everything about text, right?–part VII

If you’re a reader of this blog, then you probably know that I’m Portuguese. Aside from accentuated chars and the notorious ç, there really aren’t any issues associated with the fact that .NET stores chars in 16 bits memory spaces. In other words, I’m a lucky bastard Smile 

Before going on, a disclaimer: generally, I prefer to blog about areas which I’ve used in my daily activities. Unicode surrogates aren’t really one of those things. However, since I’m writing about text, I believe that the series wouldn’t really be completed without mentioning surrogates. If you do have experience in this area and you do detect a nasty error or an erroneous assumption, then please do use the comment section for correcting meWinking smile Having said that, let’s proceed…

If I had to write Arabic, I wouldn’t be so lucky because those 16 bits aren’t enough for representing all the existing Arabic characters…in these cases, it’s usual to use two 16 bits codes to achieve a single Unicode char. In this scenarios, it’s usual to say that the Unicode char is represented by a high surrogate (the first 16 code value) and a low surrogate (the last 16 code value). If you do need to work with surrogates, then you’ll probably need to resort to the StringInfo class if you need to iterate through the Unicode chars (typically, you’ll refer to each Unicode char as a text element or grapheme). The easiest way to use a StringInfo object is to pass it a string during instantiation. The following snippet illustrates the typical use of this class to enumerate its text elements:

var sb = new StringBuilder();
var s = "a\u0304\u308bc";
var si = new StringInfo(s);
for( var i = 0; i < si.LengthInTextElements; i++ ) {
    sb.AppendFormat("element at {0} is {1}\n",
        i,
        si.SubstringByTextElements(i, 1));
}
MessageBox.Show(sb.ToString());//console won't show it correctly!

It’s important to notice that the StringInfo class is defined in the System.Globalization namespace. After adding a reference to it, you can use the LengthInTextElements property to check the number of text elements. After knowing the current number of text elements, you can use the SubstringByTextElements method to extract the desired portion of text elements.

If you want, you can also get a TextElementEnumerator instance from the GetTextElementEnumerator method: after that, it’s really easy to iterate through the abstract Unicode code chars. Here’s a snippet which illustrates this strategy:

var sb = new StringBuilder();
var s = "a\u0304\u308bc";
var unicodeEnum = StringInfo.GetTextElementEnumerator(s);
while( unicodeEnum.MoveNext()) {
            sb.AppendFormat("element at {0} is {1}\n",
        unicodeEnum.ElementIndex,
        unicodeEnum.GetTextElement());
}
MessageBox.Show(sb.ToString());//console won't show it correctly!

Finally, you can also use the ParseCombiningCharacters method to obtain an Int32 array. The length of the array specifies the number of text elements and each array’s element identifies the index of the string where the first code for each text element can be found. Here’s a small snippet which shows how to use this method:

var sb = new StringBuilder();
var s = "a\u0304\u308bc";
var textElements = StringInfo.ParseCombiningCharacters(s);
for( var i = 0; i < textElements.Length;i++) {
    sb.AppendFormat("char {0} starts at pos {1}\n",
                    i,
                    textElements[i]);
}
Console.WriteLine(sb.ToString());//console won't show it correctly!

Not really as interesting as the first two approaches, but still useful…And I guess this wraps it up for today! Stay tuned for more.

Advertisements

~ by Luis Abreu on May 2, 2011.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: