So, you know everything about text, right?–part VII
If you’re a reader of this blog, then you probably know that I’m Portuguese. Aside from accentuated chars and the notorious ç, there really aren’t any issues associated with the fact that .NET stores chars in 16 bits memory spaces. In other words, I’m a lucky bastard
Before going on, a disclaimer: generally, I prefer to blog about areas which I’ve used in my daily activities. Unicode surrogates aren’t really one of those things. However, since I’m writing about text, I believe that the series wouldn’t really be completed without mentioning surrogates. If you do have experience in this area and you do detect a nasty error or an erroneous assumption, then please do use the comment section for correcting me Having said that, let’s proceed…
If I had to write Arabic, I wouldn’t be so lucky because those 16 bits aren’t enough for representing all the existing Arabic characters…in these cases, it’s usual to use two 16 bits codes to achieve a single Unicode char. In this scenarios, it’s usual to say that the Unicode char is represented by a high surrogate (the first 16 code value) and a low surrogate (the last 16 code value). If you do need to work with surrogates, then you’ll probably need to resort to the StringInfo class if you need to iterate through the Unicode chars (typically, you’ll refer to each Unicode char as a text element or grapheme). The easiest way to use a StringInfo object is to pass it a string during instantiation. The following snippet illustrates the typical use of this class to enumerate its text elements:
It’s important to notice that the StringInfo class is defined in the System.Globalization namespace. After adding a reference to it, you can use the LengthInTextElements property to check the number of text elements. After knowing the current number of text elements, you can use the SubstringByTextElements method to extract the desired portion of text elements.
If you want, you can also get a TextElementEnumerator instance from the GetTextElementEnumerator method: after that, it’s really easy to iterate through the abstract Unicode code chars. Here’s a snippet which illustrates this strategy:
Finally, you can also use the ParseCombiningCharacters method to obtain an Int32 array. The length of the array specifies the number of text elements and each array’s element identifies the index of the string where the first code for each text element can be found. Here’s a small snippet which shows how to use this method:
Not really as interesting as the first two approaches, but still useful…And I guess this wraps it up for today! Stay tuned for more.