So, you know everything about text, right? – part IX

In the previous post, we’ve started looking at encodings. In this post, we’ll finally put those theoretical concepts in practice and we’ll take a look at some code. Let’s start with a simple example. Suppose you’re building an application which will only handle English Strings. In this case, you know that UTF-8 will probably be a good option because it can use a single byte for representing all written chars. And that’s why you might decide to use it whenever you need to send or receive strings across the network:

var str = "This is a string which is going to be encoded";
var bytes = Encoding.UTF8.GetBytes( str );


Yes, it’s that easy! You start by getting a reference to an instance of an Encoding and then you encode an existing string into a byte array by calling its GetBytes method (the UTF8 static property returns an instance of the UTF8Encoding type, which extends the abstract Encoding type). If you only want to check out the number of bytes which are needed to encode a string, then you should resort to the GetByteCount method:

var str = "This is a string which is going to be encoded";
Console.WriteLine(Encoding.UTF8.GetByteCount( str ));


Calling the GetByteCount isn’t really a fast operation because the method analyzes all the chars in the string in order to return the correct number of bytes required. If you need to improve the performance of that call, then you can go with the GetMaxByteCount method. In this case, we end up with the number of bytes required for the worst case scenario (the method receives the number of chars in the string and multiplies it by the maximum number of bytes it needs to represent its “biggest” char):

var str = "This is a string which is going to be encoded";
Console.WriteLine(Encoding.UTF8.GetMaxByteCount( str.Length ));//138


Besides UTF8Encoding, the framework introduces several other specific Encoding implementations: ASCIIEncoding (encodes a string using the ASCII character set), UnicodeEncoding, UTF32Encoding (used for encoding each char as a 32 bit integer) and UTF7Encoding (encodes each char as a 7-bit sequence). You can create an instance of any of these encodings in two ways:

  1. you can instantiate them through constructor calls (not a good option in most scenarios).
  2. you can access an instance through one of the static properties of the Encoding class. btw, you should notice that Unicode and BigEndianUnicode return a reference to a UnicodeEncoding. The difference is that BigEndianUnicode returns a UnicodeEncoding which encodes chars in the big endian format.

The curious reader might also have noticed that the Encoding type offers a static Default property which also returns an Encoding object. This encoding is able to encode (or decode) by using the current user’s code page (as defined in the control panel’s regional settings applet). As I’ve said in the previous post, you can also use custom code pages (at your own risk!) encodings. To achieve that, you need to use the static GetEncoding method. There are several overloads which let you get a custom Encoding object by code page identifier or by code page name. The following snippet presents 3 different ways to get a handle to a UTF8Encoding instance:

var enc1 = Encoding.UTF8;
var enc2 = Encoding.GetEncoding( "utf-8" );
var enc3 = Encoding.GetEncoding( 65001 );
Console.WriteLine(enc1 == enc2);//true
Console.WriteLine(enc2 == enc3);//true


The previous example  also shows another interesting thing: each Encoding object is only created once when you get a reference from one of the Encoding type’s static methods or properties. So, whenever you request for an Encoding which has already been created, you end up receiving a reference to that previously created instance. Notice that this rule does not apply if you use the constructor of one of the existing derived Encoding types:

var enc1 = Encoding.UTF8;
var enc2 = new UTF8Encoding();
Console.WriteLine(enc1 == enc2);//false


If you look at the Encoding derived types constructor, you’ll notice that they offer several constructor overloads which allow you to fine tune the way encodings work. For instance, the UTF8Encoding introduces constructors which allows us to specify if  an encoding operation should generate a preamble (aka, BOM) and if an exception should be thrown when an invalid encoding is detected.

As you might have guessed, you can also recover a string from a previously encoded byte array. To perform a decoding, you need to 1.) get an instance of the appropriate Encoding and 2.) call its GetString method. Here’s how we can recover our previous encoded string:

var anotherString = Encoding.UTF8.GetString( bytes );

So, decoding is not that hard either, right? If you’re interested in knowing how many chars will result from a decoding operation, then you should use the GetCharCount method:

Console.WriteLine( Encoding.UTF8.GetCharCount( bytes ));


Once again, it you’re interested in speed, then you can resort to the GetMaxCharCount method. Before ending, there’s still time mention that the Encoding type offers several properties which allow you to get info about it. For instance, you can access the CodePage property to get its code page identifier. There are other interesting properties, but I’ll redirect you to the docs since this becoming a rather large post.

That’s it for now. Stay tuned for more.


~ by Luis Abreu on May 12, 2011.

One Response to “So, you know everything about text, right? – part IX”

  1. Keep up the excellent work , I read few blog posts on this web site and I believe that your website is very interesting and has circles of superb info .

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: