Byte order mark

Ever tried to put together a neat little XML document with the XDocument -class?

Its quite nice but when not working together with IO (in other words, not actually writing the XML to disk) you´re sure to experience the BOM.

BOM is an abbreviation for Byte Order Mark, a short byte marking at the beginning of a file that tells us what kind of encoding is used. Unfortunately, when writing XML in memory, or to streams, this BOM might be added.

For example, this short snippet for writing an XDocument might give us some pain:

var writerSettings = new XmlWriterSettings { Encoding = Encoding.UTF8 };
using (var stream = new MemoryStream ()) {
    using (var xmlWriter = XmlWriter.Create (stream, writerSettings))
    document.Save (xmlWriter);

    stream.Seek (0, SeekOrigin.Begin);
    byte[] buffer = new byte[stream.Length];
    stream.Read (buffer, 0, buffer.Length);
    return buffer;
}

When looking at the output (reencoded to UTF8) it’s all good but if I were to manually read this bytearray or compare the results to an actual string I´d notice an unreadable character at the beginning of the string (#255 in the extended ascii chart). This notes our BOM. It doesn’t really hurt us if we´re just to write the XML to disc but if we were to handle it in any other way in memory it could potentially screw things up.

To strip it, just use this snippet:

// 65279 is the BOM for UTF8, you might want to trim the first char whatever its encoding
xml = xml.TrimStart((char) 65279);

For more information check out this Wikipedia article.

Submit a Comment Cancel reply