#PSCXTip How to determine the byte order mark of a text file

Text files created by PowerShell are little endian Unicode (UTF-16LE) by default.  You can see this by inspecting the first couple of bytes of a text file for a BOM i.e. a byte order mark.  BOMs are not required but PowerShell usually create a BOM when it creates a text file.  Typical BOMs you’ll encounter with Windows and PowerShell are:

UTF-8 		: 0xEF 0xBB 0xBF
UTF-16LE 	: 0xFF 0xFE

You can’t use code like [System.IO.File]::ReadAllText() to view a BOM because the bytes associated with the BOM aren’t output – just the associated text is output.  Get-Content works the same way except when you use the –Encoding Byte parameter.  Given a file created in PowerShell:

PS> Get-Date > date.txt

You can see the encoding using Get-Content like so:

PS> Get-Content .\date.txt –Encoding Byte –TotalCount 3
255
254
13

However, unless you’re quick with your decimal to hex conversions, this output isn’t ideal. The PowerShell Community Extensions comes with a command called Format-Hex that will format its input or a specified file in hex format. This utility is much like the od command from UNIX. The output from the Format-Hex command for the same file as above would be:

PS> Format-Hex .\date.txt -Count 16
Address:  0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F ASCII
-------- ----------------------------------------------- ----------------
00000000 FF FE 0D 00 0A 00 53 00 75 00 6E 00 64 00 61 00 ......S.u.n.d.a.

Here we can see the first two bytes are 0x_FF 0xFE_, which is _UTF-16LE_ or little endian Unicode.  If we saved the date.txt as _UTF-8_:

PS> Get-Date | Out-File date.txt -Encoding Utf8
PS> Format-Hex .\date.txt -Count 16
Address:  0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F ASCII
-------- ----------------------------------------------- ----------------
00000000 EF BB BF 0D 0A 53 75 6E 64 61 79 2C 20 44 65 63 .....Sunday, Dec

Here we can see the UTF-8 BOM 0xEF 0xBB 0xBF.  This tip is most useful when you’re processing a file created by another program with PowerShell and you need to make sure you leave the file in the same encoding that it started out with.

Note: There are many more useful PowerShell Community Extensions (PSCX) commands. If you are interested in this great community project led by PowerShell MVPs Keith Hill and Oisin Grehan, give PSCX a try at http://pscx.codeplex.com.

comments powered by Disqus