Using the ConvertFrom-String cmdlet to parse structured text
One of PowerShell strengths has always been string manipulation. PowerShell has very good support for regular expressions–using both cmdlets and operators. There are cmdlets to import and export several text file formats: XML, JSON, CSV. I use regular expressions in PowerShell almost every day. There is one problem with regular expressions though: it’s a language that is easier to write than read. And, it can be very complex when you try to process structured text files, especially if structure is not limited to single line and is not in one of natively supported formats.
New kid on the block
On September 4, a new version of Windows Management Framework 5.0 Preview was announced. This release comes with a cmdlet, ConvertFrom-String, that simplifies processing of any structured text. It does this in a relatively straightforward fashion. You don’t have to be a regex expert to get what you want, you just need to understand output of your command, a content of your log file, a structure of data exported from your database. As long as you can see a pattern and ‘explain it’ to PowerShell, PowerShell will turn a structured text into objects.
ConvertFrom-String has two modes. In the first one (basic delimited parsing) it’s not much different from ConvertFrom-Csv. Processing is performed line by line, and there is no way to identify input header other than using Select-Object -Skip to ignore the first few lines. You can specify delimiter, names of properties, and you are good to go. But that’s not scenario in which this new cmdlets shines. I would argue that using ConvertFrom-Csv is as easy, and is possible in PowerShell 2.0.
Second mode (auto generated example-driven parsing) is much more interesting. Release notes mention FlashExtract is used to get the results.
Note: You can find more on FlashExtract in publication “FlashExtract: A Framework for Data Extraction by Examples”, PLDI 2014, Vu Le, Sumit Gulwani, that can be found here.
In this mode, we hand ConvertFrom-String a template of data that we want to process. We need to follow a few rules when we prepare the template, but once we are done, PowerShell will look at any structured data as one of the well known formats. This article focuses entirely on this mode. We will take a look at real-world example of processing structured file (export from Opera mail client address book), how the workflow is simplified by using this new cmdlet in contrast with the one that depends on regular expressions. We will also take a look at ConvertFrom-String limitations and ‘gotchas’. And last but not least, we will use the new cmdlet to process more complex input data to create complex objects.
Story of one address book transfer
Almost two years ago, I had to help my wife to move her address book from Opera mail client to Outlook. After testing different options, I ended up with ‘Opera Hotlist version 2.0’ file as my input and needed to figure out a way to convert it to CSV, format that Outlook was able to ‘consume’. Luckily, file retrieved from Opera was human-readable. And after looking at the content I immediately came up with idea how to get what I wanted. Here is an example data file.
Opera Hotlist version 2.0
Options: encoding = utf8, version=3
#CONTACT
ID=11
NAME=Justynka
CREATED=1195505237
MAIL=JUSTYNA66@gmail.com
ICON=Contact0
#CONTACT
ID=12
NAME=Leszek
CREATED=1195677687
MAIL=Leszek@domena.pl
ICON=Contact0
#CONTACT
ID=13
NAME=Iwona Kwiatkowska
CREATED=1196277590
MAIL=iwon.kwiat@op.pl
ICON=Contact0
#CONTACT
ID=14
NAME=JUSTYNA66@gmail.com
CREATED=1347061687
MAIL=JUSTYNA66@gmail.com
ICON=Contact0
#FOLDER
ID=15
NAME=Kosz
CREATED=1195505227
TRASH FOLDER=YES
UNIQUEID=EAF22324295C86499476802CC76DE41E
-
#CONTACT
ID=16
NAME=Ania
CREATED=1195505237
MAIL=Ania.Nowak@poczta.com
ICON=Contact0
#CONTACT
ID=58
NAME=Bartek Bielawski
CREATED=1381258759
MAIL=bartek.bielawski@live.com
ICON=Contact0
#CONTACT
ID=20
NAME=Poczta Grupowa
URL=
CREATED=1347221208
DESCRIPTION=
ACTIVE=YES
MAIL=xyz.abc.grupa.foo@gmail.com
PHONE=
FAX=
POSTALADDRESS=
PICTUREURL=
ICON=Contact0
How did I solve this problem back then? I’ve used regular expressions and cmdlet with name that is similar to the one I would use today–ConvertFrom-StringData. You can see that each entry starts with the same string (#CONTACT). So there it was: easy way to -split records. Next thing I noticed–each record defines properties using ‘key=value’ syntax, something that ConvertFrom-StringData can translate into a hash table. When we have the hash table, creating an object is very easy. My work here was finished or so I thought. The actual workflow from idea to actual implementation is:
- Read file as single string (-Raw), split on ‘#CONTACT’, process each record with ConvertFrom-StringData
- OK, there is header, so let’s replace it
- Replace with ‘.*?’ failed… need to use ‘[\s\S]*?’ instead to cover multiline pattern
- Converting to hash table still fails–some records are prefixed with #FOLDER rather than #CONTACT
- Works for the most part, except there is ‘-‘ somewhere that causing it to fail, need to replace it
- All hash tables look OK, time to convert them to objects using New-Object
Final code looks simple!
(Get-Content .\opera.adr -Raw) -replace '^Opera[\s\S]*?#CONTACT' -split '#CONTACT|#FOLDER' |
ForEach-Object {
$Props = ConvertFrom-StringData -StringData ($_ -replace '\n-\s+')
New-Object PSOBject -Property $Props | Select-Object Name, Mail
}
How workflow changes when we start to use ConvertFrom-String cmdlet? First of all, we don’t have to use regular expressions. Even better, we don’t need to know anything about it. All we need to do is describe structure of our file. We can either use a file, or string (preferably here-string) to define our template. To sum it up:
- Copy few records from file to here-string that we will use as a template
- Add curly brackets to identify name and mail of contact
- Make sure that you highlight property that starts new ‘set’ with ‘*’ suffix
Code is longer (mainly because of template definition), but it’s also easier to write. I didn’t have to extract data on my own, everything happened ‘automagically’.
Here is the final code:
$TemplateAdr = @'
#CONTACT
ID=11
NAME={Name*:Justynka}
CREATED=1195505237
MAIL={Mail:JUSTYNA66@gmail.com}
ICON=Contact0
#CONTACT
ID=20
NAME={Name*:Poczta Grupowa}
URL=
CREATED=1347221208
DESCRIPTION=
ACTIVE=YES
MAIL={Mail:xyz.abc.grupa.foo@gmail.com}
PHONE=
FAX=
POSTALADDRESS=
PICTUREURL=
ICON=Contact0
'@
Get-Content .\opera.adr | ConvertFrom-String -TemplateContent $TemplateAdr |
Format-Table -AutoSize Name, Mail
Name Mail
---- ----
Justynka JUSTYNA66@gmail.com
Leszek Leszek@domena.pl
Iwona Kwiatkowska iwon.kwiat@op.pl
Ewa Nowak EWA22@gmail.com
Kosz
Ania Ania.Nowak@poczta.com
Bartek Bielawski bartek.bielawski@live.com
Poczta Grupowa xyz.abc.grupa.foo@gmail.com
It looks fine, but you have to be aware of limitations and ‘gotchas’.
There is always ‘but’…
ConvertFrom-String cmdlet can do wonders for us but you can walk into problems if you are not careful.
First and foremost, examples! You have to see patterns and make sure that PowerShell is aware of all possibilities. If your template won’t cover certain scenario you can end up with partial results (worse, as you may not see that something is missing at first) or with following error message:
_[28,27: ConvertFrom-String] ConvertFrom-String appears to be having trouble parsing your data using the template you’ve provided. We’d love to take a look at what went wrong, if you’d li
ke to share the data and template used to parse it. We’ve saved these files to C:\Users\Bartek\AppData\Local\Temp\smt5asdi.1×1.input.txt and C:\Users\Bartek\AppData\Local\Temp\smt5asdi.1
x1.template.txt – feel free to attach them in a mail to psdmfb@microsoft.com. We will review all submissions, although we can’t guarantee a response._
Partial results can have two flavours. Either an entire item is missing or a property of the item is missing. The former is very hard to spot. Both are result of the fact that examples you provided in your template are too specific. For example, if I have items in address book with the name that is same as e-mail address and both examples in my template suggest ‘FirstName LastName’ or ‘Name’ format, this record will be ignored. If an e-mail is provided in a format different than the one seen in examples, we may end up with records that don’t have ‘Mail’ property. Last but not least, at times you can get very unpredictable results. For example, when I worked on this article I ‘cleaned’ address book entries and my template. I noticed that in one case ‘Mail’ property was there but value contains only part of e-mail address.
Here is the sample data, template, and results I retrieved.
$otherTemplate = @'
{First*:Jan} {Last:Fasola}
MAIL={Mail:Jan@Fasola.com}
{First*:Not} {Last:Real}
MAIL={Mail:Just@Similar.to}
'@
$otherData = @'
Jan Fasola
MAIL=Jan@Fasola.com
not there
MAIL=all@lower.case
Ewa Kowalska
MAIL='Ewa' <Ewa@Kowalska.com>
Missing2 Cause
MAIL=used@number
Silly Cut
MAIL=causeIt@Expects.up
'@
$otherData | ConvertFrom-String -TemplateContent $otherTemplate |
Format-Table -AutoSize First, Last, Mail
First Last Mail
----- ---- ----
Jan Fasola Jan@Fasola.com
Ewa Kowalska
Silly Cut It@Expects.up
As you can see, examples in my template are very specific. Both the first and last names start with upper case letter. In both records ‘Mail’ starts with upper case too. There are no numbers in name. Another thing that you probably noticed is that it does not really have to be actual example from input data, just need to follow same pattern.
In the output two items are missing. One has full name in all lower case and the second contains a number in the ‘First’ field. Our template doesn’t ‘allow’ such items, so both were discarded. Next, Mail for ‘Ewa Kowalska’ doesn’t follow the pattern we see in template. So, it’s not visible on this object. And, last but not least, in the last record Mail is ’causeIt@Expects.up’ but in our output first part (starting with lower case) was removed. How to fix it? Just modify one of examples:
$fixedTemplate = @'
{First*:Jan} {Last:Fasola}
MAIL={Mail:Jan@Fasola.com}
{First*:not2} {Last:real}
MAIL={Mail:'Just' <similar@to.ewa>}
'@
$otherData | ConvertFrom-String -TemplateContent $fixedTemplate
Alternatively, we could just add records that were missing/incomplete to the template. In my opinion, this method is following the ‘KISS’ principle. It may take more time but we are not forced to figure out what went wrong.
There are other potential issues, e.g. here François-Xavier Cat identified a problem when you want to capture netstat -na output and ‘Status’ property is not present for UDP connections (and suggested a nice workaround). Also, he shows another real-life example of ConvertFrom-String use case, so it’s definitely worth reading.
Complex is possible
So far, we worked on the flat data. ConvertFrom-String however is able to process more complex data too. We can have nested properties, that are collections of objects. In other words: any file (or other input string data) can now be seen as XML/JSON. We just need to identify nested properties and cover them in our template. We will use output from Sysinternals handle -u command. You can download it here. Example results:
In PowerShell it would map to single object for each process, with property that would hold all handles, each being object with three properties. With that in mind we can start building our template. Because we already know that data does not have to be “real” we will try to understand output and create template based on that information, leaving only lines that seem necessary for PowerShell to understand all input (and process it correctly):
$template = @'
------------------------------------------------------------------------------
{ProcessName*:Testing} pid: {PID:4} {User:\<unable to open process>}
{Id*:18}: {Type:File} (R--) {Name:E:\$Extend\$RmMetadata\$TxfLog\$TxfLogContainer00000000000000000001}
{Id*:90}: {Type:File} (R--) {Name:\clfs}
------------------------------------------------------------------------------
{ProcessName*:someprocess.exe} pid: {PID:6} {User:NT AUTHORITY\SYSTEM}
{Id*:C}: {Type:File} (---) {Name:C:\Windows\System32}
{Id*:A4}: {Type:Section} {Name:\Sessions\1\Windows\SharedSection}
------------------------------------------------------------------------------
{ProcessName*:Extend64.exe} pid: {PID:2512} {User:EMIS\Bartek}
{Id*:C}: {Type:File} (RW-) {Name:C:\Windows}
{Id*:28}: {Type:File} (R-D) {Name:C:\Windows\System32\en-US\conhost.exe.mui}
'@
None of these processes is running on my system. I copied lines with handles information, but changed them to make sure I cover all scenarios. With this template I could parse output from handle and convert it to complex objects. The only problem is that PowerShell adds ExtentText to all objects generated (including nested objects). And even though for testing/ debugging it may be handy, it’s usually not needed in the ‘final product’. Another problem I had was the fact that my handles were stored in automatically named property ‘Items’. I didn’t manage to find a way to change this name within template, so I’ve used Select-Object to do that for me.
handle -u | ConvertFrom-String -TemplateContent $Template |
Select-Object ProcessName, PID, User, @{
Name = 'Handles'
Expression = {
$_.Items | Select-Object * -ExcludeProperty ExtentText
}
}
I would love to have parameter that would disable adding ‘ExtentText’ property (or better yet: leave it out as a default behaviour, and adding it only if user requests it with switch parameter). Another problem that I had when I tried to figure out correct template was lack of verbose messages. The only way to get some feedback on ‘how’ is to use -Debug parameter, and even than we only get output if operation succeeds:
DEBUG: Property: ProcessName
Program: ESSL((PrecedingEndsWith(Hyphen(\-), Hyphen(\-), Hyphen(\-))): 0, 1, ...: ε...ε, 1 + Alphabet([\p{Lu}\p{Ll}\-.]+)...Dynamic Token(\ pid:\ )(\ pid:\ ), Number([0-9]+(\,[0-9]{3})*(
\.[0-9]+)?), WhiteSpace(( )+), 1)
-------------------------------------------------
Property: PID
Program: ESSL((Contains(Dynamic Token(\ pid:\ )(\ pid:\ ), Number([0-9]+(\,[0-9]{3})*(\.[0-9]+)?), WhiteSpace(( )+), 1)): 0, 1, ...: Alphabet([\p{Lu}\p{Ll}\-.]+), Dynamic Token(\ pid:\ )
(\ pid:\ )...Number([0-9]+(\,[0-9]{3})*(\.[0-9]+)?), WhiteSpace(( )+), 1 + Alphabet([\p{Lu}\p{Ll}\-.]+), Dynamic Token(\ pid:\ )(\ pid:\ ), Number([0-9]+(\,[0-9]{3})*(\.[0-9]+)?)...White
Space(( )+), 1)
-------------------------------------------------
Property: User
Program: ESSL((Contains(all lower((?<![\p{Lu}\p{Ll}])(\p{Ll})+), Colon(\:), WhiteSpace(( )+), 1)): 0, 1, ...: WhiteSpace(( )+), Number([0-9]+(\,[0-9]{3})*(\.[0-9]+)?), WhiteSpace(( )+)..
.ε, 1 + ε...ε, 0)
-------------------------------------------------
Property: Id
Program: ESSL((Contains(Colon(\:), WhiteSpace(( )+), Camel Case(\p{Lu}(\p{Ll})+), 1)): 0, 1, ...: WhiteSpace(( )+)...ε, 1 + ε...Colon(\:), WhiteSpace(( )+), Camel Case(\p{Lu}(\p{Ll})+),
1)
-------------------------------------------------
Property: Type
Program: ESSL((Contains(Colon(\:), WhiteSpace(( )+), Camel Case(\p{Lu}(\p{Ll})+), 1)): 0, 1, ...: Colon(\:), WhiteSpace(( )+)...Camel Case(\p{Lu}(\p{Ll})+), WhiteSpace(( )+), 1 + Colon(\
:), WhiteSpace(( )+), Camel Case(\p{Lu}(\p{Ll})+)...WhiteSpace(( )+), 1)
-------------------------------------------------
Property: Name
Program: ESSL((Contains(Colon(\:), WhiteSpace(( )+), Camel Case(\p{Lu}(\p{Ll})+), 1)): 0, 1, ...: WhiteSpace(( )+)...ε, -1 + ε...ε, 0)
-------------------------------------------------
It would be great to get some feedback when parsing fails, for example which part of file/rule was responsible for it. Sending input and template to authors every time it fails seems like overkill. On the other hand: fixing these errors looks like shooting in the dark. In my opinion this cmdlet should help a lot of administrators to parse even most complex command results/log files without deep knowledge of regular expressions. And even though I’m huge regex fan, I’m happy that we got yet another tool to parse text in PowerShell.
Share on: