I'll be honest, I'm not the best at RegEx but it's an incredible tool for breaking down data. If I used it every day I would likely be much more familiar with it but as it worked out my current role is much more involved with data extraction and reporting so I'll be picking this up more and more.
If you want to learn, play. Create real-world examples of challenges and solve them.
The Cause:
This particular challenge is from a need to parse, extract, and validate email addresses. Most of the time we think of the basic yourname@emailservice.com, but often applications format these to be more friendly in the form:
John E.Doe <jdoe@doeindustries.com>
To parse this, my original approach was to tear the string apart by the logical sections:
[John E. Doe ]<jdoe@doeindustries.com>
Everything between the brackets ( < > ) is important to the cause, the rest is not (in this application).
In strict Powershell, without RegEx in play I used this code:
$cAddr = "John E. Doe <jdoe@doeindustries.com>"
if($cAddr.indexof("<") -ge 0){
$eAddr=$cAddr.split("<")[1]
$eAddr=$($eAddr.split(">")[0]).ToLower()
} else {
$eAddr=$cAddr.ToLower()
}
First off the address is in $cAddr (stating the obvious is a habit for me).
We look for the position of the "<" character to be sure we can/need to proceed. If the character is not found the value -1 is returned. In that case we likely do not have anything to do, move along to the second-to-last line in that case.
That's the easy way out, so let's continue. I like the use of split to find the breakpoints. As we know that the < character is present now, we can split on that and choose the second segment (1 - because arrays begin at zero):
$eAddr=$cAddr.split("<")[1]
Now we can use split to grab the first segment ahead of the ">" character using the split function again:
$eAddr=$($eAddr.split(">")[0])
We use the .ToLower() function to switch the result to a nice consistent lowercase string.
We're done... but there's a better way. RegEx - Regular Expressions.
$regex = '\b[A-Za-z0-9._%-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}\b'
This represents the whole process but let's have a look on RegEx101.com to try this out:
Now, can I break this down for you? Let's see.
No, as I said I'm still learning but RegEx101 has you covered, expand the EXPLANATION pane in the top-right corner and have a look:
So, how does this relate to Powershell?
This code will process a file and parse the email names from it:
$input_path = ‘emailList.txt’
$regex = ‘\b[A-Za-z0-9._%-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}\b’
select-string -Path $input_path -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value }
I didn't write this code, I found it with one of the best resources any of us have at our fingertips, Google (Source: https://techtalk.gfi.com/...). This is much more elegant than my code, but until you learn RegEx it may be a tad intimidating. Spend some time playing on RegEx101.
Compare the above to my full script:
$emailList = Get-Content -Path emailList.txt
foreach($cAddr in $emailList){
if($cAddr.indexof("<") -ge 0){
$eAddr=$cAddr.split("<")[1]
$eAddr=$($eAddr.split(">")[0]).ToLower()
} else {
$eAddr=$cAddr.ToLower()
}
write-host "$($cAddr.PadRight(65," ")) :: $($eAddr.PadRight(45," "))"
}
The Function:
The script reads from the text file emailList.txt and strips out the email addresses.
The Output:
The email addresses were generated using this site:
https://fauxid.com/tools/fake-email-list
None of these should be real except by coincidence.
I used LibreOffice Calc to make the list on the left from the email addresses provided.
I hope this helps.