Pages

Monday, 1 March 2010

RegEx and Alteryx (Part 2)

In part 1 we saw the Match Method of the RegEx tool and the REGEX_MATCH() function; in part 2 we will take a look at the Parse and Tokenise Methods of the RegEx tool.


Parse Output Method


So continuing the e-mail validation example from part 1, imagine now that having validated our e-mail field we now want to extract the domain names (everything after the @) to see how many of our customers use which e-mail domains.  To do this we can use the RegEx tool with the Parse output method.  We can use exactly the same regular expression as we used to validate the e-mail in part 1 with the addition of a set of parentheses (shown in red below) around the part of the RegEx which we wish to extract:


^[^\.].*@([^@]+\.[^@]+)$


In RegEx this part of the expression is called a marked group and the Parse output method will extract each marked group in the regular expression as a new field.




This example can be seen in full in the sample module RegEx_Example_Part2.yxmd downloadable from the bottom of the article.

Admittedly this could be done with non-RegEx functions in the formula tool, but having already written our RegEx validation rule this is a much easier and quicker solution.  Plus if we now want to extend it to extract everything before the @ into a separate field we can just add two more parentheses:

^([^\.].*)@([^@]+\.[^@]+)$

Tokenise Output Method

From the examples I have put together for this article it appears that there is a slight mismatch between what the help file says this output method does and what it actually does.  The help file states:

"The Tokenize Method behaves like the Text to Columns tool but can also separate on punctuation. 
...will split a single column of data at each instance of the specified regular expression pattern into multiple [rows/columns]."

from which description I would expect to be able to specify a regular expression which I wish to delimit the data on, for example a comma or a pipe, with a regular expression such as (,|\|) and the data would be split into columns wherever there is a comma or a pipe.

What the tool actually does is pull out all of the data which matches the regular expression into rows or columns, so in the above example I get lots of columns either containing a comma or a pipe rather than the data between them, which wasn't what I expected.  However it is not a major problem as either functionality can in fact be reproduced using the "replace" RegEx method (which we will see in part 3) and the text to columns tool (I won't go into the details here, but mail me if you need to do this and can't see how after reading the rest of this article).

So what I will show here is an example of how we can use the functionality the tool actually implements.  Imagine that you have a data set that is a series of messages or e-mails from which you wish to extract all the e-mail addresses which appear in those messages.  For example the message might be:

"You can contact me at nospam@fictionalcompany.co.uk or nospam@myhome.com"

and you want to extract those 2 e-mail addresses and any others in your data into columns or rows.  The solution: a modified version of our regular expression from earlier:

[^\s\.][^\s]*@[^@\s]+\.[^@\s\.,]+

which only matches exactly the e-mail addresses in the string as explained below:



[^\s\.] is any character that is not a '.' or space exactly once
[^\s]* is any character that is not a space zero or more times
@ is the '@' character exactly once
[^@\s]+ is any character that is not a '@' or a space one or more times
\. is the '.' character exactly once
[^@\s\.,]+ is any character that is not a '@', space, '.' or ',' one or more times

and the tokenise output method as so:


This will take an input such as:


and gives us an output of:


Now, try to do that without using RegEx!

This example again can be seen in full in the module RegEx_Example_Part2.yxmd downloadable here:


In part 3 I will look at the last and in my opinion the most useful of the RegEx tools methods: replace.