Sunday, 28 February 2010

RegEx and Alteryx (Part 2)

In part 1 we saw the Match Method of the RegEx tool and the REGEX_MATCH() function; in part 2 we will take a look at the Parse and Tokenise Methods of the RegEx tool.


Parse Output Method


So continuing the e-mail validation example from part 1, imagine now that having validated our e-mail field we now want to extract the domain names (everything after the @) to see how many of our customers use which e-mail domains.  To do this we can use the RegEx tool with the Parse output method.  We can use exactly the same regular expression as we used to validate the e-mail in part 1 with the addition of a set of parentheses (shown in red below) around the part of the RegEx which we wish to extract:


^[^\.].*@([^@]+\.[^@]+)$


In RegEx this part of the expression is called a marked group and the Parse output method will extract each marked group in the regular expression as a new field.




This example can be seen in full in the sample module RegEx_Example_Part2.yxmd downloadable from the bottom of the article.

Admittedly this could be done with non-RegEx functions in the formula tool, but having already written our RegEx validation rule this is a much easier and quicker solution.  Plus if we now want to extend it to extract everything before the @ into a separate field we can just add two more parentheses:

^([^\.].*)@([^@]+\.[^@]+)$

Tokenise Output Method

From the examples I have put together for this article it appears that there is a slight mismatch between what the help file says this output method does and what it actually does.  The help file states:

"The Tokenize Method behaves like the Text to Columns tool but can also separate on punctuation. 
...will split a single column of data at each instance of the specified regular expression pattern into multiple [rows/columns]."

from which description I would expect to be able to specify a regular expression which I wish to delimit the data on, for example a comma or a pipe, with a regular expression such as (,|\|) and the data would be split into columns wherever there is a comma or a pipe.

What the tool actually does is pull out all of the data which matches the regular expression into rows or columns, so in the above example I get lots of columns either containing a comma or a pipe rather than the data between them, which wasn't what I expected.  However it is not a major problem as either functionality can in fact be reproduced using the "replace" RegEx method (which we will see in part 3) and the text to columns tool (I won't go into the details here, but mail me if you need to do this and can't see how after reading the rest of this article).

So what I will show here is an example of how we can use the functionality the tool actually implements.  Imagine that you have a data set that is a series of messages or e-mails from which you wish to extract all the e-mail addresses which appear in those messages.  For example the message might be:

"You can contact me at nospam@fictionalcompany.co.uk or nospam@myhome.com"

and you want to extract those 2 e-mail addresses and any others in your data into columns or rows.  The solution: a modified version of our regular expression from earlier:

[^\s\.][^\s]*@[^@\s]+\.[^@\s\.,]+

which only matches exactly the e-mail addresses in the string as explained below:



[^\s\.] is any character that is not a '.' or space exactly once
[^\s]* is any character that is not a space zero or more times
@ is the '@' character exactly once
[^@\s]+ is any character that is not a '@' or a space one or more times
\. is the '.' character exactly once
[^@\s\.,]+ is any character that is not a '@', space, '.' or ',' one or more times

and the tokenise output method as so:


This will take an input such as:


and gives us an output of:


Now, try to do that without using RegEx!

This example again can be seen in full in the module RegEx_Example_Part2.yxmd downloadable here:


In part 3 I will look at the last and in my opinion the most useful of the RegEx tools methods: replace.


Saturday, 20 February 2010

RegEx and Alteryx (Part 1)


RegEx (or Regular Expressions) are a method for processing text which has its background in UNIX and perl. They are a fairly complex but incredibly powerful way of manipulating text strings, and what more, support for using them is built directly into Alteryx!

Firstly I would like to start with the basics of regular expressions with a simple example: e-mail address format validation. Consider the basic validation rules for an e-mail address: It must consist of a series of characters followed by a '@' symbol followed by another series of characters which must contain at least one '.' symbol (In actual fact you can have more complex rules than this for e-mail validation, but to keep this example simple we will stick with this simple definition).

I began to think how I would produce this validation in Alteryx without using RegEx. The most logical approach I came up with was the Alteryx formula tool:

First we look for the '@' symbol using the FINDSTRING function, so our first validation rule is FINDSTRING([emailaddress], '@') > 0 which checks there exists at least one @ in the address. Next we need to check for at least one '.' symbol after that @. Again we can use the FINDSTRING function after first removing everything before the '@' like so FINDSTRING(SUBSTRING([emailaddress],FINDSTRING([emailaddress], '@')),'.') > 0. Which gives us a formula to implement this basic validation like so

FINDSTRING([emailaddress], '@') > 0 and FINDSTRING(SUBSTRING([emailaddress],FINDSTRING([emailaddress], '@')),'.') > 0

So not too bad, but what if we now want to check that there isn't a second @ after the first and it doesn't start with a '.'? That formula is staring to look more complex and will only continue to do so as we add more rules.

FINDSTRING([emailaddress], '@') > 0 and FINDSTRING(SUBSTRING([emailaddress],FINDSTRING([emailaddress], '@')),'.') > 0 and FINDSTRING(SUBSTRING([emailaddress],FINDSTRING([emailaddress], '@')+1),'@') == -1 and left([emailaddress],1) != '.'

OK so how does it work the RegEx way? First we need a regular expression which will match our rule which is this

^.+@.+\..+$

At which point, for those who don't know RegEx we come to our first drawback: it is an entire new topic in its own right which needs to be learned; and until you have spent some time learning the rules (and probably for a while after you have learnt them) regular expressions look like gobbledygook. There's no getting round it: they come from a time in computing when memory and storage were extremely limited and developers had to make every character count, but once you get your head round them they can be incredibly useful.

I can explain the one above as follows:

^ indicates the start of the string
.+ is any character one or more times
@ is the '@' character exactly once
.+ is any character one or more times
\. is the '.' character exactly once
.+ is any character one or more times
$ indicates the end of the string

then we can use this in Alteryx, again still using the formula tool with the following expression:

REGEX_MATCH([emailaddress], '^.+@.+\..+$')

So this shows us the first way of using RegEx in Alteryx. The REGEX_MATCH function returns true if the string matches the regular expression, false if it does not.

So now how do we add our 2 extra rules of checking that there isn't a second @ after the first or it doesn't start with a '.'? Well, a slightly modified regular expression like so:

^[^\.].*@[^@]+\.[^@]+$

Which broken into its components is

^ indicates the start of the string
[^\.] is any character that is not a '.' exactly once
.* is any character zero or more times
@ is the '@' character exactly once
[^@]+ is any character that is not a '@' one or more times
\. is the '.' character exactly once
[^@]+ is any character that is not a '@' one or more times
$ indicates the end of the string

So our formula becomes

REGEX_MATCH([emailaddress], '^[^\.].*@[^@]+\.[^@]+$')

Alternatively we can use the Alteryx tool RegEx to do the same thing using the output method 'Match' as shown below:

The three different validation methods can be seen in the example Alteryx module that can be downloaded from here:



If that was all you could do with RegEx then it really wouldn't be worth the effort of learning a new language for.  But that is just the start of what we can do and the real power of RegEx becomes apparent when we start using some of the other RegEx methods available in Alteryx such as the REGEX_REPLACE function and the parse method, which I will look at in part 2 of this blog post.


Welcome

Taking inspiration from Ron House I thought I might offer some musings on Alteryx from this side of the Atlantic. I've got a couple of topics already that I want to post and Alteryx has a habit of making me come up with new ideas and techniques, so I hope there will be plenty more.