Pages

Saturday, 20 February 2010

RegEx and Alteryx (Part 1)


RegEx (or Regular Expressions) are a method for processing text which has its background in UNIX and perl. They are a fairly complex but incredibly powerful way of manipulating text strings, and what more, support for using them is built directly into Alteryx!

Firstly I would like to start with the basics of regular expressions with a simple example: e-mail address format validation. Consider the basic validation rules for an e-mail address: It must consist of a series of characters followed by a '@' symbol followed by another series of characters which must contain at least one '.' symbol (In actual fact you can have more complex rules than this for e-mail validation, but to keep this example simple we will stick with this simple definition).

I began to think how I would produce this validation in Alteryx without using RegEx. The most logical approach I came up with was the Alteryx formula tool:

First we look for the '@' symbol using the FINDSTRING function, so our first validation rule is FINDSTRING([emailaddress], '@') > 0 which checks there exists at least one @ in the address. Next we need to check for at least one '.' symbol after that @. Again we can use the FINDSTRING function after first removing everything before the '@' like so FINDSTRING(SUBSTRING([emailaddress],FINDSTRING([emailaddress], '@')),'.') > 0. Which gives us a formula to implement this basic validation like so

FINDSTRING([emailaddress], '@') > 0 and FINDSTRING(SUBSTRING([emailaddress],FINDSTRING([emailaddress], '@')),'.') > 0

So not too bad, but what if we now want to check that there isn't a second @ after the first and it doesn't start with a '.'? That formula is staring to look more complex and will only continue to do so as we add more rules.

FINDSTRING([emailaddress], '@') > 0 and FINDSTRING(SUBSTRING([emailaddress],FINDSTRING([emailaddress], '@')),'.') > 0 and FINDSTRING(SUBSTRING([emailaddress],FINDSTRING([emailaddress], '@')+1),'@') == -1 and left([emailaddress],1) != '.'

OK so how does it work the RegEx way? First we need a regular expression which will match our rule which is this

^.+@.+\..+$

At which point, for those who don't know RegEx we come to our first drawback: it is an entire new topic in its own right which needs to be learned; and until you have spent some time learning the rules (and probably for a while after you have learnt them) regular expressions look like gobbledygook. There's no getting round it: they come from a time in computing when memory and storage were extremely limited and developers had to make every character count, but once you get your head round them they can be incredibly useful.

I can explain the one above as follows:

^ indicates the start of the string
.+ is any character one or more times
@ is the '@' character exactly once
.+ is any character one or more times
\. is the '.' character exactly once
.+ is any character one or more times
$ indicates the end of the string

then we can use this in Alteryx, again still using the formula tool with the following expression:

REGEX_MATCH([emailaddress], '^.+@.+\..+$')

So this shows us the first way of using RegEx in Alteryx. The REGEX_MATCH function returns true if the string matches the regular expression, false if it does not.

So now how do we add our 2 extra rules of checking that there isn't a second @ after the first or it doesn't start with a '.'? Well, a slightly modified regular expression like so:

^[^\.].*@[^@]+\.[^@]+$

Which broken into its components is

^ indicates the start of the string
[^\.] is any character that is not a '.' exactly once
.* is any character zero or more times
@ is the '@' character exactly once
[^@]+ is any character that is not a '@' one or more times
\. is the '.' character exactly once
[^@]+ is any character that is not a '@' one or more times
$ indicates the end of the string

So our formula becomes

REGEX_MATCH([emailaddress], '^[^\.].*@[^@]+\.[^@]+$')

Alternatively we can use the Alteryx tool RegEx to do the same thing using the output method 'Match' as shown below:

The three different validation methods can be seen in the example Alteryx module that can be downloaded from here:



If that was all you could do with RegEx then it really wouldn't be worth the effort of learning a new language for.  But that is just the start of what we can do and the real power of RegEx becomes apparent when we start using some of the other RegEx methods available in Alteryx such as the REGEX_REPLACE function and the parse method, which I will look at in part 2 of this blog post.