Chaos Reigns Within: March 2010

Sunday, 28 March 2010

The King is dead. Long live the King.

At the Extend 2010 conference on Wednesday Dean Stoecker, then president of SRC, announced that SRC going forwards will just be called Alteryx. So no more SRC, no more www.ExtendTheReach.com, just Alteryx and www.Alteryx.com. The full announcement can be read here, but from a personal viewpoint I think this is a great move by the company and I think the new branding looks amazing. I'm excited about what I'm sure is going to be a great year for the company and product.

Extend 2010 before and after the re-branding:

Look out for a more detailed report from the conference later in the week and some blog posts on a couple of new macros written collaboratively at the conference.

Monday, 22 March 2010

RSS Readers (or how to keep up with all the Alteryx news in one place)

It might just be me who is behind the times, but until the end of last year I had never used a RSS reader. For those of you who don't know what a RSS reader is (or like me it is something you have vaguely heard about but never quite known what the point of one is), it is basically a piece of software (web or windows based) which pulls a number of news feeds together into one place. Still confused? Then click on the below link:

https://www.google.com/reader/bundle/user/08686630194161749278/bundle/Alteryx

This is a bundle of Alteryx feeds created using Google's RSS reader (cleverly named Google Reader). What you see on that page is all the recent posts from my blog; Ron's blog; the official SRC Alteryx blog and questions posted on the SRC forums. So all you need to do is subscribe to that bundle and you can keep up with all the Alteryx hints, tips and news in one place (http://www.google.com/reader). And what's more if new Alteryx news feeds become available in the future (or if anyone knows of any now that I am missing) then I will add them to that bundle and as a subscriber you will automatically start receiving those feeds as well.

So what are you waiting for? Subscribe now!

Reading KML Files

On the SRC Alteryx forum this week ashokbhatt asked if Alteryx can read and write KML files (http://srcgroup.extendthereach.com/showthread.php?t=1287). Co-incidentally I also received an e-mail from a colleague on Friday asking about Alteryx reading GML files. While the answer to both of these questions is that there is no in-built support for reading or writing these file formats, they are both of course xml documents which can be digested by Alteryx using techniques similar to what I used in part 3 of my posts on RegEx here.

The below module shows how a simple placemark file could be read using Alteryx:

and this one how a simple placemark file could be written:

Obviously these only pick out a very specific section of the kml file, but hopefully will give some ideas of what could be done.

The difficultly in writing a generic kml reader is that kml is such a rich data format (http://code.google.com/apis/kml/documentation/kmlreference.html) that some of the data stored in it would have little meaning in Alteryx.

Saturday, 20 March 2010

Alteryx Plays Chess

"OK the first thing I have to admit before I start and anyone gets too excited: is my wizard doesn't actually play chess (I haven't got the time to write a chess engine at the moment ;-) ) What it does do is let two users play chess against each other. But with that caveat said, on with the post..."

I'm not quite sure how this particular wizard came into being. I had just read Bill Flanagan's post (here) on using the mapping tool to create a polar graph, which I thought was great; and wanted to have a go myself at creating something custom using the mapping tool. Turned out I couldn't really think of any pressing business problems that I could put this application of the mapping tool to use on; so just for something to experiment with the concept, I thought I would render a chess board and pieces. So one lunchtime using the Alteryx spatial tools and some custom chess piece icons I got from the net, I built a simple module which renders a chess board and pieces.

Once I'd got that far I started wondering how difficult would it be to make a wizard which allowed a user to move those pieces around the board? Turns out it didn't need many more tools to allow a user to enter a start and end co-ordinate in a wizard front end and get the piece to move between those locations. Which gives us a wizard which allows two people to play chess against each other!

What I haven't done is program Alteryx with the rules of the game: although I think it would be possible to program that logic within Alteryx, that one is going to have to wait until I have more time. Then the next step of getting the wizard to actually make its own moves in response I think is going to be beyond the abilities of Alteryx (although you could always run an external engine via the command tool...)

But having said that the wizard as it stands is a fun demonstration of what can be done using the mapping tool. It is available for download here:

Monday, 15 March 2010

\0 delimited

Ever wondered what the point of \0 delimited files would be? Ever wished you could output delimited files with no quote qualifiers? I had done both and having found the answers thought I would share them here.

Last week I ran into a problem I have had on a few occasions over the years I've worked with Alteryx. This time round some of the tricks I had used before weren't going to work so I decided to post the problem on the SRC forum. A big thanks to Margarita at SRC, as the same day I got a reply which let me build a working solution!

My problem is a seemingly simple one to do with the delimited output file which Alteryx produces. What I wanted to output was a delimited file with a pipe delimiter and no qualifiers. Alteryx does pipe delimited no problem, but when you come to options on qualifiers all you have is the tick box "Always qualify fields" which is either true or false. But there is no option to "Never qualify fields" which is what I want.

So as you might have guessed from the title of the post the solution comes from using \0 as the delimiter. What \0 does is produce an output file with no delimiters and no qualifiers, which at first glance might not seem so useful. But what it allows you to do is build up your output file exactly as you want it using the functionality of Alteryx.

For example if you have a dataset with three fields some of which contain quotes and you want to output a pipe delimited, non-qualified file; then you can just use the formula tool to build a new field like so:

[Field1] + "|" + [Field2] + "|" + [Field3]

and then a select to output only that new field to a \0 delimited file.

The below module shows a few techniques of how you could use this:

Wednesday, 3 March 2010

RegEx and Alteryx (Part 3)

For those of you who haven't read the comments on part 2, Ned Harding has posted some extra information on how the Tokenise Method works here. This shows how you can use the tool to split a field on a RegEx matched delimiter as the help file describes.

On with part 3...

Replace Output Method

The last and I think possibly the most powerful of the four methods in the RegEx tool. For this method I am going to leave the e-mail example and take a look at something I seem to be needing to do more and more recently: xml processing. So many applications today are outputting xml files or requiring xml input, that being able to process it efficiently needs to be a staple tool of any data developer.

Before I began to learn about RegEx I struggled to process xml using Alteryx as from what I have seen, there are no specific tools for reading or writing it.

(this the stage that somebody says to me: "why didn't you just use the xxx tool?" and I say because I didn't know it existed. Oh the time I could have saved if I'd only known about that multi-field formula tool a year earlier...)

I was using complex formulas to search for parts of xml strings and then more formulas to extract the piece of data I wanted. With RegEx finding and extracting parts of xml suddenly becomes rather easy.

For this example I will use a piece of xml which I hope that anyone who has used Alteryx much will have at least seen, if not manipulated themselves: an Alteryx module. Yes, for those of you that weren't aware, when you save an Alteryx module it is saved as an xml document which, with due care, can be modified in any text editor. Great for those times when you need to rename about 100 fields using a file layout you have been supplied in Excel.

This example can be followed in the module RegEx_Example_Part3.yxmd downloadable from here:

The example module uses its own xml as an input and outputs a list of all of the tool ids in the module; along with their types and their x, y display positions. Run the module and take a look at the output browse, then add a new tool, save the module and notice how the new tool you added is now included in the output.

So how does it work? Well there are 3 stages to the module:

Get the xml into a single field and single row.
Use our RegEx replace to extract the required information from the xml.
Use Text to Columns tools and a select to format the output.

Now as this article is focusing on the RegEx tools I will leave you to work out how stages 1 and 3 work for yourselves and only explain stage 2. The module actually uses the RegEx tool with the replace output method 3 times:

The first use is to remove all of the carriage return line feeds from the data:

So \r\n is the Regular Expression for a carriage return line feed combination, we have specified that we will replace it with nothing and we have also specified that we will pass all the other unmatched text through to the output as well. I think this could be done using the standard formula REPLACE function, but I think you would need to use the CHARFROMINT function and know the unicodes for a carriage return and line feed, which isn't really any easier.

The second use is to remove all white space from between the xml tags:

This time we are searching for one or more white space between two > < characters and replacing the whole lot with ><. This is another example of something which I believe would be very difficult without RegEx; you could remove all spaces from the data easily enough, but that would remove spaces inside the tags which we wish to keep. You could also remove a fixed number of spaces between the ends of tags, but to remove an unknown number of spaces I think you would struggle.

The final use, having prepared the xml, is to extract the data elements from the xml corresponding to the Alteryx tools.

From looking at the data I identified that the block of data I was interested in always looked like this:

which I can quite easily write a regular expression to match like so:

<Node ToolID="(\d+)"><GuiSettings Plugin="([^"]+)"><Position x="(\d+)" y="(\d+)"

notice that I have four marked groups (the parts of the regular expression in parenthesis) representing the parts of the xml data I wish to extract. The power of a RegEx replace is that I can reference these named groups in the replacement text.

For example if I made my replacement text $1 then the whole expression would be replaced by 52 (the first marked group) in the above example.

What I choose for my replacement in this example is

$1¬$2¬$$3,$4$|

Which for the above example results in

52 ¬AlteryxBasePluginsGui.BrowseV2.BrowseV2¬$99, 779$|

This pulls out the 4 pieces of data I want in a nicely delimited format which I can then split into cells by using the text to columns tool, which is what happens in stage 3 of the module.

I think that the RegEx replace tool is possibly the most useful of the RegEx tools available in Alteryx and definitely worth spending the time getting to understand.

Other RegEx Functions

The last thing to complete our overview of the RegEx capabilities of Alteryx is to briefly mention the remaining two functions available in the formula tool.

REGEX_COUNTMATCHES takes a string and a regular expression and returns the number of times the regular expression matches the string.

REGEX_REPLACE works in a similar manor to the replace method of the RegEx tool, but within the formula tool. One exciting use of this is in the actions of macros and wizards, an example of which I plan to cover in a future post.

Conclusion

That concludes my overview of the RegEx capabilities of Alteryx, though my examples are just some starting ideas of what is possible. I hope I have inspired you to learn some basic Regex and next time you are struggling with a complex string manipulation think about using the RegEx toolset.

If you have any specific questions or comments on anything I have posted about then feel free to e-mail me or post a comment here.

Monday, 1 March 2010

RegEx and Alteryx (Part 2)

In part 1 we saw the Match Method of the RegEx tool and the REGEX_MATCH() function; in part 2 we will take a look at the Parse and Tokenise Methods of the RegEx tool.

Parse Output Method

So continuing the e-mail validation example from part 1, imagine now that having validated our e-mail field we now want to extract the domain names (everything after the @) to see how many of our customers use which e-mail domains. To do this we can use the RegEx tool with the Parse output method. We can use exactly the same regular expression as we used to validate the e-mail in part 1 with the addition of a set of parentheses (shown in red below) around the part of the RegEx which we wish to extract:

^[^\.].*@([^@]+\.[^@]+)$

In RegEx this part of the expression is called a marked group and the Parse output method will extract each marked group in the regular expression as a new field.

This example can be seen in full in the sample module RegEx_Example_Part2.yxmd downloadable from the bottom of the article.

Admittedly this could be done with non-RegEx functions in the formula tool, but having already written our RegEx validation rule this is a much easier and quicker solution. Plus if we now want to extend it to extract everything before the @ into a separate field we can just add two more parentheses:

^([^\.].*)@([^@]+\.[^@]+)$

Tokenise Output Method

From the examples I have put together for this article it appears that there is a slight mismatch between what the help file says this output method does and what it actually does. The help file states:

"The Tokenize Method behaves like the Text to Columns tool but can also separate on punctuation.

...will split a single column of data at each instance of the specified regular expression pattern into multiple [rows/columns]."

from which description I would expect to be able to specify a regular expression which I wish to delimit the data on, for example a comma or a pipe, with a regular expression such as (,|\|) and the data would be split into columns wherever there is a comma or a pipe.

What the tool actually does is pull out all of the data which matches the regular expression into rows or columns, so in the above example I get lots of columns either containing a comma or a pipe rather than the data between them, which wasn't what I expected. However it is not a major problem as either functionality can in fact be reproduced using the "replace" RegEx method (which we will see in part 3) and the text to columns tool (I won't go into the details here, but mail me if you need to do this and can't see how after reading the rest of this article).

So what I will show here is an example of how we can use the functionality the tool actually implements. Imagine that you have a data set that is a series of messages or e-mails from which you wish to extract all the e-mail addresses which appear in those messages. For example the message might be:

"You can contact me at nospam@fictionalcompany.co.uk or nospam@myhome.com"

and you want to extract those 2 e-mail addresses and any others in your data into columns or rows. The solution: a modified version of our regular expression from earlier:

[^\s\.][^\s]*@[^@\s]+\.[^@\s\.,]+

which only matches exactly the e-mail addresses in the string as explained below:

[^\s\.] is any character that is not a '.' or space exactly once

[^\s]* is any character that is not a space zero or more times

@ is the '@' character exactly once

[^@\s]+ is any character that is not a '@' or a space one or more times

\. is the '.' character exactly once

[^@\s\.,]+ is any character that is not a '@', space, '.' or ',' one or more times

and the tokenise output method as so:

This will take an input such as:

and gives us an output of:

Now, try to do that without using RegEx!

This example again can be seen in full in the module RegEx_Example_Part2.yxmd downloadable here:

In part 3 I will look at the last and in my opinion the most useful of the RegEx tools methods: replace.