Thursday 17 December 2020

CReW Data Cleanse

Yet another new macro is about to be released!  Alteryx has "AMP"ed up their game and while not all tools are AMP-Compatible, surely more speed is coming.  I've long thought about taking a stab at a CReW Data Cleanse tool, but since the release of an ALTERYX macro that delivers that functionality I have stayed in the peanut gallery.  I'll be releasing a CReW macro before Christmas 2020.  On Friday, December 18th I will surprise the Dallas-Fort Worth User Group (Virtual) with a presentation on the macro instead of (or in addition to) their expected presentation on CReW Generate Dummy Variables.



When you use one (or more) Data Cleanse macros in your workflow, you do sacrifice performance.  Think of it as cleaning the house with a toothbrush and a cup of water.  I am offering you a mop and a bucket.  Make your workflows run faster and give others a chance to use the shared resources.  If you're working with large sets of data, this macro upgrade is essential.

Alteryx Community Post:  https://bit.ly/34mfA0U

Presentation (YouTube) on this macro can be found here:  https://youtu.be/kIDqnwT4g5Q  

Walk-thru (YouTube) of macro can be found here:  https://youtu.be/uA5oyfmE2uc

Cheers,

Mark






Thursday 10 December 2020

Building CReW Generate Dummy Variables

Background:

This story goes back to the 1980's when I was asked to test new bank software and to document what errors I uncovered.  I completed an Error Incident Report and walked that form to the next building for review by the development manager.  Finding errors meant that I got to leave my windowless cubicle and walk in the California sunshine (at least temporarily).  The developers needed to know what process they would need to perform to recreate the incident and what the expected results of such actions should be.  Upon return to my desk, I would have to summarize the days findings and explain back to the project management what issues were found and what their potential impacts would be on the bank's production environment if left unresolved.

The tried and true waterfall development process left testing until after the requirements and design were completed and I eagerly awaited software to find what were sure to be plenty of bugs.  The biggest contention was whether the "expected" results were well defined in the requirements because I would report "errors" that could be considered as missed requirements.

Creating a CReW macro does not include requirement documents, design specifications, use cases, or an approval process.  In my, Mark's, POV I create what I think captures the essence of a design pattern (assembly of Alteryx tools) into a single tool.  Some of these patterns are fairly simple and straight-forward and reduce the time to configure multiple-tools and some patterns are difficult for beginners to gain the concepts for.  In either case, the tool obfuscates the pattern and makes something easier for the user.

As an Alteryx ACE, I'm able to attend developer days where we have an opportunity to share feedback with the development (sometimes with product managers) team and discuss our views and needs as well as to give our feedback on their ideas.  Observant as I am to the NDA, that's where that story ends and this macro blog begins.

Let's talk Cross Tab"

Who doesn't like a Cross Tab tool?  Frankly, I don't know why it is a "Cross Tab" tool and isn't a "Crosstab" tool.  There are more than 200 ideas in the Alteryx Community returned when you query that term where 1/3 of these ideas are authored by Alteryx ACEs.  Just because there are literally hundreds of ideas posted, doesn't mean that the tool is broken.  It means that improvements are requested.  The cross tab tool is at the heart of the create dummy variables process and this process was the topic of my latest rant to Alteryx.

Cross Tab - Areas of potential improvement

Here are a few of my personal grievances:
  • Configuration is not saved as some new tools do
  • Dropdown and select are not searchable
  • Domain/headers replace non-alphanumeric characters with underscore
  • Output columns are in alphabetical order
  • You need data that supports all possible values otherwise Output columns are not generated
  • Downstream tools fail if not run with sufficient data

Cross Tab - Work-arounds

  • Use a find replace tool to prepend a sort sequence to each field value (to change the incoming data) so that the column name sorts data (e.g. 00_Sunday, 01_Monday).  Later I can dynamically rename the columns
  • Use an Ensure Fields (CReW) macro to create needed columns of data (after Cross Tab). 
  • Place a select in front of the cross tab.  Only bring a minimum amount of fields into cross tab.  Join back on either record sequence or recordid depending on need (e.g. use of AMP or if output records aren't 1:1 to input)
  • Sample the output data (0 records) and save to YXDB. This saves the metadata for a good output.  You can later Union this data within the workflow and ensure order and presence of data 

Let's talk Dummy Variables

After my short (not so) rant, I realized that I should take on the challenge to improve Alteryx myself through the creation of the Generate Dummy Variables tool.  It won't improve the Cross Tab, but does address the issues that take the greatest amount of my time when creating workflows that either model or score data.  With 30 or more community posts on the subject, maybe it will help folks in the same boat as I am in.



Defining Scope:

Since I don't apply requirement documents or use cases to my CReW activities, I need to set boundaries for functionality and time to build.  What can I create in a day?  Yes, when I build a macro I hope to have it created in a day.  That day generally spans for more than a week, but that includes blogs, builds, testing, feedback, documentation and icon selection.  

Scope for Dummy Variables:

The scope is for a single variable (despite my knowledge that you may have multiple source fields to convert).  I want to accept categorical data elements and provide the ability to rename the output columns to include the column name.  If possible, I want to add the functionality of an ENSURE FIELDS macro (CReW), but not require the user to have that macro downloaded.

Scope Creep:

That idea to include the variable name as the first part of the output column name (e.g. Color_G) created a thought that a custom prefix might be preferred to the actual field name.  Then I ran into problems with spaces (thank you Cross Tab) in the output column name field.  When I added the ensure fields capability (to pretend that data values not present were accounted for), I had space problems and NULL value issues that needed more creativity.  Finally, I was ready for testing and went to fellow ACE, Dan Languedoc for testing support.  He suggested (insisted) that I allow for INT fields to be supported as categorical variables.  "Fine!", I'll add them.  He later said that my macro was indestructible.

Development:

I was unhappy with my final product.  I felt that it wasn't easy to walk-through to explain to others and didn't feel satisfied with the macro.  I had added functionality and had to ensure compliance in multiple locations of the workflow.  It was not pretty.  So I salvaged what I could and re-wrote the macro.  It took me about an hour to come up with the current version.

Testing:

While I was building the macro, I was actively testing the macro in another workflow.  I was using different configuration settings as well as different incoming data (e.g. things with and without spaces).  When I want to see what's happening inside of the macro, I use a "debug" trick to place a Macro Output tool before and/or after a tool and look at the results in the secondary workflow.  I can see what the data looks like going into and out of a dynamic rename or other tools that are in the middle of the workflow and can quickly reconnect them to another point of interest.

Defensive Configuration or "APA Readiness":

Will this macro fail in the future?  I hope not.  I hope that it works and that I never have to maintain or support it again.  But what might be worse is to have consistently wrong results.  The biggest risk that I saw was dropping or muliplying records.  I use a TEST tool to ensure that the macro input record count exactly matches the output record count.  If you drop a record (e.g. Create Points) and don't let the user know about it, I think that it is an error.  I also added a check that if you ask to ENSURE field valuess that the macro will error if you don't actually input a field value.  Also, if you copy/paste a field value into the ensured fields, if you don't mark it as required, that we will ignore the request.  

Now if the macro fails, we can all blame Dan.  He said that it was good to go.

P.S.:  Here's a link to the macro!