Throughout my career working with data I have had numerous conversations about what we mean when we talk about "Big Data". I have long argued that the vast majority of companies, data analysts and data engineers do not have "Big Data". But before we can qualify that claim we really need to agree on what we mean by "Big Data", and of course like anything when we are trying to measure size it is largely relative to what we are normally used to working with. Many years ago I met a customer who, prior to my meeting with them, told me they had a big data set; and when I got there I found that they had a million row table that they couldn't open with Excel (back in the old 65k excel row limit days). My personal rough definition of "Big Data" is data that is too big to store and/or process on a single machine. When you get to data that is this big, then the rules of the game fundamentally change. You need different tools and skillsets, almost a completely different mindset to how you work with your data and Enso Analytics is not one of those specialised tools. Again I will assert most companies do not have or need these kind of volumes of data.
My Favourite CSV File
But that leaves us a gap between the data that is too big for tools like Excel, but is small enough to fit on a single machine. (Our CPO Ned Harding once described this as "Medium Data", but to the sales team he pitched it to, it never stuck.). A good example of this size of data is my personal favourite CSV file: The UK Companies House dataset!
Available to download free from Companies House website https://download.companieshouse.gov.uk/en_output.html. It is a snapshot of all the the registered companies in the UK. It is a perfect example of a medium sized dataset. It is about 470Mb zipped and I can download it in a couple of minutes on my modestly fast broadband connection. Unzipped it is 2.8 Gb, which fits perfectly well on my hard drive and indeed in the 36Gb of RAM that I have on this laptop. But at 5.6 million rows Excel is not going to be able to do anything with this file.
So why do I call it my favourite CSV file? Well for a couple of reasons:
1. It is a great example of a medium sized dataset that will need a more heavy weight tool than Excel to process.
2. At the previous analytics software company I worked for this was a test dataset when we completely rewrote the data engine that under-pinned that product. We were very proud of how fast we could read that file and even got a patent around some of the techniques that we used: https://image-ppubs.uspto.gov/dirsearch-public/print/downloadPdf/10552452 (Spoiler Enso Analytics can also read this file very fast with help from our friends at DuckDB)
Medium Data and Enso Analytics
So what is the story with "medium data" and Enso Analytics? Well if you had come to me this time last year and asked me how does Enso cope with datasets that are in the Gb size range? I would have told you that as Enso gives you a live view of your data as you are building your workflow and that it is an in-memory data engine then with full honesty "not great". But... I would have gone on to say if you have data in that kind of size range, then really you don't want to be storing it in random CSV files lying around on your hard drive. Data that size belongs in a database (I would have told you) such as Snowflake, SQL Server or Postgres, all of which Enso not only supports, but has built in pushdown capabilities which means you can use our easy visual coding tools, but all the processing will be done in the database. Which was all very true and good advice, but in reality what do you do when you want to analyse some data that isn't in your data warehouse?
Again this is where the companies house dataset is a good example. You are unlikely to have this in your data warehouse. Day to day you probably don't need it and keeping it updated is a job you don't need to do. But what do you do when you get an adhoc request?
How does DuckDB help?
Imagine you are working for a fictional coffee company. Your CEO has just called you up as she has had a call from a large multi-national chain telling her she can't use the word star in the name of her coffee company. We can't be the only coffee company who has used the word star she says, there must be others, can you find out for me?
Of course the companies house dataset is perfect for this question, so how do we solve this today with Enso?
Well actually my advice is still the same, that data needs to be in a database! But that database is going to be DuckDB and it is built directly into Enso!
We download the companies house csv file (2 minutes)
Connect to our built in version of DuckDB
And load the dataset directly into the DuckDB database. On my machine this takes a few seconds.
Then we can use Enso's visual components to quickly query the data.
Finding there are 23182 companies containing the letters STAR.
And only 31 that contain STAR and COFFEE.
Which we can quickly export to Excel and send over to our CEO.
At the end of this post. I will leave you with a question: How would you solve this today? What tools would you use and how confident would you be in your answer? More on that in my next post.
I think a skilled Enso user could answer this question in 10 minutes and be confident in their answer being correct.
No comments:
Post a Comment