Tuesday, 24 February 2026

Working with Big(ger) Data Using Enso Analytics and DuckDB

Throughout my career working with data I have had numerous conversations about what we mean when we talk about "Big Data". I have long argued that the vast majority of companies, data analysts and data engineers do not have "Big Data". But before we can qualify that claim we really need to agree on what we mean by "Big Data", and of course like anything when we are trying to measure size it is largely relative to what we are normally used to working with. Many years ago I met a customer who, prior to my meeting with them, told me they had a big data set; and when I got there I found that they had a million row table that they couldn't open with Excel (back in the old 65k excel row limit days). My personal rough definition of "Big Data" is data that is too big to store and/or process on a single machine. When you get to data that is this big, then the rules of the game fundamentally change. You need different tools and skillsets, almost a completely different mindset to how you work with your data and Enso Analytics is not one of those specialised tools. Again I will assert most companies do not have or need these kind of volumes of data.

My Favourite CSV File

But that leaves us a gap between the data that is too big for tools like Excel, but is small enough to fit on a single machine. (Our CPO Ned Harding once described this as "Medium Data", but to the sales team he pitched it to, it never stuck.). A good example of this size of data is my personal favourite CSV file: The UK Companies House dataset!

Available to download free from Companies House website https://download.companieshouse.gov.uk/en_output.html. It is a snapshot of all the the registered companies in the UK. It is a perfect example of a medium sized dataset. It is about 470Mb zipped and I can download it in a couple of minutes on my modestly fast broadband connection. Unzipped it is 2.8 Gb, which fits perfectly well on my hard drive and indeed in the 36Gb of RAM that I have on this laptop. But at 5.6 million rows Excel is not going to be able to do anything with this file.

So why do I call it my favourite CSV file? Well for a couple of reasons:

1. It is a great example of a medium sized dataset that will need a more heavy weight tool than Excel to process.

2. At the previous analytics software company I worked for this was a test dataset when we completely rewrote the data engine that under-pinned that product. We were very proud of how fast we could read that file and even got a patent around some of the techniques that we used: https://image-ppubs.uspto.gov/dirsearch-public/print/downloadPdf/10552452 (Spoiler Enso Analytics can also read this file very fast with help from our friends at DuckDB)

Medium Data and Enso Analytics

So what is the story with "medium data" and Enso Analytics? Well if you had come to me this time last year and asked me how does Enso cope with datasets that are in the Gb size range? I would have told you that as Enso gives you a live view of your data as you are building your workflow and that it is an in-memory data engine then with full honesty "not great". But... I would have gone on to say if you have data in that kind of size range, then really you don't want to be storing it in random CSV files lying around on your hard drive. Data that size belongs in a database (I would have told you) such as Snowflake, SQL Server or Postgres, all of which Enso not only supports, but has built in pushdown capabilities which means you can use our easy visual coding tools, but all the processing will be done in the database. Which was all very true and good advice, but in reality what do you do when you want to analyse some data that isn't in your data warehouse?

Again this is where the companies house dataset is a good example. You are unlikely to have this in your data warehouse. Day to day you probably don't need it and keeping it updated is a job you don't need to do. But what do you do when you get an adhoc request?

How does DuckDB help?

Imagine you are working for a fictional coffee company. Your CEO has just called you up as she has had a call from a large multi-national chain telling her she can't use the word star in the name of her coffee company. We can't be the only coffee company who has used the word star she says, there must be others, can you find out for me?

Of course the companies house dataset is perfect for this question, so how do we solve this today with Enso?

Well actually my advice is still the same, that data needs to be in a database! But that database is going to be DuckDB and it is built directly into Enso!

We download the companies house csv file (2 minutes)

Connect to our built in version of DuckDB

And load the dataset directly into the DuckDB database. On my machine this takes a few seconds.

Then we can use Enso's visual components to quickly query the data.

Finding there are 23182 companies containing the letters STAR.

And only 31 that contain STAR and COFFEE.

Which we can quickly export to Excel and send over to our CEO.

At the end of this post. I will leave you with a question: How would you solve this today? What tools would you use and how confident would you be in your answer? More on that in my next post.

I think a skilled Enso user could answer this question in 10 minutes and be confident in their answer being correct.

Friday, 30 January 2026

Spatial Capabilities in Enso Analytics - Creating Spatial Points and Spatial Joins

In our last post on spatial capabilities we looked at visualizations, inputting spatial files and spatial expressions. In this post we will look at creating points and spatial joins.

A Spatial Question?

Let's do something more interesting and answer a spatial question that needs a join. The question we are going to answer is which county in the UK has the most castles? I will start by showing you the answer and then talk through the steps.

With the dataset that I have the answer is Lincolnshire with 13 castles.

How did we get there? Lets walk through it step by step.

Creating Spatial Objects Using st_point

Like in the last blog post we got the castle points using the overpass-api. Looking inside the get_castles component in our main workflow we see this.

This calls the overpass api with an overpass query to get the castles in the UK and then extracts the castle name from the json that is returned from that API.

Then back in the parent workflow we trim the data down to the three columns we are interested in: Name, lat and long

Right now this is still just a standard table of data. To do some spatial processing on it we need to load it into DuckDB and convert those lat longs to spatial objects, which we do with the new st_point component. This component takes a longitude and latitude column and returns a DuckDB spatial object as we can see below:

Reading In the Boundary Data from a Shapefile

I found the UK county boundaries as a shape file on the geoboundaries website (https://www.geoboundaries.org/countryDownloads.html) which I downloaded to my local machine and unzipped.

We then read this in using the read_spatial_file component on our DuckDB connection

I wrapped this into a User Defined Component called get_county_boundaries to keep our parent workflow looking clean and easy to read.

Spatial Join

Now we have both our boundaries and our points as spatial objects in DuckDB we can use the DuckDB spatial engine to do a Spatial join. We do this using the same join component that we would do a regular data join with, but because these are spatial tables we get new spatial join options!

Here we choose Contains, as we want to match the counties to the castles when the castle is contained by the counties boundary. We choose a Left_Outer join so we can keep all of our counties even if they don't contain any castles.

After the join we now have all of our castle data with their containing county attached.

The Answer

The last thing to answer our question, is to aggregate the data to county level, count how many castles and then finally sort:

Spatial Visualisation

This has actually existed in Enso for a long time now and the capabilities here haven't changed in this release. Enso has the ability to visualise points on a map when a data set has latitude and longitude points.

For example if I want to use the overpass api to visualise restaurants in Cambridge (or any other location) I can pull the data and quickly visualise it on a map:

Future releases will expand on this capability to allow more spatial types to be visualised and extra options around the visualisation itself.

Spatial Data formats

This is some new functionality brought to us via DuckDB, so we can start by looking at their help page.

https://duckdb.org/docs/stable/core_extensions/spatial/gdal

And of interest here, following on from my last post about us using DuckDB's spatial capabilities, we can see that DuckDB uses the GDAL library to implement its spatial file reading.

And following through we can see that running this query

SELECT * FROM ST_Drivers();

will give us all of the supported file formats. So let's do that in Enso and see what spatial file support we gained from DuckDB in Enso:

That is a huge 54 supported spatial file formats. Complete with clickable links to help pages describing each format.

To actually load a spatial file you again need to connect to DuckDB using Database.connect, but don't worry you don't need a DuckDB instance, Enso will just create one in memory for you.

And then you can use read_file to read in your spatial file

The geom column is the spatial object.

It is worth noting that today the spatial visualisation doesn't know how to render these spatial objects (that will be coming in a future release) so to visualise these points on a map we will have to make use of some of DuckDBs spatial capabilities!

Spatial Expressions

Now we have some spatial data, what can we do with it? Well we now have spatial functions in the expression language and this is a great opportunity to show another new feature of the latest Enso release: Expression Autocomplete!

If we add a set component and select expression and start typing st_ we can get a list of the spatial functions complete with a description of what they do:

To be able to visualise the points on the map we need two columns with the latitude and longitude values. We can do that using the st_latitude and st_longitude functions in two set components, and then we are able to see the points from our shape file that we loaded in previously:

In the next blog post we will take a look at spatial joins and writing out spatial files.

Wednesday, 14 January 2026

Standing on the Shoulders of Ducks

"If I have seen further, it is by standing on the shoulders of ducks..."

--Isaac Newton (probably... I may have misremembered the quote...)

Why DuckDB?

In the first article in this series about Enso Analytics and DuckDB we looked at why those two pieces of technology worked and fitted so well together. But why did we want to add DuckDB to Enso Analytics at all? Why go to the effort of adding a whole new complex piece of technology to the product?

And the answer is an easy one: New features!

DuckDB has functionality and capabilities that previous to the last release Enso Analytics did not. And not only do we gain all of the functionality DuckDB has built so far, but we will gain all of the innovation and new features that DuckDB will add in the future. This is why this is such an accelerator for what Enso can do both today and tomorrow.

So in this article we will look at a high level what some of that functionality looks like.

Spatial

This is the big one and in many ways what started us on the journey to DuckDB. Our users have been asking us to add spatial capabilities to Enso for a while now (after all everything happens somewhere) and it was going to take us too long to build that functionality from scratch. Bringing in a piece of technology that has a rich set of existing functionality and is open source so we can build on top of it allows us to accelerate both ours and our users spatial journeys.

If you want to read more about the details I'd suggest reading some of the DuckDB documentation here https://duckdb.org/2023/04/28/spatial

Because Enso is a dual textual and visual language you don't have to write raw SQL to use the spatial capabilities: you can build your workflow using Enso's easy to use visual programming language.

More on what this looks like in a future post in the series dedicated to spatial.

File Formats

Another feature request we have had is reading and writing Parquet files. Well again with DuckDB we get this functionality for free. (and with some rather nice performance too!).

Compressed csv files? ✅

From the DuckDB website:

"CSV files are often distributed in compressed format such as GZIP archives (.csv.gz). DuckDB can decompress these files on the fly. In fact, this is typically faster than decompressing the files first and loading them due to reduced IO."

Sounds good! And now is available in Enso too!

Fast Csv Reader

And while we are on file formats, the DuckDB csv reader is rather good too. It is less forgiving than the built in Enso csv reader, but for well formatted csv files it is very fast. Take for example my favourite csv file - the UK companies house file (https://download.companieshouse.gov.uk/en_output.html). This is a 2.7 GB csv file with 5.6 million rows. Prior to this release I would have said files this size belong in a database. Today I still do, but that database is DuckDB and it ships directly with Enso.

What Does Enso Bring to DuckDB?

Now all this is great, but you might be saying to yourself I can do all of this in DuckDB already. Why do I need Enso?

Well in the same way that DuckDB has features that Enso doesn't. Enso has features that DuckDB doesn't, so using the two pieces of technology together means you get to use *all* of that functionality, including:

- a rich visual programming environment with live updates showing the results of your changes as you make them

- easy interaction with web APIs: use Enso to query a web API and then combine the results of that with data in DuckDB

- version control and workflow sharing with your team. Easily keep track of and share your analytical pipeline with your team members.

And more! Check out what makes Enso Analytics unique at our website www.ensoanalytics.com

Friday, 9 January 2026

Enso Analytics and DuckDB - A Union That Was Meant To Be

Enso Analytics 2025.3 introduces in-built DuckDB functionality, which is a gamechanger for what the product can do.

In this series of blog posts I will discuss what this means for you as a user and why Enso and DuckDB are such a good union.

But before we get to what this means for Enso first some background on DuckDB.

Back in 2022, when I was working at another data analytics company and we were in the middle of a number of company acquisitions, the CTO from one of those acquisitions came to me (as one of the principal engineers for the data engines) and said he thought we should think about replacing our internal data engine with DuckDB. Now if you haven't heard of DuckDB (or have heard of it but aren't quite sure what it is) then we should take a quick diversion to what DuckDB is.

From the duckdb.org website: DuckDB is a

Fast
Analytical
In-process
Open-source
Portable

database system.

A lot of words there, but what do they all mean? Let's break it down. And we will start at the end.

Database System

It is a database system. Great! So like Snowflake? Or Postgres? Fundamentally yes, with some key differences that we will get to later, at its core it is a database. It stores data in tables and you can query data from those tables using the standard language of the database world: SQL.

So nothing too interesting here, Enso already had support for running its processes inside of Postgres and Snowflake so this is just another DB that it now supports? Kind of. Let's look at some of the other features from the DuckDB website to see why DuckDB might be more than that.

Fast

Great! Everyone wants a fast database right? But equally fast is relative and no database manufacturer is going to say their database is slow. And actually the real question is "Fast at what?"

Analytical

Now this one is more interesting. Here at Enso we are all about analytics (it is in our name!), but why do DuckDB call that out as their second point on their home page? Well what DuckDB mean is their database is designed for analytic use cases, and taking this in conjunction with the fast bullet their database engine is designed to be fast for analytical use cases.

You can read more about that over at DuckDB https://duckdb.org/why_duckdb#fast.

In-process

Now this comes back to why is DuckDB different to Snowflake or Postgres? When they say DuckDB is in-process it means it can run as part of the hosting application (in our case Enso Analytics). As opposed to Snowflake say which runs in the cloud or Postgres which typically is installed on a server somewhere.

Open-Source and Portable

Both Enso Analytics and DuckDB are open source and can run on Windows, Macs and Linux machines, another checkmark on our list of why we are a perfect match.

Back to 2022 and my meeting with the CTO from the acquired company. My first question was why? To which the answer was all the great features, performance and capabilities. Which sounded great, but we had just spent the last few years building our own highly performant data engine. The second question was how? And this is where it got harder for that other company as their data engine didn't directly translate to SQL and actually at times deliberately worked differently to other database engines out there so compatibility was going to be a nightmare. And that coupled with the fact we were already trying to combine at lot of different acquired technologies meant we really didn't need to throw yet another one into the mixing pot.

So why is it a different story here at Enso Analytics?

Well the biggest reason is when we built the Enso product we built it to be compatible with other database technologies not different to them. This means that we can use the same set of components to run In-Memory, or in Snowflake or Postgres... And now in DuckDB. Of course not all databases support the same functionality: the Enso application allows you to do as much in database as is possible, and gracefully tell the user when something is not possible in database, and it is time to move the data in memory. This meant the framework for integrating DuckDB was there and ready to go. The how was obvious.

And then to the why. For that we go back to the list of DuckDB features and see how each matches against the Enso product.

✅ A fast analytics database - Enso is designed to do analytical work with data, so a fast analytics database is a perfect fit.

✅ In-process - This means DuckDB can be packaged in as part of the Enso installation. You don't need to set up anything apart from installing Enso and you have the power of DuckDB behind an easy to use visual data analytics programming language.

✅ Open-Source and Portable - DuckDB and Enso Analytics both share the same philosophy of being open source and work on any platform.

Oh and did I mention that DuckDB has a rich set of spatial capabilities? So as of 2025.3 Enso Analytics for the first time has spatial support. But more on that in a future post. In fact the next post looks at what new features DuckDB brings to Enso.

If you want to try it yourself download a free copy today at EnsoAnalytics.com

Chaos Reigns Within

Pages