We outline the “trillion dollar” problem of materialised views, and the solution of differential dataflow. We then describe our novel approach to bi-directional materialised views, and how it enables a seamless and collaborative product.
Materialised views
One of the biggest open problems in databases is materialised views.
Imagine you are repeatedly querying a database after any change.
The result of your queries is the materialised view. This view must update as new data streams in. You shouldn’t have to pay the cost of a full query each time. The database should be clever enough to compute a minimal incremental change from the current state of your view.
It seems reasonable, yet it’s too much to ask of current databases.
None of the major databases used in production today support this for arbitrary queries.
Here’s an illustrative quote on the extent of this issue:
Software engineers worldwide continue to waste $trillions of person-hours on incidental complexity that can be fundamentally attributed to the lack of [materialised views]
— Liron Shapira, Data denormalization is broken
Right now, a data engineering team must set up pipelines and caches. Decoupling this problem from business logic would enable developers to write more robust application-layer code, as they’d no longer need to worry about manually tracking down all the potential updates that an arbitrary query depends on – which are required to refresh the materialised view.
In general, this isn’t a solved engineering problem.
The solution would be a mechanism to efficiently keep state up-to-date by performing minimal incremental changes – this is also known as differential dataflow.
Materialize
Materialised views are a hugely important, and as is often the case, an equally hard problem.
Thankfully, smart people are actively working on it.
Frank McSherry founded the company Materialize to solve this problem and build the first database capable of maintaining materialised views – their solution allows you to express materialised views in SQL, so you don’t have to worry about dependencies, pipelines, or caches.
He won the Gödel Prize for academic contributions to theoretical computer science. Truth be told, I haven’t even finished wrapping my head around the McSherry et al. paper that introduced the differential dataflow model behind Materialize. (See here for a great introduction by Adrian Colyer.)
At Tably, we’re a small early-stage startup of 6 engineers – Materialize has raised over $100M, are building some amazing tech, guided by Frank McSherry’s formidable intellect – so what contributions could we possibly make in this space?
If you are looking to build today your own data streaming application that uses differential dataflow, you can stop reading now. Look no further than Materialize. They’re doing great work and we are huge fans.
However, if like us you’re curious and suspect that’s just the tip of the iceberg, then please read on. This is a huge space, differential dataflow technology has far-reaching implications, and so much of it still needs to be built.
Bi-directional materialised views
Although we are working to solve the same problem of efficiently maintaining materialised views, our approach to differential dataflow is quite unique – we are taking techniques from real-time collaborative editing and applying them to data analytics.
In a nutshell, we are using a technique called Operational Transformation (OT).
This is equivalent to Conflict-Free Replicated Data Types (CRDTs) which are becoming popular nowadays for collaborative applications, and both can be seen as a more powerful version of Git where conflicts between branches are resolved automatically.
Assume we have a main pipeline of data operations. The final state is our materialised view. Other branches represent updates to our view, or as a special case, new data streaming in when branching occurs at the beginning of history.
OT allows us to rebase an operation to the tip of our main pipeline by transforming it. For an appropriately designed OT algorithm, the rebased operation is a minimal incremental change to our materialised view – therefore this rebasing gives us differential dataflow.
The flexibility of our OT approach to differential dataflow means we can do bi-directional materialised views too. That is, where this makes sense, we can stream operations backwards from a user’s view and have them correctly reflected on the original data source.
Time-travelling operations back and forth is very counterintuitive. You end up in situations where divergent branches of history are happening at the same time. Ultimately it makes sense, though, because this is what automatic conflict resolution is about: conflicting operations, which shouldn’t happen at the same time, happening simultaneously – and that’s okay.
Spreadsheets from the future
Why should anyone care about ideas such as conflict-free rebasing and bi-directional materialised views? Because they unlock incredible product features.
We are building a data table on top of this time-travelling data technology: a spreadsheet from the future if you will.
We know how inaccessible these abstract concepts may feel, and precisely because not everyone cares about internals, they shouldn’t worry about whether some database powers what they do. Using our product should be an experience that’s as seamless and collaborative as possible.
If Materialize is the “Databricks” of differential dataflow, we aim to be the “Notion” a.k.a. the no-code of differential dataflow. Tight integration by co-designing a product alongside our technology stack means everything works together to empower the user.
Let’s take a sneak peek at a couple features we are building.
Fork and auto-rebase
Conflict-free rebasing brings version control – one of the greatest productivity practices available to software engineers – to the spreadsheet everybody uses and loves.
Paraphrasing our friends at GitLab:
Version control facilitates coordination, sharing, and collaboration across the entire
software developmentteam.
Right now people have data scattered, spreadsheet copies here and there that are a mess. You can build versioning into existing solutions, but it’s not the same thing as supporting it natively as part of a Git-like workflow.
Imagine simply being able to fork off someone else’s dataset and do your own analysis. All while data flows live into both your tables, courtesy of differential dataflow, so everybody can see changes as they are happening.
And you can take a look at where you forked off, skip forwards to a more recent version of that branch, rebase your analysis on top it. In a way that’s interactive and playful – no need to resolve conflicts – giving you immediate feedback, with the click of a button.
Imagine this happening at the scale of a company.
It’s as if all the knowledge of your organisation were tied together. Every spreadsheet linked to every other through a history of changes. These individual pieces of knowledge become much easier to discover, they can be seamlessly combined at any moment, and they will never be stale – as live data continues to stream in and flow throughout the organisation.
Nothing like this currently exists.
These workflows can be asynchronous like Git, or even real-time like Google Docs because our rebasing is conflict-free. They are something everyone can benefit from.
Bi-directional views
Bi-directional views give you a more natural way to interact with your data.
I’m currently writing the draft of this blog post on Notion. Because its polished interface makes the task easier for me. Even though my writing would eventually make it to the web anyway, had I chosen a clunkier editor that’s less fun to use.
If views are a window over your data, bi-directional views are the interface through which you modify your data.
We allow you to view your data in the shape and form that most makes sense to you, perhaps a dashboard customised to your particular needs. The magic is that you can change your data from here too, directly from the interface you prefer.
Most office workers need to interact with data, but they don’t need to know about transactions, say, or even that they’re interacting with a database at all. It would be unthinkable for them to manually insert rows into a database, for example. They should, instead, be able to just change some cells in a table.
We want this same convenience, backed our streaming database rather than an Excel file.
Bi-directional views allow you to modify your data in the form you see it. Changes are then propagated backwards and reflected on the original source, sparing you from having to deal with the raw data.
The future is seamless and collaborative
Differential dataflow is one of the most powerful technologies emerging out there, it solves the problem of materialised views, which will empower developers to build live-updating applications of a kind we’ve never seen before – with all that follows: seamless, interactive user experiences.
Real-time collaborative editing techniques are also reaching maturity, in the form of OT/CRDTs. They are the building block for applications enabling people to collaborate in pleasant ways that just work.
This is an exciting historical moment. We are at an inflection point where these technologies are finally coming to fruition in the next few years.
Nobody else is working on both.
And notice how we aren’t combining two technologies that just happen to work together. There’s something much deeper going on: the flexibility of these collaborative techniques (OT/CRDTs), applied to a new context where nobody has tried them before (data analytics), are precisely what enables us to build powerful features (fork and auto-rebase on tables, bi-directional materialised views).
But why should these technologies be available only to developers, when they could be in everyone’s hands?
Combining the two and building a tightly-integrated product on top is the killer combo. This intersection of seamless and collaborative software, put into the hands of everyday users, is what excites us!
No-code is arguably the hugest opportunity, with far-reaching potential to impact every worker in a modern knowledge economy. Yet it’s also one of the most crowded spaces with the least technical differentiation and where everybody recycles the same old ideas.
Our technology will be a breath of fresh air.
This post was originally published on tably.com.