In this post we’ll explore the concepts of data lake, data hub and data lab. There are many opinions and interpretations of these concepts, and they are broadly comparable. In fact, many might say they’re synonymous and we’re just splitting hairs. But let’s look again carefully. We can discern some subtle trends in the way people are doing things, and find distinctions in these expressions.
Welcome to the Data Lake
Lakes are tranquil, large pools of cool water, right? Well possibly. I grew up in Scotland, where lakes are called lochs, and rumours of monsters that lurk in the depths of ancient lochs abound. Scotland also has salt water sea lochs, full of stinging jellyfish. But one thing is for sure – lakes, lochs, call them what you will – they’re popular places to go fishing.
In current technology vernacular, a data lake is essentially a very large body of cool data, typically in the 100s of terabytes to petabytes in size. The data lake differentiates from other cool storage systems such as MAIDs (Massive Array of Idle Disks), storage vaults and tape archives, because the data remains online and fully accessible on a low-cost storage media like Apache HDFS, Ceph, or AWS Simple Storage Service (s3). This makes it an interesting and cost-effective solution for performing ad-hoc research, analysis and reporting on the aggregated data – essentially enabling data “fishing expeditions”, as well as being the feedstock for applications using deep learning or other data-intensive artificial intelligence approaches. The “big data” need not be restored from a tape or extracted from a vault or deep storage solution in order to be queried, which are tasks that usually come with a significant cost.
Data in the lake can take many forms, the most popular format is semi-structured machine data – for example telemetry data (system, application usage and activity logs, user tracking, things like that), log data (weblogs, crash logs, network element logs, application logs, firewall logs, industrial machine data and so on) and data feeds (like stock ticker data, weather data, etc.). Another popular format is system of record (SoR) data – operational database extracts, data warehouse change data capture, and so on. And many data lakes capture vast amounts of unstructured data (free-text – like chat or audio transcriptions, document scans, binary photographs and images like x-rays, binary audio – like call centre recordings, and binary video – like security camera recordings).
It’s also important to know that data lake managers often like to adopt the so-called “schema-on-read” strategy for the datasets forming the lake. Basically, this means the data is stored in the lake untreated, in full fidelity. This might seem to go against all data storage best practices, where data normalization for efficiency and integrity is one of the major tenets. However the reasoning is sound – the volumes of data involved make guaranteeing integrity via relational modelling hard to achieve whilst still assuring timely access to the data. And any storage efficiency induced savings are massively offset by the upfront labour cost of engineering the data. Finally, treating the data often implies discarding or summarizing data, which may be undesirable since it might preclude future applications and use cases (for example some data mining or AI use cases), so the value of the upfront data modelling and engineering exercise is uncertain.
Whilst residual processing data like weblogs and crashlogs might be considered low value at small scale, in aggregate and over long time spans this kind of data can be extremely valuable input. For example the data can be used to drive research, business excellence, as feedstock for new and innovative (eg. AI) products, and to guide informed business decisions.
It should be noted that data lakes are typically used to store so-called “cool” data – by which we mean data that is infrequently accessed and rarely modified; whilst “hot” data – that is, data that is frequently accessed and updated – is usually stored elsewhere (for example in an OLTP database).
I am fearless, and therefore powerful: the Data Lab
Since the storage cost per GB is pretty low, storage efficiency is less of a concern versus accessibility. Exposing data verbatim in a data lake for data scientists and analysts to do the feature engineering or modelling that they want to get the data in the shape they need for the given project or product drives agility at the cost of dataset duplication.
All of this lowers the upfront costs associated with advanced data experimentation and research; and thus places agility, innovation and the rigour of data-driven or empirical business practices within the reach of any organisation – large or small – that has the appetite to build up a data lake.
Cue the data science lab. Data labs are an emerging shared service paradigm – a kind of “knowledge services” team or division, focused on delivering advanced analysis, forecasting, war-gaming, digital twins, machine learning (ML) applications and artificial intelligence (AI) tools. These services are typically delivered as short projects, assisting all parts of the business that may need their services – from marketing to manufacturing, the executive team to the people team.
The data lab might therefore make use of a data lake, but it is, according to our definition, a different paradigm.
|Learn more about data labs – read the white paper:|
Data Lab Architectures – a field practitioner’s guide
The nerve center: the Data Hub
Data aggregated in large pools can be quite useful, as we have learned. Not without its share of asset management cost of course, but undoubtedly a useful resource with many opportunities. And making good use of these data lakes can be accelerated by assembling a skilled team in a data lab.
We’ve written about cool data. But what, you’re no doubt wondering, about hot data? What if we want to tap into the data feeds that we have, and use them to make predictions or take informed business decisions based on how we’re doing right now? In our vocabulary, this is the realm of the data hub.
A data hub is a high capacity, high-throughput integration point – such as an Apache Kafka messaging system, that can be used for monitoring, inspecting, routing, and acting upon data in motion. The idea is that all the evented data feeds that the organisation has are hooked up to the data hub, where data analytics or predictive models execute online on the data.
As the data hub is an online solution acting on data feeds, care should be taken to distinguish between batch data and data feeds. Data hubs are not well suited to processing batch data, and whilst it is possible to use change-data-capture techniques to turn system of record style batch processing-oriented data into a data feed – unless context is provided, this kind of data may deliver minimal business value in the data hub in exchange for a lot of hard work.
Data is or data are?
So there we have it. Subtle, nuanced and perhaps mildly contentious; our definitions of what a data lake, a data lab or a data hub is are noticeably different.
To sum it up:
- Use a data lake when you want to store big data long term but still want to be able to process it for analysis, reporting, research and ML/AI model training
- Use a data lab when you want an expert team of data scientists, engineers and analysts to help you quickly get value from your data
- Use a data hub when you want to have a more real-time operational view on your business and use it to drive automated analysis, predictions, reporting and decisions online using hot-path data