top | item 24718301

Machine Learning Engineer Guide: Feature Store vs. Data Warehouse

166 points| nathaliaariza | 5 years ago |logicalclocks.com

56 comments

[+] blauditore|5 years ago|reply

I get a strong buzzword bingo vibe from this post. On a related note, is there a good reason to ever have something like a "data lake" (and call it like that)? Whenever I've encountered someone bringing up the idea to "build a data lake", a few questions later it became clear they just had a messy pile of incoherent, poorly-understood data and wanted to twist it into something positive by giving that pile a fancy name.

[+] PeterisP|5 years ago|reply

The reason for data lakes appears in large enough organizations where it becomes exceedingly likely that there is some data that may be useful to you that's maintained by people you'll never meet in a department you don't know about, where it's impractical or even impossible to get them involved in your project that would consume this data.

It's not so much about data itself as an attempt to solve a communications and coordination organizational problem; you decouple sources of data and consumers of data (not the technical systems/databases, but the people and organizational units) to a 'hub-and-spoke' model where the providers of data just supply raw data without getting into a multinational project that takes a year just to identify the potential stakeholders for that data throughout a distributed organization with tens of thousands of employees.

[+] slotrans|5 years ago|reply

The simple way to understand the utility of a data lake is "S3 is phenomenally cheap".

Some people treat this as an excuse to throw whatever they want into it, without any organization or standardization... and the consequences of that are quite predictable.

But it doesn't have to be that way. You can accumulate diverse and large data sets, in cheap cloud storage, while knowing what everything is and where you can find it.

As a trivial example, let's say you have a typical OLTP database (or perhaps many), with useful data that is, unfortunately, mutable. You can store entire copies of those tables in your data lake for pennies a day, giving you the ability to recall a transactionally-consistent view of that data from various past times. This is something we've always been able to do using traditional tools, the difference is that storing the data in a "data lake" (i.e. S3) is orders-of-magnitude cheaper.

Another major use case, perhaps the most significant one, is storing the raw ingested data -- e.g. from telemetry collection, 3rd party exports, etc -- along with each stage of its transformation. By preserving the original input, along with all intermediate outputs, no information is ever lost. If a buggy transformation is discovered it no longer means your output is irrevocably corrupted, fixed results can be re-computed from wherever in the transform pipeline the bug manifested. And again, this was always possible, a data lake just makes it cheap enough to actually do on a large scale.

[+] dcolkitt|5 years ago|reply

With the caveat that "data lake" means different things to different people, I think it's important to have a repository where the raw data exists un-manipulated in the form that it was ingested.

From the end-user standpoint that's not very useful. But that's why you have data marts that normalize the raw data into a standardized format. Ultimately though the raw data needs to remain the single source of truth. If you skip that step, and only store the post-normalized format, you're likely to run into problems down the road. This could either be because you want to change the normalization format. Or you want to use some aspect of the data that isn't captured in the normalized form. Or even you discover a pre-existing bug that affected all the previous post-ETL data.

[+] dkarl|5 years ago|reply

> Whenever I've encountered someone bringing up the idea to "build a data lake", a few questions later it became clear they just had a messy pile of incoherent, poorly-understood data and wanted to twist it into something positive by giving that pile a fancy name.

That's kind of what I understand as well, but the data science folks pitched it in a slightly more positive way, like, "Please don't limit us just to the data you have time to nicely structure and validate. We want all of it. It doesn't matter if a column is getting truncated to three characters or columns are mislabeled or there are amounts in dollars and euros mixed together; we can still get signal out of it."

[+] detaro|5 years ago|reply

Having one central place to find and access data, even if its messy, can be better for understanding and using the data than each business unit/team/... having their own different place for (still messy) data. It can of course also be utterly pointless if done badly, not maintained, not used, ...

[+] cpard|5 years ago|reply

I think a reason that leads to the need of something like a 'data lake' or anything that looks like a messy pile of incoherent data, is the difference between how data scientists and traditionally data analysts deal with noise in the data. Most BI tasks require the data to be as clean as possible, it's important to be aware of the quality of the data before you calculate your MRR for example. On the other side, data scientists deal with noise more as a parameter of the models they are building. This difference leads to different requirements on how the pipelines should operate. In such a context, having a messy pile of data might help increase the velocity and the independence of some teams inside the company.

[+] sjg007|5 years ago|reply

Data is the new oil right so you want to decouple your data consumers from the data generators. This means you dump all the data into a lake. That way a reporting team or analytics team doesn’t need to connect to your production database.

[+] xapata|5 years ago|reply

Nah, it's a messy pile of data that you want to understand rather than throw away.

[+] ipiz0618|5 years ago|reply

In large organizations it is common to have multiple databases maintained by many different people. When a DS needs some data he/she needs to talk to all these different people, wait for them or their managers to grant access, and join the inconsistent labels. This takes a long time and some people are extremely protective of their (messy anyways) tables, requiring higher ups to straight up demand them to grant access.

Having nicely organized data is perfect, but I'd rather fetch the data myself from a pile of messy data instead of dealing with all these organizational nightmare.

[+] laichzeit0|5 years ago|reply

What you’re referring to is sometimes called a “data swamp”.

[+] LexSiga|5 years ago|reply

You are not wrong on the buzzword, but you are not absolutely right if you suggest that it is merely that; this is a recurring question and interrogation in that specific area.

The fact that it happens to be kinda buzz worthy is a collateral aspect: everything that answers what some people wonder and that is not yet answered plainly, is.

(and I mean, the first on the front page at this very second has : "We hacked apple" in the title.)

[+] iammru|5 years ago|reply

spot on. Data lake approach is a lazy approach. Throw a pile of garbage to the storage layer and figure out how to use it later. People just kicking the can.

[+] sixdimensional|5 years ago|reply

I say this as someone who built data infrastructure before and after the invention of data lakes, and I have done it every way - the old and the new.. For the record, for a lot of scenarios, the "old ways" actually do still work fine. But there are new opportunities/possibilities too.

I really understand what you are saying... I know the hype problem, I lived it. It makes me both frustrated and sad - because the hype is annoying, there is a lot of vaporware - but there is also something real that is happening too which is part of the story of the evolution of data architecture/infrastructure. My strong advice is, being open-minded is helpful - learn and take what was good/real and leave behind the stigma/hype. Something real and useful happened in terms of architecture, so take the benefits - but of course, don't compromise on delivering real, working solutions.

Regarding "data lakehouse", I struggle with the buzzword term also, but once again, I recommend looking at what is good/real and ignoring the stigma of the buzzwords. One way of looking at it is the literal translation - a data warehouse made from the components used to make data lakes. To be honest, it is a marketing term, but it is also an architectural pattern we had even before the term existed - for example - you could use a data warehouse product such as Vertica, and back it by HDFS - guess what, there's a data "lakehouse".. and most of the big database vendors can do this trick now - a full traditional data warehouse engine sitting on top of lake storage infrastructure.

There are several "real" use cases for data lakes. Precursor architectures could be seen to be "operational data stores" [1]. Data lakes are real, they are one approach to solving some problems.

These use cases could include: 1) raw, long term storage of large volumes of diversely structured data for staging/historical purposes; 2) data discovery/exploration of this data to identify patterns/models and relationships (this is both an AI and analytics use case for power users and BI/analytics/data scientists, etc.); and 3) an opportunity to change the paradigm of traditional ETL - instead of pulling from sources, one way you look at it is, allowing many diverse/distributed sources to push their data into the lake for powering analytics, exploration, AI model building, etc. It makes sense as part of lambda/kappa architecture as well - some of the "push" in can come from streaming sources as well.

Use case #1 is very much a "data infrastructure" kind of use case that we do anyway in data warehouses - especially those that do ELT (vs. ETL) - staging databases. If you want an architecture that actually helps make some sense of such a use case of data lakes more formally, one could look into Dan Lindstedt's data vault architecture [2]. While data vault modelling doesn't necessarily require "data lakes", the "raw" part of the data vault architecture use case overlaps nicely with data lakes.

[1] https://en.wikipedia.org/wiki/Operational_data_store

[2] https://en.wikipedia.org/wiki/Data_vault_modeling

[+] nathaliaariza|5 years ago|reply

Other resources on feature stores that help clarify the difference and when to choose it over a data warehouse

https://medium.com/data-for-ai - several blogs discussing feature stores for machine learning

https://www.logicalclocks.com/blog/feature-store-the-missing... - in depth explanation

featurestore.org - list of all feature stores available and in production

[+] maycotte|5 years ago|reply

We have been building a feature first data store for seven years and it feels like feature store is about to become one of the more exciting ways to extract value from data. We see feature stores doing much more than becoming just another silo for ML, but instead a way to get a real-time, centralized view of fragmented data that has to either be federated or put in a data lake to to be queried together. I share more of my thoughts in this blog post: https://www.molecula.com/why-moleculas-feature-based-approac... Molecula is based on the OSS platform Pilosa (https://www.pilosa.com/) and both Pilosa and Molecula are transitioning to reposition as feature stores over the coming months. Doing machine-scale analytics and ML on the data itself will be a thing of the past.

[+] moritzmeister|5 years ago|reply

This looks like a highly specialised tool. How is it going to integrate with a Data Scientist's favourite tools, such as Jupyter notebooks, Pandas or Spark and especially ML frameworks like TensorFlow, SkLearn etc.?

[+] jamesblonde|5 years ago|reply

But is it open-source? Our Hopsworks Feature Store is * https://github.com/logicalclocks/hopsworks as is GoJEK Feast * https://github.com/feast-dev/feast

I really believe that all Enterprise platform software should have an open-source version if it is to have a meaningful effect on how people work (in this case, how Data Scientists and Data Engineers work together).

[+] jamesblonde|5 years ago|reply

Author here. I wrote this article because I keep getting the question from prospects - isn't this just a data warehouse? If I missed out on anything or got anything wrong, please let us know here.

[+] notsuoh|5 years ago|reply

I mean, maybe you're leaving this intentionally open ended to garner comments to get your post higher on the HN page, but perhaps you could answer the question you posted: Isn't this just a data warehouse?

[+] sradman|5 years ago|reply

jamesblonde's article tried to answer the question:

> isn't this just a data warehouse [DW]?

My understanding is that the addition of a RowStore to the DW/ColumnStore addresses the training phase of Machine Learning.

I come from a Data Engineering background. I struggled with the Data Science centric terminology. Uber's 2017 post [1] was helpful in establishing the motivations and terminology of their Michelangelo machine learning (ML) platform. The main distinguishing feature of a Feature Store seems to be that it supports efficient batch downloading of row data that is used as the training dataset. The discussion made more sense once I figured out that feature refers to a column or data field.

Figure 4 in the Logical Clocks whitepaper Hopsworks Feature Store [2] helped me understand the architecture better. The architecture appears to be what I would call a Hybrid Transactional/Analytical Processing (HTAP) [3] engine with MySQL cluster acting as the RowStore and Apache Hive acting as the ColumnStore. I'm assuming that the Hopsworks Feature Store periodically merges the MySQL updates into Hive and also provides a mechanism to perform federated queries.

The use of Hive seems outdated (vs. say Presto) and I wonder if the use of MySQL is required compared to directly accessing column oriented files like ORC/Parquet/Kudu.

[1] https://eng.uber.com/michelangelo-machine-learning-platform/

[2] (PDF) https://uploads-ssl.webflow.com/5e6f7cd3ee7f51d539a4da0b/5ef...

[3] https://en.wikipedia.org/wiki/Hybrid_transactional/analytica...

[+] arethuza|5 years ago|reply

I might quibble a bit about:

"They could derive historical insights into the business using BI tools."

A lot of work in some traditional BI tools (e.g. Oracle Hyperion) is actually about collecting and aggregating forward looking data (within forecasts, budgets, scenarios) that are then compared with actuals.

[+] streetcat1|5 years ago|reply

Can you please explain how the ONLINE feature store works? I.e. if the prediction request contain, let say, user id, and the user record is not in the ONLINE feature store (e.g. Redis), than you would need to go to the OFFLINE store and do a join, or a select?

To sum up, assuming that the online feature store is some sort of a cache, how do you know which objects to place in the cache?

[+] overfitted|5 years ago|reply

Great read! I can imagine that a Feature Store could have an impact regarding the challenge of provenance in ML as well. I couldn't find anything on that.

Appreciate the comparison table: Data Warehouse vs Feature Store. Might stea.. re-use that.

[+] dsaiztc|5 years ago|reply

Nice write-up!

I would challenge the assumption that "the Data Warehouse is an input to the Feature Store" though. I'm more inclined towards having a first stage of data cleanup/modeling that could be reused (as input) for both DWH and FS instead.

[+] jascii|5 years ago|reply

Oh, this article is not about ML (the language) It is about Machine Learning... Can we just call Machine Learning Machine Learning, to avoid confusion?

[+] notsuoh|5 years ago|reply

Maybe you're being facetious. I had to look up what ML Language is, apparently it's a fifty year old programming language that hasn't had a stable release in 23 years, in case anyone else was wondering.

Respectfully, let's keep ML to be Machine Learning. ;)