We're currently looking into datalake implementations. Right now, we only have 1 or 2 data sources. Current thinking is reading them on the fly, combine them using pandas dataframe and query that. Anyone have experience with doing something similar?
Depends on the type of data you're processing, your business goals and the existing consumers that you need to support.
At a minimum I'd suggest planning to load the data from the data lake into an RDBMS (OLAP/columnar preferably). Then it's accessible to more than just Python scripts (BI tools, users of other languages, etc).
Depending on how much data there is, should also plan on data summarization strategies. You can either build some common rollups to ensure that consumers are all looking at the same summaries or you can let consumers build their own transform/load pipelines from the raw data lake or you can let consumers build their own transform pipelines from the data in the data warehouse (using something like dbt).
The benefits of a data lake architecture really appear when you have lots of sources, lots of disparate consumers, and lots of data, with some schema evolution & unstructured parts thrown in. If you only have 1 or 2 sources, small enough data to query raw data in Pandas, and consumers are restricted to Python scripts, then you can skip a lot of the architectural headache of building a data lake for now (just make sure to archive your raw data somewhere if you want to be able to pull it into a data lake in the future).
You can do this for small datasets and I built an app around exactly this use case [0].
As your data gets bigger (just over time even if individual days don't emit more data) you'll end up building out partitioning schemes yourself or you'll move to a system that does that for you like bigquery/snowflake/etc.
It can seem like overkill, but I can't recommend Snowflake enough. It's so simple to setup and manage, and JSON support makes it easy to just drop JSON line files into a table and query on the fly.
Depending on your data/query volume, it can also be very cheap.
However, this article was about table formats so maybe OP wasn't thinking about JSON. If so, looks like Snowflake only supports read operations atm. Snow Summit this week, so maybe that will change?
mason55|3 years ago
At a minimum I'd suggest planning to load the data from the data lake into an RDBMS (OLAP/columnar preferably). Then it's accessible to more than just Python scripts (BI tools, users of other languages, etc).
Depending on how much data there is, should also plan on data summarization strategies. You can either build some common rollups to ensure that consumers are all looking at the same summaries or you can let consumers build their own transform/load pipelines from the raw data lake or you can let consumers build their own transform pipelines from the data in the data warehouse (using something like dbt).
The benefits of a data lake architecture really appear when you have lots of sources, lots of disparate consumers, and lots of data, with some schema evolution & unstructured parts thrown in. If you only have 1 or 2 sources, small enough data to query raw data in Pandas, and consumers are restricted to Python scripts, then you can skip a lot of the architectural headache of building a data lake for now (just make sure to archive your raw data somewhere if you want to be able to pull it into a data lake in the future).
alexmerced|3 years ago
eatonphil|3 years ago
As your data gets bigger (just over time even if individual days don't emit more data) you'll end up building out partitioning schemes yourself or you'll move to a system that does that for you like bigquery/snowflake/etc.
[0] https://github.com/multiprocessio/datastation
tomnipotent|3 years ago
Depending on your data/query volume, it can also be very cheap.
chrisjc|3 years ago
However, this article was about table formats so maybe OP wasn't thinking about JSON. If so, looks like Snowflake only supports read operations atm. Snow Summit this week, so maybe that will change?
I_Love_Data_22|3 years ago