Show HN: Data Engineering Book – An open source, community-driven guide
251 points| xx123122 | 17 days ago |github.com
The Problem: I found that learning resources for modern data engineering are often fragmented and scattered across hundreds of medium articles or disjointed tutorials. It's hard to piece everything together into a coherent system.
The Solution: I decided to open-source my learning notes and build them into a structured book. My goal is to help developers fast-track their learning curve.
Key Features:
LLM-Centric: Focuses on data pipelines specifically designed for LLM training and RAG systems.
Scenario-Based: Instead of just listing tools, I compare different methods/architectures based on specific business scenarios (e.g., "When to use Vector DB vs. Keyword Search").
Hands-on Projects: Includes full code for real-world implementations, not just "Hello World" examples.
This is a work in progress, and I'm treating it as "Book-as-Code". I would love to hear your feedback on the roadmap or any "anti-patterns" I might have included!
Check it out:
Online: https://datascale-ai.github.io/data_engineering_book/
GitHub: https://github.com/datascale-ai/data_engineering_book
fudged71|16 days ago
I am a complete novice in training LLMs, and have been trying to train a novel architecture for Python code generation, using Apple Silicon.
I've been a bit frustrated to be honest that the data tools don't seem to have any focus on code, their modalities are generic text and images. And for synthetic data generation I would love to use EBNF-constrained outputs but SGlang is not available on MacOS. So I feel a bit stuck, downloading a large corpus of Python code, running into APFS issues, sharding, custom classifying, custom cleaning, custom mixing, etc. Maybe I've missed a tool but I'm surprised there aren't pre-tagged, pre-categorized, pre-filtered datasets for code where I can just tune the curriculum/filters to input into training.
esafak|17 days ago
xx123122|17 days ago
hliyan|17 days ago
> The "Modern Data Stack" (MDS) is a hot concept in data engineering in recent years, referring to a cloud-native, modular, decoupled combination of data infrastructure
https://github.com/datascale-ai/data_engineering_book/blob/m...
Later parts are better and more to the point though: https://github.com/datascale-ai/data_engineering_book/blob/m...
Edit: perhaps I judged to early. The RAG sections isn't bad either: https://github.com/datascale-ai/data_engineering_book/blob/m...
xx123122|17 days ago
joshuaissac|17 days ago
dang|17 days ago
I hope xx123122 won't mind my mentioning that they emailed us about this post, which originally got caught in a spam filter. I invited them to post a comment giving the background to the project but they probably haven't seen my reply yet. Hopefully soon, given that the post struck a chord!
Edit: they did, and I've moved that post to the toptext.
xx123122|17 days ago
osamabinladen|17 days ago
xx123122|17 days ago
nimonian|17 days ago
Whether it's GPT or not, it needs rewriting.
cpard|16 days ago
Lance[1] (the format, not just LanceDB) is a great example, where you have columnar storage optimized for both analytical operations and vector workloads together with built-in versioning for dataset iteration.
Plus (very important) random access, which is important for stuff like sampling and efficient filtering during curation but also for working with multimodal data, e.g. videos.
Lance is not alone, vortex[2] is another one, nimble[3] from Meta yet another one and I might be missing a few more.
[1] https://github.com/lance-format/lance [2] https://vortex.dev [3] https://github.com/facebookincubator/nimble
baalimago|17 days ago
Oil[0] is fairly useless without being refined as well. Perhaps: "Data is the new oil, you need to refine it"?
[0]: https://en.wikipedia.org/wiki/Petroleum
13pixels|17 days ago
We've found keyword search (BM25) often beats semantic search for specific entity names/IDs, while vectors win on concepts. Do you cover hybrid search patterns/re-ranking in the book? That seems to be where most production systems end up.
xx123122|17 days ago
eshaham78|16 days ago
[deleted]
guillem_lefait|17 days ago
xx123122|17 days ago
unknown|16 days ago
[deleted]
alexott|17 days ago
xx123122|17 days ago
Thanks for understanding, and Happy New Year!
unknown|17 days ago
[deleted]
xx123122|17 days ago
[deleted]
unknown|17 days ago
[deleted]
unknown|17 days ago
[deleted]
dvrp|17 days ago
unknown|17 days ago
[deleted]
unknown|17 days ago
[deleted]
rafavargascom|17 days ago
How is possible a Chinese publication gets to the top in HN?
xx123122|17 days ago
We are pleasantly surprised by the warm reception. We know the project (and our English localization) is still a Work in Progress, but we are committed to improving it to meet the high standards of the HN community. We'll keep shipping updates!
heliumtera|17 days ago
rafavargascom|17 days ago
MUSTANG303|17 days ago
[deleted]