Maintainable ETL Practices at Stitchfix

[+] jdreaver|6 years ago|reply

Huge +1 to leveraging SQL as much as possible during ETL.

We recently moved our data warehouse to Snowflake. Along the way, we adopted an ELT paradigm instead of ETL, where you load your raw data into the warehouse and then perform any transformations in the warehouse. A tool called dbt (https://www.getdbt.com/) has made this a joy. You write SQL select queries in SQL files, and dbt creates a DAG based on this SQL so it can efficiently create tables/views in parallel. dbt also comes with some nifty testing facilities to validate your data once it is transformed.

This is a great article, thank you for writing it!

[+] systems|6 years ago|reply

I am not a big fan for the term ELT

Just because you are using a database as your work space doesnt mean you Load your data before transformation

Most complex dw, have multiple layers of transformation, and several workspaces along the way

There is always a final step where you load the clean data, into your clean schema, used for querying the data

ETL

    Extract = Fetch the data you want 
    T = Make all necessary change
    L = Create a clean copy

[+] teej|6 years ago|reply

DBT is fantastic, I very strongly recommend it.

[+] sv123|6 years ago|reply

DBT has been a game changer for us, such an awesome tool.

[+] atombender|6 years ago|reply

In discussions like this, I often wonder how there can be enough data to require a "big data" pipeline with things like Spark and Presto.

Stitch Fix seems to be one of those online services that send you sets of clothes that they think fit your style. That seems like a really narrow, low-data kind of industry. How much data can they possibly have? And why so big on the backend? In January 2018 they had 100 engineers. Presumably they're even larger now. Just for a service that sends out clothing.

Maybe I'm lacking in imagination or insight into what takes to run a company like this. On the other hand, a single PostgreSQL instance can run complex ad-hoc queries, with CTEs and everything, on a single node involving millions, even billions, of rows.

[+] achompas|6 years ago|reply

Their service is high-touch. Stylists speak with clients constantly, so they record this information.

They also have mobile apps, they run product experiments, they source and sell clothing and manage inventory, they build and iterate on algorithmic approaches to recommend and design clothing (many of which help stylists and never reach the screen of an external client).

You can skim through their Algorithms blog for some more detail. I find them impressive in how they scale the impact of relatively few stylists to about 3M users.

https://www.kleinerperkins.com/files/INTERNET_TRENDS_REPORT_...

[+] threeseed|6 years ago|reply

Spark is not a competitor to PostgreSQL.

It is a distributed compute engine that has a lot of capabilities, is rock solid, allows you to blend SQL with Python/R/Scala and can support ML use cases as your needs grow. You can easily store all of your data in PostgreSQL and run Spark on top.

[+] afpx|6 years ago|reply

At what point does it become “big data”? 10s of billions, 100s of billions? I would love to get a single PostgreSQL server to index all of my data in a timely manner. Heck, I’d be happy with a sharded scheme, if it worked. But, I don’t have time to learn the low-level details of a RDBMS and my system IO throughput to make that work. Spark works, it’s easy, and it’s relatively cheap. So, why make it more complicated?

[+] reddickulous|6 years ago|reply

A transactional schema usually isn't ideal for running analytics on. So you create a data warehouse that transforms your data into a schema that's more amenable to reporting. They probably have a ton of marketing data also they integrate into the data warehouse to run analysis on.

[+] piggybox|6 years ago|reply

I used to work for SF and managed one of the biggest table in the DW. I won't say what's used for but to give you a clue how big that is; we added a few billion more rows each day to that table. I don't think a single PostgreSQL instance had any chance to survice that.

[+] tempguy9999|6 years ago|reply

Too large a chunk of my life was burnt the ETL from hell. Other than SQL, I can't see any other tool that would have helped.

I found the easiest way was just dump tables of data into long varchars - even the supposedly numeric stuff, because often enough it wasn't - then scrub & filter it from there. (NB, if you take anything from this post, make it that).

There were multiple challenges, but one that was most painful was the sheer filth mixed into the data. Too often even a human couldn't work out what the heck was supposed to be in some field. Often we could, but it ate a lot of time.

It doesn't sound like you ETLs are anything like mine; you start with pretty clean data. You are lucky.

[+] achompas|6 years ago|reply

Yikes, I’m sorry. What were your application developers doing to persist this messy data?

[+] ak39|6 years ago|reply

“The primary benefit of SQL is that it’s understood by all data professionals: data scientists, data engineers, analytics engineers, data analysts, DB admins, and many business analysts.“

No, the primary benefit of using SQL is not SQL itself but the “free features” you get with the RDBMS you’re using. Enforcement of primary keys, automatic data type conversions, free transactions! and rollbacks ... and the sheer gift that is set-based operations on large chunks of data. It is just plain stupid to want to code the transformations or insertion of millions of rows iteratively than to rely the implicit rollback power of an INSERT statement.

Rule # 1 in my projects: Avoid getting sexy in code, use the RDBMS for all transformations post staging. Always stage to RDBMS “as is” (data warts and all). Cleanup with SQL and stored procedures. This way you can SQL query the whole original format and figure out the best cleanup and transformations needed.

[+] iblaine|6 years ago|reply

One comment is CTEs don’t use indexes, so you may want to avoid using them in production code. That’s something to consider if you may need to port code to/from databases where indexes may be available.

19 comments