top | item 31390528

(no title)

metadata | 3 years ago

Scripting data movement is easy only for small and easy jobs. With many thousands of tables and more than a few TB, all kinds of issues start popping up. I read somewhere that 85% of large data migration projects fail. Data warehouses really need an optimal Parquet file sizes to work efficiently, and for Snowflake it's roughly 100-200MB per file. The good way to copy that is relying on DB statistics to determine the optimal number of records per chunk. Then, to have the job finish in a reasonable time, one needs to read a certain number of data chunks in parallel and stream that data into Parquet at S3 (or Azure Blobs). Once the data is up, Snowflake can ingest it.

Shameless plug: my company created a commercial solution which does exactly that (https://www.spectralcore.com/omniloader). Happy to answer any questions.

discuss

metadat|3 years ago

I knew it, it was you.