top | item 46676759

(no title)

Though I do not know the situation AT the firm you were interviewing in, if there is some unexpected increase in data volume OR say a job fails on certain days or you need to do some sort of historical data load (>= 6 months of 1 gig of data per day), the solution for running it on a single VM might not scale. BUT again, interviews are partially about problem solving, partially about checking compliance at least for IC roles (IN my anecdotal experience).

That being said yeah I too have done some similar stuff where some data engineering jobs could be run on a single VM but some jobs really did need spark, so the team decision was to fit the smaller square peg into a larger square peg and call it a da.In fact, I had spent time refactoring one particular pivotal job to run as an API deployed on our "macrolith" and integrated with our Airflow but it was rejected, so I stopped caring about engineering hygiene.

discuss

johndough|1 month ago

    (>= 6 months of 1 gig of data per day)

You can parse JSON at several GB/s: https://github.com/simdjson/simdjson And you could scale that by one or two orders of magnitude with thread-based parallelism on recent AMD Epyc or Intel Xeon CPUs. So parsing alone should not pose a problem (maybe even sub-second for 6 months of data). We would need a more precise problem statement to judge whether horizontal scaling is needed.

anshumankmr|1 month ago

> https://github.com/simdjson/simdjson

Was not aware of this but seems it is not there natively in Python,but seems cool. Will try out in future.

jesse__|1 month ago

As other commentors pointed out, 1gb/day isn't a problem for storage and retroactive processing until you get to like, hundreds of years of data. You can chew through a few hundred TB of JSON data in a day, per core + nvme drive.

Regardless, storage and retroactive processing wasn't part of the problem. The problem was explicitly "parse json records as they come in, in a big batch, and increment some integers in a database".

I'm not going to figure out what the upper limit is on a single bare-metal machine, but you can be damn sure it's a metric fuck-ton higher than 1gb/day. You can do a lot with a 10TB of memory and 256 cores.

wongarsu|1 month ago

If we are talking about cloud VMs: sure, their cpu performance is atrocious and io can be horrible. This won't scale to infinity

But if there's the option to run this on a fairly modest dedicated machine, I'd be comfortable that any reasonable solution for pure ingest could scale to five orders of magnitude more data, and still about four orders of magnitude if we need to look at historical data. Of course you could scale well beyond that, but at that point it would be actual work

ahoka|1 month ago

“6 months of 1 gig of data per day”

Then you would need an enormous 2TB storage device. \s