(no title)
anshumankmr | 1 month ago
That being said yeah I too have done some similar stuff where some data engineering jobs could be run on a single VM but some jobs really did need spark, so the team decision was to fit the smaller square peg into a larger square peg and call it a da.In fact, I had spent time refactoring one particular pivotal job to run as an API deployed on our "macrolith" and integrated with our Airflow but it was rejected, so I stopped caring about engineering hygiene.
johndough|1 month ago
anshumankmr|1 month ago
Was not aware of this but seems it is not there natively in Python,but seems cool. Will try out in future.
jesse__|1 month ago
Regardless, storage and retroactive processing wasn't part of the problem. The problem was explicitly "parse json records as they come in, in a big batch, and increment some integers in a database".
I'm not going to figure out what the upper limit is on a single bare-metal machine, but you can be damn sure it's a metric fuck-ton higher than 1gb/day. You can do a lot with a 10TB of memory and 256 cores.
wongarsu|1 month ago
But if there's the option to run this on a fairly modest dedicated machine, I'd be comfortable that any reasonable solution for pure ingest could scale to five orders of magnitude more data, and still about four orders of magnitude if we need to look at historical data. Of course you could scale well beyond that, but at that point it would be actual work
ahoka|1 month ago
Then you would need an enormous 2TB storage device. \s