top | item 24814687

Emerging Architectures for Modern Data Infrastructure

431 points| soumyadeb | 5 years ago |a16z.com | reply

105 comments

order
[+] jandrewrogers|5 years ago|reply
This article has a large gap in the story: it ignores sensor data sources, which are both the highest velocity and highest volume data models by multiple orders of magnitude. They have become ubiquitous in diverse, medium-sized industrial enterprises and it has turned them into some of the largest customers of cloud providers due to the data intensity. Organizations routinely spend $100M/year to deal with this data, and the workloads are literally growing exponentially. Almost no one provides tooling and platforms that address it. (This is not idle speculation, I’ve run just about every platform you can name through lab tests in anger. They are uniformly inadequate for these data models, everyone relies on bespoke platforms designed by specialists if they can afford the tariff.)

If you add real-time sensor data sources to the mix, the rest of the architecture model kind of falls apart. Requirements upstream have cascading effects on architecture downstream. The deficiencies are both technical and economic.

First, you need a single ordinary server (like EC2) to be able to ingest, transform, and store about 10M events per second continuously, while making that data fully online for basic queries. You can’t afford the latency overhead and systems cost of these being separate systems. You need this efficiency because the raw source may be 1B events per second; even at that rate, you’ll need a fantastic cluster architecture. Most of the open source platforms tap out at 100k events per second per server for these kinds of mixed workloads and no one can afford to run 20k+ servers because the software architecture is throughput limited (never mind the cluster management aspects at that scale).

Second, storage cost and data motion are the primary culprits that make these data models uneconomical. Open source tends to be profligate in these dimensions, and when you routinely operate on endless petabytes of data, it makes the entire enterprise problematic. To be fair, this is not to blame open source platforms per se, they were never designed for workloads where storage and latency costs were critical for viability. It can be done, but it was never a priority and you would design the software very differently if it was.

I will make a prediction. When software that can address sensor data models becomes a platform instead of bespoke, it will eat the lunch of a lot of adjacent data platforms that aren’t targeted at sensor data for a simple reason: the extreme operational efficiency of data infrastructure required to handle sensor data models applies just as much to any other data model, there simply hasn’t been an existential economic incentive to build it for those other data models. I've seen this happen several times; someone pays for bespoke sensor data infrastructure and realizes they can adapt it to run their large-scale web analytics (or whatever) many times faster and at a fraction of the infrastructure cost, even though it wasn't designed for it. And it works.

[+] dima_vm|5 years ago|reply
> 10M events per second

Disclaimer: I work at VictoriaMetrics open source.

VictoriaMetrics ingest rates are around 300k / per second / PER CORE. So theoretically you should be fine with just a single n1-standard-32 or *.8xlarge node. Though I would recommend cluster version for reliability, of course, and to scale storage/ingestion/querying independently.

Here's the benchmarks with charts: https://medium.com/@valyala/measuring-vertical-scalability-f...

[+] defen|5 years ago|reply
I don't doubt you, but it's surprising that there's really that much value / inefficiency laying around that many "medium sized" industrial enterprise can justify spending tens or hundreds of millions of dollars a year just to collect an insane amount of sensor data (and presumably take some action based on that). How big is medium sized and what kind of industries?
[+] sradman|5 years ago|reply
> it ignores sensor data sources, which are both the highest velocity and highest volume data models by multiple orders of magnitude.

This has long been the main marketing message used to promote Complex Event Processing (CEP) [1] systems. There is no shortage of enterprise and Open Source solutions for this space; what is missing is strong demand/adoption which in itself undermines the next-big-thing claim.

One can argue that sensor data is included in the ETL category.

[1] https://en.m.wikipedia.org/wiki/Complex_event_processing

[+] ianeliot|5 years ago|reply
Interesting. This may be a naive question — this is very far from my area of expertise — but is there a reason sensor data can't be sampled? It seems gratuitous to store that many events.
[+] moab|5 years ago|reply
Thanks for this great comment. What kind of workloads are people trying to run on sensor data that arrives at such a high velocity? Time series analysis? Anomaly detection? I wish I had a better idea of what kind of specific problems users you've run into are trying to solve, which fail on the existing software stack.
[+] maximilianburke|5 years ago|reply
That's one of the big challenges we've been running to at UrbanLogiq. We've built bespoke storage and processing pipelines for this data because existing options in this space both didn't fit our needs and also would bankrupt our company while we tried to sort it out.

Having "cost" on the board as a factor we were actively trying to optimize for during design pulled us in a direction that is quite foreign compared to off-the shelf solutions.

That last paragraph rings true -- one of our big challenges specifically was in ingesting and indexing data that needs to be queried across multiple dimensions, things like aircraft or drone position telemetry. But once we found a workable solution for that, it specializes quite well to simpler workloads very well.

[+] StreamBright|5 years ago|reply
>> Almost no one provides tooling and platforms that address it

I think this is due to the nature of the mentioned companies are not being too common (yet?). There are tools and systems that you can use, especially from high frequency trading which has somewhat similar challenges. KDB+ and co. would be my first stop to check if there is something that I could use. The question is the financial structure and scaling of the problem, to determine if these tools are in game. There are other interesting projects in the space:

- https://github.com/real-logic/aeron

- https://lmax-exchange.github.io/disruptor/

Of course these are not exactly what you need, long term storage and querying (like KDB) is largely unsolved.

The other tools that you might be referring to by "most of the opensource platforms" indeed are not capable doing this. I spent the last 10 years on optimizing such platforms but it is not even remotely close to what you need, you (or anybody who thinks these could be optimized) are wasting your time.

[+] ransom1538|5 years ago|reply
"You can’t afford the latency overhead and systems cost of these being separate systems. You need this efficiency because the raw source may be 1B events per second;"

We do this. Have a load balancer with a fleet of nginx machines insert into bigquery. Inserts scale well and the large queries work since it is columnar. The issue is price. It's terribly expensive.

[+] _pmf_|5 years ago|reply
One thing people seem to be doing it put incredible effort into timeliness of data nobody ever looks at. (Plus, creating hundreds of TCP/IP + JSON overhead for single bit events.)

I've used the following pattern in the past: - generally only send batched data in as large an interval as possible - if somebody looks at a device, immediately (well, might take some seconds) query the batched data and switch device to a "live" mode that provides live data instead of "wait and batch".

This will be a bad idea for scenarios where there's a reasonable expectation of surges of people needing "live" access, but for our use cases of industrial data, it works very well. We only watch our own devices, which are in the lower tens of thousands, but I don't see why this should not scale to more, under the restrictions mentioned above.

[+] junon|5 years ago|reply
> Almost no one provides tooling and platforms that address it.

As a systems engineer with a good track record and an interest in starting an endeavor, this is a very attractive statement to me.

Where can I read more about how the sensor networks are configured, the use-cases, etc? I'd like to read into this a bit more.

[+] johnrgrace|5 years ago|reply
I've got 15 million connected cars, the data they can generate is large and you care about each specific car. Sampling the data doesn't work.
[+] simo7|5 years ago|reply
I work with sensor data and although not explicitly mentioned I thought you could locate it in the "event streaming" and "stream processing" boxes.

What piece of architecture you think is left out?

[+] kortilla|5 years ago|reply
> Organizations routinely spend $100M/year to deal with this data, and the workloads are literally growing exponentially.

Let’s step back for a second and just acknowledge that you’re in a very narrow slice of the market. The number of companies that are paying $100M/year to store sensor data is probably countable with 8 bits.

So it might seem like a large gap for you, but it’s honestly not relevant for 99.99% or developers.

[+] fleetingmoments|5 years ago|reply
Sounds like you need to move more processing and storage to the edge.
[+] peterwwillis|5 years ago|reply
Sounds like you need 1,000 nodes to do 1Bpps without edge computing. With some compression at the edge, it'd be closer to 150-250. The limits of a conventionally architected network make it more annoying than it needs to be.
[+] zaptheimpaler|5 years ago|reply
Can you provide some examples of the kinds of sensors you're talking about?
[+] throwaway189262|5 years ago|reply
The vast majority of sensor data compresses well. Delta encoding with Huffman is pretty standard.

I guess you could have a time series database that used compression but I don't know of databases that do

[+] ethanwillis|5 years ago|reply
While this is an article about data infrastructure I feel like we're missing the forest for the trees.

What is most important here in my opinion is that the underlying data is useful. If your underlying data wasn't collected, collected properly, or even worse the wrong data was collected.. then setting up data infrastructure will be a boondoggle that will cause your organization to be data hostile.

Just as much, if not more effort, needs to go into collecting the right data in the right way to fill your data infrastructure with. Most of the projects I've seen or heard of are just people taking the same old data that Ted in accounting, Jill in BI, etc. are already pretty proficient at using. So the gains you get by moving that into a modern infrastructure are marginal. How many more questions can you really ask of the same data that people have decades of experience with and an intuitive sense for?

[+] teej|5 years ago|reply
The biggest shift has been towards data lake (store everything) away from data cubes (store aggregates). This makes it orders of magnitude easier to diagnose, debug, and assert the correctness of data.

So these trends aren’t in a vacuum, they directly support the issues you discuss.

> Most of the projects I've seen or heard of are just people taking the same old data ...

I don’t disagree with you here. But in my experience it’s about getting Frank in marketing to use the same numbers as everyone else.

When you have 5 different ads platforms that all take revenue credit for a single conversion and have conflicting attribution models, and none of them add up to what accounting says is in the bank account. That’s a hairy problem.

There are different flavors of that class of problem at lots of companies.

[+] huy|5 years ago|reply
I think you have a point, but there are more nuiances than that.

There are typically 2 types of data to collect: Transactional data and behavioural data.

Most transactional data, due to their important nature, are already generated and captured by the production applications. Since the logic is coded by application engineer, it's usually hard to get this data wrong. These data are then ETL-ed (or EL-ed) over to a DW, as described by the article.

For behavioural data, this is where your statement will most apply to. This is where tools like Snowplow, Posthog, Segment, etc come in to set up the proper event data collection engine. This is also where it's important to "collect data properly", as these kinds of event data changes structure fast, and hard to keep track over time. I'd admit this space (data collection management) is still nascent, with only tools like iterative.ly on the market.

[+] NightMKoder|5 years ago|reply
I completely agree - there's only so many ways to slice the data. The caveat is - the type of data matters quite a bit for the data architecture. There's another thread that mentions sensor data as a source of complexity since the data has a theoretical delay between events (i.e. period) of 0 - something few systems are built to handle, even if you sample approximations at some fixed frequency. Algorithmic trading is a similar domain that still has a huge bar for entry - a sign that _this isn't easy_.

The fidelity of the data is of course important, but I would claim it's not a blocker. Yes, you need to trust the data you collect. That's table stakes - if you can't collect data correctly at all, even without worrying about the past, you're in for a world of hurt. It's P0. That said, a lot of people assume you also need to do this historically - and that's not the case - at least for ML.

Reinforcement learning has been making great strides in recent years. If you're in this situation - you have a flow where you want to use a model without having any past data to train with - use something like VW's contextual bandits [1]. You don't need historical data to build your model, just real-time decision point & reward signals. Once deployed, the model converges over time to the optimal model using real-time feedback.

All that said - baby steps are important. If you're in this situation, start by getting fidelity and then expand scope slowly without sacrifice to fidelity. It's a lot easier to backfill than to "fix" data - get that right and it get's easier from there. You'll need fixups regardless - mistakes happen and requirements change - but you have to start with something you trust, at least in the moment it's deployed.

[1] https://vowpalwabbit.org/tutorials/contextual_bandits.html

[+] an_opabinia|5 years ago|reply
Is there any evidence that the vast amounts of clicks and user interactions companies have been collecting are worth anything at all?

Let’s say I deleted every time series whose Y axis isn’t measuring US dollars in every tech company’s database everywhere. Maybe for all those time series you just store the most recent value. Describe to me what would be lost.

You’re onto something but you’re not going far enough! Most, if not all, historic metadata, analytics and behavioral data collection - when it is not measuring literal dollar amounts - is completely worthless.

[+] malisper|5 years ago|reply
For a post detailing the modern data infrastructure I'm surprised they intentionally leave out SaaS analytics tools. I find this especially surprising given a16z has invested >$65M into Mixpanel.

Based on my experience working at an analytics company and running one myself, what this post misses out is that an increasing number of people working with data today are not engineers. These people can range from product managers who are trying to figure out what features the company should focus on building, marketers to figure out how to drive more traffic to their website, or even the CEO trying to understand how their business as a whole is doing.

For that reason, you'll still see many companies pay for full stack analytics tools (Mixpanel, Amplitude, Heap) in addition to building out their own data stack internally. It's becoming more and more important that the data is accessible to everyone at your company including the non-technical users. If you try to get everyone to use your own in-house built system, that's not going to happen.

[+] whoisjuan|5 years ago|reply
I don’t think Mixpanel fits here. Mixpanel it’s just one end-to-end suite, mostly behavioral data that is captured from user sessions or user derived events/sub-events. Basically web analytics.

The whole point of data infrastructure is that sometimes you’re collecting data from the most random places. Many of that data is not necessarily user behavior. Sometimes it’s things like temperatures, latencies, CPU usage or instrument tallies. Sometimes it’s a stream of minute to minute weather data or timings or anything, really. Besides many companies have been collecting data for decades but it all live in silos where it can’t be used for anything.

Mixpanel can’t capture all that data, or query it, or analyze it. Mixpanel is just capturing a super small subset of web event data and it happens to provide an analysis suite on top that data they collect.

That’s why Segment shows up in this list instead. They help to move a lot of siloed data into a common systems. Mixpanel is just another source of data. You need something like Snowflake to put everything together and be able to do queries across multiple datasets.

[+] soumyadeb|5 years ago|reply
That's a great point. On similar vein, marketing teams too are increasingly data driven and would tools like Braze, CustomerIO etc to run personalized data driven campaigns. Support teams are using tools like GainSight

All these tools need to be fed data about user behavior - from apps, server backends, other tools etc. It's a messy data connection problem, not just one way from SaaS to warehouse. Mobile App->SaaS; SaaS->SaaS; Warehouse->SaaS; SaaS->Warehouse and so on.

[+] huy|5 years ago|reply
For those who're interested in learning more about the history and evolution of data infrastructure/BI - basically why and how it has come to this stage - check out this short guidebook [1] that my colleagues and I put together a few months back.

It goes into details how much relevance the practices of the past (OLAP, Kimball's modeling) has with the current changes in by the cloud era (MPP, cheap storage/compute, etc). Chapter 4 will be most interesting for HN audience: It walks through the different waves of data adoption ever since BI was invented in the 60-70s.

https://holistics.io/books/setup-analytics/

[+] sradman|5 years ago|reply
This sounds like an in-depth discussion of what the a16z document calls Blueprint 1: Modern Business Intelligence. I don’t know if the other two blueprints for Multimodal and AI are explored.
[+] tuckerconnelly|5 years ago|reply
The ELT (rather than ETL) insight was really cool, hadn't heard of that before.

Unless though, you're on a massive, massive scale, Just Use Postgres, and write your ETL (ELT now?) queues normally. Keep It Simple Stupid.

[+] cageface|5 years ago|reply
While I think data science is a very interesting field with a lot of beneficial applications it also seems to be the one that's right at the heart of a lot of the negative impact some tech is having on society right now. I seriously considered specializing in it for a while but ultimately decided it was too likely I'd be asked to work on things that make me uncomfortable.
[+] malux85|5 years ago|reply
Power(ful tools) can be wielded for good or evil, the courageous thing to do is to learn it AND act ethically, not shy away from it.

Otherwise the spoils of war go to the unethical evil because they are now unchallenged.

[+] dm03514|5 years ago|reply
I'm really excited about the state of data infrastructure and the emergence of the data lake. I feel like the technical aspects of data engineering is reduced to getting data into some cloud storage (s3) as parquet. Transforms are "solved" using ELT from the data lake, or streaming using kafka/spark.

I think executing this in orgs with legacy data technologies is hard but it is much more a people problem than a tech problem. In orgs that have achieved this foundation it's really cool to see the business and analytic impact to the company.

[+] chrisweekly|5 years ago|reply
"it is much more a people problem than a tech problem"

^ This holds true for nearly every aspect of nearly every company.

[+] spullara|5 years ago|reply
Snowflake (and others) will let you either pull that in and query it or as an external query that queries it in place. You can, if it makes sense for your use case, now just T from the data lake.
[+] m3kw9|5 years ago|reply
I wonder how many of those companies in the proposed architecture have A16z as investors?
[+] fouc|5 years ago|reply
The recent HN threads about excel made me think there's definitely room for a new kind of excel that works well for big data.
[+] fluffy87|5 years ago|reply
Citation needed?

We connect all our sensors to an edge AI Server that handles sensor data, and only uploads to the cloud what’s actually relevant.

It works quite well, and there are many OEMs that offer such systems, with accelerators for inference, sensor data compression, 5G, etc.

[+] nicholast|5 years ago|reply
I considered this piece as sort of a loose validation that the Automunge library is filling an unmet need for data scientists. Intended for tabular data preprocessing in the steps immediately preceding the application of machine learning.
[+] cblconfederate|5 years ago|reply
What's the point of data hoarding? Intelligent systems in nature ingest the data, learn, and discard them