Stack Overflow is a cacheless, 9-server on-prem monolith

atonse|3 years ago

Even though I love their simplicity as an example of how to be pragmatic and not over-engineer, do remember that they’ve tuned their code to the point that they built an ORM that is one of the fastest in the NET world. I used it and it was awesomely lightweight.

It’s as much an example of how far world class talent can go, as it is about doing more with less.

jameshart|3 years ago

Right - Marc Gravell and Tim Craver, who worked on the core architecture of Stack Overflow, were both so obsessive about extracting performance from .net web applications that when they couldn’t do any more from the outside, they both quit and went to work for Microsoft on performance improvements in the framework itself.

I feel like it’s similar to how people point to Craigslist as evidence that you can still build sites in Perl - ignoring the fact that Craigslist has Larry Wall on a retainer.

Running highly scalable monoliths is easy! As long as you’re willing to hire some of the five to ten people in the world who are capable of advancing the state of the art of development on that technology stack…

didntreadarticl|3 years ago

Dapper! I used it a while back and it was a single class that bundled query results straight into a list of objects by emitting low level CLR bytecode

Looks like its expanded a little since then

https://github.com/DapperLib/Dapper

eqvinox|3 years ago

You can also see this the other way around — it's a testament to how slow some other stuff is.

Which, to be clear, is not intended to be a negative statement about that "other stuff". It really depends. Some is. But I've also seen things just done poorly by applying tools wrong, e.g. ORM misuse leading to thousands of queries that should have been one OUTER JOIN.

But I don't think you need engineers of their unique calibre to get most of what they got. It's probably an exponential thing, if you have some merely good engineers you could maybe achieve 80% of their performance. The last 20% are just much more costly.

KyeRussell|3 years ago

Yep. Following some of the SO folks on Twitter a while back, I remember watching them do all sorts of things with .NET that didn’t feel remotely “necessary” for a Q&A website. It’s not like you can pull people off the street and have them get away with infrastructure this simple.

mdasen|3 years ago

Not to take anything away from Dapper (it's an excellent library), but it isn't really that much faster than EntityFramework anymore.

> EF Core 6.0 performance is now 70% faster on the industry-standard TechEmpower Fortunes benchmark, compared to 5.0.

> This is the full-stack perf improvement, including improvements in the benchmark code, the .NET runtime, etc. EF Core 6.0 itself is 31% faster executing queries.

> Heap allocations have been reduced by 43%.

> At the end of this iteration, the gap between Dapper and EF Core in the TechEmpower Fortunes benchmark narrowed from 55% to around a little under 5%.

https://devblogs.microsoft.com/dotnet/announcing-entity-fram...

Again, this isn't to take anything away from Dapper. It's a wonderful query library that lets you just write SQL and map your objects in such a simple manner. It's going to be something that a lot of people want. Historically, Entity Framework performance wasn't great and that may have motivated StackOverflow in the past. At this point, I don't think EF's performance is really an issue.

If you look at the TechEmpower Framework Benchmarks, you can see that the Dapper and EF performance is basically identical now: https://www.techempower.com/benchmarks/#section=data-r21&l=z.... One fortunes test is 0.8% faster for Dapper and the other is 6.6% faster. For multiple queries, one is 5.6% faster and the other is 3.8% faster. For single queries, one is 12.2% faster and the other 12.9% faster. So yes Dapper is faster, but there isn't a huge advantage anymore - not to the point that one would say StackOverflow has tuned their code to such an amazing point that they need substantially less hardware. If they swapped EF in, they probably wouldn't notice much of a difference in performance. In fact, in the real world where apps, the gap between them is probably going to end up being less.

If we look at some other benchmarks in the community, they tell a similar story: https://github.com/FransBouma/RawDataAccessBencher/blob/mast...

In some tests, EF actually edges past Dapper since it can compile queries in advance (which just means calling `EF.CompileQuery(myQuery)` and assigning that to a static variable that will get reused.

Again, none of this is to take away from Dapper. Dapper is a wonderful, simple library. In a world where there's so many painful database libraries, Dapper is great. It shows wonderful care in its design. Entity Framework is great too and performance isn't really an interesting distinction. I love being able to use both EF and Dapper and having such amazing database access options.

eduction|3 years ago

The best cache is the one built into the database. People seem to forget that the major rdbmses have sophisticated cache strategies of their own and that handing them more RAM (and ensuring they are configured to use it for query or other cache) is usually a good first strategy before trying to second guess and reinvent the cache outside the db.

Thread says SO allocates 1.5TB RAM to SQL Server. Sounds wise.

MrFoof|3 years ago

Makes sense. Traditional RDBMSs are basically a buffer cache and a query optimization engine.

If the data is sitting in memory, and you've tuned extracting the data from memory as fast as possible, job done.

likeabbas|3 years ago

It's all about the load though. SO is probably 95% Read-Only which makes sense for removing the cache layer. If you had a more writes, then they would need an external cache to offset the read load.

PaulKeeble|3 years ago

Microservices remains mostly an organisational pattern to scale development teams not necessarily the system performance. Microservices add a lot of complexity and overhead.

mupuff1234|3 years ago

"Normal" sized services should be adequate enough for that purpose.

sebazzz|3 years ago

Besides, microservices don't guarantee horizontal scaling just like a monolith does not imply no ability to do horizontal scaling.

lifeisstillgood|3 years ago

The main takeaway is that the questions searched for are so widely distributed that there is no need for a cache layer - they are nothing but long tail.

At that point there is no 'cloud' design that can help. Its either one database (or maybe just shard everything onto thousands of distributed nodes)

But the point I am trying to make is that kubernetes and microservices etc are based on idea of winners - power laws. One tweet everyone wants to read. One search term, one viral video.

Then again. This is just a question of taste - the taste of the dev lead. What (s)he feels is best approach. Take another company doing the same thing and different approach might emerge.

pickledish|3 years ago

I mean, kubernetes or microservices don’t care how the data reads are distributed, right? That problem is a database-level thing whereas k8s is infrastructure, you can run any kind of database with any kind of sharding you want on it. I feel like it might be more accurate to say something like “the value of caching is based on the idea of winners” for example

selcuka|3 years ago

It is ironic that many questions on Stack Overflow are about various cloud services, hyped-up technologies, and problems caused by over-engineering.

didntreadarticl|3 years ago

various cloud services

This question does not appear to be about programming, Closed.

hyped-up technologies

subjective, Closed

problems caused by over-engineering

Opinion-based, Closed.

docandrew|3 years ago

I’m always puzzled when I’m using SO to help diagnose some obscure problem in my tech stack and I see a bunch of “hot questions” in the sidebar about whether dwarf armor can deflect magic bullets, or what the energy capacity of a Stormtrooper’s laser rifle is, etc.

oconnore|3 years ago

"The medium is the message" wins again.

cntainer|3 years ago

Imagine trying to present this kind of architecture to a room full of executives already sold on the "benefits" of kubernetes, big data, serverless, etc.

prng2021|3 years ago

Hah. I get your point but it would be an easy sell for them. The impossible sell would be to engineers. Executives would just compare operating costs estimates.

ElectricalUnion|3 years ago

What would prevent you from running 9 "web server pods" with 64GB ram each? Just implement the whole thing on top of Kubernetes, why not?

threeseed|3 years ago

The use case is simple i.e. web front end, thin app layer, database.

So if you were to implement this same architecture using Kubernetes or Serverless it would be as equally simple as a bunch of Ansible or Puppet scripts.

ctvo|3 years ago

The folks over at SO picked a stack (C#, SQL Server, IIS), and optimized the heck out of it to keep this "simplicity". Much of SO is custom built from the ground up to push performance and stay within the purity of the canonical .net stack.

It isn't clear to me this is a model that would work elsewhere, or should be held up as something to be replicated.

Did they save time? Did they save money? Did this help make SO a wildly successful company? Did it allow them to deliver features to customers faster?

Yeroc|3 years ago

It's worth reminding people what is actually possible with a relatively simple architecture. There's a vast number of websites and services with a very small fraction of the traffic of Stack Overflow with a much more complicated architecture simply because everyone thinks you need Kubernetes etc to scale out.

cosmotic|3 years ago

It's not cacheless. There are countless caches throughout (including what appears to be ~1TB of memory in the database server), just not a dedicated cache machine.

Sammi|3 years ago

It think OP is only referring to server architecture. And as you say there is no cache server. So cacheless server architecture.

ElectricalUnion|3 years ago

By this definition almost all non-toy applications under non-toy OSes have caches, because of CPU caches and registers.

tylergetsay|3 years ago

I don't think its that much more complicated than Wikimedia, which does 5x the traffic: https://meta.wikimedia.org/wiki/Wikimedia_servers

bluedino|3 years ago

Not that long ago (2016) they had:

  Servers:

  SQL Servers (Stack Overflow Cluster)
   2 Dell R720xd Servers
  SQL Servers (Stack Exchange “…and everything else” Cluster)
   2 Dell R730xd Servers, each with:
  Web Servers
   11 Dell R630 Servers
  Service Servers (Workers)
   2 Dell R630 Servers
   1 Dell R620 Server
  Elasticsearch Servers (Search)
   3 Dell R620 Servers
  HAProxy Servers (Load Balancers)
   2 Dell R620 Servers
  Redis Servers (Cache)
   2 Dell R630 Servers
  VM Servers (VMWare, Currently)
   2 Dell FX2s Blade Chassis, each with 2 of 4 blades populated
   4 Dell FC630 Blade Servers (2 per chassis)
   2 Equalogic SAN PS6000-series
  Machine Learning Servers (Providence)
   2 Dell R620 Servers
  Machine Learning Redis Servers (Still Providence)
   3 Dell R720xd Servers
  LogStash Servers
   6 Dell R720xd Servers
  HTTP Logging SQL Server
   1 Dell R730xd 
  Development SQL Server
   1 Dell R620 

  Network:

  2x Cisco Nexus 5596UP core switches (96 SFP+ ports each)
  10x Cisco Nexus 2232TM Fabric Extenders (2 per rack)
  2x Fortinet 800C Firewalls
  2x Cisco ASR-1001 Routers
  2x Cisco ASR-1001-x Routers
  6x Cisco 2960S-48TS-L Management network switches (1 Per Rack)

https://nickcraver.com/blog/2016/03/29/stack-overflow-the-ha...

Fire-Dragon-DoL|3 years ago

Isn't stackoverflow, incidentally, one of the websites who would benefit the most from caching, given their content supposedly is going to be static the majority of the time?

infomaniac|3 years ago

This is addressed in one of the linked tweets.

bitwize|3 years ago

That defies the laws of physics. How can they be web scale without cloud and microservices?

another2another|3 years ago

I want to upvote you, but you forgot MongoDB, which is the most fundamental law of web scale.

tony-allan|3 years ago

In the diagram [1], I can see why you might design it that way if starting from scratch but it works as is so why change it.

Is there a particular reason to suggest a change to the architecture?

[1] https://twitter.com/sahnlam/status/1629713954225405952/photo...

borland|3 years ago

Diagram 1 has the comment "What I think it should be".

It's easy to interpret that as "stackoverflow should change to be like this", but I think it was meant to be more like "If I had to guess how stackoverflow works, this is what I think it would look like".

It's amazing how much performance and scalability you can get out of computers, if you don't burden them with 100x overhead caused by shoveling data between microservices all the time :-)

default-kramer|3 years ago

The word "should" might be confusing here. I didn't read it as the author recommending a change; rather the author first proposes "Given what I know about Stack Overflow, they must be doing something like this, right?" Then boom comes the surprising revelation.

kichik|3 years ago

Is there a website that tracks outages of other websites like Stack Overflow over years? I know some that tell you if it's down right now, but not over years.

I have a subjective feeling that Stack Overflow is down a lot more than other websites. I don't see that ever mentioned in the discussion of cloud vs on-prem which makes the discussion seem lacking.

didntreadarticl|3 years ago

http://stats.pingdom.com/w2oc4thvox7s/73676/history

Spooky23|3 years ago

That’s an engineering choice not cloud vs. cloud. How many services are down when AWS us-east has a problem?

tyingq|3 years ago

Not caching the questions and answers makes sense to me, as I imagine the hit rate wouldn't be terribly good. I would guess, though, that they somehow cache things like the sidebar list of blog articles, featured items, "Hot Network Questions", etc.

banana_giraffe|3 years ago

They do in fact cache some things like that, they've had caching issues in the past (and again recently, I think) with the wrong cache being used in some situations:

https://meta.stackexchange.com/a/235277

jonas-w|3 years ago

The linked url [0] is also a great visualization with a bit more data than the twitter image.

[0] https://stackexchange.com/performance

hoseja|3 years ago

Only 450 peak reqs/s? Doesn't that seem low?

tiffanyh|3 years ago

> Removed Redis 4 years ago; average latency remained unchanged at 20ms.

A hidden taken away is that NVMe storage databases are so fast, they are comparable to in-memory (redis) databases these days.

kkielhofner|3 years ago

Throwing 1.5TB of RAM in the SQL Server (server) has to help too!

foobazzy|3 years ago

Please ignore my lack of understanding a bit here. I'm genuinely trying to learn.

I've always heard (and it made sense to me) that to reduce latency of requests from across the globe, you might want to have read replicas or caches spread on global infrastructure. Then how is it that stack overflow is fast here when the db is on-prem, 7 seas across from me? Any amount of RAM should not account for the distance, right?

spiffytech|3 years ago

You can put a big dent in the impact of the speed of light if you keep round-trips to a minimum.

This is one advantage of server-rendered HTML (though that's not the only option you have).

It also helps that StackOverflow is light on interactivity. You load a page, read for a minute, then maybe click a vote button or open a textarea to discuss. As long as the text and styles load quickly, you won't notice if progressive enhancement scripts take a little more time to load.

wlonkly|3 years ago

When I look up www.stackoverflow.com, I get Fastly IPs. I feel like using a CDN has to count as some cache?

bryancoxwell|3 years ago

It’s also one of the few sites I use that regularly goes down for maintenance.

unknown|3 years ago

[deleted]

ThatMedicIsASpy|3 years ago

steam would be the biggest for me

ec109685|3 years ago

Source material is from 2022, so title should include that disclaimer.

ksec|3 years ago

And somehow Wikipedia require thousands of severs.

ElectricalUnion|3 years ago

Wikipedia servers much heavier multimedia content around 20x more often (in page views), with a vastly highier write load.

didntreadarticl|3 years ago

And runs on .NET

One of the only well known sites to do so, I think?

profile53|3 years ago

I think most things Microsoft run on .net incl. parts of bing and office online.

mytailorisrich|3 years ago

Joel Spolsky used to work for Microsoft and all his products were developed using the MS ecosystem, I believe.

mike_hearn|3 years ago

It's a useful reality check. Dedicated machines are fast and you can do a lot without much software complexity. People mention the StackOverflow guys optimizing their software, but their CPU utilization is 5% so they have a lot of headroom to be less optimized. Probably they just enjoyed it and could spend time on that, so why not?

At KotlinConf in April I'll be giving a talk on two-tier architecture, which is the StackOverflow simplicity concept pushed even further. Although not quite there yet for social "web scale" apps like StackOverflow, it can be useful for many other kinds of database backed services where the users are a bit more committed and you're less dependent on virality. For example apps where users sign a contract, internal apps, etc.

The gist is that you scrap the web stack entirely and have only two tiers: an app that acts as your frontend (desktop, mobile) and an RDBMS. The frontend connects directly to the DB using its native protocols and drivers, the user authentication system is that of the database. There is no REST, no JSON, no GraphQL, no OAuth, no CORS, none of that. If you want to do a query, you do it and connect the resulting result stream directly to your GUI toolkit's widgets or table view controls. If what you want can't be expressed as SQL you use a stored procedure to invoke a DB plugin e.g. implemented with PL/Java or PL/v8. This approach was once common - the thread on Delphi the other day had a few people commenting who still maintain this type of app - but it fell out of favor because Microsoft completely failed to provide good distribution systems, so people went to the web to get that. These days distributing apps outside the browser is a lot easier so it makes sense to start looking at this design again.

The disadvantages are that it requires a couple more clicks up front for end users, and if they have very restrictive IT departments it may be harder for them to get access to your app. In some contexts that doesn't matter much, in others it's fatal. The tech for blocking DoS attacks isn't as good, and you may require a better RDBMS (Postgres is great but just not as scalable as SQL Server/Oracle). There are some others I'll cover in my talk along with proposed solutions.

The big advantage is simplicity with consequent productivity. A lot of stuff devs spend time designing, arguing about, fighting holy wars over etc just disappears. E.g. one of the benefits of GraphQL over plain REST is that it supports batching, but SQL naturally supports even better forms of batching. Results streaming happens for free, there's no need to introduce new data formats and ad-hoc APIs between frontend and DB, stored procedures provide a typed RPC protocol that can integrate properly with the transaction manager. It can also be more secure as SQL injection is impossible by design, and if you don't use HTML as your UI then XSS and XSRF bugs also become impossible. Also because your UI is fully installed locally, it can provide very low latency and other productivity features for end users. In some cases it may even make sense to expose the ability to do direct SQL queries to the end user, e.g. if you have a UI for browsing records then you can allow business analysts to supply their own SQL query rather than flooding the dev's backlog with requests for different ways to slice the data.

fatnoah|3 years ago

When my startup was acquired a few years ago, our infra was hosted at AWS, but most of our "cloud features" were used more for monitoring, alerting, and dashboarding. The real work was done by Windows/SQL and .NET app code. Ours was a messaging application that we tested to support about 350 messages/second, and we had to integrate with the "big co" backend after we were acquired. The bigco back-end could handle about 3-5 messages/second.

Our main production "infra" was a load-balanced pair of medium CPU front-end servers and a high-memory back-end for the SQL server. Theirs was approximately 20x the size, and a more "traditional" cloud microservices, etc. infrastructure. Optimization makes all the difference. So many of the "extras" just add unnecessary complexity, just like avoiding those "extras" probably does when they actually are required.

mwcampbell|3 years ago

On the topic of Postgres versus MS SQL Server or Oracle, I wonder if any of the newer Postgres-compatible databases, like Cockroach or Materialize, solve the scalability issue you raise with Postgres, while not having quite the stigma of MS SQL Server or (especially) Oracle.

yamrzou|3 years ago

Is it hosted on the cloud?

didntreadarticl|3 years ago

Nope, on-prem

https://twitter.com/alexcwatt/status/1544876135711916035?lan...

faizmokhtar|3 years ago

"What I think it should be"

That's a little bit arrogant no?

KyeRussell|3 years ago

Quite the opposite. It’s what mere morals think it’d be, vs what the extraordinary talent has gotten away with.

didntreadarticl|3 years ago

They mean preconception

118 comments