Reading this article it seems like yet another example of "you don't have big data". Most of the features that are unique to Spark (or Spark-like setups) were not needed, so in the end it's mostly...just an app talking to Postgres?
I'm not sure, but reading other articles[0] on the blog seems like they've been jumping on bandwagons before, so it's probably good to come back on those decisions every now and again.
Edit: Not trying to come off as too snarky, though I've found that this type of thing is pretty common in startups where everyone from the CTO on down has some experience but not a lot of experience. I've fallen into that trap too, at some point saying "Sure, Scala will work great! It's future-proof and everyone will love it!" cue crickets
There's that, but Spark is also a little bit tricky because it has such a big feature set that it's not just attractive as a big data tool. On paper, it can be attractive as a tool for easy single-node data parallelism, for easy streaming data processing, for easy machine learning on the Java platform, stuff like that. I'm looking at migrating off of Spark, too, and finding that Spark is still the only way to get a decently ergonomic (for developers) data table library on a Java language.
So there's all of that stuff, and then you think, "Oh, and it gives us an easy scaling path if we ever find our data volumes growing at an unexpectedly rapid clip." And at first you think it's just a cherry on top.
It may seem too good to be true at first, but with everyone using it, and with most the alternatives looking like they'll involve at least as much dev effort during the initial analysis, it's pretty easy to miss that.
And it's only a while later, when you're already pretty heavily invested in the ecosystem, that you start to really understand some things that the bloggers and book authors don't talk about in public. Like how Spark makes Big Data and scale-out a self-fulfilling prophecy. You will need to scale out, even if your data should fit in memory, because Spark wastes memory like a 22-year-old football player wastes money.
And maybe you're an appropriately cynical jerk, and can therefore spot that it couldn't possibly live up to the hype from a mile away. Congratulations. Hopefully you're the technical lead. Even if you are, though, too bad, because, nobody else in the meeting room is as jaded about tech as you are. Certainly not the management folks who don't program or have been out of the game for years. And, since you are cynical and jaded, that means there's still a good chance you'll go with Spark anyway. Because that's a hard argument to win - unless you've already been burned by Spark in the past, you aren't going to have enough intimate knowledge to make an argument that's concrete enough to sound convincing. And because this isn't the hill you want to die on. And because, being appropriately cynical, you realize that, at the end of the day, it's not really your problem. Spark will get the job done. It'll be inefficient, sure, but the extra server and development costs aren't coming out of your pocket, they're coming out of the pocket of the person who wants to be using Spark.
I think it is, but it's a worthwhile point to reiterate. It's also worth pointing out that it can be worth doing a rewrite to a less scalable architecture if it simplifies the architecture and provides more flexibility.
The company I work is currently undergoing a similar rewrite from firebase to Postgres (although we're doing a gradual migration), and it's amazing how much code we have been able to throw away, and how much quicker we can move in the parts of the code that have been migrated.
Trying Haskell if you happen to land a Haskell expert on your team sounds entirely reasonable. It's an old language by now and even used in the conservative FAANG companies. And if you're stuck on the JVM, so does checking out the competition to Java. Lots of people are happy with Clojure or Scala and don't look back.
I had a similar thought. Postgres is a great choice, but then they also went with Haskell ... I look forward to another blog post in 2-3 years detailed all the ways that Haskell failed them and that at the day they should have just gone with an industry standard language.
Granted, for some reason, it seems that "I retired a legacy system and rolled out a brand new one" seems to look better on a resume than "I refactored a legacy system into a better system." so your YMMV.
I'm not sure about the resume part. My experience refactoring legacy stuff into more modern stuff is actually a big talking point for my resume in my experience. People are always interested in why I did it, how I did it, why didn't I just build anew, etc.
Maybe that's because most jobs I apply for tend to have at least one 'legacy' system laying around that needs some help. But this is somewhat common outside of the startup scene.
Legacy in this case means an app that's maybe 7-10 years old, often written like everything happened on the back of a napkin. Sprawling, inconsistent, monolithic stuff. But not true legacy where it's written in Fortran or something in the 1500s.
At any rate, I think it's worth having refactors on your resume. I know I'd be interested in people who have done it - it's always such an educational experience.
I guess if you are replacing a system part by part as opposed to big bang rewrite, you will be more likely do make sure interfaces are well defined and components are more modular.
This doesn't really sound like a big-bang rewrite as such but an incremental development process. The sort of rewrite that would be a major strategic mistake is where you start with a new repository and begin re-implementing the entire product, but this seems all perfectly sensible to me.
This just sounds like the sort of incremental Ship of Theseus [0] development that many of us are doing. The product I'm working on has had enough key internal portions rewritten over a long enough time (including interestingly the job management system) that you could say it's a rewrite compared to the product from 2 years ago.
I'd like to see some idea of what they mean by big.
An order of magnitude in terms of lines of code and the number of engineers they had for the job and how long it all took versus how long the original took to write etc.
One of the things that scares me about something like this are all the seemingly illogical bits of code that were added over time and are actually that way for a reason because they fix some edge case or whatever. Hard to see those not getting swept away in a rewrite as well as a whole bunch of new rarely occurring issues added.
A complete rewrite strikes me as something that's almost guaranteed to be buggier than what it replaces, at least for awhile.
The company where I work is doing (and is close to finishing) a big rewrite of one the product. The base version is deployed on-prem at our clients, the new one will be cloud service, with the same features. But the on-prem version is still worked on and supported (I'm working on it) alongside the cloud one, the cloud one is to offer a less cumbersome alternative.
> Prematurely designing systems “for scale” is just another instance of premature optimization
> Examples abound: (...) using a distributed database when Postgres would do
This is the only part of the article that bugged me a little, because in my experience the choice between single-machine and distributed databases is not so much about scale as it is about availability and avoiding a single point of failure.
Even if your database server is fairly stable (a VM in a robust cloud for instance), if you use Postgres or MySQL and you need to upgrade to a newer version of the database (let say for an urgent security update), you have no choice but to completely stop the service for a few seconds / minutes (assuming the service cannot work without its database).
Depending on the service and its users, this mandatory down-time might or might not be acceptable.
Anecdotally I suspect services requiring high SLAs are more common than ones requiring petabyte scale storage.
Re availability: We had a hard time keeping the system based on Spark available. There were days when the cluster would freak out multiple times in a single day. The 'fix' would be: restart a bunch of spark workers. We spent a lot of time debugging/finding this out (some parts documented in [1]) but couldn't work out what the problem was. (EDIT: Assuming there even was a single problem.)
In this particular case, I'd take the single point of failure over the previous situation.
That being said: we have successfully used PostgreSQL's fail-overs multiple times. In my experience, they work quite alright.
High availability Postgres setups are a minimum for a production system and a staging system to understand how your system behaves during a failure event. These failure scenarios should be tested not necessarily on every commit but often enough that there’s confidence during a failover you’re not going to drop queries on the floor and pretend it’s all good as well as your monitoring systems report on the event for the sake of event reporting.
I had the same reaction... this statement seems to be an over-generalization, but it can be resolved by being careful about what 'premature' means. In this case, "we can trivially shard datasets of different projects over different servers, because they are all independent of one another", so it seems the scaling issue had a solution from the outset.
Warnings about the risks of premature optimization should not stop people thinking about issues that are likely to arise in future, and what they might do about them. On the other hand, this does not mean that you should necessarily implement that solution now.
Downtime is not mandatory while upgrading to a new PostgreSQL version. I've upgraded from 9.0 to 11 and most versions in between without downtime on large busy databases.
There's not a single approach that works for every case, but they all involve a replica and upgrading one at a time.
You don't need a distributed database to have replication and a hot backup. Stackoverflow runs this kind of configuration -- if it works for them, it can work for you.
Years ago I was building SEO software. One of the products was originally written as an internal tool, and handled our work load without skipping a beat. Then we decided to release it to the public, so I did a small refactor to implement accounts. We launched with it on hosted on a small Dell PC under my desk (where it had been running as our internal tool). Within 2 hours of launch, it was completely overwhelmed and shutting down due to overheating.
It was "rewrite" time.
While doing that, I had to come up with SOME work around. So I opened the case and stuck a box fan on it to try and exhaust some of the heat. That lasted about 8 more hours. Before the server shut down, and I got a call from the boss.
I went in to the office in the middle of the night, and started profiling the application. I found a VPS host, quickly spun up their largest Windows VM they offered, and that helped for a few days while I rewrote large swaths of the application. Even after a ~80% rewrite and splitting the application in two, we had more users that we'd ever anticipated and I was out of my depths with scaling. So we got a few (much larger) physical servers at Softlayer.
This was the setup that this website ran under for the next couple of years with minor tweaks, more space with an iSCSI array, more RAM, migrating to a more CPU, etc, but all staying at Softlayer. Eventually when the hosting bill was getting into the high four figures a month, we reevaluated and decided a rewrite was in order to switch it to Microsoft Azure utilizing Azure SQL, Azure Table Storage, Azure Queue Service, and offloading all of the complicated tasks from the web server onto the Azure infrastructure. For all I know it is still on Azure.
Is the current system is so badly architected that it cannot be refactored gradually or rewritten piecemeal? Then the forces which caused these problems will also be in effect during the rewrite, so it will end in the same place when it reaches feature parity.
I can only think of a few places where a full rewrite is justified:
* You lost the source code
* The application is almost purely integration with some 3'rd party platform or component, and you need to replace that platform. (E.g. you are developing a registry cleaner and need to port it to Mac)
* You don't have any customers or users yet and time-to-market is not a concern.
* You are not a business and are writing the code purely for your own enjoyment
But these are business level considerations. For individual developers there may be compelling reasons to push for a rewrite:
* You find it more fun to work on green-field projects than to perform maintenance development.
* The new platform is more exciting or looks better on the CV than the old
That's not necessarily true. Those forces will still be there but the learning of what not to do has also been had. People can improve and learn over time.
I have also found Spark (and Hadoop before that) a little clunky to prototype and develop on, but when you need to handle very large data sets with good throughout performance then systems like Spark/Hadoop are great. One problem they had was maintaining infrastructure, and to be honest, when I used mapreduce as a contractor at Google or AWS Elastic MapReduce as a consultant I didn’t have to deal too much with infrastructure.
Anyway, it makes sense that they backed off using Spark and HDFS - makes sense given the size of their datasets.
The original poster mentioned that their data analytics software is written in Haskell. I would like to see a write up on that.
EDIT: I see that they do have two articles on their blog on their use of Haskell.
>>> One of our main reasons for choosing Apache Spark had been its ability to handle very large datasets (larger than what you can fit into memory on a single node) and its ability to distribute computations over a whole cluster of machines ... We cannot fit all of our datasets in memory on one node, but that is also not necessary, since we can trivially shard datasets of different projects over different servers, because they are all independent of one another.
So this seems to be the massive takeaway - if you need to operate on a whole dataset that is larger than one node's memory capacity then you have to go distributed. Else it still seems an overhead barely worth the effort.
So Google: dataset is all web pages on the internet - yes that's too large go distributed.
Tesco / Walmart : dataset might be all the sales for a year. Probably too large. But could you do with sales per week? per day?
having the raw data of all your transactions etc lying around waiting for your spiffo business query sounds good but ... is it?
I would be interested in hearing folks' cut-off points for going full Big Data vs "we don't really need this"
We decided to go for a Big Rewrite for a completely different reason. The initial license for the proprietary NoSQL database we had negotiated was about to expire, and the company was going to charge us an order of magnitude more to renew.
So we immediately set out redesigning our system to use other, fully open source technologies. Also gave us an opportunity to reconsider architecture decisions that had not scaled well. In our case, moving from a monolith to microservices has had major benefits. Maybe the biggest being able to quickly see which microservice is the bottleneck and needs to be scaled up to handle the load. With the monolith, if it got slow, it was very difficult to figure out which part of the workload was making it slow.
Our company runs on a pile of VBA/Access. At least it talks to a MariadDB server on Linux.
The biggest problems are trying to run/develop this code on machines that were made in the last ten years, the other is that it's a horrible, horrible codebase. Code practices from the early 90's.
To make things worse, objects are 'evil', all HTML/SQL/XML is built by appending strings, there's no data sanity checks.....
I started a proof of concept replacement system that was written in Python and ran on the web.
It was met with "We can go with a web based system, since if anything changes in the browser we'll be up shit creek."
I am currently evaluating if we should rewrite a large chunk of our embedded controller which handles motion control right now, so this is a timely write-up! I think the lessons here are the same in embedded code - we can keep the existing black box (consultant written!) code for now while the new motion control code is written in a parallel branch. The more modular the project the easier this is luckily!
This begs the question: in which situations is it appropriate to decide on a full rewrite?
In theory, there is an easy answer to this question: If the cost of the rewrite, in terms of money, time, and opportunity cost, is less than the cost of fixing the issues with the old system, then one should go for the rewrite.
In our case, the answer to all of these questions was yes.
One of our original mistakes (back in 2014) had been that we had tried to “future-proof” our system by trying to predict our future requirements. One of our main reasons for choosing Apache Spark had been its ability to handle very large datasets (larger than what you can fit into memory on a single node) and its ability to distribute computations over a whole cluster of machines4. At the time, we did not have any datasets that were this large. In fact, 5 years later, we still do not. Our datasets have grown by a lot for sure, both in size and quantity, but we can still easily fit each individual dataset into memory on a single node, and this is unlikely to change any time soon5. We cannot fit all of our datasets in memory on one node, but that is also not necessary, since we can trivially shard datasets of different projects over different servers, because they are all independent of one another.
With hindsight, it seems obvious that divining future requirements is a fool’s errand. Prematurely designing systems “for scale” is just another instance of premature optimization, which many development teams seem to run into at one point or another
[+] [-] royjacobs|6 years ago|reply
I'm not sure, but reading other articles[0] on the blog seems like they've been jumping on bandwagons before, so it's probably good to come back on those decisions every now and again.
Edit: Not trying to come off as too snarky, though I've found that this type of thing is pretty common in startups where everyone from the CTO on down has some experience but not a lot of experience. I've fallen into that trap too, at some point saying "Sure, Scala will work great! It's future-proof and everyone will love it!" cue crickets
[0] https://tech.channable.com/posts/2017-02-24-how-we-secretly-...
[+] [-] duijf|6 years ago|reply
Yes definitely. This thing was a big exercise in "You aren't going to need it".
> so in the end it's mostly...just an app talking to Postgres?
Yes. We solve problems for our customers. It's nice if we can do that without also creating problems for ourselves :)
> seems like they've been jumping on bandwagons before
Just wanted to point out that this system is also written in Haskell. We didn't really switch bandwagons.
[+] [-] mumblemumble|6 years ago|reply
So there's all of that stuff, and then you think, "Oh, and it gives us an easy scaling path if we ever find our data volumes growing at an unexpectedly rapid clip." And at first you think it's just a cherry on top.
It may seem too good to be true at first, but with everyone using it, and with most the alternatives looking like they'll involve at least as much dev effort during the initial analysis, it's pretty easy to miss that.
And it's only a while later, when you're already pretty heavily invested in the ecosystem, that you start to really understand some things that the bloggers and book authors don't talk about in public. Like how Spark makes Big Data and scale-out a self-fulfilling prophecy. You will need to scale out, even if your data should fit in memory, because Spark wastes memory like a 22-year-old football player wastes money.
And maybe you're an appropriately cynical jerk, and can therefore spot that it couldn't possibly live up to the hype from a mile away. Congratulations. Hopefully you're the technical lead. Even if you are, though, too bad, because, nobody else in the meeting room is as jaded about tech as you are. Certainly not the management folks who don't program or have been out of the game for years. And, since you are cynical and jaded, that means there's still a good chance you'll go with Spark anyway. Because that's a hard argument to win - unless you've already been burned by Spark in the past, you aren't going to have enough intimate knowledge to make an argument that's concrete enough to sound convincing. And because this isn't the hill you want to die on. And because, being appropriately cynical, you realize that, at the end of the day, it's not really your problem. Spark will get the job done. It'll be inefficient, sure, but the extra server and development costs aren't coming out of your pocket, they're coming out of the pocket of the person who wants to be using Spark.
[+] [-] nicoburns|6 years ago|reply
The company I work is currently undergoing a similar rewrite from firebase to Postgres (although we're doing a gradual migration), and it's amazing how much code we have been able to throw away, and how much quicker we can move in the parts of the code that have been migrated.
[+] [-] fulafel|6 years ago|reply
[+] [-] macspoofing|6 years ago|reply
I had a similar thought. Postgres is a great choice, but then they also went with Haskell ... I look forward to another blog post in 2-3 years detailed all the ways that Haskell failed them and that at the day they should have just gone with an industry standard language.
[+] [-] kod|6 years ago|reply
Scala is still the least-bad option for a JVM language.
Anyone who can't be productive in Scala (not "better java", not "worse haskell", Scala) isn't someone you want on your team anyway.
[+] [-] alexpotato|6 years ago|reply
1. "We are going to start over from scratch and rewrite the whole thing!"
Joel Spolksy famously said to "Never do this!"
2. "We are going to slowly refactor the whole codebase."
This can, eventually, lead you to a place where none of the original code is there so it's like a rewrite but much simpler.
3. "We are going to slowly add new places to replace the old system till there is no new system."
This is called the "Strangler App" model as described by Martin Fowler (https://martinfowler.com/bliki/StranglerFigApplication.html)
Granted, for some reason, it seems that "I retired a legacy system and rolled out a brand new one" seems to look better on a resume than "I refactored a legacy system into a better system." so your YMMV.
[+] [-] steve_adams_86|6 years ago|reply
Maybe that's because most jobs I apply for tend to have at least one 'legacy' system laying around that needs some help. But this is somewhat common outside of the startup scene.
Legacy in this case means an app that's maybe 7-10 years old, often written like everything happened on the back of a napkin. Sprawling, inconsistent, monolithic stuff. But not true legacy where it's written in Fortran or something in the 1500s.
At any rate, I think it's worth having refactors on your resume. I know I'd be interested in people who have done it - it's always such an educational experience.
[+] [-] collyw|6 years ago|reply
[+] [-] gtsteve|6 years ago|reply
This just sounds like the sort of incremental Ship of Theseus [0] development that many of us are doing. The product I'm working on has had enough key internal portions rewritten over a long enough time (including interestingly the job management system) that you could say it's a rewrite compared to the product from 2 years ago.
[0] https://en.wikipedia.org/wiki/Ship_of_Theseus
[+] [-] barking|6 years ago|reply
[+] [-] baud147258|6 years ago|reply
[+] [-] unknown|6 years ago|reply
[deleted]
[+] [-] Darkstryder|6 years ago|reply
> Examples abound: (...) using a distributed database when Postgres would do
This is the only part of the article that bugged me a little, because in my experience the choice between single-machine and distributed databases is not so much about scale as it is about availability and avoiding a single point of failure.
Even if your database server is fairly stable (a VM in a robust cloud for instance), if you use Postgres or MySQL and you need to upgrade to a newer version of the database (let say for an urgent security update), you have no choice but to completely stop the service for a few seconds / minutes (assuming the service cannot work without its database).
Depending on the service and its users, this mandatory down-time might or might not be acceptable.
Anecdotally I suspect services requiring high SLAs are more common than ones requiring petabyte scale storage.
[+] [-] duijf|6 years ago|reply
In this particular case, I'd take the single point of failure over the previous situation.
That being said: we have successfully used PostgreSQL's fail-overs multiple times. In my experience, they work quite alright.
[1]: https://tech.channable.com/posts/2018-04-10-debugging-a-long...
[+] [-] devonkim|6 years ago|reply
[+] [-] mannykannot|6 years ago|reply
Warnings about the risks of premature optimization should not stop people thinking about issues that are likely to arise in future, and what they might do about them. On the other hand, this does not mean that you should necessarily implement that solution now.
[+] [-] StreamBright|6 years ago|reply
Also, Spark has some single point of failures that are not obvious at first.
https://gist.github.com/aseigneurin/3af6b228490a8deab519c6ae...
[+] [-] smilliken|6 years ago|reply
There's not a single approach that works for every case, but they all involve a replica and upgrading one at a time.
[+] [-] wvenable|6 years ago|reply
[+] [-] nicoburns|6 years ago|reply
True. But if it is acceptable, then it may well be a good trade-off to make.
[+] [-] unknown|6 years ago|reply
[deleted]
[+] [-] jermaustin1|6 years ago|reply
Years ago I was building SEO software. One of the products was originally written as an internal tool, and handled our work load without skipping a beat. Then we decided to release it to the public, so I did a small refactor to implement accounts. We launched with it on hosted on a small Dell PC under my desk (where it had been running as our internal tool). Within 2 hours of launch, it was completely overwhelmed and shutting down due to overheating.
It was "rewrite" time.
While doing that, I had to come up with SOME work around. So I opened the case and stuck a box fan on it to try and exhaust some of the heat. That lasted about 8 more hours. Before the server shut down, and I got a call from the boss.
I went in to the office in the middle of the night, and started profiling the application. I found a VPS host, quickly spun up their largest Windows VM they offered, and that helped for a few days while I rewrote large swaths of the application. Even after a ~80% rewrite and splitting the application in two, we had more users that we'd ever anticipated and I was out of my depths with scaling. So we got a few (much larger) physical servers at Softlayer.
This was the setup that this website ran under for the next couple of years with minor tweaks, more space with an iSCSI array, more RAM, migrating to a more CPU, etc, but all staying at Softlayer. Eventually when the hosting bill was getting into the high four figures a month, we reevaluated and decided a rewrite was in order to switch it to Microsoft Azure utilizing Azure SQL, Azure Table Storage, Azure Queue Service, and offloading all of the complicated tasks from the web server onto the Azure infrastructure. For all I know it is still on Azure.
[+] [-] goto11|6 years ago|reply
I can only think of a few places where a full rewrite is justified:
* You lost the source code
* The application is almost purely integration with some 3'rd party platform or component, and you need to replace that platform. (E.g. you are developing a registry cleaner and need to port it to Mac)
* You don't have any customers or users yet and time-to-market is not a concern.
* You are not a business and are writing the code purely for your own enjoyment
But these are business level considerations. For individual developers there may be compelling reasons to push for a rewrite:
* You find it more fun to work on green-field projects than to perform maintenance development.
* The new platform is more exciting or looks better on the CV than the old
[+] [-] ryanelfman|6 years ago|reply
[+] [-] mark_l_watson|6 years ago|reply
Anyway, it makes sense that they backed off using Spark and HDFS - makes sense given the size of their datasets.
The original poster mentioned that their data analytics software is written in Haskell. I would like to see a write up on that.
EDIT: I see that they do have two articles on their blog on their use of Haskell.
[+] [-] z3t4|6 years ago|reply
Don't wait for a big rewrite. Constantly keep deleting and rewriting. Just make sure you are solving real problems while doing it.
[+] [-] barking|6 years ago|reply
[+] [-] lifeisstillgood|6 years ago|reply
So this seems to be the massive takeaway - if you need to operate on a whole dataset that is larger than one node's memory capacity then you have to go distributed. Else it still seems an overhead barely worth the effort.
So Google: dataset is all web pages on the internet - yes that's too large go distributed.
Tesco / Walmart : dataset might be all the sales for a year. Probably too large. But could you do with sales per week? per day?
having the raw data of all your transactions etc lying around waiting for your spiffo business query sounds good but ... is it?
I would be interested in hearing folks' cut-off points for going full Big Data vs "we don't really need this"
[+] [-] jimbokun|6 years ago|reply
So we immediately set out redesigning our system to use other, fully open source technologies. Also gave us an opportunity to reconsider architecture decisions that had not scaled well. In our case, moving from a monolith to microservices has had major benefits. Maybe the biggest being able to quickly see which microservice is the bottleneck and needs to be scaled up to handle the load. With the monolith, if it got slow, it was very difficult to figure out which part of the workload was making it slow.
[+] [-] bluedino|6 years ago|reply
The biggest problems are trying to run/develop this code on machines that were made in the last ten years, the other is that it's a horrible, horrible codebase. Code practices from the early 90's.
To make things worse, objects are 'evil', all HTML/SQL/XML is built by appending strings, there's no data sanity checks.....
I started a proof of concept replacement system that was written in Python and ran on the web.
It was met with "We can go with a web based system, since if anything changes in the browser we'll be up shit creek."
:-/
[+] [-] viraptor|6 years ago|reply
[+] [-] sandGorgon|6 years ago|reply
it is also natively integrated with K8s - https://github.com/dask/dask-kubernetes - and Yarn https://yarn.dask.org/en/latest/
[+] [-] roland35|6 years ago|reply
[+] [-] sebastianconcpt|6 years ago|reply
In theory, there is an easy answer to this question: If the cost of the rewrite, in terms of money, time, and opportunity cost, is less than the cost of fixing the issues with the old system, then one should go for the rewrite.
In our case, the answer to all of these questions was yes.
One of our original mistakes (back in 2014) had been that we had tried to “future-proof” our system by trying to predict our future requirements. One of our main reasons for choosing Apache Spark had been its ability to handle very large datasets (larger than what you can fit into memory on a single node) and its ability to distribute computations over a whole cluster of machines4. At the time, we did not have any datasets that were this large. In fact, 5 years later, we still do not. Our datasets have grown by a lot for sure, both in size and quantity, but we can still easily fit each individual dataset into memory on a single node, and this is unlikely to change any time soon5. We cannot fit all of our datasets in memory on one node, but that is also not necessary, since we can trivially shard datasets of different projects over different servers, because they are all independent of one another.
With hindsight, it seems obvious that divining future requirements is a fool’s errand. Prematurely designing systems “for scale” is just another instance of premature optimization, which many development teams seem to run into at one point or another
[+] [-] AzzieElbab|6 years ago|reply
[+] [-] apta|6 years ago|reply
[+] [-] duijf|6 years ago|reply
[+] [-] unknown|6 years ago|reply
[deleted]