We're not nearly at Stripe's scale, but my startup (Spreemo) has achieved pretty amazing parallelism using the commercial SaaS CircleCI. We have 3907 expects across 372 RSpec and Cucumber files. Our tests complete in ~14 minutes when run across 8 containers.
One of the great strengths for CircleCI is that they auto-discover our test types, calculate how long each file takes to run, and then auto-allocate the files in future runs to try to equalize the run times across containers. The only effort we had to do was split up our slowest test file when we found that it was taking longer to complete than a combination of files on the other machines.
I also like that I can run pronto https://github.com/mmozuras/pronto to post Rubocop, Rails Best Practices, and Brakeman errors as comments on Github.
I'm super curious how Stripe approaches end-to-end testing (like Selenium/browser testing, but maybe something more bespoke too)
My understanding is that they have a large external dependency (my term: "the money system"), and running integration tests against it might be tricky or even undependable. Do they have a mock banking infrastructure they integrate against?
This is a great question, and it's definitely a problem we have.
We don't have a single answer we use for every system we work on, but we employ a few common patterns, ranging from just keeping hard-coded strings containing the expected output, up to and including implementing our own fake versions of external infrastructure. We have, for example, our own faked ISO-8583 [1] authorization service, which some of our tests run against to get a degree of end-to-end testing.
Back-testing is also incredibly valuable: We have repositories of every conversation or transaction we've ever exchanged with the banking networks, and when making changes to parsers or interpreters, we can compare their output against the old version on all of that historical data.
I'm very curious about that as well. I worked on a big project that had a (perhaps analogous) large external dependency on networks of embedded devices in homes and businesses, and integration testing it was …difficult. I'd love to hear how Stripe solves that problem.
Not a stripe member but i would assume that anything that involves intense security auditing, PCI, etc would be seperate codebases that rarely change.
(eg cc handling could anon the CCs in a service before they reach the main app)
The integration with 3rd parties is a seperate issue that exists no matter of it is banks or not - i would guess they abstracted that as well as services or libs and decide case by case.
I am tired of this technology having to be re-invented time and time again.
The best I ever saw was an internal tool at Microsoft. It could run tests on devices (Windows Mobile phones, but it really didn't care), had a nice reservation and pool system and a nice USB-->Ethernet-->USB system that let you route any device to any of the test benches.
This was great because it was a heterogeneous pool of devices, with different sets of tests that executed appropriately.
The test recovery was the best I've ever seen. The back end was wonky as anything, every single function returned a BOOL indicating if it had ran correctly or not, every function call was wrapped in an IF statement. That was silly, but the end result was that every layer of the app could be restarted independently, and after so many failures either a device would be auto removed from the pool and the tests reran on another device, or a host machine could be pulled out, and the test package sent down to another host machine.
The nice part was the simplicity of this. All similar tools I've used since have involved really stupid setup and configuration steps with some sort of crappy UI that was hard to use en-masse.
In comparison, this test system just tool a path to a set of source files on a machine, the compilation and execution command line, and then if the program returned 0 the test was marked as pass, if it returned anything else it was marked as fail.
All of this (except for copying the source files over) was done through an AJAX Web UI back in 2006 or so.
Everything I've used since than has either been watching people poorly reimplementing this system (frequently with not as good error recovery) or just downright inferior tools.
(For reference a full test pass was ~3 million tests over about 2 days, and there were opportunities for improvement, network bandwidth alone was a huge bottle neck)
All that said, the test system in the link sounds pretty sweet.
I agree. We already have projects like http://test-load-balancer.github.io but I have a feeling I will see five more posts on he in the next year about re-inventing this wheel and yet not see a single contribution to existing solutions like tlb.
It must be a little depressing to build a really useful product you know many people need, give it away only hoping people will use it and be happy, then find out everyone would rather build their own.
But we do like to build things, it is in our nature. Plus, what looks better on your resume:
1) I migrated my teams test suite to using test load balancer in two days, saving hours every test run.
2) I contributed improvements to the open source test load balancer project.
3) I designed and implemented my own distributed test load balancing tool!
>I am tired of this technology having to be re-invented time and time again.
So did Microsoft open source this? If not, quit complaining. Just because you saw a massive software engineering company doing something better doesn't mean everyone else who doesn't have access to it sucks for not reaching parity.
One thing that really sped up our test suite was by creating an NGINX proxy that served up all the static files instead of making rails do it. This saved us about 10 minutes off our 30 minute tests.
test-queue supports minitest too, and follows the same basic design outlined in this article: a central queue sorted by slowest tests first, with test runners forked off either locally or on other machines to consume off the queue.
We use test-queue in a dozen different projects at GitHub and most CI runs finish in ~30s. The largest suite is for the github/github rails app which runs 30 minutes of tests in 90s.
How much cheaper (in time, code, effort, complexity) would it be if:
- Their language runtime supported thread-based concurrency, which would drastically reduce implementation complexity and actual per-task overhead, thus improving machine usage efficiency AND eliminating the concerns about managing process trees that introduces a requirement for things like Docker.
- Their language runtime was AOT or JIT compiled, simply making everything faster to a degree that test execution could be reasonably performed on one (potentially large) machine.
- They used a language with a decent type system, significantly reducing the number of tests that had to be both written and run?
Only if the early engineers could write in a language that they were as productive in as Ruby. Getting Stripe launched was the key thing Stripe needed to accomplish. Everything else follows from that.
Is it fair to assume that time, code, effort, and complexity would be some degree of cheaper?
May very well be more expensive. Language choice isn't a silver bullet.
Would it be cheap enough to encourage a re-write, re-engineer the server stack, and re-train employees? I doubt it.
I think it is an interesting question but a bad one most of the time unless you take into account all the other external factors that don't include the language it self.
It would run the tests 10-100x faster per test, and also require less tests (due to having a real type system).
I do giggle a little when I see huge engineering hurdles people have to overcome because of the language that was chosen. Building an app that is going to scale to millions of users? May not want to use Ruby...
(Nothing against Stripe, I am a paying customer - love the product. I do suspect it would be easier to engineer on a better platform than RoR though).
One thing I've noticed since coding with immutable data structures & functions (rather than mutable OOP programs) is how tests run really fast, and are easy to run in parallell.
I/O only happens in a few functions, and most other code just takes data in -> transforms -> returns data out. This means I only have few functions that need to 'wait' on something outside of itself to finish, and much lesser delays in the code.
This is coding in Clojure for me, but you can do that in any language that has functions (preferable with efficient persistent data structures. Like the tree-based PersistentVector in Clojure).
Immutable data structures give you easy parallelism, however there's a hidden runtime cost: you have to allocate way more objects. For example, I was able to save a ton of object allocations here: https://github.com/mime-types/ruby-mime-types/pull/93 mostly by mutating. For tasks that are not easily parallelizable it may be slower to use immutable structures.
I mostly only ever hear about how fast FP languages are, so maybe they use some tricks to avoid allocations somehow. I would be interested in hearing more about it.
Love this. Sometimes testing can be a huge pain in the ass. I know more than one project I work on where getting them to run is a lot of effort in itself.
There is something to be said about code quality and having tests run in under a few seconds. The ideal situation is when you can have a barrage of tests run as fast as you are making changes to code. If we ever got to the point of instant feedback that didn't suck I'd think we'd change a lot about how we think about tests.
We opted for an alternate, dynamic approach, which allocates work in real-time using a work queue. We manage all coordination between workers using an nsqd instance... In order to get maximum parallel performance out of our build servers, we run tests in separate processes, allowing each process to make maximum use of the machine's CPU and I/O capability. (We run builds on Amazon's c4.8xlarge instances, which give us 36 cores each.)
This made me long for a unit test framework as simple as:
You get test parallelism and efficient use of compute resources "for free" (well, from make -j, because it already has a job queue implementation internally). This setup closely resembles the "rts" unit test approach you'll find in a number of djb-derivative projects.
The defining obstacle for Stripe seems like Ruby interpreter startup time though. I'm not sure how to elegantly handle preforked execution in a Makefile-based approach. Drop me a line if you have ideas or have tackled this in the past, I've got a couple projects stalled out on it.
On a previous project, I had built a shell script that essentially created n mysql databases and just distributed the test files under n rails processes.
We were able to run tests that took an hour in about 3 minutes. It was good enough for us. Nothing sophisticated for evenly balancing the test files, but it was pretty good for 1-2 days of work.
"This second round of forking provides a layer of isolation between tests: If a test makes changes to global state, running the test inside a throwaway process will clean everything up once that process exits."
But, then how do you catch bugs where shared mutable state is not compatible with multiple changes?
You write tests specifically for testing multiple changes. You shouldn't be testing changes to global state by seeing how multiple supposedly independent tests interact.
Congratulations on what look a very challenging task. I'm assuming a part of those tests hit a database. How have you dealt with it? I assume that a single instance, even on a powerful bare server could be a road blocker in this situation. A few insights on the Docker/Containerization part of it would also be nice!
Our testing running infrastructure spins up a pool of database instances on each worker machine, one for each worker process. The test spinup and teardown code handles schema management, hooking into our DB access layer to create and clean up database tables only if they're used by a given test.
This is an interesting and possibly overlooked problem with using slow languages like Ruby - your unit tests take forever to run. (unless you spend a lot of engineering effort on making them run faster, in which case they may run somewhat acceptably fast)
mocha and jasmine (in the node/javascript space) support nested setup and teardown methods and it's been really challenging for me to go back to using other frameworks, languages.
Not only does the nesting help limit the amount of setup and teardown you do, but when broad-reaching functional changes hit you in version 2, 3, it's so much easier to reorganize your tests to get the pre- and post-conditions right when they are already grouped that way.
The sad thing is that it takes a few release cycles before you feel any difference at all, and a couple more before you're absolutely sure that there are qualitative differences between the conventions. So it seems like a pretty arbitrary selection process instead of an obvious choice.
Oh hey, we have the same sort of system here. It's 60,000 Python tests which take ~28 hours if run serially, but we keep it around 30-40 minutes. We wrote a UI & scheduler & artifact distribution system (which we're probably going to replace with S3). We run selenium & unit tests as well as the integration tests.
We've noticed that starting and stopping a ton of docker containers in rapid succession really hoses dockerd, also that Jenkins' API is a lot slower than we expected for mostly-read-only operations.
Have you considered another containerization solution like LXD. I feel like testing like this fits the "container hyper-visor" use case and this is what LXD is designed to do.
Not sure if this has already been answered, but would Stripe's methods only work with unit tests where tests are not dependent on each other?
How would one go about building a similar distributed testing setup for end-to-end tests where a sequence of tests have to be run in particular order. Finding the optimal ordering / distribution of tests between workloads would certainly be more complicated. Maybe they could be calculated with directed graph algorithms?
How would one go about building a similar distributed testing setup for end-to-end tests where a sequence of tests have to be run in particular order.
I reckon that would be solving the wrong problem. End-to-end tests should be independent of each other, and tests should never be dependent on the order in which they are run. End-to-end tests might be longer as a result, but managing the complexity of test dependencies will quickly cripple any system that uses this approach, I imagine.
I'd love to know if their integration tests use a database or reference external services of any sort.
We ended up making a compromise where each test can never expect another test to have run...but some tests expect certain test data to be present and in a known state. To handle that, every test cleans up the data of the previous run (Entity Framework has a nice change tracker where we can keep track of the unit of work before it is persisted). We wouldn't be able to parallelize everything though...we can only accept a single test to be active on the DB at a single point in time.
I think those would not be considered "unit" tests. Often the definition of unit tests includes the ability to run those tests in any order. Any tests that have to be run in a particular order (i.e. "stateful" tests) should be considered a single test, and likely an integration test at that.
I'm not sure if you are talking from a performance perspective or a conceptual perspective, but this provides a useful discussion on multiple assertions:
My 2 cents is that multiple assertions are legitimate, as long as they prove a singluar assumption. Hence (as per the test on that page), this is a valid use of multiple assertions:
[Test]
public void ValueIsInRange()
{
int value = GetValueToTest();
Assert.That(value, Is.GreaterThan(10), "value is too small");
Assert.That(value, Is.LessThan(100), "value is too large");
}
Any reason why a financial infrastructure provider like Stripe would run CI tests on someone elses infrastructure? Isn't that a no go from a security point of view? Or - how do you trust the hosted CI company not to look at your code?
I wrote a rubygem called cloudspeq (http://github.com/meesterdude/cloudspeq) that distributes rails rspec spec's across a bunch of digital ocean machines to reduce test execution time for slow test suits in dev.
one of the things I did that may be of interest is to break up spec files themselves to help reduce hotspots (or dedicate a machine to it specifically)
Not as complex or as robust as what they did, but it works!
> Initially, we experimented with using Ruby's threads instead of multiple processes
Why, to be cool? Tests are a classic case of things that should be run in isolation - you don't want tests interfering with earth other or crashing the whole test suite. Using separate processes would have been the sensible approach to start with.
dankohn1|10 years ago
One of the great strengths for CircleCI is that they auto-discover our test types, calculate how long each file takes to run, and then auto-allocate the files in future runs to try to equalize the run times across containers. The only effort we had to do was split up our slowest test file when we found that it was taking longer to complete than a combination of files on the other machines.
I also like that I can run pronto https://github.com/mmozuras/pronto to post Rubocop, Rails Best Practices, and Brakeman errors as comments on Github.
andreasklinger|10 years ago
we simply added the linters/code analysis to the CI itself
reasoning: we try to have as little as possible "code style" discussion in PRs
clayallsopp|10 years ago
My understanding is that they have a large external dependency (my term: "the money system"), and running integration tests against it might be tricky or even undependable. Do they have a mock banking infrastructure they integrate against?
nelhage|10 years ago
We don't have a single answer we use for every system we work on, but we employ a few common patterns, ranging from just keeping hard-coded strings containing the expected output, up to and including implementing our own fake versions of external infrastructure. We have, for example, our own faked ISO-8583 [1] authorization service, which some of our tests run against to get a degree of end-to-end testing.
Back-testing is also incredibly valuable: We have repositories of every conversation or transaction we've ever exchanged with the banking networks, and when making changes to parsers or interpreters, we can compare their output against the old version on all of that historical data.
[1] https://en.wikipedia.org/wiki/ISO_8583
sanderjd|10 years ago
andreasklinger|10 years ago
(eg cc handling could anon the CCs in a service before they reach the main app)
The integration with 3rd parties is a seperate issue that exists no matter of it is banks or not - i would guess they abstracted that as well as services or libs and decide case by case.
com2kid|10 years ago
The best I ever saw was an internal tool at Microsoft. It could run tests on devices (Windows Mobile phones, but it really didn't care), had a nice reservation and pool system and a nice USB-->Ethernet-->USB system that let you route any device to any of the test benches.
This was great because it was a heterogeneous pool of devices, with different sets of tests that executed appropriately.
The test recovery was the best I've ever seen. The back end was wonky as anything, every single function returned a BOOL indicating if it had ran correctly or not, every function call was wrapped in an IF statement. That was silly, but the end result was that every layer of the app could be restarted independently, and after so many failures either a device would be auto removed from the pool and the tests reran on another device, or a host machine could be pulled out, and the test package sent down to another host machine.
The nice part was the simplicity of this. All similar tools I've used since have involved really stupid setup and configuration steps with some sort of crappy UI that was hard to use en-masse.
In comparison, this test system just tool a path to a set of source files on a machine, the compilation and execution command line, and then if the program returned 0 the test was marked as pass, if it returned anything else it was marked as fail.
All of this (except for copying the source files over) was done through an AJAX Web UI back in 2006 or so.
Everything I've used since than has either been watching people poorly reimplementing this system (frequently with not as good error recovery) or just downright inferior tools.
(For reference a full test pass was ~3 million tests over about 2 days, and there were opportunities for improvement, network bandwidth alone was a huge bottle neck)
All that said, the test system in the link sounds pretty sweet.
speedkills|10 years ago
It must be a little depressing to build a really useful product you know many people need, give it away only hoping people will use it and be happy, then find out everyone would rather build their own.
But we do like to build things, it is in our nature. Plus, what looks better on your resume: 1) I migrated my teams test suite to using test load balancer in two days, saving hours every test run. 2) I contributed improvements to the open source test load balancer project. 3) I designed and implemented my own distributed test load balancing tool!
hueving|10 years ago
So did Microsoft open source this? If not, quit complaining. Just because you saw a massive software engineering company doing something better doesn't mean everyone else who doesn't have access to it sucks for not reaching parity.
ryanong|10 years ago
https://github.com/tmm1/test-queue
One thing that really sped up our test suite was by creating an NGINX proxy that served up all the static files instead of making rails do it. This saved us about 10 minutes off our 30 minute tests.
tmm1|10 years ago
We use test-queue in a dozen different projects at GitHub and most CI runs finish in ~30s. The largest suite is for the github/github rails app which runs 30 minutes of tests in 90s.
beilabs|10 years ago
bbuchalter|10 years ago
sytse|10 years ago
teacup50|10 years ago
- Their language runtime supported thread-based concurrency, which would drastically reduce implementation complexity and actual per-task overhead, thus improving machine usage efficiency AND eliminating the concerns about managing process trees that introduces a requirement for things like Docker.
- Their language runtime was AOT or JIT compiled, simply making everything faster to a degree that test execution could be reasonably performed on one (potentially large) machine.
- They used a language with a decent type system, significantly reducing the number of tests that had to be both written and run?
100k|10 years ago
pekk|10 years ago
someone7x|10 years ago
ryanong|10 years ago
I think it is an interesting question but a bad one most of the time unless you take into account all the other external factors that don't include the language it self.
amalag|10 years ago
brianwawok|10 years ago
It would run the tests 10-100x faster per test, and also require less tests (due to having a real type system).
I do giggle a little when I see huge engineering hurdles people have to overcome because of the language that was chosen. Building an app that is going to scale to millions of users? May not want to use Ruby...
(Nothing against Stripe, I am a paying customer - love the product. I do suspect it would be easier to engineer on a better platform than RoR though).
yjgyhj|10 years ago
I/O only happens in a few functions, and most other code just takes data in -> transforms -> returns data out. This means I only have few functions that need to 'wait' on something outside of itself to finish, and much lesser delays in the code.
This is coding in Clojure for me, but you can do that in any language that has functions (preferable with efficient persistent data structures. Like the tree-based PersistentVector in Clojure).
schneems|10 years ago
I mostly only ever hear about how fast FP languages are, so maybe they use some tricks to avoid allocations somehow. I would be interested in hearing more about it.
jtchang|10 years ago
There is something to be said about code quality and having tests run in under a few seconds. The ideal situation is when you can have a barrage of tests run as fast as you are making changes to code. If we ever got to the point of instant feedback that didn't suck I'd think we'd change a lot about how we think about tests.
sigil|10 years ago
This made me long for a unit test framework as simple as:
Where you've got something like the following: You get test parallelism and efficient use of compute resources "for free" (well, from make -j, because it already has a job queue implementation internally). This setup closely resembles the "rts" unit test approach you'll find in a number of djb-derivative projects.The defining obstacle for Stripe seems like Ruby interpreter startup time though. I'm not sure how to elegantly handle preforked execution in a Makefile-based approach. Drop me a line if you have ideas or have tackled this in the past, I've got a couple projects stalled out on it.
atonse|10 years ago
We were able to run tests that took an hour in about 3 minutes. It was good enough for us. Nothing sophisticated for evenly balancing the test files, but it was pretty good for 1-2 days of work.
vkjv|10 years ago
But, then how do you catch bugs where shared mutable state is not compatible with multiple changes?
praxulus|10 years ago
arturhoo|10 years ago
nelhage|10 years ago
Ono-Sendai|10 years ago
Aqua_Geek|10 years ago
raverbashing|10 years ago
Things like using the same setup function for every test and setting up/tearing down for every test regardless of dependencies
Also tests like
then Or When it could have been consolidated into 1 testThen people wonder why it takes so much time?
Also helpful is if you can shutdown database setup for tests that don't need it
aianus|10 years ago
These are embarrassingly parallel problems, we just need better tools to fully saturate every core on every node in the test cluster.
hinkley|10 years ago
Not only does the nesting help limit the amount of setup and teardown you do, but when broad-reaching functional changes hit you in version 2, 3, it's so much easier to reorganize your tests to get the pre- and post-conditions right when they are already grouped that way.
The sad thing is that it takes a few release cycles before you feel any difference at all, and a couple more before you're absolutely sure that there are qualitative differences between the conventions. So it seems like a pretty arbitrary selection process instead of an obvious choice.
falsedan|10 years ago
We've noticed that starting and stopping a ton of docker containers in rapid succession really hoses dockerd, also that Jenkins' API is a lot slower than we expected for mostly-read-only operations.
Have you considered mesos?
badmadrad|10 years ago
akoumjian|10 years ago
lawrencewu|10 years ago
cthyon|10 years ago
How would one go about building a similar distributed testing setup for end-to-end tests where a sequence of tests have to be run in particular order. Finding the optimal ordering / distribution of tests between workloads would certainly be more complicated. Maybe they could be calculated with directed graph algorithms?
matthewmacleod|10 years ago
I reckon that would be solving the wrong problem. End-to-end tests should be independent of each other, and tests should never be dependent on the order in which they are run. End-to-end tests might be longer as a result, but managing the complexity of test dependencies will quickly cripple any system that uses this approach, I imagine.
givehimagun|10 years ago
We ended up making a compromise where each test can never expect another test to have run...but some tests expect certain test data to be present and in a known state. To handle that, every test cleans up the data of the previous run (Entity Framework has a nice change tracker where we can keep track of the unit of work before it is persisted). We wouldn't be able to parallelize everything though...we can only accept a single test to be active on the DB at a single point in time.
notduncansmith|10 years ago
hinkley|10 years ago
They have an average of 9 assertions per test case. I think I may see part of their problem.
junto|10 years ago
http://programmers.stackexchange.com/questions/7823/is-it-ok...
My 2 cents is that multiple assertions are legitimate, as long as they prove a singluar assumption. Hence (as per the test on that page), this is a valid use of multiple assertions:
chinathrow|10 years ago
patio11|10 years ago
Contracts, not firewalls, make the world go round.
brown9-2|10 years ago
One can probably assume that they are not relying upon the secrecy of their code for security.
jwatte|10 years ago
meesterdude|10 years ago
one of the things I did that may be of interest is to break up spec files themselves to help reduce hotspots (or dedicate a machine to it specifically)
Not as complex or as robust as what they did, but it works!
grandalf|10 years ago
MrBra|10 years ago
jwatte|10 years ago
jwatte|10 years ago
rubiquity|10 years ago
throwaway832975|10 years ago
123-xXx-321|10 years ago
[deleted]
123-xXx-321|10 years ago
[deleted]
123-xXx-321|10 years ago
[deleted]
smegel|10 years ago
Why, to be cool? Tests are a classic case of things that should be run in isolation - you don't want tests interfering with earth other or crashing the whole test suite. Using separate processes would have been the sensible approach to start with.
edoloughlin|10 years ago
werdnapk|10 years ago