Running three hours of Ruby tests in under three minutes

dankohn1|10 years ago

We're not nearly at Stripe's scale, but my startup (Spreemo) has achieved pretty amazing parallelism using the commercial SaaS CircleCI. We have 3907 expects across 372 RSpec and Cucumber files. Our tests complete in ~14 minutes when run across 8 containers.

One of the great strengths for CircleCI is that they auto-discover our test types, calculate how long each file takes to run, and then auto-allocate the files in future runs to try to equalize the run times across containers. The only effort we had to do was split up our slowest test file when we found that it was taking longer to complete than a combination of files on the other machines.

I also like that I can run pronto https://github.com/mmozuras/pronto to post Rubocop, Rails Best Practices, and Brakeman errors as comments on Github.

andreasklinger|10 years ago

i like pronto's approach

we simply added the linters/code analysis to the CI itself

reasoning: we try to have as little as possible "code style" discussion in PRs

clayallsopp|10 years ago

I'm super curious how Stripe approaches end-to-end testing (like Selenium/browser testing, but maybe something more bespoke too)

My understanding is that they have a large external dependency (my term: "the money system"), and running integration tests against it might be tricky or even undependable. Do they have a mock banking infrastructure they integrate against?

nelhage|10 years ago

This is a great question, and it's definitely a problem we have.

We don't have a single answer we use for every system we work on, but we employ a few common patterns, ranging from just keeping hard-coded strings containing the expected output, up to and including implementing our own fake versions of external infrastructure. We have, for example, our own faked ISO-8583 [1] authorization service, which some of our tests run against to get a degree of end-to-end testing.

Back-testing is also incredibly valuable: We have repositories of every conversation or transaction we've ever exchanged with the banking networks, and when making changes to parsers or interpreters, we can compare their output against the old version on all of that historical data.

[1] https://en.wikipedia.org/wiki/ISO_8583

sanderjd|10 years ago

I'm very curious about that as well. I worked on a big project that had a (perhaps analogous) large external dependency on networks of embedded devices in homes and businesses, and integration testing it was …difficult. I'd love to hear how Stripe solves that problem.

andreasklinger|10 years ago

Not a stripe member but i would assume that anything that involves intense security auditing, PCI, etc would be seperate codebases that rarely change.

(eg cc handling could anon the CCs in a service before they reach the main app)

The integration with 3rd parties is a seperate issue that exists no matter of it is banks or not - i would guess they abstracted that as well as services or libs and decide case by case.

com2kid|10 years ago

I am tired of this technology having to be re-invented time and time again.

The best I ever saw was an internal tool at Microsoft. It could run tests on devices (Windows Mobile phones, but it really didn't care), had a nice reservation and pool system and a nice USB-->Ethernet-->USB system that let you route any device to any of the test benches.

This was great because it was a heterogeneous pool of devices, with different sets of tests that executed appropriately.

The test recovery was the best I've ever seen. The back end was wonky as anything, every single function returned a BOOL indicating if it had ran correctly or not, every function call was wrapped in an IF statement. That was silly, but the end result was that every layer of the app could be restarted independently, and after so many failures either a device would be auto removed from the pool and the tests reran on another device, or a host machine could be pulled out, and the test package sent down to another host machine.

The nice part was the simplicity of this. All similar tools I've used since have involved really stupid setup and configuration steps with some sort of crappy UI that was hard to use en-masse.

In comparison, this test system just tool a path to a set of source files on a machine, the compilation and execution command line, and then if the program returned 0 the test was marked as pass, if it returned anything else it was marked as fail.

All of this (except for copying the source files over) was done through an AJAX Web UI back in 2006 or so.

Everything I've used since than has either been watching people poorly reimplementing this system (frequently with not as good error recovery) or just downright inferior tools.

(For reference a full test pass was ~3 million tests over about 2 days, and there were opportunities for improvement, network bandwidth alone was a huge bottle neck)

All that said, the test system in the link sounds pretty sweet.

speedkills|10 years ago

I agree. We already have projects like http://test-load-balancer.github.io but I have a feeling I will see five more posts on he in the next year about re-inventing this wheel and yet not see a single contribution to existing solutions like tlb.

It must be a little depressing to build a really useful product you know many people need, give it away only hoping people will use it and be happy, then find out everyone would rather build their own.

But we do like to build things, it is in our nature. Plus, what looks better on your resume: 1) I migrated my teams test suite to using test load balancer in two days, saving hours every test run. 2) I contributed improvements to the open source test load balancer project. 3) I designed and implemented my own distributed test load balancing tool!

hueving|10 years ago

>I am tired of this technology having to be re-invented time and time again.

So did Microsoft open source this? If not, quit complaining. Just because you saw a massive software engineering company doing something better doesn't mean everyone else who doesn't have access to it sucks for not reaching parity.

ryanong|10 years ago

If you want to implement this locally without using mini-test checkout test-queue by Aman Gupta at github.

https://github.com/tmm1/test-queue

One thing that really sped up our test suite was by creating an NGINX proxy that served up all the static files instead of making rails do it. This saved us about 10 minutes off our 30 minute tests.

tmm1|10 years ago

test-queue supports minitest too, and follows the same basic design outlined in this article: a central queue sorted by slowest tests first, with test runners forked off either locally or on other machines to consume off the queue.

We use test-queue in a dozen different projects at GitHub and most CI runs finish in ~30s. The largest suite is for the github/github rails app which runs 30 minutes of tests in 90s.

beilabs|10 years ago

Really interested in this approach, can you point somewhere that talks further about this nginx proxy strategy?

bbuchalter|10 years ago

Could you share a bit more about this nginx proxy setup for static assets? Basically mimicking production env?

sytse|10 years ago

Very cool stuff. For reference at GitLab we use a less impressive and simpler solution. We split the jobs off in https://gitlab.com/gitlab-org/gitlab-ce/blob/master/.gitlab-... These jobs will be done by separate runners, this brought our time down from 1+ hours to 23 minutes https://ci.gitlab.com/projects/1/refs/respect_filters/commit...

teacup50|10 years ago

How much cheaper (in time, code, effort, complexity) would it be if:

- Their language runtime supported thread-based concurrency, which would drastically reduce implementation complexity and actual per-task overhead, thus improving machine usage efficiency AND eliminating the concerns about managing process trees that introduces a requirement for things like Docker.

- Their language runtime was AOT or JIT compiled, simply making everything faster to a degree that test execution could be reasonably performed on one (potentially large) machine.

- They used a language with a decent type system, significantly reducing the number of tests that had to be both written and run?

100k|10 years ago

Only if the early engineers could write in a language that they were as productive in as Ruby. Getting Stripe launched was the key thing Stripe needed to accomplish. Everything else follows from that.

pekk|10 years ago

Thread-based concurrency based on shared mutable state doesn't reduce complexity.

someone7x|10 years ago

Is it fair to assume that time, code, effort, and complexity would be some degree of cheaper? May very well be more expensive. Language choice isn't a silver bullet.

ryanong|10 years ago

Would it be cheap enough to encourage a re-write, re-engineer the server stack, and re-train employees? I doubt it.

I think it is an interesting question but a bad one most of the time unless you take into account all the other external factors that don't include the language it self.

amalag|10 years ago

JRuby will do the first and part of the second

brianwawok|10 years ago

You mean something like Scala?

It would run the tests 10-100x faster per test, and also require less tests (due to having a real type system).

I do giggle a little when I see huge engineering hurdles people have to overcome because of the language that was chosen. Building an app that is going to scale to millions of users? May not want to use Ruby...

(Nothing against Stripe, I am a paying customer - love the product. I do suspect it would be easier to engineer on a better platform than RoR though).

yjgyhj|10 years ago

One thing I've noticed since coding with immutable data structures & functions (rather than mutable OOP programs) is how tests run really fast, and are easy to run in parallell.

I/O only happens in a few functions, and most other code just takes data in -> transforms -> returns data out. This means I only have few functions that need to 'wait' on something outside of itself to finish, and much lesser delays in the code.

This is coding in Clojure for me, but you can do that in any language that has functions (preferable with efficient persistent data structures. Like the tree-based PersistentVector in Clojure).

schneems|10 years ago

Immutable data structures give you easy parallelism, however there's a hidden runtime cost: you have to allocate way more objects. For example, I was able to save a ton of object allocations here: https://github.com/mime-types/ruby-mime-types/pull/93 mostly by mutating. For tasks that are not easily parallelizable it may be slower to use immutable structures.

I mostly only ever hear about how fast FP languages are, so maybe they use some tricks to avoid allocations somehow. I would be interested in hearing more about it.

jtchang|10 years ago

Love this. Sometimes testing can be a huge pain in the ass. I know more than one project I work on where getting them to run is a lot of effort in itself.

There is something to be said about code quality and having tests run in under a few seconds. The ideal situation is when you can have a barrage of tests run as fast as you are making changes to code. If we ever got to the point of instant feedback that didn't suck I'd think we'd change a lot about how we think about tests.

sigil|10 years ago

We opted for an alternate, dynamic approach, which allocates work in real-time using a work queue. We manage all coordination between workers using an nsqd instance... In order to get maximum parallel performance out of our build servers, we run tests in separate processes, allowing each process to make maximum use of the machine's CPU and I/O capability. (We run builds on Amazon's c4.8xlarge instances, which give us 36 cores each.)

This made me long for a unit test framework as simple as:

    $ make -j36 test

Where you've got something like the following:

    $ find tests/

      tests/bin/A
      tests/bin/B
      ...
      tests/input/A
      tests/input/B
      ...
      tests/expected/A
      tests/expected/B
      ...
      tests/output/

    $ cat Makefile

      test : $(shell find tests/bin -type f | sed -e 's@/bin/@/output/@')
      
      tests/output/% : tests/bin/% tests/input/% tests/expected/%
              @ printf "testing [%s] ... " $@
              @ sh -c 'exec $$0 < $$1' $^ > $@
              @ # ...runs tests/bin/% < tests/input/% > tests/output/%
              @ sh -c 'exec cmp -s $$3 $$0' $@ $^ && echo pass || echo fail
              @ # ...runs cmp -s tests/expected/% tests/output/%
     
      clean :
              rm -f tests/output/*

You get test parallelism and efficient use of compute resources "for free" (well, from make -j, because it already has a job queue implementation internally). This setup closely resembles the "rts" unit test approach you'll find in a number of djb-derivative projects.

The defining obstacle for Stripe seems like Ruby interpreter startup time though. I'm not sure how to elegantly handle preforked execution in a Makefile-based approach. Drop me a line if you have ideas or have tackled this in the past, I've got a couple projects stalled out on it.

atonse|10 years ago

On a previous project, I had built a shell script that essentially created n mysql databases and just distributed the test files under n rails processes.

We were able to run tests that took an hour in about 3 minutes. It was good enough for us. Nothing sophisticated for evenly balancing the test files, but it was pretty good for 1-2 days of work.

vkjv|10 years ago

"This second round of forking provides a layer of isolation between tests: If a test makes changes to global state, running the test inside a throwaway process will clean everything up once that process exits."

But, then how do you catch bugs where shared mutable state is not compatible with multiple changes?

praxulus|10 years ago

You write tests specifically for testing multiple changes. You shouldn't be testing changes to global state by seeing how multiple supposedly independent tests interact.

arturhoo|10 years ago

Congratulations on what look a very challenging task. I'm assuming a part of those tests hit a database. How have you dealt with it? I assume that a single instance, even on a powerful bare server could be a road blocker in this situation. A few insights on the Docker/Containerization part of it would also be nice!

nelhage|10 years ago

Our testing running infrastructure spins up a pool of database instances on each worker machine, one for each worker process. The test spinup and teardown code handles schema management, hooking into our DB access layer to create and clean up database tables only if they're used by a given test.

Ono-Sendai|10 years ago

This is an interesting and possibly overlooked problem with using slow languages like Ruby - your unit tests take forever to run. (unless you spend a lot of engineering effort on making them run faster, in which case they may run somewhat acceptably fast)

Aqua_Geek|10 years ago

This isn't just a problem with Ruby. Our ObjC test suite for a project I work on takes about 10 min to run, too.

raverbashing|10 years ago

I guess a lot of problems come from the stupidly brain dead way people usually write tests (because it's the "recommended TDD way")

Things like using the same setup function for every test and setting up/tearing down for every test regardless of dependencies

Also tests like

    def test1():
      do_a() 
      check_condition_X()

then

    def test2():
      do_a() 
      check_condition_Y()

Or

    def test1():
      do_a() 
      check_condition_X()

    def test2():
      do_a()
      do_b()
      check_condition_Y()

When it could have been consolidated into 1 test

Then people wonder why it takes so much time?

Also helpful is if you can shutdown database setup for tests that don't need it

aianus|10 years ago

The time it saves me when I see 'test2' failed instead of 'test_enormous:137' failed is worth more than the marginal computation required.

These are embarrassingly parallel problems, we just need better tools to fully saturate every core on every node in the test cluster.

hinkley|10 years ago

mocha and jasmine (in the node/javascript space) support nested setup and teardown methods and it's been really challenging for me to go back to using other frameworks, languages.

Not only does the nesting help limit the amount of setup and teardown you do, but when broad-reaching functional changes hit you in version 2, 3, it's so much easier to reorganize your tests to get the pre- and post-conditions right when they are already grouped that way.

The sad thing is that it takes a few release cycles before you feel any difference at all, and a couple more before you're absolutely sure that there are qualitative differences between the conventions. So it seems like a pretty arbitrary selection process instead of an obvious choice.

falsedan|10 years ago

Oh hey, we have the same sort of system here. It's 60,000 Python tests which take ~28 hours if run serially, but we keep it around 30-40 minutes. We wrote a UI & scheduler & artifact distribution system (which we're probably going to replace with S3). We run selenium & unit tests as well as the integration tests.

We've noticed that starting and stopping a ton of docker containers in rapid succession really hoses dockerd, also that Jenkins' API is a lot slower than we expected for mostly-read-only operations.

Have you considered mesos?

badmadrad|10 years ago

Have you considered another containerization solution like LXD. I feel like testing like this fits the "container hyper-visor" use case and this is what LXD is designed to do.

akoumjian|10 years ago

Anything significantly different in your Python implementation?

lawrencewu|10 years ago

hi frei

cthyon|10 years ago

Not sure if this has already been answered, but would Stripe's methods only work with unit tests where tests are not dependent on each other?

How would one go about building a similar distributed testing setup for end-to-end tests where a sequence of tests have to be run in particular order. Finding the optimal ordering / distribution of tests between workloads would certainly be more complicated. Maybe they could be calculated with directed graph algorithms?

matthewmacleod|10 years ago

How would one go about building a similar distributed testing setup for end-to-end tests where a sequence of tests have to be run in particular order.

I reckon that would be solving the wrong problem. End-to-end tests should be independent of each other, and tests should never be dependent on the order in which they are run. End-to-end tests might be longer as a result, but managing the complexity of test dependencies will quickly cripple any system that uses this approach, I imagine.

givehimagun|10 years ago

I'd love to know if their integration tests use a database or reference external services of any sort.

We ended up making a compromise where each test can never expect another test to have run...but some tests expect certain test data to be present and in a known state. To handle that, every test cleans up the data of the previous run (Entity Framework has a nice change tracker where we can keep track of the unit of work before it is persisted). We wouldn't be able to parallelize everything though...we can only accept a single test to be active on the DB at a single point in time.

notduncansmith|10 years ago

I think those would not be considered "unit" tests. Often the definition of unit tests includes the ability to run those tests in any order. Any tests that have to be run in a particular order (i.e. "stateful" tests) should be considered a single test, and likely an integration test at that.

hinkley|10 years ago

needle scratching on record

They have an average of 9 assertions per test case. I think I may see part of their problem.

junto|10 years ago

I'm not sure if you are talking from a performance perspective or a conceptual perspective, but this provides a useful discussion on multiple assertions:

http://programmers.stackexchange.com/questions/7823/is-it-ok...

My 2 cents is that multiple assertions are legitimate, as long as they prove a singluar assumption. Hence (as per the test on that page), this is a valid use of multiple assertions:

  [Test]
  public void ValueIsInRange()
  {
    int value = GetValueToTest();

    Assert.That(value, Is.GreaterThan(10), "value is too small");
    Assert.That(value, Is.LessThan(100), "value is too large");
  }

chinathrow|10 years ago

Any reason why a financial infrastructure provider like Stripe would run CI tests on someone elses infrastructure? Isn't that a no go from a security point of view? Or - how do you trust the hosted CI company not to look at your code?

patio11|10 years ago

how do you trust the hosted CI company not to look at your code?

Contracts, not firewalls, make the world go round.

brown9-2|10 years ago

how do you trust the hosted CI company not to look at your code

One can probably assume that they are not relying upon the secrecy of their code for security.

jwatte|10 years ago

If their code is right, everyone in the world reading it wouldn't be a problem.

meesterdude|10 years ago

I wrote a rubygem called cloudspeq (http://github.com/meesterdude/cloudspeq) that distributes rails rspec spec's across a bunch of digital ocean machines to reduce test execution time for slow test suits in dev.

one of the things I did that may be of interest is to break up spec files themselves to help reduce hotspots (or dedicate a machine to it specifically)

Not as complex or as robust as what they did, but it works!

grandalf|10 years ago

It's interesting to imagine, for a test suite that would take three hours, how much of the execution time is state management vs algorithm execution.

MrBra|10 years ago

No, they aren't going to switch to a pure functional language.

jwatte|10 years ago

http://engineering.imvu.com/2011/01/19/buildbot-and-intermit...

jwatte|10 years ago

Also, since 2011, we have added features and platforms under test, yet deceased test run time to < 4 minutes. So, yay progress!

rubiquity|10 years ago

Does this mean each process has its own database or are you able to use transactions with the selenium/capybara tests?

throwaway832975|10 years ago

Pull-based load balancing is a generally underrated technique.

123-xXx-321|10 years ago

[deleted]

123-xXx-321|10 years ago

[deleted]

123-xXx-321|10 years ago

[deleted]

smegel|10 years ago

> Initially, we experimented with using Ruby's threads instead of multiple processes

Why, to be cool? Tests are a classic case of things that should be run in isolation - you don't want tests interfering with earth other or crashing the whole test suite. Using separate processes would have been the sensible approach to start with.

edoloughlin|10 years ago

Was anyone else expecting the article to be about replacing Ruby with a compiled language?

werdnapk|10 years ago

No.

106 comments