meteorfox's comments

meteorfox | 9 years ago | on: Web Service Efficiency at Instagram with Python

This part I'm skeptical:

  "Compared to CPU time, CPU instruction is a better metric,
  as it reports the same numbers regardless of CPU models
  and CPU loads for the same request."
CPU instructions will be more stable than CPU time for sure, and it does show that their metric is stable based on their chart but a single CPU instruction can take multiple cycles, especially if there are stalls in the pipeline or other processes are "polluting" the cache. Depending on the CPU model, the number of uops that can be issued concurrently varies, and the latency of these instructions will also vary on the size, and access pattern to their memory hierarchy.

Also, what about a change in compiler version? That can also vary the number of instructions. Unless they are referring to a Python bytecode instruction as a CPU instruction.

Would measuring CPI be a better indicative of their efficiency? They could also track both, no need to settle for one.

meteorfox | 10 years ago | on: Nomad Million Container Challenge

It's really impressive that it can handle these many container placements.

But, honest question, what's the value of determining how fast can we schedule a million containers? This question is not just for Nomad but other cluster managers as well that have recently published similar benchmarks.

I see the value of scheduling thousands to perhaps hundreds of thousands of containers across many nodes, but millions seem excessive.

I think that is more valuable to measure what happens after you have 1 million containers running on your cluster. Such as: - What is the overhead keeping track of that many containers? - How do they impact the responsiveness of other API calls (list, delete)? - What happens when nodes go down and suddenly you lose a considerable amount of containers, can it recover quickly? - How does it impact the performance of running containers in the cluster?

Also, there are other important factors to test for: - what about image size? How does it impact scheduling time when non-cached? - container density per node - number of nodes - what about scheduling other workloads that Nomad support, like VMs and runtimes?

meteorfox | 10 years ago | on: How Not to Measure Latency [pdf]

The explanations here gave a lot of details on the effect, but IMHO, not as many details in the cause of Coordinated Omission (CO). Most of what I'll be saying here comes from a CMU's paper titled "Open vs Closed: A Cautionary Tale"[1] and from Gil Tene's talk.

First, some terminology which I think is important for the discussion, also when I say 'job' this could be something like a user, HTTP request, RPC call, network packet, or some sort of task the system is asked to do, and can accomplish in some finite amount of time.

Closed-loop system, aka closed system - is a system where new job arrivals are only triggered by job completions, some examples are interactive terminal, batch systems like a CI build system.

Open-loop system, aka open system - is a system where new job arrivals are independent of job completions, some examples are the requesting the front page of Hacker news, or arriving packets to a network switch.

Partly-open system - is a system where new jobs arrive by some outside process as in an open system, and every time a job completes there is a probability p it makes a follow-up request, or probability (1 - p) it leaves the system. Some examples are web applications, where users request a page, and make follow-up requests, but each user is independent, and new users are arriving and leaving in their own.

Second, workload generators (e.g. JMeter, ab, Gatling, etc) can also be classified similarly. Workload generators that issue a request, and then block to wait for a response before making the next request are based on a closed system (e.g. JMeter[2], ab). Those generators that continue to issue requests independently of the response rate, regardless of the system throughput, are based on an open system (e.g. Gatling, wrk2[3])

Now, CO happens whenever a workload generator based on a closed system is used against an open system or partly open system, and the throughput of the system under load is slower than the injection rate of the workload generator.

For the sake of simplicity, assume we have an open system, say a simple web page, where multiple users arrive by some probability distribution and simply request the page, and then 'leave'. Assume the arrival probability distribution is uniform, where the p is 1.0 that a request will arrive every second.

In this example, if we use a workload generator based on a closed system to simulate this workload for 100 seconds, and the system under load never slows downs so it continuous to serve a response under 1 second, say that is always 500 ms. Then there's no CO here. In the end, we will have 100 samples of response times of 500ms, all the statistics (min, max, avg, etc) will be 500ms.

Now, say we are using the same workload generator at an injection rate of 1 request/s, but this time the system under load for the first 50 seconds will behave as before with responses taking 500 ms, and for the later 50 seconds the system stalls.

Since the system under load is an open system, we should expect 50 samples of response times with 500 ms, and 50 samples where response times linearly decrease from 50s to 1s. The statistics then would be

min=500ms, max=50s, avg=13s, median=0.75s, 90%ile=45.05s

But because we used a closed system workload generator, our samples are skewed. Instead, we get 50 samples of 500ms and only 1 samples of 50 seconds! This happens because the injection rate is slowed down by the response rate of the system. As you can see this is not even the workload we intended because essentially our workload generator backed off when the system stalled. The stats now look like this:

min=500ms, max=50s, avg=1.47s, median=500ms, 90%ile=500ms.

[1][pdf] http://repository.cmu.edu/cgi/viewcontent.cgi?article=1872&c... [2] http://jmeter.512774.n5.nabble.com/Coordinated-Omission-CO-p... [3] https://github.com/giltene/wrk2

meteorfox | 10 years ago | on: The Curse of the First-In First-Out Queue Discipline (2012) [pdf]

I'm trying to relate this back to computer systems if its even possible, say comparing it to scheduling block requests from multiple processes to a block device, if the LIFO discipline maximizes welfare, I assume welfare in such system would be average response time, where response time here = queue time + service time of the block requests. When the block device is saturated and starts queuing, I guess one benefit would be that the block requests with the smallest waiting time would be served first, and improve responsiveness, but unless some kind of deadline is added you might have long tail where certain block requests don't get to be serviced.

But since the paper assume there's an opening time, perhaps then is not applicable for the block device example I gave above, maybe a more comparable example would be a traffic spike to a web application after some announcement, and how an http framework/library might 'choose' http requests to service. My understanding is that most framework/libraries just implicitly delegate to the OS process scheduler.

meteorfox | 10 years ago | on: Google: 90% of our engineers use the software you wrote (Homebrew), but...

I understand your point, and I'm not trying to be a troll.

Humans are more complex than that. I don't think you can assume that candidates will perform the same all the time. Sometimes an excellent candidate can perform badly for multiple reasons (e.g. nervousness, poor preparation, bad interviewer, personal problems, etc).

It seems to me, that rejecting a good candidate, and have him/her interview again after some time, if that candidate was a 'good-hire', then it would increase the chance of hiring him/her, since it is most likely they will prepare better, and know what to expect.

meteorfox | 11 years ago | on: How to Generate Millions of HTTP Requests (2012)

Check out Gatling[1], which is based on Scala, Akka, and Netty, and last time I checked works on Windows.

The only thing missing would be an out-of-box solution for distributed load generation, which I believe is being developed. But today, you can use a 'scale-out' approach[2] which gives you the ability to combine the data from multiple Gatling instances into a single report, but as a post-process step.

[1] http://gatling.io [2] http://gatling.io/docs/2.1.5/cookbook/scaling_out.html

meteorfox | 12 years ago | on: VPS Disk Performance, Digital Ocean vs. Linode

It seems to me that the workloads on Digital Ocean are not fully utilizing the block device. FIO shows about 50% util for DO, and 81% for Linode on the queue depth=1 test, and 70%/82%, respectively, on the queue depth=8 test. Which seems to indicate a bottle neck somewhere else (at least on DO), and that seems to align with OP's statement at the end, where it seems DO is being capped. Also as it was pointed out, the sizes of these VPS are not even mentioned, and most likely the working set size of 128MB, might be either too small or too big for any of these VPS, making even harder to see the value of these results.

Brendan Gregg has excellent info on these topics,

http://www.joyent.com/blog/benchmarking-the-cloud http://dtrace.org/blogs/brendan/2011/05/11/file-system-laten... http://dtrace.org/blogs/brendan/2012/10/23/active-benchmarki...

page 1