Post Mortem: A single whitespace character

pilif|11 years ago

Likely "Cowboy" is a transparent proxy added by your mobile service provider. I had a similar thing happening a year ago when the mobile provider used by most of our barcode scanners decided to add a transparent proxy into the loop (without telling anybody).

The solution for this problem: Use SSL.

I mean: There are already many good reasons to use SSL, but whenever you need to send any kind of mission critical data over the mobile network, you practically must use SSL if you want any kind of guarantees that the data you send to the server is what actually reaches the server (and reverse).

Here's my war story from last year: http://pilif.github.io/2013/09/when-in-doubt-ssl/

rubiquity|11 years ago

Cowboy is the name of an Erlang web server and Heroku uses Erlang for their routing. I imagine the reason Cowboy is showing up is due to Heroku's routing layer.

goleksiak|11 years ago

We would really like to use HTTPS but it's not supported by the Arduino chipset as I understand it. Though I'm not the hardware guy here at eatabit...

pavel_lishin|11 years ago

Who was replacing numbers with asterisks, and for what purpose?

ldng|11 years ago

What's wrong with transparent proxy ? Isn't how HTTP caching is supposed to work ? I would think the cache headers are the solution rather than SSL.

It feels like you are kind of throwing the baby with the bath water. IMHO, badly configured transparent proxy does not mean the concept is bad, does it ?

ctz|11 years ago

This would require a vast, vast upgrade of client power to achieve the same communications performance. If you could achieve it all, SSL would also likely decrease reliability over a spotty GSM link.

goleksiak|11 years ago

Heroku came back and said:

Looking through the system, I see that you were sent two emails (in August and September) as several of your apps were migrated to the new routing stack (https://devcenter.heroku.com/articles/heroku-improved-router). As mentioned in the documentation, the new router follows stricter adherence to the RFC specification, including sensitivity to spaces.

...and sure enough, there is a line that says:

The request line expects single spaces to separate between the verb, the path, and the HTTP version.

So the lesson is: RTFM

-G

sixwing|11 years ago

The team at Heroku (where I currently PM) is constantly trying to improve our communication and documentation. We're definitely sorry that this caused problems, and we'll work even harder to make sure that our communication calls out any potential issues. Again - thanks for reaching out to us, and let us know if we can help.

jrochkind1|11 years ago

This very example -- requests were technically illegal all the time without devs realizing, but something in the stack changed to start rejecting them -- demonstrates the fallacy of the "be liberal in what you accept, strict in what you issue" principal. If all the web servers involved had been strict in rejecting the illegal request from the start, they would have noticed the bug in development before deploying to firmware in the field.

Xylakant|11 years ago

I don't agree that "be liberal in what you accept, strict in what you issue" is a fallacy. The client actually failed to adhere to the "be strict in what you issue" principal, just as the Cowboy was not liberal in accepting. All software will sooner or later exhibit bugs or be stricter or more lenient about a standard.

I think the fallacy is to assume that once stuff works in production, only your changes can trigger a bug. There's way too much software involved in a standard webserver stack to assume anything about it. Any patch, any update to software or devices not under your control has the potential to break your stack. The thing the OP did was the right thing: Monitor, monitor, monitor.

codehero|11 years ago

I have to agree. I developed a proprietary embedded web server using a streaming HTTP parser. Complying with the HTTP parsing rules is a headache to say the least. Variable amounts of whitespace; 2 variants of line terminators (\r\n or \n) with the provision that the latter SHOULD be accepted by the server and line continuations make complying with the whole specification a real pain if you only have 100 bytes to parse pieces of your request.

Maybe for a server with massive resources (I am talking about megabytes of RAM compared to kilobytes I work with) being liberal in what you accept works, but not when you are on a budget.

acdha|11 years ago

I think Postel's law should be read in the context of “when you cannot control the outside”. It's probably the least-bad option when you are forced to support unknown clients – see e.g. http://daniel.haxx.se/blog/2014/10/26/stricter-http-1-1-fram... for a very recent example – but that clearly doesn't apply in this case where they control both sides, or in many other cases where the number of clients is small and/or there's a solid communication mechanism to tell developers when they need to fix something.

robomartin|11 years ago

Not "principal", "principle".

Not being critical, just pointing out a common mistake.

http://blog.oxforddictionaries.com/2011/08/principle-or-prin...

Principal: Main, most important

Principle: A rule, a system of belief

ajanuary|11 years ago

Doesn't it just demonstrate that you shouldn't switch from being liberal to being strict?

For it to hold up, you need to provide the further argument that you frequently need to switch from liberal to strict.

jmount|11 years ago

I agree http://www.win-vector.com/blog/2010/02/postels-law-not-sure-... . Correct code remains correct under various compositions and transformations (that may happen in the future). Code that is working only due to pity often does not have this property. Some Netflix style chaos-monkey that turns on and off strictness during testing would be cool.

ams6110|11 years ago

In particular this philosophy is rejected in the Erlang community, where they prefer "crash if anything is not what you expect it to be"

akerl_|11 years ago

This doesn't demonstrate a fallacy in "be liberal in what you accept" any more than closed source software demonstrates fallacies in Linus's Law.

The problem wasn't liberal acceptance, it was that liberal acceptance ended when Cowboy was added to the mix.

Strict acceptance would have shown the error earlier, but continued liberal acceptance would have allowed continued functionality.

lmm|11 years ago

The right thing, I think, is to "accept but warn". Like those web browsers that used to show a yellow exclamation mark in the status bar when something was off; web devs could check for this and fix it, but normal users were unaffected. More protocols should include a way to indicate "nonfatal errors".

hyperpape|11 years ago

I recall reading that Postel's law did not mean "accept input that flagrantly ignores the standard", but merely wherever the standard might be read differently, accept all conceivable interpretations of the standard. Unfortunately, I can't remember for sure where I read this, or how authoritative it was.

Postel's original formulation is not written in an essay, but an RFC, and does not elaborate on what he meant: https://tools.ietf.org/html/rfc761

Here's one discussion that suggests this interpretation, without precisely ascribing it to Postel: http://cacm.acm.org/magazines/2011/8/114933-the-robustness-p...

mqsiuser|11 years ago

> the fallacy of the "be liberal in what you accept, strict in what you issue" principal

The market (players) (can) manipulate it to create an (perceived) competitive advantage.

It's also a source where "evil" in IT comes from.

MichaelGG|11 years ago

SIP takes this to the next level. http://tools.ietf.org/html/rfc4475 Is a spec for "torture tests", where the SIP authors revel in the hideously complex parsing rules they've come up with (which is basically HTTP parsing).

They even suggest that code should infer the meaning of messages. So I suppose you need some sort of AI to really handle things well.

Binary protocols would be a better choice. Or, a well-defined text format. JSON, XML, anything, really, would eliminate this class of bugs.

rtpg|11 years ago

I think the core issue here is that we're directly manipulating strings instead of using DSLs and tooling based around grammars to build our responses (this has been a solved problem for more than 10 years!)

I'm a strong proponent of "do not manipulate strings". Having library writers be the only one doing that would greatly reduce the attack surface/bug potential.

spydum|11 years ago

The Server: cowboy tag is from an Erlang web server:

https://github.com/ninenines/cowboy/blob/master/src/cowboy_p...

I'm guessing around here would be interesting to add a test case to handle.

As far as whose server this is? I'd guess Heroku or AWS, though it's plenty possible T-Mobile could have devised some proxy to inspect traffic, but seems unlikely they would do so with Cowboy?

mischanix|11 years ago

It's simple enough to single out Heroku:

  $ cat <<EOF | nc example.herokuapp.com 80
  GET /test  HTTP/1.1
  
  EOF
  ----
  HTTP/1.1 505 HTTP Version Not Supported

lostcolony|11 years ago

T-mobile is known to have, in the past at least, used Erlang. And Cowboy is one of the most popular web servers within that community at this point.

jimrhoskins|11 years ago

Didn't heroku turn off legacy routing last week? I was getting emails warning me to update to their new routing rules by Monday. Seems like it could be related.

ams6110|11 years ago

Erlang is from Ericson, a telecom company. T-Mobile is a telecom company. Doesn't seem a stretch that they would use erlang.

asveikau|11 years ago

      strcpy( ( char * ) commsOrderBuffer, "GET /v1/printer/");
  
      strcat( ( char * ) commsOrderBuffer, ( char * ) settings.getIMEI());
      strcat( ( char * ) commsOrderBuffer, "/orders.txt  HTTP/1.1\r\n");
      strcat( ( char * ) commsOrderBuffer, "HOST: ");
      strcat( ( char * ) commsOrderBuffer, SERVER_NAME);
      strcat( ( char * ) commsOrderBuffer, "\r\n");
      strcat( ( char * ) commsOrderBuffer, "Authorization: Basic ");

What the.... O(n) string concatenations, unnecessary pointer casts, no bounds checking... I think extra whitespace in an HTTP request is not their only problem.

ams6110|11 years ago

Those would be "safe" (assuming that settings.getIMEI() is completely under your control, everything else is string literals) but yeah snprintf seems way better here (though it's been well over 20 years since I wrote any significant C code.

linuxlizard|11 years ago

The Arduino embedded C library (which I'm assuming they're using) isn't as rich as a Glibc or uclibc. Sometimes I have to fall back on very old school methods to build complex strings.

Dylan16807|11 years ago

O what now? O(7) is the same as O(1)

userbinator|11 years ago

I saw it right away - "that HTTP/1.1 looks a bit farther away than it should be..." - and confirmed it by selecting the spaces. I thought it would be a bit more subtle than that... I remember working with a server that violated the HTTP spec by not accepting allowed extra spaces in headers.

According to the new HTTP/1.1 RFC 7230, it should be a single space - the previous RFC didn't specify this clearly in the wording, although it is implied by the grammar (SP and not 1 * SP).

https://tools.ietf.org/html/rfc7230#section-3.1.1

"A request-line begins with a method token, followed by a single space (SP), the request-target, another single space (SP), the protocol version, and ends with CRLF."

I'm surprised there doesn't seem to be any widely-used and easily available HTTP conformance checker - unlike the well-known HTML validators.

This is also why monospace fonts are ideal for seeing small but significant differences like this.

antoncohen|11 years ago

> I'm surprised there doesn't seem to be any widely-used and easily available HTTP conformance checker - unlike the well-known HTML validators.

There is one called Co-Advisor [1] that can be used to test web proxies. It is commercial and pretty expensive, but the online version might be free for open source projects. Squid and Apache Traffic Server are tested with it [2][3]. There was a USENIX talk that showed some Co-Advisor results [4]

1. http://coad.measurement-factory.com/details.html

2. http://wiki.squid-cache.org/Features/HTTP11

3. http://trafficserver.apache.org/acknowledgements

4. https://www.usenix.org/conference/lisa12/rolling-d2o-choosin... (at 31:16 in to the video).

michaelmior|11 years ago

That's an interesting idea. It would be useful to have a Web server where the output is just a conformance check of the request. That might be a fun project for a rainy day :)

jlouis|11 years ago

This proves a very important pet peeve of mine: Your modern application has a highly dynamic operating point. There is no way you can deploy a system and expect it to be static for eternity. Back in the day with low interconnectivity you could. But today it is impossible.

When you build stacks on top of system for which you have no direct control, you must be able to adapt your system. This means you can't statically deploy code without an upgrade path in one way or the other.

quesera|11 years ago

You're combining two issues.

Yes, if you let other people run your infrastructure, you are beholden to their operations decisions and schedules.

It is not impossible to design around that (new) problem, but it is sometimes expensive.

The trick is to know what external dependencies you have, and that is almost impossible to fully quantify in the XaaS and cloud model.

goleksiak|11 years ago

True but that doesn't bother me. Nothing is static on the web these days and everyone plays under the same rule set. Keeps things interesting...

mml|11 years ago

Cowboy is quite a well respected we server of the Erlang flavor. I'd guess heroku rejiggered something in their stack, perhaps adding cowboy as a reverse proxy or load balancer in front of their junk.

Cowboy apparently shot yor no-good dirty sidewinding web requests in the face.

mqsiuser|11 years ago

It is well known that you can't (should not) rely on bugs (or internal APIs)

kirab|11 years ago

It's technically correct, according to the HTTP spec there must be a single "SP" character between the elements in the Request-Line:

Request-Line = Method SP Request-URI SP HTTP-Version CRLF

Source: http://www.w3.org/Protocols/rfc2616/rfc2616-sec5.html#sec5.1

Animats|11 years ago

Another broken network device which takes it upon itself to mess with TCP connections passing through.

I ran into this a few years ago with Coyote Point load balancers. It turns out that if you send HTTP headers to a Coyote Point load balancer, and the last header field is "User-agent", and that field ends with "m" but does not otherwise contain "m", the connection does not go through the load balancer.

Complaining to Coyote Point produced typical clueless responses such as "Upgrade your software". (The problem wasn't at my end, but at sites with Coyote Point devices. Fortunately, I knew someone who had a Coyote Point unit, and we were able to force the situation there.) I had our system ("Sitetruth.com site rating system", note the "m") put an unnecessary "Accept" header field at the end of the header to work around the problem.

Coyote Point's filtering software is regular-expression based, and I suspect that somewhere, there is a rule with a "\m" instead of "\n".

A current issue: there are some sites where, if you make three HTTP requests for the same URL from the same IP address in a short period, further requests are ignored for about 15 seconds. You can make this happen with three "wget" requests. Try "wget http://bitcointalk.org" three times in quick succession. Amusingly, this limiter only applies for HTTP sessions, not HTTPS.

Danieru|11 years ago

That series of strcat's caught my eye as bad practice. Fine in this case since the destination string is short but horrible in general. Every single one of those calls needs to iterate over the entire existing string to find the string size. The code could be much cleaner with a small macro hiding the incrementation and the casts.

userbinator|11 years ago

Basically it's an O(n^2) algorithm... well-known story about that here:

http://www.joelonsoftware.com/articles/fog0000000319.html

The design of strcat() itself is partially to blame for this - the return value could've been more useful, like the number of characters in the resulting string or a pointer to the end of the appended string so it could be used to chain concatenations, but instead they chose to return the exact same pointer that was passed in as the source.

krakensden|11 years ago

Yeah- implicit concatenation + snprintf seems like the way to go. Although you'd have to calculate a length, I suspect avoiding that is the primary virtue of this approach.

mikeash|11 years ago

A sufficiently smart compiler could optimize a string of strcat calls to remove the redundant length finding. I have no idea if real compilers actually would....

kyberias|11 years ago

What's the deal with all the scrollbars on this page?

spindritf|11 years ago

Yeah, why is every image, heading, and paragraph on that page surrounded by scrollbars where most don't work and are not necessary?

chippy|11 years ago

I initially thought that it was because of a single whitespace character... but it's because

p { overflow: scroll; }

colinbartlett|11 years ago

It scares me to think all of these requests run over unencrypted HTTP.

davidrusu|11 years ago

Why? it's just pizza

robomartin|11 years ago

Kudos for sorting this out quickly. Problems like this one can be really difficult to debug.

I remember one case where the coefficient table for a polyphase FIR filter we implemented in an FPGA caused huge instability problems in a design. The coefficient table, if I remember correctly, was 32 wide (32 multipliers) and 128 phases long. That's 4096 numbers. The design had about 40 of these tables that would be loaded from firmware into FPGA registers in real time as needed. We built a tool in Excel to be able to compute these tables of FIR coefficients.

We got word from a customer that things were not behaving correctly under certain circumstances. We were able to reproduce the problem in the lab but could not find anything wrong with the FPGA, microcontroller or Excel code after about three weeks of work by three engineers. This quickly became a nightmare as it threatened several lucrative contracts and failed to service our existing customer base adequately.

I had to put our other two hardware engineers back to work on their existing projects so I took on the debugging process. This was the most intense debugging I've had to do in thirty years of software and hardware development. Lots at stake. The very reputation and financial well being of my business was at stake. Enter 18 hour days, 7 days a week.

FOUR MONTHS LATER, at 2:00 AM on a fine Sunday morning without having slept for three days looking at code the bug jumped out at me. We've all had that moment but his one was well "one of those". The problem? We used "ROUND()" in instead of "ROUNDUP()" in calculation that had nothing to do with the FIR filter coefficients but rather affected the programming of counters related to them. This caused timing errors in a state machine that drove the FIR filters. If this were software this would be exactly like having the wrong count in a loop counter. Yup.

I re-calculated after making the change and everything worked as advertised. That was the best Monday I've had in years. And I took a long vacation after that.

Over four months to find a bug.

That's why sometimes it is impossible and even unreasonable to create budgets for software development. One little bug can set you back weeks, if not months.

vvpan|11 years ago

Way to abuse :first-letter.

peterwwillis|11 years ago

Assuming the problem originates from something relating to eatabit's infrastructure, the important takeway (for me) would be: Depend as little on 3rd parties as possible.

I know this is not a popular opinion among the HN crowd, mainly due to the entire web's love of linking to some other site's js/css to offload cost from their own site. But this makes no sense; you're not really reducing costs, you're just delaying them.

People talk about how 3rd parties speed up development or (potentially) reduce costs. But if the success of your business depends on providing a service all the time that has to be reliable, the reliability of your product is directly proportional to the reliability of the 3rd party. And each 3rd party adds additional points of failure. If you don't control whatever service or product the 3rd party is giving you, you will be unable to even attempt to isolate and fix it yourself.

Typically the answer to this problem is 'buy a better service contract'. But if the 3rd party doesn't provide 24/7 365 support along with multiple contact methods and harsh penalties for failing to supply you with timely service, you're wasting your money. You don't want to be the guy who has to tell the CIO "Sorry, I can't get a hold of our service provider or they aren't giving me timely updates, so I do not know when our product will be up again."

stevewilhelm|11 years ago

> Depend as little on 3rd parties as possible.

This attitude has many a startup reinventing and supporting commodity infrastructure instead of focusing on developing unique products and value for their customers.

KMag|11 years ago

When learning OCaml, I decided to write a little web client that would bruit force the password on my own home router. I wrote a client, and my router wasn't responding, so I tried having my client fetch pages from Yahoo, and it worked fine.

I fired up wireshark and saw that everything looked fine... except that all of my line terminators were shift-in-formfeed instead of carriage-return-newline. It turns out that OCaml uses decimal character escapes instead of octal. (This was back when I was under the impression that portable code avoided use of \n in string literals because someone who misunderstood text mode file handles had told me that Microsoft compilers expanded \n to \015\012.)

Apparently someone at Yahoo had experienced enough terribly terribly written web clients that they wrote their HTTP server to accept any two non-space whitespace characters as a line ending.

jameshart|11 years ago

"our cellular printing api has printed over 9300 food orders for our client restaurants, stadiums and golf courses"

Am I the only one who read this as a system using 3D printing to print food? Disappointed to discover it's not that kind of cellular.

unknown|11 years ago

[deleted]

justinsb|11 years ago

Tangentially, why didn't curl escape the trailing space to %20?

jim_lawless|11 years ago

I experienced a similar problem with a POP3 utility that I had written years ago. I had been appending an extra space to the end of each text line (before the CRLF ).

There were a few people using this utility with no problems until one day a particular POP3 server no longer tolerated my utility's malformed requests.

weissadam|11 years ago

I have some advice. Hire a real C programmer. This code is _awful_ and probably full of vulns.

rcconf|11 years ago

I've had the same issues when developing with Flask in Python. I forgot to URL encode some query parameters and it worked fine with the local HTTP server.

But when I put nginx in front as a proxy, it denied all requests.

stevekemp|11 years ago

The thttpd webserver doesn't handle requests with too many slashes either, which I only found out recently

This is treated as an invalid request:

      http://example.com//robots.txt

ericcholis|11 years ago

Slightly off-topic, but this is why dev posts like this are important. I didn't know eatabit.com was a thing, it it sounds like a great service.

robogeek78|11 years ago

Dev #2 here thanks for the compliment. Where are you? Maybe we should expand to your area. :)

kstrauser|11 years ago

If this were my team, I would be unsettled by the fact that we never caught it in testing. Did no one write tests to exercise this part of the app - the one where we're handcrafting HTTP requests?

Objectively, you need to write more tests. At the minimum, this bug should have a regression test so that it can never accidentally happen again (say when a dev merges an old branch in for whatever reason).

shortstuffsushi|11 years ago

What test would you have written to catch this? One that checks the exact contents of headers passed along? It's possibly they even had tests around this, but were expecting the same output that they were inputting (copy+pasta). Perhaps they had a more "integration"-ee test that actually hit the web with that bad header. At the point they wrote it, that test would have been passing. It wasn't until the parsing server changed (to Coyote, it seems) that the test would have started to fail.

imanaccount247|11 years ago

>Objectively, you need to write more tests

That is precisely the opposite of objective. Personal thoughts, feelings and opinions are subjective by definition. 2+2=4 is objective. "You need to put more cheese on that pizza" is subjective.

unknown|11 years ago

[deleted]

cleanCodeAtWork|11 years ago

Are there any languages out there that handle scale and many connections like Erlang does, but with an easier to swallow syntax?

hlieberman|11 years ago

Erlang. The syntax really isn't that bad, once you get over the initial shock. In all honesty, grasping that the variables are immutable and how you need to change your thinking is much more difficult than the syntax itself.

angersock|11 years ago

http://elixir-lang.org/

It's the Erlang VM you love, but with the Ruby syntax we all enjoy!

lmm|11 years ago

I've been very happy writing these things in Scala using Spray. Honestly there are plenty of event-driven I/O frameworks in many languages, and almost as many green-threading systems. The Erlang supervision system and the ability to replace code on the fly, not so much.

buster|11 years ago

http://elixir-lang.org/

mikeklaas|11 years ago

q

cofcdylan|11 years ago

i'm just glad my city made it to HN.

goleksiak|11 years ago

Charleston represent!

205 comments