Linux may have been causing USB disconnects

[+] bryanlarsen|12 years ago|reply

Note that this bug was found because the software engineer talked to a hardware engineer.

Props to Intel for hiring leading Linux developers and turning them loose.

[+] raverbashing|12 years ago|reply

Really

As an EE turned "software engineer" this bothers me, a lot.

I like the EE part of it, but I prefer thing that change more easily and are more "playful" (not to mention today hardware is at the mercy of software, so you take the reference design and go with it)

But I've come into situations where I uncovered a HW bug (in the chip reference board implementation, no less) that only manifested itself because of something specific in software (in the HDMI standard - or better, things from the standard inherited from things like VESA)

The Software Engineer see ports/memory to be written to and doesn't know what happens behind that

The Hardware engineer sees the "chip" and its connections but doesn't realise the rabbit hole goes deeper "ah this is a simple USB device, only 8 pins" now try communicating with it

[+] miga|12 years ago|reply

It is always reason to celebrate when one engineer successfully communicates with another of different specialty. Big kudos to Intel for actually encouraging them to do so!

[+] fixedd|12 years ago|reply

Sarah's pretty sharp. IIRC, she single handedly built Linux's USB 3 support.

[+] makomk|12 years ago|reply

Wait, if I'm reading this correctly there's no safe resume recovery time which can be guaranteed not to cause devices to drop off the bus. The kernel could wait 10 minutes and devices could still require more than that. That seems like a pretty major issue with the USB specification.

[+] nknighthb|12 years ago|reply

If you issue a database query, you have no particular guarantee that it's going to complete in any finite amount of time. At some point, you simply throw up your hands and say it would be unreasonable to wait any longer, and accept the resulting error condition.

[+] JoeAltmaier|12 years ago|reply

The hub knows when the device is ready; just query it. A constant timeout is not needed. A give-up timeout might be employed, but there's no reason that can't be 100's of ms, nobody is waiting on that and it doesn't usually happen anyway.

[+] kalleboo|12 years ago|reply

Be sure to check out the mailing list post linked from the G+ post which contains more technical details and proposed fixes http://marc.info/?l=linux-usb&m=137714769606183&w=2

[+] milliams|12 years ago|reply

Why on earth is all the text on that website set to font-weight:600 and using Courier New of all fonts? Incredibly hard to read.

[+] alexchamberlain|12 years ago|reply

We should applaud them for standing up and saying "Hey, we cocked up, sorry!"

[+] valisystem|12 years ago|reply

Definitely yes. Them or others, whatever. And Not for every issues. But I want to ear about issues that have been worrying for a long time and/or annoyed a lot of people, or that are very complex and shady. And the only way to encourage communication about bugs is to congratulate people for their fixes/problem isolation.

[+] scrrr|12 years ago|reply

Nothing wrong with that statement, except I think it should be a given.. Do we need a pat on the back for doing the right thing? :)

[+] bluesign|12 years ago|reply

this is wrong interpretation actually.

There is no "maximum" for a reason. Because it should be evaluated as "hey hardware developer, you will have guaranteed 10 ms from System Software to resume". If you don't wake up in 10ms, you are clearly violating the spec.

9.2.6.2 states: After a port is reset or resumed, the USB System Software is expected to provide a “recovery” interval of 10 ms before the device attached to the port is expected to respond to data transfers. The device may ignore any data transfers during the recovery interval. After the end of the recovery interval (measured from the end of the reset or the end of the EOP at the end of the resume signaling), the device must accept data transfers at any time.

[+] delinka|12 years ago|reply

Nothing there says that the hardware must be ready at or after 10ms. It simply says that software can't ask for anything before 10ms is up. Software has to wait 10ms, and then might have to wait longer.

[+] nly|12 years ago|reply

The true intention of the spec is academic at this point. There are millions upon millions of devices out there with one interpretation and they're not changing. Linux can either increase the grace period or be tarnished as having bad USB suspend.

[+] unknown|12 years ago|reply

[deleted]

[+] annnnd|12 years ago|reply

Congrats! But how nobody analysed this bug for 8+ years is a bit of a mystery to me...

[+] RyanZAG|12 years ago|reply

Well the Linux USB maintainer has spent the last month or so trying to get Linus to be more polite, so I guess those kind of things have a higher priority!

I kid, I kid...

The reason is that it is incredibly difficult to link the disconnect to the cause as the 10ms is likely sufficient in 99% of cases - until it suddenly isn't. This means that you could be running test cases on a certain device for a year, and suddenly the test will fail the day after. When the test case mysteriously fails randomly like that on only a subset of devices, the assumption is that the hardware is faulty. These kind of failures would likely be higher on lower quality, less optimized hardware as well, furthering the perception.

As far as I can tell, the reason this is fixed now is because known good hardware from Intel started exhibiting the same error which got people at Intel to track it down directly, as they knew it wasn't their hardware at fault.

[+] JoeAltmaier|12 years ago|reply

Because nobody cares about suspend-resume power mgmt. If it doesn't work, curse it, pull it out and put it back in again, voila it works.

The people who really care about and study the spec, are those who have to support fixed devices i.e. USB devices internal to an appliance. They physically cannot be removed by the user. So suspend/resume has to work.

Embedded programmers have to deal with totally-broken drivers/specs all the time. There are probably 100s of folks who knew about this and dealt with it (bumped the timeout in their embedded kernel to match the devices they support) and never said anything to anybody.

[+] smackfu|12 years ago|reply

When cheap hardware acts like it doesn't follow the spec, no one digs too deep, because it's always going to be quite frequent, and there's nothing you can do about it. It's very rare that it turns out to actually have been following the spec, and you had the spec wrong. That's the practical reason.

[+] xradionut|12 years ago|reply

I can't speak for kernel developers, but when you have complex and large codebase running on a huge variety of hardware, you will have some edge cases that are rare or difficult to debug. And I don't envy the folks that have to interface directly with hardware, I have enough fun in database land...

[+] 16s|12 years ago|reply

Why is that variable set at 10? Who would question that?

The spec says 10 too. It's the "at least 10" part that was missed. That's very subtle, does not stand out and is easily over-looked unless someone is really auditing code and reading specs carefully.

[+] kbart|12 years ago|reply

Take a look at Kernel USB source code. I did. Once.

[+] oakwhiz|12 years ago|reply

This is a very interesting type of bug that I have often seen cropping up around hardware interfaces in microcontrollers.

[+] ape4|12 years ago|reply

Its a good thing that Linux is open and transparent. Good to admit a bug (and exactly what it is) rather than silently deny then possibly fix.

Also, somebody uses Google+ ?

[+] davidw|12 years ago|reply

For whatever reason, there seems to be a number of Linux people on Google Plus, including Linus Torvalds.

[+] foobarqux|12 years ago|reply

What are the conditions where this problem manifests?

I have a Das Keyboard that sporadically become unresponsive until I unplug and plug it back in. How do I know if my problem is caused by the issue described in the article?

[+] blaenk|12 years ago|reply

For what it's worth, I too have a Das Keyboard (Ultimate) and I don't experience this problem (Arch 64-bit).

Hopefully that helps narrow down your issue.

[+] baq|12 years ago|reply

it happens after a resume from sleep?

[+] miga|12 years ago|reply

Good we have a fix for a bug, that has been pestering me for quite a long time. As for maximum timeout, I believe that setting a maximum timeout in sysfs with default of 1s should make satisfy most people (unless anyone wants per-device max wait time?)

[+] Fuxy|12 years ago|reply

50ms should be quite enough i think. That's 5X the minimum, more than any proper device should ask for. If you want to be extreme you can make it 100ms but any more than that is way to extreme.

[+] unknown|12 years ago|reply

[deleted]

[+] unknown|12 years ago|reply

[deleted]

[+] codex|12 years ago|reply

This is alarming. If the issue really was that simple, it strongly indicates that Linux kernel developers don't put a lot of effort into investigating problems whenever a convenient scapegoat--faulty hardware--is available. For shame.

[+] T3RMINATED|12 years ago|reply

you are probably going to get the middle finger from Linus Torvald and he will say it was built like this by design and your wrong.

[+] Jugurtha|12 years ago|reply

Well, that's why you don't hardcode a magic value, nor do you continuously poll the state of a device and rely instead on interrupts: That's what they're made for.

[+] exDM69|12 years ago|reply

> Well, that's why you don't hardcode a magic value, nor do you continuously poll the state of a device and rely instead on interrupts: That's what they're made for.

There was a mention about this in the OP. There were no interrupts for this state transition in USB prior to USB3. "The Intel xHCI host, unlike the EHCI host, actually gives an interrupt when the port fully transitions to the active state."

In addition, a lot of hardware initialization is based on delays and polling by design.

[+] alexchamberlain|12 years ago|reply

That's not entirely fair; standards are full of hard coded values.

[+] zwdr|12 years ago|reply

USB 3.0 uses interrupts, but if you're trying that with 2.0 you're gonna have to wait for those interrupts a _long_ time.

[+] JoeAltmaier|12 years ago|reply

Agreed. There's no reason to slavishly imitate a spec, when you can be more generous or better yet just test and wait.

Embedded programmers know this; you can't ship working appliances without dealing with these issues.

[+] rustynails|12 years ago|reply

Now can someone fix the embarrassing network bug in Linux. You know, where if you access a link to a network, or access an open networked path after 255 seconds or so ... You receive a network error. It still begs belief that such a fundamental aspect of Linux is broken ... When I show Linux to newbies and this fault occurs (ie. 100% of the time), I simply say "Linux isn't perfect ..." But inside, I cringe...

It occurs under all distros I've tried and it's been there for years. Even on different computers with different hardware...

[+] vidarh|12 years ago|reply

I have TCP connections that have stayed open for months so this is highly unlikely to be a kernel issue. I have no idea what you might be running into as I've never seen anything like that occurring.

[+] viraptor|12 years ago|reply

> You know, where if you access a link to a network, or access an open networked path after 255 seconds or so ... You receive a network error.

I don't even know what this means. What's the actual, reproducible scenario?

[+] joosters|12 years ago|reply

If you are connecting to a remote system, it could be NAT configured badly on your router.

The router provided by my ISP (Virgin Media) is ruthless at closing idle TCP connections after only a few minutes. I'd see this with idle SSH logins being closed all the time.

The solution (for me at least) was to ensure connections used TCP keepalives, and vastly decrease the keepalive times (various sysctl calls, I don't have the details to hand).

[+] username42|12 years ago|reply

Never had such error. Even in Linux 0.99pl10 (in 1992), network was working fine. I have always used a lot remote access to X servers. If network was broken, Linux would have been unusable.

[+] octo_t|12 years ago|reply

backing the previous poster here, I've never experienced anything like this at all. "Network error" makes it sound like you're using KDE or GNOME or something, or Samba isn't liking your configuration...

[+] eksith|12 years ago|reply

That's an odd.

Have you eliminated all other variables E.G. Common utility/setting, cabling, switch/router etc..? I've had paths open for much longer than that without access issues.

[+] mh-|12 years ago|reply

> where if you access a link to a network, or access an open networked path after 255 seconds or so

this terminology doesn't really make sense in the context of a TCP connection.

but, anyway, check:

    net.ipv4.tcp_keepalive_time
    net.ipv4.tcp_keepalive_probes
    net.ipv4.tcp_keepalive_intvl

docs for the values here: https://www.kernel.org/doc/Documentation/networking/ip-sysct...

[+] stinos|12 years ago|reply

not sure what you mean exactly, but one thing that has caused tons of trouble here, with sometimes the sole solution being a restart (ok our IT maintainer might be doing something worng, yet..) is the opposite: take a bunch of workstations and a bunch of servers, put home directories and data on servers then put everything together using NFS shares. Run analysis and whatnot on the data. Then make the server go down somehow and watch all workstation getting completely locked up without seemingly ever generating some kind of timeout error instead waiting endlessly on a dead connection.

116 comments