top | item 6256193

Linux may have been causing USB disconnects

331 points| chalst | 12 years ago |plus.google.com | reply

116 comments

order
[+] bryanlarsen|12 years ago|reply
Note that this bug was found because the software engineer talked to a hardware engineer.

Props to Intel for hiring leading Linux developers and turning them loose.

[+] raverbashing|12 years ago|reply
Really

As an EE turned "software engineer" this bothers me, a lot.

I like the EE part of it, but I prefer thing that change more easily and are more "playful" (not to mention today hardware is at the mercy of software, so you take the reference design and go with it)

But I've come into situations where I uncovered a HW bug (in the chip reference board implementation, no less) that only manifested itself because of something specific in software (in the HDMI standard - or better, things from the standard inherited from things like VESA)

The Software Engineer see ports/memory to be written to and doesn't know what happens behind that

The Hardware engineer sees the "chip" and its connections but doesn't realise the rabbit hole goes deeper "ah this is a simple USB device, only 8 pins" now try communicating with it

[+] miga|12 years ago|reply
It is always reason to celebrate when one engineer successfully communicates with another of different specialty. Big kudos to Intel for actually encouraging them to do so!
[+] fixedd|12 years ago|reply
Sarah's pretty sharp. IIRC, she single handedly built Linux's USB 3 support.
[+] makomk|12 years ago|reply
Wait, if I'm reading this correctly there's no safe resume recovery time which can be guaranteed not to cause devices to drop off the bus. The kernel could wait 10 minutes and devices could still require more than that. That seems like a pretty major issue with the USB specification.
[+] nknighthb|12 years ago|reply
If you issue a database query, you have no particular guarantee that it's going to complete in any finite amount of time. At some point, you simply throw up your hands and say it would be unreasonable to wait any longer, and accept the resulting error condition.
[+] JoeAltmaier|12 years ago|reply
The hub knows when the device is ready; just query it. A constant timeout is not needed. A give-up timeout might be employed, but there's no reason that can't be 100's of ms, nobody is waiting on that and it doesn't usually happen anyway.
[+] kalleboo|12 years ago|reply
Be sure to check out the mailing list post linked from the G+ post which contains more technical details and proposed fixes http://marc.info/?l=linux-usb&m=137714769606183&w=2
[+] milliams|12 years ago|reply
Why on earth is all the text on that website set to font-weight:600 and using Courier New of all fonts? Incredibly hard to read.
[+] alexchamberlain|12 years ago|reply
We should applaud them for standing up and saying "Hey, we cocked up, sorry!"
[+] valisystem|12 years ago|reply
Definitely yes. Them or others, whatever. And Not for every issues. But I want to ear about issues that have been worrying for a long time and/or annoyed a lot of people, or that are very complex and shady. And the only way to encourage communication about bugs is to congratulate people for their fixes/problem isolation.
[+] scrrr|12 years ago|reply
Nothing wrong with that statement, except I think it should be a given.. Do we need a pat on the back for doing the right thing? :)
[+] bluesign|12 years ago|reply
this is wrong interpretation actually.

There is no "maximum" for a reason. Because it should be evaluated as "hey hardware developer, you will have guaranteed 10 ms from System Software to resume". If you don't wake up in 10ms, you are clearly violating the spec.

9.2.6.2 states: After a port is reset or resumed, the USB System Software is expected to provide a “recovery” interval of 10 ms before the device attached to the port is expected to respond to data transfers. The device may ignore any data transfers during the recovery interval. After the end of the recovery interval (measured from the end of the reset or the end of the EOP at the end of the resume signaling), the device must accept data transfers at any time.

[+] delinka|12 years ago|reply
Nothing there says that the hardware must be ready at or after 10ms. It simply says that software can't ask for anything before 10ms is up. Software has to wait 10ms, and then might have to wait longer.
[+] nly|12 years ago|reply
The true intention of the spec is academic at this point. There are millions upon millions of devices out there with one interpretation and they're not changing. Linux can either increase the grace period or be tarnished as having bad USB suspend.
[+] annnnd|12 years ago|reply
Congrats! But how nobody analysed this bug for 8+ years is a bit of a mystery to me...
[+] RyanZAG|12 years ago|reply
Well the Linux USB maintainer has spent the last month or so trying to get Linus to be more polite, so I guess those kind of things have a higher priority!

I kid, I kid...

The reason is that it is incredibly difficult to link the disconnect to the cause as the 10ms is likely sufficient in 99% of cases - until it suddenly isn't. This means that you could be running test cases on a certain device for a year, and suddenly the test will fail the day after. When the test case mysteriously fails randomly like that on only a subset of devices, the assumption is that the hardware is faulty. These kind of failures would likely be higher on lower quality, less optimized hardware as well, furthering the perception.

As far as I can tell, the reason this is fixed now is because known good hardware from Intel started exhibiting the same error which got people at Intel to track it down directly, as they knew it wasn't their hardware at fault.

[+] JoeAltmaier|12 years ago|reply
Because nobody cares about suspend-resume power mgmt. If it doesn't work, curse it, pull it out and put it back in again, voila it works.

The people who really care about and study the spec, are those who have to support fixed devices i.e. USB devices internal to an appliance. They physically cannot be removed by the user. So suspend/resume has to work.

Embedded programmers have to deal with totally-broken drivers/specs all the time. There are probably 100s of folks who knew about this and dealt with it (bumped the timeout in their embedded kernel to match the devices they support) and never said anything to anybody.

[+] smackfu|12 years ago|reply
When cheap hardware acts like it doesn't follow the spec, no one digs too deep, because it's always going to be quite frequent, and there's nothing you can do about it. It's very rare that it turns out to actually have been following the spec, and you had the spec wrong. That's the practical reason.
[+] xradionut|12 years ago|reply
I can't speak for kernel developers, but when you have complex and large codebase running on a huge variety of hardware, you will have some edge cases that are rare or difficult to debug. And I don't envy the folks that have to interface directly with hardware, I have enough fun in database land...
[+] 16s|12 years ago|reply
Why is that variable set at 10? Who would question that?

The spec says 10 too. It's the "at least 10" part that was missed. That's very subtle, does not stand out and is easily over-looked unless someone is really auditing code and reading specs carefully.

[+] kbart|12 years ago|reply
Take a look at Kernel USB source code. I did. Once.
[+] oakwhiz|12 years ago|reply
This is a very interesting type of bug that I have often seen cropping up around hardware interfaces in microcontrollers.
[+] ape4|12 years ago|reply
Its a good thing that Linux is open and transparent. Good to admit a bug (and exactly what it is) rather than silently deny then possibly fix.

Also, somebody uses Google+ ?

[+] davidw|12 years ago|reply
For whatever reason, there seems to be a number of Linux people on Google Plus, including Linus Torvalds.
[+] foobarqux|12 years ago|reply
What are the conditions where this problem manifests?

I have a Das Keyboard that sporadically become unresponsive until I unplug and plug it back in. How do I know if my problem is caused by the issue described in the article?

[+] blaenk|12 years ago|reply
For what it's worth, I too have a Das Keyboard (Ultimate) and I don't experience this problem (Arch 64-bit).

Hopefully that helps narrow down your issue.

[+] baq|12 years ago|reply
it happens after a resume from sleep?
[+] miga|12 years ago|reply
Good we have a fix for a bug, that has been pestering me for quite a long time. As for maximum timeout, I believe that setting a maximum timeout in sysfs with default of 1s should make satisfy most people (unless anyone wants per-device max wait time?)
[+] Fuxy|12 years ago|reply
50ms should be quite enough i think. That's 5X the minimum, more than any proper device should ask for. If you want to be extreme you can make it 100ms but any more than that is way to extreme.
[+] codex|12 years ago|reply
This is alarming. If the issue really was that simple, it strongly indicates that Linux kernel developers don't put a lot of effort into investigating problems whenever a convenient scapegoat--faulty hardware--is available. For shame.
[+] T3RMINATED|12 years ago|reply
you are probably going to get the middle finger from Linus Torvald and he will say it was built like this by design and your wrong.
[+] Jugurtha|12 years ago|reply
Well, that's why you don't hardcode a magic value, nor do you continuously poll the state of a device and rely instead on interrupts: That's what they're made for.
[+] exDM69|12 years ago|reply
> Well, that's why you don't hardcode a magic value, nor do you continuously poll the state of a device and rely instead on interrupts: That's what they're made for.

There was a mention about this in the OP. There were no interrupts for this state transition in USB prior to USB3. "The Intel xHCI host, unlike the EHCI host, actually gives an interrupt when the port fully transitions to the active state."

In addition, a lot of hardware initialization is based on delays and polling by design.

[+] alexchamberlain|12 years ago|reply
That's not entirely fair; standards are full of hard coded values.
[+] zwdr|12 years ago|reply
USB 3.0 uses interrupts, but if you're trying that with 2.0 you're gonna have to wait for those interrupts a _long_ time.
[+] JoeAltmaier|12 years ago|reply
Agreed. There's no reason to slavishly imitate a spec, when you can be more generous or better yet just test and wait.

Embedded programmers know this; you can't ship working appliances without dealing with these issues.

[+] rustynails|12 years ago|reply
Now can someone fix the embarrassing network bug in Linux. You know, where if you access a link to a network, or access an open networked path after 255 seconds or so ... You receive a network error. It still begs belief that such a fundamental aspect of Linux is broken ... When I show Linux to newbies and this fault occurs (ie. 100% of the time), I simply say "Linux isn't perfect ..." But inside, I cringe...

It occurs under all distros I've tried and it's been there for years. Even on different computers with different hardware...

[+] vidarh|12 years ago|reply
I have TCP connections that have stayed open for months so this is highly unlikely to be a kernel issue. I have no idea what you might be running into as I've never seen anything like that occurring.
[+] viraptor|12 years ago|reply
> You know, where if you access a link to a network, or access an open networked path after 255 seconds or so ... You receive a network error.

I don't even know what this means. What's the actual, reproducible scenario?

[+] joosters|12 years ago|reply
If you are connecting to a remote system, it could be NAT configured badly on your router.

The router provided by my ISP (Virgin Media) is ruthless at closing idle TCP connections after only a few minutes. I'd see this with idle SSH logins being closed all the time.

The solution (for me at least) was to ensure connections used TCP keepalives, and vastly decrease the keepalive times (various sysctl calls, I don't have the details to hand).

[+] username42|12 years ago|reply
Never had such error. Even in Linux 0.99pl10 (in 1992), network was working fine. I have always used a lot remote access to X servers. If network was broken, Linux would have been unusable.
[+] octo_t|12 years ago|reply
backing the previous poster here, I've never experienced anything like this at all. "Network error" makes it sound like you're using KDE or GNOME or something, or Samba isn't liking your configuration...
[+] eksith|12 years ago|reply
That's an odd.

Have you eliminated all other variables E.G. Common utility/setting, cabling, switch/router etc..? I've had paths open for much longer than that without access issues.

[+] mh-|12 years ago|reply
> where if you access a link to a network, or access an open networked path after 255 seconds or so

this terminology doesn't really make sense in the context of a TCP connection.

but, anyway, check:

    net.ipv4.tcp_keepalive_time
    net.ipv4.tcp_keepalive_probes
    net.ipv4.tcp_keepalive_intvl
docs for the values here: https://www.kernel.org/doc/Documentation/networking/ip-sysct...
[+] stinos|12 years ago|reply
not sure what you mean exactly, but one thing that has caused tons of trouble here, with sometimes the sole solution being a restart (ok our IT maintainer might be doing something worng, yet..) is the opposite: take a bunch of workstations and a bunch of servers, put home directories and data on servers then put everything together using NFS shares. Run analysis and whatnot on the data. Then make the server go down somehow and watch all workstation getting completely locked up without seemingly ever generating some kind of timeout error instead waiting endlessly on a dead connection.