As an EE turned "software engineer" this bothers me, a lot.
I like the EE part of it, but I prefer thing that change more easily and are more "playful" (not to mention today hardware is at the mercy of software, so you take the reference design and go with it)
But I've come into situations where I uncovered a HW bug (in the chip reference board implementation, no less) that only manifested itself because of something specific in software (in the HDMI standard - or better, things from the standard inherited from things like VESA)
The Software Engineer see ports/memory to be written to and doesn't know what happens behind that
The Hardware engineer sees the "chip" and its connections but doesn't realise the rabbit hole goes deeper "ah this is a simple USB device, only 8 pins" now try communicating with it
It is always reason to celebrate when one engineer successfully communicates with another of different specialty. Big kudos to Intel for actually encouraging them to do so!
Wait, if I'm reading this correctly there's no safe resume recovery time which can be guaranteed not to cause devices to drop off the bus. The kernel could wait 10 minutes and devices could still require more than that. That seems like a pretty major issue with the USB specification.
If you issue a database query, you have no particular guarantee that it's going to complete in any finite amount of time. At some point, you simply throw up your hands and say it would be unreasonable to wait any longer, and accept the resulting error condition.
The hub knows when the device is ready; just query it. A constant timeout is not needed. A give-up timeout might be employed, but there's no reason that can't be 100's of ms, nobody is waiting on that and it doesn't usually happen anyway.
Definitely yes. Them or others, whatever. And Not for every issues. But I want to ear about issues that have been worrying for a long time and/or annoyed a lot of people, or that are very complex and shady. And the only way to encourage communication about bugs is to congratulate people for their fixes/problem isolation.
There is no "maximum" for a reason. Because it should be evaluated as "hey hardware developer, you will have guaranteed 10 ms from System Software to resume". If you don't wake up in 10ms, you are clearly violating the spec.
9.2.6.2 states:
After a port is reset or resumed, the USB System Software is expected to provide a “recovery” interval of 10 ms before the device attached to the port is expected to respond to data transfers. The device may ignore any data transfers during the recovery interval.
After the end of the recovery interval (measured from the end of the reset or the end of the EOP at the end of the resume signaling), the device must accept data transfers at any time.
Nothing there says that the hardware must be ready at or after 10ms. It simply says that software can't ask for anything before 10ms is up. Software has to wait 10ms, and then might have to wait longer.
The true intention of the spec is academic at this point. There are millions upon millions of devices out there with one interpretation and they're not changing. Linux can either increase the grace period or be tarnished as having bad USB suspend.
Well the Linux USB maintainer has spent the last month or so trying to get Linus to be more polite, so I guess those kind of things have a higher priority!
I kid, I kid...
The reason is that it is incredibly difficult to link the disconnect to the cause as the 10ms is likely sufficient in 99% of cases - until it suddenly isn't. This means that you could be running test cases on a certain device for a year, and suddenly the test will fail the day after. When the test case mysteriously fails randomly like that on only a subset of devices, the assumption is that the hardware is faulty. These kind of failures would likely be higher on lower quality, less optimized hardware as well, furthering the perception.
As far as I can tell, the reason this is fixed now is because known good hardware from Intel started exhibiting the same error which got people at Intel to track it down directly, as they knew it wasn't their hardware at fault.
Because nobody cares about suspend-resume power mgmt. If it doesn't work, curse it, pull it out and put it back in again, voila it works.
The people who really care about and study the spec, are those who have to support fixed devices i.e. USB devices internal to an appliance. They physically cannot be removed by the user. So suspend/resume has to work.
Embedded programmers have to deal with totally-broken drivers/specs all the time. There are probably 100s of folks who knew about this and dealt with it (bumped the timeout in their embedded kernel to match the devices they support) and never said anything to anybody.
When cheap hardware acts like it doesn't follow the spec, no one digs too deep, because it's always going to be quite frequent, and there's nothing you can do about it. It's very rare that it turns out to actually have been following the spec, and you had the spec wrong. That's the practical reason.
I can't speak for kernel developers, but when you have complex and large codebase running on a huge variety of hardware, you will have some edge cases that are rare or difficult to debug. And I don't envy the folks that have to interface directly with hardware, I have enough fun in database land...
Why is that variable set at 10? Who would question that?
The spec says 10 too. It's the "at least 10" part that was missed. That's very subtle, does not stand out and is easily over-looked unless someone is really auditing code and reading specs carefully.
What are the conditions where this problem manifests?
I have a Das Keyboard that sporadically become unresponsive until I unplug and plug it back in. How do I know if my problem is caused by the issue described in the article?
Good we have a fix for a bug, that has been pestering me for quite a long time. As for maximum timeout, I believe that setting a maximum timeout in sysfs with default of 1s should make satisfy most people (unless anyone wants per-device max wait time?)
50ms should be quite enough i think. That's 5X the minimum, more than any proper device should ask for.
If you want to be extreme you can make it 100ms but any more than that is way to extreme.
This is alarming. If the issue really was that simple, it strongly indicates that Linux kernel developers don't put a lot of effort into investigating problems whenever a convenient scapegoat--faulty hardware--is available. For shame.
Well, that's why you don't hardcode a magic value, nor do you continuously poll the state of a device and rely instead on interrupts: That's what they're made for.
> Well, that's why you don't hardcode a magic value, nor do you continuously poll the state of a device and rely instead on interrupts: That's what they're made for.
There was a mention about this in the OP. There were no interrupts for this state transition in USB prior to USB3. "The Intel xHCI host, unlike the EHCI host, actually gives an interrupt when the port fully transitions to the active state."
In addition, a lot of hardware initialization is based on delays and polling by design.
Now can someone fix the embarrassing network bug in Linux. You know, where if you access a link to a network, or access an open networked path after 255 seconds or so ... You receive a network error. It still begs belief that such a fundamental aspect of Linux is broken ...
When I show Linux to newbies and this fault occurs (ie. 100% of the time), I simply say "Linux isn't perfect ..." But inside, I cringe...
It occurs under all distros I've tried and it's been there for years. Even on different computers with different hardware...
I have TCP connections that have stayed open for months so this is highly unlikely to be a kernel issue. I have no idea what you might be running into as I've never seen anything like that occurring.
If you are connecting to a remote system, it could be NAT configured badly on your router.
The router provided by my ISP (Virgin Media) is ruthless at closing idle TCP connections after only a few minutes. I'd see this with idle SSH logins being closed all the time.
The solution (for me at least) was to ensure connections used TCP keepalives, and vastly decrease the keepalive times (various sysctl calls, I don't have the details to hand).
Never had such error. Even in Linux 0.99pl10 (in 1992), network was working fine. I have always used a lot remote access to X servers. If network was broken, Linux would have been unusable.
backing the previous poster here, I've never experienced anything like this at all. "Network error" makes it sound like you're using KDE or GNOME or something, or Samba isn't liking your configuration...
Have you eliminated all other variables E.G. Common utility/setting, cabling, switch/router etc..? I've had paths open for much longer than that without access issues.
not sure what you mean exactly, but one thing that has caused tons of trouble here, with sometimes the sole solution being a restart (ok our IT maintainer might be doing something worng, yet..) is the opposite: take a bunch of workstations and a bunch of servers, put home directories and data on servers then put everything together using NFS shares. Run analysis and whatnot on the data. Then make the server go down somehow and watch all workstation getting completely locked up without seemingly ever generating some kind of timeout error instead waiting endlessly on a dead connection.
[+] [-] bryanlarsen|12 years ago|reply
Props to Intel for hiring leading Linux developers and turning them loose.
[+] [-] raverbashing|12 years ago|reply
As an EE turned "software engineer" this bothers me, a lot.
I like the EE part of it, but I prefer thing that change more easily and are more "playful" (not to mention today hardware is at the mercy of software, so you take the reference design and go with it)
But I've come into situations where I uncovered a HW bug (in the chip reference board implementation, no less) that only manifested itself because of something specific in software (in the HDMI standard - or better, things from the standard inherited from things like VESA)
The Software Engineer see ports/memory to be written to and doesn't know what happens behind that
The Hardware engineer sees the "chip" and its connections but doesn't realise the rabbit hole goes deeper "ah this is a simple USB device, only 8 pins" now try communicating with it
[+] [-] miga|12 years ago|reply
[+] [-] fixedd|12 years ago|reply
[+] [-] makomk|12 years ago|reply
[+] [-] nknighthb|12 years ago|reply
[+] [-] JoeAltmaier|12 years ago|reply
[+] [-] kalleboo|12 years ago|reply
[+] [-] milliams|12 years ago|reply
[+] [-] alexchamberlain|12 years ago|reply
[+] [-] valisystem|12 years ago|reply
[+] [-] scrrr|12 years ago|reply
[+] [-] bluesign|12 years ago|reply
There is no "maximum" for a reason. Because it should be evaluated as "hey hardware developer, you will have guaranteed 10 ms from System Software to resume". If you don't wake up in 10ms, you are clearly violating the spec.
9.2.6.2 states: After a port is reset or resumed, the USB System Software is expected to provide a “recovery” interval of 10 ms before the device attached to the port is expected to respond to data transfers. The device may ignore any data transfers during the recovery interval. After the end of the recovery interval (measured from the end of the reset or the end of the EOP at the end of the resume signaling), the device must accept data transfers at any time.
[+] [-] delinka|12 years ago|reply
[+] [-] nly|12 years ago|reply
[+] [-] unknown|12 years ago|reply
[deleted]
[+] [-] annnnd|12 years ago|reply
[+] [-] RyanZAG|12 years ago|reply
I kid, I kid...
The reason is that it is incredibly difficult to link the disconnect to the cause as the 10ms is likely sufficient in 99% of cases - until it suddenly isn't. This means that you could be running test cases on a certain device for a year, and suddenly the test will fail the day after. When the test case mysteriously fails randomly like that on only a subset of devices, the assumption is that the hardware is faulty. These kind of failures would likely be higher on lower quality, less optimized hardware as well, furthering the perception.
As far as I can tell, the reason this is fixed now is because known good hardware from Intel started exhibiting the same error which got people at Intel to track it down directly, as they knew it wasn't their hardware at fault.
[+] [-] JoeAltmaier|12 years ago|reply
The people who really care about and study the spec, are those who have to support fixed devices i.e. USB devices internal to an appliance. They physically cannot be removed by the user. So suspend/resume has to work.
Embedded programmers have to deal with totally-broken drivers/specs all the time. There are probably 100s of folks who knew about this and dealt with it (bumped the timeout in their embedded kernel to match the devices they support) and never said anything to anybody.
[+] [-] smackfu|12 years ago|reply
[+] [-] xradionut|12 years ago|reply
[+] [-] 16s|12 years ago|reply
The spec says 10 too. It's the "at least 10" part that was missed. That's very subtle, does not stand out and is easily over-looked unless someone is really auditing code and reading specs carefully.
[+] [-] kbart|12 years ago|reply
[+] [-] oakwhiz|12 years ago|reply
[+] [-] ape4|12 years ago|reply
Also, somebody uses Google+ ?
[+] [-] davidw|12 years ago|reply
[+] [-] foobarqux|12 years ago|reply
I have a Das Keyboard that sporadically become unresponsive until I unplug and plug it back in. How do I know if my problem is caused by the issue described in the article?
[+] [-] blaenk|12 years ago|reply
Hopefully that helps narrow down your issue.
[+] [-] baq|12 years ago|reply
[+] [-] miga|12 years ago|reply
[+] [-] Fuxy|12 years ago|reply
[+] [-] unknown|12 years ago|reply
[deleted]
[+] [-] unknown|12 years ago|reply
[deleted]
[+] [-] codex|12 years ago|reply
[+] [-] T3RMINATED|12 years ago|reply
[+] [-] Jugurtha|12 years ago|reply
[+] [-] exDM69|12 years ago|reply
There was a mention about this in the OP. There were no interrupts for this state transition in USB prior to USB3. "The Intel xHCI host, unlike the EHCI host, actually gives an interrupt when the port fully transitions to the active state."
In addition, a lot of hardware initialization is based on delays and polling by design.
[+] [-] alexchamberlain|12 years ago|reply
[+] [-] zwdr|12 years ago|reply
[+] [-] JoeAltmaier|12 years ago|reply
Embedded programmers know this; you can't ship working appliances without dealing with these issues.
[+] [-] rustynails|12 years ago|reply
It occurs under all distros I've tried and it's been there for years. Even on different computers with different hardware...
[+] [-] vidarh|12 years ago|reply
[+] [-] viraptor|12 years ago|reply
I don't even know what this means. What's the actual, reproducible scenario?
[+] [-] joosters|12 years ago|reply
The router provided by my ISP (Virgin Media) is ruthless at closing idle TCP connections after only a few minutes. I'd see this with idle SSH logins being closed all the time.
The solution (for me at least) was to ensure connections used TCP keepalives, and vastly decrease the keepalive times (various sysctl calls, I don't have the details to hand).
[+] [-] username42|12 years ago|reply
[+] [-] octo_t|12 years ago|reply
[+] [-] eksith|12 years ago|reply
Have you eliminated all other variables E.G. Common utility/setting, cabling, switch/router etc..? I've had paths open for much longer than that without access issues.
[+] [-] mh-|12 years ago|reply
this terminology doesn't really make sense in the context of a TCP connection.
but, anyway, check:
docs for the values here: https://www.kernel.org/doc/Documentation/networking/ip-sysct...[+] [-] stinos|12 years ago|reply