top | item 23541383

(no title)

ccostes | 5 years ago

Aside from the Rust aspect (which is cool!), I can't believe we've come this far and still don't have low-latency video conferencing. Maybe I'm overly sensitive, but people talking over each other and the lack of conversational flow drives me crazy with things like hangouts.

discuss

order

Aengeuad|5 years ago

John Carmack always has an interesting point to make about latency: https://twitter.com/ID_AA_Carmack/status/193480622533120001

>I can send an IP packet to Europe faster than I can send a pixel to the screen. How f’d up is that?

and to relate to the other post about landlines: https://twitter.com/ID_AA_Carmack/status/992778768417722368

>I made a long internal post yesterday about audio latency, and it included “Many people reading this are too young to remember analog local phone calls, and how the lag from cell phones changed conversations.”

Artlav|5 years ago

> Many people reading this are too young to remember analog local phone calls, and how the lag from cell phones changed conversations

Is there somewhere to read about the changes in question?

I'm old enough to remember extensive use of analog landlines, and can't really think of any difference to a cellphone other than audio quality.

lostmsu|5 years ago

Isn't this mostly because actually showing a pixel requires a macroscopic change?

pedrocr|5 years ago

Cisco "telepresence" solved this 15 years ago. Standardized rooms on both sides with high quality cameras and low latencies. Polycom had a similar but worse setup at the time. The Cisco experience was very close to being in a shared meeting with the other people. It made meetings across continents work very well and was an actual competitor to flying everywhere. Between the hardware being too expensive and the link requirements being very high I only ever saw it implemented in multinational telecoms for whom it was an actual work tool but also something to impress their clients with.

Either Cisco needed to bring down the cost massively to expand access or someone needed to build it in major cities and bill by the hour to compete against flying. None of those happened so it stayed a niche. Compared to those experiences more than a decade ago the common VC is still very slowly catching up. Part of it is setup, like installing VC rooms with 2 smaller TVs side by side instead of one large one so you can see the document and the other people at decent sizes. But part of it is still the technology. Those "telepresences" were almost surely on a dedicated link running on the telecom core network that guaranteed quality instead of routing through the internet and randomly failing. I suspect getting really low latency will require that kind of telecom level QoS otherwise you'll be increasing buffer sizes to avoid freezes.

ponker|5 years ago

Cisco and HP Halo were incredible but the biggest problem they had was 1) the requirement to build out an actual room for it and 2) the shitty software setup experience. The big corporates that could afford to build out real estate for VCs also bogged the shit down in "enterpriseyness" that made the shit impossible to use.

blahbhthrow3748|5 years ago

My first job out of school was doing product verification for the cameras that were used in those Cisco systems! It was pretty impressive, I think they managed to squeeze 1080p at 60fps over USB2. Had a lot of fun building jigs and testing setups to test the MTBF on a tight time frame

bob1029|5 years ago

The biggest problem is that of the video codecs which ultimately boils down to using interframe compression. This technique requires that a certain # of video frames be received and buffered before a final image can be produced. This requirement imposes a baseline amount of latency that can never be overcome by any means. It is a hard trade-off in information theory.

Something to consider is that there are alternative techniques to interframe compression. Intraframe compression (e.g. JPEG) can bring your encoding latency per frame down to 0~10ms at the cost of a dramatic increase in bandwidth. Other benefits include the ability to instantly draw any frame the moment you receive it, because every single JPEG contains 100% of the data. With almost all video codecs, you must have some prior # of frames in many cases to reconstitute a complete frame.

For certain applications on modern networks, intraframe compression may not be as unbearable an idea as it once was. I've thrown together a prototype using LibJpegTurbo and I am able to get a C#/AspNetCore websocket to push a framebuffer drawn in safe C# to my browser window in ~5-10 milliseconds @ 1080p. Testing this approach at 60fps redraw with event feedback has proven that ideal localhost roundtrip latency is nearly indistinguishable from native desktop applications.

The ultimate point here is that you can build something that runs with better latency than any streaming offering on earth right now - if you are willing to make sacrifices on bandwidth efficiency. My 3 weekend project arguably already runs much better than Google Stadia regarding both latency and quality, but the market for streaming game & video conference services which require 50~100 Mbps (depending on resolution & refresh rate) constant throughput is probably very limited for now. That said, it is also not entirely non-existent - think about corporate networks, e-sports events, very serious PC gamers on LAN, etc. Keep in mind that it is virtually impossible to cheat at video games delivered through these types of streaming platforms. I would very much like to keep the streaming gaming dream alive, even if it can't be fully realized until 10gbps+ LAN/internet is default everywhere.

phoboslab|5 years ago

Interframes are not a problem, as long as they only reference previous frames, not future ones.

I was able to get latency down to 50ms, streaming to a browser using MPEG1[1]. The latency is mostly the result of 1 frame (16ms) delay for a screen capture on the sender + 2-3 frames of latency to get through the OS stack to the screen at the receiving end. En- and decoding was about ~5ms. Plus of course the network latency, but I only tested this on a local wifi, so it didn't add much.

[1] https://phoboslab.org/log/2015/07/play-gta-v-in-your-browser...

vlovich123|5 years ago

You can also just configure your video encoder to not use B-frames. Then if you make all consecutive frames P frames then the size is very maintainable. It gets trickier if your transport is lossy since a dropped P frame is a problem but it's not an unsolvable problem if you use LTR frames intelligently.

All the benefits of efficient codecs, more manageable handling of the latency downsides.

The challenges you'll run into instantly with JPEG is that the file size increase & encoding/decoding time on large resolutions outstrips any benefits you get in your limited tests. For video game applications you have to figure out how you're going to pipeline your streaming more efficiently than transferring a small 10 kb image as otherwise you're transferring each full uncompressed frame to the CPU which is expensive. Doing JPEG compression on the GPU is probably tricky. Finally decode is the other side of the problem. HW video decoders are embarrassingly fast & super common. Your JPEG decode is going to be significantly slower.

* EDIT: For your weekend project are you testing it with cloud servers or locally? I would be surprised if under equivalent network conditions you're outperforming Stadia so careful that you're not benchmarking local network performance against Stadia's production on public networks perf.

cossatot|5 years ago

I think a larger issue is the focus on video as opposed to audio. Audio may be less sexy but it is far and away more important for most interpersonal communication (I'm not discussing gaming or streaming or whatever, but teleconferencing). Most of us don't care that much if we get super crisp, uninterrupted views of our colleagues or clients, but audio problems really impede discussion.

jstrong|5 years ago

one technique that could be used (to get high compression rates on compression applied to each frame) is to train a compression "dictionary" on the first few seconds/minutes of a data stream, and then use the dictionary to compress/decompress each frame.

izacus|5 years ago

Well, all the effort is regularly defeated by poor hardware - you can have 40ms latecy in the video call stack, but when people attach Bluetooth headphones which buffer everything for 300ms there's nothing really to be done.

(Be gentle on your coworkers and use cabled headphones.)

bufferoverflow|5 years ago

LLAC/LHDC LL bluetooth codec adds only 30ms.

AptX low latency codec adds only 40ms max.

Just buy headphones with good low latency support. They aren't even expensive anymore.

Filligree|5 years ago

Okay, but I want to wear wireless headphones.

Why can't I have both? Wifi doesn't seem to have this latency problem.

GuiA|5 years ago

There are hard limits at play. No matter what you do, you can't go from New York to London in less than ~20ms; add video/audio encoding, packet switching, decoding, etc. and it's easy to see why any latency under the 100ms mark at that spatial scale in a scalable, mainstream product would be close to a miracle.

The thing is that when we talk in a room, sound will take <10ms to reach my ears from your mouth. This is what "enables" all of the human turn taking cues in conversation (eye contact, picking up whether a sentence is about to end/whether it's a good time to chime in/etc) - I've been looking for work from people who've tried to see at what point things start feeling really bad (is it 10ms, or 50ms?), but haven't found much so far. No matter what it is though, it's likely that long distance digital communications just cannot match it.

See also this interesting comment about the feeling of "closeness" from phone copper wires:

https://news.ycombinator.com/item?id=22931809

Landlines were so fast and so "direct" in their latency (where distance correlates very directly with time, due to a lack of "hops") that local phone calls were faster than the speed of sound across a table, and for a bit after they came out--before people generally got used to seemingly random latency--local calls felt "intimate", like as if you were talking to someone in bed with their head right next to you; I also have heard stories of negotiators who had gotten really tuned to analyzing people's wait times while thinking that long distance calls were confusing and threw them off their game.

jokoon|5 years ago

> it's easy to see why any latency under the 100ms mark at that spatial scale in a scalable, mainstream product would be close to a miracle.

It seems normal phones are able to do it, though. At least it seems normal phones suffer less latency problem.

In a way, simplicity in technology often means better performance.

josh2600|5 years ago

The media lab has done a ton of research on this. I seem to remember people being able to notice visual latency at 30ms and audio latency at 80-120ms (this is because light is faster than sound).

eru|5 years ago

> The thing is that when we talk in a room, sound will take <10ms to reach my ears from your mouth. This is what "enables" all of the human turn taking cues in conversation (eye contact, picking up whether a sentence is about to end/whether it's a good time to chime in/etc) - I've been looking for work from people who've tried to see at what point things start feeling really bad (is it 10ms, or 50ms?), but haven't found much so far. No matter what it is though, it's likely that long distance digital communications just cannot match it.

Digital communication could cheat, though!

There's a lot of latency hiding you can do, if you can predict well enough what's coming next. Humans are fairly predictable most of the time.

sbierwagen|5 years ago

Where does Tonari actually put the camera? The perspective on the displayed image makes it look like the camera is ceiling mounted, but that would make the eye contact problem much worse than even Zoom.

wallflower|5 years ago

If I had to guess at a possible future, I can imagine edge computing servers that connect over 5G or fiber to your device. On these edge computing servers, they predict using AI/ML what you, as a participant, could do (video including facial and hand gestures, audio including Toastmaster type fillers like ahh, umm) in the next 50-60ms or longer and transmit their guess using rendered video frames and audio in time for the other videoconferencing participants to see “no latency” interaction. Done right, it would seem real. Done wrong, definite Max Max Head Headroom feel.