top | item 45744155

(no title)

> Getting 200 Gb/s of reliable in-order bytestream per core over a unreliable, out-of-order packet-switched network using standard ethernet is not very hard with proper protocol design.

You also suggested that this can be done using a single CPU core. It seems to me that this proposal involves custom APIs (not sockets), and even if viable with a single core in the common case, would blow up in case of loss/recovery/retransmission events. Falcon provides a mostly lossless fabric with loss/retransmits/recovery taken care of by the fabric: the host CPU never handles any of these tail cases.

Ultimately there are two APIs for networks: sockets and verbs. Former is great for simplicity, compatibility, and portability, and the latter is the standard for when you are willing to break compatibility for performance.

discuss

Veserv|4 months ago

You can use a single core to do 200 Gb/s of bytestream in the presence of loss/recovery/retransmission assuming you adequately size your buffers so you do not need to stall while waiting for retransmit. So ~1 bandwidth-delay product worth of buffering for the number of lost transmits and retransmits of the same chunk of data you want to survive at full speed.

You can use such a protocol as a simple write() and read() to a single bytestream if you so desire, though you would probably be better off using a better API to avoid that unnecessary copy. Usage does not need to be anymore complicated then using a TCP socket which also provides a reliable ordered bytestream abstraction. You make bytes go in, same bytes come out the other side.

jauntywundrkind|4 months ago

Now do that across thousands of connections. While retaining very low p99 latency.

Just the idea that using a bytestream is ok is leaving opportunity on the table. If you know what protocols you are sending, you can allow some out-of-order transmission.

Asking the kernel or dpdk or whatever to juggle contention sounds like a coherency nightmare on large scale system, is a very hard scheduling problem, that a hardware timing wheel is going to be able to just do. Getting reliability & stability at massive concurrency at low latencies feels like such an obvious place for hardware to shine, and it does here.

Maybe you can dedicate some cores of your system to maintain a low enough latency simulacra maybe, but you'd still have to shuffle all the data through those low latency cores, which itself takes time and system bandwidth. Leaving the work to hardware with its own buffers & own scheduling seems like an obviously good use of hardware. Especially with the incredibly exact delay based congestion control their close cycle timing feedback gives them: you can act way before the CPU would poll/interrupt again.

Then having own Upper-Level-Protocol processors offloads a ton more of the hard work these applications need.

You don't seem curious or interested at all. You seem like you are here to downput and belittle. There's so many amazing wins in so many dimensions here, where the NIC can do very smart things intelligently, can specialize and respond with enormous speed. I'd challenge you to try just a bit to see some upsides to specialization, versus just saying a CPU hypothetically can do everything (and where is the research showing what p99 latency the best of breed software stacks can do?).