Achieving 100Gbps intrusion prevention on a single server

[+] crispyambulance|5 years ago|reply

Just some really basic questions since I don't understand all the jargon in the article, can someone help me?

* They're talking about 100Gbps ethernet traffic. Where does one even see that line rate? It seems those rates would only occur at carriers, optical transport, or in very large data-centers.

* Intrusion detection, AFAIK, would mean that the packets have to be parsed and inspected. But most traffic is encrypted, right? How far beyond typical firewall capability can you get with this stuff if you can only look at the unencrypted part of the packet?

* Not sure exactly what "intrusion" means in this context? What are they looking for? What are some common rules that get applied? Every protocol is going to be wildly different and working at an application level. Don't you need specific application experts to even think about writing such specialized rules?

* There are commercial products which, I think, do this stuff from Gigamon and Netscout. These products have special FPGA-based switches that have many 10G, 40G or 100G ports, how does the tech in the article differ? What's the use case?

[+] nix23|5 years ago|reply

Sorry to be that harsh, but you can read everything in the net. For what a IDS is, start with that article it alone answers 3 of your questions:

https://en.wikipedia.org/wiki/Intrusion_detection_system

[+] sqren|5 years ago|reply

To learn about Intrusion Detection Systems (IDS) I found Unifi's documentation quite helpful. It's specific for their product but most of it is applicable elsewhere.

https://help.ui.com/hc/en-us/articles/360006893234-UniFi-USG...

> What are some common rules that get applied?

This is specifically answered under "Categories and Their Definitions":

> Compromised: This is a list of known compromised hosts, confirmed and updated daily as well.

> Scan: Things to detect reconnaissance and probing. Nessus, Nikto, portscanning, etc. Early warning stuff.

> SpamHaus: This ruleset takes a daily list of known spammers and spam networks as researched by Spamhaus.

> Web Apps: Rules for very specific web applications.

[+] yabones|5 years ago|reply

Some systems use a CA cert on endpoints and a CA key on the web proxy portion of the IDS, so it's able to intercept and decrypt all traffic that passes through. This isn't as common anymore since it has massive security risks associated with a 'godmode' TLS cert.

But there are still ways to inspect encrypted traffic. TLSv1.2 and lower expose the SNI in the handshake. Even on TLSv1.3 there is plenty of metadata exchanged during the TLS handshake. For example, the JA3 spec [1] is a standardized hash of selected connection parameters such as cipher suites, extensions, curves, etc. It can reveal a decent amount of information about the software initializing the connection, and when compared to a list of known good or known harmful JA3's can flag connections as malicious.

[1] https://engineering.salesforce.com/tls-fingerprinting-with-j...

[+] solotronics|5 years ago|reply

Core networks of cloud providers (FANG) are made up of many aggregated or multipath 100G/400G.

[+] hinkley|5 years ago|reply

For one, there are many variations of geographical redundancy where there is no single machine that touches all of the traffic. Having everything behind a pair of redundant load balancers is somewhat antiquated.

[+] unknown|5 years ago|reply

[deleted]

[+] PragmaticPulp|5 years ago|reply

> They're talking about 100Gbps ethernet traffic. Where does one even see that line rate?

The higher the peak throughout, the more margin you have for user applications.

If the peak throughout is 100Gbps then it’s likely that it will only consume 1% of resources at a more pedestrian 1Gbps throughout.

[+] mlmitch|5 years ago|reply

To answer your first question - I’ve personally achieved the advertised 100Gbps line rate on the bigger AWS instances.

[+] gjulianm|5 years ago|reply

It's a very interesting article! Achieving 25Mpps throughput with all of those tasks to do per packet is amazing, even more with the headroom they have in resources.

However, I miss more discussion on how to increase the number of concurrent flows, which would be the more complex part. Even with no rules, they only can process 500k concurrent flows, which is nowhere near enough for a 100Gbps network. For reference, in 10Gbps networks we're seeing between four and six million concurrent flows, depending on how much is that network exposed to the Internet.

[+] touisteur|5 years ago|reply

Still very interesting for internal networks and middleboxes. If you're trying to bring 10GbE to the end-user, you'll need to filter closer. And in a closed or somehow isolated network, you're less likely to go over the 10k connections threshold, I think. Especially if the high bandwidth (>1GbE) is used primarily for network file access.

[+] teleforce|5 years ago|reply

What a lovely fond memories of having to research and use the very first several versions of Snort IDS and its first IPS (Hogwash) about 20 years ago [1],[2].

I'd like to test this new system on the cheaper Xilinx ULtraScale+ FPGA using the provided software [3], but not sure will it even work with different FPGA set up with minimum change [4]?

Another thing is that it will be interesting to test and compare it with eBPF based IPS system bypassing the kernel without the need for FPGA? It seems that for Suricata IDS/IPS it has been proposed but no performance metrics are provided of the effort [5].

[1] https://www.snort.org [2] http://hogwash.sourceforge.net/docs/overview.html [3] https://github.com/cmu-snap/pigasus [4] https://www.xilinx.com/products/intellectual-property/cmac.h... [5] https://cdn2.hubspot.net/hubfs/6344338/Resources/Stamus_WP_I...

[+] gjulianm|5 years ago|reply

At those speeds the problem is that the CPUs are not fast enough. Say you need to process these 25 Mpps (and that's being generous, the maximum packet rate at 100Gbps is 162Mpps), in a single core at 3,6 GHz: you have 144 clock cycles per packet. A cache miss or a branch predictor miss will already eat quite a lot of the available time for that packet. Even if you split the workload among multiple cores you're still pretty limited in what you can do. And if you need to forward those packets, you might be close to the PCIe 3 x16 maximum efective bandwidth.

In my company we're developing traffic capture/analysis software at 100Gbps (which is orders of magnitude faster than IDS/IPS) and in order to achieve those speeds we need fast processors with a lot of cores and quite a lot of RAM, interact directly with the NIC buffers (we use DPDK now, we previously worked with modified drivers) avoiding the kernel entirely, and we have to limit the tasks per packet a lot to the point that flow state management is pretty difficult, and TCP reassembly looks impossible. I don't see a software IDS/IPS system getting anywhere close to the performance of an FPGA.

[+] Quarrel|5 years ago|reply

Interesting.

So, Snort 3.0 uses Intel’s Hyperscan library, as per the article.

Intel acquired this when they bought Sensory Networks, Inc. Guess what SN did before they moved to software? They made and sold hardware that essentially implemented Hyperscan in FPGAs (and SRAM, on PCI boards), for IDS & virus scanning at speed. They even had a patch to add support to Snort for it. SN eventually moved to a pure software model during the 2008 downturn (which smashed a bunch of their customers plans) and ultimately sold to Intel.

End of the day it is a real bitch to do right, and it was very hard at 10Gbps, and the problems are essentially identical now (only with more stuff over https).

I have no idea if they'd use them, but Intel also acquired a heap of patents on doing exactly this using the techniques that Hyperscan implements, and while they're probably happy with you doing it in software, they're probably more diligent in fighting off hardware adversaries.

[+] ianhowson|5 years ago|reply

Don't forget about the I-Series!

If you're still interested in hardware, there's the NVIDIA BlueField 2 and its regex accelerator: https://www.nvidia.com/en-us/networking/products/data-proces...

[+] londons_explore|5 years ago|reply

> The reassembler is responsible for ordering TCP packets. It needs to do this at line rate while maintaining state for 100K flows.

100k flows at a line rate of 100Gbps, suggests the average flow is 1Mbps. I find that hard to believe - A typical home or office computer might have hundreds of TCP connections open at any given point in time, yet barely be downloading anything. Most of those connections are just sitting idle.

Perhaps this hardware only bothers with keeping the 100k most active flows in the reassembler... In which case the obvious attack is simply to send some packets out of order by a few seconds to make sure you're out of the most-active reassembly window?

[+] stefan_|5 years ago|reply

Ding ding ding. Not only is the number of flows limited by the BRAM which is going to be impossible/very expensive to scale, it will drop particularly OOO flows:

If the OOO Engine detects that BRAM capacity for OOO flows exceeds 90% of its maximum capacity, it drops the flow with the longest linked list

[+] ashika|5 years ago|reply

> the average flow is 1Mbps. I find that hard to believe

hard to believe? I can confirm - tcp sessions with average flows of >1Mbps do exist.

you are right in identifying a weakness of the system but incorrect as to its significance. A drag racer cannot turn, because it has been optimized for something else.

[+] hnaccy|5 years ago|reply

>At the start of the pipeline we can afford to run lots of memory-cheap filters in parallel. Only a subset of incoming packets make it past these filters, so we need less of the more memory intensive filters running in parallel behind them to achieve the desired line rate.

What would happen if you manufactured lots of packets to trigger the expensive filters?

[+] rwmj|5 years ago|reply

I was wondering this too. I guess the trick would be how would you (the attacker) know which filters are being used / which are expensive?

If they're only using the Snort standard filters, that would be one thing, but Cloudflare or similar services which might actually use this hardware would probably also have a completely custom rule set. So could you somehow detect experimentally what packets trigger the expensive rules? Perhaps some kind of fuzzing attack could do that.

[+] mr__y|5 years ago|reply

>What would happen if you manufactured lots of packets to trigger the expensive filters?

then you would effectively DoS/DDoS the IPS. Now depending on how the system as a whole works it could be an efficient way to get through with a different attack that would normally be detected/blocked by IPS.

[+] cryptofistMonk|5 years ago|reply

Reading "FGPA" in the fourth paragraph confused me more than it should have.

[+] 90red|5 years ago|reply

I was going to comment something on the lines of that, I don't have any prior experience with the technology so my question might be stupid but does the FGPA take the place of a dedicated machine like a server where it handles the loads?

[+] louwrentius|5 years ago|reply

TLDR: How they did it? Use an FPGA to offload the pattern matching.

Very interesting though.

[+] tleb_|5 years ago|reply

Did you read the article? Your TLDR is misleading.

[+] snehesht|5 years ago|reply

Github Repo - https://github.com/cmu-snap/pigasus

[+] anticristi|5 years ago|reply

What a beast! I love the FPGA-first design logic.

[+] serialx|5 years ago|reply

This is just awesome. I wish there are open and arduino like dev env for FPGAs.

[+] shakna|5 years ago|reply

The ICE40 is about as close to this as you can get. A lot of open source tools work with them, and you can find boards being sold by companies that generally specialise in the Arduino ecosystem. Unfortunately not open hardware, but that's unlikely to happen anytime soon.

You'd never want to build anything production-quality with one, but they're decent enough for hobbyists.

[+] ryanmjacobs|5 years ago|reply

shameless plug... https://webfpga.com isn't bad if you're looking to get your feet wet with FPGA dev

[+] davidbrennerjr|5 years ago|reply

Snort3 will offload work to a second core only when the first core is overloaded, for the most part Snort3 uses only 1 core.

35 comments