kcudrevelc's comments

kcudrevelc | 2 years ago | on: Shaving 40% Off Google’s B-Tree Implementation with Go Generics (2022)

Hey, Google go b-tree implementation author here. A few important things:

- this implementation was done by me while I worked at Google and needed a good ordered tree. It is not and never was supported by Google, just open sourced by the company and written on company time.

- it now supports generics! Actually, iirc it did almost at the time that this article came out. In go 1.18 and higher, the original API uses a specialization of the generic underneath.

kcudrevelc | 5 years ago | on: Design Docs at Google

I've worked at Google for a while, and a few years ago I wanted to release an open-source project. Since the code was going on Git, it made sense to throw the design there too, so I just added a DESIGN.md to the top-level of the project. This has had a remarkably nice effect: sometimes, when people make large changes to the project, they update the design to reflect the changes!

See this lovely commit as an example: https://github.com/google/stenographer/commit/d678531a3e5a87...

kcudrevelc | 9 years ago | on: Gryadka is not Paxos, so it's probably wrong

It's not a (provable, guaranteed) CP system. Bitcoin attempts to provide a global consensus across a large number of actors, but its attempt is based on proof of work, specifically that it's probably difficult to compute hashes with a certain property (leading zeros, last I checked). It in no way guarantees consistency in the face of partitions.

Consider the simplest case: 1/2 of bitcoin users/miners are temporarily split off from the other half, for say a week (all atlantic/pacific fibers are broken at once, all satellites fall out of the sky, other huge catastrophes occur all simultaneously). Each half would append their own blocks to the blockchain happily, and depending on the chain lengths added, once the partition went away there would be no good way to reconcile. Thus: not consistent in the face of partitions.

kcudrevelc | 9 years ago | on: Getting Past C

> Under Linux, some SECCOMP initialization and capability dances having to do with dropping root and closing off privilege-escalation attacks as soon as possible after startup.

I was under the impression that these specific things were actually quite hard to do in Go. I believe that both setuid/setgid and seccomp_load change the current OS thread (only), and since Go multiplexes across multiple threads and gives programmers very little control over which ones are used for what goroutines, I'm not sure how you would, for example, apply a seccomp context across all threads in a Go program. setuid/setgid are currently unsupported for this reason, with the best method being "start a subprocess and pass it file descriptors" (https://github.com/golang/go/issues/1435).

I'd be interested to hear if others have found ways to actually do this reliably for all OS threads underlying a running Go process.

kcudrevelc | 11 years ago | on: Stenographer – A full-packet-capture utility

I think the reason we're getting faster performance is that we tend to have packets clustered on disk, as you've surmised. Since packets with particular ports/IPs/etc tend to cluster in time, there's a good chance that at least a few will take advantage of disk caches. Even if we clear the disk cache before the query, the first packet read can cache some read-ahead, and a subsequent packet read may hit that cache entry without requiring an additional seek/read.

As far as compressing offsets, I haven't done any specific measurements but my intuition is that snappy (really any compression algorithm) gives us a huge benefit, since all offsets are stored in-order: they tend to have at least 2 prefix bytes in common, so it's highly compressible.

I experimented with mmap'ing all files in stenographer when it sees them, and it turned out to have negligible performance benefits... I think because the kernel already does disk caching in the background.

I think compression is something we'll defer until we have an explicit need. It sounds super useful, but we tend not to really care about data after a pretty short time anyway... we try to extract "interesting" pcaps from steno pretty quickly (based on alerts, etc). It's a great idea, though, and I'm happy to accept pull requests ;)

Overall, I've been really pleased with how doing the simplest thing actually gives us good performance while maintaining understand-ability. The kernel disk caching means we don't need any in-process caching. The simplest offset encoding + built-in compression gives great compression and speed. O_DIRECT gives really good disk throughput by offloading all write decisions to the kernel. More often than not, more clever code gave little or even negative performance gains.

kcudrevelc | 11 years ago | on: Stenographer – A full-packet-capture utility

I blame blissful ignorance: I'm not at all familiar with DPDK. I'll definitely read up, though!

I wonder if O_DIRECT writes can happen from DPDK memory space? If not, we don't gain anything, since we'd need to copy packets into RAM for writes anyway.

Supporting stock Linux is definitely a nice-to-have... I'd like to make this a relatively easily installed deb. Currently, all dependencies are available via apt-get in stock Ubuntu.

kcudrevelc | 11 years ago | on: Stenographer – A full-packet-capture utility

Hey, great questions!

Query Performance: Right now, we've got test machines deployed with 8 500GB disks for packets + 1 indexing disk (all 15KRPM spinning disks). They keep at 90% full, or roughly 460GB/disk, about 1K files/disk. Querying over the entire corpus (~4TB of packets) for something innocuous like 'port 65432' takes 25 seconds to return ~50K packets (that's after dropping all disk caches). The same query run again takes 1.5 sec, with disk caches in place. Of course, the number of packets returned is a huge factor in this... each packet requires a seek in the packets file. Searching for something that doesn't exist (host 0.0.0.1) takes roughly 5 seconds. Note that time-based queries, like "port 4444 and after 3h ago and before 1h ago" do choose to only query certain files, taking advantage of the fact that we name files by microsecond timestamp and we flush files every minute.

A big part of query performance is actually over-provisioning disks. We see disk throughput of roughly 160-180MB/s. If we write 160MB/s, our read throughput is awful. If we write 100MB/s, it's pretty good. Who would have thought: disks have limited bandwidth, and it's shared between reads and writes. :)

We actually don't use LevelDB... we use the SSTables that underly LevelDB. Since we know we're write-once, we use https://github.com/google/leveldb/blob/master/include/leveld... directly for writes (and its Go equivalent for reads). I'm familiar with the file format (they're used extensively inside Google), so it was a simple solution. That said, it's been very successful... we tend to have indexes in the 10s of MBs for 2-4GB files. Of course, index size/compressibility is directly correlated with network traffic: more varied IPs/ports would be harder to compress. The built-in compression of LevelDB tables is also a boon here... we get prefix compression on keys, plus snappy compression on packet seek locations, for free.

We currently do no compression of packets. Doing so would definitely increase our CPU usage per packet, and I'm really scared of what it would do to reads. Consider that reading packets in compressed storage would require decompressing each block a packet is in. On the other hand, if someone wanted to store packets REALLY long term, they could easily compress the entire blockfile+index before uploading to more permanent storage. I expect this would be better than having to do it inline. Even if we did build it in, we'd probably do it tiered (initial write uncompressed, then compress later on as possible).

AF_PACKET is no better than PF_RING+DNA, but I also don't think it's any worse. They both have very specific trade-offs. The big draw for me for AF_PACKET is that it's already there... any stock Linux machine will already have it built in and working. Thus steno should "just work", while a PF_RING solution has a slightly higher barrier to entry. I think PF_RING+DNA should give similar performance to steno... but libzero currently probably gives better performance because packets can be shared across processes. This is a really interesting problem that I'm wondering if we could also solve with AF_PACKET... but that's a story for another day. Short story: I wanted this to work on stock linux as much as possible.

kcudrevelc | 11 years ago | on: Stenographer – A full-packet-capture utility

Offline jobs are an interesting idea, but they weren't what we were really thinking of. Instead, we use stenographer more like a database of recent traffic. Consider this as a simple use case for intrusion detection:

  set up snort and steno
  foreach snort alert
    request all packets in stream from steno: srcIP,srcPort,dstIP,dstPort match
    OR request all packets on that srcIP,dstIP, to get OTHER connections between those hosts
    store pcap to directory (or central DB, or whatever)
Then, when a human analyst wants to investigate the alert, instead of getting the very limited PCAP that comes out of snort, they get a ton of data they can use to build context, write new detection rules, etc.

kcudrevelc | 11 years ago | on: Stenographer – A full-packet-capture utility

No corrections necessary, you're right on the money.

This is a 20% project. While it's one we plan to use internally, it's not a "supported" Google product. It's just another open-source project along with the many others we use to keep our networks secure.

Also, it's designed specifically to do one thing (packet history) and do it well. In no way is it a complete solution; this is a building block for network detection and response.

To reiterate some of the salient points:

1) Disk is REALLY cheap these days.

2) NIDS don't store lots of history, because they're optimized for detecting patterns and signatures. So they might find something in the middle of a TCP stream and send an alert, but you don't have much context around it. This allows you to build that context by requesting all packets from that stream during a (possibly very long) time range.

3) There's a ton of reasons why this isn't used to monitor users:

* it's wrong: I'd flat-out refuse to build something designed to monitor users

* it wouldn't work #1: most interesting user traffic is encrypted on the wire

* it wouldn't work #2: our production network architecture is not good at single aggregation points

* it wouldn't work #3: there aren't enough disks in the world to handle our production network load

* it's redundant: applications can already do per-application, structured monitoring as necessary for debugging/auditing/etc.

kcudrevelc | 11 years ago | on: Stenographer – A full-packet-capture utility

Hey, thanks! If you have any additional questions about the design process, internals, etc, feel free to ask. I'm the primary author of the project, and I'll be refreshing the HN post for the next hour or so trying to answer questions as they come up, and/or updating the docs.
page 1