CPU Introspection: Intel Load Port Snooping

[+] drinfinity|6 years ago|reply

I find it interesting that somebody dedicates his or her life to figuring out how these CPUs work when exactly that information is just laying about in some vault in Santa Clara.

[+] shaklee3|6 years ago|reply

If we're talking about dedicating their life, that's likely to be Agner Fog: https://www.agner.org/optimize/

He's put out the most detailed third-party documentation on Intel and AMD processors that I've ever seen.

[+] vardump|6 years ago|reply

I think I'll disable hyperthreading [0]. :-)

This interesting novel technique [1] can provide an unique window inside the CPU core black box, helping us to better understand how the CPU works internally when it comes to otherwise invisible loads and stores.

While I can't think of any scenario immediately, intuitively I feel this could be useful for those of us seeking to squeeze everything out of a system.

Not the particular case about low level TLB miss / page walk mechanics (we already know TLB misses are bad), but perhaps there are other situations where existing performance counters don't provide as detailed information.

[0]: Yes, I'm aware INVPLG (Invalidate TLB Entries) used in this blog post is a privileged ring 0 instruction. But there's clearly a leak regardless.

[1]: Using a core hyperthread to "spy" other hyperthread on same core.

[+] gamozolabs|6 years ago|reply

Hehe, hyperthreading has some issues. This issue technically works single thread, but it's hard for sensitive data to survive a context switch. That being said, this issue is mitigated in all common OSes and latest microcode.

I'll be curious as to what there is to learn from this. It's more of a longshot goal for me to learn how things work, develop accurate uarch models, and then learn from those models better than I could guess and check hardware results.

Hard to say if it'll go well....

[+] baybal2|6 years ago|reply

> I think I'll disable hyperthreading

It will not help you. Anything short of specially built CPU architecture and an OS is useless against hardware level "attacks," if you can even call them that.

CPUs are simply not built with an idea that you have to protect one process from another, and do it on that level of sophistication. You normally don't have that concern if the only person who can run code on a CPU is its user.

But the whole paradigm breaks when you have multiple "tenants" in a single systems, and whose entire setup is not managed by the host.

The same comes when you allow random untrusted code be JITed or even ran as is with WASM.

The only solution against that is to stop people from using "virtual" hosting, and remove JIT compilation of untrusted code.

[+] herendin2|6 years ago|reply

I read this. It's quite interesting, but I think it is short of a few clear basic definitions of terms.

Could anyone help explain, in a few sentences, what is a Load Port, and why is it interesting in this context?

It appears to be some type of indicator of the proportional time slice given to certain opaque internal processes which are not normally visible to users.

[+] bertr4nd|6 years ago|reply

CPUs issue instructions through a few ports, each of which services a set of instruction types (eg, arithmetic, memory load/store, vector instructions, etc.). For example here are Skylake’s ports: https://en.wikichip.org/wiki/intel/microarchitectures/skylak...

So getting a trace from the load ports is basically a trace of all memory accesses in the system. Something particularly cool about this work is that you can even see loads that are hidden from software, like a hardware page table walk.

[+] gamozolabs|6 years ago|reply

Sorry about that. This blog kinda just hopped into the meat of it as I was trying to keep it short. It's largely a followup to a previous blog of mine https://gamozolabs.github.io/metrology/2019/08/19/sushi_roll... where I go into the details and descriptions.

[+] naveen99|6 years ago|reply

I don’t understand why intel and amd can’t give us access to the cpu cache the same way NVidia does with CUDA for local and shared memory on the gpu. I just don’t buy the hand waiving that the cpu can magically do branch prediction better than a programmer with a static c compiler who actually know what they want in the future. maybe if they had offered cache control on the itanium or phi they wouldn’t have had to cancel them despite the need for reprogramming user software which didn’t stop CUDA.

[+] loa_in_|6 years ago|reply

I agree that we should be able to tell the processor which branch is more likely. Even something as simple as a flag to select between "I prefer you take any jump you encounter" and "I prefer you skip all jumps in this macroblock for speculation purposes" (and also "try to predict smartly" to stick to how it's working now, aka what this article is about). This would give a determined programmer everything he needs to make sure his program is executed optimally.

[+] pg_is_a_butt|6 years ago|reply

[deleted]

19 comments