How the JVM compares strings on x86 using pcmpestri

[+] lvh|8 years ago|reply

The JVM has a number of cool features that enable you to efficiently drop down into C (FFI) yourself: these tricks aren't just for built-ins.

A quick overview:

- First there was JNI. JNI means that you write method stubs with the "native" keyword. Then you run javah, which gives you some C glue code that you eventually need to compile. This is very fast, but it's annoying because now you need a tool chain everywhere. The Python equivalent of this is roughly writing CPython extensions.

- People thought JNI was annoying, so Sun developed JNA. JNA lets you bind a library directly: all the magic comes with the JVM, and you can just dlopen something and call some syms. This works fine, but it's very slow. The Python equivalent of this is roughly ctypes.

- Most recently, there's JNR and jnr-ffi. They do a very clever trick: you use JNI to get to libffi, and then you use libffi to call everything else performantly. You get roughly the performance of JNI, with the convenience of JNR. The Python equivalent of this is roughly cffi.

JNR is way more usable than I thought it would be. I develop caesium[0], a Clojure libsodium binding, using jnr-ffi and a pile of macros. I gave a talk about this at Clojure Conj (recording [1], slides [2]) if you're interested.

To be fair: this code uses intrinsics, which means that it's implemented differently than the three methods shown above, so it's still slightly different. It's just not different in a way that's meaningful to you unless you're working on the JVM itself :)

[0]: https://github.com/lvh/caesium [1]: https://dev-videos.com/videos/Lf-M1ZH6KME/Using-Clojure-with... [2]: https://www.lvh.io/CCryptoClojure/#/sec-title-slide

[+] agibsonccc|8 years ago|reply

If folks are interested in this, I figured some folks might like our c++ version of this (been around for 5 years ish now): https://github.com/bytedeco/javacpp

Also comes with tons of pre cooked projects already ready to go: https://github.com/bytedeco/javacpp-presets

We use it in production at skymind and it powers our whole cpu and gpu stack with built in memory management among other things.

Also has maven and sbt plugins so you can plug it right in to your workflow automatically just maintaining the java code.

One of the coolest features is the name matching so you can get semi automatic mapping.

We auto generate our JNI bindings from this.

[+] quotemstr|8 years ago|reply

JNR (and JNA and cffi) strike me as unsafe due to the lack of type enforcement. Systems like this let you call any random pointer as if it were a C function of any type. That usually works, but sometimes doesn't. If you're lucky, mistakes make your program blow up right away. If you're unlucky, you get impossible to diagnose memory corruption.

I'd much rather write conventional bridges and have the system check that I'm right.

[+] skissane|8 years ago|reply

There is a project to get something like JNR bundled with the JDK – see http://openjdk.java.net/jeps/191 and http://openjdk.java.net/projects/panama/

[+] needusername|8 years ago|reply

> Then you run javah, which gives you some C glue code that you eventually need to compile.

That's deprecated and no longer supported in Java 10. The recommend way is to use javac -h.

> This is very fast

There is a certain call overhead for JNI that JNA and JNR necessarily share as well. As a normal Java application you don't really have a way to get around this.

See https://stackoverflow.com/questions/36298111/is-it-possible-...

> but it's annoying because now you need a tool chain everywhere.

You only need a toolchain for building, you can just ship the .so. These days you get pretty far with Linux 64bit, macOS 64bit and Windows 64bit.

[+] pcwalton|8 years ago|reply

One thing I learned about pcmpxstrx is that it's surprisingly slow: latency of 10-11 cycles and reciprocal throughput of 3-5 cycles on Haswell according to Agner's tables, depending on the precise instruction variant. The instructions are also limited in the ALU ports they can use. Since AVX2 has made SIMD on x86 fairly flexible, it can sometimes not be worth using the string comparison instructions if simpler instructions suffice: even a slightly longer sequence of simpler SIMD instructions sometimes beats a single string compare.

The SSE 4.2 string comparison instructions still have their uses, but it's always worth testing alternate instruction sequences when optimizing code that might use them.

[+] burntsushi|8 years ago|reply

I have the same experience. I've tried using pcmpestr in substring search a few times, and it had always turned out to not be worth it. I have however never tried it in during comparison functions, so I can't speak to that, but I wouldn't be surprised if the latency of the instruction mad at impact there too.

[+] dullgiulio|8 years ago|reply

Go uses the same technique[1], or a Duff's device[2] when AVX2 is unsupported.

I am convinced that similar tricks are employed by almost every language runtime or standard library (GNU libc also does this, etc.)

[1] https://github.com/golang/go/blob/master/src/runtime/asm_amd... [2] https://en.wikipedia.org/wiki/Duff%27s_device

[+] JanecekPetr|8 years ago|reply

This guy's blog is awesome. A lot of in-depth information, original research, nice and funny writing style. I hope he finds some time to write / speak more soon.

[+] userbinator|8 years ago|reply

I find it a bit sad that, instead of optimising the existing REP CMPS instruction to do vectorised compares like they did with REP MOVS/STOS and block copies/writes, Intel introduced another even more complex instruction that itself requires a bunch of additional support code to use. I certainly don't think it's a "good use of CISC".

[+] pcwalton|8 years ago|reply

pcmpxstrx is a lot more powerful than rep cmps.

I'm lukewarm on pcmpxstrx too, but for a different reason: I'd prefer the effort to go into more general purpose, highly flexible SIMD instructions (which is thankfully happening now with AVX 512).

[+] r4um|8 years ago|reply

Another excellent series on JVM perf/internals https://shipilev.net/jvm-anatomy-park/

[+] exikyut|8 years ago|reply

> But did you know there is also a secret second implementation? String.compareTo is one of a few methods that is important enough to also get a special hand-rolled assembly version.

oooooh.

Is there a list somewhere of all the functions that have had such special treatment?

[+] nicoulaj|8 years ago|reply

The intrinsics for the HotSpot JVM are declared here: http://hg.openjdk.java.net/jdk10/jdk10/hotspot/file/5ab7a67b...

(look for "java_lang_Math" for instance)

[+] kaeluka|8 years ago|reply

The feature is called "intrinsics", this claims to be a list: https://gist.github.com/apangin/7a9b7062a4bd0cd41fcc

[+] vardump|8 years ago|reply

I think it could be fairly profitable performance wise to write a specialized version of compareTo (simple memcmp) for cases where the result is directly compared to 0, equal or not equal case.

Or to have compareEqualityTo() version.

Current version supports greater and less as well, which is of course useful in many data structures and sorting, but might not represent bulk of the call sites. This feature does not come for free.

This alternative version (memcmp alias) could be written to run faster.

Only benchmarking could tell whether or not it's ultimately worth it.

[+] BeeOnRope|8 years ago|reply

That version exists - it's called equals(), and it generates different code that doesn't need to distinguish the less than/greater than cases.

[+] saagarjha|8 years ago|reply

> The code that generates this, MacroAssembler::string_compare in macroAssembler_x86.cpp is well-documented for the curious.

I was expecting that this would be some C++ code, or maybe inline assembly, but it turns out it's just a bunch of macros that are a thin layer over assembly instructions. Is this done for portability, to abstract over differing instruction names on different architectures?

[+] ahmadzaraei|8 years ago|reply

What a lovely ready; I enjoy learning about the intricate details of the JVM. Thank you for sharing!

[+] ghusbands|8 years ago|reply

tl;dr: The string comparison intrinsic in the JVM uses a vectorised string comparison instruction.

(vpcmpestri (of the pcmpxstrx family) isn't an especially crazy instruction to use for string comparison. That's what it's designed for.)

[+] amelius|8 years ago|reply

Thanks for the tl;dr. I think the "crazy" adjective refers to the instruction, and not to the use of it. As the article explains, the instruction is quite complicated and has a large number of parameters.

[+] brianon99|8 years ago|reply

Agree that using vector instruction is not crazy at all. I am quite interested when I see the article title, and then deeply disappointed after I read it. I was expecting something like the strlen shown in the book The Hackers Delight, which loads 4 bytes into a int, ad then some bit masking and subtraction and a nlz (number of leading zero) to index the first 0 byte. Please read it first and that is what really what meant by crazy.

[+] zakk|8 years ago|reply

Click here to discover that crazy instruction! C++ programmers HATE IT!!

(I am just ironizing about the title, the content is actually great!)

[+] sctb|8 years ago|reply

Yes indeed. We've un-clickbaited the title. Nothing is safe, it seems.

[+] oblio|8 years ago|reply

"You wouldn't believe what CRAZY instruction the JVM uses!"

Or even better (Yay for Betteridge!):

"Has the JVM found the ultimate string comparison solution in this CRAZY instruction?"

[+] frik|8 years ago|reply

UTF-16 is one of these ill-fated developments that curse some languages & platforms (WinNT incl Win10, WinAPI32, Java, Flash, JS, Python 3) to his day.

  compareTo uses 0x19, which means doing the “equal each” 
  (aka string comparison) operation across 8 unsigned words 
  (thanks UTF-16!) with a negated result. This monster of an 
  instruction takes in 4 registers of input:

[+] vardump|8 years ago|reply

That's path dependence [0]. When all of those were conceived in the nineties, 2-byte UCS-2 seemed to be enough to store all unicode code points.

UTF-16 came only later, once it was clear 65535 code points is too few.

Had those languages been designed in last 10 years, all of them would pick UTF-8 as their code point format.

[0]: https://en.wikipedia.org/wiki/Path_dependence

[+] RyanZAG|8 years ago|reply

Java uses UTF8 in latest release for strings without special characters.

http://www.baeldung.com/java-9-compact-string

UTF16 is not really a curse for languages that require it. String operations in non-English languages are very fast because of it, and most software these days has to deal with localization.

[+] maxerickson|8 years ago|reply

What about https://www.python.org/dev/peps/pep-0393/ ?

[+] unknown|8 years ago|reply

[deleted]

[+] goldenkey|8 years ago|reply

[deleted]

[+] dang|8 years ago|reply

It has nothing to do with "frigid opinions"; we banned the account for egregious serial abuse of the site.

Please don't do such digressions in HN comments. If you don't think a user should be banned, you're welcome to email us. And of course you can vouch for their good comments.

[+] exikyut|8 years ago|reply

I figured out the 2nd part of this comment is talking about https://news.ycombinator.com/item?id=16090274.

I can see this post fine; it's not "dead".

If you ever browse /newest and/or have showdead on you've probably seen the occasional genuine spam that turns up on here. I think the HN admins have a genuinely hard time with the current volume.

To deal with the current state of affairs Arc apparently has to be exceptionally aggressive, and I've seen it generates a very small stream of false negatives.

In future, open the post (click the timestamp) and click "vouch". If it stays [dead], hopefully enough other people also vouch for it.

70 comments