The JVM has a number of cool features that enable you to efficiently drop down into C (FFI) yourself: these tricks aren't just for built-ins.
A quick overview:
- First there was JNI. JNI means that you write method stubs with the "native" keyword. Then you run javah, which gives you some C glue code that you eventually need to compile. This is very fast, but it's annoying because now you need a tool chain everywhere. The Python equivalent of this is roughly writing CPython extensions.
- People thought JNI was annoying, so Sun developed JNA. JNA lets you bind a library directly: all the magic comes with the JVM, and you can just dlopen something and call some syms. This works fine, but it's very slow. The Python equivalent of this is roughly ctypes.
- Most recently, there's JNR and jnr-ffi. They do a very clever trick: you use JNI to get to libffi, and then you use libffi to call everything else performantly. You get roughly the performance of JNI, with the convenience of JNR. The Python equivalent of this is roughly cffi.
JNR is way more usable than I thought it would be. I develop caesium[0], a Clojure libsodium binding, using jnr-ffi and a pile of macros. I gave a talk about this at Clojure Conj (recording [1], slides [2]) if you're interested.
To be fair: this code uses intrinsics, which means that it's implemented differently than the three methods shown above, so it's still slightly different. It's just not different in a way that's meaningful to you unless you're working on the JVM itself :)
If folks are interested in this, I figured some folks might like our c++ version of this (been around for 5 years ish now):
https://github.com/bytedeco/javacpp
JNR (and JNA and cffi) strike me as unsafe due to the lack of type enforcement. Systems like this let you call any random pointer as if it were a C function of any type. That usually works, but sometimes doesn't. If you're lucky, mistakes make your program blow up right away. If you're unlucky, you get impossible to diagnose memory corruption.
I'd much rather write conventional bridges and have the system check that I'm right.
> Then you run javah, which gives you some C glue code that you eventually need to compile.
That's deprecated and no longer supported in Java 10. The recommend way is to use javac -h.
> This is very fast
There is a certain call overhead for JNI that JNA and JNR necessarily share as well. As a normal Java application you don't really have a way to get around this.
One thing I learned about pcmpxstrx is that it's surprisingly slow: latency of 10-11 cycles and reciprocal throughput of 3-5 cycles on Haswell according to Agner's tables, depending on the precise instruction variant. The instructions are also limited in the ALU ports they can use. Since AVX2 has made SIMD on x86 fairly flexible, it can sometimes not be worth using the string comparison instructions if simpler instructions suffice: even a slightly longer sequence of simpler SIMD instructions sometimes beats a single string compare.
The SSE 4.2 string comparison instructions still have their uses, but it's always worth testing alternate instruction sequences when optimizing code that might use them.
I have the same experience. I've tried using pcmpestr in substring search a few times, and it had always turned out to not be worth it. I have however never tried it in during comparison functions, so I can't speak to that, but I wouldn't be surprised if the latency of the instruction mad at impact there too.
This guy's blog is awesome. A lot of in-depth information, original research, nice and funny writing style. I hope he finds some time to write / speak more soon.
I find it a bit sad that, instead of optimising the existing REP CMPS instruction to do vectorised compares like they did with REP MOVS/STOS and block copies/writes, Intel introduced another even more complex instruction that itself requires a bunch of additional support code to use. I certainly don't think it's a "good use of CISC".
I'm lukewarm on pcmpxstrx too, but for a different reason: I'd prefer the effort to go into more general purpose, highly flexible SIMD instructions (which is thankfully happening now with AVX 512).
> But did you know there is also a secret second implementation? String.compareTo is one of a few methods that is important enough to also get a special hand-rolled assembly version.
oooooh.
Is there a list somewhere of all the functions that have had such special treatment?
I think it could be fairly profitable performance wise to write a specialized version of compareTo (simple memcmp) for cases where the result is directly compared to 0, equal or not equal case.
Or to have compareEqualityTo() version.
Current version supports greater and less as well, which is of course useful in many data structures and sorting, but might not represent bulk of the call sites. This feature does not come for free.
This alternative version (memcmp alias) could be written to run faster.
Only benchmarking could tell whether or not it's ultimately worth it.
> The code that generates this, MacroAssembler::string_compare in macroAssembler_x86.cpp is well-documented for the curious.
I was expecting that this would be some C++ code, or maybe inline assembly, but it turns out it's just a bunch of macros that are a thin layer over assembly instructions. Is this done for portability, to abstract over differing instruction names on different architectures?
Thanks for the tl;dr. I think the "crazy" adjective refers to the instruction, and not to the use of it. As the article explains, the instruction is quite complicated and has a large number of parameters.
Agree that using vector instruction is not crazy at all. I am quite interested when I see the article title, and then deeply disappointed after I read it. I was expecting something like the strlen shown in the book The Hackers Delight, which loads 4 bytes into a int, ad then some bit masking and subtraction and a nlz (number of leading zero) to index the first 0 byte. Please read it first and that is what really what meant by crazy.
UTF-16 is one of these ill-fated developments that curse some languages & platforms (WinNT incl Win10, WinAPI32, Java, Flash, JS, Python 3) to his day.
compareTo uses 0x19, which means doing the “equal each”
(aka string comparison) operation across 8 unsigned words
(thanks UTF-16!) with a negated result. This monster of an
instruction takes in 4 registers of input:
UTF16 is not really a curse for languages that require it. String operations in non-English languages are very fast because of it, and most software these days has to deal with localization.
It has nothing to do with "frigid opinions"; we banned the account for egregious serial abuse of the site.
Please don't do such digressions in HN comments. If you don't think a user should be banned, you're welcome to email us. And of course you can vouch for their good comments.
If you ever browse /newest and/or have showdead on you've probably seen the occasional genuine spam that turns up on here. I think the HN admins have a genuinely hard time with the current volume.
To deal with the current state of affairs Arc apparently has to be exceptionally aggressive, and I've seen it generates a very small stream of false negatives.
In future, open the post (click the timestamp) and click "vouch". If it stays [dead], hopefully enough other people also vouch for it.
[+] [-] lvh|8 years ago|reply
A quick overview:
- First there was JNI. JNI means that you write method stubs with the "native" keyword. Then you run javah, which gives you some C glue code that you eventually need to compile. This is very fast, but it's annoying because now you need a tool chain everywhere. The Python equivalent of this is roughly writing CPython extensions.
- People thought JNI was annoying, so Sun developed JNA. JNA lets you bind a library directly: all the magic comes with the JVM, and you can just dlopen something and call some syms. This works fine, but it's very slow. The Python equivalent of this is roughly ctypes.
- Most recently, there's JNR and jnr-ffi. They do a very clever trick: you use JNI to get to libffi, and then you use libffi to call everything else performantly. You get roughly the performance of JNI, with the convenience of JNR. The Python equivalent of this is roughly cffi.
JNR is way more usable than I thought it would be. I develop caesium[0], a Clojure libsodium binding, using jnr-ffi and a pile of macros. I gave a talk about this at Clojure Conj (recording [1], slides [2]) if you're interested.
To be fair: this code uses intrinsics, which means that it's implemented differently than the three methods shown above, so it's still slightly different. It's just not different in a way that's meaningful to you unless you're working on the JVM itself :)
[0]: https://github.com/lvh/caesium [1]: https://dev-videos.com/videos/Lf-M1ZH6KME/Using-Clojure-with... [2]: https://www.lvh.io/CCryptoClojure/#/sec-title-slide
[+] [-] agibsonccc|8 years ago|reply
Also comes with tons of pre cooked projects already ready to go: https://github.com/bytedeco/javacpp-presets
We use it in production at skymind and it powers our whole cpu and gpu stack with built in memory management among other things.
Also has maven and sbt plugins so you can plug it right in to your workflow automatically just maintaining the java code.
One of the coolest features is the name matching so you can get semi automatic mapping.
We auto generate our JNI bindings from this.
[+] [-] quotemstr|8 years ago|reply
I'd much rather write conventional bridges and have the system check that I'm right.
[+] [-] skissane|8 years ago|reply
[+] [-] needusername|8 years ago|reply
That's deprecated and no longer supported in Java 10. The recommend way is to use javac -h.
> This is very fast
There is a certain call overhead for JNI that JNA and JNR necessarily share as well. As a normal Java application you don't really have a way to get around this.
See https://stackoverflow.com/questions/36298111/is-it-possible-...
> but it's annoying because now you need a tool chain everywhere.
You only need a toolchain for building, you can just ship the .so. These days you get pretty far with Linux 64bit, macOS 64bit and Windows 64bit.
[+] [-] pcwalton|8 years ago|reply
The SSE 4.2 string comparison instructions still have their uses, but it's always worth testing alternate instruction sequences when optimizing code that might use them.
[+] [-] burntsushi|8 years ago|reply
[+] [-] dullgiulio|8 years ago|reply
I am convinced that similar tricks are employed by almost every language runtime or standard library (GNU libc also does this, etc.)
[1] https://github.com/golang/go/blob/master/src/runtime/asm_amd... [2] https://en.wikipedia.org/wiki/Duff%27s_device
[+] [-] JanecekPetr|8 years ago|reply
[+] [-] userbinator|8 years ago|reply
[+] [-] pcwalton|8 years ago|reply
I'm lukewarm on pcmpxstrx too, but for a different reason: I'd prefer the effort to go into more general purpose, highly flexible SIMD instructions (which is thankfully happening now with AVX 512).
[+] [-] r4um|8 years ago|reply
[+] [-] exikyut|8 years ago|reply
oooooh.
Is there a list somewhere of all the functions that have had such special treatment?
[+] [-] nicoulaj|8 years ago|reply
(look for "java_lang_Math" for instance)
[+] [-] kaeluka|8 years ago|reply
[+] [-] vardump|8 years ago|reply
Or to have compareEqualityTo() version.
Current version supports greater and less as well, which is of course useful in many data structures and sorting, but might not represent bulk of the call sites. This feature does not come for free.
This alternative version (memcmp alias) could be written to run faster.
Only benchmarking could tell whether or not it's ultimately worth it.
[+] [-] BeeOnRope|8 years ago|reply
[+] [-] saagarjha|8 years ago|reply
I was expecting that this would be some C++ code, or maybe inline assembly, but it turns out it's just a bunch of macros that are a thin layer over assembly instructions. Is this done for portability, to abstract over differing instruction names on different architectures?
[+] [-] ahmadzaraei|8 years ago|reply
[+] [-] ghusbands|8 years ago|reply
(vpcmpestri (of the pcmpxstrx family) isn't an especially crazy instruction to use for string comparison. That's what it's designed for.)
[+] [-] amelius|8 years ago|reply
[+] [-] brianon99|8 years ago|reply
[+] [-] zakk|8 years ago|reply
(I am just ironizing about the title, the content is actually great!)
[+] [-] sctb|8 years ago|reply
[+] [-] oblio|8 years ago|reply
Or even better (Yay for Betteridge!):
"Has the JVM found the ultimate string comparison solution in this CRAZY instruction?"
[+] [-] frik|8 years ago|reply
[+] [-] vardump|8 years ago|reply
UTF-16 came only later, once it was clear 65535 code points is too few.
Had those languages been designed in last 10 years, all of them would pick UTF-8 as their code point format.
[0]: https://en.wikipedia.org/wiki/Path_dependence
[+] [-] RyanZAG|8 years ago|reply
http://www.baeldung.com/java-9-compact-string
UTF16 is not really a curse for languages that require it. String operations in non-English languages are very fast because of it, and most software these days has to deal with localization.
[+] [-] maxerickson|8 years ago|reply
[+] [-] unknown|8 years ago|reply
[deleted]
[+] [-] goldenkey|8 years ago|reply
[deleted]
[+] [-] dang|8 years ago|reply
Please don't do such digressions in HN comments. If you don't think a user should be banned, you're welcome to email us. And of course you can vouch for their good comments.
[+] [-] exikyut|8 years ago|reply
I can see this post fine; it's not "dead".
If you ever browse /newest and/or have showdead on you've probably seen the occasional genuine spam that turns up on here. I think the HN admins have a genuinely hard time with the current volume.
To deal with the current state of affairs Arc apparently has to be exceptionally aggressive, and I've seen it generates a very small stream of false negatives.
In future, open the post (click the timestamp) and click "vouch". If it stays [dead], hopefully enough other people also vouch for it.