top | item 16407002

(no title)

dlwh | 8 years ago

Thanks for the kind words.

Breeze does a large chunk of (dense) compute via netlib-java, which calls out to "real" lapack if you set it up. Are things really faster than that? Or are you referring to the non BLAS/non Lapack things?

discuss

agibsonccc|8 years ago

Few things about netlib-java.

1: It's a read only repository now. It's retired. Lack of maintenance will hurt its long term prospects.

2. The license on net lib java's native binaries are not commercial friendly

3. Net lib java does everything on heap with double arrays, we do everything off heap. There's no copying to worry about, and there's a lot lower latency and flexibility with our data buffers.

4. Due to javacpp we have better control and interop with other c++ libraries like opencv. This makes it easier to write native code and use it from java later on. This allowed us to write and maintain all of our own c/c++ code with the same api (see: nd4j there) - https://github.com/deeplearning4j/libnd4j

So yes it ends up being faster in practice for a lot of scenarios. Aside from that, we also have more control over the blas libaries we pick.

This means we also have access to cublas as well as (see below) more configuration and flexibility.

Net lib java tries to be "pure" which, while elegant, isn't practical if you want to benefit from gpus and DL. We implemented the proper shims to make things "just work" from the user's perspective there on top of having more flexibility (see: mkls opemp knobs etc)

Nd4j has its own built in garbage collector and memory management which means we don't have to worry about any strange work arounds when working with cpus/gpus and we can keep off heap buffers in a managed manner.

See:

http://deeplearning4j.org/workspaces

http://deeplearning4j.org/native

In general, "just blas" isn't enough. I know from personal experience. I wrote nd4j after trying to use every java library for matrix compute and all of them fell flat in terms of speed, interop with other c++ libraries, and the need to use java arrays was highly limiting. Over the years, we built up nd4j to handle harder scenarios.

This includes other features like distributed parameter servers among other things.

Other things aside: I like what breeze attempted but it ultimately didn't scratch the itch for me when I was looking hard at the various java matrix libraries (I've tried all of them)

When I originally built out nd4j, it has this backend architecture:

http://nd4j.org/backend.html

It was so we could just use whatever matrix backend we wanted. None of them worked well enough due to the flexibility we needed.

I also had an inherent problem with java based for loops in any setting. We wrote our own forkjoin implementation as well attempting to make it fast and it just couldn't beat plain c.

We've found especially after matrices of size 128 x 128 or so, we hands down beat every JVM out there no matter what language is. The last bit we are working on are smaller matrices.

The other problem we're working on is our sparse support could use some work. The basics are in there but it's not quite ready for prime time yet.

After that, (I'm obviously biased) I don't see how anything could compete with us. Especially after we add our autodiff/pytorch like stack on top of all these primitives.

Hope that helps!