I'm extremely happy with Julia performance. I'm using it right now to prototype some new algorithms for my PhD.
How fast is fast? Well in my case I'm trying to come up with things related to collision detection and I'm getting <30ns using SAT between OBBs (Oriented Bounding Boxes) and <300ns for minimum distance computation using the GJK algorithm again between OBBs.
In the context of recommender systems I did an experiment when I started learning Julia comparing against Cython and wrote about it, although I was a newbie and I should update it: https://plopezadeva.com/julia-first-impressions.html
So, the article does a bit of goalpost switching. OP provides an example of massive speedup for real applications, which I don't doubt, but constantly compares between Python ( which is fast only really when you dispatch to optimized C/C++ ) and C.
"Unless you are a C/C++ wizard normal Python developers cannot fix or optimize big libraries like NumPy or TensorFlow. In Julia its users tend to contribute more to the libraries they use themselves."
I am excited by Julia. It seems to have a good niche and community around it. It seems well suited for its job. That focus will probably make it better than more general alternatives in short order.
Well, further, python's performance problems tend to come from it's C/C++ interop. Python made a few mistakes early on that make JITing it really hard without making backwards incompatible concessions.
Easy/Fast C interop is a blessing and a curse. Yes, you can do fast things from C with little overhead, but it also makes things like having a generational garbage collector or a JIT really hard.
The author gives an example of a little function written in a generic way, and shows how JIT compilation specializes it at runtime into highly optimized machine code. It’s a wonderful example of how Julia’s JIT compiler works, and how it can lead to significant speedups over even C or Fortran code in some cases.
people do find Julia to be faster than Python/Numpy, but it is not uniformly faster than Fortran. And Julia's start-up time should not be ignored. Quoting the last link, "In fact the whole Fortran benchmark (300 integrations) finishes roughly in the time it takes to startup a Julia session and import all required libraries (Julia 1.5.1)."
I don’t think anyone is claiming that Julia is uniformly faster than well-written Fortran. But you can get comparable performance with code using a style that many people find easier to read, write, and reason about; and it’s interesting that sometimes this code does outperform optimized C or Fortran code.
Startup time is much improved in recent Julia versions¹, but is certainly not negligible for short calculations.
Comparing different metrics is not valid across all use cases. Counting startup time but excluding development time is disingenuous. Having used both Fortan and Julia the difference between development time is staggering.
Different tools for different uses cases is the best way to put it.
Of course it will not always beat Fortran, but don't think think it is damn impressive that a high level dynamic language with much higher productivity frequently matches or beats Fortran? That is a crazy achievement if you ask me.
Real world systems today are going to use a lot of computing resources, such as clusters, GPUs, tensor processing units, multiple cores etc. In such a world, anything that makes that easy to deal with is going to have the performance edge in practice.
Doesn't matter how fast a Fortran program would be in theory, if the Julia program is delivered years ahead of it.
If julia could become a native vector/gpu programming language that is usable by mere mortals this could be a niche that might eventually grow into mainstream. But I can't help but notice that e.g. nvidia's github mentions only: Python, C++, C, Go, Cuda. (not sure the order matters).
Julia has excellent CUDA support[1], I have no idea why nvidia doesn't promote it more. It's fast, flexible and very featureful. There was a recent thread about it here: https://news.ycombinator.com/item?id=27496679
The AMD [2] and Intel [3] support is younger, but developing quickly.
There's also [4] for a unified API that works across different GPU vendors to avoid lockin
Well, Common Lisp or Racket are a lot faster than Python, too, and both are general-purpose languages which has many advantages over a language specific for numerical computing.
If only Julia had a fluent static compilation it would be in the top 10 of languages. It's too bad there are not so much investment into a really important stuff that will benefit the world. All the investment goes to bubbles and financial pyramids.
One of the gripes that I have with Julia is that if you write linear algebra code naively, you will have tons of unnecessary temporary allocations, while in Eigen (a C++ library) you can avoid most of these without sacrificing too much readability. (It even optimizes how to run matrix kernels on the fly!) Sure, you can rewrite your Julia code in C-style to remove those temporary allocations, but then the code becomes even less readable than what you can achieve in C++.
The naive Julia version has unnecessary allocations and therefore is 23% slower than the optimized version:
@inbounds for k = 2:60000
Pp .= Fk_1 * Pu * Fk_1' .+ Q
K .= Pp * Hk' * pinv(R .+ Hk * Pp * Hk')
aux1 .= I18 .- K * Hk
Pu .= aux1 * Pp * aux1' .+ K * R * K'
result[k] = tr(Pu)
end
In order for this loop to match the C++ version you need to use C-style functions:
for k = 2:60000
# Pp = Fk_1 * Pu * Fk_1' + Q
mul!(aux2, mul!(aux1, Fk_1, Pu), Fk_1')
@. Pp = aux2 + Q
# K = Pp * Hk' * pinv(R + Hk * Pp * Hk')
mul!(aux4, Hk, mul!(aux3, Pp, Hk'))
mul!(K, aux3, pinv(R + aux4))
# Pu = (I - K * Hk) * Pp * (I - K * Hk)' + K * R * K'
mul!(aux1, K, Hk)
@. aux2 = I18 - aux1
mul!(aux6, mul!(aux5, aux2, Pp), aux2')
mul!(aux5, mul!(aux3, K, R), K')
@. Pu = aux6 + aux5
result[k] = tr(Pu)
end
... which is quite dirty. But you can write the same thing in C++ like this (and even be a bit faster than Julia!):
for(int k = 2; k <= 60000; k++) {
Pp = Fk_1*Pu*Fk_1.transpose() + Q;
aux1 = R + Hk*Pp*Hk.transpose();
pinv = aux1.completeOrthogonalDecomposition().pseudoInverse();
K = Pp*Hk.transpose()*pinv;
aux2 = I18 - K*Hk;
Pu = aux2*Pp*aux2.transpose() + K*R*K.transpose();
result[k-1] = Pu.trace();
}
which is much more readable than Julia's optimized version.
If Julia had a linear-algebra-aware optimizing compiler (without the sheer madness of C++ template meta-programming that Eigen uses), then Julia's standing in HPC would be much, much better. I admit that it's a hard goal to achieve, since I haven't seen any language trying this (the closest I've seen is LLVM's matrix intrinsics (https://clang.llvm.org/docs/MatrixTypes.html), but it's only a proposal)
In this code, the main problem---I think---is that there are intermediate results that are being allocated, e.g., Fk_1 * Pu * Fk_1'. I will speculate that you could improve on the baseline code by preallocating these in the same way as Pp, K, aux1, and Pu are initialized outside of the loop.
Are you sure that the difference is due to the allocations? I would expect this to be dominated by matrix multiplies or svds. Are you comparing this with the same blas/LAPACK?
Edit : OK, I see those are small matrices. Then Staticarrays should be a nice contender here, both for speed and readability.
[+] [-] plafl|4 years ago|reply
How fast is fast? Well in my case I'm trying to come up with things related to collision detection and I'm getting <30ns using SAT between OBBs (Oriented Bounding Boxes) and <300ns for minimum distance computation using the GJK algorithm again between OBBs.
In the context of recommender systems I did an experiment when I started learning Julia comparing against Cython and wrote about it, although I was a newbie and I should update it: https://plopezadeva.com/julia-first-impressions.html
[+] [-] jvanderbot|4 years ago|reply
"Unless you are a C/C++ wizard normal Python developers cannot fix or optimize big libraries like NumPy or TensorFlow. In Julia its users tend to contribute more to the libraries they use themselves."
I am excited by Julia. It seems to have a good niche and community around it. It seems well suited for its job. That focus will probably make it better than more general alternatives in short order.
[+] [-] cogman10|4 years ago|reply
Easy/Fast C interop is a blessing and a curse. Yes, you can do fast things from C with little overhead, but it also makes things like having a generational garbage collector or a JIT really hard.
[+] [-] leephillips|4 years ago|reply
[+] [-] Bostonian|4 years ago|reply
Looking at https://github.com/arturofburgos/Assessment-of-Programming-L... https://www.matecdev.com/posts/numpy-julia-fortran.html https://github.com/zyth0s/bench_density_gradient_wfn https://github.com/PIK-ICoNe/NetworkDynamicsBenchmarks
people do find Julia to be faster than Python/Numpy, but it is not uniformly faster than Fortran. And Julia's start-up time should not be ignored. Quoting the last link, "In fact the whole Fortran benchmark (300 integrations) finishes roughly in the time it takes to startup a Julia session and import all required libraries (Julia 1.5.1)."
[+] [-] leephillips|4 years ago|reply
Startup time is much improved in recent Julia versions¹, but is certainly not negligible for short calculations.
[1]https://lwn.net/Articles/856819/
[+] [-] fathead_glacier|4 years ago|reply
Different tools for different uses cases is the best way to put it.
[+] [-] dnautics|4 years ago|reply
[+] [-] socialdemocrat|4 years ago|reply
Real world systems today are going to use a lot of computing resources, such as clusters, GPUs, tensor processing units, multiple cores etc. In such a world, anything that makes that easy to deal with is going to have the performance edge in practice.
Doesn't matter how fast a Fortran program would be in theory, if the Julia program is delivered years ahead of it.
[+] [-] jvanderbot|4 years ago|reply
[+] [-] streamofdigits|4 years ago|reply
[+] [-] eigenspace|4 years ago|reply
The AMD [2] and Intel [3] support is younger, but developing quickly.
There's also [4] for a unified API that works across different GPU vendors to avoid lockin
[1] https://cuda.juliagpu.org/dev/
[2] https://github.com/JuliaGPU/AMDGPU.jl
[3] https://github.com/JuliaGPU/oneAPI.jl
[4] https://github.com/JuliaGPU/KernelAbstractions.jl
[+] [-] anon321321323|4 years ago|reply
[+] [-] jnxx|4 years ago|reply
[+] [-] canadianfella|4 years ago|reply
[deleted]
[+] [-] dandanua|4 years ago|reply
[+] [-] cbkeller|4 years ago|reply
[+] [-] socialdemocrat|4 years ago|reply
[+] [-] logimame|4 years ago|reply
Here's an example: https://ronanarraes.com/tutorials/julia/my-julia-workflow-re...
The naive Julia version has unnecessary allocations and therefore is 23% slower than the optimized version:
In order for this loop to match the C++ version you need to use C-style functions: ... which is quite dirty. But you can write the same thing in C++ like this (and even be a bit faster than Julia!): which is much more readable than Julia's optimized version.If Julia had a linear-algebra-aware optimizing compiler (without the sheer madness of C++ template meta-programming that Eigen uses), then Julia's standing in HPC would be much, much better. I admit that it's a hard goal to achieve, since I haven't seen any language trying this (the closest I've seen is LLVM's matrix intrinsics (https://clang.llvm.org/docs/MatrixTypes.html), but it's only a proposal)
[+] [-] thebooktocome|4 years ago|reply
[+] [-] zmk_|4 years ago|reply
[+] [-] antoine-levitt|4 years ago|reply
Edit : OK, I see those are small matrices. Then Staticarrays should be a nice contender here, both for speed and readability.
[+] [-] unknown|4 years ago|reply
[deleted]
[+] [-] Steve_Baker77|4 years ago|reply
[+] [-] arjunbajaj|4 years ago|reply
https://outline.com/R5kdau
[+] [-] arianon|4 years ago|reply
[+] [-] jvanderbot|4 years ago|reply
[+] [-] effnorwood|4 years ago|reply
[deleted]