ffriend | 1 year ago | on: Async await: the worst thing to happen to programming?
ffriend's comments
ffriend | 1 year ago | on: Llama 3 implemented in pure NumPy
Now compare it to the Hugginface implementation [1]. In addition to the aforementioned concepts, you need to understand the hierarchy of `PreTrainedModel`s, 3 types of attention, 3 types of rotary embeddings, HF's definition of attention mask (which is not the same as mask you read about in transformer tutorials), several types of cache class, dozens of flags to control things like output format or serialization, etc.
It's not that Meta's implementation is good and HF's implementation is bad - they pursue different goals in their own optimal way. But if you just want to learn how the model works, Meta's code base is great.
[1]: https://github.com/huggingface/transformers/blob/main/src/tr...
ffriend | 1 year ago | on: Llama 3 implemented in pure NumPy
[1]: https://github.com/meta-llama/llama3/blob/main/llama/model.p...
ffriend | 1 year ago | on: Llama 3 implemented in pure NumPy
[1]: https://github.com/dfdx/fabrique/blob/main/fabrique/llama/mo...
ffriend | 7 years ago | on: Julia 1.0
I believe it's more complicated than most posters there realize, especially in the context of PyTorch (which uses a fork of autograd under the hood) with its dynamic graphs... Anyway, AD deserves its own discussion, that's I didn't want to concentrate on it.
> I'd be interested in a side by side comparison as well, and I was thinking that the main difficulty would be that I couldn't write good Julia code, but maybe we can pair up, if that'd be interesting, to address several common topics that come up (fusion, broadcasting, generics but specialization, etc).
Sounds good! Do you have a task at hand that would involve all the topics and could be implemented in limited time? Maybe some kind of Monte Carlo simulation or Gibbs sampling to get started?
ffriend | 7 years ago | on: Julia 1.0
The library is for reverse-mode automatic differentiation, but let's put AD itself aside and talk about code generation. As an input to code generator, I have a computational graph (or "tape") - a list of functions connecting input and intermediate variables. As an output I want a compiled function for CPU/GPU. (Note: Theano used to do exactly this, but that's a separate huge system not relevant no Numba or Cython).
In Julia I follow the following steps:
1. Convert all operations on the tape to expressions (~approx 1 line per operation type).
2. Eliminate common subexpressions.
3. Fuse broadcasting, e.g. rewrite:
c .= a .* b
e .= c .+ d
into e .= a .* b .+ d
Dot near operations means that they are applied elementwise without creating intermediate arrays. On CPU, Julia compiler / LLVM then generates code that reads and writes to memory exactly once (unlike e.g. what you would get with several separate operations on numpy arrays). On GPU, CUDAnative generates a single CUDA kernel which on my tests is ~1.5 times faster then several separate kernels. Note that `.=` also means that the result of operation is directly written to a (buffered) destination, so it no memory is allocated in the hot loop.4. Rewrite everything I can into in-place operations. Notably, matrix multiplication `A * B` is replaced with BLAS/CUBLAS alternative.
5. Add to the expression function header, buffers and JIT-compile the result.
In Python, I imagine using `ast` module for code parsing and transformations like common expression elimination (how hard it would be?). Perhaps, Numba can be used to compile Python code to fast CPU and GPU code, but does it fit with AST? Also, do Numba or Cython do optimizations like broadcasting and kernel fusion? I'd love to see side-by-side comparison of capabilities in such a scenario!
ffriend | 7 years ago | on: My favorite things that are coming with Julia 1.0
Python had a long history of web server development. I think, aiohttp may be considered state of the art now. Let's measure its performance:
from aiohttp import web
async def handle(request):
return web.Response(text="Hello")
app = web.Application()
app.add_routes([web.get('/', handle),])
web.run_app(app)
using `wrk` for testing: $ wrk -t1 -c1000 -d30s http://127.0.0.1:8080/
Running 30s test @ http://127.0.0.1:8080/
1 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 164.59ms 17.12ms 537.35ms 83.97%
Req/Sec 6.05k 0.95k 7.74k 74.33%
180749 requests in 30.08s, 26.72MB read
Requests/sec: 6008.75
Transfer/sec: 0.89MB
So we have ~6k rps on a single CPU core. As far as I remember, Tornado has ~4k rps, while built-in Flask server can process only about ~1k requests per second . Yes, you are unlikely to use Flask dev server in production, but for aiohttp it's indeed a recommended way.Now let's measure Julia's HTTP.jl server:
using HTTP
HTTP.listen() do request::HTTP.Request
return HTTP.Response("Hello")
end
which gives: $ wrk -t1 -c1000 -d30s http://127.0.0.1:8081/
Running 30s test @ http://127.0.0.1:8081/
1 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 105.95ms 108.35ms 1.99s 98.40%
Req/Sec 9.65k 1.69k 13.66k 80.21%
271917 requests in 30.09s, 16.10MB read
Socket errors: connect 0, read 0, write 0, timeout 274
Requests/sec: 9035.73
Transfer/sec: 547.67KB
So it's 9k (with a few failed requests though).This doesn't include any routing, input data parsing, header or cookie processing, etc., but it amazes me how good the server is given that web development is NOT considered a strong part of the language.
The downside of Julia web programming is the number of libraries and tools (e.g. routers, DB connectors, template engines, etc.) - they exist, but are quite behind Python equivalents, so gotchas are expected. Yet I'm quite positive about future of web programming in Julia.
ffriend | 8 years ago | on: Matrix Calculus
The way I dealt with it is to first translate vectorized expression to so-called Einstein notation [2] - indexed expression with implicit sums over repeated indices. E.g. matrix product `Z = X * Y` may be written in it as:
Z[i,j] = X[i,k] * Y[k,j] # implicitly sum over k
It worked pretty well and I was able to get results in Einstein notation for element-wise functions, matrix multiplication and even convolutions.Unfortunately, the only way to calculate such expressions efficiently is to convert them back to vectorized notation, and it's not always possible (e.g. because of sparse structure) and very error-prone.
The good news is that if the result of the whole expression is a scalar, all the derivatives will have the same number of dimensions as corresponding inputs. E.g. in:
y = sum(W * X + b)
if `W` is a matrix, then `dy/dW` is also a matrix (without sum it would be a 3D tensor). This is the reason why backpropogation algorithm (and symbolic/automatic differentiation in general) in machine learning works. So finally I ended up with a another library [3], which can only deal with scalar outputs, but is much more stable.Theoretical description of the method for the first library can be found in [4] (page 1338-1343, caution - 76M) while the set of rule I've derived is in [5].
[1]: https://github.com/dfdx/XDiff.jl
[2]: https://en.wikipedia.org/wiki/Einstein_notation
[3]: https://github.com/dfdx/XGrad.jl
[4]: http://docs.mipro-proceedings.com/proceedings/mipro_2017_pro...
[5]: https://github.com/dfdx/XDiff.jl/blob/master/src/trules.jl
ffriend | 8 years ago | on: Tensorflow sucks
ffriend | 8 years ago | on: Tensorflow sucks
You don't really need a graph to support different backends. One popular approach is to have different array implementations (e.g. CPU and GPU arrays).
> [...] and let’s tensorboard show you an awesome view of your computation
At the end of the post the author shows his API that lets you do the same things as Tensorboard, but for whatever framework you like.
All in all, expression graphs like these used in TF and Theano are great for symbolic differentiation of a loss function and further expression optimization (e.g. simplification, operation fusion, etc.). But TF goes further and makes everything a node in a graph. Even things that are not algebraic expressions such as variable initialization or objective optimization.
ffriend | 8 years ago | on: How to Make Python Run as Fast as Julia (2015)
ffriend | 8 years ago | on: How to Make Python Run as Fast as Julia (2015)
In [2]: import numpy as np
In [3]: X = np.ones(1000000000, dtype=np.int)
In [4]: Y = np.ones(1000000000, dtype=np.int)
In [5]: %time X = X + 2.0 * Y
CPU times: user 10.4 s, sys: 27.1 s, total: 37.5 s
Wall time: 46 s
In [6]: %time X = X + 2 * Y
CPU times: user 8.66 s, sys: 26 s, total: 34.7 s
Wall time: 42.6 s
In [7]: %time X += 2 * Y
CPU times: user 8.58 s, sys: 23.2 s, total: 31.8 s
Wall time: 37.7 s
In [8]: %time np.add(X, Y, out=X); np.add(X, Y, out=X)
CPU times: user 11.3 s, sys: 25.6 s, total: 36.9 s
Wall time: 42.6 s
No surprise, Julia makes nearly the same result: julia> X = ones(Int, 1000000000);
julia> Y = ones(Int, 1000000000);
julia> @btime X .= X .+ 2Y
34.814 s (6 allocations: 7.45 GiB)
UPD. Just noticed 7.45Gib allocations. We can get rid of it as: julia> @btime X .= X .+ 2 .* Y
20.464 s (4 allocations: 96 bytes
or: julia> @btime X .+= 2 .* Y
20.098 s (4 allocations: 96 bytes)ffriend | 8 years ago | on: How to Make Python Run as Fast as Julia (2015)
C .= A .+ B
Benchmarks for 3 matrices of size 1000x1000: julia> using BenchmarkTools
julia> @benchmark C = A + B
BenchmarkTools.Trial:
memory estimate: 7.63 MiB
allocs estimate: 2
--------------
minimum time: 2.359 ms (0.00% GC)
median time: 2.713 ms (0.00% GC)
mean time: 3.794 ms (28.81% GC)
maximum time: 62.708 ms (95.27% GC)
--------------
samples: 1314
evals/sample: 1
julia> @benchmark C .= A .+ B
BenchmarkTools.Trial:
memory estimate: 128 bytes
allocs estimate: 4
--------------
minimum time: 1.232 ms (0.00% GC)
median time: 1.320 ms (0.00% GC)
mean time: 1.356 ms (0.00% GC)
maximum time: 2.572 ms (0.00% GC)
--------------
samples: 3651
evals/sample: 1
Note that memory usage dropped from 7.63MiB to 128 bytes.[1]: https://docs.julialang.org/en/stable/manual/functions/#man-v...
ffriend | 8 years ago | on: How to Make Python Run as Fast as Julia (2015)
In this case the way the author shows it isn't the best one: he modifies Python code to be more realistic - that's ok, but doesn't he do the same thing for Julia? Obviously, writing a recursive fibonacci functions isn't the best way to implement it. Obviously, using caching can improve performance. But why not to apply these changes to both implementations?
ffriend | 8 years ago | on: Julia Computing Raises $4.6M in Seed Funding
ffriend | 8 years ago | on: A crashed advertisement reveals logs of a facial recognition system
ffriend | 8 years ago | on: A crashed advertisement reveals logs of a facial recognition system
[1]: http://www.ekmaninternational.com/paul-ekman-international-p...
ffriend | 8 years ago | on: A crashed advertisement reveals logs of a facial recognition system
My interest to offline applications comes from personal experience: recently we demonstrated our product (not emotion recognition, but also capturing user's face) on an exposition. People came to our stand, used the product (so they clearly opted-in), asked questions, etc. After 2 days, we asked a girl at the stand "What do people think about the product"? "Well, in general, they are interested" she answered. Not much info, right? Definitely less informative than "65% expressed mild interest, 20% had no reaction and 5% found it disgusting, especially this feature".
So I don't try to justify this use case - my life doesn't depend on it - but I find it stupid not to try to understand your clients better when it doesn't introduce a moral conflict.
ffriend | 8 years ago | on: A crashed advertisement reveals logs of a facial recognition system
But if that person just memorizes customer reactions to understand how people on average react to particular products or actions, that's ok, right? Because this is what sellers and business owners do to improve their product. So is it about human-to-human interaction or some more subtle detail? I'm biased here, so sorry if I miss something obvious in this situation.
ffriend | 8 years ago | on: A crashed advertisement reveals logs of a facial recognition system
1) async programming vs. threading
2) infectious async/await syntax
Async programming is great. Coroutines are a powerful tool, both for expressing your ideas more clearly and for improving performance in IO-heavy systems.
async/await syntax may not be the best design for async programming though. Consider example in Julia:
`foo()` returns an asynchronous `Task`, `bar()` awaits this task, and you can invoke `bar()` from whatever context you want. Now look at the Python version with async/await keywords: Oops, we can't make `bar()` synchronous, it MUST be `async` now, as well as all functions that invoke `bar()`. This is what is meant my "infectious" behavior.Maybe we can wrap it into `asyncio.run()` then and stop the async avalance?
Yes, it works in synchronous context. But path to asynchronous context is now closed for us: So in practice, whenever you change one of your functions to `async`, you have to change all its callers up the stack to also be `async`. And it hurts a lot.Can we have asynchronous programming in Python without async/await. Well, prior to Python 3.5 we used generators, so it looks like at least techically it's possible.