ffriend's comments

ffriend | 1 year ago | on: Async await: the worst thing to happen to programming?

There seem to be two distinct topics:

1) async programming vs. threading

2) infectious async/await syntax

Async programming is great. Coroutines are a powerful tool, both for expressing your ideas more clearly and for improving performance in IO-heavy systems.

async/await syntax may not be the best design for async programming though. Consider example in Julia:

  function foo(x)
      @async print(x)    # some IO function
  end
  function bar(x)
      @sync foo(x)
  end

`foo()` returns an asynchronous `Task`, `bar()` awaits this task, and you can invoke `bar()` from whatever context you want. Now look at the Python version with async/await keywords:

  async def foo(x):
      print(x)     # some IO function
  def bar(x):
      await foo(x)
  # SyntaxError: 'await' outside async function

Oops, we can't make `bar()` synchronous, it MUST be `async` now, as well as all functions that invoke `bar()`. This is what is meant my "infectious" behavior.

Maybe we can wrap it into `asyncio.run()` then and stop the async avalance?

  def bar(x):
      asyncio.run(foo(x))
  bar(5)

Yes, it works in synchronous context. But path to asynchronous context is now closed for us:

  async def baz(x):
      bar(x)
  await baz(5)
  # RuntimeError: asyncio.run() cannot be called from a running event loop

So in practice, whenever you change one of your functions to `async`, you have to change all its callers up the stack to also be `async`. And it hurts a lot.

Can we have asynchronous programming in Python without async/await. Well, prior to Python 3.5 we used generators, so it looks like at least techically it's possible.

ffriend | 1 year ago | on: Llama 3 implemented in pure NumPy

Sure, knowing the basics of LLM math is necessary. But it's also _enough_ to know this math to fully grasp the code. There are only 4 concepts - attention, feed-forward net, RMS-normalization and rotary embeddings - organized into a clear structure.

Now compare it to the Hugginface implementation [1]. In addition to the aforementioned concepts, you need to understand the hierarchy of `PreTrainedModel`s, 3 types of attention, 3 types of rotary embeddings, HF's definition of attention mask (which is not the same as mask you read about in transformer tutorials), several types of cache class, dozens of flags to control things like output format or serialization, etc.

It's not that Meta's implementation is good and HF's implementation is bad - they pursue different goals in their own optimal way. But if you just want to learn how the model works, Meta's code base is great.

[1]: https://github.com/huggingface/transformers/blob/main/src/tr...

ffriend | 1 year ago | on: Llama 3 implemented in pure NumPy

It's also worth mentioning that the original implementation by Meta is only 300 lines of very readable code [1].

[1]: https://github.com/meta-llama/llama3/blob/main/llama/model.p...

ffriend | 1 year ago | on: Llama 3 implemented in pure NumPy

JAX requires a bit more work to maintain fixed-size buffers as required by XLA, especially in case of caching and rotary embeddings. But yeah, overall the code can be pretty similar [1].

[1]: https://github.com/dfdx/fabrique/blob/main/fabrique/llama/mo...

ffriend | 7 years ago | on: Julia 1.0

> there's nothing in the language that prevents this from working with the autograd package, except no one's taken the time to implement it (https://github.com/HIPS/autograd/issues/47).

I believe it's more complicated than most posters there realize, especially in the context of PyTorch (which uses a fork of autograd under the hood) with its dynamic graphs... Anyway, AD deserves its own discussion, that's I didn't want to concentrate on it.

> I'd be interested in a side by side comparison as well, and I was thinking that the main difficulty would be that I couldn't write good Julia code, but maybe we can pair up, if that'd be interesting, to address several common topics that come up (fusion, broadcasting, generics but specialization, etc).

Sounds good! Do you have a task at hand that would involve all the topics and could be implemented in limited time? Maybe some kind of Monte Carlo simulation or Gibbs sampling to get started?

ffriend | 7 years ago | on: Julia 1.0

I have quite limited experience with Cython and tried Numba just a couple of times, but I'm curious how much would it take to rewrite one of my Julia libraries to them.

The library is for reverse-mode automatic differentiation, but let's put AD itself aside and talk about code generation. As an input to code generator, I have a computational graph (or "tape") - a list of functions connecting input and intermediate variables. As an output I want a compiled function for CPU/GPU. (Note: Theano used to do exactly this, but that's a separate huge system not relevant no Numba or Cython).

In Julia I follow the following steps:

1. Convert all operations on the tape to expressions (~approx 1 line per operation type).

2. Eliminate common subexpressions.

3. Fuse broadcasting, e.g. rewrite:

    c .= a .* b
    e .= c .+ d

into

    e .= a .* b .+ d

Dot near operations means that they are applied elementwise without creating intermediate arrays. On CPU, Julia compiler / LLVM then generates code that reads and writes to memory exactly once (unlike e.g. what you would get with several separate operations on numpy arrays). On GPU, CUDAnative generates a single CUDA kernel which on my tests is ~1.5 times faster then several separate kernels. Note that `.=` also means that the result of operation is directly written to a (buffered) destination, so it no memory is allocated in the hot loop.

4. Rewrite everything I can into in-place operations. Notably, matrix multiplication `A * B` is replaced with BLAS/CUBLAS alternative.

5. Add to the expression function header, buffers and JIT-compile the result.

In Python, I imagine using `ast` module for code parsing and transformations like common expression elimination (how hard it would be?). Perhaps, Numba can be used to compile Python code to fast CPU and GPU code, but does it fit with AST? Also, do Numba or Cython do optimizations like broadcasting and kernel fusion? I'd love to see side-by-side comparison of capabilities in such a scenario!

ffriend | 7 years ago | on: My favorite things that are coming with Julia 1.0

Almost certainly yes. Although it doesn't mean it's the best choice for your specific task.

Python had a long history of web server development. I think, aiohttp may be considered state of the art now. Let's measure its performance:

    from aiohttp import web
    
    async def handle(request):    
        return web.Response(text="Hello")
    
    app = web.Application()
    app.add_routes([web.get('/', handle),])
    
    web.run_app(app)

using `wrk` for testing:

    $ wrk -t1 -c1000 -d30s http://127.0.0.1:8080/
    Running 30s test @ http://127.0.0.1:8080/
      1 threads and 1000 connections
      Thread Stats   Avg      Stdev     Max   +/- Stdev
        Latency   164.59ms   17.12ms 537.35ms   83.97%
        Req/Sec     6.05k     0.95k    7.74k    74.33%
      180749 requests in 30.08s, 26.72MB read
    Requests/sec:   6008.75
    Transfer/sec:      0.89MB

So we have ~6k rps on a single CPU core. As far as I remember, Tornado has ~4k rps, while built-in Flask server can process only about ~1k requests per second . Yes, you are unlikely to use Flask dev server in production, but for aiohttp it's indeed a recommended way.

Now let's measure Julia's HTTP.jl server:

    using HTTP

    HTTP.listen() do request::HTTP.Request
       return HTTP.Response("Hello")
    end

which gives:

    $ wrk -t1 -c1000 -d30s http://127.0.0.1:8081/
    Running 30s test @ http://127.0.0.1:8081/
      1 threads and 1000 connections
      Thread Stats   Avg      Stdev     Max   +/- Stdev
        Latency   105.95ms  108.35ms   1.99s    98.40%
        Req/Sec     9.65k     1.69k   13.66k    80.21%
      271917 requests in 30.09s, 16.10MB read
      Socket errors: connect 0, read 0, write 0, timeout 274
    Requests/sec:   9035.73
    Transfer/sec:    547.67KB

So it's 9k (with a few failed requests though).

This doesn't include any routing, input data parsing, header or cookie processing, etc., but it amazes me how good the server is given that web development is NOT considered a strong part of the language.

The downside of Julia web programming is the number of libraries and tools (e.g. routers, DB connectors, template engines, etc.) - they exist, but are quite behind Python equivalents, so gotchas are expected. Yet I'm quite positive about future of web programming in Julia.

ffriend | 8 years ago | on: Matrix Calculus

Some time ago I implemented a library [1] similar to this tool. The tricky part is that derivatives quickly exceed 2 dimensions, e.g. derivative of a vector output w.r.t. input matrix is a 3D tensor (e.g. if `y = f(X)`, you need to find derivative of each `y[i]` w.r.t. each `X[m,n]`), and we don't have a notation for it. Also, often such tensors are very sparse (e.g. for element-wise `log()` derivative is a matrix where only the main diagonal has non-zero values corresponding to derivatives `dy[i]/dx[i]` where `y = log(x)`).

The way I dealt with it is to first translate vectorized expression to so-called Einstein notation [2] - indexed expression with implicit sums over repeated indices. E.g. matrix product `Z = X * Y` may be written in it as:

    Z[i,j] = X[i,k] * Y[k,j]   # implicitly sum over k

It worked pretty well and I was able to get results in Einstein notation for element-wise functions, matrix multiplication and even convolutions.

Unfortunately, the only way to calculate such expressions efficiently is to convert them back to vectorized notation, and it's not always possible (e.g. because of sparse structure) and very error-prone.

The good news is that if the result of the whole expression is a scalar, all the derivatives will have the same number of dimensions as corresponding inputs. E.g. in:

    y = sum(W * X + b)

if `W` is a matrix, then `dy/dW` is also a matrix (without sum it would be a 3D tensor). This is the reason why backpropogation algorithm (and symbolic/automatic differentiation in general) in machine learning works. So finally I ended up with a another library [3], which can only deal with scalar outputs, but is much more stable.

Theoretical description of the method for the first library can be found in [4] (page 1338-1343, caution - 76M) while the set of rule I've derived is in [5].

[1]: https://github.com/dfdx/XDiff.jl

[2]: https://en.wikipedia.org/wiki/Einstein_notation

[3]: https://github.com/dfdx/XGrad.jl

[4]: http://docs.mipro-proceedings.com/proceedings/mipro_2017_pro...

[5]: https://github.com/dfdx/XDiff.jl/blob/master/src/trules.jl

ffriend | 8 years ago | on: Tensorflow sucks

If I want to export my own computational graph to ONNX, what is the first place I should look at? Do you know about any documentation or reference implementation of the format?

ffriend | 8 years ago | on: Tensorflow sucks

> I don’t understand their vendetta against the graph, which is a powerful abstraction that lets you choose different backend [...]

You don't really need a graph to support different backends. One popular approach is to have different array implementations (e.g. CPU and GPU arrays).

> [...] and let’s tensorboard show you an awesome view of your computation

At the end of the post the author shows his API that lets you do the same things as Tensorboard, but for whatever framework you like.

All in all, expression graphs like these used in TF and Theano are great for symbolic differentiation of a loss function and further expression optimization (e.g. simplification, operation fusion, etc.). But TF goes further and makes everything a node in a graph. Even things that are not algebraic expressions such as variable initialization or objective optimization.

ffriend | 8 years ago | on: How to Make Python Run as Fast as Julia (2015)

I could have not noticed use of swap in the previous test, so I repeated it on a Linux box and 1e8 numbers (instead of 1e9). Julia took 100.583ms while Python 207ms (probably due to double reading of the array). So I guess adding 1e9 numbers should take about 1 second on a modern desktop CPU.

ffriend | 8 years ago | on: How to Make Python Run as Fast as Julia (2015)

BLAS really shines when you do matrix multiplication, for element-wise operations the best you can do is to add numbers using SIMD instructions or put the load to GPU, and most numeric libraries already when possible. The benchmark about seems unrealistic, here are results from my newest MaBook Pro:

    In [2]: import numpy as np

    In [3]: X = np.ones(1000000000, dtype=np.int)

    In [4]: Y = np.ones(1000000000, dtype=np.int)

    In [5]: %time X = X + 2.0 * Y
    CPU times: user 10.4 s, sys: 27.1 s, total: 37.5 s
    Wall time: 46 s

    In [6]: %time X = X + 2 * Y
    CPU times: user 8.66 s, sys: 26 s, total: 34.7 s
    Wall time: 42.6 s

    In [7]: %time X += 2 * Y
    CPU times: user 8.58 s, sys: 23.2 s, total: 31.8 s
    Wall time: 37.7 s

    In [8]: %time np.add(X, Y, out=X); np.add(X, Y, out=X)
    CPU times: user 11.3 s, sys: 25.6 s, total: 36.9 s
    Wall time: 42.6 s

No surprise, Julia makes nearly the same result:

    julia> X = ones(Int, 1000000000);
    julia> Y = ones(Int, 1000000000); 

    julia> @btime X .= X .+ 2Y
      34.814 s (6 allocations: 7.45 GiB)

UPD. Just noticed 7.45Gib allocations. We can get rid of it as:

    julia> @btime X .= X .+ 2 .* Y
      20.464 s (4 allocations: 96 bytes

or:

    julia> @btime X .+= 2 .* Y
      20.098 s (4 allocations: 96 bytes)

ffriend | 8 years ago | on: How to Make Python Run as Fast as Julia (2015)

Even better would be to write (see dot vectorization [1]):

    C .= A .+ B

Benchmarks for 3 matrices of size 1000x1000:

    julia> using BenchmarkTools

    julia> @benchmark C = A + B
    BenchmarkTools.Trial: 
      memory estimate:  7.63 MiB
      allocs estimate:  2
      --------------
      minimum time:     2.359 ms (0.00% GC)
      median time:      2.713 ms (0.00% GC)
      mean time:        3.794 ms (28.81% GC)
      maximum time:     62.708 ms (95.27% GC)
      --------------
      samples:          1314
      evals/sample:     1

    julia> @benchmark C .= A .+ B
    BenchmarkTools.Trial: 
      memory estimate:  128 bytes
      allocs estimate:  4
      --------------
      minimum time:     1.232 ms (0.00% GC)
      median time:      1.320 ms (0.00% GC)
      mean time:        1.356 ms (0.00% GC)
      maximum time:     2.572 ms (0.00% GC)
      --------------
      samples:          3651
      evals/sample:     1

Note that memory usage dropped from 7.63MiB to 128 bytes.

[1]: https://docs.julialang.org/en/stable/manual/functions/#man-v...

ffriend | 8 years ago | on: How to Make Python Run as Fast as Julia (2015)

> The author understands your perspective but he's deliberately using a different one.

In this case the way the author shows it isn't the best one: he modifies Python code to be more realistic - that's ok, but doesn't he do the same thing for Julia? Obviously, writing a recursive fibonacci functions isn't the best way to implement it. Obviously, using caching can improve performance. But why not to apply these changes to both implementations?

ffriend | 8 years ago | on: Julia Computing Raises $4.6M in Seed Funding

The two language problem is more specific to scientific computing where literally every popular library is written in more then one language - fast C/C++/Fortran and convenient high-level R/Matlab/Python. NumPy, SciPy, Caffe, Theano, Tensorflow to name a few. I once had to rewrite Matlab + C code (for face tracking) to pure Julia - not only the resulting code was almost twice smaller, it also ran ~20% faster.

ffriend | 8 years ago | on: A crashed advertisement reveals logs of a facial recognition system

As far as I know, storing personal data - including photo, name, email and sometimes even IP address - without explicit and clear consent is strictly forbidden in most countries, at least in EU.

ffriend | 8 years ago | on: A crashed advertisement reveals logs of a facial recognition system

If we are talking about professional actors trying to trick the tracker, then yes, it should be pretty hard to design software to overcome it. But most people aren't that good, and although they can mislead their friends or collegues, they still leave clues to detect a fake emotion. If you are interested, Paul Ekman has quite a lot of literature on the topic, e.g. see [1].

[1]: http://www.ekmaninternational.com/paul-ekman-international-p...

ffriend | 8 years ago | on: A crashed advertisement reveals logs of a facial recognition system

Well, I definitely do have other applications for it. For example, I know that similar software has been used in labs to estimate people's reaction to videos and game features, in mobile applications to improve interaction with a user, etc.

My interest to offline applications comes from personal experience: recently we demonstrated our product (not emotion recognition, but also capturing user's face) on an exposition. People came to our stand, used the product (so they clearly opted-in), asked questions, etc. After 2 days, we asked a girl at the stand "What do people think about the product"? "Well, in general, they are interested" she answered. Not much info, right? Definitely less informative than "65% expressed mild interest, 20% had no reaction and 5% found it disgusting, especially this feature".

So I don't try to justify this use case - my life doesn't depend on it - but I find it stupid not to try to understand your clients better when it doesn't introduce a moral conflict.

ffriend | 8 years ago | on: A crashed advertisement reveals logs of a facial recognition system

> I would feel bad about it and I think the person should ask my permission first.

But if that person just memorizes customer reactions to understand how people on average react to particular products or actions, that's ok, right? Because this is what sellers and business owners do to improve their product. So is it about human-to-human interaction or some more subtle detail? I'm biased here, so sorry if I miss something obvious in this situation.

ffriend | 8 years ago | on: A crashed advertisement reveals logs of a facial recognition system

I'm curious what's the boundary between ethical and unethical. People constantly analyze each other's mood and it's perceived positively. But doing the same thing massively using automated tools is often considered inappropriate. So is it because of using technology, massiveness, purposes? I hope there's a way to make such things both efficient and not unethical.

page 1