top | item 29576323

Awkward: Nested, jagged, differentiable, mixed type, GPU-enabled, JIT'd NumPy

144 points| pizza | 4 years ago |awkward-array.org | reply

44 comments

order
[+] radarsat1|4 years ago|reply
This looks incredibly useful for my current project! We have lots of arrays of objects having trajectories of different lengths to deal with, and I often end up having to use numpy with dtype=object, which feels, well, "awkward". It can be very hard to convince numpy arrays to have items that are actually lists, but I like using the vectorized syntax. I've found it sometimes necessary to have Pandas dataframes containing lists too, which I also find awkward.

Just a note, I noticed that there is an unfortunate error in the page describing the bike route calculations. Right in the box where it's supposed to show how much faster things are after JIT, instead a traceback is displayed ending with:

> ImportError: Numba needs NumPy 1.20 or less

I don't think this is intentional. This is in the output box just above the phrase "But it runs 250× faster than the pure Python code:". The box below it shows no timings.

I guess this is a danger of having a "live" backend to render and serve documentation.

[+] ivirshup|4 years ago|reply
Don’t be too scared by all the TODO’s in the docs, most of that stuff actually works – the docs just haven’t been written yet. Admittedly those docs have had TODOs on them for over a year.

To see how it works, you’ve got to look at the repo or any of the talks. The developers are also quite active on the tracker, so do just ask if something is unclear.

[+] mlajtos|4 years ago|reply
We are getting closer and closer to differentiable APL running on GPU. I like this direction :)
[+] Certhas|4 years ago|reply
Julia is there now. It seems absurd to me to build this on top of Python. You are building everything as a DSL that needs to be carefully crafted to never touch the data structures of the host language. Changing from Python to Julia and suddenly being able to use datastructures other than numpy arrays, and writing things that should be for loops as for loops, was such a relieve. But, I guess as Javascript has shown, with an infinite amount of engineering resources dedicated to it you can build anything in any language.
[+] cl3misch|4 years ago|reply
Not sure if you mean exactly this, but JAX is pretty close to "differentiable APL running on GPU" I'd say.
[+] FridgeSeal|4 years ago|reply
It’d be a lot cooler if so much of the work wasn’t so inextricably linked to Python.
[+] N1H1L|4 years ago|reply
But also jax.numpy can do this quite a bit too.
[+] ByThyGrace|4 years ago|reply
Which GPU? On whose drivers? On which systems?
[+] jpivarski|4 years ago|reply
Hi! I'm the original author of Awkward Array (Jim Pivarski), though there are now many contributors with about five regulars. Two of my colleagues just pointed me here—I'm glad you're interested! I can answer any questions you have about it.

First, sorry about all the TODOs in the documentation: I laid out a table of contents structure as a reminder to myself of what ought to be written, but haven't had a chance to fill in all of the topics. From the front page (https://awkward-array.org/), if you click through to the Python API reference (https://awkward-array.readthedocs.io/), that site is 100% filled in. Like NumPy, the library consists of one basic data type, `ak.Array`, and a suite of functions that act on it, `ak.this` and `ak.that`. All of those functions are individually documented, and many have examples.

The basic idea starts with a data structure like Apache Arrow (https://arrow.apache.org/)—a tree of general, variable-length types, organized in memory as a collection of columnar arrays—but performs operations on the data without ever taking it out of its columnar form. (3.5 minute explanation here: https://youtu.be/2NxWpU7NArk?t=661) Those columnar operations are compiled (in C++); there's a core of structure-manipulation functions suggestively named "cpu-kernels" that will also be implemented in CUDA (some already have, but that's in an experimental stage).

A key aspect of this is that structure can be manipulated just by changing values in some internal arrays and rearranging the single tree organizing those arrays. If, for instance, you want to replace a bunch of objects in variable-length lists with another structure, it never needs to instantiate those objects or lists as explicit types (e.g. `struct` or `std::vector`), and so the functions don't need to be compiled for specific data types. You can define any new data types at runtime and the same compiled functions apply. Therefore, JIT compilation is not necessary.

We do have Numba extensions so that you can iterate over runtime-defined data types in JIT-compiled Numba, but that's a second way to manipulate the same data. By analogy with NumPy, you can compute many things using NumPy's precompiled functions, as long as you express your workflow in NumPy's vectorized way. Numba additionally allows you to express your workflow in imperative loops without losing performance. It's the same way with Awkward Array: unpacking a million record structures or slicing a million variable-length lists in a single function call makes use of some precompiled functions (no JIT), but iterating over them at scale with imperative for loops requires JIT-compilation in Numba.

Just as we work with Numba to provide both of these programming styles—array-oriented and imperative—we'll also be working with JAX to add autodifferentiation (Anish Biswas will be starting on this in January; he's actually continuing work from last spring, but in a different direction). We're also working with Martin Durant and Doug Davis to replace our homegrown lazy arrays with industry-standard Dask, as a new collection type (https://github.com/ContinuumIO/dask-awkward/). A lot of my time, with Ianna Osborne and Ioana Ifrim at my university, is being spent refactoring the internals to make these kinds of integrations easier (https://indico.cern.ch/event/855454/contributions/4605044/). We found that we had implemented too much in C++ and need more, but not all, of the code to be in Python to be able to interact with third-party libraries.

If you have any other questions, I'd be happy to answer them!

[+] pizza|4 years ago|reply
Do you have some examples of slicing, masking, un-padding, and (I suppose) “Haskell-like” ops, eg fmap, but also eg treemap, vmap, pmap that are in Jax? Also grouping, cutting, and interweaving, and also.. this is kind of a weird ask but suppose I had an operation in pure assembly that is extremely fast w two int64 parameters outputting one int64, what’s the easiest path for me to get awkward to apply that to two Arrays and give me one Array back as output?
[+] riskneutral|4 years ago|reply
Was there no way to do this in Apache Arrow, or with some modifications to Arrow?
[+] rich_sasha|4 years ago|reply
It would be nice to see some comparison to JAX.

Is it mostly that arrays in Awkward are, sort of, unstructured? JIT-ed, ADed JAX on JSON-like structures, on GPU?

[+] nmca|4 years ago|reply
This isn't really comparable to JAX afaict - JAX offers only masked dense computation, this is first-class ragged support with no mention of autodiff or jit (outside of numba)
[+] ivirshup|4 years ago|reply
It’s a ragged array that you can work with in python code, but also Jax and numba code.
[+] dheera|4 years ago|reply
Is there a comparison to Numba?
[+] stevesimmons|4 years ago|reply
Awkward is a complement to Numba, not an alternative.

Awkward is the container for data. Like Numpy, it supports storing data as low-level arrays rather than everything being a "boxed" Python object. While Numpy is for regular arrays, Awkward adds irregular arrays of JSON-like objects.

Numba is the computation layer. Its JIT compiler builds faster code by specialising for specific concrete data types, rather than Python allowing everything to be dynamically changed.

If Numba code is fed arrays of Numpy/Awkward data, your computations get much faster.

So they are complementary, not alternatives.

[+] karjudev|4 years ago|reply
Can I define a deep neural network as an Awkward array of matrices of different sizes? It looks very promising to compute a backpropagation step without ping-ponging with the Python interpreter.
[+] jpivarski|4 years ago|reply
I hadn't considered an Awkward array being used as a neural network, and I would expect specialized neural network frameworks to optimize it better: don't they fuse the matrices representing each layer into a single outside-of-Python computation? If they don't, I wonder why not.

Nevertheless, my ears perked up at "matrices of different sizes," since that was something I implemented a while back thinking it would find a use-case if it were well advertised.

You can matrix-multiply two arrays of matrices, in which all the matrices have different shapes, as long as matrix "i" in the first array is compatible with matrix "i" in the second array. Like this:

    import awkward1 as ak

    lefts = ak.Array([
        [[1, 2],
         [3, 4],
         [5, 6]],

        [[1, 2, 3, 4],
         [5, 6, 7, 8]],

        [[1],
         [2],
         [3],
         [4]],
    ])

    rights = ak.Array([
        [[7, 8, 9],
         [10, 11, 12]],

        [[8, 10],
         [11, 12],
         [13, 14],
         [15, 16]],

        [[5, 6, 7]],
    ])
Matrix-multiplying them results in this:

    >>> (lefts @ rights).tolist()
    [
        [[ 27,  30,  33],
         [ 61,  68,  75],
         [ 95, 106, 117]],

        [[129, 140],
         [317, 348]],

        [[ 5,  6,  7],
         [10, 12, 14],
         [15, 18, 21],
         [20, 24, 28]]
    ]
The left[0]'s 3×2 times right[0]'s 2×3 results in the output's 3×3, etc for i=1 and i=2. This is the kind of operation you wouldn't even be able to talk about if you didn't have jagged arrays.
[+] mathgenius|4 years ago|reply
How does this work? Is it a C++ lib with python interface? Does it use llvm for the JIT? ...
[+] jpivarski|4 years ago|reply
It's a C++ library with a Python interface. How much C++ and how much Python is currently in flux (https://indico.cern.ch/event/855454/contributions/4605044/): we're moving toward having the majority of the library be in Python with only the speed-critical parts in C++ behind a C interface.

The main library has no JIT, so no LLVM, but there is a secondary method of access through Numba, which has JIT and LLVM.

The main method of access is to call array-at-a-time ("vectorized") functions on the arrays, like NumPy. The types of the data in the arrays can be defined at runtime, and are stored in a columnar way, like Apache Arrow. This format allows for operations to be performed on dynamically typed data at compile-time speeds without specialized compilation—it can be done with a suite of precompiled functions. The reason that's possible is because these data structures are never instantiated in the normal "record-oriented" way, they're sets of integer arrays that are only interpreted as data structures, and it's not hard to precompile a suite of functions that only operate on integer arrays.

A 3.5 minute explanation of the columnar format and what a columnar operation looks like can be found here: https://youtu.be/2NxWpU7NArk?t=661

[+] visarga|4 years ago|reply
How fast is it at loading and saving compared to Python objects with JSON?
[+] jpivarski|4 years ago|reply
I've been doing some tests with JSON recently, so I have some exact numbers for a particular sample. Suppose you have JSON like the following:

    MULTIPLIER = int(10e6)
    json_string = b"[" + b", ".join([
        b'[{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}, {"x": 3.3, "y": [1, 2, 3]}],' +
        b'[],' +
        b'[{"x": 4.4, "y": [1, 2, 3, 4]}, {"x": 5.5, "y": [1, 2, 3, 4, 5]}]'
    ] * MULTIPLIER) + b"]"
It's a complex structure with many array elements (30 million), but those elements have a common type. It's 1.4 GB of uncompressed JSON. If I convert it to a Parquet file (a natural format for data like these), it would be 1.6 GB of uncompressed Parquet. It can get much smaller with compression, but since the same numbers are repeating in the above, compressing it would not be a fair comparison. (Note that I'm using 3 bytes per float in the JSON; uncompressed Parquet uses 8 bytes per float. I should generate something like the above with random numbers and then compress the Parquet.)

Reading the JSON into Python dicts and lists using the standard library `json` module takes 70 seconds and uses 20 GB of RAM (for the dicts and lists, not counting the original string).

Reading the Parquet file into an Awkward Array takes 3.3 seconds and uses 1.34 GB of RAM (for just the array).

Reading the JSON file into an Awkward Array takes 39 seconds and uses the same 1.34 GB of RAM. If you have a JSONSchema for the file, an experimental method (https://github.com/scikit-hep/awkward-1.0/pull/1165#issuecom...) would reduce the reading time to 6.0 seconds, since most of that time is spent discovering the schema of the JSON data dynamically.

The main thing, though, is that you can compute the already-loaded data in faster and more concise ways. For instance, if you wanted to slice and call a NumPy ufunc on the above data, like

    output = np.square(array["y", ..., 1:])
the equivalent Python would be

    output = []
    for sublist in python_objects:
        tmp1 = []
        for record in sublist:
            tmp2 = []
            for number in record["y"][1:]:
                tmp2.append(np.square(number))
            tmp1.append(tmp2)
        output.append(tmp1)
The Python code runs in 140 seconds and the array expression runs in 2.4 seconds (current version; in the unreleased version 2 it's 1.5 seconds).

For both loading data and performing operations on them, it's orders of magnitude faster than pure Python—for the same reasons the same can be said of NumPy. What we're adding here is the ability to do this kind of thing on non-rectangular arrays of numbers.