top | item 42893844

Sparrow, a modern C++ implementation of the Apache Arrow columnar format

116 points| SylvainCorlay | 1 year ago |johan-mabille.medium.com

25 comments

order

binary132|1 year ago

I am glad to see others in the comments share my concerns about the code quality considering that the offered samples are not really idiomatic or modern C++. In particular, there is really no reason to use pointers here in either the accessor or the mutator; both of those probably should have been references, although the case can be made for the mutator at least since it makes it explicit in the written code that we may mutate the arguments.

amluto|1 year ago

This is supposed to be idiomatic?!?

    namespace sp = sparrow;
    sp::primitive_array<int> ar = { 1, 3, 5, 7, 9 };
    // Caution: get_arrow_structures returns pointers, not values
    auto [arrow_array, arrow_schema] = sp::get_arrow_structures(std::move(ar));
    // Use arrow_array and arrow_schema as you need (serialization,
    // passing it to a third party library)
    // ...
    // do NOT release the C structures in the end, the "ar" variable will do it
   // for you
I’m sorry, resources are kept alive by an object that has been moved from?

senkora|1 year ago

This bothered me enough to check the source code, because I simply had to know:

    template <layout_or_array A>
    std::pair<ArrowArray*, ArrowSchema*> get_arrow_structures(A& a)
    {
        arrow_proxy& proxy = detail::array_access::get_arrow_proxy(a);
        return std::make_pair(&(proxy.array()), &(proxy.schema()));
    }
https://github.com/man-group/sparrow/blob/c01a768f590ebf3b22...

So the answer is that the `std::move` does nothing and should be omitted, because this function only has one overload, and that overload takes its argument by lvalue-reference.

(And as far as I can tell, `detail::array_access::get_arrow_proxy(a)` eventually just reads a member on `a`, so there's no copying anywhere)

It's a harmless mistake; I'm just surprised it wasn't caught. The authors seem pretty experienced with the language.

jandrewrogers|1 year ago

Not speaking to the specific design choices here, but in C++ moved-from objects are not destroyed and must be valid in their moved-from state (e.g. a sentinel value to indicate they’ve been moved) so that they can be destroyed in the indefinite future. This is useful even though “destroy on move” is the correct semantics for most cases. Making “move” and “destroy” distinct operations increases the flexibility and expressiveness.

A common case where this is useful is if the address space where the object lives is accessible, for read or write, by references exogenous to the process like some kinds of shared memory or hardware DMA. If the object is immediately destroyed on move, it implies the memory can be reused while things your process can’t control may not know you destroyed the object. This is essentially a use-after-free bug factory. Being able to defer destruction to a point in time when you can guarantee this kind of UAF bug is not possible is valuable.

SylvainCorlay|1 year ago

Excellent catch.

This std::move should not have been in this code snippet. It is a copy-paste mistake, carried over from the previous code snippet of the post, and should have been omitted.

(The `std::move` does nothing in this snippet, since `sp::get_arrow_structure` takes and lvalue reference).

In the previous example with `sp::extract_arrow_structures`, which takes an rvalue reference, std::move is required and the sparrow primitive array cannot be operated upon after.

juunpp|1 year ago

I think that comment is a copy-paste mistake. If you look at the next code snippet, the comment actually makes sense there.

That being said, I've also given up on C++ and learn it mostly to keep up with the job, if that's where you are coming from. I don't find Rust to be a satisfying replacement, though. No language scratches the itch for me right now.

arka2147483647|1 year ago

Why does sp::primitive_array<> even exist?

Should it not be std::array<>?

rubenvanwyk|1 year ago

Seems cool but I have questions about Arcticdb - is Polars, DuckDB etc really so limited for data science analysis that it's justified to write a new library specifically for time-series analysis on S3 files?

willdealtry|1 year ago

ArcticDB is more concerned with storage and persistence than with in-memory processing, so it's complementary to Polars/DuckDB etc, rather than an alternative.

nly|1 year ago

Trying to find a good C++ parquet library is equally hard