top | item 26020067

(no title)

wesm | 5 years ago

I challenge you to have a closer look at the project.

Deserialization by definition requires bytes or bits to be relocated from their position in the wire protocol to other data structures which are used for processing. Arrow does not require any bytes or bits to be relocated. So if a "C array of doubles" is not native to the CPU, then I don't know what is.

discuss

order

throwaway894345|5 years ago

Perhaps "zero-copy" is a more precise or well-defined term?

waynesonfire|5 years ago

CPUs come in many flavors. One area where they differ is in the way that bytes of a word are represented in memory. Two common formats are Big Endian and Little Endian. This is an example where a "C array of doubles" would be incompatible and some form of deserilaziation would be needed.

My understanding is that an apache arrow library provides an API to manipulate the format in a platform agnostic way. But to claim that it eliminates deserialization is false.

dlnovell|5 years ago

I wish I had your confidence, to argue with Wes McKinney about the details of how Arrow works

ska|5 years ago

You are right that if you want to do this in a heterogeneous computing environment, one of the layouts is going to be "wrong" and require an extra step regardless of how you do this.

But ... (a) this is way less common than it was decades ago (rare use cases we are talking about here ) and (b) it seems to be addressed in a sensible way (i.e. Arrow defaults to little-endian, but you could swap it on a big-endian network). I think it includes utility functions for conversion also.

So the usual case incurs no overhead, and the corner cases are covered. I'm not sure exactly what you are complaining about, unless it's the lack of liberally sprinkling ("no deserialization in most use cases") or whatever around the comments?

johncolanduoni|5 years ago

Big endian is pretty rare among anything you’d be doing in-memory analytics on. Looks like you can choose the endianess of the format if you need to, but it’s little endian by default: https://arrow.apache.org/docs/format/Columnar.html. I’d suggest reading up on the format, it covers the properties it provides to be friendly to direct random access.