(no title)
shadowwolf007 | 4 years ago
The edge cases are a hassle but they don't become less of a hassle from a business perspective by switching to json or really any other format. We tried an experiment of using more json and eventually gave it up because it wasn't saving any time at a holistic level because the "data schema" conversations massively dominated the entirety of the development and testing time.
Obviously being able to jam out some json helped quite a bit initially, but then on the QA side we started to run in to problems with tooling not really being designed to handle massive json files. Basically, when something was invalid (such as the first time we encountered an invalid quote) it was not enjoyable to figure out where that was in a 15GB file.
That said, I fully concur with the general premise that CSV doesn't let you encode the solutions to these problems, which really really sucks. But, to solve that, we would output to a more columnar storage format like Parquet or something. This would let us fully encode and manage the data how we wanted while letting our clients continue working their processes.
What I would really like to see is a file format where the validity of the file could be established by only using the header. E.g. I could validate that all the values in a specific column were integers without having to read them all.
anigbrowl|4 years ago
breck|4 years ago
Agreed. JSON let's me know something is a number. That's great, but I still have to check for min/max,zero etc. A string? That's great, but I got to check it against a set of enums, and so forth. Basically, the "types" JSON gives you is about 20% of the work, and you're going to have to parse things into your own types anyway.
> What I would really like to see is a file format where the validity of the file could be established by only using the header.
Are you saying something like a checksum so not only is a schema provided but some method to verify that the data obeys the schema?
If you're talking about just some stronger shared ontology, I think that's a direction things will go. I call this concept "Type the world" or "World Wide Types". I'm starting to think something like GPT-N will be the primary author, rather than a committee of humans like Schema.org.
shadowwolf007|4 years ago
A checksum would be crude and user-hostile, only being able to say "you did it wrong" but not really good at tell you what it means to do it right.
If I understand the concepts correctly then it seems like a shared ontology could potentially solve the problem in a non-hostile way.
Plus, it makes me happy because I feel like types are a real-world problem, so it is always nice if the type system could enforce that real-world-ness and all the messiness that comes along for the ride.
radus|4 years ago
shadowwolf007|4 years ago
More a byproduct of decisions made 5 - 7 years ago when the company was in raw startup mode versus a more mature roadmap.