Super-Structured Data: Rethinking the Schema

simonw|3 years ago

> The idea here is that instead of manually creating schemas, what if the schemas were automatically created for you? When something doesn’t fit in a table, how about automatically adding columns for the missing fields?

I've been experimenting with this approach against SQLite for a few years now, and I really like it.

My sqlite-utils package does exactly this. Try running this on the command line:

    brew install sqlite-utils
    echo '[
      {"id": 1, "name": "Cleo"},
      {"id": 2, "name": "Azy", "age": 1.5}
    ]' | sqlite-utils insert /tmp/demo.db creatures - --pk id
    sqlite-utils schema /tmp/demo.db

It outputs the generated schema:

    CREATE TABLE [creatures] (
       [id] INTEGER PRIMARY KEY,
       [name] TEXT,
       [age] FLOAT
    );

When you insert more data you can use the --alter flag to have it automatically create any missing columns.

Full documentation here: https://sqlite-utils.datasette.io/en/stable/cli.html#inserti...

It's also available as a Python library: https://sqlite-utils.datasette.io/en/stable/python-api.html

mccanne|3 years ago

Simon, brilliant observations here and kudos on sqlite-utils.

I'm all for layers, a fundamental approach in our field to tame complexity. And the SQL model and SQLite have stood the test of time and are solid foundations.

I'm just wondering could we be stuck in a local maximum where the presumed answer is always the relational model? Maybe if we built the relational model on top of a different set of lower-level primitives (a type system instead of schemas and tables) we could escape local maximum we're stuck in? Just a thought.

There are a few somewhat ad hoc perf measurements here regarding the sqlite-utils and sqlite... https://zed.brimdata.io/docs/commands/zq/#73-performance-com...

I'm not a SQLite expert so if I did something wrong, please holler and let me know :)

mamcx|3 years ago

Note: The relational model (even SQL) is THIS.

Despite the claims, SQL is NOT "schema-fixed".

You can 100% create new schemas, alter them and modify them.

What actual happens is that if you have a CENTRAL repository of data (aka "source of truth"), then you bet you wanna "freeze" your schemas (because is like a API, where you need to fulfill contracts).

--

SQL have limitations in lack of composability, the biggest reason "NoSQL" work is this: A JSON is composable. A "stringy" SQL is not. If SQL were really around "relations, tupes" like (stealing from my project, TablaM):

    [Customer id:i32, name:Str; 1, "Jhon"]

then developers will have less reason to go elsewhere.

mccanne|3 years ago

Author here. All good points. Yes, you can build a super-structured type system on top of tables. EdgeDB does this well. And you can put JSON into relational columns. Then you might ask what the "type" of that column is? Well, if you want deep types, the row type varies from column to column as the JSON values vary and you have to walk the JSON to determine the type. SQL implementation are beginning to try to do deal with this mess by adding layers on top of tables. We're saying, maybe we should think differently about the problem and build tables on top of types as a special case of a type system. This also gives a very nice way to get data into and out of systems without having to go through the messiness of ODBC and special casing tables vs tuples vs scalars etc.

zmgsabst|3 years ago

I’m not sure what you mean by “composable” here — could you elaborate?

flappyeagle|3 years ago

why hasn't someone built a composable flavor of SQL? it seems like a burning need

CharlesW|3 years ago

The words "anarchy" and "authoritarianism" seem unnecessarily emotional and pejorative, and because of their semantic baggage I personally wouldn't use them in a professional situation. The author counts on the emotional color of those words to attempt an argument that both are somehow bad.

Instead of those words I'd suggest something like "schema on write" vs. "schema on read", or "persisted structured" vs. "persisted unstructured". "Document" vs. "relational" doesn't quite capture it, since unstructured data can have late-binding relations applied at read time, and structured data doesn't have to be relational.

And of course, modern relational databases can store unstructured data as easily as structured data.

anentropic|3 years ago

The first few sections of this post nearly lost me, waffling on about NoSQL vs whatever.

Eventually we get to the meat:

> For example, the JSON value

  {"s":"foo","a":[1,"bar"]}

> would traditionally be called “schema-less” and in fact is said have the vague type “object” in the world of JavaScript or “dict” in the world of Python. However, the super-structured interpretation of this value’s type is instead:

> type record with field s of type string and field a of type array of type union of types integer and string

> We call the former style of typing a “shallow” type system and the latter style of typing a “deep” type system. The hierarchy of a shallow-typed value must be traversed to determine its structure whereas the structure of a deeply-typed value is determined directly from its type.

This is a bit confusing, since JSON data commonly has an implicit schema, or "deep type system" as this post calls it, and if you consume data in any statically-typed language you will materialise the implicit "deep" types in your host language.

So it seems that ZSON is sort of like a TypeScript-ified version of JSON, where the implicit types are made explicit.

It seems the point is not to have an external schema that documents must comply to, so I guess at the end of the day has similar aim to other "self-describing" message formats like https://amzn.github.io/ion-docs/ ? i.e. each message has its own schema

So the interesting part is perhaps the new data tools to work with large collections of self-describing messages?

vosper|3 years ago

> The first few sections of this post nearly lost me, waffling on about NoSQL vs whatever.

Since the author of the blog post is here, I'll just jump in to agree with this part: there is a lot of unecessary background text before we get to the meat of it. I don't think people need a history lesson on NoSQL and SQL, and IMO the "authoritarianism" metaphor is a stretch, and that word has pretty negative connotations.

I think there's some value in setting the scene, but I think you will lose readers before they get to the much more interesting content further down. I recommend revising it to be a lot shorter.

rco8786|3 years ago

> This is a bit confusing, since JSON data commonly has an implicit schema, or "deep type system" as this post calls it, and if you consume data in any statically-typed language you will materialise the implicit "deep" types in your host language

That is an incredibly expensive operation to perform. Being able to look at two binary blobs of data and quickly determining whether or not they are the same type of data unlocks a whole host of functionality over large amounts of data that is otherwise prohibitively expensive and slow.

troelsSteegin|3 years ago

It looks like the use case is specifying types for dataflow operators (aka endpoints for dataflow pipes) [0] and I surmise composition should be super easy. I was surprised not to see any mention of XML or XML Schema as prior art, especially with their discussion of schema registries. Edit: Oh, the point of reference is Kafka [1]

[0] https://zed.brimdata.io/docs/language/overview/ [1] https://docs.confluent.io/platform/current/schema-registry/i...

hbarka|3 years ago

I also thought about XML. It has the Document Object Model (DOM), the structure which describes the data.

kmerroll|3 years ago

Interesting discussion, but buried in a lot of legacy thinking about schemas and personally, I don't find Yet-Another-Schema-Abstraction (YASA)™ layer very compelling when better solutions in functional programming and semantic ontologies are far ahead in this area.

Suggest looking into JSON-LD which was intended to solve many of the type and validation use-cases related to type and schema.

abraxaz|3 years ago

To pile on a bit here, JSON-LD is based on RDF, which is an abstract syntax for data as semantic triples (i.e. RDF statements), there is also RDF* which is in development which extends this basic data model to make statements about statements.

RDF has concrete syntaxes, one of them being JSON-LD, and it can be used to model relational databases fairly well with R2RML (https://www.w3.org/TR/r2rml/) which essentially turns relation databases into a concrete syntax for RDF.

schema.org is also based on RDF, and is essentially an ontology (one of many) that can be used for RDF and non RDF data, but mainly because almost all data can be represented as RDF - so non RDF data is just data that does not have a formal mapping to RDF yet.

Ontologies is a concept used frequently in RDF but rarely outside of it, it is quite important for federated or distributed knowledge, or descriptions of entities. It focuses heavily on modelling properties instead of modelling objects, and then whenever a property occurs that property can be understood within the context of an ontology.

An example is the age of a person (https://schema.org/birthDate)

When I get a semantic triple:

<example:JohnSmith> <https://schema.org/birthDate> "2000-01-01"^^<https://schema.org/Date>

This tells me that the entity identified by the IRI <example:JohnSmith> is a person - and their birth date is 2000-01-01. I however don't expect that i will get all other descriptions of this person at the same time, I won't necessarily get their <https://schema.org/nationality> for example, even though this is a property of a <https://schema.org/Person> defined by schema.org

I can also combine https://schema.org/ based descriptions with other descriptions, and these descriptions can be merged from multiple sources and then queried together using SPARQL.

vlmutolo|3 years ago

What are the "better solutions in functional programming and semantic ontologies"? What would I Google?

ducharmdev|3 years ago

You're right, I think Yet-Another-Schema-Solution (YASS)™ would be much more compelling!

(Please forgive me)

difflens|3 years ago

Perhaps I don't understand their use case fully, but it seems to me that every schema can be defined as a child protobuf message, and each child can then be added to a oneof field of a parent protobuf message. This way, you get the strict/optional type checks that are required, and the efficiency and ecosystem around protobufs.

mccanne|3 years ago

Author here. This totally makes sense. The challenge here is you need to store the type definitions somewhere (e.g., in the .proto files) and any system that processes protocol buffers needs to know which proto to apply to which messages. The theme of super-structured data is that this type structure should be native to the serialized data and our premise is this leads to better DX (though Zed is early and the jury is out). Perhaps flexbuffers is a closer analogy, which I should have mentioned in the article.

tabtab|3 years ago

"Dynamic Relational" needs to be implemented. Columns and (optionally) tables are "create on write". If you issue "SELECT nonExistingColumn FROM myTable" you get nulls (if rows exist), not an error. One can incrementally "lock down" the schema as a project matures by adding constraints. Unlike the "NoSql" movement, it does not throw out most of RDBMS concepts, just tweaks them only enough to be dynamic-friendly. This reduces the learning curve.

natemcintosh|3 years ago

So it sounds like one of the advantages of the Zed ecosystem is that its data can go into three file formats (zson, zng, zst), each designed for a specific use case, and convert between them easily and without loss.

And it seems like the newer "zed lake" format is like a large blob managed by a server. Can you also convert data to and from and the file formats to the lake format? What is the lake's main use case?

zedlover|3 years ago

It seems the zed lake is a data lake implementation rather than a format. It uses ZNG for both metadata and data storage internally I think.

bthomas|3 years ago

I didn't follow this part:

> EdgeDB is essentially a new data silo whose type system cannot be used to serialize data external to the system.

I think this implies that serializing external data to zson is easier than writing an INSERT into edgedb, but not sure why that would be.

unknown|3 years ago

[deleted]

munro|3 years ago

I love the idea of getting rid of tables, when developing application code I'm often thinking in terms of Maps/Sets/Lists--I wish I could just take that code and make it persistent. PRIMARY KEY is really like a map. Also I wish I had transactional memory in my application. Not sure what the future looks like, but I am loving all this development in the database space.

thinkharderdev|3 years ago

Arrow has union types (as well as structs and dictionary types). Parquet doesn't but I think it has an intentionally shallow types system to allow flexibility in encoding. Basically everything is either a numeric or binary and the logical type for binary columns is defined in metadata. So you can use, for instance, Arrow as the encoding.

Kinrany|3 years ago

Yes, the comparison with Arrow ecosystem should really be more in depth since that's the closest thing that exists.

SPBS|3 years ago

This is a data serialization format, not a replacement for storing your business data. Your business data needs to have the same schema enforced everywhere, otherwise how are you going to reconcile your user data now and your user data 5 months ago if their schemas are radically different?

ccleve|3 years ago

tldr; Don't use relational tables or unstructured document databases. Instead use structured types. The "schema" here is ultimately a collection of independent objects / classes with well-defined fields.

Ok, fine. But I'm not sure how this helps if you have six different systems with six different definitions of a customer, and more importantly, different relationships between customers and other objects like orders or transactions or locations or communications.

I don't see their approach as ground-breaking, but it is definitely worthy of discussion.

abraxaz|3 years ago

> Ok, fine. But I'm not sure how this helps if you have six different systems with six different definitions of a customer, and more importantly, different relationships between customers and other objects like orders or transactions or locations or communications.

If you have this problem, consider giving RDF a look - you can fairly easily use RDF based technologies to map the data in these systems onto a common model, some examples of tools that may be useful here is https://www.w3.org/TR/r2rml/ and https://github.com/ontop/ontop - you can also use JSON-LD to convert most JSON data to RDF. For more info ask in https://gitter.im/linkeddata/chat

HelloNurse|3 years ago

It helps if this machinery can reject data and thus perform validation. Since recursive construction of union types (valid records can look like this, or also like that...) is trivial, a programmer somewhere has to draw the line between "loosen the schema to allow this record" and "reject this record to enforce the schema".

loquisgon|3 years ago

I got interested when the different spectrum points of json and relational were contrasted. So I read the whole thing. I got lost and disheartened when the new terminology, starting with the super-structured name was introduced and completely went downhill with the other z names. Maybe it's just me and maybe it is like quantum mechanics and any other innovation where new names don't make sense and feel ugly.

feoren|3 years ago

Wow, what a waste of time. I've been doing it correctly for so long that I forget that virtually everyone else on the planet has no idea how to build a good data model. What pisses me off is that I actually have the right answer on how to avoid all of this pain, but if I typed it out here I'd either waste my time and get ignored or (much, much less likely) get my idea poached. It takes hours to fully communicate anyway. What do you do when you know you're sitting on an approach & tech that could revolutionize the X-hundred-billion-dollar data management industry but you can barely even get your own fucking employer to take you seriously?

Anyway this article is crap and gets everything wrong, just like all of you do. Whatever, nothing to see here I guess.

46 comments