top | item 42044992

(no title)

dmpetrov | 1 year ago

DataChain has no assumptions about metadata format. However, some formats are supported out of the box: WebDataset, json-pair, openimage, etc.

Extract metadata as usual, then return the result as JSON or a Pydantic object. DataChian will automatically serialize it to internal dataset structure (SQLite), which can be exported to CSV/Parquet.

In case of PDF/HTML, you will likely produce multiple documents per file which is also supported - just `yield return my_result` multiple times from map().

Check out video: https://www.youtube.com/watch?v=yjzcPCSYKEo Blog post: https://datachain.ai/blog/datachain-unstructured-pdf-process...

discuss

order

nbbaier|1 year ago

> However, some formats are supported out of the box: WebDataset, json-pair, openimage, etc.

Forgive my ignorance, but what is "json-pair"?

dmpetrov|1 year ago

It's not a format :)

It's simpliy about linking metadata from a json to a corresponding image or video file, like pairing data003.png & data003.json to a single, virtual record. Some format use this approach: open-image or laion datasets.

spott|1 year ago

> DataChain has no assumptions about metadata format.

Could your metadata come from something like a Postgres sql statement? Or an iceberg view?

dmpetrov|1 year ago

Absolutely, that's a common scenario!

Just connect from your Python code (like the lambda in the example) to DB and extract the necessary data.