(no title)
dmpetrov | 1 year ago
Extract metadata as usual, then return the result as JSON or a Pydantic object. DataChian will automatically serialize it to internal dataset structure (SQLite), which can be exported to CSV/Parquet.
In case of PDF/HTML, you will likely produce multiple documents per file which is also supported - just `yield return my_result` multiple times from map().
Check out video: https://www.youtube.com/watch?v=yjzcPCSYKEo Blog post: https://datachain.ai/blog/datachain-unstructured-pdf-process...
nbbaier|1 year ago
Forgive my ignorance, but what is "json-pair"?
dmpetrov|1 year ago
It's simpliy about linking metadata from a json to a corresponding image or video file, like pairing data003.png & data003.json to a single, virtual record. Some format use this approach: open-image or laion datasets.
spott|1 year ago
Could your metadata come from something like a Postgres sql statement? Or an iceberg view?
dmpetrov|1 year ago
Just connect from your Python code (like the lambda in the example) to DB and extract the necessary data.