top | item 44289189

(no title)

el_don_almighty | 8 months ago

I have been looking for something that would ingest a decade of old Word and PowerPoint documents and convert them into a standardized format where the individual elements could be repurposed for other formats. This seems like a critical building block for a system that would accomplish this task.

Now I need a catalog, archive, or historian function that archives and pulls the elements easily. Amazing work!

discuss

order

pxc|8 months ago

Can't you just start with unoconv or pandoc, then maybe use an LLM to clean up after converting to plain text?

toledocavani|8 months ago

Which decade? DOCX and PPTX is just zipped XMLs, seems pretty standard to me