top | item 44289332

(no title)

souvik3333 | 8 months ago

Actually, we have trained the model to convert to markdown and do semantic tagging at the same time. Eg, the equations will be extracted as LaTeX equations, and images (plots, figures, and so on) will be described within the `<img>` tags. Same with `<signature>`, `<watermark>`, <page_number>.

Also, we extract the tables as HTML tables instead of markdown for complex tables.

discuss

mgr86|8 months ago

Have you considered XML. TEI, for example, is very robust and mature for marking up documents.

esafak|8 months ago

First I heard of it. https://en.wikipedia.org/wiki/Text_Encoding_Initiative

lukev|8 months ago

Yeah this really hurts. If your goal is to precisely mark up a document with some structural elements, XML is strictly superior to Markdown.

The fact that someone would go to all the work to build a model to extract the structure of documents, then choose an output format strictly less expressive than XML, speaks poorly of the state of cross-generational knowledge sharing within the industry.

jtbayly|8 months ago

What happens to footnotes?

souvik3333|8 months ago

They will be extracted in a new line as normal text. It will be the last line.