top | item 42415025

(no title)

_rs | 1 year ago

That was the first thing I checked, and it looks like they’re using some existing python package to parse docx files. I wonder if they contributed to it or vetted it strongly

discuss

disgruntledphd2|1 year ago

Wow, I dunno if that's good or bad, certainly it's not what I expected.

wis|1 year ago

Looking at the code, it looks like they used existing Python packages to read and parse MS Office formats, not what I expected, seeing that the repo is in Microsoft's org on GitHub I expected them to have used Microsoft's "official" libraries for parsing these formats, through Component Object Model (COM).

They used Mammoth for docx (Word) [1][2] Python-pptx for ppt (PowerPoint) [3][4] and Pandas for XSLX (Excel) [5]

[1] https://github.com/microsoft/markitdown/blob/70ab149ff1657c3... [2] https://pypi.org/project/mammoth/ [3] https://github.com/microsoft/markitdown/blob/70ab149ff1657c3... [4] https://pypi.org/project/python-pptx/ [5] https://github.com/microsoft/markitdown/blob/70ab149ff1657c3...