Wow, I dunno if that's good or bad, certainly it's not what I expected.

wis · on Dec 14, 2024

Looking at the code, it looks like they used existing Python packages to read and parse MS Office formats, not what I expected, seeing that the repo is in Microsoft's org on GitHub I expected them to have used Microsoft's "official" libraries for parsing these formats, through Component Object Model (COM).

They used Mammoth for docx (Word) [1][2] Python-pptx for ppt (PowerPoint) [3][4] and Pandas for XSLX (Excel) [5]

[1] https://github.com/microsoft/markitdown/blob/70ab149ff1657c3... [2] https://pypi.org/project/mammoth/ [3] https://github.com/microsoft/markitdown/blob/70ab149ff1657c3... [4] https://pypi.org/project/python-pptx/ [5] https://github.com/microsoft/markitdown/blob/70ab149ff1657c3...

jamwil · on Dec 14, 2024

COM requires you to interact with the files through the associated MS Office applications, whereas these libs parse the ooxml file format directly.