Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Pandoc (https://pandoc.org) can be used to convert a .docx file to markdown and other file formats like djot and typst. I don't think pandoc can convert powerpoint and excel files.


The hard part about document conversion is not finding a tool which can convert the formats but the tool which does it best. I wonder how MarkItDown ranks for the tasks for the various types.


The README of MarkItDown mentions "indexing and text analysis" as the two motivating features, whereas Pandoc is more interested in document preparation via conversion that maintains rich text formatting.

Since my personal use leans towards the latter, I'm hesitant to believe that this tool will work better for me but others may have other priorities.


MarkItDown feels like running strings; the output is great for text extraction and processing, not for reading by humans


Yeah that was the interesting part to me, at least. Plus, it's Microsoft so hopefully it will work for their files.


That was the first thing I checked, and it looks like they’re using some existing python package to parse docx files. I wonder if they contributed to it or vetted it strongly


Wow, I dunno if that's good or bad, certainly it's not what I expected.


Looking at the code, it looks like they used existing Python packages to read and parse MS Office formats, not what I expected, seeing that the repo is in Microsoft's org on GitHub I expected them to have used Microsoft's "official" libraries for parsing these formats, through Component Object Model (COM).

They used Mammoth for docx (Word) [1][2] Python-pptx for ppt (PowerPoint) [3][4] and Pandas for XSLX (Excel) [5]

[1] https://github.com/microsoft/markitdown/blob/70ab149ff1657c3... [2] https://pypi.org/project/mammoth/ [3] https://github.com/microsoft/markitdown/blob/70ab149ff1657c3... [4] https://pypi.org/project/python-pptx/ [5] https://github.com/microsoft/markitdown/blob/70ab149ff1657c3...


COM requires you to interact with the files through the associated MS Office applications, whereas these libs parse the ooxml file format directly.


...I did not catch that it was from Microsoft. I was wondering why a random markdown converter was so notable.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: