There were lots of document processing libraries and services that converts various types of document types to clean text data that be documents, images, audio, video, etc. to text format that could be used with LLMs.
Some popular libraries are
- Textractor
- unstructured.io
- LlamaParse
- llms.txt
New opensource python library that converts document to clean text format to use with LLM.
MarkItDown – from Microsoft
A python library that transforms any (most) document to clean Markdown format. It supports huge number of types of data format like Pdf, Pptx, Docx, xlsx, images, audio, html, text formats(csv, json, xml), zip file, etc. The library has smart content recognition for document structures, media processing like images/audio, supports OCR, etc.
Installing MarkItDown for your project
$ pip install markitdown
Basic implementation
from markitdown import MarkItDown
markdown = MarkItDown()
content = markdown.convert('demo.docx')
print(content.text_content)
Sample for demo.docx

Result from markitdown on markdown viewer

Fig: Markdown content (left), Preview of Markdown (right)
MarkItDown also supports integration with LLMs like OpenAI, as follows
from openai import OpenAI
client = OpenAI()
markdown = MarkItDown(llm_client = client, llm_model = "gpt-4o")
content = markdown.convert('screenshot.png')
print(content.text_content)


Fig: Screenshot.png (left), Code Snippet (Right)
Leave a Reply