Unstructured data to clean text for LLM

There were lots of document processing libraries and services that converts various types of document types to clean text data that be documents, images, audio, video, etc. to text format that could be used with LLMs.

Some popular libraries are

  • Textractor
  • unstructured.io
  • LlamaParse
  • llms.txt

New opensource python library that converts document to clean text format to use with LLM.

MarkItDown – from Microsoft

A python library that transforms any (most) document to clean Markdown format. It supports huge number of types of data format like Pdf, Pptx, Docx, xlsx, images, audio, html, text formats(csv, json, xml), zip file, etc. The library has smart content recognition for document structures, media processing like images/audio, supports OCR, etc.

Installing MarkItDown for your project

$ pip install markitdown

Basic implementation

from markitdown import MarkItDown
markdown = MarkItDown()
content = markdown.convert('demo.docx')
print(content.text_content)

Sample for demo.docx

Result from markitdown on markdown viewer

Fig: Markdown content (left), Preview of Markdown (right)

MarkItDown also supports integration with LLMs like OpenAI, as follows

from openai import OpenAI
client = OpenAI()
markdown = MarkItDown(llm_client = client, llm_model = "gpt-4o")
content = markdown.convert('screenshot.png')
print(content.text_content)

Fig: Screenshot.png (left), Code Snippet (Right)

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *