Markdown Text Splitter #
MarkdownTextSplitter splits text along Markdown headings, code blocks, or horizontal rules. It’s implemented as a simple subclass of RecursiveCharacterSplitter with Markdown-specific separators. See the source code to see the Markdown syntax expected by default.
- How the text is split: by list of markdown specific characters
- How the chunk size is measured: by length function passed in (defaults to number of characters)
from langchain.text_splitter import MarkdownTextSplitter
markdown_text = """
# 🦜️🔗 LangChain
⚡ Building applications with LLMs through composability ⚡
## Quick Install
```bash
# Hopefully this code block isn't split
pip install langchainAs an open source project in a rapidly developing field, we are extremely open to contributions. """ markdown_splitter = MarkdownTextSplitter(chunk_size=100, chunk_overlap=0)
docs = markdown_splitter.create_documents([markdown_text])
docs
[Document(page_content='# 🦜️🔗 LangChain\n\n⚡ Building applications with LLMs through composability ⚡', lookup_str='', metadata={}, lookup_index=0), Document(page_content="Quick Install\n\n```bash\n# Hopefully this code block isn't split\npip install langchain", lookup_str='', metadata={}, lookup_index=0), Document(page_content='As an open source project in a rapidly developing field, we are extremely open to contributions.', lookup_str='', metadata={}, lookup_index=0)]