LangExtract：大模型文本提炼工具功能与工作流程详解

LangExtract——大模型文本提炼工具

概述

什么是LangExtract

LangExtract是一个Python库，旨在利用大型语言模型（LLMs）从非结构化文本文档中提取结构化信息。该库能够处理临床笔记、文学文本或报告等各类材料，识别并组织关键细节，同时确保提取数据与源文本位置之间保持精确映射。

Contents

LangExtract——大模型文本提炼工具概述实践 PDF文档处理支持

核心能力

标注来源：将每个提取映射到源文本中的精确字符位置。
结构化输出：根据少量样本规范生成结构化输出模式。
长文档处理：通过分块和并行处理机制，高效处理大量文本。
交互式可视化：生成 HTML 文件，以便在原始文本上下文中审查提取内容。
多供应商支持：兼容云端LLM（如Gemini、OpenAI）和本地部署模型（如Ollama）。
领域适应性：可使用示例配置任何提取任务，以适应特定领域需求。

实践

安装

LangExtract 可从 PyPI 安装，也可以从源代码构建。该库需要 Python 3.10 及以上版本，并为特定提供程序提供可选的依赖项。

标准安装

pip install langextract

开发安装

git clone https://github.com/google/langextract.git
cd langextract
pip install -e ".[dev]"

基本工作流

LangExtract工作流程示意图

LangExtract的工作流程主要包括以下步骤：

定义输入样本：创建提示词（prompt_description）和示例（examples）来指导模型。
调用lx.extract()函数处理输入文本。

内部处理流程如下:
- 输入处理: 如果fetch_urls=True且输入是 URL, 会自动下载文本。
- 创建提示模板: 使用PromptTemplateStructured组织提示词和示例。
- 模型配置: 根据参数优先级创建语言模型（优先级:model>config>model_id）。
- 文本处理: 通过Annotator协调文本分块、并行处理和结果解析。
- 结果对齐: 使用Resolver将提取结果对齐到源文本位置。
可视化结果：保存结果并生成交互式 HTML 可视化。

返回的AnnotatedDocument包含:
- 原始文本和document_id。
- Extraction对象列表，每个包含char_interval位置信息。
- 每个提取的AlignmentStatus表示匹配质量。

对于长文档，可以使用 URL 直接处理并启用并行处理和多次提取来提高性能和准确性。系统支持多种模型提供商（Gemini、OpenAI、Ollama 等），通过工厂模式自动选择合适的提供商。

调用Qwen示例

import&nbsp;langextract&nbsp;as&nbsp;lx
from&nbsp;langextract&nbsp;import&nbsp;factory
from&nbsp;langextract.providers.openai&nbsp;import&nbsp;OpenAILanguageModel

# Text with a medication mention
input_text =&nbsp;"Patient took 400 mg PO Ibuprofen q4h for two days."

# Define extraction prompt
prompt_description =&nbsp;"Extract medication information including medication name, dosage, route, frequency, and duration in the order they appear in the text."

# Define example data with entities in order of appearance
examples = [&nbsp; &nbsp; lx.data.ExampleData(&nbsp; &nbsp; &nbsp; &nbsp; text="Patient was given 250 mg IV Cefazolin TID for one week.",&nbsp; &nbsp; &nbsp; &nbsp; extractions=[&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; lx.data.Extraction(extraction_class="dosage", extraction_text="250 mg"),&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; lx.data.Extraction(extraction_class="route", extraction_text="IV"),&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; lx.data.Extraction(extraction_class="medication", extraction_text="Cefazolin"),&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; lx.data.Extraction(extraction_class="frequency", extraction_text="TID"), &nbsp;# TID = three times a day
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; lx.data.Extraction(extraction_class="duration", extraction_text="for one week")
&nbsp; &nbsp; &nbsp; &nbsp; ]
&nbsp; &nbsp; )]

result = lx.extract(&nbsp; &nbsp; text_or_documents=input_text,&nbsp; &nbsp; prompt_description=prompt_description,&nbsp; &nbsp; examples=examples,&nbsp; &nbsp; fence_output=True,&nbsp; &nbsp; use_schema_constraints=False,&nbsp; &nbsp; model = OpenAILanguageModel(&nbsp; &nbsp; &nbsp; &nbsp; model_id='qwen-plus',&nbsp; &nbsp; &nbsp; &nbsp; base_url='',&nbsp; &nbsp; &nbsp; &nbsp; api_key='',&nbsp; &nbsp; &nbsp; &nbsp; provider_kwargs={&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;'connect_timeout':&nbsp;60, &nbsp;# 允许 60 秒完成 SSL 握手
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;'timeout':&nbsp;120&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;# 保持 120 秒的整体请求超时
&nbsp; &nbsp; &nbsp; &nbsp; }
&nbsp; &nbsp; ))

# Display entities with positions
print(f"Input:&nbsp;{input_text}
")
print("Extracted entities:")
for&nbsp;entity&nbsp;in&nbsp;result.extractions:&nbsp; &nbsp; position_info =&nbsp;""&nbsp; &nbsp;&nbsp;if&nbsp;entity.char_interval:&nbsp; &nbsp; &nbsp; &nbsp; start, end = entity.char_interval.start_pos, entity.char_interval.end_pos&nbsp; &nbsp; &nbsp; &nbsp; position_info =&nbsp;f" (pos:&nbsp;{start}-{end})"&nbsp; &nbsp;&nbsp;print(f"•&nbsp;{entity.extraction_class.capitalize()}:&nbsp;{entity.extraction_text}{position_info}")

# Save and visualize the results
lx.io.save_annotated_documents([result], output_name="medical_ner_extraction.jsonl", output_dir=".")

# Generate the interactive visualization
html_content = lx.visualize("medical_ner_extraction.jsonl")
with&nbsp;open("medical_ner_visualization.html",&nbsp;"w")&nbsp;as&nbsp;f:&nbsp; &nbsp;&nbsp;if&nbsp;hasattr(html_content,&nbsp;'data'):&nbsp; &nbsp; &nbsp; &nbsp; f.write(html_content.data) &nbsp;# For Jupyter/Colab
&nbsp; &nbsp;&nbsp;else:&nbsp; &nbsp; &nbsp; &nbsp; f.write(html_content)
print("Interactive visualization saved to medical_ner_visualization.html")

这段代码的核心目标是：使用 LangExtract 库对接大语言模型（Qwen），从医疗文本中自动提取结构化的药物信息（剂量、途径、名称等），并通过打印、文件保存、HTML 可视化等方式展示结果。该流程适用于医疗文本分析、药物信息抽取等场景。

界面展示

LangExtract提取结果可视化界面

在生成的HTML文件中，需要增加UTF-8字符集声明以避免乱码问题：

<!DOCTYPE&nbsp;html><html><head>&nbsp; &nbsp;&nbsp;<meta&nbsp;charset="UTF-8">&nbsp; &nbsp;&nbsp;<title>医疗实体提取可视化</title>&nbsp; &nbsp;&nbsp;<style>&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;<!-- 原有CSS样式保持不变 -->&nbsp; &nbsp;&nbsp;</style></head><body>&nbsp; &nbsp;&nbsp;<!-- 原有HTML内容保持不变 --&gt;&nbsp; &nbsp;&nbsp;<div&nbsp;class="lx-animated-wrapper lx-gif-optimized">&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;<!-- ... 原有内容 ... --&gt;&nbsp; &nbsp;&nbsp;</div&gt;&nbsp; &nbsp;&nbsp;<script>&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;<!-- 原有JavaScript代码保持不变 --&gt;&nbsp; &nbsp;&nbsp;</script></body></html>

此外，注意到该库最初对中文支持不足。

增加中文分词支持

为了支持中文，需要进入 LangExtract 的 tokenizer 分词部分进行修改。更新后的匹配模式如下：

# ✅ Updated to support Chinese characters (CJK Unified Ideographs, Extension A, Compatibility Ideographs)
# &nbsp; &nbsp;and other Unicode languages
_LETTERS_PATTERN = (
&nbsp; &nbsp;&nbsp;r"[A-Za-zu4e00-u9fffu3400-u4dbfuf900-ufaff]+")"""匹配中文、英文的连续字母（包含CJK基本区、扩展A区、兼容区）"""
_DIGITS_PATTERN = (
&nbsp; &nbsp;&nbsp;r"[0-9uff10-uff19]+")"""匹配阿拉伯数字与全角数字"""
_SYMBOLS_PATTERN = (
&nbsp; &nbsp;&nbsp;r"[^A-Za-z0-9u4e00-u9fffu3400-u4dbfuf900-ufaffs]+")"""匹配除中文、英文、数字和空格外的符号（含全角符号）"""
_END_OF_SENTENCE_PATTERN = re.compile(r"[.?!。？！]$")"""匹配句末符号（含中英文标点）"""
_SLASH_ABBREV_PATTERN = (
&nbsp; &nbsp;&nbsp;r"[A-Za-z0-9u4e00-u9fffu3400-u4dbfuf900-ufaff]+"
&nbsp; &nbsp;&nbsp;r"(?:/[A-Za-z0-9u4e00-u9fffu3400-u4dbfuf900-ufaff]+)+")"""匹配类似 '中/英/混合' 这种带斜杠的缩写或组合词"""
_TOKEN_PATTERN = re.compile(
&nbsp; &nbsp;&nbsp;rf"{_SLASH_ABBREV_PATTERN}|{_LETTERS_PATTERN}|{_DIGITS_PATTERN}|{_SYMBOLS_PATTERN}")"""通用token匹配模式：支持中文、英文、数字、符号"""
_WORD_PATTERN = re.compile(
&nbsp; &nbsp;&nbsp;rf"(?:{_LETTERS_PATTERN}|{_DIGITS_PATTERN})Z")"""匹配完整词语（字母或数字结尾）"""

通过以上修改即可实现中文支持。

PDF文档处理支持

LangExtract 目前主要支持处理原始文本字符串。在实际工作流程中，源文件通常以 PDF、DOCX 或 PPTX 格式存在，这要求用户进行以下手动操作：

手动将文件转换为纯文本（此过程可能丢失原始文档的布局和出处信息）。
将纯文本输入到 LangExtract 中进行处理。
手动将提取内容映射回原始文档以进行验证。

若能实现单步流程，将极大简化 LangExtract 的采用和使用。

建议的解决方案

建议将 Docling 库作为 LangExtract 的可选前端进行集成：

Docling 能够将多种文档格式（如 PDF）转换为统一的 DoclingDocument 对象。
它在转换过程中能保留原始文档的来源信息（包括页面、边界框和阅读顺序）。
将从 Docling 提取的文本块按照当前 LangExtract 的方式输入。
通过起源元数据，将 LangExtract 的提取结果映射回原始文档。

此集成将是可选的（通过pip install langextract[docling]安装），从而确保核心包保持轻量级，无额外依赖。

概念验证

以下代码展示了一个概念验证示例，目前尚未集成到 LangExtract 库中：

import&nbsp;langextract&nbsp;as&nbsp;lx
import&nbsp;textwrap
from&nbsp;pdf_extract&nbsp;import&nbsp;extract_with_file_support

# 1. Define the prompt and extraction rules
prompt = textwrap.dedent("""&nbsp; &nbsp; Extract characters, emotions, and relationships in order of appearance.
&nbsp; &nbsp; Use exact text for extractions. Do not paraphrase or overlap entities.
&nbsp; &nbsp; Provide meaningful attributes for each entity to add context.""")

# 2. Provide a high-quality example to guide the model
examples = [&nbsp; &nbsp; lx.data.ExampleData(&nbsp; &nbsp; &nbsp; &nbsp; text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",&nbsp; &nbsp; &nbsp; &nbsp; extractions=[&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; lx.data.Extraction(&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; extraction_class="character",&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; extraction_text="ROMEO",&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; attributes={"emotional_state":&nbsp;"wonder"}&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ),&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; lx.data.Extraction(&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; extraction_class="emotion",&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; extraction_text="But soft!",&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; attributes={"feeling":&nbsp;"gentle awe"}&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ),&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; lx.data.Extraction(&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; extraction_class="relationship",&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; extraction_text="Juliet is the sun",&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; attributes={"type":&nbsp;"metaphor"}&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ),&nbsp; &nbsp; &nbsp; &nbsp; ]
&nbsp; &nbsp; )]
source =&nbsp;"<sample pdf file>.pdf"
result = extract_with_file_support(&nbsp; &nbsp; source=source,&nbsp; &nbsp; prompt_description=prompt,&nbsp; &nbsp; examples=examples,&nbsp; &nbsp; model_id="gemini-2.5-flash",)
# result.extractions[0].extraction_text
# result.extractions[0].provenance

LangExtract：大模型文本提炼工具功能与工作流程详解