Milvus Analyzer：解决RAG分词难题，提升中文全文检索精度

在构建RAG（Retrieval-Augmented Generation）系统时，许多开发者常常面临一个核心难题：如何处理文本分词以确保高效准确的全文检索。错误的文本切分可能导致专有名词无法被正确识别，严重影响检索质量。

Contents

什么是 Milvus Analyzer？Analyzer 类型代码实践

RAG系统中不正确分词的示例分词导致检索失败的插画
Milvus Analyzer 如何避免大模型将《无线电法国别研究》误解为“无线电，法国别研究”？

分词错误导致语义偏差的图示

在RAG系统与向量数据库的实践中，分词问题常常成为一个“噩梦”。Milvus虽然在2.5版本就引入了全文检索（Full-text Search）功能，但用户在实际部署RAG时，仍可能遇到地名、人名、专有词汇无法被准确检索的问题。

例如，在《鲁迅全集》中，可能检索到“藤野先生”却无法检索到“藤野”；在半导体领域，能够搜索“EUV”或“光刻机”，却无法搜到“EUV光刻机”。

这并非是Embedding模型选择不当，也不是向量检索阈值设置过高，更与Milvus本身的性能无关。最大的可能在于分词环节，选择了不合适的Analyzer。

典型错误包括将“武汉市长江大桥”错误分词为“武汉市长”和“江大桥”，以及将“霍格沃兹魔法学院”分词为“霍格沃兹魔”和“法学院”等。

广州地名分词错误示例

分词环节常见问题除了过度分词，还包括分词不足和语言不匹配等。要解决这些挑战并实现高效全文检索，选择合适的Analyzer至关重要。在文本处理中，Analyzer负责将原始文本转换为结构化、可搜索的格式，其选择直接决定了最终的查询质量。

因此，本文将重点解读Analyzer 的工作原理、不同场景下的选型策略，以及如何在生产环境中落地实践。

什么是 Milvus Analyzer？

简而言之，Milvus Analyzer 是 Milvus 提供的文本预处理与分词工具，用于将原始文本拆解为 token，并对其进行标准化和清洗，从而更好地支持全文检索和 text match。

下图展示了 Milvus Analyzer 的整体架构：

Milvus Analyzer 整体架构图从图中可以看出，Milvus Analyzer 的整体处理流程可以总结为：原始文本 → Tokenizer → Filter → Tokens。
Analyzer 的核心组件有两个，Tokenizer（分词器）与Filter（过滤器）。它们共同将输入文本转换为词元（token），并对这些词元进行优化，以便为高效的索引和检索做好准备。

Tokenizer（分词器）：负责把文本切分成基础的 token，例如按空格切分（Whitespace）、中文分词（Jieba）、多语言分词（ICU）等。
Filter（过滤器）：对 token 进行特定的处理方法，Milvus 内置了丰富的 filter，例如统一大小写（Lowercase）、去掉标点（Removepunct）、停用词过滤（Stop）、词干提取（Stemmer）、正则匹配（Regex）等。Milvus 支持设置多个 filter 按顺序处理，可以满足复杂的 token 处理需求。

Tokenizer 与 Filter 工作流程图

(1) Tokenizer

Tokenizer 是 Milvus Analyzer 的第一步处理工具，它的任务是将一段原始文本切分成更小的 token（词或子词）。不同语言、不同场景需要使用不同的 Tokenizer。Milvus 目前支持以下几类 Tokenizer：

Milvus 支持的 Tokenizer 类型
在 Milvus 中，Tokenizer 是在创建 Collection 的 Schema 时配置的，具体是在定义 VARCHAR 字段时，通过analyzer_params指定。这意味着 Tokenizer 并不是一个单独的对象，而是绑定在字段级别的配置里，Milvus 在插入数据时会自动进行分词和预处理。

FieldSchema(&nbsp; &nbsp; name="text",&nbsp; &nbsp; dtype=DataType.VARCHAR,&nbsp; &nbsp; max_length=512,&nbsp; &nbsp; analyzer_params={"tokenizer":&nbsp;"standard"&nbsp; &nbsp;# 这里配置 Tokenizer&nbsp; &nbsp; })

（2）Filter

如果说 Tokenizer 是切分文本的“刀”，那么 Filter 就是其后的“精修工序”。在 Milvus Analyzer 中，Filter 的作用是对切分后的 token 进行进一步的标准化、清洗或改造，使最终的 token 更适合用于检索。

例如，统一大小写、去除停用词（如 “the”、“and”）、去除标点、词干提取（如 running → run）等，都是典型的 Filter 工作。

Milvus 内置了多种常用 Filter，可满足大部分语言处理需求：

Milvus 内置 Filter 类型
使用 Filter 的优势在于，开发者可以根据业务场景灵活组合不同的清洗规则。例如，在英文搜索中，常见的组合是 Lowercase + Stop + Stemmer，这能确保大小写统一、去除无意义词汇，并将不同形态的词统一为词干。

在中文搜索中，通常会结合 Cncharonly + Stop，使分词结果更简洁、更精准。在 Milvus 中，Filter 与 Tokenizer 一样，通过analyzer_params配置在 FieldSchema 中。例如：

FieldSchema(&nbsp; &nbsp; name="text",&nbsp; &nbsp; dtype=DataType.VARCHAR,&nbsp; &nbsp; max_length=512,&nbsp; &nbsp; analyzer_params={&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;"tokenizer":&nbsp;"standard","filter": ["lowercase",&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; {"type":&nbsp;"stop",&nbsp;# Specifies the filter type as stop"stop_words": ["of","to","_english_"],&nbsp;# Defines custom stop words and includes the English stop word list&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; },&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; {"type":&nbsp;"stemmer", &nbsp;# Specifies the filter type as stemmer"language":&nbsp;"english"&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }],&nbsp; &nbsp; })

Analyzer 类型

选择合适的 Analyzer 可以显著提高检索效率并降低成本。为满足不同场景需求，Milvus 提供了三类 Analyzer：

内置（Built-in）的 Standard/English/Chinese 三种 Analyzer
基于用户自定义 Tokenizer 和 Filter 组成的 Custom Analyzer
在多语言文档场景中非常实用的 Multi-language Analyzer。

(1) 内置 Analyzer (Built-in)

内置 Analyzer 是 Milvus 自带的标准配置，开箱即用，适用于大多数常见场景。它们已预定义好 Tokenizer 和 Filter 的组合：

Milvus 内置 Analyzer 组合列表
如果仅需常见的英文或中文搜索，可以直接使用内置 Analyzer，无需额外配置。

这里需要注意的是，Standard Analyzer 默认处理英文文档。如果中文使用 Standard Analyzer，后续可能出现全文搜索无结果的问题，社区中已有不少用户遇到此情况。

(2) 多语言 Analyzer (Multi-language)

在跨语言文本库中，单一分词器往往无法覆盖所有语种。为此，Milvus 提供了 Multi-language Analyzer，它会根据文本的语言自动选择合适的分词器。

不同语言使用的 Tokenizer 对照表：

多语言 Analyzer 的 Tokenizer 对应表
这意味着，如果数据集同时包含英文、中文、日文、韩文甚至阿拉伯文，Milvus 可以在同一个字段里进行灵活处理，大大减少了手工预处理的复杂度。

(3) 自定义 Analyzer (Custom)

如果内置或多语言 Analyzer 不能完全满足需求，Milvus 还支持用户自定义 Analyzer。通过自由组合 Tokenizer 和 Filter，可以形成一个符合业务特点的 Analyzer。

例如：

FieldSchema(&nbsp; &nbsp; &nbsp; &nbsp; name="text",&nbsp; &nbsp; &nbsp; &nbsp; dtype=DataType.VARCHAR,&nbsp; &nbsp; &nbsp; &nbsp; max_length=512,&nbsp; &nbsp; &nbsp; &nbsp; analyzer_params={"tokenizer":&nbsp;"jieba", &nbsp;"filter": ["cncharonly","stop"]&nbsp;&nbsp;# 自定义组合，比如中英混合语料中只搜中文，且去掉中文停用词，比如“的”、“了”、“在”&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; )

代码实践

以下通过 Python SDK 演示如何在 Milvus 中使用 Analyzer，将分别展示普通 Analyzer 和多语言 Analyzer 的用法。

本文使用的 Milvus 版本为 v2.6.1，Pymilvus 版本为 v2.6.1。

(1) 普通 Analyzer 示例

假设要建立一个英文文本搜索的 Collection，并在插入数据时自动完成分词和预处理。这里选用内置的 English Analyzer（相当于standard + lowercase + stop + stemmer的组合）。

from&nbsp;pymilvus&nbsp;import&nbsp;MilvusClient, DataType, Function, FunctionType
client = MilvusClient(
&nbsp; &nbsp; uri="http://localhost:19530",)
schema = client.create_schema()
schema.add_field(
&nbsp; &nbsp; field_name="id", &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;# Field name
&nbsp; &nbsp; datatype=DataType.INT64, &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;# Integer data type
&nbsp; &nbsp; is_primary=True, &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;# Designate as primary key
&nbsp; &nbsp; auto_id=True&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;# Auto-generate IDs (recommended)
)
schema.add_field(
&nbsp; &nbsp; field_name='text',
&nbsp; &nbsp; datatype=DataType.VARCHAR,
&nbsp; &nbsp; max_length=1000,
&nbsp; &nbsp; enable_analyzer=True,
&nbsp; &nbsp; analyzer_params={
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;"tokenizer":&nbsp;"standard",
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;"filter": [
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;"lowercase",
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; {
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;"type":&nbsp;"stop",&nbsp;# Specifies the filter type as stop
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;"stop_words": ["of","to","_english_"],&nbsp;# Defines custom stop words and includes the English stop word list
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; },
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; {
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;"type":&nbsp;"stemmer", &nbsp;# Specifies the filter type as stemmer
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;"language":&nbsp;"english"
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }],
&nbsp; &nbsp; &nbsp; &nbsp; },
&nbsp; &nbsp; enable_match=True,
)
schema.add_field(
&nbsp; &nbsp; field_name="sparse", &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;# Field name
&nbsp; &nbsp; datatype=DataType.SPARSE_FLOAT_VECTOR &nbsp;# Sparse vector data type
)
bm25_function = Function(
&nbsp; &nbsp; name="text_to_vector", &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;# Descriptive function name
&nbsp; &nbsp; function_type=FunctionType.BM25, &nbsp;# Use BM25 algorithm
&nbsp; &nbsp; input_field_names=["text"], &nbsp; &nbsp; &nbsp;# Process text from this field
&nbsp; &nbsp; output_field_names=["sparse"] &nbsp; &nbsp;# Store vectors in this field
)
schema.add_function(bm25_function)
index_params = client.prepare_index_params()
index_params.add_index(
&nbsp; &nbsp; field_name="sparse", &nbsp; &nbsp; &nbsp; &nbsp;# Field to index (our vector field)
&nbsp; &nbsp; index_type="AUTOINDEX", &nbsp; &nbsp;# Let Milvus choose optimal index type
&nbsp; &nbsp; metric_type="BM25"
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;# Must be BM25 for this feature
)
COLLECTION_NAME =&nbsp;"english_demo"
if&nbsp;client.has_collection(COLLECTION_NAME):
&nbsp; &nbsp; client.drop_collection(COLLECTION_NAME)
&nbsp; &nbsp; print(f"Dropped existing collection:&nbsp;{COLLECTION_NAME}")
client.create_collection(
&nbsp; &nbsp; collection_name=COLLECTION_NAME, &nbsp; &nbsp; &nbsp;# Collection name
&nbsp; &nbsp; schema=schema, &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;# Our schema
&nbsp; &nbsp; index_params=index_params &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;# Our search index configuration
)
print(f"成功创建集合:&nbsp;{COLLECTION_NAME}")
# 准备示例数据
sample_texts = [
&nbsp; &nbsp;"The quick brown fox jumps over the lazy dog",
&nbsp; &nbsp;"Machine learning algorithms are revolutionizing artificial intelligence",
&nbsp; &nbsp;"Python programming language is widely used for data science projects",
&nbsp; &nbsp;"Natural language processing helps computers understand human languages",
&nbsp; &nbsp;"Deep learning models require large amounts of training data",
&nbsp; &nbsp;"Search engines use complex algorithms to rank web pages",
&nbsp; &nbsp;"Text analysis and information retrieval are important NLP tasks",
&nbsp; &nbsp;"Vector databases enable efficient similarity searches",
&nbsp; &nbsp;"Stemming reduces words to their root forms for better searching",
&nbsp; &nbsp;"Stop words like 'the', 'and', 'of' are often filtered out"]
# 插入数据
print("
正在插入数据...")
data = [{"text": text}&nbsp;for&nbsp;text&nbsp;in&nbsp;sample_texts]
client.insert(
&nbsp; &nbsp; collection_name=COLLECTION_NAME,
&nbsp; &nbsp; data=data)
print(f"成功插入&nbsp;{len(sample_texts)}&nbsp;条数据")
# 演示分词器效果
print("
"&nbsp;+&nbsp;"="*60)
print("分词器分析演示")
print("="*60)
test_text =&nbsp;"The running dogs are jumping over the lazy cats"
print(f"
原始文本: '{test_text}'")
# 使用 run_analyzer 展示分词结果
analyzer_result = client.run_analyzer(
&nbsp; &nbsp; texts=test_text,
&nbsp; &nbsp; collection_name=COLLECTION_NAME,
&nbsp; &nbsp; field_name="text")
print(f"分词结果:&nbsp;{analyzer_result}")
print("
分析说明:")
print("- lowercase: 将所有字母转换为小写")
print("- stop words: 过滤掉停用词 ['of', 'to'] 和英语常见停用词")
print("- stemmer: 将词汇还原为词干形式 (running -> run, jumping -> jump)")
# 全文检索演示
print("
"&nbsp;+&nbsp;"="*60)
print("全文检索演示")
print("="*60)
# 等待数据索引完成
import&nbsp;time
time.sleep(2)
# 搜索查询示例
search_queries = [
&nbsp; &nbsp;"jump", &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;# 测试词干匹配 (应该匹配 "jumps")
&nbsp; &nbsp;"algorithm", &nbsp; &nbsp; &nbsp;# 测试精确匹配
&nbsp; &nbsp;"python program",&nbsp;# 测试多词查询
&nbsp; &nbsp;"learn"
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;# 测试词干匹配 (应该匹配 "learning")
]
for&nbsp;i, query&nbsp;in&nbsp;enumerate(search_queries,&nbsp;1):
&nbsp; &nbsp;print(f"
查询&nbsp;{i}: '{query}'")
&nbsp; &nbsp;print("-"&nbsp;*&nbsp;40)
&nbsp; &nbsp;# 执行全文检索
&nbsp; &nbsp; search_results = client.search(
&nbsp; &nbsp; &nbsp; &nbsp; collection_name=COLLECTION_NAME,
&nbsp; &nbsp; &nbsp; &nbsp; data=[query], &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;# 查询文本
&nbsp; &nbsp; &nbsp; &nbsp; search_params={"metric_type":&nbsp;"BM25"},
&nbsp; &nbsp; &nbsp; &nbsp; output_fields=["text"], &nbsp; &nbsp; &nbsp;# 返回原始文本
&nbsp; &nbsp; &nbsp; &nbsp; limit=3
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;# 返回前3个结果
&nbsp; &nbsp; )
&nbsp; &nbsp;if&nbsp;search_results&nbsp;and&nbsp;len(search_results[0]) >&nbsp;0:
&nbsp; &nbsp; &nbsp; &nbsp;for&nbsp;j, result&nbsp;in&nbsp;enumerate(search_results[0],&nbsp;1):
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; score = result["distance"]
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; text = result["entity"]["text"]
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;print(f" &nbsp;结果&nbsp;{j}&nbsp;(相关度:&nbsp;{score:.4f}):&nbsp;{text}")
&nbsp; &nbsp;else:
&nbsp; &nbsp; &nbsp; &nbsp;print(" &nbsp;未找到相关结果")
print("
"&nbsp;+&nbsp;"="*60)
print("检索完成！")
print("="*60)

(2) 多语言 Analyzer 示例

如果数据集中同时包含多种语言，例如英文、中文和日文，可以启用 Multi-language Analyzer。这样 Milvus 会根据文本语言自动选择合适的分词器。

from&nbsp;pymilvus&nbsp;import&nbsp;MilvusClient, DataType, Function, FunctionType
import&nbsp;time
# 配置连接
client = MilvusClient(
&nbsp; &nbsp; uri="http://localhost:19530",
)
COLLECTION_NAME =&nbsp;"multilingual_demo"
# 删除已存在的集合
if&nbsp;client.has_collection(COLLECTION_NAME):
&nbsp; &nbsp; client.drop_collection(COLLECTION_NAME)
# 创建schema
schema = client.create_schema()
# 添加主键字段
schema.add_field(
&nbsp; &nbsp; field_name="id",
&nbsp; &nbsp; datatype=DataType.INT64,
&nbsp; &nbsp; is_primary=True,
&nbsp; &nbsp; auto_id=True
)
# 添加语言标识字段
schema.add_field(
&nbsp; &nbsp; field_name="language",
&nbsp; &nbsp; datatype=DataType.VARCHAR,
&nbsp; &nbsp; max_length=50
)
# 添加文本字段，配置多语言分析器
multi_analyzer_params = {
&nbsp; &nbsp;"by_field":&nbsp;"language", &nbsp;# 根据language字段选择分析器
&nbsp; &nbsp;"analyzers": {
&nbsp; &nbsp; &nbsp; &nbsp;"en": {
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;"type":&nbsp;"english"&nbsp;&nbsp;# 英语分析器
&nbsp; &nbsp; &nbsp; &nbsp; },
&nbsp; &nbsp; &nbsp; &nbsp;"zh": {
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;"type":&nbsp;"chinese"&nbsp;&nbsp;# 中文分析器
&nbsp; &nbsp; &nbsp; &nbsp; },
&nbsp; &nbsp; &nbsp; &nbsp;"jp": {
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;"tokenizer":&nbsp;"icu", &nbsp;# 日语使用ICU分词器
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;"filter": [
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;"lowercase",
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; {
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;"type":&nbsp;"stop",
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;"stop_words": ["は","が","の","に","を","で","と"]
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ]
&nbsp; &nbsp; &nbsp; &nbsp; },
&nbsp; &nbsp; &nbsp; &nbsp;"default": {
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;"tokenizer":&nbsp;"icu"&nbsp;&nbsp;# 默认使用ICU通用分词器
&nbsp; &nbsp; &nbsp; &nbsp; }
&nbsp; &nbsp; },
&nbsp; &nbsp;"alias": {
&nbsp; &nbsp; &nbsp; &nbsp;"english":&nbsp;"en",
&nbsp; &nbsp; &nbsp; &nbsp;"chinese":&nbsp;"zh",
&nbsp; &nbsp; &nbsp; &nbsp;"japanese":&nbsp;"jp",
&nbsp; &nbsp; &nbsp; &nbsp;"中文":&nbsp;"zh",
&nbsp; &nbsp; &nbsp; &nbsp;"英文":&nbsp;"en",
&nbsp; &nbsp; &nbsp; &nbsp;"日文":&nbsp;"jp"
&nbsp; &nbsp; }
}
schema.add_field(
&nbsp; &nbsp; field_name="text",
&nbsp; &nbsp; datatype=DataType.VARCHAR,
&nbsp; &nbsp; max_length=2000,
&nbsp; &nbsp; enable_analyzer=True,
&nbsp; &nbsp; multi_analyzer_params=multi_analyzer_params
)
# 添加稀疏向量字段用于BM25
schema.add_field(
&nbsp; &nbsp; field_name="sparse_vector",
&nbsp; &nbsp; datatype=DataType.SPARSE_FLOAT_VECTOR
)
# 定义BM25函数
bm25_function = Function(
&nbsp; &nbsp; name="text_bm25",
&nbsp; &nbsp; function_type=FunctionType.BM25,
&nbsp; &nbsp; input_field_names=["text"],
&nbsp; &nbsp; output_field_names=["sparse_vector"]
)
schema.add_function(bm25_function)
# 准备索引参数
index_params = client.prepare_index_params()
index_params.add_index(
&nbsp; &nbsp; field_name="sparse_vector",
&nbsp; &nbsp; index_type="AUTOINDEX",
&nbsp; &nbsp; metric_type="BM25"
)
# 创建集合
client.create_collection(
&nbsp; &nbsp; collection_name=COLLECTION_NAME,
&nbsp; &nbsp; schema=schema,
&nbsp; &nbsp; index_params=index_params
)
# 准备多语言测试数据
multilingual_data = [
&nbsp; &nbsp;# 英文数据
&nbsp; &nbsp; {"language":&nbsp;"en","text":&nbsp;"Artificial intelligence is revolutionizing technology industries worldwide"},
&nbsp; &nbsp; {"language":&nbsp;"en","text":&nbsp;"Machine learning algorithms process large datasets efficiently"},
&nbsp; &nbsp; {"language":&nbsp;"en","text":&nbsp;"Vector databases provide fast similarity search capabilities"},
&nbsp; &nbsp;# 中文数据
&nbsp; &nbsp; {"language":&nbsp;"zh","text":&nbsp;"人工智能正在改变世界各行各业"},
&nbsp; &nbsp; {"language":&nbsp;"zh","text":&nbsp;"机器学习算法能够高效处理大规模数据集"},
&nbsp; &nbsp; {"language":&nbsp;"zh","text":&nbsp;"向量数据库提供快速的相似性搜索功能"},
&nbsp; &nbsp;# 日文数据
&nbsp; &nbsp; {"language":&nbsp;"jp","text":&nbsp;"人工知能は世界中の技術産業に革命をもたらしています"},
&nbsp; &nbsp; {"language":&nbsp;"jp","text":&nbsp;"機械学習アルゴリズムは大量のデータセットを効率的に処理します"},
&nbsp; &nbsp; {"language":&nbsp;"jp","text":&nbsp;"ベクトルデータベースは高速な類似性検索機能を提供します"},
]
client.insert(
&nbsp; &nbsp; collection_name=COLLECTION_NAME,
&nbsp; &nbsp; data=multilingual_data
)
# 等待BM25函数生成向量
print("等待BM25向量生成...")
client.flush(COLLECTION_NAME)
time.sleep(5)
client.load_collection(COLLECTION_NAME)
# 演示分词器效果
print("
分词器分析:")
test_texts = {
&nbsp; &nbsp;"en":"The running algorithms are processing data efficiently",
&nbsp; &nbsp;"zh":"这些运行中的算法正在高效地处理数据",
&nbsp; &nbsp;"jp":"これらの実行中のアルゴリズムは効率的にデータを処理しています"
}
for&nbsp;lang, text&nbsp;in&nbsp;test_texts.items():
&nbsp; &nbsp;print(f"{lang}:&nbsp;{text}")
&nbsp; &nbsp;try:
&nbsp; &nbsp; &nbsp; &nbsp; analyzer_result = client.run_analyzer(
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; texts=text,
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; collection_name=COLLECTION_NAME,
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; field_name="text",
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; analyzer_names=[lang]
&nbsp; &nbsp; &nbsp; &nbsp; )
&nbsp; &nbsp; &nbsp; &nbsp;print(f" &nbsp;→&nbsp;{analyzer_result}")
&nbsp; &nbsp;except&nbsp;Exception&nbsp;as&nbsp;e:
&nbsp; &nbsp; &nbsp; &nbsp;print(f" &nbsp;→ 分析失败:&nbsp;{e}")
# 多语言检索演示
print("
检索测试:")
search_cases = [
&nbsp; &nbsp; ("zh","人工智能"),
&nbsp; &nbsp; ("jp","機械学習"),
&nbsp; &nbsp; ("en","algorithm"),
]
for&nbsp;lang, query&nbsp;in&nbsp;search_cases:
&nbsp; &nbsp;print(f"
{lang}&nbsp;'{query}':")
&nbsp; &nbsp;try:
&nbsp; &nbsp; &nbsp; &nbsp; search_results = client.search(
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; collection_name=COLLECTION_NAME,
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; data=[query],
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; search_params={"metric_type":&nbsp;"BM25"},
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; output_fields=["language","text"],
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; limit=3,
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;filter=f'language == "{lang}"'
&nbsp; &nbsp; &nbsp; &nbsp; )
&nbsp; &nbsp; &nbsp; &nbsp;if&nbsp;search_results&nbsp;and&nbsp;len(search_results[0]) >&nbsp;0:
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;for&nbsp;result&nbsp;in&nbsp;search_results[0]:
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; score = result["distance"]
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; text = result["entity"]["text"]
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;print(f" &nbsp;{score:.3f}:&nbsp;{text}")
&nbsp; &nbsp; &nbsp; &nbsp;else:
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;print(" &nbsp;无结果")
&nbsp; &nbsp;except&nbsp;Exception&nbsp;as&nbsp;e:
&nbsp; &nbsp; &nbsp; &nbsp;print(f" &nbsp;错误:&nbsp;{e}")
print("
完成")

此外，Milvus 目前也支持使用 language_identifier 分词器进行搜索，其优势在于无需手动告知系统文本语言，Milvus 会自动识别。相应地，语言字段（language）也并非必需。社区之前的官宣：Milvus 2.6引入多语言分析器，全文搜索再升级，助力业务全球化博客中已对此做了详细介绍，这里不再赘述。

Milvus Analyzer：解决RAG分词难题，提升中文全文检索精度

什么是 Milvus Analyzer？

(1) Tokenizer

（2）Filter

Analyzer 类型

(1) 内置 Analyzer (Built-in)

(2) 多语言 Analyzer (Multi-language)

(3) 自定义 Analyzer (Custom)

代码实践

(1) 普通 Analyzer 示例

(2) 多语言 Analyzer 示例

发表回复取消回复

最新内容

《亚洲水发展展望2025》深度解读：亚太水安全喜忧参半，未来挑战何在？

谷歌支付6800万美元和解语音助手监听诉讼，你的隐私可能被“误触发”录音

甲骨文豪掷500亿美元押注AI基建，美国数据中心版图加速扩张

OpenAI总裁豪掷2500万美元支持特朗普，科技巨头与政坛的深度捆绑引关注

相关内容

Structured RAG重塑企业知识库：从模糊答案到精准洞察，解决RAG聚合与完整性挑战

沃尔沃RAG实战：企业级知识库放弃小分块策略，多模态AI文档检索系统构建与选型心得

OpenAI开源两大安全推理模型：GPT-OSS-Safeguard深度解析

Claude Skills：终结提示词时代，定义AI能力新范式（核心原理与应用）

分类

快速链接

You Might Also Like

什么是 Milvus Analyzer？

(1) Tokenizer

（2）Filter

Analyzer 类型

(1) 内置 Analyzer (Built-in)

(2) 多语言 Analyzer (Multi-language)

(3) 自定义 Analyzer (Custom)

代码实践

(1) 普通 Analyzer 示例

(2) 多语言 Analyzer 示例

发表回复 取消回复

最新内容

分类

快速链接

发表回复取消回复