Lindera

lindera tokenizer 执行基于词典的形态分析。它是日语、韩语和中文等单词之间不用空格分隔的语言的不错选择。

配置

lindera tokenizer 是内置在 Milvus 中的。要使用它，只需在 analyzer_params 内的 tokenizer 部分中指定其名称即可。

analyzer_params = {
    "tokenizer": {
      "type": "lindera"，
      "dict_kind": "ipadic"
    }
}

Parameter	Description
`type`	The type of tokenizer. This is fixed to `"lindera"`.
`dict`	A list of dictionaries used to define vocabulary. Possible values: `ipadic`: Japanese `ko-dic`: Korean `cc-cedict`: Mandarin Chinese (traditional/simpl.)

Parameter

Description

type

The type of tokenizer. This is fixed to "lindera".

dict

A list of dictionaries used to define vocabulary. Possible values:

ipadic: Japanese
ko-dic: Korean
cc-cedict: Mandarin Chinese (traditional/simpl.)

参数	描述
`type`	tokenizer 的类型。此值固定为 `"lindera"`。
`dict`	用于定义词汇表的词典列表。可能的值： `ipadic`：日语 `ko-dic`：韩语 `cc-cedict`：普通话中文（繁体/简体）

参数

描述

type

tokenizer 的类型。此值固定为 "lindera"。

dict

用于定义词汇表的词典列表。可能的值：

ipadic：日语
ko-dic：韩语
cc-cedict：普通话中文（繁体/简体）

定义 analyzer_params 后，您可以在定义 collection schema 时将其应用于 VARCHAR field。这允许 Milvus 使用指定的 analyzer 处理该 field 中的文本，以实现高效的分词和过滤。有关详细信息，请参阅示例使用。

示例

在将 analyzer 配置应用到您的 collection schema 之前，使用 run_analyzer 方法验证其行为。

Analyzer 配置

analyzer_params = {
    "tokenizer": {
      "type": "lindera",
      "dict_kind": "ipadic"
    }
}

使用 `run_analyzer` 验证

# Sample text to analyze
sample_text = "東京スカイツリーの最寄り駅はとうきょうスカイツリー駅で"

# Run the standard analyzer with the defined configuration
result = MilvusClient.run_analyzer(sample_text, analyzer_params)
print(result)

预期输出

['こんにちは', 'かい', 'ぎょ', 'う']

配置​

示例​

Analyzer 配置​

使用 run_analyzer 验证​

预期输出​

配置

示例

Analyzer 配置

使用 `run_analyzer` 验证

预期输出