跳到主要内容

使用 Milvus 和 OpenAI 进行语义搜索

Open In Colab GitHub Repository

本指南展示了如何将 OpenAI 的嵌入 API 与 Milvus 向量数据库结合使用,在文本上进行语义搜索。

开始使用

在开始之前,请确保您已准备好 OpenAI API 密钥,或者从 OpenAI 网站 获取一个。

本示例中使用的数据是书名。您可以从这里下载数据集,并将其放在运行以下代码的同一目录中。

首先,安装 Milvus 和 OpenAI 的包:

pip install --upgrade openai pymilvus

如果您使用的是 Google Colab,要启用刚刚安装的依赖项,您可能需要重启运行时。(点击屏幕顶部的"Runtime"菜单,然后从下拉菜单中选择"Restart session")。

准备好后,我们就可以生成嵌入并使用向量数据库进行语义搜索了。

使用 OpenAI 和 Milvus 搜索书名

在以下示例中,我们从下载的 CSV 文件中加载书名数据,使用 OpenAI 嵌入模型生成向量表示,并将它们存储在 Milvus 向量数据库中进行语义搜索。

from openai import OpenAI
from pymilvus import MilvusClient

MODEL_NAME = "text-embedding-3-small" # Which model to use, please check https://platform.openai.com/docs/guides/embeddings for available models
DIMENSION = 1536 # Dimension of vector embedding

# Connect to OpenAI with API Key.
openai_client = OpenAI(api_key="<YOUR_OPENAI_API_KEY>")

docs = [
"Artificial intelligence was founded as an academic discipline in 1956.",
"Alan Turing was the first person to conduct substantial research in AI.",
"Born in Maida Vale, London, Turing was raised in southern England.",
]

vectors = [
vec.embedding
for vec in openai_client.embeddings.create(input=docs, model=MODEL_NAME).data
]

# Prepare data to be stored in Milvus vector database.
# We can store the id, vector representation, raw text and labels such as "subject" in this case in Milvus.
data = [
{"id": i, "vector": vectors[i], "text": docs[i], "subject": "history"}
for i in range(len(docs))
]


# Connect to Milvus, all data is stored in a local file named "milvus_openai_demo.db"
# in current directory. You can also connect to a remote Milvus server following this
# instruction: https://milvus.io/docs/install_standalone-docker.md.
milvus_client = MilvusClient(uri="milvus_openai_demo.db")
COLLECTION_NAME = "demo_collection" # Milvus collection name
# Create a collection to store the vectors and text.
if milvus_client.has_collection(collection_name=COLLECTION_NAME):
milvus_client.drop_collection(collection_name=COLLECTION_NAME)
milvus_client.create_collection(collection_name=COLLECTION_NAME, dimension=DIMENSION)

# Insert all data into Milvus vector database.
res = milvus_client.insert(collection_name="demo_collection", data=data)

print(res["insert_count"])

关于 MilvusClient 的参数:

  • uri 设置为本地文件,例如 ./milvus.db,是最方便的方法,因为它会自动使用 Milvus Lite 将所有数据存储在此文件中。
  • 如果您有大规模数据,可以在 docker 或 kubernetes 上设置性能更高的 Milvus 服务器。在此设置中,请使用服务器 uri,例如 http://localhost:19530,作为您的 uri
  • 如果您想使用 Zilliz Cloud(Milvus 的完全托管云服务),请调整 uritoken,它们对应于 Zilliz Cloud 中的公共端点和 API 密钥

将所有数据存储在 Milvus 向量数据库中后,我们现在可以通过为查询生成向量嵌入并进行向量搜索来执行语义搜索。

queries = ["When was artificial intelligence founded?"]

query_vectors = [
vec.embedding
for vec in openai_client.embeddings.create(input=queries, model=MODEL_NAME).data
]

res = milvus_client.search(
collection_name=COLLECTION_NAME, # target collection
data=query_vectors, # query vectors
limit=2, # number of returned entities
output_fields=["text", "subject"], # specifies fields to be returned
)

for q in queries:
print("Query:", q)
for result in res:
print(result)
print("\n")

您应该看到以下输出:

[
{
"id": 0,
"distance": -0.772376537322998,
"entity": {
"text": "Artificial intelligence was founded as an academic discipline in 1956.",
"subject": "history",
},
},
{
"id": 1,
"distance": -0.58596271276474,
"entity": {
"text": "Alan Turing was the first person to conduct substantial research in AI.",
"subject": "history",
},
},
]