跳到主要内容

使用 Milvus 和 Firecrawl 构建 RAG

Open In Colab GitHub Repository

Firecrawl 使开发者能够使用从任何网站抓取的干净数据构建 AI 应用程序。凭借先进的抓取、爬取和数据提取功能,Firecrawl 简化了将网站内容转换为干净的 markdown 或结构化数据的过程,用于下游 AI 工作流程。

在本教程中,我们将向您展示如何使用 Milvus 和 Firecrawl 构建检索增强生成(RAG)管道。该管道集成了 Firecrawl 用于网络数据抓取,Milvus 用于向量存储,以及 OpenAI 用于生成有洞察力的、上下文感知的响应。

准备工作

依赖项和环境

要开始,请运行以下命令安装所需的依赖项:

$ pip install firecrawl-py pymilvus openai requests tqdm

如果您使用的是 Google Colab,为了启用刚安装的依赖项,您可能需要重启运行时(点击屏幕顶部的"Runtime"菜单,然后从下拉菜单中选择"Restart session")。

设置 API 密钥

要使用 Firecrawl 从指定 URL 抓取数据,您需要获取 FIRECRAWL_API_KEY 并将其设置为环境变量。此外,我们将在此示例中使用 OpenAI 作为 LLM。您还应该准备 OPENAI_API_KEY 作为环境变量。

import os

os.environ["FIRECRAWL_API_KEY"] = "fc-***********"
os.environ["OPENAI_API_KEY"] = "sk-***********"

准备 LLM 和嵌入模型

我们初始化 OpenAI 客户端来准备嵌入模型。

from openai import OpenAI

openai_client = OpenAI()

定义一个函数,使用 OpenAI 客户端生成文本嵌入。我们使用 text-embedding-3-small 模型作为示例。

def emb_text(text):
return (
openai_client.embeddings.create(input=text, model="text-embedding-3-small")
.data[0]
.embedding
)

生成一个测试嵌入并打印其维度和前几个元素。

test_embedding = emb_text("This is a test")
embedding_dim = len(test_embedding)
print(embedding_dim)
print(test_embedding[:10])

1536 [0.009889289736747742, -0.005578675772994757, 0.00683477520942688, -0.03805781528353691, -0.01824733428657055, -0.04121600463986397, -0.007636285852640867, 0.03225184231996536, 0.018949154764413834, 9.352207416668534e-05]

使用 Firecrawl 抓取数据

初始化 Firecrawl 应用程序

我们将使用 firecrawl 库以 markdown 格式从指定 URL 抓取数据。首先初始化 Firecrawl 应用程序:

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])

抓取目标网站

从目标 URL 抓取内容。网站 LLM-powered Autonomous Agents 提供了对使用大型语言模型(LLM)构建的自主代理系统的深入探索。我们将使用这些内容构建一个 RAG 系统。

# 抓取网站:
scrape_status = app.scrape_url(
"https://lilianweng.github.io/posts/2023-06-23-agent/",
params={"formats": ["markdown"]},
)

markdown_content = scrape_status["markdown"]

处理抓取的内容

为了使抓取的内容便于插入到 Milvus 中,我们简单地使用 "# " 来分离内容,这可以大致分离抓取的 markdown 文件的每个主要部分的内容。

def split_markdown_content(content):
return [section.strip() for section in content.split("# ") if section.strip()]


# 处理抓取的 markdown 内容
sections = split_markdown_content(markdown_content)

# 打印前几个部分以了解结构
for i, section in enumerate(sections[:3]):
print(f"Section {i+1}:")
print(section[:300] + "...")
print("-" * 50)

Section 1: Table of Contents


Section 2: Agent System Overview #

In a LLM-powered autonomous agent system, LLM functions as the agent's brain, complemented by several key components:

  • Planning
    • Subgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling effi...

Section 3: Component One: Planning #

A complicated task usually involves many steps. An agent needs to know what they are and plan ahead.

#...

将数据加载到 Milvus

创建 collection

from pymilvus import MilvusClient

milvus_client = MilvusClient(uri="./milvus_demo.db")
collection_name = "my_rag_collection"

关于 MilvusClient 的参数:

  • uri 设置为本地文件,例如 ./milvus.db,是最方便的方法,因为它会自动利用 Milvus Lite 将所有数据存储在此文件中。

  • 如果您有大规模数据,可以在 docker 或 kubernetes 上设置性能更高的 Milvus 服务器。在此设置中,请使用服务器 uri,例如 http://localhost:19530,作为您的 uri

  • 如果您想使用 Zilliz Cloud,Milvus 的完全托管云服务,请调整 uritoken,它们对应于 Zilliz Cloud 中的 Public Endpoint 和 Api key

检查 collection 是否已存在,如果存在则删除它。

if milvus_client.has_collection(collection_name):
milvus_client.drop_collection(collection_name)

使用指定参数创建新的 collection。

如果我们不指定任何字段信息,Milvus 将自动创建一个默认的 id 字段作为主键,以及一个 vector 字段来存储向量数据。保留的 JSON 字段用于存储非模式定义的字段及其值。

milvus_client.create_collection(
collection_name=collection_name,
dimension=embedding_dim,
metric_type="IP", # 内积距离
consistency_level="Strong", # 支持的值为 (`"Strong"`, `"Session"`, `"Bounded"`, `"Eventually"`)。详见 https://milvus.io/docs/consistency.md#Consistency-Level
)

插入数据

遍历文本部分,创建嵌入,然后将数据插入到 Milvus 中。

这里有一个新字段 text,它是 collection 模式中的非定义字段。它将自动添加到保留的 JSON 动态字段中,在高级别上可以被视为普通字段。

from tqdm import tqdm

data = []

text_embeddings = [emb_text(section) for section in tqdm(sections, desc="Creating embeddings")]

for i, (section, embedding) in enumerate(zip(sections, text_embeddings)):
data.append({"id": i, "vector": embedding, "text": section})

milvus_client.insert(collection_name=collection_name, data=data)

Creating embeddings: 100%|██████████| 18/18 [00:05<00:00, 3.20it/s]

{'insert_count': 18, 'ids': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17], 'cost': 0}

构建 RAG

为查询检索数据

让我们指定一个关于自主代理的问题。

question = "What are the approaches to Task Decomposition?"

在 collection 中搜索问题并检索语义上最相关的前 3 个匹配项。

search_res = milvus_client.search(
collection_name=collection_name,
data=[emb_text(question)], # 将问题转换为嵌入向量
limit=3, # 返回前 3 个结果
search_params={"metric_type": "IP", "params": {}}, # 内积距离
output_fields=["text"], # 返回文本字段
)

让我们看看查询的搜索结果

retrieved_lines_with_distances = [
(res["entity"]["text"], res["distance"]) for res in search_res[0]
]
print(json.dumps(retrieved_lines_with_distances, indent=2))

[ [ "Task Decomposition#\n\nA complicated task usually involves many steps. An agent needs to know what they are and plan ahead.\n\nTask decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs.\n\nAnother quite distinct approach, LLM+P (Liu et al. 2023), involves relying on an external classical planner to do long-horizon planning. This approach utilizes the Planning Domain Definition Language (PDDL) and treats the LLM as a translator to convert goals in natural language to PDDL definitions.", 0.8536056280136108 ], [ "Agent System Overview#\n\nIn a LLM-powered autonomous agent system, LLM functions as the agent's brain, complemented by several key components:\n\n- Planning\n - Subgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\n - Reflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.\n- Memory\n - Short-term memory: I would consider all the in-context learning (See Prompt Engineering as utilizing short-term memory of the model to learn.\n - Long-term memory: This provides the agent with the capability to retain and recall (infinite) information over extended periods, often by leveraging an external vector store and fast retrieval.\n- Tool use\n - The agent learns to call external APIs for extra information that is missing from the model weights (often hard to change after pre-training), including current information, code execution capability, access to proprietary information sources and more.", 0.7669463157653809 ], [ "Self-Reflection#\n\nSelf-reflection is a vital capability for autonomous agents as it allows them to improve iteratively by refining past action decisions and correcting previous mistakes. This capability is crucial for real-world tasks where trial and error are inevitable.\n\nThere are several distinct approaches to incorporate self-reflection in LLM-based agents:\n\n1. Reasoning Shortcuts: ReAct (Yao et al. 2023) integrates reasoning and acting within LLM by extending the action space to include language reasoning traces. The agent can perform dynamic reasoning to create, maintain, and adjust high-level plans for acting (reason to act), while also interacting with the external environment to incorporate additional information into reasoning (act to reason).\n2. Subgoal Decomposition: Self-reflection can guide the agent to break down complex tasks into smaller, more manageable subgoals. This approach helps in creating a structured plan and facilitates better error handling and recovery.\n3. Past Experience: The agent can leverage past experiences and mistakes to inform current actions. This includes learning from previous successes and failures to improve future decision-making processes.", 0.7415370345115662 ] ]

使用 LLM 生成响应

将检索到的文档转换为字符串格式。

context = "\n".join([line_with_distance[0] for line_with_distance in retrieved_lines_with_distances])

定义系统和用户提示符以供 OpenAI 使用。此提示符是根据检索到的文档组装的。

SYSTEM_PROMPT = """
Human: You are an AI assistant. You are given a user question, and please write clean, concise and accurate answer to the question. You will be given a set of related contexts to the question, please answer based on the context. Please say "information is not available" if the question cannot be answered based on the context.
"""

USER_PROMPT = f"""
Use the following pieces of information enclosed in &lt;context&gt; tags to provide an answer to the question enclosed in &lt;question&gt; tags.
&lt;context&gt;
{{context}}
&lt;/context&gt;
&lt;question&gt;
{{question}}
&lt;/question&gt;
"""

使用 OpenAI 生成基于提示符的响应。

response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": USER_PROMPT.format(context=context, question=question),
},
],
)

print(response.choices[0].message.content)

Based on the provided context, there are several approaches to Task Decomposition:

  1. LLM with Simple Prompting: This involves using a Large Language Model with straightforward prompts such as:

    • "Steps for XYZ.\n1."
    • "What are the subgoals for achieving XYZ?"
  2. Task-Specific Instructions: This approach uses specific instructions tailored to the task at hand. For example:

    • "Write a story outline." for writing a novel
  3. Human Inputs: Involving human guidance and input in the decomposition process.

  4. LLM+P Approach: This is a distinct method that relies on an external classical planner for long-horizon planning. It utilizes the Planning Domain Definition Language (PDDL) and treats the LLM as a translator to convert goals in natural language to PDDL definitions.

Additionally, the context mentions that Subgoal Decomposition can be used as part of self-reflection, where the agent breaks down complex tasks into smaller, more manageable subgoals to create a structured plan and facilitate better error handling and recovery.

太好了!我们已经使用 Milvus 和 Firecrawl 构建了一个 RAG 管道。

快速部署

要了解如何使用此教程启动在线演示,请参阅示例应用程序