接入检索、工具与Skills

outputs/evidence.json 已经能从日志里提取异常证据。项目还缺两块东西：一块是从 Runbook 里找到相关排查条目，另一块是让 Agent 按固定规则读取证据、检索 Runbook，再整理报告。

这里会用到前面几篇的概念，但落点都很具体。search_runbook 对应 RAG 里的检索步骤，read_evidence 和 search_runbook 对应函数调用里的工具，run_agent.py 对应 Agent 循环，skills/log-triage/SKILL.md 对应 Skills 里的固定做法。

一、当前状态

现在项目里已经有这些材料：

text

inputs/sample.log
outputs/evidence.json
runbooks/common.md
skills/log-triage/SKILL.md

evidence.json 是 Python 从日志里提取出来的事实，里面有 title、reasons、request_id、path、status、cost_ms 和原始日志窗口。Runbook 还是 Markdown 文件，不能直接按查询词搜索。SKILL.md 也只是规则文件，还没有被入口脚本读进去。

这一层的文件关系会变成这样：

report.md 和 run-trace.json 由入口脚本生成。当前代码先把索引、工具和规则接好，避免完整 Agent 脚本一下子堆到同一个文件里。

二、Runbook 索引

Runbook 现在是一份 Markdown。检索前要先把它拆成章节，## 登录接口超时 和 ## 订单同步失败 分别成为一个 section。新增文件是 scripts/build_index.py。

新增文件：scripts/build_index.py

python

#!/usr/bin/env python3
"""构建 Runbook 简单索引。"""

from __future__ import annotations

import json
import re
from dataclasses import asdict, dataclass

from config import OUTPUT_DIR, RUNBOOK_DIR


@dataclass
class RunbookSection:
    source: str
    title: str
    content: str
    keywords: list[str]

source 记录来自哪个 Markdown 文件，Runbook 文件多起来以后还能回到原文。title 和 content 给模型看，keywords 给本地搜索函数打分用。

三、章节切分

继续在 scripts/build_index.py 里写章节切分。它只认二级标题，也就是 ## xxx。# 常见排查 这种一级标题只当文档标题，不进入索引。

继续修改：scripts/build_index.py

python

def build_section(source: str, title: str, lines: list[str]) -> RunbookSection:
    content = "\n".join(line for line in lines if line.strip())
    words = re.findall(r"[A-Za-z0-9_/-]+", f"{title}\n{content}")
    return RunbookSection(
        source=source,
        title=title,
        content=content,
        keywords=sorted(set(words)),
    )


def split_sections(source: str, text: str) -> list[RunbookSection]:
    sections: list[RunbookSection] = []
    current_title = ""
    current_lines: list[str] = []

    for line in text.splitlines():
        if line.startswith("## "):
            if current_title:
                sections.append(build_section(source, current_title, current_lines))
            current_title = line.removeprefix("## ").strip()
            current_lines = []
            continue

        current_lines.append(line)

    if current_title:
        sections.append(build_section(source, current_title, current_lines))

    return sections

keywords 是从标题和正文里提取出来的英文、数字、下划线、斜杠和短横线。它能抓到 /orders、502、503、task=sync_orders 里的主要片段。中文标题不靠这个正则命中，所以真实中文 Runbook 要接 embedding 或分词；当前样例里的关键线索主要是路径、状态码和任务名，够用来观察流程。

四、索引入口

索引入口读取 runbooks/*.md，把所有 section 写到 outputs/runbook-index.json。这个文件会被 search_runbook 工具读取。

继续修改：scripts/build_index.py

python

def main() -> int:
    sections: list[RunbookSection] = []

    for path in sorted(RUNBOOK_DIR.glob("*.md")):
        text = path.read_text(encoding="utf-8")
        sections.extend(split_sections(path.name, text))

    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
    output_path = OUTPUT_DIR / "runbook-index.json"
    output_path.write_text(
        json.dumps([asdict(item) for item in sections], ensure_ascii=False, indent=2),
        encoding="utf-8",
    )

    print(f"sections: {len(sections)}")
    print(f"output: {output_path}")
    return 0


if __name__ == "__main__":
    raise SystemExit(main())

从项目根目录运行：

bash

uv run python -m scripts.build_index

输出应该是：

text

sections: 2
output: .../ops-log-agent/outputs/runbook-index.json

索引里能看到两个 section：登录接口超时 和 订单同步失败。证据里出现 /orders 502 upstream timeout 时，搜索应该命中第二个。

五、工具模块

Agent 不直接读文件路径，而是通过工具函数拿资料。这和函数调用那篇的联系人查询一样：模型只知道工具名和参数，Python 才真正读取本地文件。

当前先在 scripts/run_agent.py 里加两个普通函数。一个读证据，一个查 Runbook 索引。

修改文件：scripts/run_agent.py

python

import json
import re
from pathlib import Path

from config import BASE_DIR, MAX_TOP_K, OUTPUT_DIR


def read_evidence() -> dict:
    evidence_path = OUTPUT_DIR / "evidence.json"
    if not evidence_path.exists():
        return {"ok": False, "error": "缺少 outputs/evidence.json"}

    return {
        "ok": True,
        "items": json.loads(evidence_path.read_text(encoding="utf-8")),
    }

read_evidence 没有参数，直接读取当前运行生成的证据文件。参数越少，模型越不容易填错。这个函数返回 ok/error/items，工具成功和失败都用同一种形状，trace 也好保存。

接着加 Runbook 搜索函数：

python

def tokenize(text: str) -> set[str]:
    return set(re.findall(r"[A-Za-z0-9_/-]+", text))


def search_runbook(query: str, limit: int = MAX_TOP_K) -> dict:
    index_path = OUTPUT_DIR / "runbook-index.json"
    if not index_path.exists():
        return {"ok": False, "error": "缺少 outputs/runbook-index.json"}

    sections = json.loads(index_path.read_text(encoding="utf-8"))
    query_tokens = tokenize(query)

    results = []
    for section in sections:
        score = len(query_tokens & set(section["keywords"]))
        if score <= 0:
            continue
        results.append(
            {
                "score": score,
                "source": section["source"],
                "title": section["title"],
                "content": section["content"],
            }
        )

    results.sort(key=lambda item: item["score"], reverse=True)
    return {"ok": True, "items": results[:limit]}

这就是一个很粗的本地检索。证据里有 /orders、502、upstream timeout，Runbook 章节里也有这些词，交集越多，分数越高。它还不是 embedding 检索，但函数的输入输出已经按 RAG 的形状设计好了：给查询词，返回相关资料片段。

六、搜索验证

工具接模型之前，先单独验证搜索。可以临时在 scripts/run_agent.py 的 main() 里打印一次：

python

def main() -> int:
    print(read_evidence())
    print(search_runbook("/orders 502 upstream timeout"))
    return 0

运行顺序是：

bash

uv run python -m scripts.read_logs
uv run python -m scripts.build_index
uv run python -m scripts.run_agent

如果前两步没跑，第三步会返回缺文件：

json

{"ok": false, "error": "缺少 outputs/evidence.json"}

文件都存在时，search_runbook("/orders 502 upstream timeout") 应该返回 订单同步失败。看到这个结果，说明日志证据和 Runbook 索引已经接上了。报告写得好不好是下一层问题，资料能不能查到要先在这里确认。

七、工具定义

普通函数能跑以后，再把它们描述成模型可调用的工具。这里的工具定义放在 scripts/run_agent.py，和 05、08 的写法保持一致。

继续修改：scripts/run_agent.py

python

TOOLS = [
    {
        "type": "function",
        "name": "read_evidence",
        "description": "读取当前日志分析产生的证据列表。",
        "parameters": {
            "type": "object",
            "properties": {},
            "required": [],
            "additionalProperties": False,
        },
        "strict": True,
    },
    {
        "type": "function",
        "name": "search_runbook",
        "description": "根据日志线索搜索 Runbook 章节。",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "从日志证据中提取的查询词，例如 /orders 502 upstream timeout",
                }
            },
            "required": ["query"],
            "additionalProperties": False,
        },
        "strict": True,
    },
]

read_evidence 对应“先看日志里有什么证据”。search_runbook 对应“拿证据里的关键词去查排查资料”。模型不需要知道文件路径，也不需要知道索引 JSON 的结构，它只通过这两个工具拿资料。

八、工具路由

模型返回 function_call 后，Python 需要把工具名转成本地函数调用。这里继续用白名单，不按字符串动态执行函数。

继续修改：scripts/run_agent.py

python

def dispatch_tool(tool_call) -> dict:
    try:
        arguments = json.loads(tool_call.arguments or "{}")
    except json.JSONDecodeError as exc:
        return {"ok": False, "error": f"工具参数不是合法 JSON: {exc}"}

    if tool_call.name == "read_evidence":
        return read_evidence()

    if tool_call.name == "search_runbook":
        query = arguments.get("query")
        if not isinstance(query, str) or not query.strip():
            return {"ok": False, "error": f"query 不合法: {query!r}"}
        return search_runbook(query=query)

    return {"ok": False, "error": f"未知工具: {tool_call.name}"}

这里和 MCP 那篇的思路也能对上：工具入口可以标准化，但能执行什么仍然由本地白名单控制。模型说要调用 delete_log，本地没有这个分支，就只能得到 未知工具。

九、Skill 接入

log-triage Skill 现在已经在项目里。入口脚本可以先直接读取本地 SKILL.md，把它拼到 instructions 里。真实 Agent 客户端可能有自己的 Skills 加载机制，本地脚本先用最直观的读文件方式观察效果。

继续修改：scripts/run_agent.py

python

def load_skill_text() -> str:
    skill_path = BASE_DIR / "skills" / "log-triage" / "SKILL.md"
    return skill_path.read_text(encoding="utf-8")


def build_instructions() -> str:
    skill_text = load_skill_text()
    return (
        "根据日志证据和 Runbook 输出分析结果。"
        "需要资料时先调用 read_evidence，再根据证据调用 search_runbook。"
        "结论只能来自工具结果。\n\n"
        f"{skill_text}"
    )

这段不是为了让 Skill 变成工具。Skill 不执行动作，只提供分析时的规则。工具负责拿资料，Skill 负责规定报告结构和证据边界。这个分工如果混在一起，SKILL.md 里就会开始写文件路径、JSON 结构和搜索逻辑，后面会越来越乱。

十、运行记录

Agent 接上以后，除了 report.md，还要保存工具调用记录。很多时候报告错了，不是模型没看到规则，而是它根本没查证据，或者查 Runbook 时查询词写偏了。

run-trace.json 至少保留这几类信息：

json

{
  "tools": [
    {
      "tool": "read_evidence",
      "arguments": "{}",
      "ok": true
    },
    {
      "tool": "search_runbook",
      "arguments": "{\"query\":\"/orders 502 upstream timeout\"}",
      "ok": true
    }
  ],
  "report_path": "outputs/report.md"
}

这份 trace 的价值很直接。工具没调，问题在 Agent 判断；工具调了但参数不对，问题在查询词；工具返回对了但报告写偏了，问题在生成阶段。没有 trace，只看最后一段报告，很难知道哪一层出了错。

十一、运行顺序

当前三步命令是固定的：

bash

uv run python -m scripts.read_logs
uv run python -m scripts.build_index
uv run python -m scripts.run_agent

第一步生成 outputs/evidence.json，第二步生成 outputs/runbook-index.json，第三步读取这两份文件并准备进入 Agent 分析。接上模型以后，第三步还会生成：

text

outputs/report.md
outputs/run-trace.json

到这里，项目里的几块材料已经能对应上前面的基础概念：

outputs/evidence.json 是工具读取的结构化事实。
outputs/runbook-index.json 是本地 RAG 检索资料。
read_evidence 和 search_runbook 是模型可调用的函数。
log-triage/SKILL.md 是分析日志时的固定规则。
run-trace.json 用来回看 Agent 到底调用了哪些工具。

接入检索、工具与Skills ​

一、当前状态 ​

二、Runbook 索引 ​

三、章节切分 ​

四、索引入口 ​

五、工具模块 ​

六、搜索验证 ​

七、工具定义 ​

八、工具路由 ​

九、Skill 接入 ​

十、运行记录 ​

十一、运行顺序 ​

接入检索、工具与Skills

一、当前状态

二、Runbook 索引

三、章节切分

四、索引入口

五、工具模块

六、搜索验证

七、工具定义

八、工具路由

九、Skill 接入

十、运行记录

十一、运行顺序