Week 7|实验|把 Week3 文档资产升级成 Week8-ready 文档资产

跑出 parse → section → chunk → anchor → quality → gate 的最小闭环

这次实验不是做一个独立 PDF demo。

你要从 Week3 的 raw document asset 出发,把它升级成能被 Week8 安全索引和引用的文档资产 v1。

这次实验要完成什么

  • 确认 Week3 文档输入存在
  • 从 MinIO / source fallback 读取 raw document
  • 执行 Docling-first parse
  • 生成 normalized sections
  • 执行 section-aware chunking
  • 生成 evidence anchors
  • 输出 chunk_quality_report.md
  • 输出 week8_ready_gate.json

参考实验时间

90–120 分钟

本次实验产出

  • artifacts/week07/parsed_doc.json
  • artifacts/week07/sections.json
  • artifacts/week07/chunks.json
  • artifacts/week07/evidence_anchors.json
  • reports/week07/doc_ingest_input_report.json
  • reports/week07/chunk_quality_report.md
  • reports/week07/week8_ready_gate.json
  • reports/week07/week07_lab_execution_summary.md

先看一张闭环图

flowchart LR
    INGEST["Week3 doc_ingest"] --> RAW["raw_doc_asset<br/>MinIO raw object"]
    RAW --> PARSE["Docling-first parse"]
    PARSE --> SECTION["normalized sections"]
    SECTION --> CHUNK["section-aware chunks"]
    CHUNK --> ANCHOR["evidence anchors"]
    ANCHOR --> QUALITY["chunk quality report"]
    QUALITY --> GATE["Week8 ready gate"]

0. 环境准备

本实验默认 Docker-first 路径:

docker compose --env-file infra/env/.env.local -f infra/docker-compose.yml up -d --build
docker compose --env-file infra/env/.env.local -f infra/docker-compose.yml run --rm devbox bash

进入 devbox 后再执行后续命令。

1. 确认 Week3 文档输入

python -m pipelines.ingestion.doc_ingest \
  --manifest data/seed_manifests/manifest_workspace_helpcenter_v1.json \
  --source-dir data/canonization/documents \
  --batch-id week07-lab-001 \
  --report-json reports/week07/doc_ingest_input_report.json

你要确认报告里至少包含:

  • raw_object_path
  • source_fingerprint
  • doc_idsource_id
  • ingest status

2. 执行 Week7 parse plan

python -m pipelines.parse_normalize.run_parse \
  --input-source postgres \
  --limit 50 \
  --parse-strategy docling_v1_no_ocr \
  --chunk-strategy section_aware_v1 \
  --out-dir artifacts/week07 \
  --report-dir reports/week07 \
  --plan-only

Plan 阶段只确认输入和配置,不真正写输出。重点检查:

  • 是否优先使用 s3://omni-raw-documents/...
  • 是否默认 ocr=false
  • 是否默认不写 embedding / pgvector
  • 是否会记录 strategy version

3. Docling-first parse

python -m pipelines.parse_normalize.run_parse \
  --input-source postgres \
  --limit 50 \
  --parse-strategy docling_v1_no_ocr \
  --chunk-strategy section_aware_v1 \
  --out-dir artifacts/week07 \
  --report-dir reports/week07

如果 Docling 依赖未安装,CLI 应该给出明确安装建议,而不是静默生成假数据。

4. 检查 sections

jq '.sections[0]' artifacts/week07/sections.json

必须能看到类似字段:

  • section_id
  • doc_id
  • section_path
  • page_no
  • bboxbbox_missing_reason
  • source_fingerprint
  • doc_version
  • parse_strategy_version

5. 检查 chunks

jq '.chunks[0]' artifacts/week07/chunks.json

必须能看到:

  • chunk_id
  • section_id
  • chunk_index
  • content
  • chunk_strategy_version
  • overlap_prev
  • overlap_next
  • quality_flags

6. 检查 evidence anchors

jq '.anchors[0]' artifacts/week07/evidence_anchors.json

每个 chunk 至少应该有一个 anchor。Anchor 要能回到:

  • source uri / raw object uri
  • doc version
  • source fingerprint
  • page number
  • bbox 或 missing reason
  • section path

7. 生成 quality report 与 Week8 ready gate

python -m pipelines.parse_normalize.quality \
  --input-dir artifacts/week07 \
  --sample-size 50 \
  --out reports/week07/chunk_quality_report.md \
  --gate-out reports/week07/week8_ready_gate.json

报告至少包含:

  • metadata completeness
  • page reference rate
  • bbox coverage
  • anchor coverage
  • empty chunk rate
  • orphan chunk rate
  • overlap duplicate rate
  • PII leakage risk
  • sample_shortfall

8. 常见错误与排查

问题 可能原因 处理
找不到 s3:// raw object MinIO 未启动或 bucket/key 不存在 先查 doc_ingest_input_report.json 和 MinIO bucket
指纹不一致 读到的 raw bytes 不是 ingest 时那份文件 hard fail 或 quarantine,不要继续
Docling import 失败 devbox 未安装 parse extra 安装 optional extra 或使用 fallback
PDF 没 page_no parser 输出缺失关键 metadata hard fail,不能进入 Week8
HTML 没 bbox 正常 not applicable 报告里写 non_paged_bbox_not_applicable_count

提交清单

  • reports/week07/doc_ingest_input_report.json
  • artifacts/week07/sections.json
  • artifacts/week07/chunks.json
  • artifacts/week07/evidence_anchors.json
  • reports/week07/chunk_quality_report.md
  • reports/week07/week8_ready_gate.json
  • reports/week07/week07_lab_execution_summary.md

实验最重要判断

这次实验不是为了生成最多 chunks,而是为了证明:

Week8 只能消费带 evidence anchor 且通过质量门禁的 chunks。