flowchart LR
INGEST["Week3 doc_ingest"] --> RAW["raw_doc_asset<br/>MinIO raw object"]
RAW --> PARSE["Docling-first parse"]
PARSE --> SECTION["normalized sections"]
SECTION --> CHUNK["section-aware chunks"]
CHUNK --> ANCHOR["evidence anchors"]
ANCHOR --> QUALITY["chunk quality report"]
QUALITY --> GATE["Week8 ready gate"]
Week 7|实验|把 Week3 文档资产升级成 Week8-ready 文档资产
跑出 parse → section → chunk → anchor → quality → gate 的最小闭环
这次实验不是做一个独立 PDF demo。
你要从 Week3 的 raw document asset 出发,把它升级成能被 Week8 安全索引和引用的文档资产 v1。
这次实验要完成什么
- 确认 Week3 文档输入存在
- 从 MinIO / source fallback 读取 raw document
- 执行 Docling-first parse
- 生成 normalized sections
- 执行 section-aware chunking
- 生成 evidence anchors
- 输出
chunk_quality_report.md - 输出
week8_ready_gate.json
参考实验时间
90–120 分钟
本次实验产出
artifacts/week07/parsed_doc.jsonartifacts/week07/sections.jsonartifacts/week07/chunks.jsonartifacts/week07/evidence_anchors.jsonreports/week07/doc_ingest_input_report.jsonreports/week07/chunk_quality_report.mdreports/week07/week8_ready_gate.jsonreports/week07/week07_lab_execution_summary.md
先看一张闭环图
0. 环境准备
本实验默认 Docker-first 路径:
docker compose --env-file infra/env/.env.local -f infra/docker-compose.yml up -d --build
docker compose --env-file infra/env/.env.local -f infra/docker-compose.yml run --rm devbox bash进入 devbox 后再执行后续命令。
1. 确认 Week3 文档输入
python -m pipelines.ingestion.doc_ingest \
--manifest data/seed_manifests/manifest_workspace_helpcenter_v1.json \
--source-dir data/canonization/documents \
--batch-id week07-lab-001 \
--report-json reports/week07/doc_ingest_input_report.json你要确认报告里至少包含:
raw_object_pathsource_fingerprintdoc_id或source_id- ingest status
2. 执行 Week7 parse plan
python -m pipelines.parse_normalize.run_parse \
--input-source postgres \
--limit 50 \
--parse-strategy docling_v1_no_ocr \
--chunk-strategy section_aware_v1 \
--out-dir artifacts/week07 \
--report-dir reports/week07 \
--plan-onlyPlan 阶段只确认输入和配置,不真正写输出。重点检查:
- 是否优先使用
s3://omni-raw-documents/... - 是否默认
ocr=false - 是否默认不写 embedding / pgvector
- 是否会记录 strategy version
3. Docling-first parse
python -m pipelines.parse_normalize.run_parse \
--input-source postgres \
--limit 50 \
--parse-strategy docling_v1_no_ocr \
--chunk-strategy section_aware_v1 \
--out-dir artifacts/week07 \
--report-dir reports/week07如果 Docling 依赖未安装,CLI 应该给出明确安装建议,而不是静默生成假数据。
4. 检查 sections
jq '.sections[0]' artifacts/week07/sections.json必须能看到类似字段:
section_iddoc_idsection_pathpage_nobbox或bbox_missing_reasonsource_fingerprintdoc_versionparse_strategy_version
5. 检查 chunks
jq '.chunks[0]' artifacts/week07/chunks.json必须能看到:
chunk_idsection_idchunk_indexcontentchunk_strategy_versionoverlap_prevoverlap_nextquality_flags
6. 检查 evidence anchors
jq '.anchors[0]' artifacts/week07/evidence_anchors.json每个 chunk 至少应该有一个 anchor。Anchor 要能回到:
- source uri / raw object uri
- doc version
- source fingerprint
- page number
- bbox 或 missing reason
- section path
7. 生成 quality report 与 Week8 ready gate
python -m pipelines.parse_normalize.quality \
--input-dir artifacts/week07 \
--sample-size 50 \
--out reports/week07/chunk_quality_report.md \
--gate-out reports/week07/week8_ready_gate.json报告至少包含:
- metadata completeness
- page reference rate
- bbox coverage
- anchor coverage
- empty chunk rate
- orphan chunk rate
- overlap duplicate rate
- PII leakage risk
- sample_shortfall
8. 常见错误与排查
| 问题 | 可能原因 | 处理 |
|---|---|---|
找不到 s3:// raw object |
MinIO 未启动或 bucket/key 不存在 | 先查 doc_ingest_input_report.json 和 MinIO bucket |
| 指纹不一致 | 读到的 raw bytes 不是 ingest 时那份文件 | hard fail 或 quarantine,不要继续 |
| Docling import 失败 | devbox 未安装 parse extra | 安装 optional extra 或使用 fallback |
| PDF 没 page_no | parser 输出缺失关键 metadata | hard fail,不能进入 Week8 |
| HTML 没 bbox | 正常 not applicable | 报告里写 non_paged_bbox_not_applicable_count |
提交清单
reports/week07/doc_ingest_input_report.jsonartifacts/week07/sections.jsonartifacts/week07/chunks.jsonartifacts/week07/evidence_anchors.jsonreports/week07/chunk_quality_report.mdreports/week07/week8_ready_gate.jsonreports/week07/week07_lab_execution_summary.md
实验最重要判断
这次实验不是为了生成最多 chunks,而是为了证明:
Week8 只能消费带 evidence anchor 且通过质量门禁的 chunks。