aucourt-ingest/todo.md
slothitude d77fe12cfc AuCourtIngest: complete 8-stage Australian legal case ingestion pipeline
Source layer (5 court sources), processing pipeline (parse/extract/chunk/embed/graph),
property graph with 8 node types, juror subgraph queries with 6 personas,
orchestrator with bootstrap/watch/backfill/audit/process modes, 170 tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-30 11:56:23 +10:00

81 lines
6.4 KiB
Markdown

# AuCourtIngest — TODO
## Blocked / Needs Decisions
- [ ] Pick email for User-Agent header and AusLII contact
- [ ] Decide: Neo4j Docker for dev, or NetworkX for unit tests + Neo4j for integration?
- [ ] Curate exoneration source data (RMIT register, Innocence Project) → `data/exoneration_sources.json`
- [ ] Email AustLII partnership inquiry (long lead time, do early)
- [ ] Inspect High Court `/cases-and-judgments` page structure (spec URL was wrong)
## Stage 1: Skeleton + Data Models ✅
- [x] Package structure: `aucourt_ingest/__init__.py`
- [x] `models.py` — RawDocument, CaseMeta, Chunk, JurorPersona, JurorContext, FetchQueueItem, SourceState
- [x] `config.py` — TOML loader (stdlib tomllib), typed config objects
- [x] `main.py` — argparse with bootstrap/watch/backfill/audit subcommands
- [x] `config.toml` — 5 source defs, rate limits, storage paths, API key placeholders
- [x] `pyproject.toml` — dependencies, entry points
- [x] `data/docs/` and `data/raw/` directories
- [x] Verify: `python -m aucourt_ingest --help` runs, all subcommands dispatch, all models import
## Stage 2: MetaDB + Utils ✅
- [x] `storage/meta_db.py` — async SQLite (aiosqlite), documents/fetch_queue/source_state CRUD, status transitions
- [x] `utils/mnc_parser.py` — MNC regex parser + court→jurisdiction mapping
- [x] `utils/rate_limiter.py` — async token bucket per source + registry
- [x] `utils/retry.py` — exponential backoff decorator with jitter
- [x] Tests: 27 passed (MNC parser 12, MetaDB 11, rate limiter 4)
## Stage 3: Source Layer ✅
- [x] `sources/base.py` — BaseSource ABC (discover, fetch, rate limit, MetaDB integration)
- [x] `storage/doc_store.py` — save/load raw downloads to `data/raw/{source}/{year}/{mnc}.html`
- [x] `sources/nsw_caselaw.py` — JSON API at `/browse/list?filter=criminal`, 200/page, 630 pages, 125K total
- [x] Wire: `bootstrap --source nsw_caselaw --limit 3` fetched 3 real cases with MNCs, stored on disk + MetaDB
- [x] `sources/fedcourt.py` — RSS poll at `/rss/fca-judgments`, DOCX link discovery, bootstrap function
- [x] `sources/highcourt.py` — crawl from `/cases-and-judgments`, keyword filter, bootstrap function
- [x] `sources/qld_judgments.py` — search endpoint scraping, bootstrap stub (needs empirical testing)
- [x] `sources/auslaw_mcp.py` — stub with NotImplementedError, needs external MCP server
## Stage 4: Processing Pipeline ✅
- [x] `processing/doc_parser.py` — HTML (strip nav/header/footer/BS4), DOCX (python-docx headings), PDF (pdfminer.six). MNC extraction before stripping.
- [x] `processing/meta_extractor.py` — Claude Haiku via injectable LLMClient protocol. JSON parse + retry with fix prompt. Handles markdown-wrapped responses.
- [x] `processing/outcome_parser.py` — regex-based verdict detection (14 patterns). Priority: HELD → appeal → quash → acquittal → not_guilty → guilty → hung. Exoneration external source lookup. Appeal cross-reference.
- [x] `processing/chunk_engine.py` — LLM boundary detection via injectable client. 7 chunk types with token budgets. Oversized sections auto-split. `chunk_with_sections()` for sync testing.
- [x] `processing/embed_engine.py` — OpenAI text-embedding-3-small via injectable EmbeddingClient protocol. Batch processing with configurable size.
- [x] `storage/vector_index.py` — ChromaDB wrapper: store_chunks, query with chunk_type/doc_id filters, cosine similarity.
- [x] Tests: 39 passed (doc_parser 8, meta_extractor 6, outcome_parser 18, chunk_engine 7)
## Stage 5: Graph Builder ✅
- [x] `storage/graph_db.py` — GraphDB Protocol + GraphNode/GraphRelationship + _node_id/_rel_id helpers
- [x] `storage/in_memory_graph_db.py` — NetworkX-backed InMemoryGraphDB for testing
- [x] `storage/neo4j_graph_db.py` — Neo4j async driver wrapper for production
- [x] `processing/graph_builder.py` — build_case, build_chunks, build_similarity_edges, build_full
- [x] Tests: 40 passed (cosine similarity 7, helpers 4, InMemoryGraphDB 8, build_case 7, build_chunks 4, similarity 7, build_full 2, protocol 1)
## Stage 6: Jury Interface ✅
- [x] `jury/personas.py` — 6 persona definitions (nurse, accountant, skeptic, ex_cop, empath, foreman) + traversal config
- [x] `jury/subgraph_query.py` — SubgraphQuery with get_juror_context(), RULED_INADMISSIBLE exclusion at traversal level, token budget, anchor spec parser
- [x] Tests: 27 passed (persona definitions 13, anchor parsing 5, subgraph query 9: inadmissibility wall, token budget, chunk type filter, sequence ordering, all personas valid, foreman traversal, global exclusion)
## Stage 7: Orchestrator + Watch Mode ✅
- [x] `orchestrator.py` — main loop, queue dispatch, stage transitions (fetched→parsed→embedded→graphed), error recovery, ProcessingPipeline protocol
- [x] `utils/telegram.py` — alert sender with templates (source_degraded, daily_summary, milestone, fetch_error), NoOpAlert fallback
- [x] Wire: watch mode (polling loop with SIGINT/SIGTERM handling), backfill mode (year range), audit mode (re-process by status), process mode (drain queue)
- [x] `main.py` — all 5 modes fully implemented: bootstrap, watch, backfill, audit, process
- [x] Tests: 27 passed (TelegramAlert 10, NoOpAlert 3, orchestrator 14: empty queue, single/multi doc, limit, missing doc, pipeline error, status transitions, graph built, vector index, single doc, doc error, meta stored, no pipeline, queue status)
## Stage 8: Integration Tests ✅
- [x] `processing/pipeline.py` — FullPipeline wiring DocParser → MetaExtractor → OutcomeParser → ChunkEngine → EmbedEngine
- [x] Mock LLM client with deterministic meta extraction and chunk boundary detection
- [x] Mock embedding client with normalized deterministic vectors
- [x] 10 integration tests: parse, extract_meta, chunk, embed, 3-doc e2e to graph, meta stored, vector index, status transitions, graph queryable
- [x] End-to-end: 3 documents → FullPipeline → Orchestrator → InMemoryGraphDB with node dedup and edge traversal
- [x] Tests: 170 passed (pipeline 10, orchestrator 27, jury 27, graph 40, meta_extractor 6, chunk_engine 7, doc_parser 8, outcome_parser 18, mnc_parser 12, meta_db 11, rate_limiter 5)
---
## Spec Corrections Found
- High Court URL: `/cases/recent-judgments` → 404. Correct: `/cases-and-judgments`
- Architecture diagram lists "VICCatalogue" as a source but no source definition exists. VIC is AusLaw MCP targeted search only.
- Spec says Python 3.12 — Rog has 3.13. Using 3.13.
- AusLII direct access returns 403 — only accessible via AusLaw MCP.