aucourt-ingest/plan.md
slothitude d77fe12cfc AuCourtIngest: complete 8-stage Australian legal case ingestion pipeline
Source layer (5 court sources), processing pipeline (parse/extract/chunk/embed/graph),
property graph with 8 node types, juror subgraph queries with 6 personas,
orchestrator with bootstrap/watch/backfill/audit/process modes, 170 tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-30 11:56:23 +10:00

135 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# AuCourtIngest — Implementation Plan
## Audit Notes (from research)
- **High Court URL is wrong in spec**: `/cases/recent-judgments` → 404. Real page is `/cases-and-judgments`. The crawl strategy will need to be discovered empirically — the page structure is unknown.
- **AustLII blocks direct access** (403). AusLaw MCP is the only viable path to AustLII content. Remove any "AustLII direct" assumptions.
- **QLD Judgments**: no documented API. The search endpoint at `/caselaw-search/query?queryStringSearchText={term}` exists but behaviour is undocumented. Treat as HTML scraping, not API calls.
- **AusLaw MCP**: repo exists and is active. Tools confirmed: search, legislation, OCR. But it's an MCP server — needs to be running as a separate process. Integration pattern differs from direct HTTP sources.
- **Python version**: spec says 3.12, Rog has 3.13. Use 3.13.
- **No VIC source defined**: spec architecture diagram mentions "VICCatalogue" but section 3 only has 5 sources (FedCourt, HighCourt, NSW, QLD, AusLaw MCP). VIC is only mentioned as a Phase 2 backfill item via AusLaw MCP targeted search. Not a direct source adapter.
---
## Build Order
The spec's 6-phase structure is fine conceptually but has dependency issues. The plan below fixes ordering and adds missing work the spec assumes but doesn't call out.
### Stage 1: Skeleton + Data Models (no external deps)
Goal: runnable `python -m aucourt_ingest --help` that does nothing yet.
- [ ] Create `aucourt_ingest/__init__.py` package
- [ ] Create `models.py` — all dataclasses: `RawDocument`, `CaseMeta`, `Chunk`, `JurorPersona`, `JurorContext`, `FetchQueueItem`, `SourceState`
- [ ] Create `config.py` — load `config.toml`, expose typed config objects (source configs, rate limits, API keys, paths)
- [ ] Create `main.py` — argparse entry point with `bootstrap`, `watch`, `backfill`, `audit` subcommands
- [ ] Create `config.toml` — all source definitions, rate limits, storage paths, API key placeholders
- [ ] Create `pyproject.toml` — dependencies, scripts entry point
- [ ] Create `data/` directories (docs/, raw/)
- [ ] Verify: `python -m aucourt_ingest --help` runs, all imports work
### Stage 2: MetaDB + Utils (no external deps except aiosqlite)
Goal: SQLite is initialised, can track documents through pipeline states.
- [ ] Create `storage/meta_db.py` — init schema (documents, fetch_queue, source_state tables), CRUD methods
- [ ] Create `utils/mnc_parser.py` — MNC regex parser + validator
- [ ] Create `utils/rate_limiter.py` — async token bucket per source_id
- [ ] Create `utils/retry.py` — configurable exponential backoff decorator
- [ ] Tests: `test_mnc_parser.py` — parse valid/invalid MNCs, edge cases (missing brackets, non-standard courts)
- [ ] Tests: `test_meta_db.py` — init, insert doc, update status transitions, queue operations, source state
- [ ] Tests: `test_rate_limiter.py` — basic burst behavior, per-source isolation
### Stage 3: Source Layer (one source first, then extend)
Goal: Fetch a real document from one court and store it as raw HTML.
**Start with NSW Caselaw** — it has the clearest pagination, HTML-only (no DOCX parsing needed), and the `?filter=criminal` endpoint works.
- [ ] Create `sources/base.py``BaseSource` ABC: `discover()`, `fetch(url)`, `build_queue()`. Enforce rate limit. Log to MetaDB. Handle retry per exception type.
- [ ] Create `sources/nsw_caselaw.py` — browse pagination, extract MNC + URL per row, fetch HTML
- [ ] Create `storage/doc_store.py` — save raw downloads to `data/raw/{source_id}/{year}/{doc_id}.html`
- [ ] Wire: `main.py bootstrap --source nsw_caselaw --limit 3` fetches 3 real criminal cases
- [ ] Verify: raw HTML files on disk, MetaDB shows `fetch_status=fetched`
Then extend to other sources:
- [ ] Create `sources/fedcourt.py` — RSS poll (`/rss/fca-judgments` is confirmed working), extract DOCX links from RSS items, download DOCX. Bootstrap pagination via `/digital-law-library/judgments/search` endpoint (needs empirical testing).
- [ ] Create `sources/highcourt.py` — crawl from `/cases-and-judgments` (NOT `/cases/recent-judgments` which 404s). Page structure is unknown — needs manual inspection first. Apply PRIORITY_KEYWORDS filter post-fetch.
- [ ] Create `sources/qld_judgments.py` — HTML scraping at `/caselaw-search/query?queryStringSearchText={term}`. No documented API. May need to reverse-engineer search parameters for criminal filter.
- [ ] Create `sources/auslaw_mcp.py` — MCP client wrapper. NOT a direct HTTP source — spawns/requires auslaw-mcp server process. Use `mcp-python-sdk`. Handles: fetch by citation, OCR for scanned PDFs, targeted search.
### Stage 4: Processing Pipeline (format-agnostic)
Goal: Raw document → structured metadata + chunks + embeddings. Works regardless of source.
- [ ] Create `processing/doc_parser.py` — dispatch to format handler. HTML (BeautifulSoup, strip nav/header/footer), DOCX (python-docx), PDF (pdfminer.six text extraction). Output: `RawDocument`.
- [ ] Create `processing/meta_extractor.py` — call Claude Haiku with the extraction prompt from spec. Parse JSON response into `CaseMeta`. Handle malformed LLM output gracefully (retry once with "fix this JSON").
- [ ] Create `processing/outcome_parser.py` — separate verdict extraction pass. Resolution order: headnote → orders section → appeal cross-ref → external exoneration list. Load exoneration lookup table from `data/exoneration_sources.json` (to be curated manually).
- [ ] Create `processing/chunk_engine.py` — use Claude Haiku to identify structural boundaries, then split at boundaries within token budgets. Output: list of `Chunk` with type, sequence, speaker.
- [ ] Create `processing/embed_engine.py` — batch embed chunks via OpenAI text-embedding-3-small, store in ChromaDB collection `au_cases_{source_id}`.
- [ ] Create `storage/vector_index.py` — ChromaDB wrapper: `embed_and_store(chunks)`, `query(text, chunk_types, doc_ids, top_k)`
- [ ] Tests: `test_doc_parser.py` — fixture HTML (NSW-style), DOCX, PDF text. Verify clean extraction, char counts, no nav/footer leakage.
- [ ] Tests: `test_meta_extractor.py` — mock Haiku response, verify CaseMeta parsing, verify null handling for missing fields.
- [ ] Tests: `test_outcome_parser.py` — fixture with explicit headnote, fixture with orders-only, fixture requiring appeal cross-ref.
- [ ] Tests: `test_chunk_engine.py` — mock boundary detection, verify token budget enforcement, verify sequence ordering.
### Stage 5: Graph Builder
Goal: Build property graph from CaseMeta + Chunks. Enable juror queries.
- [ ] Set up Neo4j (Docker container on Lappy or local)
- [ ] Create `storage/graph_db.py` — Neo4j driver wrapper: `create_node(label, properties)`, `create_relationship(from, to, type, props)`, `query(cypher)`
- [ ] Create `processing/graph_builder.py` — from CaseMeta: create Case, Charge, Judge, Witness, Exhibit, Ruling, Timeline nodes and relationships. From Chunks: create Chunk nodes, FOLLOWS edges. Pairwise similarity: CORROBORATES (>0.85) and CONTRADICTS (<0.25 same topic).
- [ ] Tests: `test_graph_builder.py` create case with known meta, verify graph shape (node counts, relationship types, no dangling references).
- [ ] Verify: small test case Neo4j has correct node/edge counts
### Stage 6: Juror Interface
Goal: Persona-driven subgraph queries work end-to-end.
- [ ] Create `jury/personas.py` `JurorPersona` dataclass + `PERSONA_TRAVERSAL` dict (6 personas from spec)
- [ ] Create `jury/subgraph_query.py` `get_juror_context(case_mnc, persona, max_tokens)`. Traverse from persona anchor nodes, follow specified edge types, collect chunks up to token budget. **MUST exclude RULED_INADMISSIBLE edges at traversal level.**
- [ ] Tests: `test_subgraph_query.py` fixture graph with inadmissible evidence. Verify every persona excludes it. Verify token budget is respected. Verify persona-specific node selection (nurse sees medical exhibits, not financial).
- [ ] Tests: `test_inadmissibility_wall.py` dedicated test that RULED_INADMISSIBLE exclusion cannot be bypassed regardless of persona config.
### Stage 7: Orchestrator + Watch Mode
Goal: Autonomous continuous operation.
- [ ] Create `orchestrator.py` main loop: poll sources enqueue fetch_queue dispatch workers (sourceparseextractchunkembedgraph) update MetaDB at each stage transition
- [ ] Create `utils/telegram.py` alert sender: source_degraded, daily_summary, milestone
- [ ] Wire: `python -m aucourt_ingest watch` runs APScheduler cron jobs (6h poll per spec)
- [ ] Wire: `python -m aucourt_ingest backfill --source fedcourt --from 1995 --to 2009 --format pdf`
- [ ] Wire: `python -m aucourt_ingest audit` re-processes all documents in MetaDB with current schema
- [ ] Integration test: full pipeline (NSW Caselaw 3 docs end to end graph queryable)
### Stage 8: Bootstrap Runs
- [ ] Bootstrap FedCourt: RSS 2010-2026 criminal filter (~small test first, then full)
- [ ] Bootstrap High Court: crawl 2000-2026 criminal appeals
- [ ] Bootstrap NSW Caselaw: browse criminal 2010-2026
- [ ] Validate corpus stats against spec targets (~5,000 cases for Phase 1)
- [ ] Set up Telegram alerts
- [ ] Enable watch mode on all 3 primary sources
---
## Key Risks
| Risk | Impact | Mitigation |
|------|--------|------------|
| High Court page structure unknown | Blocks source adapter | Manual inspection first, write adapter from real HTML |
| QLD search endpoint undocumented | Fragile scraping | Build defensive parser, expect breaking changes |
| Haiku extraction quality varies | Bad metadata | Validation layer, manual_review queue for low-confidence |
| Neo4j on Docker adds infra dep | Slows dev | Could start with NetworkX in-memory for testing |
| 5,000 cases × Haiku calls = cost | Burn rate | Estimate before bootstrap, consider batching |
| AusLaw MCP server as external dep | Availability | Document start-up procedure, health check |
| Court sites may block scrapers | Source degradation | Polite headers, rate limits, robots.txt respect, Telegram alerts |
## Decisions Needed Before Stage 3
1. **Graph backend for dev**: Neo4j Docker vs NetworkX in-memory for testing. Neo4j is spec target but adds Docker dependency. Consider: use NetworkX for unit tests, Neo4j for integration.
2. **Email AustLII now?** Long lead time. Low effort. Do it during Stage 1 skeleton work.
3. **Contact email in User-Agent** spec has placeholder `your@email.com`. Need real one.
4. **Exoneration data curation** RMIT register, Innocence Project lists need manual collection into `data/exoneration_sources.json`. When?