slothitude d77fe12cfc AuCourtIngest: complete 8-stage Australian legal case ingestion pipeline

Source layer (5 court sources), processing pipeline (parse/extract/chunk/embed/graph),
property graph with 8 node types, juror subgraph queries with 6 personas,
orchestrator with bootstrap/watch/backfill/audit/process modes, 170 tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-05-30 11:56:23 +10:00

10 KiB

Raw Permalink Blame History

AuCourtIngest — Implementation Plan

Audit Notes (from research)

High Court URL is wrong in spec: /cases/recent-judgments → 404. Real page is /cases-and-judgments. The crawl strategy will need to be discovered empirically — the page structure is unknown.
AustLII blocks direct access (403). AusLaw MCP is the only viable path to AustLII content. Remove any "AustLII direct" assumptions.
QLD Judgments: no documented API. The search endpoint at /caselaw-search/query?queryStringSearchText={term} exists but behaviour is undocumented. Treat as HTML scraping, not API calls.
AusLaw MCP: repo exists and is active. Tools confirmed: search, legislation, OCR. But it's an MCP server — needs to be running as a separate process. Integration pattern differs from direct HTTP sources.
Python version: spec says 3.12, Rog has 3.13. Use 3.13.
No VIC source defined: spec architecture diagram mentions "VICCatalogue" but section 3 only has 5 sources (FedCourt, HighCourt, NSW, QLD, AusLaw MCP). VIC is only mentioned as a Phase 2 backfill item via AusLaw MCP targeted search. Not a direct source adapter.

Build Order

The spec's 6-phase structure is fine conceptually but has dependency issues. The plan below fixes ordering and adds missing work the spec assumes but doesn't call out.

Stage 1: Skeleton + Data Models (no external deps)

Goal: runnable python -m aucourt_ingest --help that does nothing yet.

Create aucourt_ingest/__init__.py package
Create models.py — all dataclasses: RawDocument, CaseMeta, Chunk, JurorPersona, JurorContext, FetchQueueItem, SourceState
Create config.py — load config.toml, expose typed config objects (source configs, rate limits, API keys, paths)
Create main.py — argparse entry point with bootstrap, watch, backfill, audit subcommands
Create config.toml — all source definitions, rate limits, storage paths, API key placeholders
Create pyproject.toml — dependencies, scripts entry point
Create data/ directories (docs/, raw/)
Verify: python -m aucourt_ingest --help runs, all imports work

Stage 2: MetaDB + Utils (no external deps except aiosqlite)

Goal: SQLite is initialised, can track documents through pipeline states.

Create storage/meta_db.py — init schema (documents, fetch_queue, source_state tables), CRUD methods
Create utils/mnc_parser.py — MNC regex parser + validator
Create utils/rate_limiter.py — async token bucket per source_id
Create utils/retry.py — configurable exponential backoff decorator
Tests: test_mnc_parser.py — parse valid/invalid MNCs, edge cases (missing brackets, non-standard courts)
Tests: test_meta_db.py — init, insert doc, update status transitions, queue operations, source state
Tests: test_rate_limiter.py — basic burst behavior, per-source isolation

Stage 3: Source Layer (one source first, then extend)

Goal: Fetch a real document from one court and store it as raw HTML.

Start with NSW Caselaw — it has the clearest pagination, HTML-only (no DOCX parsing needed), and the ?filter=criminal endpoint works.

Create sources/base.py — BaseSource ABC: discover(), fetch(url), build_queue(). Enforce rate limit. Log to MetaDB. Handle retry per exception type.
Create sources/nsw_caselaw.py — browse pagination, extract MNC + URL per row, fetch HTML
Create storage/doc_store.py — save raw downloads to data/raw/{source_id}/{year}/{doc_id}.html
Wire: main.py bootstrap --source nsw_caselaw --limit 3 fetches 3 real criminal cases
Verify: raw HTML files on disk, MetaDB shows fetch_status=fetched

Then extend to other sources:

Create sources/fedcourt.py — RSS poll (/rss/fca-judgments is confirmed working), extract DOCX links from RSS items, download DOCX. Bootstrap pagination via /digital-law-library/judgments/search endpoint (needs empirical testing).
Create sources/highcourt.py — crawl from /cases-and-judgments (NOT /cases/recent-judgments which 404s). Page structure is unknown — needs manual inspection first. Apply PRIORITY_KEYWORDS filter post-fetch.
Create sources/qld_judgments.py — HTML scraping at /caselaw-search/query?queryStringSearchText={term}. No documented API. May need to reverse-engineer search parameters for criminal filter.
Create sources/auslaw_mcp.py — MCP client wrapper. NOT a direct HTTP source — spawns/requires auslaw-mcp server process. Use mcp-python-sdk. Handles: fetch by citation, OCR for scanned PDFs, targeted search.

Stage 4: Processing Pipeline (format-agnostic)

Goal: Raw document → structured metadata + chunks + embeddings. Works regardless of source.

Create processing/doc_parser.py — dispatch to format handler. HTML (BeautifulSoup, strip nav/header/footer), DOCX (python-docx), PDF (pdfminer.six text extraction). Output: RawDocument.
Create processing/meta_extractor.py — call Claude Haiku with the extraction prompt from spec. Parse JSON response into CaseMeta. Handle malformed LLM output gracefully (retry once with "fix this JSON").
Create processing/outcome_parser.py — separate verdict extraction pass. Resolution order: headnote → orders section → appeal cross-ref → external exoneration list. Load exoneration lookup table from data/exoneration_sources.json (to be curated manually).
Create processing/chunk_engine.py — use Claude Haiku to identify structural boundaries, then split at boundaries within token budgets. Output: list of Chunk with type, sequence, speaker.
Create processing/embed_engine.py — batch embed chunks via OpenAI text-embedding-3-small, store in ChromaDB collection au_cases_{source_id}.
Create storage/vector_index.py — ChromaDB wrapper: embed_and_store(chunks), query(text, chunk_types, doc_ids, top_k)
Tests: test_doc_parser.py — fixture HTML (NSW-style), DOCX, PDF text. Verify clean extraction, char counts, no nav/footer leakage.
Tests: test_meta_extractor.py — mock Haiku response, verify CaseMeta parsing, verify null handling for missing fields.
Tests: test_outcome_parser.py — fixture with explicit headnote, fixture with orders-only, fixture requiring appeal cross-ref.
Tests: test_chunk_engine.py — mock boundary detection, verify token budget enforcement, verify sequence ordering.

Stage 5: Graph Builder

Goal: Build property graph from CaseMeta + Chunks. Enable juror queries.

Set up Neo4j (Docker container on Lappy or local)
Create storage/graph_db.py — Neo4j driver wrapper: create_node(label, properties), create_relationship(from, to, type, props), query(cypher)
Create processing/graph_builder.py — from CaseMeta: create Case, Charge, Judge, Witness, Exhibit, Ruling, Timeline nodes and relationships. From Chunks: create Chunk nodes, FOLLOWS edges. Pairwise similarity: CORROBORATES (>0.85) and CONTRADICTS (<0.25 same topic).
Tests: test_graph_builder.py — create case with known meta, verify graph shape (node counts, relationship types, no dangling references).
Verify: small test case → Neo4j has correct node/edge counts

Stage 6: Juror Interface

Goal: Persona-driven subgraph queries work end-to-end.

Create jury/personas.py — JurorPersona dataclass + PERSONA_TRAVERSAL dict (6 personas from spec)
Create jury/subgraph_query.py — get_juror_context(case_mnc, persona, max_tokens). Traverse from persona anchor nodes, follow specified edge types, collect chunks up to token budget. MUST exclude RULED_INADMISSIBLE edges at traversal level.
Tests: test_subgraph_query.py — fixture graph with inadmissible evidence. Verify every persona excludes it. Verify token budget is respected. Verify persona-specific node selection (nurse sees medical exhibits, not financial).
Tests: test_inadmissibility_wall.py — dedicated test that RULED_INADMISSIBLE exclusion cannot be bypassed regardless of persona config.

Stage 7: Orchestrator + Watch Mode

Goal: Autonomous continuous operation.

Create orchestrator.py — main loop: poll sources → enqueue fetch_queue → dispatch workers (source→parse→extract→chunk→embed→graph) → update MetaDB at each stage transition
Create utils/telegram.py — alert sender: source_degraded, daily_summary, milestone
Wire: python -m aucourt_ingest watch runs APScheduler cron jobs (6h poll per spec)
Wire: python -m aucourt_ingest backfill --source fedcourt --from 1995 --to 2009 --format pdf
Wire: python -m aucourt_ingest audit re-processes all documents in MetaDB with current schema
Integration test: full pipeline (NSW Caselaw → 3 docs → end to end → graph queryable)

Stage 8: Bootstrap Runs

Bootstrap FedCourt: RSS 2010-2026 criminal filter (~small test first, then full)
Bootstrap High Court: crawl 2000-2026 criminal appeals
Bootstrap NSW Caselaw: browse criminal 2010-2026
Validate corpus stats against spec targets (~5,000 cases for Phase 1)
Set up Telegram alerts
Enable watch mode on all 3 primary sources

Key Risks

Risk	Impact	Mitigation
High Court page structure unknown	Blocks source adapter	Manual inspection first, write adapter from real HTML
QLD search endpoint undocumented	Fragile scraping	Build defensive parser, expect breaking changes
Haiku extraction quality varies	Bad metadata	Validation layer, manual_review queue for low-confidence
Neo4j on Docker adds infra dep	Slows dev	Could start with NetworkX in-memory for testing
5,000 cases × Haiku calls = cost	Burn rate	Estimate before bootstrap, consider batching
AusLaw MCP server as external dep	Availability	Document start-up procedure, health check
Court sites may block scrapers	Source degradation	Polite headers, rate limits, robots.txt respect, Telegram alerts

Decisions Needed Before Stage 3

Graph backend for dev: Neo4j Docker vs NetworkX in-memory for testing. Neo4j is spec target but adds Docker dependency. Consider: use NetworkX for unit tests, Neo4j for integration.
Email AustLII now? — Long lead time. Low effort. Do it during Stage 1 skeleton work.
Contact email in User-Agent — spec has placeholder your@email.com. Need real one.
Exoneration data curation — RMIT register, Innocence Project lists need manual collection into data/exoneration_sources.json. When?

10 KiB Raw Permalink Blame History Unescape Escape