Source layer (5 court sources), processing pipeline (parse/extract/chunk/embed/graph), property graph with 8 node types, juror subgraph queries with 6 personas, orchestrator with bootstrap/watch/backfill/audit/process modes, 170 tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
10 KiB
AuCourtIngest — Implementation Plan
Audit Notes (from research)
- High Court URL is wrong in spec:
/cases/recent-judgments→ 404. Real page is/cases-and-judgments. The crawl strategy will need to be discovered empirically — the page structure is unknown. - AustLII blocks direct access (403). AusLaw MCP is the only viable path to AustLII content. Remove any "AustLII direct" assumptions.
- QLD Judgments: no documented API. The search endpoint at
/caselaw-search/query?queryStringSearchText={term}exists but behaviour is undocumented. Treat as HTML scraping, not API calls. - AusLaw MCP: repo exists and is active. Tools confirmed: search, legislation, OCR. But it's an MCP server — needs to be running as a separate process. Integration pattern differs from direct HTTP sources.
- Python version: spec says 3.12, Rog has 3.13. Use 3.13.
- No VIC source defined: spec architecture diagram mentions "VICCatalogue" but section 3 only has 5 sources (FedCourt, HighCourt, NSW, QLD, AusLaw MCP). VIC is only mentioned as a Phase 2 backfill item via AusLaw MCP targeted search. Not a direct source adapter.
Build Order
The spec's 6-phase structure is fine conceptually but has dependency issues. The plan below fixes ordering and adds missing work the spec assumes but doesn't call out.
Stage 1: Skeleton + Data Models (no external deps)
Goal: runnable python -m aucourt_ingest --help that does nothing yet.
- Create
aucourt_ingest/__init__.pypackage - Create
models.py— all dataclasses:RawDocument,CaseMeta,Chunk,JurorPersona,JurorContext,FetchQueueItem,SourceState - Create
config.py— loadconfig.toml, expose typed config objects (source configs, rate limits, API keys, paths) - Create
main.py— argparse entry point withbootstrap,watch,backfill,auditsubcommands - Create
config.toml— all source definitions, rate limits, storage paths, API key placeholders - Create
pyproject.toml— dependencies, scripts entry point - Create
data/directories (docs/, raw/) - Verify:
python -m aucourt_ingest --helpruns, all imports work
Stage 2: MetaDB + Utils (no external deps except aiosqlite)
Goal: SQLite is initialised, can track documents through pipeline states.
- Create
storage/meta_db.py— init schema (documents, fetch_queue, source_state tables), CRUD methods - Create
utils/mnc_parser.py— MNC regex parser + validator - Create
utils/rate_limiter.py— async token bucket per source_id - Create
utils/retry.py— configurable exponential backoff decorator - Tests:
test_mnc_parser.py— parse valid/invalid MNCs, edge cases (missing brackets, non-standard courts) - Tests:
test_meta_db.py— init, insert doc, update status transitions, queue operations, source state - Tests:
test_rate_limiter.py— basic burst behavior, per-source isolation
Stage 3: Source Layer (one source first, then extend)
Goal: Fetch a real document from one court and store it as raw HTML.
Start with NSW Caselaw — it has the clearest pagination, HTML-only (no DOCX parsing needed), and the ?filter=criminal endpoint works.
- Create
sources/base.py—BaseSourceABC:discover(),fetch(url),build_queue(). Enforce rate limit. Log to MetaDB. Handle retry per exception type. - Create
sources/nsw_caselaw.py— browse pagination, extract MNC + URL per row, fetch HTML - Create
storage/doc_store.py— save raw downloads todata/raw/{source_id}/{year}/{doc_id}.html - Wire:
main.py bootstrap --source nsw_caselaw --limit 3fetches 3 real criminal cases - Verify: raw HTML files on disk, MetaDB shows
fetch_status=fetched
Then extend to other sources:
- Create
sources/fedcourt.py— RSS poll (/rss/fca-judgmentsis confirmed working), extract DOCX links from RSS items, download DOCX. Bootstrap pagination via/digital-law-library/judgments/searchendpoint (needs empirical testing). - Create
sources/highcourt.py— crawl from/cases-and-judgments(NOT/cases/recent-judgmentswhich 404s). Page structure is unknown — needs manual inspection first. Apply PRIORITY_KEYWORDS filter post-fetch. - Create
sources/qld_judgments.py— HTML scraping at/caselaw-search/query?queryStringSearchText={term}. No documented API. May need to reverse-engineer search parameters for criminal filter. - Create
sources/auslaw_mcp.py— MCP client wrapper. NOT a direct HTTP source — spawns/requires auslaw-mcp server process. Usemcp-python-sdk. Handles: fetch by citation, OCR for scanned PDFs, targeted search.
Stage 4: Processing Pipeline (format-agnostic)
Goal: Raw document → structured metadata + chunks + embeddings. Works regardless of source.
- Create
processing/doc_parser.py— dispatch to format handler. HTML (BeautifulSoup, strip nav/header/footer), DOCX (python-docx), PDF (pdfminer.six text extraction). Output:RawDocument. - Create
processing/meta_extractor.py— call Claude Haiku with the extraction prompt from spec. Parse JSON response intoCaseMeta. Handle malformed LLM output gracefully (retry once with "fix this JSON"). - Create
processing/outcome_parser.py— separate verdict extraction pass. Resolution order: headnote → orders section → appeal cross-ref → external exoneration list. Load exoneration lookup table fromdata/exoneration_sources.json(to be curated manually). - Create
processing/chunk_engine.py— use Claude Haiku to identify structural boundaries, then split at boundaries within token budgets. Output: list ofChunkwith type, sequence, speaker. - Create
processing/embed_engine.py— batch embed chunks via OpenAI text-embedding-3-small, store in ChromaDB collectionau_cases_{source_id}. - Create
storage/vector_index.py— ChromaDB wrapper:embed_and_store(chunks),query(text, chunk_types, doc_ids, top_k) - Tests:
test_doc_parser.py— fixture HTML (NSW-style), DOCX, PDF text. Verify clean extraction, char counts, no nav/footer leakage. - Tests:
test_meta_extractor.py— mock Haiku response, verify CaseMeta parsing, verify null handling for missing fields. - Tests:
test_outcome_parser.py— fixture with explicit headnote, fixture with orders-only, fixture requiring appeal cross-ref. - Tests:
test_chunk_engine.py— mock boundary detection, verify token budget enforcement, verify sequence ordering.
Stage 5: Graph Builder
Goal: Build property graph from CaseMeta + Chunks. Enable juror queries.
- Set up Neo4j (Docker container on Lappy or local)
- Create
storage/graph_db.py— Neo4j driver wrapper:create_node(label, properties),create_relationship(from, to, type, props),query(cypher) - Create
processing/graph_builder.py— from CaseMeta: create Case, Charge, Judge, Witness, Exhibit, Ruling, Timeline nodes and relationships. From Chunks: create Chunk nodes, FOLLOWS edges. Pairwise similarity: CORROBORATES (>0.85) and CONTRADICTS (<0.25 same topic). - Tests:
test_graph_builder.py— create case with known meta, verify graph shape (node counts, relationship types, no dangling references). - Verify: small test case → Neo4j has correct node/edge counts
Stage 6: Juror Interface
Goal: Persona-driven subgraph queries work end-to-end.
- Create
jury/personas.py—JurorPersonadataclass +PERSONA_TRAVERSALdict (6 personas from spec) - Create
jury/subgraph_query.py—get_juror_context(case_mnc, persona, max_tokens). Traverse from persona anchor nodes, follow specified edge types, collect chunks up to token budget. MUST exclude RULED_INADMISSIBLE edges at traversal level. - Tests:
test_subgraph_query.py— fixture graph with inadmissible evidence. Verify every persona excludes it. Verify token budget is respected. Verify persona-specific node selection (nurse sees medical exhibits, not financial). - Tests:
test_inadmissibility_wall.py— dedicated test that RULED_INADMISSIBLE exclusion cannot be bypassed regardless of persona config.
Stage 7: Orchestrator + Watch Mode
Goal: Autonomous continuous operation.
- Create
orchestrator.py— main loop: poll sources → enqueue fetch_queue → dispatch workers (source→parse→extract→chunk→embed→graph) → update MetaDB at each stage transition - Create
utils/telegram.py— alert sender: source_degraded, daily_summary, milestone - Wire:
python -m aucourt_ingest watchruns APScheduler cron jobs (6h poll per spec) - Wire:
python -m aucourt_ingest backfill --source fedcourt --from 1995 --to 2009 --format pdf - Wire:
python -m aucourt_ingest auditre-processes all documents in MetaDB with current schema - Integration test: full pipeline (NSW Caselaw → 3 docs → end to end → graph queryable)
Stage 8: Bootstrap Runs
- Bootstrap FedCourt: RSS 2010-2026 criminal filter (~small test first, then full)
- Bootstrap High Court: crawl 2000-2026 criminal appeals
- Bootstrap NSW Caselaw: browse criminal 2010-2026
- Validate corpus stats against spec targets (~5,000 cases for Phase 1)
- Set up Telegram alerts
- Enable watch mode on all 3 primary sources
Key Risks
| Risk | Impact | Mitigation |
|---|---|---|
| High Court page structure unknown | Blocks source adapter | Manual inspection first, write adapter from real HTML |
| QLD search endpoint undocumented | Fragile scraping | Build defensive parser, expect breaking changes |
| Haiku extraction quality varies | Bad metadata | Validation layer, manual_review queue for low-confidence |
| Neo4j on Docker adds infra dep | Slows dev | Could start with NetworkX in-memory for testing |
| 5,000 cases × Haiku calls = cost | Burn rate | Estimate before bootstrap, consider batching |
| AusLaw MCP server as external dep | Availability | Document start-up procedure, health check |
| Court sites may block scrapers | Source degradation | Polite headers, rate limits, robots.txt respect, Telegram alerts |
Decisions Needed Before Stage 3
- Graph backend for dev: Neo4j Docker vs NetworkX in-memory for testing. Neo4j is spec target but adds Docker dependency. Consider: use NetworkX for unit tests, Neo4j for integration.
- Email AustLII now? — Long lead time. Low effort. Do it during Stage 1 skeleton work.
- Contact email in User-Agent — spec has placeholder
your@email.com. Need real one. - Exoneration data curation — RMIT register, Innocence Project lists need manual collection into
data/exoneration_sources.json. When?