# AuCourtIngest — Implementation Plan ## Audit Notes (from research) - **High Court URL is wrong in spec**: `/cases/recent-judgments` → 404. Real page is `/cases-and-judgments`. The crawl strategy will need to be discovered empirically — the page structure is unknown. - **AustLII blocks direct access** (403). AusLaw MCP is the only viable path to AustLII content. Remove any "AustLII direct" assumptions. - **QLD Judgments**: no documented API. The search endpoint at `/caselaw-search/query?queryStringSearchText={term}` exists but behaviour is undocumented. Treat as HTML scraping, not API calls. - **AusLaw MCP**: repo exists and is active. Tools confirmed: search, legislation, OCR. But it's an MCP server — needs to be running as a separate process. Integration pattern differs from direct HTTP sources. - **Python version**: spec says 3.12, Rog has 3.13. Use 3.13. - **No VIC source defined**: spec architecture diagram mentions "VICCatalogue" but section 3 only has 5 sources (FedCourt, HighCourt, NSW, QLD, AusLaw MCP). VIC is only mentioned as a Phase 2 backfill item via AusLaw MCP targeted search. Not a direct source adapter. --- ## Build Order The spec's 6-phase structure is fine conceptually but has dependency issues. The plan below fixes ordering and adds missing work the spec assumes but doesn't call out. ### Stage 1: Skeleton + Data Models (no external deps) Goal: runnable `python -m aucourt_ingest --help` that does nothing yet. - [ ] Create `aucourt_ingest/__init__.py` package - [ ] Create `models.py` — all dataclasses: `RawDocument`, `CaseMeta`, `Chunk`, `JurorPersona`, `JurorContext`, `FetchQueueItem`, `SourceState` - [ ] Create `config.py` — load `config.toml`, expose typed config objects (source configs, rate limits, API keys, paths) - [ ] Create `main.py` — argparse entry point with `bootstrap`, `watch`, `backfill`, `audit` subcommands - [ ] Create `config.toml` — all source definitions, rate limits, storage paths, API key placeholders - [ ] Create `pyproject.toml` — dependencies, scripts entry point - [ ] Create `data/` directories (docs/, raw/) - [ ] Verify: `python -m aucourt_ingest --help` runs, all imports work ### Stage 2: MetaDB + Utils (no external deps except aiosqlite) Goal: SQLite is initialised, can track documents through pipeline states. - [ ] Create `storage/meta_db.py` — init schema (documents, fetch_queue, source_state tables), CRUD methods - [ ] Create `utils/mnc_parser.py` — MNC regex parser + validator - [ ] Create `utils/rate_limiter.py` — async token bucket per source_id - [ ] Create `utils/retry.py` — configurable exponential backoff decorator - [ ] Tests: `test_mnc_parser.py` — parse valid/invalid MNCs, edge cases (missing brackets, non-standard courts) - [ ] Tests: `test_meta_db.py` — init, insert doc, update status transitions, queue operations, source state - [ ] Tests: `test_rate_limiter.py` — basic burst behavior, per-source isolation ### Stage 3: Source Layer (one source first, then extend) Goal: Fetch a real document from one court and store it as raw HTML. **Start with NSW Caselaw** — it has the clearest pagination, HTML-only (no DOCX parsing needed), and the `?filter=criminal` endpoint works. - [ ] Create `sources/base.py` — `BaseSource` ABC: `discover()`, `fetch(url)`, `build_queue()`. Enforce rate limit. Log to MetaDB. Handle retry per exception type. - [ ] Create `sources/nsw_caselaw.py` — browse pagination, extract MNC + URL per row, fetch HTML - [ ] Create `storage/doc_store.py` — save raw downloads to `data/raw/{source_id}/{year}/{doc_id}.html` - [ ] Wire: `main.py bootstrap --source nsw_caselaw --limit 3` fetches 3 real criminal cases - [ ] Verify: raw HTML files on disk, MetaDB shows `fetch_status=fetched` Then extend to other sources: - [ ] Create `sources/fedcourt.py` — RSS poll (`/rss/fca-judgments` is confirmed working), extract DOCX links from RSS items, download DOCX. Bootstrap pagination via `/digital-law-library/judgments/search` endpoint (needs empirical testing). - [ ] Create `sources/highcourt.py` — crawl from `/cases-and-judgments` (NOT `/cases/recent-judgments` which 404s). Page structure is unknown — needs manual inspection first. Apply PRIORITY_KEYWORDS filter post-fetch. - [ ] Create `sources/qld_judgments.py` — HTML scraping at `/caselaw-search/query?queryStringSearchText={term}`. No documented API. May need to reverse-engineer search parameters for criminal filter. - [ ] Create `sources/auslaw_mcp.py` — MCP client wrapper. NOT a direct HTTP source — spawns/requires auslaw-mcp server process. Use `mcp-python-sdk`. Handles: fetch by citation, OCR for scanned PDFs, targeted search. ### Stage 4: Processing Pipeline (format-agnostic) Goal: Raw document → structured metadata + chunks + embeddings. Works regardless of source. - [ ] Create `processing/doc_parser.py` — dispatch to format handler. HTML (BeautifulSoup, strip nav/header/footer), DOCX (python-docx), PDF (pdfminer.six text extraction). Output: `RawDocument`. - [ ] Create `processing/meta_extractor.py` — call Claude Haiku with the extraction prompt from spec. Parse JSON response into `CaseMeta`. Handle malformed LLM output gracefully (retry once with "fix this JSON"). - [ ] Create `processing/outcome_parser.py` — separate verdict extraction pass. Resolution order: headnote → orders section → appeal cross-ref → external exoneration list. Load exoneration lookup table from `data/exoneration_sources.json` (to be curated manually). - [ ] Create `processing/chunk_engine.py` — use Claude Haiku to identify structural boundaries, then split at boundaries within token budgets. Output: list of `Chunk` with type, sequence, speaker. - [ ] Create `processing/embed_engine.py` — batch embed chunks via OpenAI text-embedding-3-small, store in ChromaDB collection `au_cases_{source_id}`. - [ ] Create `storage/vector_index.py` — ChromaDB wrapper: `embed_and_store(chunks)`, `query(text, chunk_types, doc_ids, top_k)` - [ ] Tests: `test_doc_parser.py` — fixture HTML (NSW-style), DOCX, PDF text. Verify clean extraction, char counts, no nav/footer leakage. - [ ] Tests: `test_meta_extractor.py` — mock Haiku response, verify CaseMeta parsing, verify null handling for missing fields. - [ ] Tests: `test_outcome_parser.py` — fixture with explicit headnote, fixture with orders-only, fixture requiring appeal cross-ref. - [ ] Tests: `test_chunk_engine.py` — mock boundary detection, verify token budget enforcement, verify sequence ordering. ### Stage 5: Graph Builder Goal: Build property graph from CaseMeta + Chunks. Enable juror queries. - [ ] Set up Neo4j (Docker container on Lappy or local) - [ ] Create `storage/graph_db.py` — Neo4j driver wrapper: `create_node(label, properties)`, `create_relationship(from, to, type, props)`, `query(cypher)` - [ ] Create `processing/graph_builder.py` — from CaseMeta: create Case, Charge, Judge, Witness, Exhibit, Ruling, Timeline nodes and relationships. From Chunks: create Chunk nodes, FOLLOWS edges. Pairwise similarity: CORROBORATES (>0.85) and CONTRADICTS (<0.25 same topic). - [ ] Tests: `test_graph_builder.py` — create case with known meta, verify graph shape (node counts, relationship types, no dangling references). - [ ] Verify: small test case → Neo4j has correct node/edge counts ### Stage 6: Juror Interface Goal: Persona-driven subgraph queries work end-to-end. - [ ] Create `jury/personas.py` — `JurorPersona` dataclass + `PERSONA_TRAVERSAL` dict (6 personas from spec) - [ ] Create `jury/subgraph_query.py` — `get_juror_context(case_mnc, persona, max_tokens)`. Traverse from persona anchor nodes, follow specified edge types, collect chunks up to token budget. **MUST exclude RULED_INADMISSIBLE edges at traversal level.** - [ ] Tests: `test_subgraph_query.py` — fixture graph with inadmissible evidence. Verify every persona excludes it. Verify token budget is respected. Verify persona-specific node selection (nurse sees medical exhibits, not financial). - [ ] Tests: `test_inadmissibility_wall.py` — dedicated test that RULED_INADMISSIBLE exclusion cannot be bypassed regardless of persona config. ### Stage 7: Orchestrator + Watch Mode Goal: Autonomous continuous operation. - [ ] Create `orchestrator.py` — main loop: poll sources → enqueue fetch_queue → dispatch workers (source→parse→extract→chunk→embed→graph) → update MetaDB at each stage transition - [ ] Create `utils/telegram.py` — alert sender: source_degraded, daily_summary, milestone - [ ] Wire: `python -m aucourt_ingest watch` runs APScheduler cron jobs (6h poll per spec) - [ ] Wire: `python -m aucourt_ingest backfill --source fedcourt --from 1995 --to 2009 --format pdf` - [ ] Wire: `python -m aucourt_ingest audit` re-processes all documents in MetaDB with current schema - [ ] Integration test: full pipeline (NSW Caselaw → 3 docs → end to end → graph queryable) ### Stage 8: Bootstrap Runs - [ ] Bootstrap FedCourt: RSS 2010-2026 criminal filter (~small test first, then full) - [ ] Bootstrap High Court: crawl 2000-2026 criminal appeals - [ ] Bootstrap NSW Caselaw: browse criminal 2010-2026 - [ ] Validate corpus stats against spec targets (~5,000 cases for Phase 1) - [ ] Set up Telegram alerts - [ ] Enable watch mode on all 3 primary sources --- ## Key Risks | Risk | Impact | Mitigation | |------|--------|------------| | High Court page structure unknown | Blocks source adapter | Manual inspection first, write adapter from real HTML | | QLD search endpoint undocumented | Fragile scraping | Build defensive parser, expect breaking changes | | Haiku extraction quality varies | Bad metadata | Validation layer, manual_review queue for low-confidence | | Neo4j on Docker adds infra dep | Slows dev | Could start with NetworkX in-memory for testing | | 5,000 cases × Haiku calls = cost | Burn rate | Estimate before bootstrap, consider batching | | AusLaw MCP server as external dep | Availability | Document start-up procedure, health check | | Court sites may block scrapers | Source degradation | Polite headers, rate limits, robots.txt respect, Telegram alerts | ## Decisions Needed Before Stage 3 1. **Graph backend for dev**: Neo4j Docker vs NetworkX in-memory for testing. Neo4j is spec target but adds Docker dependency. Consider: use NetworkX for unit tests, Neo4j for integration. 2. **Email AustLII now?** — Long lead time. Low effort. Do it during Stage 1 skeleton work. 3. **Contact email in User-Agent** — spec has placeholder `your@email.com`. Need real one. 4. **Exoneration data curation** — RMIT register, Innocence Project lists need manual collection into `data/exoneration_sources.json`. When?