aucourt-ingest/plan.md

136 lines
10 KiB
Markdown
Raw Permalink Normal View History

# AuCourtIngest — Implementation Plan
## Audit Notes (from research)
- **High Court URL is wrong in spec**: `/cases/recent-judgments` → 404. Real page is `/cases-and-judgments`. The crawl strategy will need to be discovered empirically — the page structure is unknown.
- **AustLII blocks direct access** (403). AusLaw MCP is the only viable path to AustLII content. Remove any "AustLII direct" assumptions.
- **QLD Judgments**: no documented API. The search endpoint at `/caselaw-search/query?queryStringSearchText={term}` exists but behaviour is undocumented. Treat as HTML scraping, not API calls.
- **AusLaw MCP**: repo exists and is active. Tools confirmed: search, legislation, OCR. But it's an MCP server — needs to be running as a separate process. Integration pattern differs from direct HTTP sources.
- **Python version**: spec says 3.12, Rog has 3.13. Use 3.13.
- **No VIC source defined**: spec architecture diagram mentions "VICCatalogue" but section 3 only has 5 sources (FedCourt, HighCourt, NSW, QLD, AusLaw MCP). VIC is only mentioned as a Phase 2 backfill item via AusLaw MCP targeted search. Not a direct source adapter.
---
## Build Order
The spec's 6-phase structure is fine conceptually but has dependency issues. The plan below fixes ordering and adds missing work the spec assumes but doesn't call out.
### Stage 1: Skeleton + Data Models (no external deps)
Goal: runnable `python -m aucourt_ingest --help` that does nothing yet.
- [ ] Create `aucourt_ingest/__init__.py` package
- [ ] Create `models.py` — all dataclasses: `RawDocument`, `CaseMeta`, `Chunk`, `JurorPersona`, `JurorContext`, `FetchQueueItem`, `SourceState`
- [ ] Create `config.py` — load `config.toml`, expose typed config objects (source configs, rate limits, API keys, paths)
- [ ] Create `main.py` — argparse entry point with `bootstrap`, `watch`, `backfill`, `audit` subcommands
- [ ] Create `config.toml` — all source definitions, rate limits, storage paths, API key placeholders
- [ ] Create `pyproject.toml` — dependencies, scripts entry point
- [ ] Create `data/` directories (docs/, raw/)
- [ ] Verify: `python -m aucourt_ingest --help` runs, all imports work
### Stage 2: MetaDB + Utils (no external deps except aiosqlite)
Goal: SQLite is initialised, can track documents through pipeline states.
- [ ] Create `storage/meta_db.py` — init schema (documents, fetch_queue, source_state tables), CRUD methods
- [ ] Create `utils/mnc_parser.py` — MNC regex parser + validator
- [ ] Create `utils/rate_limiter.py` — async token bucket per source_id
- [ ] Create `utils/retry.py` — configurable exponential backoff decorator
- [ ] Tests: `test_mnc_parser.py` — parse valid/invalid MNCs, edge cases (missing brackets, non-standard courts)
- [ ] Tests: `test_meta_db.py` — init, insert doc, update status transitions, queue operations, source state
- [ ] Tests: `test_rate_limiter.py` — basic burst behavior, per-source isolation
### Stage 3: Source Layer (one source first, then extend)
Goal: Fetch a real document from one court and store it as raw HTML.
**Start with NSW Caselaw** — it has the clearest pagination, HTML-only (no DOCX parsing needed), and the `?filter=criminal` endpoint works.
- [ ] Create `sources/base.py``BaseSource` ABC: `discover()`, `fetch(url)`, `build_queue()`. Enforce rate limit. Log to MetaDB. Handle retry per exception type.
- [ ] Create `sources/nsw_caselaw.py` — browse pagination, extract MNC + URL per row, fetch HTML
- [ ] Create `storage/doc_store.py` — save raw downloads to `data/raw/{source_id}/{year}/{doc_id}.html`
- [ ] Wire: `main.py bootstrap --source nsw_caselaw --limit 3` fetches 3 real criminal cases
- [ ] Verify: raw HTML files on disk, MetaDB shows `fetch_status=fetched`
Then extend to other sources:
- [ ] Create `sources/fedcourt.py` — RSS poll (`/rss/fca-judgments` is confirmed working), extract DOCX links from RSS items, download DOCX. Bootstrap pagination via `/digital-law-library/judgments/search` endpoint (needs empirical testing).
- [ ] Create `sources/highcourt.py` — crawl from `/cases-and-judgments` (NOT `/cases/recent-judgments` which 404s). Page structure is unknown — needs manual inspection first. Apply PRIORITY_KEYWORDS filter post-fetch.
- [ ] Create `sources/qld_judgments.py` — HTML scraping at `/caselaw-search/query?queryStringSearchText={term}`. No documented API. May need to reverse-engineer search parameters for criminal filter.
- [ ] Create `sources/auslaw_mcp.py` — MCP client wrapper. NOT a direct HTTP source — spawns/requires auslaw-mcp server process. Use `mcp-python-sdk`. Handles: fetch by citation, OCR for scanned PDFs, targeted search.
### Stage 4: Processing Pipeline (format-agnostic)
Goal: Raw document → structured metadata + chunks + embeddings. Works regardless of source.
- [ ] Create `processing/doc_parser.py` — dispatch to format handler. HTML (BeautifulSoup, strip nav/header/footer), DOCX (python-docx), PDF (pdfminer.six text extraction). Output: `RawDocument`.
- [ ] Create `processing/meta_extractor.py` — call Claude Haiku with the extraction prompt from spec. Parse JSON response into `CaseMeta`. Handle malformed LLM output gracefully (retry once with "fix this JSON").
- [ ] Create `processing/outcome_parser.py` — separate verdict extraction pass. Resolution order: headnote → orders section → appeal cross-ref → external exoneration list. Load exoneration lookup table from `data/exoneration_sources.json` (to be curated manually).
- [ ] Create `processing/chunk_engine.py` — use Claude Haiku to identify structural boundaries, then split at boundaries within token budgets. Output: list of `Chunk` with type, sequence, speaker.
- [ ] Create `processing/embed_engine.py` — batch embed chunks via OpenAI text-embedding-3-small, store in ChromaDB collection `au_cases_{source_id}`.
- [ ] Create `storage/vector_index.py` — ChromaDB wrapper: `embed_and_store(chunks)`, `query(text, chunk_types, doc_ids, top_k)`
- [ ] Tests: `test_doc_parser.py` — fixture HTML (NSW-style), DOCX, PDF text. Verify clean extraction, char counts, no nav/footer leakage.
- [ ] Tests: `test_meta_extractor.py` — mock Haiku response, verify CaseMeta parsing, verify null handling for missing fields.
- [ ] Tests: `test_outcome_parser.py` — fixture with explicit headnote, fixture with orders-only, fixture requiring appeal cross-ref.
- [ ] Tests: `test_chunk_engine.py` — mock boundary detection, verify token budget enforcement, verify sequence ordering.
### Stage 5: Graph Builder
Goal: Build property graph from CaseMeta + Chunks. Enable juror queries.
- [ ] Set up Neo4j (Docker container on Lappy or local)
- [ ] Create `storage/graph_db.py` — Neo4j driver wrapper: `create_node(label, properties)`, `create_relationship(from, to, type, props)`, `query(cypher)`
- [ ] Create `processing/graph_builder.py` — from CaseMeta: create Case, Charge, Judge, Witness, Exhibit, Ruling, Timeline nodes and relationships. From Chunks: create Chunk nodes, FOLLOWS edges. Pairwise similarity: CORROBORATES (>0.85) and CONTRADICTS (<0.25 same topic).
- [ ] Tests: `test_graph_builder.py` — create case with known meta, verify graph shape (node counts, relationship types, no dangling references).
- [ ] Verify: small test case → Neo4j has correct node/edge counts
### Stage 6: Juror Interface
Goal: Persona-driven subgraph queries work end-to-end.
- [ ] Create `jury/personas.py``JurorPersona` dataclass + `PERSONA_TRAVERSAL` dict (6 personas from spec)
- [ ] Create `jury/subgraph_query.py``get_juror_context(case_mnc, persona, max_tokens)`. Traverse from persona anchor nodes, follow specified edge types, collect chunks up to token budget. **MUST exclude RULED_INADMISSIBLE edges at traversal level.**
- [ ] Tests: `test_subgraph_query.py` — fixture graph with inadmissible evidence. Verify every persona excludes it. Verify token budget is respected. Verify persona-specific node selection (nurse sees medical exhibits, not financial).
- [ ] Tests: `test_inadmissibility_wall.py` — dedicated test that RULED_INADMISSIBLE exclusion cannot be bypassed regardless of persona config.
### Stage 7: Orchestrator + Watch Mode
Goal: Autonomous continuous operation.
- [ ] Create `orchestrator.py` — main loop: poll sources → enqueue fetch_queue → dispatch workers (source→parse→extract→chunk→embed→graph) → update MetaDB at each stage transition
- [ ] Create `utils/telegram.py` — alert sender: source_degraded, daily_summary, milestone
- [ ] Wire: `python -m aucourt_ingest watch` runs APScheduler cron jobs (6h poll per spec)
- [ ] Wire: `python -m aucourt_ingest backfill --source fedcourt --from 1995 --to 2009 --format pdf`
- [ ] Wire: `python -m aucourt_ingest audit` re-processes all documents in MetaDB with current schema
- [ ] Integration test: full pipeline (NSW Caselaw → 3 docs → end to end → graph queryable)
### Stage 8: Bootstrap Runs
- [ ] Bootstrap FedCourt: RSS 2010-2026 criminal filter (~small test first, then full)
- [ ] Bootstrap High Court: crawl 2000-2026 criminal appeals
- [ ] Bootstrap NSW Caselaw: browse criminal 2010-2026
- [ ] Validate corpus stats against spec targets (~5,000 cases for Phase 1)
- [ ] Set up Telegram alerts
- [ ] Enable watch mode on all 3 primary sources
---
## Key Risks
| Risk | Impact | Mitigation |
|------|--------|------------|
| High Court page structure unknown | Blocks source adapter | Manual inspection first, write adapter from real HTML |
| QLD search endpoint undocumented | Fragile scraping | Build defensive parser, expect breaking changes |
| Haiku extraction quality varies | Bad metadata | Validation layer, manual_review queue for low-confidence |
| Neo4j on Docker adds infra dep | Slows dev | Could start with NetworkX in-memory for testing |
| 5,000 cases × Haiku calls = cost | Burn rate | Estimate before bootstrap, consider batching |
| AusLaw MCP server as external dep | Availability | Document start-up procedure, health check |
| Court sites may block scrapers | Source degradation | Polite headers, rate limits, robots.txt respect, Telegram alerts |
## Decisions Needed Before Stage 3
1. **Graph backend for dev**: Neo4j Docker vs NetworkX in-memory for testing. Neo4j is spec target but adds Docker dependency. Consider: use NetworkX for unit tests, Neo4j for integration.
2. **Email AustLII now?** — Long lead time. Low effort. Do it during Stage 1 skeleton work.
3. **Contact email in User-Agent** — spec has placeholder `your@email.com`. Need real one.
4. **Exoneration data curation** — RMIT register, Innocence Project lists need manual collection into `data/exoneration_sources.json`. When?