aucourt-ingest/todo.md
slothitude d77fe12cfc AuCourtIngest: complete 8-stage Australian legal case ingestion pipeline
Source layer (5 court sources), processing pipeline (parse/extract/chunk/embed/graph),
property graph with 8 node types, juror subgraph queries with 6 personas,
orchestrator with bootstrap/watch/backfill/audit/process modes, 170 tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-30 11:56:23 +10:00

6.4 KiB

AuCourtIngest — TODO

Blocked / Needs Decisions

  • Pick email for User-Agent header and AusLII contact
  • Decide: Neo4j Docker for dev, or NetworkX for unit tests + Neo4j for integration?
  • Curate exoneration source data (RMIT register, Innocence Project) → data/exoneration_sources.json
  • Email AustLII partnership inquiry (long lead time, do early)
  • Inspect High Court /cases-and-judgments page structure (spec URL was wrong)

Stage 1: Skeleton + Data Models

  • Package structure: aucourt_ingest/__init__.py
  • models.py — RawDocument, CaseMeta, Chunk, JurorPersona, JurorContext, FetchQueueItem, SourceState
  • config.py — TOML loader (stdlib tomllib), typed config objects
  • main.py — argparse with bootstrap/watch/backfill/audit subcommands
  • config.toml — 5 source defs, rate limits, storage paths, API key placeholders
  • pyproject.toml — dependencies, entry points
  • data/docs/ and data/raw/ directories
  • Verify: python -m aucourt_ingest --help runs, all subcommands dispatch, all models import

Stage 2: MetaDB + Utils

  • storage/meta_db.py — async SQLite (aiosqlite), documents/fetch_queue/source_state CRUD, status transitions
  • utils/mnc_parser.py — MNC regex parser + court→jurisdiction mapping
  • utils/rate_limiter.py — async token bucket per source + registry
  • utils/retry.py — exponential backoff decorator with jitter
  • Tests: 27 passed (MNC parser 12, MetaDB 11, rate limiter 4)

Stage 3: Source Layer

  • sources/base.py — BaseSource ABC (discover, fetch, rate limit, MetaDB integration)
  • storage/doc_store.py — save/load raw downloads to data/raw/{source}/{year}/{mnc}.html
  • sources/nsw_caselaw.py — JSON API at /browse/list?filter=criminal, 200/page, 630 pages, 125K total
  • Wire: bootstrap --source nsw_caselaw --limit 3 fetched 3 real cases with MNCs, stored on disk + MetaDB
  • sources/fedcourt.py — RSS poll at /rss/fca-judgments, DOCX link discovery, bootstrap function
  • sources/highcourt.py — crawl from /cases-and-judgments, keyword filter, bootstrap function
  • sources/qld_judgments.py — search endpoint scraping, bootstrap stub (needs empirical testing)
  • sources/auslaw_mcp.py — stub with NotImplementedError, needs external MCP server

Stage 4: Processing Pipeline

  • processing/doc_parser.py — HTML (strip nav/header/footer/BS4), DOCX (python-docx headings), PDF (pdfminer.six). MNC extraction before stripping.
  • processing/meta_extractor.py — Claude Haiku via injectable LLMClient protocol. JSON parse + retry with fix prompt. Handles markdown-wrapped responses.
  • processing/outcome_parser.py — regex-based verdict detection (14 patterns). Priority: HELD → appeal → quash → acquittal → not_guilty → guilty → hung. Exoneration external source lookup. Appeal cross-reference.
  • processing/chunk_engine.py — LLM boundary detection via injectable client. 7 chunk types with token budgets. Oversized sections auto-split. chunk_with_sections() for sync testing.
  • processing/embed_engine.py — OpenAI text-embedding-3-small via injectable EmbeddingClient protocol. Batch processing with configurable size.
  • storage/vector_index.py — ChromaDB wrapper: store_chunks, query with chunk_type/doc_id filters, cosine similarity.
  • Tests: 39 passed (doc_parser 8, meta_extractor 6, outcome_parser 18, chunk_engine 7)

Stage 5: Graph Builder

  • storage/graph_db.py — GraphDB Protocol + GraphNode/GraphRelationship + _node_id/_rel_id helpers
  • storage/in_memory_graph_db.py — NetworkX-backed InMemoryGraphDB for testing
  • storage/neo4j_graph_db.py — Neo4j async driver wrapper for production
  • processing/graph_builder.py — build_case, build_chunks, build_similarity_edges, build_full
  • Tests: 40 passed (cosine similarity 7, helpers 4, InMemoryGraphDB 8, build_case 7, build_chunks 4, similarity 7, build_full 2, protocol 1)

Stage 6: Jury Interface

  • jury/personas.py — 6 persona definitions (nurse, accountant, skeptic, ex_cop, empath, foreman) + traversal config
  • jury/subgraph_query.py — SubgraphQuery with get_juror_context(), RULED_INADMISSIBLE exclusion at traversal level, token budget, anchor spec parser
  • Tests: 27 passed (persona definitions 13, anchor parsing 5, subgraph query 9: inadmissibility wall, token budget, chunk type filter, sequence ordering, all personas valid, foreman traversal, global exclusion)

Stage 7: Orchestrator + Watch Mode

  • orchestrator.py — main loop, queue dispatch, stage transitions (fetched→parsed→embedded→graphed), error recovery, ProcessingPipeline protocol
  • utils/telegram.py — alert sender with templates (source_degraded, daily_summary, milestone, fetch_error), NoOpAlert fallback
  • Wire: watch mode (polling loop with SIGINT/SIGTERM handling), backfill mode (year range), audit mode (re-process by status), process mode (drain queue)
  • main.py — all 5 modes fully implemented: bootstrap, watch, backfill, audit, process
  • Tests: 27 passed (TelegramAlert 10, NoOpAlert 3, orchestrator 14: empty queue, single/multi doc, limit, missing doc, pipeline error, status transitions, graph built, vector index, single doc, doc error, meta stored, no pipeline, queue status)

Stage 8: Integration Tests

  • processing/pipeline.py — FullPipeline wiring DocParser → MetaExtractor → OutcomeParser → ChunkEngine → EmbedEngine
  • Mock LLM client with deterministic meta extraction and chunk boundary detection
  • Mock embedding client with normalized deterministic vectors
  • 10 integration tests: parse, extract_meta, chunk, embed, 3-doc e2e to graph, meta stored, vector index, status transitions, graph queryable
  • End-to-end: 3 documents → FullPipeline → Orchestrator → InMemoryGraphDB with node dedup and edge traversal
  • Tests: 170 passed (pipeline 10, orchestrator 27, jury 27, graph 40, meta_extractor 6, chunk_engine 7, doc_parser 8, outcome_parser 18, mnc_parser 12, meta_db 11, rate_limiter 5)

Spec Corrections Found

  • High Court URL: /cases/recent-judgments → 404. Correct: /cases-and-judgments
  • Architecture diagram lists "VICCatalogue" as a source but no source definition exists. VIC is AusLaw MCP targeted search only.
  • Spec says Python 3.12 — Rog has 3.13. Using 3.13.
  • AusLII direct access returns 403 — only accessible via AusLaw MCP.