Source layer (5 court sources), processing pipeline (parse/extract/chunk/embed/graph), property graph with 8 node types, juror subgraph queries with 6 personas, orchestrator with bootstrap/watch/backfill/audit/process modes, 170 tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
6.4 KiB
6.4 KiB
AuCourtIngest — TODO
Blocked / Needs Decisions
- Pick email for User-Agent header and AusLII contact
- Decide: Neo4j Docker for dev, or NetworkX for unit tests + Neo4j for integration?
- Curate exoneration source data (RMIT register, Innocence Project) →
data/exoneration_sources.json - Email AustLII partnership inquiry (long lead time, do early)
- Inspect High Court
/cases-and-judgmentspage structure (spec URL was wrong)
Stage 1: Skeleton + Data Models ✅
- Package structure:
aucourt_ingest/__init__.py models.py— RawDocument, CaseMeta, Chunk, JurorPersona, JurorContext, FetchQueueItem, SourceStateconfig.py— TOML loader (stdlib tomllib), typed config objectsmain.py— argparse with bootstrap/watch/backfill/audit subcommandsconfig.toml— 5 source defs, rate limits, storage paths, API key placeholderspyproject.toml— dependencies, entry pointsdata/docs/anddata/raw/directories- Verify:
python -m aucourt_ingest --helpruns, all subcommands dispatch, all models import
Stage 2: MetaDB + Utils ✅
storage/meta_db.py— async SQLite (aiosqlite), documents/fetch_queue/source_state CRUD, status transitionsutils/mnc_parser.py— MNC regex parser + court→jurisdiction mappingutils/rate_limiter.py— async token bucket per source + registryutils/retry.py— exponential backoff decorator with jitter- Tests: 27 passed (MNC parser 12, MetaDB 11, rate limiter 4)
Stage 3: Source Layer ✅
sources/base.py— BaseSource ABC (discover, fetch, rate limit, MetaDB integration)storage/doc_store.py— save/load raw downloads todata/raw/{source}/{year}/{mnc}.htmlsources/nsw_caselaw.py— JSON API at/browse/list?filter=criminal, 200/page, 630 pages, 125K total- Wire:
bootstrap --source nsw_caselaw --limit 3fetched 3 real cases with MNCs, stored on disk + MetaDB sources/fedcourt.py— RSS poll at/rss/fca-judgments, DOCX link discovery, bootstrap functionsources/highcourt.py— crawl from/cases-and-judgments, keyword filter, bootstrap functionsources/qld_judgments.py— search endpoint scraping, bootstrap stub (needs empirical testing)sources/auslaw_mcp.py— stub with NotImplementedError, needs external MCP server
Stage 4: Processing Pipeline ✅
processing/doc_parser.py— HTML (strip nav/header/footer/BS4), DOCX (python-docx headings), PDF (pdfminer.six). MNC extraction before stripping.processing/meta_extractor.py— Claude Haiku via injectable LLMClient protocol. JSON parse + retry with fix prompt. Handles markdown-wrapped responses.processing/outcome_parser.py— regex-based verdict detection (14 patterns). Priority: HELD → appeal → quash → acquittal → not_guilty → guilty → hung. Exoneration external source lookup. Appeal cross-reference.processing/chunk_engine.py— LLM boundary detection via injectable client. 7 chunk types with token budgets. Oversized sections auto-split.chunk_with_sections()for sync testing.processing/embed_engine.py— OpenAI text-embedding-3-small via injectable EmbeddingClient protocol. Batch processing with configurable size.storage/vector_index.py— ChromaDB wrapper: store_chunks, query with chunk_type/doc_id filters, cosine similarity.- Tests: 39 passed (doc_parser 8, meta_extractor 6, outcome_parser 18, chunk_engine 7)
Stage 5: Graph Builder ✅
storage/graph_db.py— GraphDB Protocol + GraphNode/GraphRelationship + _node_id/_rel_id helpersstorage/in_memory_graph_db.py— NetworkX-backed InMemoryGraphDB for testingstorage/neo4j_graph_db.py— Neo4j async driver wrapper for productionprocessing/graph_builder.py— build_case, build_chunks, build_similarity_edges, build_full- Tests: 40 passed (cosine similarity 7, helpers 4, InMemoryGraphDB 8, build_case 7, build_chunks 4, similarity 7, build_full 2, protocol 1)
Stage 6: Jury Interface ✅
jury/personas.py— 6 persona definitions (nurse, accountant, skeptic, ex_cop, empath, foreman) + traversal configjury/subgraph_query.py— SubgraphQuery with get_juror_context(), RULED_INADMISSIBLE exclusion at traversal level, token budget, anchor spec parser- Tests: 27 passed (persona definitions 13, anchor parsing 5, subgraph query 9: inadmissibility wall, token budget, chunk type filter, sequence ordering, all personas valid, foreman traversal, global exclusion)
Stage 7: Orchestrator + Watch Mode ✅
orchestrator.py— main loop, queue dispatch, stage transitions (fetched→parsed→embedded→graphed), error recovery, ProcessingPipeline protocolutils/telegram.py— alert sender with templates (source_degraded, daily_summary, milestone, fetch_error), NoOpAlert fallback- Wire: watch mode (polling loop with SIGINT/SIGTERM handling), backfill mode (year range), audit mode (re-process by status), process mode (drain queue)
main.py— all 5 modes fully implemented: bootstrap, watch, backfill, audit, process- Tests: 27 passed (TelegramAlert 10, NoOpAlert 3, orchestrator 14: empty queue, single/multi doc, limit, missing doc, pipeline error, status transitions, graph built, vector index, single doc, doc error, meta stored, no pipeline, queue status)
Stage 8: Integration Tests ✅
processing/pipeline.py— FullPipeline wiring DocParser → MetaExtractor → OutcomeParser → ChunkEngine → EmbedEngine- Mock LLM client with deterministic meta extraction and chunk boundary detection
- Mock embedding client with normalized deterministic vectors
- 10 integration tests: parse, extract_meta, chunk, embed, 3-doc e2e to graph, meta stored, vector index, status transitions, graph queryable
- End-to-end: 3 documents → FullPipeline → Orchestrator → InMemoryGraphDB with node dedup and edge traversal
- Tests: 170 passed (pipeline 10, orchestrator 27, jury 27, graph 40, meta_extractor 6, chunk_engine 7, doc_parser 8, outcome_parser 18, mnc_parser 12, meta_db 11, rate_limiter 5)
Spec Corrections Found
- High Court URL:
/cases/recent-judgments→ 404. Correct:/cases-and-judgments - Architecture diagram lists "VICCatalogue" as a source but no source definition exists. VIC is AusLaw MCP targeted search only.
- Spec says Python 3.12 — Rog has 3.13. Using 3.13.
- AusLII direct access returns 403 — only accessible via AusLaw MCP.