aucourt-ingest/CLAUDE.md
slothitude d77fe12cfc AuCourtIngest: complete 8-stage Australian legal case ingestion pipeline
Source layer (5 court sources), processing pipeline (parse/extract/chunk/embed/graph),
property graph with 8 node types, juror subgraph queries with 6 personas,
orchestrator with bootstrap/watch/backfill/audit/process modes, 170 tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-30 11:56:23 +10:00

4.9 KiB

CLAUDE.md — AuCourtIngest

Australian legal case ingestion pipeline. Discovers, fetches, parses, normalises, and graph-ingests Australian court decisions for AI jury/judge analysis.

Spec: spec.md is the authoritative design document. All architecture decisions, schemas, and source definitions live there.

Tech Stack

Component Choice
Language Python 3.13 (C:\Python313\python.exe)
HTTP httpx (async)
HTML parsing BeautifulSoup4
DOCX python-docx
PDF text pdfminer.six
PDF OCR via AusLaw MCP (pytesseract)
Embeddings OpenAI text-embedding-3-small (1536-dim)
Vector store ChromaDB (local)
Graph DB Neo4j (Docker) — primary for dev
Meta DB SQLite via aiosqlite
LLM extraction Claude Haiku 4.5 (structured metadata)
MCP client mcp-python-sdk
Scheduling APScheduler
Config TOML

Architecture

Three-layer pipeline: Source Layer (court RSS/crawlers) → Processing Layer (parse → extract → chunk → embed → graph) → Storage Layer (DocStore + VectorIndex + PropertyGraph + MetaDB)

Four operational modes: bootstrap, watch, backfill, audit

Key Concepts

  • MNC (Medium Neutral Citation) — canonical case ID, e.g. [2019] NSWSC 1234
  • Chunk types: opening, testimony, exhibit, ruling, closing, judgment, sentence
  • Graph node types: Case, Charge, Witness, Exhibit, Judge, Ruling, Timeline, Chunk
  • Juror subgraph queries — persona-driven graph traversal with per-persona anchor nodes, edge types, and chunk type filters
  • Exoneration sources — RMIT Wrongful Convictions Register, Innocence Project Australia, Royal Commission findings

Critical Invariants

  1. RULED_INADMISSIBLE exclusion — enforced at graph traversal level, never at prompt level. All juror subgraph queries must exclude RULED_INADMISSIBLE edges.
  2. Suppression orders — do not ingest content flagged with suppression_order=True
  3. Verdict ground truth — OutcomeParser resolution priority: headnote → orders section → appeal cross-reference → manual flag
  4. Rate limits per source — see spec section 7, enforced in orchestrator. Respect robots.txt.
  5. No shortcuts on parse errors — on ParseError after retry, log to MetaDB, move to manual_review queue. Never silently skip.

Planned File Structure

aucourt_ingest/
├── main.py                  # entry point, mode dispatch
├── config.toml              # source configs, rate limits, paths
├── orchestrator.py          # main loop, queue management
├── sources/
│   ├── base.py              # BaseSource ABC
│   ├── fedcourt.py
│   ├── highcourt.py
│   ├── nsw_caselaw.py
│   ├── qld_judgments.py
│   └── auslaw_mcp.py
├── processing/
│   ├── doc_parser.py
│   ├── meta_extractor.py
│   ├── outcome_parser.py
│   ├── chunk_engine.py
│   ├── embed_engine.py
│   └── graph_builder.py
├── storage/
│   ├── doc_store.py
│   ├── vector_index.py
│   ├── graph_db.py
│   └── meta_db.py
├── jury/
│   ├── personas.py          # JurorPersona definitions
│   └── subgraph_query.py    # get_juror_context()
├── utils/
│   ├── rate_limiter.py
│   ├── retry.py
│   ├── telegram.py
│   └── mnc_parser.py
├── data/
│   ├── docs/
│   ├── raw/
│   └── meta.db
└── tests/
    ├── test_parsers.py
    ├── test_meta_extractor.py
    └── test_graph_builder.py

Development Commands

# Install dependencies (once project initialised)
pip install httpx beautifulsoup4 python-docx pdfminer.six chromadb neo4j aiosqlite anthropic mcp apscheduler tomli

# Run ingestion modes
python -m aucourt_ingest bootstrap          # initial bulk ingest
python -m aucourt_ingest watch              # continuous RSS polling
python -m aucourt_ingest backfill           # fill gaps
python -m aucourt_ingest audit              # re-process for schema updates

# Run tests
python -m pytest tests/

# Start Neo4j (Docker)
docker run -p 7687:7687 -p 7474:7474 neo4j

Data Sources

Source ID Fetch Strategy Rate Limit
Federal Court fedcourt RSS poll → DOCX download 1 rps
High Court highcourt Index crawl 0.5 rps
NSW Caselaw nsw_caselaw Browse pagination (MNC system) 1 rps
QLD Judgments qld_judgments Search pagination 0.5 rps
AusLaw MCP auslaw_mcp Targeted retrieval (not bulk) 0.3 rps

Windows Gotchas

  • Use forward slashes in all Python paths (avoids \t tab interpolation)
  • Neo4j runs in Docker — ensure Docker Desktop is running
  • OpenAI API key required for embeddings — set OPENAI_API_KEY env var
  • Anthropic API key required for LLM extraction — set ANTHROPIC_API_KEY env var