# CLAUDE.md — AuCourtIngest Australian legal case ingestion pipeline. Discovers, fetches, parses, normalises, and graph-ingests Australian court decisions for AI jury/judge analysis. **Spec:** `spec.md` is the authoritative design document. All architecture decisions, schemas, and source definitions live there. ## Tech Stack | Component | Choice | |-----------|--------| | Language | Python 3.13 (`C:\Python313\python.exe`) | | HTTP | httpx (async) | | HTML parsing | BeautifulSoup4 | | DOCX | python-docx | | PDF text | pdfminer.six | | PDF OCR | via AusLaw MCP (pytesseract) | | Embeddings | OpenAI text-embedding-3-small (1536-dim) | | Vector store | ChromaDB (local) | | Graph DB | Neo4j (Docker) — primary for dev | | Meta DB | SQLite via aiosqlite | | LLM extraction | Claude Haiku 4.5 (structured metadata) | | MCP client | mcp-python-sdk | | Scheduling | APScheduler | | Config | TOML | ## Architecture Three-layer pipeline: **Source Layer** (court RSS/crawlers) → **Processing Layer** (parse → extract → chunk → embed → graph) → **Storage Layer** (DocStore + VectorIndex + PropertyGraph + MetaDB) Four operational modes: `bootstrap`, `watch`, `backfill`, `audit` ## Key Concepts - **MNC** (Medium Neutral Citation) — canonical case ID, e.g. `[2019] NSWSC 1234` - **Chunk types**: opening, testimony, exhibit, ruling, closing, judgment, sentence - **Graph node types**: Case, Charge, Witness, Exhibit, Judge, Ruling, Timeline, Chunk - **Juror subgraph queries** — persona-driven graph traversal with per-persona anchor nodes, edge types, and chunk type filters - **Exoneration sources** — RMIT Wrongful Convictions Register, Innocence Project Australia, Royal Commission findings ## Critical Invariants 1. **RULED_INADMISSIBLE exclusion** — enforced at graph traversal level, never at prompt level. All juror subgraph queries must exclude `RULED_INADMISSIBLE` edges. 2. **Suppression orders** — do not ingest content flagged with `suppression_order=True` 3. **Verdict ground truth** — OutcomeParser resolution priority: headnote → orders section → appeal cross-reference → manual flag 4. **Rate limits per source** — see spec section 7, enforced in orchestrator. Respect robots.txt. 5. **No shortcuts on parse errors** — on ParseError after retry, log to MetaDB, move to manual_review queue. Never silently skip. ## Planned File Structure ``` aucourt_ingest/ ├── main.py # entry point, mode dispatch ├── config.toml # source configs, rate limits, paths ├── orchestrator.py # main loop, queue management ├── sources/ │ ├── base.py # BaseSource ABC │ ├── fedcourt.py │ ├── highcourt.py │ ├── nsw_caselaw.py │ ├── qld_judgments.py │ └── auslaw_mcp.py ├── processing/ │ ├── doc_parser.py │ ├── meta_extractor.py │ ├── outcome_parser.py │ ├── chunk_engine.py │ ├── embed_engine.py │ └── graph_builder.py ├── storage/ │ ├── doc_store.py │ ├── vector_index.py │ ├── graph_db.py │ └── meta_db.py ├── jury/ │ ├── personas.py # JurorPersona definitions │ └── subgraph_query.py # get_juror_context() ├── utils/ │ ├── rate_limiter.py │ ├── retry.py │ ├── telegram.py │ └── mnc_parser.py ├── data/ │ ├── docs/ │ ├── raw/ │ └── meta.db └── tests/ ├── test_parsers.py ├── test_meta_extractor.py └── test_graph_builder.py ``` ## Development Commands ```bash # Install dependencies (once project initialised) pip install httpx beautifulsoup4 python-docx pdfminer.six chromadb neo4j aiosqlite anthropic mcp apscheduler tomli # Run ingestion modes python -m aucourt_ingest bootstrap # initial bulk ingest python -m aucourt_ingest watch # continuous RSS polling python -m aucourt_ingest backfill # fill gaps python -m aucourt_ingest audit # re-process for schema updates # Run tests python -m pytest tests/ # Start Neo4j (Docker) docker run -p 7687:7687 -p 7474:7474 neo4j ``` ## Data Sources | Source | ID | Fetch Strategy | Rate Limit | |--------|----|---------------|------------| | Federal Court | `fedcourt` | RSS poll → DOCX download | 1 rps | | High Court | `highcourt` | Index crawl | 0.5 rps | | NSW Caselaw | `nsw_caselaw` | Browse pagination (MNC system) | 1 rps | | QLD Judgments | `qld_judgments` | Search pagination | 0.5 rps | | AusLaw MCP | `auslaw_mcp` | Targeted retrieval (not bulk) | 0.3 rps | ## Windows Gotchas - Use forward slashes in all Python paths (avoids `\t` tab interpolation) - Neo4j runs in Docker — ensure Docker Desktop is running - OpenAI API key required for embeddings — set `OPENAI_API_KEY` env var - Anthropic API key required for LLM extraction — set `ANTHROPIC_API_KEY` env var