aucourt-ingest/CLAUDE.md

# CLAUDE.md — AuCourtIngest

Australian legal case ingestion pipeline. Discovers, fetches, parses, normalises, and graph-ingests Australian court decisions for AI jury/judge analysis.

**Spec:** `spec.md` is the authoritative design document. All architecture decisions, schemas, and source definitions live there.

## Tech Stack

| Component | Choice |
|-----------|--------|
| Language | Python 3.13 (`C:\Python313\python.exe`) |
| HTTP | httpx (async) |
| HTML parsing | BeautifulSoup4 |
| DOCX | python-docx |
| PDF text | pdfminer.six |
| PDF OCR | via AusLaw MCP (pytesseract) |
| Embeddings | OpenAI text-embedding-3-small (1536-dim) |
| Vector store | ChromaDB (local) |
| Graph DB | Neo4j (Docker) — primary for dev |
| Meta DB | SQLite via aiosqlite |
| LLM extraction | Claude Haiku 4.5 (structured metadata) |
| MCP client | mcp-python-sdk |
| Scheduling | APScheduler |
| Config | TOML |

## Architecture

Three-layer pipeline: **Source Layer** (court RSS/crawlers) → **Processing Layer** (parse → extract → chunk → embed → graph) → **Storage Layer** (DocStore + VectorIndex + PropertyGraph + MetaDB)

Four operational modes: `bootstrap`, `watch`, `backfill`, `audit`

## Key Concepts

- **MNC** (Medium Neutral Citation) — canonical case ID, e.g. `[2019] NSWSC 1234`
- **Chunk types**: opening, testimony, exhibit, ruling, closing, judgment, sentence
- **Graph node types**: Case, Charge, Witness, Exhibit, Judge, Ruling, Timeline, Chunk
- **Juror subgraph queries** — persona-driven graph traversal with per-persona anchor nodes, edge types, and chunk type filters
- **Exoneration sources** — RMIT Wrongful Convictions Register, Innocence Project Australia, Royal Commission findings

## Critical Invariants

1. **RULED_INADMISSIBLE exclusion** — enforced at graph traversal level, never at prompt level. All juror subgraph queries must exclude `RULED_INADMISSIBLE` edges.
2. **Suppression orders** — do not ingest content flagged with `suppression_order=True`
3. **Verdict ground truth** — OutcomeParser resolution priority: headnote → orders section → appeal cross-reference → manual flag
4. **Rate limits per source** — see spec section 7, enforced in orchestrator. Respect robots.txt.
5. **No shortcuts on parse errors** — on ParseError after retry, log to MetaDB, move to manual_review queue. Never silently skip.

## Planned File Structure

```
aucourt_ingest/
├── main.py                  # entry point, mode dispatch
├── config.toml              # source configs, rate limits, paths
├── orchestrator.py          # main loop, queue management
├── sources/
│   ├── base.py              # BaseSource ABC
│   ├── fedcourt.py
│   ├── highcourt.py
│   ├── nsw_caselaw.py
│   ├── qld_judgments.py
│   └── auslaw_mcp.py
├── processing/
│   ├── doc_parser.py
│   ├── meta_extractor.py
│   ├── outcome_parser.py
│   ├── chunk_engine.py
│   ├── embed_engine.py
│   └── graph_builder.py
├── storage/
│   ├── doc_store.py
│   ├── vector_index.py
│   ├── graph_db.py
│   └── meta_db.py
├── jury/
│   ├── personas.py          # JurorPersona definitions
│   └── subgraph_query.py    # get_juror_context()
├── utils/
│   ├── rate_limiter.py
│   ├── retry.py
│   ├── telegram.py
│   └── mnc_parser.py
├── data/
│   ├── docs/
│   ├── raw/
│   └── meta.db
└── tests/
    ├── test_parsers.py
    ├── test_meta_extractor.py
    └── test_graph_builder.py
```

## Development Commands

```bash
# Install dependencies (once project initialised)
pip install httpx beautifulsoup4 python-docx pdfminer.six chromadb neo4j aiosqlite anthropic mcp apscheduler tomli

# Run ingestion modes
python -m aucourt_ingest bootstrap          # initial bulk ingest
python -m aucourt_ingest watch              # continuous RSS polling
python -m aucourt_ingest backfill           # fill gaps
python -m aucourt_ingest audit              # re-process for schema updates

# Run tests
python -m pytest tests/

# Start Neo4j (Docker)
docker run -p 7687:7687 -p 7474:7474 neo4j
```

## Data Sources

| Source | ID | Fetch Strategy | Rate Limit |
|--------|----|---------------|------------|
| Federal Court | `fedcourt` | RSS poll → DOCX download | 1 rps |
| High Court | `highcourt` | Index crawl | 0.5 rps |
| NSW Caselaw | `nsw_caselaw` | Browse pagination (MNC system) | 1 rps |
| QLD Judgments | `qld_judgments` | Search pagination | 0.5 rps |
| AusLaw MCP | `auslaw_mcp` | Targeted retrieval (not bulk) | 0.3 rps |

## Windows Gotchas

- Use forward slashes in all Python paths (avoids `\t` tab interpolation)
- Neo4j runs in Docker — ensure Docker Desktop is running
- OpenAI API key required for embeddings — set `OPENAI_API_KEY` env var
- Anthropic API key required for LLM extraction — set `ANTHROPIC_API_KEY` env var