aucourt-ingest/CLAUDE.md

127 lines
4.9 KiB
Markdown
Raw Normal View History

# CLAUDE.md — AuCourtIngest
Australian legal case ingestion pipeline. Discovers, fetches, parses, normalises, and graph-ingests Australian court decisions for AI jury/judge analysis.
**Spec:** `spec.md` is the authoritative design document. All architecture decisions, schemas, and source definitions live there.
## Tech Stack
| Component | Choice |
|-----------|--------|
| Language | Python 3.13 (`C:\Python313\python.exe`) |
| HTTP | httpx (async) |
| HTML parsing | BeautifulSoup4 |
| DOCX | python-docx |
| PDF text | pdfminer.six |
| PDF OCR | via AusLaw MCP (pytesseract) |
| Embeddings | OpenAI text-embedding-3-small (1536-dim) |
| Vector store | ChromaDB (local) |
| Graph DB | Neo4j (Docker) — primary for dev |
| Meta DB | SQLite via aiosqlite |
| LLM extraction | Claude Haiku 4.5 (structured metadata) |
| MCP client | mcp-python-sdk |
| Scheduling | APScheduler |
| Config | TOML |
## Architecture
Three-layer pipeline: **Source Layer** (court RSS/crawlers) → **Processing Layer** (parse → extract → chunk → embed → graph) → **Storage Layer** (DocStore + VectorIndex + PropertyGraph + MetaDB)
Four operational modes: `bootstrap`, `watch`, `backfill`, `audit`
## Key Concepts
- **MNC** (Medium Neutral Citation) — canonical case ID, e.g. `[2019] NSWSC 1234`
- **Chunk types**: opening, testimony, exhibit, ruling, closing, judgment, sentence
- **Graph node types**: Case, Charge, Witness, Exhibit, Judge, Ruling, Timeline, Chunk
- **Juror subgraph queries** — persona-driven graph traversal with per-persona anchor nodes, edge types, and chunk type filters
- **Exoneration sources** — RMIT Wrongful Convictions Register, Innocence Project Australia, Royal Commission findings
## Critical Invariants
1. **RULED_INADMISSIBLE exclusion** — enforced at graph traversal level, never at prompt level. All juror subgraph queries must exclude `RULED_INADMISSIBLE` edges.
2. **Suppression orders** — do not ingest content flagged with `suppression_order=True`
3. **Verdict ground truth** — OutcomeParser resolution priority: headnote → orders section → appeal cross-reference → manual flag
4. **Rate limits per source** — see spec section 7, enforced in orchestrator. Respect robots.txt.
5. **No shortcuts on parse errors** — on ParseError after retry, log to MetaDB, move to manual_review queue. Never silently skip.
## Planned File Structure
```
aucourt_ingest/
├── main.py # entry point, mode dispatch
├── config.toml # source configs, rate limits, paths
├── orchestrator.py # main loop, queue management
├── sources/
│ ├── base.py # BaseSource ABC
│ ├── fedcourt.py
│ ├── highcourt.py
│ ├── nsw_caselaw.py
│ ├── qld_judgments.py
│ └── auslaw_mcp.py
├── processing/
│ ├── doc_parser.py
│ ├── meta_extractor.py
│ ├── outcome_parser.py
│ ├── chunk_engine.py
│ ├── embed_engine.py
│ └── graph_builder.py
├── storage/
│ ├── doc_store.py
│ ├── vector_index.py
│ ├── graph_db.py
│ └── meta_db.py
├── jury/
│ ├── personas.py # JurorPersona definitions
│ └── subgraph_query.py # get_juror_context()
├── utils/
│ ├── rate_limiter.py
│ ├── retry.py
│ ├── telegram.py
│ └── mnc_parser.py
├── data/
│ ├── docs/
│ ├── raw/
│ └── meta.db
└── tests/
├── test_parsers.py
├── test_meta_extractor.py
└── test_graph_builder.py
```
## Development Commands
```bash
# Install dependencies (once project initialised)
pip install httpx beautifulsoup4 python-docx pdfminer.six chromadb neo4j aiosqlite anthropic mcp apscheduler tomli
# Run ingestion modes
python -m aucourt_ingest bootstrap # initial bulk ingest
python -m aucourt_ingest watch # continuous RSS polling
python -m aucourt_ingest backfill # fill gaps
python -m aucourt_ingest audit # re-process for schema updates
# Run tests
python -m pytest tests/
# Start Neo4j (Docker)
docker run -p 7687:7687 -p 7474:7474 neo4j
```
## Data Sources
| Source | ID | Fetch Strategy | Rate Limit |
|--------|----|---------------|------------|
| Federal Court | `fedcourt` | RSS poll → DOCX download | 1 rps |
| High Court | `highcourt` | Index crawl | 0.5 rps |
| NSW Caselaw | `nsw_caselaw` | Browse pagination (MNC system) | 1 rps |
| QLD Judgments | `qld_judgments` | Search pagination | 0.5 rps |
| AusLaw MCP | `auslaw_mcp` | Targeted retrieval (not bulk) | 0.3 rps |
## Windows Gotchas
- Use forward slashes in all Python paths (avoids `\t` tab interpolation)
- Neo4j runs in Docker — ensure Docker Desktop is running
- OpenAI API key required for embeddings — set `OPENAI_API_KEY` env var
- Anthropic API key required for LLM extraction — set `ANTHROPIC_API_KEY` env var