Source layer (5 court sources), processing pipeline (parse/extract/chunk/embed/graph), property graph with 8 node types, juror subgraph queries with 6 personas, orchestrator with bootstrap/watch/backfill/audit/process modes, 170 tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4.9 KiB
4.9 KiB
CLAUDE.md — AuCourtIngest
Australian legal case ingestion pipeline. Discovers, fetches, parses, normalises, and graph-ingests Australian court decisions for AI jury/judge analysis.
Spec: spec.md is the authoritative design document. All architecture decisions, schemas, and source definitions live there.
Tech Stack
| Component | Choice |
|---|---|
| Language | Python 3.13 (C:\Python313\python.exe) |
| HTTP | httpx (async) |
| HTML parsing | BeautifulSoup4 |
| DOCX | python-docx |
| PDF text | pdfminer.six |
| PDF OCR | via AusLaw MCP (pytesseract) |
| Embeddings | OpenAI text-embedding-3-small (1536-dim) |
| Vector store | ChromaDB (local) |
| Graph DB | Neo4j (Docker) — primary for dev |
| Meta DB | SQLite via aiosqlite |
| LLM extraction | Claude Haiku 4.5 (structured metadata) |
| MCP client | mcp-python-sdk |
| Scheduling | APScheduler |
| Config | TOML |
Architecture
Three-layer pipeline: Source Layer (court RSS/crawlers) → Processing Layer (parse → extract → chunk → embed → graph) → Storage Layer (DocStore + VectorIndex + PropertyGraph + MetaDB)
Four operational modes: bootstrap, watch, backfill, audit
Key Concepts
- MNC (Medium Neutral Citation) — canonical case ID, e.g.
[2019] NSWSC 1234 - Chunk types: opening, testimony, exhibit, ruling, closing, judgment, sentence
- Graph node types: Case, Charge, Witness, Exhibit, Judge, Ruling, Timeline, Chunk
- Juror subgraph queries — persona-driven graph traversal with per-persona anchor nodes, edge types, and chunk type filters
- Exoneration sources — RMIT Wrongful Convictions Register, Innocence Project Australia, Royal Commission findings
Critical Invariants
- RULED_INADMISSIBLE exclusion — enforced at graph traversal level, never at prompt level. All juror subgraph queries must exclude
RULED_INADMISSIBLEedges. - Suppression orders — do not ingest content flagged with
suppression_order=True - Verdict ground truth — OutcomeParser resolution priority: headnote → orders section → appeal cross-reference → manual flag
- Rate limits per source — see spec section 7, enforced in orchestrator. Respect robots.txt.
- No shortcuts on parse errors — on ParseError after retry, log to MetaDB, move to manual_review queue. Never silently skip.
Planned File Structure
aucourt_ingest/
├── main.py # entry point, mode dispatch
├── config.toml # source configs, rate limits, paths
├── orchestrator.py # main loop, queue management
├── sources/
│ ├── base.py # BaseSource ABC
│ ├── fedcourt.py
│ ├── highcourt.py
│ ├── nsw_caselaw.py
│ ├── qld_judgments.py
│ └── auslaw_mcp.py
├── processing/
│ ├── doc_parser.py
│ ├── meta_extractor.py
│ ├── outcome_parser.py
│ ├── chunk_engine.py
│ ├── embed_engine.py
│ └── graph_builder.py
├── storage/
│ ├── doc_store.py
│ ├── vector_index.py
│ ├── graph_db.py
│ └── meta_db.py
├── jury/
│ ├── personas.py # JurorPersona definitions
│ └── subgraph_query.py # get_juror_context()
├── utils/
│ ├── rate_limiter.py
│ ├── retry.py
│ ├── telegram.py
│ └── mnc_parser.py
├── data/
│ ├── docs/
│ ├── raw/
│ └── meta.db
└── tests/
├── test_parsers.py
├── test_meta_extractor.py
└── test_graph_builder.py
Development Commands
# Install dependencies (once project initialised)
pip install httpx beautifulsoup4 python-docx pdfminer.six chromadb neo4j aiosqlite anthropic mcp apscheduler tomli
# Run ingestion modes
python -m aucourt_ingest bootstrap # initial bulk ingest
python -m aucourt_ingest watch # continuous RSS polling
python -m aucourt_ingest backfill # fill gaps
python -m aucourt_ingest audit # re-process for schema updates
# Run tests
python -m pytest tests/
# Start Neo4j (Docker)
docker run -p 7687:7687 -p 7474:7474 neo4j
Data Sources
| Source | ID | Fetch Strategy | Rate Limit |
|---|---|---|---|
| Federal Court | fedcourt |
RSS poll → DOCX download | 1 rps |
| High Court | highcourt |
Index crawl | 0.5 rps |
| NSW Caselaw | nsw_caselaw |
Browse pagination (MNC system) | 1 rps |
| QLD Judgments | qld_judgments |
Search pagination | 0.5 rps |
| AusLaw MCP | auslaw_mcp |
Targeted retrieval (not bulk) | 0.3 rps |
Windows Gotchas
- Use forward slashes in all Python paths (avoids
\ttab interpolation) - Neo4j runs in Docker — ensure Docker Desktop is running
- OpenAI API key required for embeddings — set
OPENAI_API_KEYenv var - Anthropic API key required for LLM extraction — set
ANTHROPIC_API_KEYenv var