Source layer (5 court sources), processing pipeline (parse/extract/chunk/embed/graph), property graph with 8 node types, juror subgraph queries with 6 personas, orchestrator with bootstrap/watch/backfill/audit/process modes, 170 tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
126 lines
4.9 KiB
Markdown
126 lines
4.9 KiB
Markdown
# CLAUDE.md — AuCourtIngest
|
|
|
|
Australian legal case ingestion pipeline. Discovers, fetches, parses, normalises, and graph-ingests Australian court decisions for AI jury/judge analysis.
|
|
|
|
**Spec:** `spec.md` is the authoritative design document. All architecture decisions, schemas, and source definitions live there.
|
|
|
|
## Tech Stack
|
|
|
|
| Component | Choice |
|
|
|-----------|--------|
|
|
| Language | Python 3.13 (`C:\Python313\python.exe`) |
|
|
| HTTP | httpx (async) |
|
|
| HTML parsing | BeautifulSoup4 |
|
|
| DOCX | python-docx |
|
|
| PDF text | pdfminer.six |
|
|
| PDF OCR | via AusLaw MCP (pytesseract) |
|
|
| Embeddings | OpenAI text-embedding-3-small (1536-dim) |
|
|
| Vector store | ChromaDB (local) |
|
|
| Graph DB | Neo4j (Docker) — primary for dev |
|
|
| Meta DB | SQLite via aiosqlite |
|
|
| LLM extraction | Claude Haiku 4.5 (structured metadata) |
|
|
| MCP client | mcp-python-sdk |
|
|
| Scheduling | APScheduler |
|
|
| Config | TOML |
|
|
|
|
## Architecture
|
|
|
|
Three-layer pipeline: **Source Layer** (court RSS/crawlers) → **Processing Layer** (parse → extract → chunk → embed → graph) → **Storage Layer** (DocStore + VectorIndex + PropertyGraph + MetaDB)
|
|
|
|
Four operational modes: `bootstrap`, `watch`, `backfill`, `audit`
|
|
|
|
## Key Concepts
|
|
|
|
- **MNC** (Medium Neutral Citation) — canonical case ID, e.g. `[2019] NSWSC 1234`
|
|
- **Chunk types**: opening, testimony, exhibit, ruling, closing, judgment, sentence
|
|
- **Graph node types**: Case, Charge, Witness, Exhibit, Judge, Ruling, Timeline, Chunk
|
|
- **Juror subgraph queries** — persona-driven graph traversal with per-persona anchor nodes, edge types, and chunk type filters
|
|
- **Exoneration sources** — RMIT Wrongful Convictions Register, Innocence Project Australia, Royal Commission findings
|
|
|
|
## Critical Invariants
|
|
|
|
1. **RULED_INADMISSIBLE exclusion** — enforced at graph traversal level, never at prompt level. All juror subgraph queries must exclude `RULED_INADMISSIBLE` edges.
|
|
2. **Suppression orders** — do not ingest content flagged with `suppression_order=True`
|
|
3. **Verdict ground truth** — OutcomeParser resolution priority: headnote → orders section → appeal cross-reference → manual flag
|
|
4. **Rate limits per source** — see spec section 7, enforced in orchestrator. Respect robots.txt.
|
|
5. **No shortcuts on parse errors** — on ParseError after retry, log to MetaDB, move to manual_review queue. Never silently skip.
|
|
|
|
## Planned File Structure
|
|
|
|
```
|
|
aucourt_ingest/
|
|
├── main.py # entry point, mode dispatch
|
|
├── config.toml # source configs, rate limits, paths
|
|
├── orchestrator.py # main loop, queue management
|
|
├── sources/
|
|
│ ├── base.py # BaseSource ABC
|
|
│ ├── fedcourt.py
|
|
│ ├── highcourt.py
|
|
│ ├── nsw_caselaw.py
|
|
│ ├── qld_judgments.py
|
|
│ └── auslaw_mcp.py
|
|
├── processing/
|
|
│ ├── doc_parser.py
|
|
│ ├── meta_extractor.py
|
|
│ ├── outcome_parser.py
|
|
│ ├── chunk_engine.py
|
|
│ ├── embed_engine.py
|
|
│ └── graph_builder.py
|
|
├── storage/
|
|
│ ├── doc_store.py
|
|
│ ├── vector_index.py
|
|
│ ├── graph_db.py
|
|
│ └── meta_db.py
|
|
├── jury/
|
|
│ ├── personas.py # JurorPersona definitions
|
|
│ └── subgraph_query.py # get_juror_context()
|
|
├── utils/
|
|
│ ├── rate_limiter.py
|
|
│ ├── retry.py
|
|
│ ├── telegram.py
|
|
│ └── mnc_parser.py
|
|
├── data/
|
|
│ ├── docs/
|
|
│ ├── raw/
|
|
│ └── meta.db
|
|
└── tests/
|
|
├── test_parsers.py
|
|
├── test_meta_extractor.py
|
|
└── test_graph_builder.py
|
|
```
|
|
|
|
## Development Commands
|
|
|
|
```bash
|
|
# Install dependencies (once project initialised)
|
|
pip install httpx beautifulsoup4 python-docx pdfminer.six chromadb neo4j aiosqlite anthropic mcp apscheduler tomli
|
|
|
|
# Run ingestion modes
|
|
python -m aucourt_ingest bootstrap # initial bulk ingest
|
|
python -m aucourt_ingest watch # continuous RSS polling
|
|
python -m aucourt_ingest backfill # fill gaps
|
|
python -m aucourt_ingest audit # re-process for schema updates
|
|
|
|
# Run tests
|
|
python -m pytest tests/
|
|
|
|
# Start Neo4j (Docker)
|
|
docker run -p 7687:7687 -p 7474:7474 neo4j
|
|
```
|
|
|
|
## Data Sources
|
|
|
|
| Source | ID | Fetch Strategy | Rate Limit |
|
|
|--------|----|---------------|------------|
|
|
| Federal Court | `fedcourt` | RSS poll → DOCX download | 1 rps |
|
|
| High Court | `highcourt` | Index crawl | 0.5 rps |
|
|
| NSW Caselaw | `nsw_caselaw` | Browse pagination (MNC system) | 1 rps |
|
|
| QLD Judgments | `qld_judgments` | Search pagination | 0.5 rps |
|
|
| AusLaw MCP | `auslaw_mcp` | Targeted retrieval (not bulk) | 0.3 rps |
|
|
|
|
## Windows Gotchas
|
|
|
|
- Use forward slashes in all Python paths (avoids `\t` tab interpolation)
|
|
- Neo4j runs in Docker — ensure Docker Desktop is running
|
|
- OpenAI API key required for embeddings — set `OPENAI_API_KEY` env var
|
|
- Anthropic API key required for LLM extraction — set `ANTHROPIC_API_KEY` env var
|