Source layer (5 court sources), processing pipeline (parse/extract/chunk/embed/graph), property graph with 8 node types, juror subgraph queries with 6 personas, orchestrator with bootstrap/watch/backfill/audit/process modes, 170 tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
25 KiB
# AuCourtIngest — Agent Specification v0.1
**Project:** Australian Legal Case Ingestion Pipeline
**Purpose:** Autonomous agent to discover, fetch, parse, normalise, and graph-ingest Australian court decisions for AI jury/judge analysis
**Author:** Aaron King / Slothitude Games
**Date:** 2026-05-30
---
## 1. Overview
AuCourtIngest is a fully autonomous ingestion agent. It operates as a continuous pipeline with no human operator required after initial configuration. Its output is a populated property graph (RyuGraph or Neo4j-compatible) and a flat document store, both ready for per-juror RAG queries.
The agent has four operational modes:
| Mode | Trigger | Description |
|------|---------|-------------|
| bootstrap | Manual / first run | Historical bulk ingest from all tier-1 sources |
| watch | Cron / RSS poll | Continuous ingestion of new decisions |
| backfill | Manual | Fill gaps in existing corpus by date/court/charge-type |
| audit | Manual | Re-process existing documents to update graph schema |
---
## 2. Architecture
┌─────────────────────────────────────────────────────┐
│ ORCHESTRATOR │
│ (main loop, mode dispatch, │
│ rate limiting, error recovery) │
└────────┬───────────────────────────────┬────────────┘
  │ │
  ▼ ▼
┌─────────────────┐ ┌─────────────────────┐
│ SOURCE LAYER │ │ PROCESSING LAYER │
│ │ │ │
│ - FedCourtRSS │──fetch─────▶│ - DocParser │
│ - HighCourtRSS │ │ - MetaExtractor │
│ - NSWCaselaw │ │ - OutcomeParser │
│ - QLDJudgments │ │ - ChunkEngine │
│ - VICCatalogue │ │ - EmbedEngine │
│ - AusLawMCP │ │ - GraphBuilder │
└─────────────────┘ └──────────┬──────────┘
  │
  ▼
  ┌─────────────────────┐
  │ STORAGE LAYER │
  │ │
  │ - DocStore (flat) │
  │ - VectorIndex │
  │ - PropertyGraph │
  │ - MetaDB (SQLite) │
  └─────────────────────┘
---
## 3. Source Definitions
### 3.1 Federal Court of Australia
source\_id: fedcourt
base\_url: https://www.judgments.fedcourt.gov.au
rss\_feed: https://www.judgments.fedcourt.gov.au/rss/fca-judgments
doc\_formats: \[html, docx, pdf]
coverage\_from: 1977
full\_text\_from: 1995
docx\_from: 1995
pdf\_range: \[1977, 1994]
rate\_limit\_rps: 1
tos\_status: public\_domain
fetch\_strategy: rss\_poll\_then\_docx\_download
**Fetch logic:**
1. Poll RSS feed every 6 hours
2. For each new item: extract judgment URL from RSS <link>
3. Fetch HTML page → find DOCX download link (pattern: /judgments/fca/YYYY/NNN.docx)
4. Download DOCX → hand to DocParser
5. For 1977–1994: fetch PDF instead → hand to PDFParser
**DOCX URL pattern:**
https://www.judgments.fedcourt.gov.au/judgments/fca/{YYYY}/{index}.docx
**Bootstrap pagination:**
https://www.fedcourt.gov.au/digital-law-library/judgments/search
?NeedleType=all\&txtSearch=\&category=criminal\&dateFrom=1995-01-01
\&dateTo=2026-12-31\&resultsPerPage=50\&pageNum={N}
---
### 3.2 High Court of Australia
source\_id: highcourt
base\_url: https://www.hcourt.gov.au
transcript\_base: https://www.hcourt.gov.au/cases/case-s{YYYY}-{N}
coverage\_from: 1994
doc\_formats: \[html, pdf]
rate\_limit\_rps: 0.5
tos\_status: public\_domain
fetch\_strategy: index\_crawl
**Fetch logic:**
1. Crawl /cases/recent-judgments — paginated list with matter numbers
2. For each matter: fetch judgment HTML page
3. Also fetch transcript if available (/transcripts/ path)
4. Store judgment + transcript as separate documents, linked by matter\_id
**Priority filter for MVP:**
PRIORITY\_KEYWORDS = \[
  "murder", "manslaughter", "sexual assault", "robbery",
  "appeal allowed", "conviction quashed", "miscarriage of justice"
]
---
### 3.3 NSW Caselaw
source\_id: nsw\_caselaw
base\_url: https://www.caselaw.nsw.gov.au
browse\_url: https://www.caselaw.nsw.gov.au/browse
coverage\_from: 1988
doc\_formats: \[html, pdf]
rate\_limit\_rps: 1
tos\_status: open\_reproduction\_authorised
fetch\_strategy: mnc\_systematic + browse\_pagination
mnc\_pattern: "\[YYYY] {COURT} {N}"
**Fetch logic:**
1. Browse /browse?filter=criminal → paginate through all pages
2. Extract MNC + decision URL per row
3. Fetch decision HTML → parse full text + metadata
4. MNC serves as canonical case ID throughout system
**MNC parsing:**
\# Example: \[2019] NSWSC 1234
import re
MNC\_PATTERN = r'\\\[(\\d{4})\\]\\s+(\[A-Z]+)\\s+(\\d+)'
def parse\_mnc(mnc: str) -> dict:
  m = re.match(MNC\_PATTERN, mnc)
  return {"year": m\[1], "court": m\[2], "number": m\[3]}
---
### 3.4 Queensland Judgments
source\_id: qld\_judgments
base\_url: https://www.queenslandjudgments.com.au
coverage: variable\_by\_court
doc\_formats: \[html, pdf]
rate\_limit\_rps: 0.5
tos\_status: free\_access
fetch\_strategy: search\_pagination
**Fetch logic:**
1. Use search API with matter\_type=criminal filter
2. Paginate results → extract case URLs
3. Fetch HTML per case
---
### 3.5 AusLaw MCP (Gap-Fill + Search)
source\_id: auslaw\_mcp
type: mcp\_server
repo: russellbrenner/auslaw-mcp
tools:
  - search\_austlii
  - search\_jade
  - fetch\_document\_text
  - search\_jade\_by\_citation
use\_case: targeted\_retrieval\_not\_bulk
rate\_limit: conservative # case-by-case only
**Use cases:**
- Fetch a specific case by citation when referenced in another judgment
- OCR for scanned pre-1995 PDFs
- Cross-reference citation graph to find related cases
- Fill gaps where direct court portals have incomplete coverage
**Tool call pattern:**
\# Retrieve by citation
result = await mcp.call("fetch\_document\_text", {
  "url": "https://www.austlii.edu.au/cgi-bin/viewdoc/au/cases/..."
})
\# Search for cases matching charge type
results = await mcp.call("search\_austlii", {
  "query": "murder conviction appeal",
  "jurisdiction": "nsw",
  "limit": 20,
  "sortBy": "date"
})
---
## 4. Document Processing Pipeline
### 4.1 DocParser
Accepts: HTML, DOCX, PDF
Outputs: normalised RawDocument struct
@dataclass
class RawDocument:
  source\_id: str # e.g. "nsw\_caselaw"
  doc\_id: str # MNC or internal UUID
  url: str
  fetch\_timestamp: str # ISO8601
  raw\_text: str # full extracted text
  format: str # html | docx | pdf
  pages: int | None
  char\_count: int
**Format handlers:**
- HTML: BeautifulSoup → strip nav/header/footer → extract .judgment-body or equivalent
- DOCX: python-docx → iterate paragraphs, preserve heading structure
- PDF: pdfminer.six for text PDFs; pytesseract via AusLaw MCP OCR for scanned
---
### 4.2 MetaExtractor
Input: RawDocument
Output: CaseMeta struct
Method: Claude claude-haiku-4-5-20251001 via API (cheap, fast, structured)
@dataclass
class CaseMeta:
  # Identity
  case\_name: str # e.g. "R v Smith"
  mnc: str # Medium Neutral Citation
  court: str # NSWSC | HCA | FCA | QSC etc
  judge: list\[str]
  date\_delivered: str # ISO8601
  jurisdiction: str # NSW | CTH | QLD | VIC etc
  # Classification
  matter\_type: str # criminal | civil | appeal | coronial
  charges: list\[str] # \["murder s18 Crimes Act 1900 (NSW)"]
  charge\_categories: list\[str] # \["homicide", "violence"]
  # Outcome
  verdict: str # guilty | not\_guilty | appeal\_allowed |
  # appeal\_dismissed | conviction\_quashed |
  # sentence\_varied | hung | n/a
  sentence: str | None # "18 years NMP 13"
  outcome\_notes: str # free text summary of result
  # Flags
  is\_appeal: bool
  appeal\_of: str | None # MNC of original decision
  was\_appealed: bool # filled in later by back-reference
  exoneration\_flag: bool # manual or derived from appeal outcome
  inadmissible\_evidence: list\[str] # evidence ruled out
  suppression\_order: bool # whether publication restrictions apply
**Extraction prompt (system):**
You are a legal metadata extractor. Given the full text of an Australian court judgment,
extract structured metadata in JSON. Output ONLY valid JSON matching the CaseMeta schema.
Be conservative — use null for fields you cannot determine with confidence.
For verdict, use only these values: guilty | not\_guilty | appeal\_allowed |
appeal\_dismissed | conviction\_quashed | sentence\_varied | hung | civil\_judgment | n/a
---
### 4.3 OutcomeParser
Separate pass specifically for verdict/outcome ground truth. This is the most critical field — it's what the divergence signal compares against.
**Resolution priority:**
1. Explicit headnote — "HELD: conviction quashed"
2. Orders section — "Appeal allowed. Convictions on counts 1 and 3 set aside."
3. Appeal cross-reference — if later appeal found, update original record
4. Manual flag — exoneration\_flag=True sourced from external list
**External exoneration sources:**
- RMIT Wrongful Convictions Register (AU)
- Innocence Project Australia case list
- Royal Commission findings referencing specific convictions
These are small enough to maintain as a static JSON lookup table keyed by defendant name + year.
---
### 4.4 ChunkEngine
Chunks documents for RAG retrieval. Chunks are **semantically meaningful units**, not arbitrary token windows.
**Chunk types and sizes:**
| Chunk Type | Content | Target Tokens | Overlap |
|------------|---------|---------------|---------|
| opening | Case intro, charges, parties | 400–600 | none |
| testimony | Each witness examination block | 300–500 | 50 |
| exhibit | Evidence description + ruling | 200–400 | none |
| ruling | Each evidentiary ruling by judge | 200–300 | none |
| closing | Closing submissions summary | 400–600 | 50 |
| judgment | Judicial reasoning blocks | 400–600 | 50 |
| sentence | Sentencing remarks | 300–500 | none |
**Chunk struct:**
@dataclass
class Chunk:
  chunk\_id: str # UUID
  doc\_id: str # parent document MNC
  chunk\_type: str # from table above
  sequence: int # position in document
  text: str
  token\_count: int
  speaker: str | None # witness name if testimony
  page\_ref: str | None # "p.47" if available
  embedding: list\[float] # 1536-dim, filled by EmbedEngine
**Chunking strategy:**
Use Claude claude-haiku-4-5-20251001 to identify structural boundaries:
Identify the structural sections of this Australian court judgment.
Return a JSON array of {section\_type, start\_char, end\_char, speaker}.
Section types: opening | testimony | exhibit | ruling | closing | judgment | sentence
Then split at identified boundaries, keeping each chunk within token budget.
---
### 4.5 EmbedEngine
model: text-embedding-3-small # OpenAI, 1536-dim, cheap
batch\_size: 100
store: chromadb # local, no infra
collection\_naming: "au\_cases\_{source\_id}"
Each chunk gets an embedding. The embedding is stored in ChromaDB with chunk metadata as payload.
**Query interface:**
def query\_chunks(
  text: str,
  chunk\_types: list\[str] = None, # filter by type
  doc\_ids: list\[str] = None, # filter to specific cases
  top\_k: int = 10
) -> list\[Chunk]:
  ...
---
### 4.6 GraphBuilder
Builds the property graph from extracted metadata + chunk relationships.
**Node types:**
(:Case {mnc, court, date, jurisdiction, matter\_type, verdict, exoneration\_flag})
(:Charge {text, category, act, section})
(:Witness {name, role}) # role: prosecution|defence|expert|accused
(:Exhibit {id, description, admitted})
(:Judge {name, court})
(:Ruling {type, outcome}) # type: admissibility|direction|objection
(:Timeline {date, event})
(:Chunk {chunk\_id, type, text\_preview})
**Relationship types:**
(:Case)-\[:CHARGED\_WITH]->(:Charge)
(:Case)-\[:HEARD\_BY]->(:Judge)
(:Case)-\[:INCLUDES\_TESTIMONY {credibility\_score}]->(:Witness)
(:Case)-\[:HAS\_EXHIBIT {admitted: bool}]->(:Exhibit)
(:Case)-\[:HAS\_RULING]->(:Ruling)
(:Case)-\[:APPEALS]->(:Case) # appeal chain
(:Witness)-\[:GAVE\_TESTIMONY]->(:Chunk)
(:Exhibit)-\[:DESCRIBED\_IN]->(:Chunk)
(:Ruling)-\[:CONCERNS]->(:Exhibit)
(:Ruling)-\[:CONCERNS]->(:Witness)
(:Chunk)-\[:FOLLOWS]->(:Chunk) # sequence chain
(:Chunk)-\[:CORROBORATES]->(:Chunk) # semantic similarity > 0.85
(:Chunk)-\[:CONTRADICTS]->(:Chunk) # semantic similarity < 0.2 on same topic
**Graph backend:**
- Primary: Neo4j (local Docker) for development
- Alternative: RyuGraph if staying in the existing Lulu/Windows stack
- Export: GraphML for portability
**CORROBORATES / CONTRADICTS edge generation:**
\# After all chunks embedded, run pairwise similarity within a case
\# for chunks of type testimony and exhibit
for pair in chunk\_pairs\_within\_case(doc\_id, types=\["testimony", "exhibit"]):
  sim = cosine\_similarity(pair\[0].embedding, pair\[1].embedding)
  if sim > 0.85:
  graph.add\_edge(pair\[0], pair\[1], "CORROBORATES", weight=sim)
  elif same\_topic(pair\[0], pair\[1]) and sim < 0.25:
  graph.add\_edge(pair\[0], pair\[1], "CONTRADICTS", weight=sim)
---
## 5. Juror Subgraph Query Interface
This is the interface the jury agent calls at deliberation time.
def get\_juror\_context(
  case\_mnc: str,
  persona: JurorPersona,
  max\_tokens: int = 4000
) -> JurorContext:
  """
  Traverse the graph from the persona's anchor node types,
  follow relevant edge types, collect chunks up to token budget.
  Returns a compact context string + list of source node IDs.
  """
**Persona → graph traversal mapping:**
PERSONA\_TRAVERSAL = {
  "nurse": {
  "anchor\_nodes": \["Witness\[role=expert]", "Exhibit\[category=medical]"],
  "edge\_types": \["GAVE\_TESTIMONY", "DESCRIBED\_IN", "CORROBORATES"],
  "chunk\_types": \["testimony", "exhibit"],
  "exclude\_edges": \["RULED\_INADMISSIBLE"]
  },
  "accountant": {
  "anchor\_nodes": \["Exhibit\[category=financial]", "Witness\[role=expert]"],
  "edge\_types": \["DESCRIBED\_IN", "HAS\_RULING"],
  "chunk\_types": \["exhibit", "ruling"],
  "exclude\_edges": \["RULED\_INADMISSIBLE"]
  },
  "skeptic": {
  "anchor\_nodes": \["Timeline", "Witness"],
  "edge\_types": \["CONTRADICTS", "HAS\_RULING", "GAVE\_TESTIMONY"],
  "chunk\_types": \["testimony", "ruling"],
  "exclude\_edges": \["RULED\_INADMISSIBLE"]
  },
  "ex\_cop": {
  "anchor\_nodes": \["Ruling", "Exhibit\[category=forensic]"],
  "edge\_types": \["HAS\_RULING", "DESCRIBED\_IN", "CORROBORATES"],
  "chunk\_types": \["ruling", "exhibit", "judgment"],
  "exclude\_edges": \["RULED\_INADMISSIBLE"]
  },
  "empath": {
  "anchor\_nodes": \["Witness\[role=prosecution]", "Witness\[role=accused]"],
  "edge\_types": \["GAVE\_TESTIMONY", "CORROBORATES", "CONTRADICTS"],
  "chunk\_types": \["testimony", "closing"],
  "exclude\_edges": \["RULED\_INADMISSIBLE"]
  },
  "foreman": {
  "anchor\_nodes": \["Charge", "Ruling", "Judge"],
  "edge\_types": \["CHARGED\_WITH", "HAS\_RULING", "HEARD\_BY"],
  "chunk\_types": \["opening", "judgment", "sentence"],
  "exclude\_edges": \["RULED\_INADMISSIBLE"]
  }
}
**Critical invariant:** RULED\_INADMISSIBLE edges are always excluded from juror queries. This wall is enforced at the graph traversal level, not the prompt level.
---
## 6. MetaDB Schema (SQLite)
Tracks ingestion state — what's been fetched, processed, failed.
CREATE TABLE documents (
  doc\_id TEXT PRIMARY KEY, -- MNC or UUID
  source\_id TEXT NOT NULL,
  url TEXT NOT NULL,
  fetch\_status TEXT, -- pending|fetched|parsed|embedded|graphed|failed
  fetch\_timestamp TEXT,
  parse\_timestamp TEXT,
  embed\_timestamp TEXT,
  graph\_timestamp TEXT,
  error\_message TEXT,
  char\_count INTEGER,
  chunk\_count INTEGER,
  matter\_type TEXT,
  court TEXT,
  year INTEGER,
  verdict TEXT,
  exoneration\_flag INTEGER DEFAULT 0
);
CREATE TABLE fetch\_queue (
  queue\_id INTEGER PRIMARY KEY AUTOINCREMENT,
  source\_id TEXT,
  url TEXT,
  priority INTEGER DEFAULT 5, -- 1=highest
  queued\_at TEXT,
  attempts INTEGER DEFAULT 0
);
CREATE TABLE source\_state (
  source\_id TEXT PRIMARY KEY,
  last\_poll TEXT,
  last\_rss\_etag TEXT,
  docs\_fetched INTEGER DEFAULT 0,
  docs\_failed INTEGER DEFAULT 0
);
---
## 7. Rate Limiting & Politeness
RATE\_LIMITS = {
  "fedcourt": {"rps": 1.0, "concurrent": 2, "retry\_after": 60},
  "highcourt": {"rps": 0.5, "concurrent": 1, "retry\_after": 120},
  "nsw\_caselaw": {"rps": 1.0, "concurrent": 2, "retry\_after": 60},
  "qld\_judgments": {"rps": 0.5, "concurrent": 1, "retry\_after": 120},
  "auslaw\_mcp": {"rps": 0.3, "concurrent": 1, "retry\_after": 180},
}
\# Always send:
HEADERS = {
  "User-Agent": "AuCourtIngest/0.1 (legal research; contact: your@email.com)",
  "Accept": "text/html,application/xhtml+xml,application/msword",
}
\# Respect robots.txt (NSW Caselaw blocks search engines, not research agents)
\# Check before first fetch from each domain
---
## 8. Error Handling & Recovery
class FetchError(Exception): pass
class ParseError(Exception): pass
class RateLimitError(Exception): pass
RETRY\_POLICY = {
  FetchError: {"max\_retries": 3, "backoff": "exponential", "base": 30},
  RateLimitError: {"max\_retries": 5, "backoff": "exponential", "base": 120},
  ParseError: {"max\_retries": 1, "backoff": "fixed", "base": 0},
}
\# On ParseError after retry: log to MetaDB, move to manual\_review queue
\# On persistent FetchError: mark source as degraded, alert via Telegram
**Telegram alerting** (reuse Lulu's Telegram integration):
ALERTS = {
  "source\_degraded": "⚠️ AuCourtIngest: {source\_id} failing for {minutes}min",
  "daily\_summary": "📊 Ingested: {new\_docs} docs | Graph nodes: {nodes} | Errors: {errors}",
  "milestone": "🎯 Corpus reached {N} cases",
}
---
## 9. Bootstrap Run Plan
Priority order for initial corpus build:
Phase 1 — Week 1 (no permissions needed):
  \[ ] Federal Court RSS → DOCX, 2010–2026, criminal filter
  \[ ] High Court judgments 2000–2026, criminal appeals
  \[ ] NSW Caselaw criminal browse, 2010–2026
  Target: \~5,000 cases
Phase 2 — Week 2:
  \[ ] Federal Court PDF backfill 1995–2009
  \[ ] QLD Judgments criminal, 2000–2026
  \[ ] VIC via AusLaw MCP targeted search
  Target: \~15,000 cases total
Phase 3 — Ongoing:
  \[ ] RSS watch mode all sources
  \[ ] AustLII partnership email → if approved, full backfill
  \[ ] Exoneration cross-reference pass (RMIT register)
  Target: 50,000+ cases
---
## 10. Output Contracts
**DocStore:** /data/docs/{source\_id}/{year}/{mnc}.json
{
  "doc\_id": "\[2019] NSWSC 1234",
  "meta": { ...CaseMeta... },
  "chunks": \[ ...Chunk\[] ... ],
  "raw\_text\_path": "/data/raw/nsw\_caselaw/2019/NSWSC\_1234.html"
}
**VectorIndex:** ChromaDB collection au\_cases, filterable by source\_id, chunk\_type, doc\_id
**PropertyGraph:** Neo4j au\_legal database, accessible via Bolt protocol
**MetaDB:** SQLite at /data/meta.db
---
## 11. Tech Stack
Language: Python 3.12
HTTP: httpx (async)
HTML parsing: BeautifulSoup4
DOCX: python-docx
PDF text: pdfminer.six
PDF OCR: via AusLaw MCP (pytesseract wrapper)
Embeddings: openai text-embedding-3-small
Vector store: chromadb (local)
Graph DB: neo4j (Docker) or ryu\_graph (Windows)
SQLite: aiosqlite
LLM extraction: anthropic claude-haiku-4-5-20251001 (cheap, fast)
MCP client: mcp-python-sdk
Scheduling: APScheduler
Alerting: Telegram Bot API (reuse Lulu token)
Config: TOML (pyproject-style)
---
## 12. File Structure
aucourt\_ingest/
├── main.py # entry point, mode dispatch
├── config.toml # source configs, rate limits, paths
├── orchestrator.py # main loop, queue management
├── sources/
│ ├── base.py # BaseSource ABC
│ ├── fedcourt.py
│ ├── highcourt.py
│ ├── nsw\_caselaw.py
│ ├── qld\_judgments.py
│ └── auslaw\_mcp.py
├── processing/
│ ├── doc\_parser.py
│ ├── meta\_extractor.py
│ ├── outcome\_parser.py
│ ├── chunk\_engine.py
│ ├── embed\_engine.py
│ └── graph\_builder.py
├── storage/
│ ├── doc\_store.py
│ ├── vector\_index.py
│ ├── graph\_db.py
│ └── meta\_db.py
├── jury/
│ ├── personas.py # JurorPersona definitions
│ └── subgraph\_query.py # get\_juror\_context()
├── utils/
│ ├── rate\_limiter.py
│ ├── retry.py
│ ├── telegram.py
│ └── mnc\_parser.py
├── data/
│ ├── docs/
│ ├── raw/
│ └── meta.db
└── tests/
  ├── test\_parsers.py
  ├── test\_meta\_extractor.py
  └── test\_graph\_builder.py
---
## 13. Open Questions / Decisions Deferred
1. **Graph backend:** RyuGraph (existing Lulu stack, Windows/Android) vs Neo4j (Docker, better Cypher tooling). Recommend Neo4j for dev, RyuGraph for eventual mobile/embedded target.
2. **Transcript vs judgment distinction:** Many cases have both. Transcripts are richer for juror simulation (raw testimony). Judgments are cleaner but summarised. Ingest both, tag chunk source type.
3. **Suppression orders:** Some cases have partial suppression. MetaDB suppression\_order flag. Do not ingest suppressed content. Check AustLII suppression notices as reference.
4. **AustLII partnership timing:** Email now or after MVP? Recommend now — low effort, long lead time for approval.
5. **Embedding model:** text-embedding-3-small is cheap but ada-002 or Voyage Law may give better legal domain performance. Evaluate on retrieval quality after first 1,000 cases.
---
*End of spec v0.1*