\# AuCourtIngest — Agent Specification v0.1 \*\*Project:\*\* Australian Legal Case Ingestion Pipeline \*\*Purpose:\*\* Autonomous agent to discover, fetch, parse, normalise, and graph-ingest Australian court decisions for AI jury/judge analysis \*\*Author:\*\* Aaron King / Slothitude Games \*\*Date:\*\* 2026-05-30 \--- \## 1. Overview AuCourtIngest is a fully autonomous ingestion agent. It operates as a continuous pipeline with no human operator required after initial configuration. Its output is a populated property graph (RyuGraph or Neo4j-compatible) and a flat document store, both ready for per-juror RAG queries. The agent has four operational modes: | Mode | Trigger | Description | |------|---------|-------------| | `bootstrap` | Manual / first run | Historical bulk ingest from all tier-1 sources | | `watch` | Cron / RSS poll | Continuous ingestion of new decisions | | `backfill` | Manual | Fill gaps in existing corpus by date/court/charge-type | | `audit` | Manual | Re-process existing documents to update graph schema | \--- \## 2. Architecture ``` ┌─────────────────────────────────────────────────────┐ │ ORCHESTRATOR │ │ (main loop, mode dispatch, │ │ rate limiting, error recovery) │ └────────┬───────────────────────────────┬────────────┘ │ │ ▼ ▼ ┌─────────────────┐ ┌─────────────────────┐ │ SOURCE LAYER │ │ PROCESSING LAYER │ │ │ │ │ │ - FedCourtRSS │──fetch─────▶│ - DocParser │ │ - HighCourtRSS │ │ - MetaExtractor │ │ - NSWCaselaw │ │ - OutcomeParser │ │ - QLDJudgments │ │ - ChunkEngine │ │ - VICCatalogue │ │ - EmbedEngine │ │ - AusLawMCP │ │ - GraphBuilder │ └─────────────────┘ └──────────┬──────────┘ │ ▼ ┌─────────────────────┐ │ STORAGE LAYER │ │ │ │ - DocStore (flat) │ │ - VectorIndex │ │ - PropertyGraph │ │ - MetaDB (SQLite) │ └─────────────────────┘ ``` \--- \## 3. Source Definitions \### 3.1 Federal Court of Australia ```yaml source\_id: fedcourt base\_url: https://www.judgments.fedcourt.gov.au rss\_feed: https://www.judgments.fedcourt.gov.au/rss/fca-judgments doc\_formats: \[html, docx, pdf] coverage\_from: 1977 full\_text\_from: 1995 docx\_from: 1995 pdf\_range: \[1977, 1994] rate\_limit\_rps: 1 tos\_status: public\_domain fetch\_strategy: rss\_poll\_then\_docx\_download ``` \*\*Fetch logic:\*\* 1\. Poll RSS feed every 6 hours 2\. For each new item: extract judgment URL from RSS `` 3\. Fetch HTML page → find DOCX download link (pattern: `/judgments/fca/YYYY/NNN.docx`) 4\. Download DOCX → hand to DocParser 5\. For 1977–1994: fetch PDF instead → hand to PDFParser \*\*DOCX URL pattern:\*\* ``` https://www.judgments.fedcourt.gov.au/judgments/fca/{YYYY}/{index}.docx ``` \*\*Bootstrap pagination:\*\* ``` https://www.fedcourt.gov.au/digital-law-library/judgments/search ?NeedleType=all\&txtSearch=\&category=criminal\&dateFrom=1995-01-01 \&dateTo=2026-12-31\&resultsPerPage=50\&pageNum={N} ``` \--- \### 3.2 High Court of Australia ```yaml source\_id: highcourt base\_url: https://www.hcourt.gov.au transcript\_base: https://www.hcourt.gov.au/cases/case-s{YYYY}-{N} coverage\_from: 1994 doc\_formats: \[html, pdf] rate\_limit\_rps: 0.5 tos\_status: public\_domain fetch\_strategy: index\_crawl ``` \*\*Fetch logic:\*\* 1\. Crawl `/cases/recent-judgments` — paginated list with matter numbers 2\. For each matter: fetch judgment HTML page 3\. Also fetch transcript if available (`/transcripts/` path) 4\. Store judgment + transcript as separate documents, linked by `matter\_id` \*\*Priority filter for MVP:\*\* ```python PRIORITY\_KEYWORDS = \[ "murder", "manslaughter", "sexual assault", "robbery", "appeal allowed", "conviction quashed", "miscarriage of justice" ] ``` \--- \### 3.3 NSW Caselaw ```yaml source\_id: nsw\_caselaw base\_url: https://www.caselaw.nsw.gov.au browse\_url: https://www.caselaw.nsw.gov.au/browse coverage\_from: 1988 doc\_formats: \[html, pdf] rate\_limit\_rps: 1 tos\_status: open\_reproduction\_authorised fetch\_strategy: mnc\_systematic + browse\_pagination mnc\_pattern: "\[YYYY] {COURT} {N}" ``` \*\*Fetch logic:\*\* 1\. Browse `/browse?filter=criminal` → paginate through all pages 2\. Extract MNC + decision URL per row 3\. Fetch decision HTML → parse full text + metadata 4\. MNC serves as canonical case ID throughout system \*\*MNC parsing:\*\* ```python \# Example: \[2019] NSWSC 1234 import re MNC\_PATTERN = r'\\\[(\\d{4})\\]\\s+(\[A-Z]+)\\s+(\\d+)' def parse\_mnc(mnc: str) -> dict: m = re.match(MNC\_PATTERN, mnc) return {"year": m\[1], "court": m\[2], "number": m\[3]} ``` \--- \### 3.4 Queensland Judgments ```yaml source\_id: qld\_judgments base\_url: https://www.queenslandjudgments.com.au coverage: variable\_by\_court doc\_formats: \[html, pdf] rate\_limit\_rps: 0.5 tos\_status: free\_access fetch\_strategy: search\_pagination ``` \*\*Fetch logic:\*\* 1\. Use search API with `matter\_type=criminal` filter 2\. Paginate results → extract case URLs 3\. Fetch HTML per case \--- \### 3.5 AusLaw MCP (Gap-Fill + Search) ```yaml source\_id: auslaw\_mcp type: mcp\_server repo: russellbrenner/auslaw-mcp tools: - search\_austlii - search\_jade - fetch\_document\_text - search\_jade\_by\_citation use\_case: targeted\_retrieval\_not\_bulk rate\_limit: conservative # case-by-case only ``` \*\*Use cases:\*\* \- Fetch a specific case by citation when referenced in another judgment \- OCR for scanned pre-1995 PDFs \- Cross-reference citation graph to find related cases \- Fill gaps where direct court portals have incomplete coverage \*\*Tool call pattern:\*\* ```python \# Retrieve by citation result = await mcp.call("fetch\_document\_text", { "url": "https://www.austlii.edu.au/cgi-bin/viewdoc/au/cases/..." }) \# Search for cases matching charge type results = await mcp.call("search\_austlii", { "query": "murder conviction appeal", "jurisdiction": "nsw", "limit": 20, "sortBy": "date" }) ``` \--- \## 4. Document Processing Pipeline \### 4.1 DocParser Accepts: HTML, DOCX, PDF Outputs: normalised `RawDocument` struct ```python @dataclass class RawDocument: source\_id: str # e.g. "nsw\_caselaw" doc\_id: str # MNC or internal UUID url: str fetch\_timestamp: str # ISO8601 raw\_text: str # full extracted text format: str # html | docx | pdf pages: int | None char\_count: int ``` \*\*Format handlers:\*\* \- HTML: BeautifulSoup → strip nav/header/footer → extract `.judgment-body` or equivalent \- DOCX: `python-docx` → iterate paragraphs, preserve heading structure \- PDF: `pdfminer.six` for text PDFs; `pytesseract` via AusLaw MCP OCR for scanned \--- \### 4.2 MetaExtractor Input: `RawDocument` Output: `CaseMeta` struct Method: Claude claude-haiku-4-5-20251001 via API (cheap, fast, structured) ```python @dataclass class CaseMeta: # Identity case\_name: str # e.g. "R v Smith" mnc: str # Medium Neutral Citation court: str # NSWSC | HCA | FCA | QSC etc judge: list\[str] date\_delivered: str # ISO8601 jurisdiction: str # NSW | CTH | QLD | VIC etc # Classification matter\_type: str # criminal | civil | appeal | coronial charges: list\[str] # \["murder s18 Crimes Act 1900 (NSW)"] charge\_categories: list\[str] # \["homicide", "violence"] # Outcome verdict: str # guilty | not\_guilty | appeal\_allowed | # appeal\_dismissed | conviction\_quashed | # sentence\_varied | hung | n/a sentence: str | None # "18 years NMP 13" outcome\_notes: str # free text summary of result # Flags is\_appeal: bool appeal\_of: str | None # MNC of original decision was\_appealed: bool # filled in later by back-reference exoneration\_flag: bool # manual or derived from appeal outcome inadmissible\_evidence: list\[str] # evidence ruled out suppression\_order: bool # whether publication restrictions apply ``` \*\*Extraction prompt (system):\*\* ``` You are a legal metadata extractor. Given the full text of an Australian court judgment, extract structured metadata in JSON. Output ONLY valid JSON matching the CaseMeta schema. Be conservative — use null for fields you cannot determine with confidence. For verdict, use only these values: guilty | not\_guilty | appeal\_allowed | appeal\_dismissed | conviction\_quashed | sentence\_varied | hung | civil\_judgment | n/a ``` \--- \### 4.3 OutcomeParser Separate pass specifically for verdict/outcome ground truth. This is the most critical field — it's what the divergence signal compares against. \*\*Resolution priority:\*\* 1\. Explicit headnote — "HELD: conviction quashed" 2\. Orders section — "Appeal allowed. Convictions on counts 1 and 3 set aside." 3\. Appeal cross-reference — if later appeal found, update original record 4\. Manual flag — `exoneration\_flag=True` sourced from external list \*\*External exoneration sources:\*\* \- RMIT Wrongful Convictions Register (AU) \- Innocence Project Australia case list \- Royal Commission findings referencing specific convictions These are small enough to maintain as a static JSON lookup table keyed by defendant name + year. \--- \### 4.4 ChunkEngine Chunks documents for RAG retrieval. Chunks are \*\*semantically meaningful units\*\*, not arbitrary token windows. \*\*Chunk types and sizes:\*\* | Chunk Type | Content | Target Tokens | Overlap | |------------|---------|---------------|---------| | `opening` | Case intro, charges, parties | 400–600 | none | | `testimony` | Each witness examination block | 300–500 | 50 | | `exhibit` | Evidence description + ruling | 200–400 | none | | `ruling` | Each evidentiary ruling by judge | 200–300 | none | | `closing` | Closing submissions summary | 400–600 | 50 | | `judgment` | Judicial reasoning blocks | 400–600 | 50 | | `sentence` | Sentencing remarks | 300–500 | none | \*\*Chunk struct:\*\* ```python @dataclass class Chunk: chunk\_id: str # UUID doc\_id: str # parent document MNC chunk\_type: str # from table above sequence: int # position in document text: str token\_count: int speaker: str | None # witness name if testimony page\_ref: str | None # "p.47" if available embedding: list\[float] # 1536-dim, filled by EmbedEngine ``` \*\*Chunking strategy:\*\* Use Claude claude-haiku-4-5-20251001 to identify structural boundaries: ``` Identify the structural sections of this Australian court judgment. Return a JSON array of {section\_type, start\_char, end\_char, speaker}. Section types: opening | testimony | exhibit | ruling | closing | judgment | sentence ``` Then split at identified boundaries, keeping each chunk within token budget. \--- \### 4.5 EmbedEngine ```yaml model: text-embedding-3-small # OpenAI, 1536-dim, cheap batch\_size: 100 store: chromadb # local, no infra collection\_naming: "au\_cases\_{source\_id}" ``` Each chunk gets an embedding. The embedding is stored in ChromaDB with chunk metadata as payload. \*\*Query interface:\*\* ```python def query\_chunks( text: str, chunk\_types: list\[str] = None, # filter by type doc\_ids: list\[str] = None, # filter to specific cases top\_k: int = 10 ) -> list\[Chunk]: ... ``` \--- \### 4.6 GraphBuilder Builds the property graph from extracted metadata + chunk relationships. \*\*Node types:\*\* ``` (:Case {mnc, court, date, jurisdiction, matter\_type, verdict, exoneration\_flag}) (:Charge {text, category, act, section}) (:Witness {name, role}) # role: prosecution|defence|expert|accused (:Exhibit {id, description, admitted}) (:Judge {name, court}) (:Ruling {type, outcome}) # type: admissibility|direction|objection (:Timeline {date, event}) (:Chunk {chunk\_id, type, text\_preview}) ``` \*\*Relationship types:\*\* ``` (:Case)-\[:CHARGED\_WITH]->(:Charge) (:Case)-\[:HEARD\_BY]->(:Judge) (:Case)-\[:INCLUDES\_TESTIMONY {credibility\_score}]->(:Witness) (:Case)-\[:HAS\_EXHIBIT {admitted: bool}]->(:Exhibit) (:Case)-\[:HAS\_RULING]->(:Ruling) (:Case)-\[:APPEALS]->(:Case) # appeal chain (:Witness)-\[:GAVE\_TESTIMONY]->(:Chunk) (:Exhibit)-\[:DESCRIBED\_IN]->(:Chunk) (:Ruling)-\[:CONCERNS]->(:Exhibit) (:Ruling)-\[:CONCERNS]->(:Witness) (:Chunk)-\[:FOLLOWS]->(:Chunk) # sequence chain (:Chunk)-\[:CORROBORATES]->(:Chunk) # semantic similarity > 0.85 (:Chunk)-\[:CONTRADICTS]->(:Chunk) # semantic similarity < 0.2 on same topic ``` \*\*Graph backend:\*\* \- Primary: Neo4j (local Docker) for development \- Alternative: RyuGraph if staying in the existing Lulu/Windows stack \- Export: GraphML for portability \*\*CORROBORATES / CONTRADICTS edge generation:\*\* ```python \# After all chunks embedded, run pairwise similarity within a case \# for chunks of type testimony and exhibit for pair in chunk\_pairs\_within\_case(doc\_id, types=\["testimony", "exhibit"]): sim = cosine\_similarity(pair\[0].embedding, pair\[1].embedding) if sim > 0.85: graph.add\_edge(pair\[0], pair\[1], "CORROBORATES", weight=sim) elif same\_topic(pair\[0], pair\[1]) and sim < 0.25: graph.add\_edge(pair\[0], pair\[1], "CONTRADICTS", weight=sim) ``` \--- \## 5. Juror Subgraph Query Interface This is the interface the jury agent calls at deliberation time. ```python def get\_juror\_context( case\_mnc: str, persona: JurorPersona, max\_tokens: int = 4000 ) -> JurorContext: """ Traverse the graph from the persona's anchor node types, follow relevant edge types, collect chunks up to token budget. Returns a compact context string + list of source node IDs. """ ``` \*\*Persona → graph traversal mapping:\*\* ```python PERSONA\_TRAVERSAL = { "nurse": { "anchor\_nodes": \["Witness\[role=expert]", "Exhibit\[category=medical]"], "edge\_types": \["GAVE\_TESTIMONY", "DESCRIBED\_IN", "CORROBORATES"], "chunk\_types": \["testimony", "exhibit"], "exclude\_edges": \["RULED\_INADMISSIBLE"] }, "accountant": { "anchor\_nodes": \["Exhibit\[category=financial]", "Witness\[role=expert]"], "edge\_types": \["DESCRIBED\_IN", "HAS\_RULING"], "chunk\_types": \["exhibit", "ruling"], "exclude\_edges": \["RULED\_INADMISSIBLE"] }, "skeptic": { "anchor\_nodes": \["Timeline", "Witness"], "edge\_types": \["CONTRADICTS", "HAS\_RULING", "GAVE\_TESTIMONY"], "chunk\_types": \["testimony", "ruling"], "exclude\_edges": \["RULED\_INADMISSIBLE"] }, "ex\_cop": { "anchor\_nodes": \["Ruling", "Exhibit\[category=forensic]"], "edge\_types": \["HAS\_RULING", "DESCRIBED\_IN", "CORROBORATES"], "chunk\_types": \["ruling", "exhibit", "judgment"], "exclude\_edges": \["RULED\_INADMISSIBLE"] }, "empath": { "anchor\_nodes": \["Witness\[role=prosecution]", "Witness\[role=accused]"], "edge\_types": \["GAVE\_TESTIMONY", "CORROBORATES", "CONTRADICTS"], "chunk\_types": \["testimony", "closing"], "exclude\_edges": \["RULED\_INADMISSIBLE"] }, "foreman": { "anchor\_nodes": \["Charge", "Ruling", "Judge"], "edge\_types": \["CHARGED\_WITH", "HAS\_RULING", "HEARD\_BY"], "chunk\_types": \["opening", "judgment", "sentence"], "exclude\_edges": \["RULED\_INADMISSIBLE"] } } ``` \*\*Critical invariant:\*\* `RULED\_INADMISSIBLE` edges are always excluded from juror queries. This wall is enforced at the graph traversal level, not the prompt level. \--- \## 6. MetaDB Schema (SQLite) Tracks ingestion state — what's been fetched, processed, failed. ```sql CREATE TABLE documents ( doc\_id TEXT PRIMARY KEY, -- MNC or UUID source\_id TEXT NOT NULL, url TEXT NOT NULL, fetch\_status TEXT, -- pending|fetched|parsed|embedded|graphed|failed fetch\_timestamp TEXT, parse\_timestamp TEXT, embed\_timestamp TEXT, graph\_timestamp TEXT, error\_message TEXT, char\_count INTEGER, chunk\_count INTEGER, matter\_type TEXT, court TEXT, year INTEGER, verdict TEXT, exoneration\_flag INTEGER DEFAULT 0 ); CREATE TABLE fetch\_queue ( queue\_id INTEGER PRIMARY KEY AUTOINCREMENT, source\_id TEXT, url TEXT, priority INTEGER DEFAULT 5, -- 1=highest queued\_at TEXT, attempts INTEGER DEFAULT 0 ); CREATE TABLE source\_state ( source\_id TEXT PRIMARY KEY, last\_poll TEXT, last\_rss\_etag TEXT, docs\_fetched INTEGER DEFAULT 0, docs\_failed INTEGER DEFAULT 0 ); ``` \--- \## 7. Rate Limiting \& Politeness ```python RATE\_LIMITS = { "fedcourt": {"rps": 1.0, "concurrent": 2, "retry\_after": 60}, "highcourt": {"rps": 0.5, "concurrent": 1, "retry\_after": 120}, "nsw\_caselaw": {"rps": 1.0, "concurrent": 2, "retry\_after": 60}, "qld\_judgments": {"rps": 0.5, "concurrent": 1, "retry\_after": 120}, "auslaw\_mcp": {"rps": 0.3, "concurrent": 1, "retry\_after": 180}, } \# Always send: HEADERS = { "User-Agent": "AuCourtIngest/0.1 (legal research; contact: your@email.com)", "Accept": "text/html,application/xhtml+xml,application/msword", } \# Respect robots.txt (NSW Caselaw blocks search engines, not research agents) \# Check before first fetch from each domain ``` \--- \## 8. Error Handling \& Recovery ```python class FetchError(Exception): pass class ParseError(Exception): pass class RateLimitError(Exception): pass RETRY\_POLICY = { FetchError: {"max\_retries": 3, "backoff": "exponential", "base": 30}, RateLimitError: {"max\_retries": 5, "backoff": "exponential", "base": 120}, ParseError: {"max\_retries": 1, "backoff": "fixed", "base": 0}, } \# On ParseError after retry: log to MetaDB, move to manual\_review queue \# On persistent FetchError: mark source as degraded, alert via Telegram ``` \*\*Telegram alerting\*\* (reuse Lulu's Telegram integration): ```python ALERTS = { "source\_degraded": "⚠️ AuCourtIngest: {source\_id} failing for {minutes}min", "daily\_summary": "📊 Ingested: {new\_docs} docs | Graph nodes: {nodes} | Errors: {errors}", "milestone": "🎯 Corpus reached {N} cases", } ``` \--- \## 9. Bootstrap Run Plan Priority order for initial corpus build: ``` Phase 1 — Week 1 (no permissions needed): \[ ] Federal Court RSS → DOCX, 2010–2026, criminal filter \[ ] High Court judgments 2000–2026, criminal appeals \[ ] NSW Caselaw criminal browse, 2010–2026 Target: \~5,000 cases Phase 2 — Week 2: \[ ] Federal Court PDF backfill 1995–2009 \[ ] QLD Judgments criminal, 2000–2026 \[ ] VIC via AusLaw MCP targeted search Target: \~15,000 cases total Phase 3 — Ongoing: \[ ] RSS watch mode all sources \[ ] AustLII partnership email → if approved, full backfill \[ ] Exoneration cross-reference pass (RMIT register) Target: 50,000+ cases ``` \--- \## 10. Output Contracts \*\*DocStore:\*\* `/data/docs/{source\_id}/{year}/{mnc}.json` ```json { "doc\_id": "\[2019] NSWSC 1234", "meta": { ...CaseMeta... }, "chunks": \[ ...Chunk\[] ... ], "raw\_text\_path": "/data/raw/nsw\_caselaw/2019/NSWSC\_1234.html" } ``` \*\*VectorIndex:\*\* ChromaDB collection `au\_cases`, filterable by `source\_id`, `chunk\_type`, `doc\_id` \*\*PropertyGraph:\*\* Neo4j `au\_legal` database, accessible via Bolt protocol \*\*MetaDB:\*\* SQLite at `/data/meta.db` \--- \## 11. Tech Stack ``` Language: Python 3.12 HTTP: httpx (async) HTML parsing: BeautifulSoup4 DOCX: python-docx PDF text: pdfminer.six PDF OCR: via AusLaw MCP (pytesseract wrapper) Embeddings: openai text-embedding-3-small Vector store: chromadb (local) Graph DB: neo4j (Docker) or ryu\_graph (Windows) SQLite: aiosqlite LLM extraction: anthropic claude-haiku-4-5-20251001 (cheap, fast) MCP client: mcp-python-sdk Scheduling: APScheduler Alerting: Telegram Bot API (reuse Lulu token) Config: TOML (pyproject-style) ``` \--- \## 12. File Structure ``` aucourt\_ingest/ ├── main.py # entry point, mode dispatch ├── config.toml # source configs, rate limits, paths ├── orchestrator.py # main loop, queue management ├── sources/ │ ├── base.py # BaseSource ABC │ ├── fedcourt.py │ ├── highcourt.py │ ├── nsw\_caselaw.py │ ├── qld\_judgments.py │ └── auslaw\_mcp.py ├── processing/ │ ├── doc\_parser.py │ ├── meta\_extractor.py │ ├── outcome\_parser.py │ ├── chunk\_engine.py │ ├── embed\_engine.py │ └── graph\_builder.py ├── storage/ │ ├── doc\_store.py │ ├── vector\_index.py │ ├── graph\_db.py │ └── meta\_db.py ├── jury/ │ ├── personas.py # JurorPersona definitions │ └── subgraph\_query.py # get\_juror\_context() ├── utils/ │ ├── rate\_limiter.py │ ├── retry.py │ ├── telegram.py │ └── mnc\_parser.py ├── data/ │ ├── docs/ │ ├── raw/ │ └── meta.db └── tests/ ├── test\_parsers.py ├── test\_meta\_extractor.py └── test\_graph\_builder.py ``` \--- \## 13. Open Questions / Decisions Deferred 1\. \*\*Graph backend:\*\* RyuGraph (existing Lulu stack, Windows/Android) vs Neo4j (Docker, better Cypher tooling). Recommend Neo4j for dev, RyuGraph for eventual mobile/embedded target. 2\. \*\*Transcript vs judgment distinction:\*\* Many cases have both. Transcripts are richer for juror simulation (raw testimony). Judgments are cleaner but summarised. Ingest both, tag chunk source type. 3\. \*\*Suppression orders:\*\* Some cases have partial suppression. MetaDB `suppression\_order` flag. Do not ingest suppressed content. Check AustLII suppression notices as reference. 4\. \*\*AustLII partnership timing:\*\* Email now or after MVP? Recommend now — low effort, long lead time for approval. 5\. \*\*Embedding model:\*\* text-embedding-3-small is cheap but ada-002 or Voyage Law may give better legal domain performance. Evaluate on retrieval quality after first 1,000 cases. \--- \*End of spec v0.1\*