Source layer (5 court sources), processing pipeline (parse/extract/chunk/embed/graph), property graph with 8 node types, juror subgraph queries with 6 personas, orchestrator with bootstrap/watch/backfill/audit/process modes, 170 tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1430 lines
25 KiB
Markdown
1430 lines
25 KiB
Markdown
\# AuCourtIngest — Agent Specification v0.1
|
||
|
||
|
||
|
||
\*\*Project:\*\* Australian Legal Case Ingestion Pipeline
|
||
|
||
\*\*Purpose:\*\* Autonomous agent to discover, fetch, parse, normalise, and graph-ingest Australian court decisions for AI jury/judge analysis
|
||
|
||
\*\*Author:\*\* Aaron King / Slothitude Games
|
||
|
||
\*\*Date:\*\* 2026-05-30
|
||
|
||
|
||
|
||
\---
|
||
|
||
|
||
|
||
\## 1. Overview
|
||
|
||
|
||
|
||
AuCourtIngest is a fully autonomous ingestion agent. It operates as a continuous pipeline with no human operator required after initial configuration. Its output is a populated property graph (RyuGraph or Neo4j-compatible) and a flat document store, both ready for per-juror RAG queries.
|
||
|
||
|
||
|
||
The agent has four operational modes:
|
||
|
||
|
||
|
||
| Mode | Trigger | Description |
|
||
|
||
|------|---------|-------------|
|
||
|
||
| `bootstrap` | Manual / first run | Historical bulk ingest from all tier-1 sources |
|
||
|
||
| `watch` | Cron / RSS poll | Continuous ingestion of new decisions |
|
||
|
||
| `backfill` | Manual | Fill gaps in existing corpus by date/court/charge-type |
|
||
|
||
| `audit` | Manual | Re-process existing documents to update graph schema |
|
||
|
||
|
||
|
||
\---
|
||
|
||
|
||
|
||
\## 2. Architecture
|
||
|
||
|
||
|
||
```
|
||
|
||
┌─────────────────────────────────────────────────────┐
|
||
|
||
│ ORCHESTRATOR │
|
||
|
||
│ (main loop, mode dispatch, │
|
||
|
||
│ rate limiting, error recovery) │
|
||
|
||
└────────┬───────────────────────────────┬────────────┘
|
||
|
||
  │ │
|
||
|
||
  ▼ ▼
|
||
|
||
┌─────────────────┐ ┌─────────────────────┐
|
||
|
||
│ SOURCE LAYER │ │ PROCESSING LAYER │
|
||
|
||
│ │ │ │
|
||
|
||
│ - FedCourtRSS │──fetch─────▶│ - DocParser │
|
||
|
||
│ - HighCourtRSS │ │ - MetaExtractor │
|
||
|
||
│ - NSWCaselaw │ │ - OutcomeParser │
|
||
|
||
│ - QLDJudgments │ │ - ChunkEngine │
|
||
|
||
│ - VICCatalogue │ │ - EmbedEngine │
|
||
|
||
│ - AusLawMCP │ │ - GraphBuilder │
|
||
|
||
└─────────────────┘ └──────────┬──────────┘
|
||
|
||
  │
|
||
|
||
  ▼
|
||
|
||
  ┌─────────────────────┐
|
||
|
||
  │ STORAGE LAYER │
|
||
|
||
  │ │
|
||
|
||
  │ - DocStore (flat) │
|
||
|
||
  │ - VectorIndex │
|
||
|
||
  │ - PropertyGraph │
|
||
|
||
  │ - MetaDB (SQLite) │
|
||
|
||
  └─────────────────────┘
|
||
|
||
```
|
||
|
||
|
||
|
||
\---
|
||
|
||
|
||
|
||
\## 3. Source Definitions
|
||
|
||
|
||
|
||
\### 3.1 Federal Court of Australia
|
||
|
||
|
||
|
||
```yaml
|
||
|
||
source\_id: fedcourt
|
||
|
||
base\_url: https://www.judgments.fedcourt.gov.au
|
||
|
||
rss\_feed: https://www.judgments.fedcourt.gov.au/rss/fca-judgments
|
||
|
||
doc\_formats: \[html, docx, pdf]
|
||
|
||
coverage\_from: 1977
|
||
|
||
full\_text\_from: 1995
|
||
|
||
docx\_from: 1995
|
||
|
||
pdf\_range: \[1977, 1994]
|
||
|
||
rate\_limit\_rps: 1
|
||
|
||
tos\_status: public\_domain
|
||
|
||
fetch\_strategy: rss\_poll\_then\_docx\_download
|
||
|
||
```
|
||
|
||
|
||
|
||
\*\*Fetch logic:\*\*
|
||
|
||
1\. Poll RSS feed every 6 hours
|
||
|
||
2\. For each new item: extract judgment URL from RSS `<link>`
|
||
|
||
3\. Fetch HTML page → find DOCX download link (pattern: `/judgments/fca/YYYY/NNN.docx`)
|
||
|
||
4\. Download DOCX → hand to DocParser
|
||
|
||
5\. For 1977–1994: fetch PDF instead → hand to PDFParser
|
||
|
||
|
||
|
||
\*\*DOCX URL pattern:\*\*
|
||
|
||
```
|
||
|
||
https://www.judgments.fedcourt.gov.au/judgments/fca/{YYYY}/{index}.docx
|
||
|
||
```
|
||
|
||
|
||
|
||
\*\*Bootstrap pagination:\*\*
|
||
|
||
```
|
||
|
||
https://www.fedcourt.gov.au/digital-law-library/judgments/search
|
||
|
||
?NeedleType=all\&txtSearch=\&category=criminal\&dateFrom=1995-01-01
|
||
|
||
\&dateTo=2026-12-31\&resultsPerPage=50\&pageNum={N}
|
||
|
||
```
|
||
|
||
|
||
|
||
\---
|
||
|
||
|
||
|
||
\### 3.2 High Court of Australia
|
||
|
||
|
||
|
||
```yaml
|
||
|
||
source\_id: highcourt
|
||
|
||
base\_url: https://www.hcourt.gov.au
|
||
|
||
transcript\_base: https://www.hcourt.gov.au/cases/case-s{YYYY}-{N}
|
||
|
||
coverage\_from: 1994
|
||
|
||
doc\_formats: \[html, pdf]
|
||
|
||
rate\_limit\_rps: 0.5
|
||
|
||
tos\_status: public\_domain
|
||
|
||
fetch\_strategy: index\_crawl
|
||
|
||
```
|
||
|
||
|
||
|
||
\*\*Fetch logic:\*\*
|
||
|
||
1\. Crawl `/cases/recent-judgments` — paginated list with matter numbers
|
||
|
||
2\. For each matter: fetch judgment HTML page
|
||
|
||
3\. Also fetch transcript if available (`/transcripts/` path)
|
||
|
||
4\. Store judgment + transcript as separate documents, linked by `matter\_id`
|
||
|
||
|
||
|
||
\*\*Priority filter for MVP:\*\*
|
||
|
||
```python
|
||
|
||
PRIORITY\_KEYWORDS = \[
|
||
|
||
  "murder", "manslaughter", "sexual assault", "robbery",
|
||
|
||
  "appeal allowed", "conviction quashed", "miscarriage of justice"
|
||
|
||
]
|
||
|
||
```
|
||
|
||
|
||
|
||
\---
|
||
|
||
|
||
|
||
\### 3.3 NSW Caselaw
|
||
|
||
|
||
|
||
```yaml
|
||
|
||
source\_id: nsw\_caselaw
|
||
|
||
base\_url: https://www.caselaw.nsw.gov.au
|
||
|
||
browse\_url: https://www.caselaw.nsw.gov.au/browse
|
||
|
||
coverage\_from: 1988
|
||
|
||
doc\_formats: \[html, pdf]
|
||
|
||
rate\_limit\_rps: 1
|
||
|
||
tos\_status: open\_reproduction\_authorised
|
||
|
||
fetch\_strategy: mnc\_systematic + browse\_pagination
|
||
|
||
mnc\_pattern: "\[YYYY] {COURT} {N}"
|
||
|
||
```
|
||
|
||
|
||
|
||
\*\*Fetch logic:\*\*
|
||
|
||
1\. Browse `/browse?filter=criminal` → paginate through all pages
|
||
|
||
2\. Extract MNC + decision URL per row
|
||
|
||
3\. Fetch decision HTML → parse full text + metadata
|
||
|
||
4\. MNC serves as canonical case ID throughout system
|
||
|
||
|
||
|
||
\*\*MNC parsing:\*\*
|
||
|
||
```python
|
||
|
||
\# Example: \[2019] NSWSC 1234
|
||
|
||
import re
|
||
|
||
MNC\_PATTERN = r'\\\[(\\d{4})\\]\\s+(\[A-Z]+)\\s+(\\d+)'
|
||
|
||
def parse\_mnc(mnc: str) -> dict:
|
||
|
||
  m = re.match(MNC\_PATTERN, mnc)
|
||
|
||
  return {"year": m\[1], "court": m\[2], "number": m\[3]}
|
||
|
||
```
|
||
|
||
|
||
|
||
\---
|
||
|
||
|
||
|
||
\### 3.4 Queensland Judgments
|
||
|
||
|
||
|
||
```yaml
|
||
|
||
source\_id: qld\_judgments
|
||
|
||
base\_url: https://www.queenslandjudgments.com.au
|
||
|
||
coverage: variable\_by\_court
|
||
|
||
doc\_formats: \[html, pdf]
|
||
|
||
rate\_limit\_rps: 0.5
|
||
|
||
tos\_status: free\_access
|
||
|
||
fetch\_strategy: search\_pagination
|
||
|
||
```
|
||
|
||
|
||
|
||
\*\*Fetch logic:\*\*
|
||
|
||
1\. Use search API with `matter\_type=criminal` filter
|
||
|
||
2\. Paginate results → extract case URLs
|
||
|
||
3\. Fetch HTML per case
|
||
|
||
|
||
|
||
\---
|
||
|
||
|
||
|
||
\### 3.5 AusLaw MCP (Gap-Fill + Search)
|
||
|
||
|
||
|
||
```yaml
|
||
|
||
source\_id: auslaw\_mcp
|
||
|
||
type: mcp\_server
|
||
|
||
repo: russellbrenner/auslaw-mcp
|
||
|
||
tools:
|
||
|
||
  - search\_austlii
|
||
|
||
  - search\_jade
|
||
|
||
  - fetch\_document\_text
|
||
|
||
  - search\_jade\_by\_citation
|
||
|
||
use\_case: targeted\_retrieval\_not\_bulk
|
||
|
||
rate\_limit: conservative # case-by-case only
|
||
|
||
```
|
||
|
||
|
||
|
||
\*\*Use cases:\*\*
|
||
|
||
\- Fetch a specific case by citation when referenced in another judgment
|
||
|
||
\- OCR for scanned pre-1995 PDFs
|
||
|
||
\- Cross-reference citation graph to find related cases
|
||
|
||
\- Fill gaps where direct court portals have incomplete coverage
|
||
|
||
|
||
|
||
\*\*Tool call pattern:\*\*
|
||
|
||
```python
|
||
|
||
\# Retrieve by citation
|
||
|
||
result = await mcp.call("fetch\_document\_text", {
|
||
|
||
  "url": "https://www.austlii.edu.au/cgi-bin/viewdoc/au/cases/..."
|
||
|
||
})
|
||
|
||
|
||
|
||
\# Search for cases matching charge type
|
||
|
||
results = await mcp.call("search\_austlii", {
|
||
|
||
  "query": "murder conviction appeal",
|
||
|
||
  "jurisdiction": "nsw",
|
||
|
||
  "limit": 20,
|
||
|
||
  "sortBy": "date"
|
||
|
||
})
|
||
|
||
```
|
||
|
||
|
||
|
||
\---
|
||
|
||
|
||
|
||
\## 4. Document Processing Pipeline
|
||
|
||
|
||
|
||
\### 4.1 DocParser
|
||
|
||
|
||
|
||
Accepts: HTML, DOCX, PDF
|
||
|
||
Outputs: normalised `RawDocument` struct
|
||
|
||
|
||
|
||
```python
|
||
|
||
@dataclass
|
||
|
||
class RawDocument:
|
||
|
||
  source\_id: str # e.g. "nsw\_caselaw"
|
||
|
||
  doc\_id: str # MNC or internal UUID
|
||
|
||
  url: str
|
||
|
||
  fetch\_timestamp: str # ISO8601
|
||
|
||
  raw\_text: str # full extracted text
|
||
|
||
  format: str # html | docx | pdf
|
||
|
||
  pages: int | None
|
||
|
||
  char\_count: int
|
||
|
||
```
|
||
|
||
|
||
|
||
\*\*Format handlers:\*\*
|
||
|
||
\- HTML: BeautifulSoup → strip nav/header/footer → extract `.judgment-body` or equivalent
|
||
|
||
\- DOCX: `python-docx` → iterate paragraphs, preserve heading structure
|
||
|
||
\- PDF: `pdfminer.six` for text PDFs; `pytesseract` via AusLaw MCP OCR for scanned
|
||
|
||
|
||
|
||
\---
|
||
|
||
|
||
|
||
\### 4.2 MetaExtractor
|
||
|
||
|
||
|
||
Input: `RawDocument`
|
||
|
||
Output: `CaseMeta` struct
|
||
|
||
Method: Claude claude-haiku-4-5-20251001 via API (cheap, fast, structured)
|
||
|
||
|
||
|
||
```python
|
||
|
||
@dataclass
|
||
|
||
class CaseMeta:
|
||
|
||
  # Identity
|
||
|
||
  case\_name: str # e.g. "R v Smith"
|
||
|
||
  mnc: str # Medium Neutral Citation
|
||
|
||
  court: str # NSWSC | HCA | FCA | QSC etc
|
||
|
||
  judge: list\[str]
|
||
|
||
  date\_delivered: str # ISO8601
|
||
|
||
  jurisdiction: str # NSW | CTH | QLD | VIC etc
|
||
|
||
|
||
|
||
  # Classification
|
||
|
||
  matter\_type: str # criminal | civil | appeal | coronial
|
||
|
||
  charges: list\[str] # \["murder s18 Crimes Act 1900 (NSW)"]
|
||
|
||
  charge\_categories: list\[str] # \["homicide", "violence"]
|
||
|
||
|
||
|
||
  # Outcome
|
||
|
||
  verdict: str # guilty | not\_guilty | appeal\_allowed |
|
||
|
||
  # appeal\_dismissed | conviction\_quashed |
|
||
|
||
  # sentence\_varied | hung | n/a
|
||
|
||
  sentence: str | None # "18 years NMP 13"
|
||
|
||
  outcome\_notes: str # free text summary of result
|
||
|
||
|
||
|
||
  # Flags
|
||
|
||
  is\_appeal: bool
|
||
|
||
  appeal\_of: str | None # MNC of original decision
|
||
|
||
  was\_appealed: bool # filled in later by back-reference
|
||
|
||
  exoneration\_flag: bool # manual or derived from appeal outcome
|
||
|
||
  inadmissible\_evidence: list\[str] # evidence ruled out
|
||
|
||
  suppression\_order: bool # whether publication restrictions apply
|
||
|
||
```
|
||
|
||
|
||
|
||
\*\*Extraction prompt (system):\*\*
|
||
|
||
```
|
||
|
||
You are a legal metadata extractor. Given the full text of an Australian court judgment,
|
||
|
||
extract structured metadata in JSON. Output ONLY valid JSON matching the CaseMeta schema.
|
||
|
||
Be conservative — use null for fields you cannot determine with confidence.
|
||
|
||
For verdict, use only these values: guilty | not\_guilty | appeal\_allowed |
|
||
|
||
appeal\_dismissed | conviction\_quashed | sentence\_varied | hung | civil\_judgment | n/a
|
||
|
||
```
|
||
|
||
|
||
|
||
\---
|
||
|
||
|
||
|
||
\### 4.3 OutcomeParser
|
||
|
||
|
||
|
||
Separate pass specifically for verdict/outcome ground truth. This is the most critical field — it's what the divergence signal compares against.
|
||
|
||
|
||
|
||
\*\*Resolution priority:\*\*
|
||
|
||
1\. Explicit headnote — "HELD: conviction quashed"
|
||
|
||
2\. Orders section — "Appeal allowed. Convictions on counts 1 and 3 set aside."
|
||
|
||
3\. Appeal cross-reference — if later appeal found, update original record
|
||
|
||
4\. Manual flag — `exoneration\_flag=True` sourced from external list
|
||
|
||
|
||
|
||
\*\*External exoneration sources:\*\*
|
||
|
||
\- RMIT Wrongful Convictions Register (AU)
|
||
|
||
\- Innocence Project Australia case list
|
||
|
||
\- Royal Commission findings referencing specific convictions
|
||
|
||
|
||
|
||
These are small enough to maintain as a static JSON lookup table keyed by defendant name + year.
|
||
|
||
|
||
|
||
\---
|
||
|
||
|
||
|
||
\### 4.4 ChunkEngine
|
||
|
||
|
||
|
||
Chunks documents for RAG retrieval. Chunks are \*\*semantically meaningful units\*\*, not arbitrary token windows.
|
||
|
||
|
||
|
||
\*\*Chunk types and sizes:\*\*
|
||
|
||
|
||
|
||
| Chunk Type | Content | Target Tokens | Overlap |
|
||
|
||
|------------|---------|---------------|---------|
|
||
|
||
| `opening` | Case intro, charges, parties | 400–600 | none |
|
||
|
||
| `testimony` | Each witness examination block | 300–500 | 50 |
|
||
|
||
| `exhibit` | Evidence description + ruling | 200–400 | none |
|
||
|
||
| `ruling` | Each evidentiary ruling by judge | 200–300 | none |
|
||
|
||
| `closing` | Closing submissions summary | 400–600 | 50 |
|
||
|
||
| `judgment` | Judicial reasoning blocks | 400–600 | 50 |
|
||
|
||
| `sentence` | Sentencing remarks | 300–500 | none |
|
||
|
||
|
||
|
||
\*\*Chunk struct:\*\*
|
||
|
||
```python
|
||
|
||
@dataclass
|
||
|
||
class Chunk:
|
||
|
||
  chunk\_id: str # UUID
|
||
|
||
  doc\_id: str # parent document MNC
|
||
|
||
  chunk\_type: str # from table above
|
||
|
||
  sequence: int # position in document
|
||
|
||
  text: str
|
||
|
||
  token\_count: int
|
||
|
||
  speaker: str | None # witness name if testimony
|
||
|
||
  page\_ref: str | None # "p.47" if available
|
||
|
||
  embedding: list\[float] # 1536-dim, filled by EmbedEngine
|
||
|
||
```
|
||
|
||
|
||
|
||
\*\*Chunking strategy:\*\*
|
||
|
||
Use Claude claude-haiku-4-5-20251001 to identify structural boundaries:
|
||
|
||
```
|
||
|
||
Identify the structural sections of this Australian court judgment.
|
||
|
||
Return a JSON array of {section\_type, start\_char, end\_char, speaker}.
|
||
|
||
Section types: opening | testimony | exhibit | ruling | closing | judgment | sentence
|
||
|
||
```
|
||
|
||
Then split at identified boundaries, keeping each chunk within token budget.
|
||
|
||
|
||
|
||
\---
|
||
|
||
|
||
|
||
\### 4.5 EmbedEngine
|
||
|
||
|
||
|
||
```yaml
|
||
|
||
model: text-embedding-3-small # OpenAI, 1536-dim, cheap
|
||
|
||
batch\_size: 100
|
||
|
||
store: chromadb # local, no infra
|
||
|
||
collection\_naming: "au\_cases\_{source\_id}"
|
||
|
||
```
|
||
|
||
|
||
|
||
Each chunk gets an embedding. The embedding is stored in ChromaDB with chunk metadata as payload.
|
||
|
||
|
||
|
||
\*\*Query interface:\*\*
|
||
|
||
```python
|
||
|
||
def query\_chunks(
|
||
|
||
  text: str,
|
||
|
||
  chunk\_types: list\[str] = None, # filter by type
|
||
|
||
  doc\_ids: list\[str] = None, # filter to specific cases
|
||
|
||
  top\_k: int = 10
|
||
|
||
) -> list\[Chunk]:
|
||
|
||
  ...
|
||
|
||
```
|
||
|
||
|
||
|
||
\---
|
||
|
||
|
||
|
||
\### 4.6 GraphBuilder
|
||
|
||
|
||
|
||
Builds the property graph from extracted metadata + chunk relationships.
|
||
|
||
|
||
|
||
\*\*Node types:\*\*
|
||
|
||
|
||
|
||
```
|
||
|
||
(:Case {mnc, court, date, jurisdiction, matter\_type, verdict, exoneration\_flag})
|
||
|
||
(:Charge {text, category, act, section})
|
||
|
||
(:Witness {name, role}) # role: prosecution|defence|expert|accused
|
||
|
||
(:Exhibit {id, description, admitted})
|
||
|
||
(:Judge {name, court})
|
||
|
||
(:Ruling {type, outcome}) # type: admissibility|direction|objection
|
||
|
||
(:Timeline {date, event})
|
||
|
||
(:Chunk {chunk\_id, type, text\_preview})
|
||
|
||
```
|
||
|
||
|
||
|
||
\*\*Relationship types:\*\*
|
||
|
||
|
||
|
||
```
|
||
|
||
(:Case)-\[:CHARGED\_WITH]->(:Charge)
|
||
|
||
(:Case)-\[:HEARD\_BY]->(:Judge)
|
||
|
||
(:Case)-\[:INCLUDES\_TESTIMONY {credibility\_score}]->(:Witness)
|
||
|
||
(:Case)-\[:HAS\_EXHIBIT {admitted: bool}]->(:Exhibit)
|
||
|
||
(:Case)-\[:HAS\_RULING]->(:Ruling)
|
||
|
||
(:Case)-\[:APPEALS]->(:Case) # appeal chain
|
||
|
||
(:Witness)-\[:GAVE\_TESTIMONY]->(:Chunk)
|
||
|
||
(:Exhibit)-\[:DESCRIBED\_IN]->(:Chunk)
|
||
|
||
(:Ruling)-\[:CONCERNS]->(:Exhibit)
|
||
|
||
(:Ruling)-\[:CONCERNS]->(:Witness)
|
||
|
||
(:Chunk)-\[:FOLLOWS]->(:Chunk) # sequence chain
|
||
|
||
(:Chunk)-\[:CORROBORATES]->(:Chunk) # semantic similarity > 0.85
|
||
|
||
(:Chunk)-\[:CONTRADICTS]->(:Chunk) # semantic similarity < 0.2 on same topic
|
||
|
||
```
|
||
|
||
|
||
|
||
\*\*Graph backend:\*\*
|
||
|
||
\- Primary: Neo4j (local Docker) for development
|
||
|
||
\- Alternative: RyuGraph if staying in the existing Lulu/Windows stack
|
||
|
||
\- Export: GraphML for portability
|
||
|
||
|
||
|
||
\*\*CORROBORATES / CONTRADICTS edge generation:\*\*
|
||
|
||
```python
|
||
|
||
\# After all chunks embedded, run pairwise similarity within a case
|
||
|
||
\# for chunks of type testimony and exhibit
|
||
|
||
for pair in chunk\_pairs\_within\_case(doc\_id, types=\["testimony", "exhibit"]):
|
||
|
||
  sim = cosine\_similarity(pair\[0].embedding, pair\[1].embedding)
|
||
|
||
  if sim > 0.85:
|
||
|
||
  graph.add\_edge(pair\[0], pair\[1], "CORROBORATES", weight=sim)
|
||
|
||
  elif same\_topic(pair\[0], pair\[1]) and sim < 0.25:
|
||
|
||
  graph.add\_edge(pair\[0], pair\[1], "CONTRADICTS", weight=sim)
|
||
|
||
```
|
||
|
||
|
||
|
||
\---
|
||
|
||
|
||
|
||
\## 5. Juror Subgraph Query Interface
|
||
|
||
|
||
|
||
This is the interface the jury agent calls at deliberation time.
|
||
|
||
|
||
|
||
```python
|
||
|
||
def get\_juror\_context(
|
||
|
||
  case\_mnc: str,
|
||
|
||
  persona: JurorPersona,
|
||
|
||
  max\_tokens: int = 4000
|
||
|
||
) -> JurorContext:
|
||
|
||
  """
|
||
|
||
  Traverse the graph from the persona's anchor node types,
|
||
|
||
  follow relevant edge types, collect chunks up to token budget.
|
||
|
||
  Returns a compact context string + list of source node IDs.
|
||
|
||
  """
|
||
|
||
```
|
||
|
||
|
||
|
||
\*\*Persona → graph traversal mapping:\*\*
|
||
|
||
|
||
|
||
```python
|
||
|
||
PERSONA\_TRAVERSAL = {
|
||
|
||
  "nurse": {
|
||
|
||
  "anchor\_nodes": \["Witness\[role=expert]", "Exhibit\[category=medical]"],
|
||
|
||
  "edge\_types": \["GAVE\_TESTIMONY", "DESCRIBED\_IN", "CORROBORATES"],
|
||
|
||
  "chunk\_types": \["testimony", "exhibit"],
|
||
|
||
  "exclude\_edges": \["RULED\_INADMISSIBLE"]
|
||
|
||
  },
|
||
|
||
  "accountant": {
|
||
|
||
  "anchor\_nodes": \["Exhibit\[category=financial]", "Witness\[role=expert]"],
|
||
|
||
  "edge\_types": \["DESCRIBED\_IN", "HAS\_RULING"],
|
||
|
||
  "chunk\_types": \["exhibit", "ruling"],
|
||
|
||
  "exclude\_edges": \["RULED\_INADMISSIBLE"]
|
||
|
||
  },
|
||
|
||
  "skeptic": {
|
||
|
||
  "anchor\_nodes": \["Timeline", "Witness"],
|
||
|
||
  "edge\_types": \["CONTRADICTS", "HAS\_RULING", "GAVE\_TESTIMONY"],
|
||
|
||
  "chunk\_types": \["testimony", "ruling"],
|
||
|
||
  "exclude\_edges": \["RULED\_INADMISSIBLE"]
|
||
|
||
  },
|
||
|
||
  "ex\_cop": {
|
||
|
||
  "anchor\_nodes": \["Ruling", "Exhibit\[category=forensic]"],
|
||
|
||
  "edge\_types": \["HAS\_RULING", "DESCRIBED\_IN", "CORROBORATES"],
|
||
|
||
  "chunk\_types": \["ruling", "exhibit", "judgment"],
|
||
|
||
  "exclude\_edges": \["RULED\_INADMISSIBLE"]
|
||
|
||
  },
|
||
|
||
  "empath": {
|
||
|
||
  "anchor\_nodes": \["Witness\[role=prosecution]", "Witness\[role=accused]"],
|
||
|
||
  "edge\_types": \["GAVE\_TESTIMONY", "CORROBORATES", "CONTRADICTS"],
|
||
|
||
  "chunk\_types": \["testimony", "closing"],
|
||
|
||
  "exclude\_edges": \["RULED\_INADMISSIBLE"]
|
||
|
||
  },
|
||
|
||
  "foreman": {
|
||
|
||
  "anchor\_nodes": \["Charge", "Ruling", "Judge"],
|
||
|
||
  "edge\_types": \["CHARGED\_WITH", "HAS\_RULING", "HEARD\_BY"],
|
||
|
||
  "chunk\_types": \["opening", "judgment", "sentence"],
|
||
|
||
  "exclude\_edges": \["RULED\_INADMISSIBLE"]
|
||
|
||
  }
|
||
|
||
}
|
||
|
||
```
|
||
|
||
|
||
|
||
\*\*Critical invariant:\*\* `RULED\_INADMISSIBLE` edges are always excluded from juror queries. This wall is enforced at the graph traversal level, not the prompt level.
|
||
|
||
|
||
|
||
\---
|
||
|
||
|
||
|
||
\## 6. MetaDB Schema (SQLite)
|
||
|
||
|
||
|
||
Tracks ingestion state — what's been fetched, processed, failed.
|
||
|
||
|
||
|
||
```sql
|
||
|
||
CREATE TABLE documents (
|
||
|
||
  doc\_id TEXT PRIMARY KEY, -- MNC or UUID
|
||
|
||
  source\_id TEXT NOT NULL,
|
||
|
||
  url TEXT NOT NULL,
|
||
|
||
  fetch\_status TEXT, -- pending|fetched|parsed|embedded|graphed|failed
|
||
|
||
  fetch\_timestamp TEXT,
|
||
|
||
  parse\_timestamp TEXT,
|
||
|
||
  embed\_timestamp TEXT,
|
||
|
||
  graph\_timestamp TEXT,
|
||
|
||
  error\_message TEXT,
|
||
|
||
  char\_count INTEGER,
|
||
|
||
  chunk\_count INTEGER,
|
||
|
||
  matter\_type TEXT,
|
||
|
||
  court TEXT,
|
||
|
||
  year INTEGER,
|
||
|
||
  verdict TEXT,
|
||
|
||
  exoneration\_flag INTEGER DEFAULT 0
|
||
|
||
);
|
||
|
||
|
||
|
||
CREATE TABLE fetch\_queue (
|
||
|
||
  queue\_id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||
|
||
  source\_id TEXT,
|
||
|
||
  url TEXT,
|
||
|
||
  priority INTEGER DEFAULT 5, -- 1=highest
|
||
|
||
  queued\_at TEXT,
|
||
|
||
  attempts INTEGER DEFAULT 0
|
||
|
||
);
|
||
|
||
|
||
|
||
CREATE TABLE source\_state (
|
||
|
||
  source\_id TEXT PRIMARY KEY,
|
||
|
||
  last\_poll TEXT,
|
||
|
||
  last\_rss\_etag TEXT,
|
||
|
||
  docs\_fetched INTEGER DEFAULT 0,
|
||
|
||
  docs\_failed INTEGER DEFAULT 0
|
||
|
||
);
|
||
|
||
```
|
||
|
||
|
||
|
||
\---
|
||
|
||
|
||
|
||
\## 7. Rate Limiting \& Politeness
|
||
|
||
|
||
|
||
```python
|
||
|
||
RATE\_LIMITS = {
|
||
|
||
  "fedcourt": {"rps": 1.0, "concurrent": 2, "retry\_after": 60},
|
||
|
||
  "highcourt": {"rps": 0.5, "concurrent": 1, "retry\_after": 120},
|
||
|
||
  "nsw\_caselaw": {"rps": 1.0, "concurrent": 2, "retry\_after": 60},
|
||
|
||
  "qld\_judgments": {"rps": 0.5, "concurrent": 1, "retry\_after": 120},
|
||
|
||
  "auslaw\_mcp": {"rps": 0.3, "concurrent": 1, "retry\_after": 180},
|
||
|
||
}
|
||
|
||
|
||
|
||
\# Always send:
|
||
|
||
HEADERS = {
|
||
|
||
  "User-Agent": "AuCourtIngest/0.1 (legal research; contact: your@email.com)",
|
||
|
||
  "Accept": "text/html,application/xhtml+xml,application/msword",
|
||
|
||
}
|
||
|
||
|
||
|
||
\# Respect robots.txt (NSW Caselaw blocks search engines, not research agents)
|
||
|
||
\# Check before first fetch from each domain
|
||
|
||
```
|
||
|
||
|
||
|
||
\---
|
||
|
||
|
||
|
||
\## 8. Error Handling \& Recovery
|
||
|
||
|
||
|
||
```python
|
||
|
||
class FetchError(Exception): pass
|
||
|
||
class ParseError(Exception): pass
|
||
|
||
class RateLimitError(Exception): pass
|
||
|
||
|
||
|
||
RETRY\_POLICY = {
|
||
|
||
  FetchError: {"max\_retries": 3, "backoff": "exponential", "base": 30},
|
||
|
||
  RateLimitError: {"max\_retries": 5, "backoff": "exponential", "base": 120},
|
||
|
||
  ParseError: {"max\_retries": 1, "backoff": "fixed", "base": 0},
|
||
|
||
}
|
||
|
||
|
||
|
||
\# On ParseError after retry: log to MetaDB, move to manual\_review queue
|
||
|
||
\# On persistent FetchError: mark source as degraded, alert via Telegram
|
||
|
||
```
|
||
|
||
|
||
|
||
\*\*Telegram alerting\*\* (reuse Lulu's Telegram integration):
|
||
|
||
```python
|
||
|
||
ALERTS = {
|
||
|
||
  "source\_degraded": "⚠️ AuCourtIngest: {source\_id} failing for {minutes}min",
|
||
|
||
  "daily\_summary": "📊 Ingested: {new\_docs} docs | Graph nodes: {nodes} | Errors: {errors}",
|
||
|
||
  "milestone": "🎯 Corpus reached {N} cases",
|
||
|
||
}
|
||
|
||
```
|
||
|
||
|
||
|
||
\---
|
||
|
||
|
||
|
||
\## 9. Bootstrap Run Plan
|
||
|
||
|
||
|
||
Priority order for initial corpus build:
|
||
|
||
|
||
|
||
```
|
||
|
||
Phase 1 — Week 1 (no permissions needed):
|
||
|
||
  \[ ] Federal Court RSS → DOCX, 2010–2026, criminal filter
|
||
|
||
  \[ ] High Court judgments 2000–2026, criminal appeals
|
||
|
||
  \[ ] NSW Caselaw criminal browse, 2010–2026
|
||
|
||
  Target: \~5,000 cases
|
||
|
||
|
||
|
||
Phase 2 — Week 2:
|
||
|
||
  \[ ] Federal Court PDF backfill 1995–2009
|
||
|
||
  \[ ] QLD Judgments criminal, 2000–2026
|
||
|
||
  \[ ] VIC via AusLaw MCP targeted search
|
||
|
||
  Target: \~15,000 cases total
|
||
|
||
|
||
|
||
Phase 3 — Ongoing:
|
||
|
||
  \[ ] RSS watch mode all sources
|
||
|
||
  \[ ] AustLII partnership email → if approved, full backfill
|
||
|
||
  \[ ] Exoneration cross-reference pass (RMIT register)
|
||
|
||
  Target: 50,000+ cases
|
||
|
||
```
|
||
|
||
|
||
|
||
\---
|
||
|
||
|
||
|
||
\## 10. Output Contracts
|
||
|
||
|
||
|
||
\*\*DocStore:\*\* `/data/docs/{source\_id}/{year}/{mnc}.json`
|
||
|
||
```json
|
||
|
||
{
|
||
|
||
  "doc\_id": "\[2019] NSWSC 1234",
|
||
|
||
  "meta": { ...CaseMeta... },
|
||
|
||
  "chunks": \[ ...Chunk\[] ... ],
|
||
|
||
  "raw\_text\_path": "/data/raw/nsw\_caselaw/2019/NSWSC\_1234.html"
|
||
|
||
}
|
||
|
||
```
|
||
|
||
|
||
|
||
\*\*VectorIndex:\*\* ChromaDB collection `au\_cases`, filterable by `source\_id`, `chunk\_type`, `doc\_id`
|
||
|
||
|
||
|
||
\*\*PropertyGraph:\*\* Neo4j `au\_legal` database, accessible via Bolt protocol
|
||
|
||
|
||
|
||
\*\*MetaDB:\*\* SQLite at `/data/meta.db`
|
||
|
||
|
||
|
||
\---
|
||
|
||
|
||
|
||
\## 11. Tech Stack
|
||
|
||
|
||
|
||
```
|
||
|
||
Language: Python 3.12
|
||
|
||
HTTP: httpx (async)
|
||
|
||
HTML parsing: BeautifulSoup4
|
||
|
||
DOCX: python-docx
|
||
|
||
PDF text: pdfminer.six
|
||
|
||
PDF OCR: via AusLaw MCP (pytesseract wrapper)
|
||
|
||
Embeddings: openai text-embedding-3-small
|
||
|
||
Vector store: chromadb (local)
|
||
|
||
Graph DB: neo4j (Docker) or ryu\_graph (Windows)
|
||
|
||
SQLite: aiosqlite
|
||
|
||
LLM extraction: anthropic claude-haiku-4-5-20251001 (cheap, fast)
|
||
|
||
MCP client: mcp-python-sdk
|
||
|
||
Scheduling: APScheduler
|
||
|
||
Alerting: Telegram Bot API (reuse Lulu token)
|
||
|
||
Config: TOML (pyproject-style)
|
||
|
||
```
|
||
|
||
|
||
|
||
\---
|
||
|
||
|
||
|
||
\## 12. File Structure
|
||
|
||
|
||
|
||
```
|
||
|
||
aucourt\_ingest/
|
||
|
||
├── main.py # entry point, mode dispatch
|
||
|
||
├── config.toml # source configs, rate limits, paths
|
||
|
||
├── orchestrator.py # main loop, queue management
|
||
|
||
├── sources/
|
||
|
||
│ ├── base.py # BaseSource ABC
|
||
|
||
│ ├── fedcourt.py
|
||
|
||
│ ├── highcourt.py
|
||
|
||
│ ├── nsw\_caselaw.py
|
||
|
||
│ ├── qld\_judgments.py
|
||
|
||
│ └── auslaw\_mcp.py
|
||
|
||
├── processing/
|
||
|
||
│ ├── doc\_parser.py
|
||
|
||
│ ├── meta\_extractor.py
|
||
|
||
│ ├── outcome\_parser.py
|
||
|
||
│ ├── chunk\_engine.py
|
||
|
||
│ ├── embed\_engine.py
|
||
|
||
│ └── graph\_builder.py
|
||
|
||
├── storage/
|
||
|
||
│ ├── doc\_store.py
|
||
|
||
│ ├── vector\_index.py
|
||
|
||
│ ├── graph\_db.py
|
||
|
||
│ └── meta\_db.py
|
||
|
||
├── jury/
|
||
|
||
│ ├── personas.py # JurorPersona definitions
|
||
|
||
│ └── subgraph\_query.py # get\_juror\_context()
|
||
|
||
├── utils/
|
||
|
||
│ ├── rate\_limiter.py
|
||
|
||
│ ├── retry.py
|
||
|
||
│ ├── telegram.py
|
||
|
||
│ └── mnc\_parser.py
|
||
|
||
├── data/
|
||
|
||
│ ├── docs/
|
||
|
||
│ ├── raw/
|
||
|
||
│ └── meta.db
|
||
|
||
└── tests/
|
||
|
||
  ├── test\_parsers.py
|
||
|
||
  ├── test\_meta\_extractor.py
|
||
|
||
  └── test\_graph\_builder.py
|
||
|
||
```
|
||
|
||
|
||
|
||
\---
|
||
|
||
|
||
|
||
\## 13. Open Questions / Decisions Deferred
|
||
|
||
|
||
|
||
1\. \*\*Graph backend:\*\* RyuGraph (existing Lulu stack, Windows/Android) vs Neo4j (Docker, better Cypher tooling). Recommend Neo4j for dev, RyuGraph for eventual mobile/embedded target.
|
||
|
||
2\. \*\*Transcript vs judgment distinction:\*\* Many cases have both. Transcripts are richer for juror simulation (raw testimony). Judgments are cleaner but summarised. Ingest both, tag chunk source type.
|
||
|
||
3\. \*\*Suppression orders:\*\* Some cases have partial suppression. MetaDB `suppression\_order` flag. Do not ingest suppressed content. Check AustLII suppression notices as reference.
|
||
|
||
4\. \*\*AustLII partnership timing:\*\* Email now or after MVP? Recommend now — low effort, long lead time for approval.
|
||
|
||
5\. \*\*Embedding model:\*\* text-embedding-3-small is cheap but ada-002 or Voyage Law may give better legal domain performance. Evaluate on retrieval quality after first 1,000 cases.
|
||
|
||
|
||
|
||
\---
|
||
|
||
|
||
|
||
\*End of spec v0.1\*
|
||
|