1431 lines
25 KiB
Markdown
1431 lines
25 KiB
Markdown
|
|
\# AuCourtIngest — Agent Specification v0.1
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\*\*Project:\*\* Australian Legal Case Ingestion Pipeline
|
|||
|
|
|
|||
|
|
\*\*Purpose:\*\* Autonomous agent to discover, fetch, parse, normalise, and graph-ingest Australian court decisions for AI jury/judge analysis
|
|||
|
|
|
|||
|
|
\*\*Author:\*\* Aaron King / Slothitude Games
|
|||
|
|
|
|||
|
|
\*\*Date:\*\* 2026-05-30
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\---
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\## 1. Overview
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
AuCourtIngest is a fully autonomous ingestion agent. It operates as a continuous pipeline with no human operator required after initial configuration. Its output is a populated property graph (RyuGraph or Neo4j-compatible) and a flat document store, both ready for per-juror RAG queries.
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
The agent has four operational modes:
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
| Mode | Trigger | Description |
|
|||
|
|
|
|||
|
|
|------|---------|-------------|
|
|||
|
|
|
|||
|
|
| `bootstrap` | Manual / first run | Historical bulk ingest from all tier-1 sources |
|
|||
|
|
|
|||
|
|
| `watch` | Cron / RSS poll | Continuous ingestion of new decisions |
|
|||
|
|
|
|||
|
|
| `backfill` | Manual | Fill gaps in existing corpus by date/court/charge-type |
|
|||
|
|
|
|||
|
|
| `audit` | Manual | Re-process existing documents to update graph schema |
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\---
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\## 2. Architecture
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
┌─────────────────────────────────────────────────────┐
|
|||
|
|
|
|||
|
|
│ ORCHESTRATOR │
|
|||
|
|
|
|||
|
|
│ (main loop, mode dispatch, │
|
|||
|
|
|
|||
|
|
│ rate limiting, error recovery) │
|
|||
|
|
|
|||
|
|
└────────┬───────────────────────────────┬────────────┘
|
|||
|
|
|
|||
|
|
  │ │
|
|||
|
|
|
|||
|
|
  ▼ ▼
|
|||
|
|
|
|||
|
|
┌─────────────────┐ ┌─────────────────────┐
|
|||
|
|
|
|||
|
|
│ SOURCE LAYER │ │ PROCESSING LAYER │
|
|||
|
|
|
|||
|
|
│ │ │ │
|
|||
|
|
|
|||
|
|
│ - FedCourtRSS │──fetch─────▶│ - DocParser │
|
|||
|
|
|
|||
|
|
│ - HighCourtRSS │ │ - MetaExtractor │
|
|||
|
|
|
|||
|
|
│ - NSWCaselaw │ │ - OutcomeParser │
|
|||
|
|
|
|||
|
|
│ - QLDJudgments │ │ - ChunkEngine │
|
|||
|
|
|
|||
|
|
│ - VICCatalogue │ │ - EmbedEngine │
|
|||
|
|
|
|||
|
|
│ - AusLawMCP │ │ - GraphBuilder │
|
|||
|
|
|
|||
|
|
└─────────────────┘ └──────────┬──────────┘
|
|||
|
|
|
|||
|
|
  │
|
|||
|
|
|
|||
|
|
  ▼
|
|||
|
|
|
|||
|
|
  ┌─────────────────────┐
|
|||
|
|
|
|||
|
|
  │ STORAGE LAYER │
|
|||
|
|
|
|||
|
|
  │ │
|
|||
|
|
|
|||
|
|
  │ - DocStore (flat) │
|
|||
|
|
|
|||
|
|
  │ - VectorIndex │
|
|||
|
|
|
|||
|
|
  │ - PropertyGraph │
|
|||
|
|
|
|||
|
|
  │ - MetaDB (SQLite) │
|
|||
|
|
|
|||
|
|
  └─────────────────────┘
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\---
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\## 3. Source Definitions
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\### 3.1 Federal Court of Australia
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
|
|||
|
|
source\_id: fedcourt
|
|||
|
|
|
|||
|
|
base\_url: https://www.judgments.fedcourt.gov.au
|
|||
|
|
|
|||
|
|
rss\_feed: https://www.judgments.fedcourt.gov.au/rss/fca-judgments
|
|||
|
|
|
|||
|
|
doc\_formats: \[html, docx, pdf]
|
|||
|
|
|
|||
|
|
coverage\_from: 1977
|
|||
|
|
|
|||
|
|
full\_text\_from: 1995
|
|||
|
|
|
|||
|
|
docx\_from: 1995
|
|||
|
|
|
|||
|
|
pdf\_range: \[1977, 1994]
|
|||
|
|
|
|||
|
|
rate\_limit\_rps: 1
|
|||
|
|
|
|||
|
|
tos\_status: public\_domain
|
|||
|
|
|
|||
|
|
fetch\_strategy: rss\_poll\_then\_docx\_download
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\*\*Fetch logic:\*\*
|
|||
|
|
|
|||
|
|
1\. Poll RSS feed every 6 hours
|
|||
|
|
|
|||
|
|
2\. For each new item: extract judgment URL from RSS `<link>`
|
|||
|
|
|
|||
|
|
3\. Fetch HTML page → find DOCX download link (pattern: `/judgments/fca/YYYY/NNN.docx`)
|
|||
|
|
|
|||
|
|
4\. Download DOCX → hand to DocParser
|
|||
|
|
|
|||
|
|
5\. For 1977–1994: fetch PDF instead → hand to PDFParser
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\*\*DOCX URL pattern:\*\*
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
https://www.judgments.fedcourt.gov.au/judgments/fca/{YYYY}/{index}.docx
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\*\*Bootstrap pagination:\*\*
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
https://www.fedcourt.gov.au/digital-law-library/judgments/search
|
|||
|
|
|
|||
|
|
?NeedleType=all\&txtSearch=\&category=criminal\&dateFrom=1995-01-01
|
|||
|
|
|
|||
|
|
\&dateTo=2026-12-31\&resultsPerPage=50\&pageNum={N}
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\---
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\### 3.2 High Court of Australia
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
|
|||
|
|
source\_id: highcourt
|
|||
|
|
|
|||
|
|
base\_url: https://www.hcourt.gov.au
|
|||
|
|
|
|||
|
|
transcript\_base: https://www.hcourt.gov.au/cases/case-s{YYYY}-{N}
|
|||
|
|
|
|||
|
|
coverage\_from: 1994
|
|||
|
|
|
|||
|
|
doc\_formats: \[html, pdf]
|
|||
|
|
|
|||
|
|
rate\_limit\_rps: 0.5
|
|||
|
|
|
|||
|
|
tos\_status: public\_domain
|
|||
|
|
|
|||
|
|
fetch\_strategy: index\_crawl
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\*\*Fetch logic:\*\*
|
|||
|
|
|
|||
|
|
1\. Crawl `/cases/recent-judgments` — paginated list with matter numbers
|
|||
|
|
|
|||
|
|
2\. For each matter: fetch judgment HTML page
|
|||
|
|
|
|||
|
|
3\. Also fetch transcript if available (`/transcripts/` path)
|
|||
|
|
|
|||
|
|
4\. Store judgment + transcript as separate documents, linked by `matter\_id`
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\*\*Priority filter for MVP:\*\*
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
|
|||
|
|
PRIORITY\_KEYWORDS = \[
|
|||
|
|
|
|||
|
|
  "murder", "manslaughter", "sexual assault", "robbery",
|
|||
|
|
|
|||
|
|
  "appeal allowed", "conviction quashed", "miscarriage of justice"
|
|||
|
|
|
|||
|
|
]
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\---
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\### 3.3 NSW Caselaw
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
|
|||
|
|
source\_id: nsw\_caselaw
|
|||
|
|
|
|||
|
|
base\_url: https://www.caselaw.nsw.gov.au
|
|||
|
|
|
|||
|
|
browse\_url: https://www.caselaw.nsw.gov.au/browse
|
|||
|
|
|
|||
|
|
coverage\_from: 1988
|
|||
|
|
|
|||
|
|
doc\_formats: \[html, pdf]
|
|||
|
|
|
|||
|
|
rate\_limit\_rps: 1
|
|||
|
|
|
|||
|
|
tos\_status: open\_reproduction\_authorised
|
|||
|
|
|
|||
|
|
fetch\_strategy: mnc\_systematic + browse\_pagination
|
|||
|
|
|
|||
|
|
mnc\_pattern: "\[YYYY] {COURT} {N}"
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\*\*Fetch logic:\*\*
|
|||
|
|
|
|||
|
|
1\. Browse `/browse?filter=criminal` → paginate through all pages
|
|||
|
|
|
|||
|
|
2\. Extract MNC + decision URL per row
|
|||
|
|
|
|||
|
|
3\. Fetch decision HTML → parse full text + metadata
|
|||
|
|
|
|||
|
|
4\. MNC serves as canonical case ID throughout system
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\*\*MNC parsing:\*\*
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
|
|||
|
|
\# Example: \[2019] NSWSC 1234
|
|||
|
|
|
|||
|
|
import re
|
|||
|
|
|
|||
|
|
MNC\_PATTERN = r'\\\[(\\d{4})\\]\\s+(\[A-Z]+)\\s+(\\d+)'
|
|||
|
|
|
|||
|
|
def parse\_mnc(mnc: str) -> dict:
|
|||
|
|
|
|||
|
|
  m = re.match(MNC\_PATTERN, mnc)
|
|||
|
|
|
|||
|
|
  return {"year": m\[1], "court": m\[2], "number": m\[3]}
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\---
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\### 3.4 Queensland Judgments
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
|
|||
|
|
source\_id: qld\_judgments
|
|||
|
|
|
|||
|
|
base\_url: https://www.queenslandjudgments.com.au
|
|||
|
|
|
|||
|
|
coverage: variable\_by\_court
|
|||
|
|
|
|||
|
|
doc\_formats: \[html, pdf]
|
|||
|
|
|
|||
|
|
rate\_limit\_rps: 0.5
|
|||
|
|
|
|||
|
|
tos\_status: free\_access
|
|||
|
|
|
|||
|
|
fetch\_strategy: search\_pagination
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\*\*Fetch logic:\*\*
|
|||
|
|
|
|||
|
|
1\. Use search API with `matter\_type=criminal` filter
|
|||
|
|
|
|||
|
|
2\. Paginate results → extract case URLs
|
|||
|
|
|
|||
|
|
3\. Fetch HTML per case
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\---
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\### 3.5 AusLaw MCP (Gap-Fill + Search)
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
|
|||
|
|
source\_id: auslaw\_mcp
|
|||
|
|
|
|||
|
|
type: mcp\_server
|
|||
|
|
|
|||
|
|
repo: russellbrenner/auslaw-mcp
|
|||
|
|
|
|||
|
|
tools:
|
|||
|
|
|
|||
|
|
  - search\_austlii
|
|||
|
|
|
|||
|
|
  - search\_jade
|
|||
|
|
|
|||
|
|
  - fetch\_document\_text
|
|||
|
|
|
|||
|
|
  - search\_jade\_by\_citation
|
|||
|
|
|
|||
|
|
use\_case: targeted\_retrieval\_not\_bulk
|
|||
|
|
|
|||
|
|
rate\_limit: conservative # case-by-case only
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\*\*Use cases:\*\*
|
|||
|
|
|
|||
|
|
\- Fetch a specific case by citation when referenced in another judgment
|
|||
|
|
|
|||
|
|
\- OCR for scanned pre-1995 PDFs
|
|||
|
|
|
|||
|
|
\- Cross-reference citation graph to find related cases
|
|||
|
|
|
|||
|
|
\- Fill gaps where direct court portals have incomplete coverage
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\*\*Tool call pattern:\*\*
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
|
|||
|
|
\# Retrieve by citation
|
|||
|
|
|
|||
|
|
result = await mcp.call("fetch\_document\_text", {
|
|||
|
|
|
|||
|
|
  "url": "https://www.austlii.edu.au/cgi-bin/viewdoc/au/cases/..."
|
|||
|
|
|
|||
|
|
})
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\# Search for cases matching charge type
|
|||
|
|
|
|||
|
|
results = await mcp.call("search\_austlii", {
|
|||
|
|
|
|||
|
|
  "query": "murder conviction appeal",
|
|||
|
|
|
|||
|
|
  "jurisdiction": "nsw",
|
|||
|
|
|
|||
|
|
  "limit": 20,
|
|||
|
|
|
|||
|
|
  "sortBy": "date"
|
|||
|
|
|
|||
|
|
})
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\---
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\## 4. Document Processing Pipeline
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\### 4.1 DocParser
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Accepts: HTML, DOCX, PDF
|
|||
|
|
|
|||
|
|
Outputs: normalised `RawDocument` struct
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
|
|||
|
|
@dataclass
|
|||
|
|
|
|||
|
|
class RawDocument:
|
|||
|
|
|
|||
|
|
  source\_id: str # e.g. "nsw\_caselaw"
|
|||
|
|
|
|||
|
|
  doc\_id: str # MNC or internal UUID
|
|||
|
|
|
|||
|
|
  url: str
|
|||
|
|
|
|||
|
|
  fetch\_timestamp: str # ISO8601
|
|||
|
|
|
|||
|
|
  raw\_text: str # full extracted text
|
|||
|
|
|
|||
|
|
  format: str # html | docx | pdf
|
|||
|
|
|
|||
|
|
  pages: int | None
|
|||
|
|
|
|||
|
|
  char\_count: int
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\*\*Format handlers:\*\*
|
|||
|
|
|
|||
|
|
\- HTML: BeautifulSoup → strip nav/header/footer → extract `.judgment-body` or equivalent
|
|||
|
|
|
|||
|
|
\- DOCX: `python-docx` → iterate paragraphs, preserve heading structure
|
|||
|
|
|
|||
|
|
\- PDF: `pdfminer.six` for text PDFs; `pytesseract` via AusLaw MCP OCR for scanned
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\---
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\### 4.2 MetaExtractor
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Input: `RawDocument`
|
|||
|
|
|
|||
|
|
Output: `CaseMeta` struct
|
|||
|
|
|
|||
|
|
Method: Claude claude-haiku-4-5-20251001 via API (cheap, fast, structured)
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
|
|||
|
|
@dataclass
|
|||
|
|
|
|||
|
|
class CaseMeta:
|
|||
|
|
|
|||
|
|
  # Identity
|
|||
|
|
|
|||
|
|
  case\_name: str # e.g. "R v Smith"
|
|||
|
|
|
|||
|
|
  mnc: str # Medium Neutral Citation
|
|||
|
|
|
|||
|
|
  court: str # NSWSC | HCA | FCA | QSC etc
|
|||
|
|
|
|||
|
|
  judge: list\[str]
|
|||
|
|
|
|||
|
|
  date\_delivered: str # ISO8601
|
|||
|
|
|
|||
|
|
  jurisdiction: str # NSW | CTH | QLD | VIC etc
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
  # Classification
|
|||
|
|
|
|||
|
|
  matter\_type: str # criminal | civil | appeal | coronial
|
|||
|
|
|
|||
|
|
  charges: list\[str] # \["murder s18 Crimes Act 1900 (NSW)"]
|
|||
|
|
|
|||
|
|
  charge\_categories: list\[str] # \["homicide", "violence"]
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
  # Outcome
|
|||
|
|
|
|||
|
|
  verdict: str # guilty | not\_guilty | appeal\_allowed |
|
|||
|
|
|
|||
|
|
  # appeal\_dismissed | conviction\_quashed |
|
|||
|
|
|
|||
|
|
  # sentence\_varied | hung | n/a
|
|||
|
|
|
|||
|
|
  sentence: str | None # "18 years NMP 13"
|
|||
|
|
|
|||
|
|
  outcome\_notes: str # free text summary of result
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
  # Flags
|
|||
|
|
|
|||
|
|
  is\_appeal: bool
|
|||
|
|
|
|||
|
|
  appeal\_of: str | None # MNC of original decision
|
|||
|
|
|
|||
|
|
  was\_appealed: bool # filled in later by back-reference
|
|||
|
|
|
|||
|
|
  exoneration\_flag: bool # manual or derived from appeal outcome
|
|||
|
|
|
|||
|
|
  inadmissible\_evidence: list\[str] # evidence ruled out
|
|||
|
|
|
|||
|
|
  suppression\_order: bool # whether publication restrictions apply
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\*\*Extraction prompt (system):\*\*
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
You are a legal metadata extractor. Given the full text of an Australian court judgment,
|
|||
|
|
|
|||
|
|
extract structured metadata in JSON. Output ONLY valid JSON matching the CaseMeta schema.
|
|||
|
|
|
|||
|
|
Be conservative — use null for fields you cannot determine with confidence.
|
|||
|
|
|
|||
|
|
For verdict, use only these values: guilty | not\_guilty | appeal\_allowed |
|
|||
|
|
|
|||
|
|
appeal\_dismissed | conviction\_quashed | sentence\_varied | hung | civil\_judgment | n/a
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\---
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\### 4.3 OutcomeParser
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Separate pass specifically for verdict/outcome ground truth. This is the most critical field — it's what the divergence signal compares against.
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\*\*Resolution priority:\*\*
|
|||
|
|
|
|||
|
|
1\. Explicit headnote — "HELD: conviction quashed"
|
|||
|
|
|
|||
|
|
2\. Orders section — "Appeal allowed. Convictions on counts 1 and 3 set aside."
|
|||
|
|
|
|||
|
|
3\. Appeal cross-reference — if later appeal found, update original record
|
|||
|
|
|
|||
|
|
4\. Manual flag — `exoneration\_flag=True` sourced from external list
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\*\*External exoneration sources:\*\*
|
|||
|
|
|
|||
|
|
\- RMIT Wrongful Convictions Register (AU)
|
|||
|
|
|
|||
|
|
\- Innocence Project Australia case list
|
|||
|
|
|
|||
|
|
\- Royal Commission findings referencing specific convictions
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
These are small enough to maintain as a static JSON lookup table keyed by defendant name + year.
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\---
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\### 4.4 ChunkEngine
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Chunks documents for RAG retrieval. Chunks are \*\*semantically meaningful units\*\*, not arbitrary token windows.
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\*\*Chunk types and sizes:\*\*
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
| Chunk Type | Content | Target Tokens | Overlap |
|
|||
|
|
|
|||
|
|
|------------|---------|---------------|---------|
|
|||
|
|
|
|||
|
|
| `opening` | Case intro, charges, parties | 400–600 | none |
|
|||
|
|
|
|||
|
|
| `testimony` | Each witness examination block | 300–500 | 50 |
|
|||
|
|
|
|||
|
|
| `exhibit` | Evidence description + ruling | 200–400 | none |
|
|||
|
|
|
|||
|
|
| `ruling` | Each evidentiary ruling by judge | 200–300 | none |
|
|||
|
|
|
|||
|
|
| `closing` | Closing submissions summary | 400–600 | 50 |
|
|||
|
|
|
|||
|
|
| `judgment` | Judicial reasoning blocks | 400–600 | 50 |
|
|||
|
|
|
|||
|
|
| `sentence` | Sentencing remarks | 300–500 | none |
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\*\*Chunk struct:\*\*
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
|
|||
|
|
@dataclass
|
|||
|
|
|
|||
|
|
class Chunk:
|
|||
|
|
|
|||
|
|
  chunk\_id: str # UUID
|
|||
|
|
|
|||
|
|
  doc\_id: str # parent document MNC
|
|||
|
|
|
|||
|
|
  chunk\_type: str # from table above
|
|||
|
|
|
|||
|
|
  sequence: int # position in document
|
|||
|
|
|
|||
|
|
  text: str
|
|||
|
|
|
|||
|
|
  token\_count: int
|
|||
|
|
|
|||
|
|
  speaker: str | None # witness name if testimony
|
|||
|
|
|
|||
|
|
  page\_ref: str | None # "p.47" if available
|
|||
|
|
|
|||
|
|
  embedding: list\[float] # 1536-dim, filled by EmbedEngine
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\*\*Chunking strategy:\*\*
|
|||
|
|
|
|||
|
|
Use Claude claude-haiku-4-5-20251001 to identify structural boundaries:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Identify the structural sections of this Australian court judgment.
|
|||
|
|
|
|||
|
|
Return a JSON array of {section\_type, start\_char, end\_char, speaker}.
|
|||
|
|
|
|||
|
|
Section types: opening | testimony | exhibit | ruling | closing | judgment | sentence
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Then split at identified boundaries, keeping each chunk within token budget.
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\---
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\### 4.5 EmbedEngine
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
|
|||
|
|
model: text-embedding-3-small # OpenAI, 1536-dim, cheap
|
|||
|
|
|
|||
|
|
batch\_size: 100
|
|||
|
|
|
|||
|
|
store: chromadb # local, no infra
|
|||
|
|
|
|||
|
|
collection\_naming: "au\_cases\_{source\_id}"
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Each chunk gets an embedding. The embedding is stored in ChromaDB with chunk metadata as payload.
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\*\*Query interface:\*\*
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
|
|||
|
|
def query\_chunks(
|
|||
|
|
|
|||
|
|
  text: str,
|
|||
|
|
|
|||
|
|
  chunk\_types: list\[str] = None, # filter by type
|
|||
|
|
|
|||
|
|
  doc\_ids: list\[str] = None, # filter to specific cases
|
|||
|
|
|
|||
|
|
  top\_k: int = 10
|
|||
|
|
|
|||
|
|
) -> list\[Chunk]:
|
|||
|
|
|
|||
|
|
  ...
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\---
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\### 4.6 GraphBuilder
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Builds the property graph from extracted metadata + chunk relationships.
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\*\*Node types:\*\*
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
(:Case {mnc, court, date, jurisdiction, matter\_type, verdict, exoneration\_flag})
|
|||
|
|
|
|||
|
|
(:Charge {text, category, act, section})
|
|||
|
|
|
|||
|
|
(:Witness {name, role}) # role: prosecution|defence|expert|accused
|
|||
|
|
|
|||
|
|
(:Exhibit {id, description, admitted})
|
|||
|
|
|
|||
|
|
(:Judge {name, court})
|
|||
|
|
|
|||
|
|
(:Ruling {type, outcome}) # type: admissibility|direction|objection
|
|||
|
|
|
|||
|
|
(:Timeline {date, event})
|
|||
|
|
|
|||
|
|
(:Chunk {chunk\_id, type, text\_preview})
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\*\*Relationship types:\*\*
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
(:Case)-\[:CHARGED\_WITH]->(:Charge)
|
|||
|
|
|
|||
|
|
(:Case)-\[:HEARD\_BY]->(:Judge)
|
|||
|
|
|
|||
|
|
(:Case)-\[:INCLUDES\_TESTIMONY {credibility\_score}]->(:Witness)
|
|||
|
|
|
|||
|
|
(:Case)-\[:HAS\_EXHIBIT {admitted: bool}]->(:Exhibit)
|
|||
|
|
|
|||
|
|
(:Case)-\[:HAS\_RULING]->(:Ruling)
|
|||
|
|
|
|||
|
|
(:Case)-\[:APPEALS]->(:Case) # appeal chain
|
|||
|
|
|
|||
|
|
(:Witness)-\[:GAVE\_TESTIMONY]->(:Chunk)
|
|||
|
|
|
|||
|
|
(:Exhibit)-\[:DESCRIBED\_IN]->(:Chunk)
|
|||
|
|
|
|||
|
|
(:Ruling)-\[:CONCERNS]->(:Exhibit)
|
|||
|
|
|
|||
|
|
(:Ruling)-\[:CONCERNS]->(:Witness)
|
|||
|
|
|
|||
|
|
(:Chunk)-\[:FOLLOWS]->(:Chunk) # sequence chain
|
|||
|
|
|
|||
|
|
(:Chunk)-\[:CORROBORATES]->(:Chunk) # semantic similarity > 0.85
|
|||
|
|
|
|||
|
|
(:Chunk)-\[:CONTRADICTS]->(:Chunk) # semantic similarity < 0.2 on same topic
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\*\*Graph backend:\*\*
|
|||
|
|
|
|||
|
|
\- Primary: Neo4j (local Docker) for development
|
|||
|
|
|
|||
|
|
\- Alternative: RyuGraph if staying in the existing Lulu/Windows stack
|
|||
|
|
|
|||
|
|
\- Export: GraphML for portability
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\*\*CORROBORATES / CONTRADICTS edge generation:\*\*
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
|
|||
|
|
\# After all chunks embedded, run pairwise similarity within a case
|
|||
|
|
|
|||
|
|
\# for chunks of type testimony and exhibit
|
|||
|
|
|
|||
|
|
for pair in chunk\_pairs\_within\_case(doc\_id, types=\["testimony", "exhibit"]):
|
|||
|
|
|
|||
|
|
  sim = cosine\_similarity(pair\[0].embedding, pair\[1].embedding)
|
|||
|
|
|
|||
|
|
  if sim > 0.85:
|
|||
|
|
|
|||
|
|
  graph.add\_edge(pair\[0], pair\[1], "CORROBORATES", weight=sim)
|
|||
|
|
|
|||
|
|
  elif same\_topic(pair\[0], pair\[1]) and sim < 0.25:
|
|||
|
|
|
|||
|
|
  graph.add\_edge(pair\[0], pair\[1], "CONTRADICTS", weight=sim)
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\---
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\## 5. Juror Subgraph Query Interface
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
This is the interface the jury agent calls at deliberation time.
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
|
|||
|
|
def get\_juror\_context(
|
|||
|
|
|
|||
|
|
  case\_mnc: str,
|
|||
|
|
|
|||
|
|
  persona: JurorPersona,
|
|||
|
|
|
|||
|
|
  max\_tokens: int = 4000
|
|||
|
|
|
|||
|
|
) -> JurorContext:
|
|||
|
|
|
|||
|
|
  """
|
|||
|
|
|
|||
|
|
  Traverse the graph from the persona's anchor node types,
|
|||
|
|
|
|||
|
|
  follow relevant edge types, collect chunks up to token budget.
|
|||
|
|
|
|||
|
|
  Returns a compact context string + list of source node IDs.
|
|||
|
|
|
|||
|
|
  """
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\*\*Persona → graph traversal mapping:\*\*
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
|
|||
|
|
PERSONA\_TRAVERSAL = {
|
|||
|
|
|
|||
|
|
  "nurse": {
|
|||
|
|
|
|||
|
|
  "anchor\_nodes": \["Witness\[role=expert]", "Exhibit\[category=medical]"],
|
|||
|
|
|
|||
|
|
  "edge\_types": \["GAVE\_TESTIMONY", "DESCRIBED\_IN", "CORROBORATES"],
|
|||
|
|
|
|||
|
|
  "chunk\_types": \["testimony", "exhibit"],
|
|||
|
|
|
|||
|
|
  "exclude\_edges": \["RULED\_INADMISSIBLE"]
|
|||
|
|
|
|||
|
|
  },
|
|||
|
|
|
|||
|
|
  "accountant": {
|
|||
|
|
|
|||
|
|
  "anchor\_nodes": \["Exhibit\[category=financial]", "Witness\[role=expert]"],
|
|||
|
|
|
|||
|
|
  "edge\_types": \["DESCRIBED\_IN", "HAS\_RULING"],
|
|||
|
|
|
|||
|
|
  "chunk\_types": \["exhibit", "ruling"],
|
|||
|
|
|
|||
|
|
  "exclude\_edges": \["RULED\_INADMISSIBLE"]
|
|||
|
|
|
|||
|
|
  },
|
|||
|
|
|
|||
|
|
  "skeptic": {
|
|||
|
|
|
|||
|
|
  "anchor\_nodes": \["Timeline", "Witness"],
|
|||
|
|
|
|||
|
|
  "edge\_types": \["CONTRADICTS", "HAS\_RULING", "GAVE\_TESTIMONY"],
|
|||
|
|
|
|||
|
|
  "chunk\_types": \["testimony", "ruling"],
|
|||
|
|
|
|||
|
|
  "exclude\_edges": \["RULED\_INADMISSIBLE"]
|
|||
|
|
|
|||
|
|
  },
|
|||
|
|
|
|||
|
|
  "ex\_cop": {
|
|||
|
|
|
|||
|
|
  "anchor\_nodes": \["Ruling", "Exhibit\[category=forensic]"],
|
|||
|
|
|
|||
|
|
  "edge\_types": \["HAS\_RULING", "DESCRIBED\_IN", "CORROBORATES"],
|
|||
|
|
|
|||
|
|
  "chunk\_types": \["ruling", "exhibit", "judgment"],
|
|||
|
|
|
|||
|
|
  "exclude\_edges": \["RULED\_INADMISSIBLE"]
|
|||
|
|
|
|||
|
|
  },
|
|||
|
|
|
|||
|
|
  "empath": {
|
|||
|
|
|
|||
|
|
  "anchor\_nodes": \["Witness\[role=prosecution]", "Witness\[role=accused]"],
|
|||
|
|
|
|||
|
|
  "edge\_types": \["GAVE\_TESTIMONY", "CORROBORATES", "CONTRADICTS"],
|
|||
|
|
|
|||
|
|
  "chunk\_types": \["testimony", "closing"],
|
|||
|
|
|
|||
|
|
  "exclude\_edges": \["RULED\_INADMISSIBLE"]
|
|||
|
|
|
|||
|
|
  },
|
|||
|
|
|
|||
|
|
  "foreman": {
|
|||
|
|
|
|||
|
|
  "anchor\_nodes": \["Charge", "Ruling", "Judge"],
|
|||
|
|
|
|||
|
|
  "edge\_types": \["CHARGED\_WITH", "HAS\_RULING", "HEARD\_BY"],
|
|||
|
|
|
|||
|
|
  "chunk\_types": \["opening", "judgment", "sentence"],
|
|||
|
|
|
|||
|
|
  "exclude\_edges": \["RULED\_INADMISSIBLE"]
|
|||
|
|
|
|||
|
|
  }
|
|||
|
|
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\*\*Critical invariant:\*\* `RULED\_INADMISSIBLE` edges are always excluded from juror queries. This wall is enforced at the graph traversal level, not the prompt level.
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\---
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\## 6. MetaDB Schema (SQLite)
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Tracks ingestion state — what's been fetched, processed, failed.
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
```sql
|
|||
|
|
|
|||
|
|
CREATE TABLE documents (
|
|||
|
|
|
|||
|
|
  doc\_id TEXT PRIMARY KEY, -- MNC or UUID
|
|||
|
|
|
|||
|
|
  source\_id TEXT NOT NULL,
|
|||
|
|
|
|||
|
|
  url TEXT NOT NULL,
|
|||
|
|
|
|||
|
|
  fetch\_status TEXT, -- pending|fetched|parsed|embedded|graphed|failed
|
|||
|
|
|
|||
|
|
  fetch\_timestamp TEXT,
|
|||
|
|
|
|||
|
|
  parse\_timestamp TEXT,
|
|||
|
|
|
|||
|
|
  embed\_timestamp TEXT,
|
|||
|
|
|
|||
|
|
  graph\_timestamp TEXT,
|
|||
|
|
|
|||
|
|
  error\_message TEXT,
|
|||
|
|
|
|||
|
|
  char\_count INTEGER,
|
|||
|
|
|
|||
|
|
  chunk\_count INTEGER,
|
|||
|
|
|
|||
|
|
  matter\_type TEXT,
|
|||
|
|
|
|||
|
|
  court TEXT,
|
|||
|
|
|
|||
|
|
  year INTEGER,
|
|||
|
|
|
|||
|
|
  verdict TEXT,
|
|||
|
|
|
|||
|
|
  exoneration\_flag INTEGER DEFAULT 0
|
|||
|
|
|
|||
|
|
);
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
CREATE TABLE fetch\_queue (
|
|||
|
|
|
|||
|
|
  queue\_id INTEGER PRIMARY KEY AUTOINCREMENT,
|
|||
|
|
|
|||
|
|
  source\_id TEXT,
|
|||
|
|
|
|||
|
|
  url TEXT,
|
|||
|
|
|
|||
|
|
  priority INTEGER DEFAULT 5, -- 1=highest
|
|||
|
|
|
|||
|
|
  queued\_at TEXT,
|
|||
|
|
|
|||
|
|
  attempts INTEGER DEFAULT 0
|
|||
|
|
|
|||
|
|
);
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
CREATE TABLE source\_state (
|
|||
|
|
|
|||
|
|
  source\_id TEXT PRIMARY KEY,
|
|||
|
|
|
|||
|
|
  last\_poll TEXT,
|
|||
|
|
|
|||
|
|
  last\_rss\_etag TEXT,
|
|||
|
|
|
|||
|
|
  docs\_fetched INTEGER DEFAULT 0,
|
|||
|
|
|
|||
|
|
  docs\_failed INTEGER DEFAULT 0
|
|||
|
|
|
|||
|
|
);
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\---
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\## 7. Rate Limiting \& Politeness
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
|
|||
|
|
RATE\_LIMITS = {
|
|||
|
|
|
|||
|
|
  "fedcourt": {"rps": 1.0, "concurrent": 2, "retry\_after": 60},
|
|||
|
|
|
|||
|
|
  "highcourt": {"rps": 0.5, "concurrent": 1, "retry\_after": 120},
|
|||
|
|
|
|||
|
|
  "nsw\_caselaw": {"rps": 1.0, "concurrent": 2, "retry\_after": 60},
|
|||
|
|
|
|||
|
|
  "qld\_judgments": {"rps": 0.5, "concurrent": 1, "retry\_after": 120},
|
|||
|
|
|
|||
|
|
  "auslaw\_mcp": {"rps": 0.3, "concurrent": 1, "retry\_after": 180},
|
|||
|
|
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\# Always send:
|
|||
|
|
|
|||
|
|
HEADERS = {
|
|||
|
|
|
|||
|
|
  "User-Agent": "AuCourtIngest/0.1 (legal research; contact: your@email.com)",
|
|||
|
|
|
|||
|
|
  "Accept": "text/html,application/xhtml+xml,application/msword",
|
|||
|
|
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\# Respect robots.txt (NSW Caselaw blocks search engines, not research agents)
|
|||
|
|
|
|||
|
|
\# Check before first fetch from each domain
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\---
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\## 8. Error Handling \& Recovery
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
|
|||
|
|
class FetchError(Exception): pass
|
|||
|
|
|
|||
|
|
class ParseError(Exception): pass
|
|||
|
|
|
|||
|
|
class RateLimitError(Exception): pass
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
RETRY\_POLICY = {
|
|||
|
|
|
|||
|
|
  FetchError: {"max\_retries": 3, "backoff": "exponential", "base": 30},
|
|||
|
|
|
|||
|
|
  RateLimitError: {"max\_retries": 5, "backoff": "exponential", "base": 120},
|
|||
|
|
|
|||
|
|
  ParseError: {"max\_retries": 1, "backoff": "fixed", "base": 0},
|
|||
|
|
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\# On ParseError after retry: log to MetaDB, move to manual\_review queue
|
|||
|
|
|
|||
|
|
\# On persistent FetchError: mark source as degraded, alert via Telegram
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\*\*Telegram alerting\*\* (reuse Lulu's Telegram integration):
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
|
|||
|
|
ALERTS = {
|
|||
|
|
|
|||
|
|
  "source\_degraded": "⚠️ AuCourtIngest: {source\_id} failing for {minutes}min",
|
|||
|
|
|
|||
|
|
  "daily\_summary": "📊 Ingested: {new\_docs} docs | Graph nodes: {nodes} | Errors: {errors}",
|
|||
|
|
|
|||
|
|
  "milestone": "🎯 Corpus reached {N} cases",
|
|||
|
|
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\---
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\## 9. Bootstrap Run Plan
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Priority order for initial corpus build:
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Phase 1 — Week 1 (no permissions needed):
|
|||
|
|
|
|||
|
|
  \[ ] Federal Court RSS → DOCX, 2010–2026, criminal filter
|
|||
|
|
|
|||
|
|
  \[ ] High Court judgments 2000–2026, criminal appeals
|
|||
|
|
|
|||
|
|
  \[ ] NSW Caselaw criminal browse, 2010–2026
|
|||
|
|
|
|||
|
|
  Target: \~5,000 cases
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Phase 2 — Week 2:
|
|||
|
|
|
|||
|
|
  \[ ] Federal Court PDF backfill 1995–2009
|
|||
|
|
|
|||
|
|
  \[ ] QLD Judgments criminal, 2000–2026
|
|||
|
|
|
|||
|
|
  \[ ] VIC via AusLaw MCP targeted search
|
|||
|
|
|
|||
|
|
  Target: \~15,000 cases total
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Phase 3 — Ongoing:
|
|||
|
|
|
|||
|
|
  \[ ] RSS watch mode all sources
|
|||
|
|
|
|||
|
|
  \[ ] AustLII partnership email → if approved, full backfill
|
|||
|
|
|
|||
|
|
  \[ ] Exoneration cross-reference pass (RMIT register)
|
|||
|
|
|
|||
|
|
  Target: 50,000+ cases
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\---
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\## 10. Output Contracts
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\*\*DocStore:\*\* `/data/docs/{source\_id}/{year}/{mnc}.json`
|
|||
|
|
|
|||
|
|
```json
|
|||
|
|
|
|||
|
|
{
|
|||
|
|
|
|||
|
|
  "doc\_id": "\[2019] NSWSC 1234",
|
|||
|
|
|
|||
|
|
  "meta": { ...CaseMeta... },
|
|||
|
|
|
|||
|
|
  "chunks": \[ ...Chunk\[] ... ],
|
|||
|
|
|
|||
|
|
  "raw\_text\_path": "/data/raw/nsw\_caselaw/2019/NSWSC\_1234.html"
|
|||
|
|
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\*\*VectorIndex:\*\* ChromaDB collection `au\_cases`, filterable by `source\_id`, `chunk\_type`, `doc\_id`
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\*\*PropertyGraph:\*\* Neo4j `au\_legal` database, accessible via Bolt protocol
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\*\*MetaDB:\*\* SQLite at `/data/meta.db`
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\---
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\## 11. Tech Stack
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Language: Python 3.12
|
|||
|
|
|
|||
|
|
HTTP: httpx (async)
|
|||
|
|
|
|||
|
|
HTML parsing: BeautifulSoup4
|
|||
|
|
|
|||
|
|
DOCX: python-docx
|
|||
|
|
|
|||
|
|
PDF text: pdfminer.six
|
|||
|
|
|
|||
|
|
PDF OCR: via AusLaw MCP (pytesseract wrapper)
|
|||
|
|
|
|||
|
|
Embeddings: openai text-embedding-3-small
|
|||
|
|
|
|||
|
|
Vector store: chromadb (local)
|
|||
|
|
|
|||
|
|
Graph DB: neo4j (Docker) or ryu\_graph (Windows)
|
|||
|
|
|
|||
|
|
SQLite: aiosqlite
|
|||
|
|
|
|||
|
|
LLM extraction: anthropic claude-haiku-4-5-20251001 (cheap, fast)
|
|||
|
|
|
|||
|
|
MCP client: mcp-python-sdk
|
|||
|
|
|
|||
|
|
Scheduling: APScheduler
|
|||
|
|
|
|||
|
|
Alerting: Telegram Bot API (reuse Lulu token)
|
|||
|
|
|
|||
|
|
Config: TOML (pyproject-style)
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\---
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\## 12. File Structure
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
aucourt\_ingest/
|
|||
|
|
|
|||
|
|
├── main.py # entry point, mode dispatch
|
|||
|
|
|
|||
|
|
├── config.toml # source configs, rate limits, paths
|
|||
|
|
|
|||
|
|
├── orchestrator.py # main loop, queue management
|
|||
|
|
|
|||
|
|
├── sources/
|
|||
|
|
|
|||
|
|
│ ├── base.py # BaseSource ABC
|
|||
|
|
|
|||
|
|
│ ├── fedcourt.py
|
|||
|
|
|
|||
|
|
│ ├── highcourt.py
|
|||
|
|
|
|||
|
|
│ ├── nsw\_caselaw.py
|
|||
|
|
|
|||
|
|
│ ├── qld\_judgments.py
|
|||
|
|
|
|||
|
|
│ └── auslaw\_mcp.py
|
|||
|
|
|
|||
|
|
├── processing/
|
|||
|
|
|
|||
|
|
│ ├── doc\_parser.py
|
|||
|
|
|
|||
|
|
│ ├── meta\_extractor.py
|
|||
|
|
|
|||
|
|
│ ├── outcome\_parser.py
|
|||
|
|
|
|||
|
|
│ ├── chunk\_engine.py
|
|||
|
|
|
|||
|
|
│ ├── embed\_engine.py
|
|||
|
|
|
|||
|
|
│ └── graph\_builder.py
|
|||
|
|
|
|||
|
|
├── storage/
|
|||
|
|
|
|||
|
|
│ ├── doc\_store.py
|
|||
|
|
|
|||
|
|
│ ├── vector\_index.py
|
|||
|
|
|
|||
|
|
│ ├── graph\_db.py
|
|||
|
|
|
|||
|
|
│ └── meta\_db.py
|
|||
|
|
|
|||
|
|
├── jury/
|
|||
|
|
|
|||
|
|
│ ├── personas.py # JurorPersona definitions
|
|||
|
|
|
|||
|
|
│ └── subgraph\_query.py # get\_juror\_context()
|
|||
|
|
|
|||
|
|
├── utils/
|
|||
|
|
|
|||
|
|
│ ├── rate\_limiter.py
|
|||
|
|
|
|||
|
|
│ ├── retry.py
|
|||
|
|
|
|||
|
|
│ ├── telegram.py
|
|||
|
|
|
|||
|
|
│ └── mnc\_parser.py
|
|||
|
|
|
|||
|
|
├── data/
|
|||
|
|
|
|||
|
|
│ ├── docs/
|
|||
|
|
|
|||
|
|
│ ├── raw/
|
|||
|
|
|
|||
|
|
│ └── meta.db
|
|||
|
|
|
|||
|
|
└── tests/
|
|||
|
|
|
|||
|
|
  ├── test\_parsers.py
|
|||
|
|
|
|||
|
|
  ├── test\_meta\_extractor.py
|
|||
|
|
|
|||
|
|
  └── test\_graph\_builder.py
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\---
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\## 13. Open Questions / Decisions Deferred
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
1\. \*\*Graph backend:\*\* RyuGraph (existing Lulu stack, Windows/Android) vs Neo4j (Docker, better Cypher tooling). Recommend Neo4j for dev, RyuGraph for eventual mobile/embedded target.
|
|||
|
|
|
|||
|
|
2\. \*\*Transcript vs judgment distinction:\*\* Many cases have both. Transcripts are richer for juror simulation (raw testimony). Judgments are cleaner but summarised. Ingest both, tag chunk source type.
|
|||
|
|
|
|||
|
|
3\. \*\*Suppression orders:\*\* Some cases have partial suppression. MetaDB `suppression\_order` flag. Do not ingest suppressed content. Check AustLII suppression notices as reference.
|
|||
|
|
|
|||
|
|
4\. \*\*AustLII partnership timing:\*\* Email now or after MVP? Recommend now — low effort, long lead time for approval.
|
|||
|
|
|
|||
|
|
5\. \*\*Embedding model:\*\* text-embedding-3-small is cheap but ada-002 or Voyage Law may give better legal domain performance. Evaluate on retrieval quality after first 1,000 cases.
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\---
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
\*End of spec v0.1\*
|
|||
|
|
|