aucourt-ingest/spec.md

1431 lines
25 KiB
Markdown
Raw Permalink Normal View History

\# AuCourtIngest — Agent Specification v0.1
\*\*Project:\*\* Australian Legal Case Ingestion Pipeline
\*\*Purpose:\*\* Autonomous agent to discover, fetch, parse, normalise, and graph-ingest Australian court decisions for AI jury/judge analysis
\*\*Author:\*\* Aaron King / Slothitude Games
\*\*Date:\*\* 2026-05-30
\---
\## 1. Overview
AuCourtIngest is a fully autonomous ingestion agent. It operates as a continuous pipeline with no human operator required after initial configuration. Its output is a populated property graph (RyuGraph or Neo4j-compatible) and a flat document store, both ready for per-juror RAG queries.
The agent has four operational modes:
| Mode | Trigger | Description |
|------|---------|-------------|
| `bootstrap` | Manual / first run | Historical bulk ingest from all tier-1 sources |
| `watch` | Cron / RSS poll | Continuous ingestion of new decisions |
| `backfill` | Manual | Fill gaps in existing corpus by date/court/charge-type |
| `audit` | Manual | Re-process existing documents to update graph schema |
\---
\## 2. Architecture
```
┌─────────────────────────────────────────────────────┐
│ ORCHESTRATOR │
│ (main loop, mode dispatch, │
│ rate limiting, error recovery) │
└────────┬───────────────────────────────┬────────────┘
  │ │
  ▼ ▼
┌─────────────────┐ ┌─────────────────────┐
│ SOURCE LAYER │ │ PROCESSING LAYER │
│ │ │ │
│ - FedCourtRSS │──fetch─────▶│ - DocParser │
│ - HighCourtRSS │ │ - MetaExtractor │
│ - NSWCaselaw │ │ - OutcomeParser │
│ - QLDJudgments │ │ - ChunkEngine │
│ - VICCatalogue │ │ - EmbedEngine │
│ - AusLawMCP │ │ - GraphBuilder │
└─────────────────┘ └──────────┬──────────┘
 
 
  ┌─────────────────────┐
  │ STORAGE LAYER │
  │ │
  │ - DocStore (flat) │
  │ - VectorIndex │
  │ - PropertyGraph │
  │ - MetaDB (SQLite) │
  └─────────────────────┘
```
\---
\## 3. Source Definitions
\### 3.1 Federal Court of Australia
```yaml
source\_id: fedcourt
base\_url: https://www.judgments.fedcourt.gov.au
rss\_feed: https://www.judgments.fedcourt.gov.au/rss/fca-judgments
doc\_formats: \[html, docx, pdf]
coverage\_from: 1977
full\_text\_from: 1995
docx\_from: 1995
pdf\_range: \[1977, 1994]
rate\_limit\_rps: 1
tos\_status: public\_domain
fetch\_strategy: rss\_poll\_then\_docx\_download
```
\*\*Fetch logic:\*\*
1\. Poll RSS feed every 6 hours
2\. For each new item: extract judgment URL from RSS `<link>`
3\. Fetch HTML page → find DOCX download link (pattern: `/judgments/fca/YYYY/NNN.docx`)
4\. Download DOCX → hand to DocParser
5\. For 19771994: fetch PDF instead → hand to PDFParser
\*\*DOCX URL pattern:\*\*
```
https://www.judgments.fedcourt.gov.au/judgments/fca/{YYYY}/{index}.docx
```
\*\*Bootstrap pagination:\*\*
```
https://www.fedcourt.gov.au/digital-law-library/judgments/search
?NeedleType=all\&txtSearch=\&category=criminal\&dateFrom=1995-01-01
\&dateTo=2026-12-31\&resultsPerPage=50\&pageNum={N}
```
\---
\### 3.2 High Court of Australia
```yaml
source\_id: highcourt
base\_url: https://www.hcourt.gov.au
transcript\_base: https://www.hcourt.gov.au/cases/case-s{YYYY}-{N}
coverage\_from: 1994
doc\_formats: \[html, pdf]
rate\_limit\_rps: 0.5
tos\_status: public\_domain
fetch\_strategy: index\_crawl
```
\*\*Fetch logic:\*\*
1\. Crawl `/cases/recent-judgments` — paginated list with matter numbers
2\. For each matter: fetch judgment HTML page
3\. Also fetch transcript if available (`/transcripts/` path)
4\. Store judgment + transcript as separate documents, linked by `matter\_id`
\*\*Priority filter for MVP:\*\*
```python
PRIORITY\_KEYWORDS = \[
&#x20; "murder", "manslaughter", "sexual assault", "robbery",
&#x20; "appeal allowed", "conviction quashed", "miscarriage of justice"
]
```
\---
\### 3.3 NSW Caselaw
```yaml
source\_id: nsw\_caselaw
base\_url: https://www.caselaw.nsw.gov.au
browse\_url: https://www.caselaw.nsw.gov.au/browse
coverage\_from: 1988
doc\_formats: \[html, pdf]
rate\_limit\_rps: 1
tos\_status: open\_reproduction\_authorised
fetch\_strategy: mnc\_systematic + browse\_pagination
mnc\_pattern: "\[YYYY] {COURT} {N}"
```
\*\*Fetch logic:\*\*
1\. Browse `/browse?filter=criminal` → paginate through all pages
2\. Extract MNC + decision URL per row
3\. Fetch decision HTML → parse full text + metadata
4\. MNC serves as canonical case ID throughout system
\*\*MNC parsing:\*\*
```python
\# Example: \[2019] NSWSC 1234
import re
MNC\_PATTERN = r'\\\[(\\d{4})\\]\\s+(\[A-Z]+)\\s+(\\d+)'
def parse\_mnc(mnc: str) -> dict:
&#x20; m = re.match(MNC\_PATTERN, mnc)
&#x20; return {"year": m\[1], "court": m\[2], "number": m\[3]}
```
\---
\### 3.4 Queensland Judgments
```yaml
source\_id: qld\_judgments
base\_url: https://www.queenslandjudgments.com.au
coverage: variable\_by\_court
doc\_formats: \[html, pdf]
rate\_limit\_rps: 0.5
tos\_status: free\_access
fetch\_strategy: search\_pagination
```
\*\*Fetch logic:\*\*
1\. Use search API with `matter\_type=criminal` filter
2\. Paginate results → extract case URLs
3\. Fetch HTML per case
\---
\### 3.5 AusLaw MCP (Gap-Fill + Search)
```yaml
source\_id: auslaw\_mcp
type: mcp\_server
repo: russellbrenner/auslaw-mcp
tools:
&#x20; - search\_austlii
&#x20; - search\_jade
&#x20; - fetch\_document\_text
&#x20; - search\_jade\_by\_citation
use\_case: targeted\_retrieval\_not\_bulk
rate\_limit: conservative # case-by-case only
```
\*\*Use cases:\*\*
\- Fetch a specific case by citation when referenced in another judgment
\- OCR for scanned pre-1995 PDFs
\- Cross-reference citation graph to find related cases
\- Fill gaps where direct court portals have incomplete coverage
\*\*Tool call pattern:\*\*
```python
\# Retrieve by citation
result = await mcp.call("fetch\_document\_text", {
&#x20; "url": "https://www.austlii.edu.au/cgi-bin/viewdoc/au/cases/..."
})
\# Search for cases matching charge type
results = await mcp.call("search\_austlii", {
&#x20; "query": "murder conviction appeal",
&#x20; "jurisdiction": "nsw",
&#x20; "limit": 20,
&#x20; "sortBy": "date"
})
```
\---
\## 4. Document Processing Pipeline
\### 4.1 DocParser
Accepts: HTML, DOCX, PDF
Outputs: normalised `RawDocument` struct
```python
@dataclass
class RawDocument:
&#x20; source\_id: str # e.g. "nsw\_caselaw"
&#x20; doc\_id: str # MNC or internal UUID
&#x20; url: str
&#x20; fetch\_timestamp: str # ISO8601
&#x20; raw\_text: str # full extracted text
&#x20; format: str # html | docx | pdf
&#x20; pages: int | None
&#x20; char\_count: int
```
\*\*Format handlers:\*\*
\- HTML: BeautifulSoup → strip nav/header/footer → extract `.judgment-body` or equivalent
\- DOCX: `python-docx` → iterate paragraphs, preserve heading structure
\- PDF: `pdfminer.six` for text PDFs; `pytesseract` via AusLaw MCP OCR for scanned
\---
\### 4.2 MetaExtractor
Input: `RawDocument`
Output: `CaseMeta` struct
Method: Claude claude-haiku-4-5-20251001 via API (cheap, fast, structured)
```python
@dataclass
class CaseMeta:
&#x20; # Identity
&#x20; case\_name: str # e.g. "R v Smith"
&#x20; mnc: str # Medium Neutral Citation
&#x20; court: str # NSWSC | HCA | FCA | QSC etc
&#x20; judge: list\[str]
&#x20; date\_delivered: str # ISO8601
&#x20; jurisdiction: str # NSW | CTH | QLD | VIC etc
&#x20; # Classification
&#x20; matter\_type: str # criminal | civil | appeal | coronial
&#x20; charges: list\[str] # \["murder s18 Crimes Act 1900 (NSW)"]
&#x20; charge\_categories: list\[str] # \["homicide", "violence"]
&#x20; # Outcome
&#x20; verdict: str # guilty | not\_guilty | appeal\_allowed |
&#x20; # appeal\_dismissed | conviction\_quashed |
&#x20; # sentence\_varied | hung | n/a
&#x20; sentence: str | None # "18 years NMP 13"
&#x20; outcome\_notes: str # free text summary of result
&#x20; # Flags
&#x20; is\_appeal: bool
&#x20; appeal\_of: str | None # MNC of original decision
&#x20; was\_appealed: bool # filled in later by back-reference
&#x20; exoneration\_flag: bool # manual or derived from appeal outcome
&#x20; inadmissible\_evidence: list\[str] # evidence ruled out
&#x20; suppression\_order: bool # whether publication restrictions apply
```
\*\*Extraction prompt (system):\*\*
```
You are a legal metadata extractor. Given the full text of an Australian court judgment,
extract structured metadata in JSON. Output ONLY valid JSON matching the CaseMeta schema.
Be conservative — use null for fields you cannot determine with confidence.
For verdict, use only these values: guilty | not\_guilty | appeal\_allowed |
appeal\_dismissed | conviction\_quashed | sentence\_varied | hung | civil\_judgment | n/a
```
\---
\### 4.3 OutcomeParser
Separate pass specifically for verdict/outcome ground truth. This is the most critical field — it's what the divergence signal compares against.
\*\*Resolution priority:\*\*
1\. Explicit headnote — "HELD: conviction quashed"
2\. Orders section — "Appeal allowed. Convictions on counts 1 and 3 set aside."
3\. Appeal cross-reference — if later appeal found, update original record
4\. Manual flag — `exoneration\_flag=True` sourced from external list
\*\*External exoneration sources:\*\*
\- RMIT Wrongful Convictions Register (AU)
\- Innocence Project Australia case list
\- Royal Commission findings referencing specific convictions
These are small enough to maintain as a static JSON lookup table keyed by defendant name + year.
\---
\### 4.4 ChunkEngine
Chunks documents for RAG retrieval. Chunks are \*\*semantically meaningful units\*\*, not arbitrary token windows.
\*\*Chunk types and sizes:\*\*
| Chunk Type | Content | Target Tokens | Overlap |
|------------|---------|---------------|---------|
| `opening` | Case intro, charges, parties | 400600 | none |
| `testimony` | Each witness examination block | 300500 | 50 |
| `exhibit` | Evidence description + ruling | 200400 | none |
| `ruling` | Each evidentiary ruling by judge | 200300 | none |
| `closing` | Closing submissions summary | 400600 | 50 |
| `judgment` | Judicial reasoning blocks | 400600 | 50 |
| `sentence` | Sentencing remarks | 300500 | none |
\*\*Chunk struct:\*\*
```python
@dataclass
class Chunk:
&#x20; chunk\_id: str # UUID
&#x20; doc\_id: str # parent document MNC
&#x20; chunk\_type: str # from table above
&#x20; sequence: int # position in document
&#x20; text: str
&#x20; token\_count: int
&#x20; speaker: str | None # witness name if testimony
&#x20; page\_ref: str | None # "p.47" if available
&#x20; embedding: list\[float] # 1536-dim, filled by EmbedEngine
```
\*\*Chunking strategy:\*\*
Use Claude claude-haiku-4-5-20251001 to identify structural boundaries:
```
Identify the structural sections of this Australian court judgment.
Return a JSON array of {section\_type, start\_char, end\_char, speaker}.
Section types: opening | testimony | exhibit | ruling | closing | judgment | sentence
```
Then split at identified boundaries, keeping each chunk within token budget.
\---
\### 4.5 EmbedEngine
```yaml
model: text-embedding-3-small # OpenAI, 1536-dim, cheap
batch\_size: 100
store: chromadb # local, no infra
collection\_naming: "au\_cases\_{source\_id}"
```
Each chunk gets an embedding. The embedding is stored in ChromaDB with chunk metadata as payload.
\*\*Query interface:\*\*
```python
def query\_chunks(
&#x20; text: str,
&#x20; chunk\_types: list\[str] = None, # filter by type
&#x20; doc\_ids: list\[str] = None, # filter to specific cases
&#x20; top\_k: int = 10
) -> list\[Chunk]:
&#x20; ...
```
\---
\### 4.6 GraphBuilder
Builds the property graph from extracted metadata + chunk relationships.
\*\*Node types:\*\*
```
(:Case {mnc, court, date, jurisdiction, matter\_type, verdict, exoneration\_flag})
(:Charge {text, category, act, section})
(:Witness {name, role}) # role: prosecution|defence|expert|accused
(:Exhibit {id, description, admitted})
(:Judge {name, court})
(:Ruling {type, outcome}) # type: admissibility|direction|objection
(:Timeline {date, event})
(:Chunk {chunk\_id, type, text\_preview})
```
\*\*Relationship types:\*\*
```
(:Case)-\[:CHARGED\_WITH]->(:Charge)
(:Case)-\[:HEARD\_BY]->(:Judge)
(:Case)-\[:INCLUDES\_TESTIMONY {credibility\_score}]->(:Witness)
(:Case)-\[:HAS\_EXHIBIT {admitted: bool}]->(:Exhibit)
(:Case)-\[:HAS\_RULING]->(:Ruling)
(:Case)-\[:APPEALS]->(:Case) # appeal chain
(:Witness)-\[:GAVE\_TESTIMONY]->(:Chunk)
(:Exhibit)-\[:DESCRIBED\_IN]->(:Chunk)
(:Ruling)-\[:CONCERNS]->(:Exhibit)
(:Ruling)-\[:CONCERNS]->(:Witness)
(:Chunk)-\[:FOLLOWS]->(:Chunk) # sequence chain
(:Chunk)-\[:CORROBORATES]->(:Chunk) # semantic similarity > 0.85
(:Chunk)-\[:CONTRADICTS]->(:Chunk) # semantic similarity < 0.2 on same topic
```
\*\*Graph backend:\*\*
\- Primary: Neo4j (local Docker) for development
\- Alternative: RyuGraph if staying in the existing Lulu/Windows stack
\- Export: GraphML for portability
\*\*CORROBORATES / CONTRADICTS edge generation:\*\*
```python
\# After all chunks embedded, run pairwise similarity within a case
\# for chunks of type testimony and exhibit
for pair in chunk\_pairs\_within\_case(doc\_id, types=\["testimony", "exhibit"]):
&#x20; sim = cosine\_similarity(pair\[0].embedding, pair\[1].embedding)
&#x20; if sim > 0.85:
&#x20; graph.add\_edge(pair\[0], pair\[1], "CORROBORATES", weight=sim)
&#x20; elif same\_topic(pair\[0], pair\[1]) and sim < 0.25:
&#x20; graph.add\_edge(pair\[0], pair\[1], "CONTRADICTS", weight=sim)
```
\---
\## 5. Juror Subgraph Query Interface
This is the interface the jury agent calls at deliberation time.
```python
def get\_juror\_context(
&#x20; case\_mnc: str,
&#x20; persona: JurorPersona,
&#x20; max\_tokens: int = 4000
) -> JurorContext:
&#x20; """
&#x20; Traverse the graph from the persona's anchor node types,
&#x20; follow relevant edge types, collect chunks up to token budget.
&#x20; Returns a compact context string + list of source node IDs.
&#x20; """
```
\*\*Persona → graph traversal mapping:\*\*
```python
PERSONA\_TRAVERSAL = {
&#x20; "nurse": {
&#x20; "anchor\_nodes": \["Witness\[role=expert]", "Exhibit\[category=medical]"],
&#x20; "edge\_types": \["GAVE\_TESTIMONY", "DESCRIBED\_IN", "CORROBORATES"],
&#x20; "chunk\_types": \["testimony", "exhibit"],
&#x20; "exclude\_edges": \["RULED\_INADMISSIBLE"]
&#x20; },
&#x20; "accountant": {
&#x20; "anchor\_nodes": \["Exhibit\[category=financial]", "Witness\[role=expert]"],
&#x20; "edge\_types": \["DESCRIBED\_IN", "HAS\_RULING"],
&#x20; "chunk\_types": \["exhibit", "ruling"],
&#x20; "exclude\_edges": \["RULED\_INADMISSIBLE"]
&#x20; },
&#x20; "skeptic": {
&#x20; "anchor\_nodes": \["Timeline", "Witness"],
&#x20; "edge\_types": \["CONTRADICTS", "HAS\_RULING", "GAVE\_TESTIMONY"],
&#x20; "chunk\_types": \["testimony", "ruling"],
&#x20; "exclude\_edges": \["RULED\_INADMISSIBLE"]
&#x20; },
&#x20; "ex\_cop": {
&#x20; "anchor\_nodes": \["Ruling", "Exhibit\[category=forensic]"],
&#x20; "edge\_types": \["HAS\_RULING", "DESCRIBED\_IN", "CORROBORATES"],
&#x20; "chunk\_types": \["ruling", "exhibit", "judgment"],
&#x20; "exclude\_edges": \["RULED\_INADMISSIBLE"]
&#x20; },
&#x20; "empath": {
&#x20; "anchor\_nodes": \["Witness\[role=prosecution]", "Witness\[role=accused]"],
&#x20; "edge\_types": \["GAVE\_TESTIMONY", "CORROBORATES", "CONTRADICTS"],
&#x20; "chunk\_types": \["testimony", "closing"],
&#x20; "exclude\_edges": \["RULED\_INADMISSIBLE"]
&#x20; },
&#x20; "foreman": {
&#x20; "anchor\_nodes": \["Charge", "Ruling", "Judge"],
&#x20; "edge\_types": \["CHARGED\_WITH", "HAS\_RULING", "HEARD\_BY"],
&#x20; "chunk\_types": \["opening", "judgment", "sentence"],
&#x20; "exclude\_edges": \["RULED\_INADMISSIBLE"]
&#x20; }
}
```
\*\*Critical invariant:\*\* `RULED\_INADMISSIBLE` edges are always excluded from juror queries. This wall is enforced at the graph traversal level, not the prompt level.
\---
\## 6. MetaDB Schema (SQLite)
Tracks ingestion state — what's been fetched, processed, failed.
```sql
CREATE TABLE documents (
&#x20; doc\_id TEXT PRIMARY KEY, -- MNC or UUID
&#x20; source\_id TEXT NOT NULL,
&#x20; url TEXT NOT NULL,
&#x20; fetch\_status TEXT, -- pending|fetched|parsed|embedded|graphed|failed
&#x20; fetch\_timestamp TEXT,
&#x20; parse\_timestamp TEXT,
&#x20; embed\_timestamp TEXT,
&#x20; graph\_timestamp TEXT,
&#x20; error\_message TEXT,
&#x20; char\_count INTEGER,
&#x20; chunk\_count INTEGER,
&#x20; matter\_type TEXT,
&#x20; court TEXT,
&#x20; year INTEGER,
&#x20; verdict TEXT,
&#x20; exoneration\_flag INTEGER DEFAULT 0
);
CREATE TABLE fetch\_queue (
&#x20; queue\_id INTEGER PRIMARY KEY AUTOINCREMENT,
&#x20; source\_id TEXT,
&#x20; url TEXT,
&#x20; priority INTEGER DEFAULT 5, -- 1=highest
&#x20; queued\_at TEXT,
&#x20; attempts INTEGER DEFAULT 0
);
CREATE TABLE source\_state (
&#x20; source\_id TEXT PRIMARY KEY,
&#x20; last\_poll TEXT,
&#x20; last\_rss\_etag TEXT,
&#x20; docs\_fetched INTEGER DEFAULT 0,
&#x20; docs\_failed INTEGER DEFAULT 0
);
```
\---
\## 7. Rate Limiting \& Politeness
```python
RATE\_LIMITS = {
&#x20; "fedcourt": {"rps": 1.0, "concurrent": 2, "retry\_after": 60},
&#x20; "highcourt": {"rps": 0.5, "concurrent": 1, "retry\_after": 120},
&#x20; "nsw\_caselaw": {"rps": 1.0, "concurrent": 2, "retry\_after": 60},
&#x20; "qld\_judgments": {"rps": 0.5, "concurrent": 1, "retry\_after": 120},
&#x20; "auslaw\_mcp": {"rps": 0.3, "concurrent": 1, "retry\_after": 180},
}
\# Always send:
HEADERS = {
&#x20; "User-Agent": "AuCourtIngest/0.1 (legal research; contact: your@email.com)",
&#x20; "Accept": "text/html,application/xhtml+xml,application/msword",
}
\# Respect robots.txt (NSW Caselaw blocks search engines, not research agents)
\# Check before first fetch from each domain
```
\---
\## 8. Error Handling \& Recovery
```python
class FetchError(Exception): pass
class ParseError(Exception): pass
class RateLimitError(Exception): pass
RETRY\_POLICY = {
&#x20; FetchError: {"max\_retries": 3, "backoff": "exponential", "base": 30},
&#x20; RateLimitError: {"max\_retries": 5, "backoff": "exponential", "base": 120},
&#x20; ParseError: {"max\_retries": 1, "backoff": "fixed", "base": 0},
}
\# On ParseError after retry: log to MetaDB, move to manual\_review queue
\# On persistent FetchError: mark source as degraded, alert via Telegram
```
\*\*Telegram alerting\*\* (reuse Lulu's Telegram integration):
```python
ALERTS = {
&#x20; "source\_degraded": "⚠️ AuCourtIngest: {source\_id} failing for {minutes}min",
&#x20; "daily\_summary": "📊 Ingested: {new\_docs} docs | Graph nodes: {nodes} | Errors: {errors}",
&#x20; "milestone": "🎯 Corpus reached {N} cases",
}
```
\---
\## 9. Bootstrap Run Plan
Priority order for initial corpus build:
```
Phase 1 — Week 1 (no permissions needed):
&#x20; \[ ] Federal Court RSS → DOCX, 20102026, criminal filter
&#x20; \[ ] High Court judgments 20002026, criminal appeals
&#x20; \[ ] NSW Caselaw criminal browse, 20102026
&#x20; Target: \~5,000 cases
Phase 2 — Week 2:
&#x20; \[ ] Federal Court PDF backfill 19952009
&#x20; \[ ] QLD Judgments criminal, 20002026
&#x20; \[ ] VIC via AusLaw MCP targeted search
&#x20; Target: \~15,000 cases total
Phase 3 — Ongoing:
&#x20; \[ ] RSS watch mode all sources
&#x20; \[ ] AustLII partnership email → if approved, full backfill
&#x20; \[ ] Exoneration cross-reference pass (RMIT register)
&#x20; Target: 50,000+ cases
```
\---
\## 10. Output Contracts
\*\*DocStore:\*\* `/data/docs/{source\_id}/{year}/{mnc}.json`
```json
{
&#x20; "doc\_id": "\[2019] NSWSC 1234",
&#x20; "meta": { ...CaseMeta... },
&#x20; "chunks": \[ ...Chunk\[] ... ],
&#x20; "raw\_text\_path": "/data/raw/nsw\_caselaw/2019/NSWSC\_1234.html"
}
```
\*\*VectorIndex:\*\* ChromaDB collection `au\_cases`, filterable by `source\_id`, `chunk\_type`, `doc\_id`
\*\*PropertyGraph:\*\* Neo4j `au\_legal` database, accessible via Bolt protocol
\*\*MetaDB:\*\* SQLite at `/data/meta.db`
\---
\## 11. Tech Stack
```
Language: Python 3.12
HTTP: httpx (async)
HTML parsing: BeautifulSoup4
DOCX: python-docx
PDF text: pdfminer.six
PDF OCR: via AusLaw MCP (pytesseract wrapper)
Embeddings: openai text-embedding-3-small
Vector store: chromadb (local)
Graph DB: neo4j (Docker) or ryu\_graph (Windows)
SQLite: aiosqlite
LLM extraction: anthropic claude-haiku-4-5-20251001 (cheap, fast)
MCP client: mcp-python-sdk
Scheduling: APScheduler
Alerting: Telegram Bot API (reuse Lulu token)
Config: TOML (pyproject-style)
```
\---
\## 12. File Structure
```
aucourt\_ingest/
├── main.py # entry point, mode dispatch
├── config.toml # source configs, rate limits, paths
├── orchestrator.py # main loop, queue management
├── sources/
│ ├── base.py # BaseSource ABC
│ ├── fedcourt.py
│ ├── highcourt.py
│ ├── nsw\_caselaw.py
│ ├── qld\_judgments.py
│ └── auslaw\_mcp.py
├── processing/
│ ├── doc\_parser.py
│ ├── meta\_extractor.py
│ ├── outcome\_parser.py
│ ├── chunk\_engine.py
│ ├── embed\_engine.py
│ └── graph\_builder.py
├── storage/
│ ├── doc\_store.py
│ ├── vector\_index.py
│ ├── graph\_db.py
│ └── meta\_db.py
├── jury/
│ ├── personas.py # JurorPersona definitions
│ └── subgraph\_query.py # get\_juror\_context()
├── utils/
│ ├── rate\_limiter.py
│ ├── retry.py
│ ├── telegram.py
│ └── mnc\_parser.py
├── data/
│ ├── docs/
│ ├── raw/
│ └── meta.db
└── tests/
&#x20; ├── test\_parsers.py
&#x20; ├── test\_meta\_extractor.py
&#x20; └── test\_graph\_builder.py
```
\---
\## 13. Open Questions / Decisions Deferred
1\. \*\*Graph backend:\*\* RyuGraph (existing Lulu stack, Windows/Android) vs Neo4j (Docker, better Cypher tooling). Recommend Neo4j for dev, RyuGraph for eventual mobile/embedded target.
2\. \*\*Transcript vs judgment distinction:\*\* Many cases have both. Transcripts are richer for juror simulation (raw testimony). Judgments are cleaner but summarised. Ingest both, tag chunk source type.
3\. \*\*Suppression orders:\*\* Some cases have partial suppression. MetaDB `suppression\_order` flag. Do not ingest suppressed content. Check AustLII suppression notices as reference.
4\. \*\*AustLII partnership timing:\*\* Email now or after MVP? Recommend now — low effort, long lead time for approval.
5\. \*\*Embedding model:\*\* text-embedding-3-small is cheap but ada-002 or Voyage Law may give better legal domain performance. Evaluate on retrieval quality after first 1,000 cases.
\---
\*End of spec v0.1\*