aucourt-ingest/spec.md
slothitude d77fe12cfc AuCourtIngest: complete 8-stage Australian legal case ingestion pipeline
Source layer (5 court sources), processing pipeline (parse/extract/chunk/embed/graph),
property graph with 8 node types, juror subgraph queries with 6 personas,
orchestrator with bootstrap/watch/backfill/audit/process modes, 170 tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-30 11:56:23 +10:00

1430 lines
25 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

\# AuCourtIngest — Agent Specification v0.1
\*\*Project:\*\* Australian Legal Case Ingestion Pipeline
\*\*Purpose:\*\* Autonomous agent to discover, fetch, parse, normalise, and graph-ingest Australian court decisions for AI jury/judge analysis
\*\*Author:\*\* Aaron King / Slothitude Games
\*\*Date:\*\* 2026-05-30
\---
\## 1. Overview
AuCourtIngest is a fully autonomous ingestion agent. It operates as a continuous pipeline with no human operator required after initial configuration. Its output is a populated property graph (RyuGraph or Neo4j-compatible) and a flat document store, both ready for per-juror RAG queries.
The agent has four operational modes:
| Mode | Trigger | Description |
|------|---------|-------------|
| `bootstrap` | Manual / first run | Historical bulk ingest from all tier-1 sources |
| `watch` | Cron / RSS poll | Continuous ingestion of new decisions |
| `backfill` | Manual | Fill gaps in existing corpus by date/court/charge-type |
| `audit` | Manual | Re-process existing documents to update graph schema |
\---
\## 2. Architecture
```
┌─────────────────────────────────────────────────────┐
│ ORCHESTRATOR │
│ (main loop, mode dispatch, │
│ rate limiting, error recovery) │
└────────┬───────────────────────────────┬────────────┘
&#x20; │ │
&#x20; ▼ ▼
┌─────────────────┐ ┌─────────────────────┐
│ SOURCE LAYER │ │ PROCESSING LAYER │
│ │ │ │
│ - FedCourtRSS │──fetch─────▶│ - DocParser │
│ - HighCourtRSS │ │ - MetaExtractor │
│ - NSWCaselaw │ │ - OutcomeParser │
│ - QLDJudgments │ │ - ChunkEngine │
│ - VICCatalogue │ │ - EmbedEngine │
│ - AusLawMCP │ │ - GraphBuilder │
└─────────────────┘ └──────────┬──────────┘
&#x20; │
&#x20; ▼
&#x20; ┌─────────────────────┐
&#x20; │ STORAGE LAYER │
&#x20; │ │
&#x20; │ - DocStore (flat) │
&#x20; │ - VectorIndex │
&#x20; │ - PropertyGraph │
&#x20; │ - MetaDB (SQLite) │
&#x20; └─────────────────────┘
```
\---
\## 3. Source Definitions
\### 3.1 Federal Court of Australia
```yaml
source\_id: fedcourt
base\_url: https://www.judgments.fedcourt.gov.au
rss\_feed: https://www.judgments.fedcourt.gov.au/rss/fca-judgments
doc\_formats: \[html, docx, pdf]
coverage\_from: 1977
full\_text\_from: 1995
docx\_from: 1995
pdf\_range: \[1977, 1994]
rate\_limit\_rps: 1
tos\_status: public\_domain
fetch\_strategy: rss\_poll\_then\_docx\_download
```
\*\*Fetch logic:\*\*
1\. Poll RSS feed every 6 hours
2\. For each new item: extract judgment URL from RSS `<link>`
3\. Fetch HTML page → find DOCX download link (pattern: `/judgments/fca/YYYY/NNN.docx`)
4\. Download DOCX → hand to DocParser
5\. For 19771994: fetch PDF instead → hand to PDFParser
\*\*DOCX URL pattern:\*\*
```
https://www.judgments.fedcourt.gov.au/judgments/fca/{YYYY}/{index}.docx
```
\*\*Bootstrap pagination:\*\*
```
https://www.fedcourt.gov.au/digital-law-library/judgments/search
?NeedleType=all\&txtSearch=\&category=criminal\&dateFrom=1995-01-01
\&dateTo=2026-12-31\&resultsPerPage=50\&pageNum={N}
```
\---
\### 3.2 High Court of Australia
```yaml
source\_id: highcourt
base\_url: https://www.hcourt.gov.au
transcript\_base: https://www.hcourt.gov.au/cases/case-s{YYYY}-{N}
coverage\_from: 1994
doc\_formats: \[html, pdf]
rate\_limit\_rps: 0.5
tos\_status: public\_domain
fetch\_strategy: index\_crawl
```
\*\*Fetch logic:\*\*
1\. Crawl `/cases/recent-judgments` — paginated list with matter numbers
2\. For each matter: fetch judgment HTML page
3\. Also fetch transcript if available (`/transcripts/` path)
4\. Store judgment + transcript as separate documents, linked by `matter\_id`
\*\*Priority filter for MVP:\*\*
```python
PRIORITY\_KEYWORDS = \[
&#x20; "murder", "manslaughter", "sexual assault", "robbery",
&#x20; "appeal allowed", "conviction quashed", "miscarriage of justice"
]
```
\---
\### 3.3 NSW Caselaw
```yaml
source\_id: nsw\_caselaw
base\_url: https://www.caselaw.nsw.gov.au
browse\_url: https://www.caselaw.nsw.gov.au/browse
coverage\_from: 1988
doc\_formats: \[html, pdf]
rate\_limit\_rps: 1
tos\_status: open\_reproduction\_authorised
fetch\_strategy: mnc\_systematic + browse\_pagination
mnc\_pattern: "\[YYYY] {COURT} {N}"
```
\*\*Fetch logic:\*\*
1\. Browse `/browse?filter=criminal` → paginate through all pages
2\. Extract MNC + decision URL per row
3\. Fetch decision HTML → parse full text + metadata
4\. MNC serves as canonical case ID throughout system
\*\*MNC parsing:\*\*
```python
\# Example: \[2019] NSWSC 1234
import re
MNC\_PATTERN = r'\\\[(\\d{4})\\]\\s+(\[A-Z]+)\\s+(\\d+)'
def parse\_mnc(mnc: str) -> dict:
&#x20; m = re.match(MNC\_PATTERN, mnc)
&#x20; return {"year": m\[1], "court": m\[2], "number": m\[3]}
```
\---
\### 3.4 Queensland Judgments
```yaml
source\_id: qld\_judgments
base\_url: https://www.queenslandjudgments.com.au
coverage: variable\_by\_court
doc\_formats: \[html, pdf]
rate\_limit\_rps: 0.5
tos\_status: free\_access
fetch\_strategy: search\_pagination
```
\*\*Fetch logic:\*\*
1\. Use search API with `matter\_type=criminal` filter
2\. Paginate results → extract case URLs
3\. Fetch HTML per case
\---
\### 3.5 AusLaw MCP (Gap-Fill + Search)
```yaml
source\_id: auslaw\_mcp
type: mcp\_server
repo: russellbrenner/auslaw-mcp
tools:
&#x20; - search\_austlii
&#x20; - search\_jade
&#x20; - fetch\_document\_text
&#x20; - search\_jade\_by\_citation
use\_case: targeted\_retrieval\_not\_bulk
rate\_limit: conservative # case-by-case only
```
\*\*Use cases:\*\*
\- Fetch a specific case by citation when referenced in another judgment
\- OCR for scanned pre-1995 PDFs
\- Cross-reference citation graph to find related cases
\- Fill gaps where direct court portals have incomplete coverage
\*\*Tool call pattern:\*\*
```python
\# Retrieve by citation
result = await mcp.call("fetch\_document\_text", {
&#x20; "url": "https://www.austlii.edu.au/cgi-bin/viewdoc/au/cases/..."
})
\# Search for cases matching charge type
results = await mcp.call("search\_austlii", {
&#x20; "query": "murder conviction appeal",
&#x20; "jurisdiction": "nsw",
&#x20; "limit": 20,
&#x20; "sortBy": "date"
})
```
\---
\## 4. Document Processing Pipeline
\### 4.1 DocParser
Accepts: HTML, DOCX, PDF
Outputs: normalised `RawDocument` struct
```python
@dataclass
class RawDocument:
&#x20; source\_id: str # e.g. "nsw\_caselaw"
&#x20; doc\_id: str # MNC or internal UUID
&#x20; url: str
&#x20; fetch\_timestamp: str # ISO8601
&#x20; raw\_text: str # full extracted text
&#x20; format: str # html | docx | pdf
&#x20; pages: int | None
&#x20; char\_count: int
```
\*\*Format handlers:\*\*
\- HTML: BeautifulSoup → strip nav/header/footer → extract `.judgment-body` or equivalent
\- DOCX: `python-docx` → iterate paragraphs, preserve heading structure
\- PDF: `pdfminer.six` for text PDFs; `pytesseract` via AusLaw MCP OCR for scanned
\---
\### 4.2 MetaExtractor
Input: `RawDocument`
Output: `CaseMeta` struct
Method: Claude claude-haiku-4-5-20251001 via API (cheap, fast, structured)
```python
@dataclass
class CaseMeta:
&#x20; # Identity
&#x20; case\_name: str # e.g. "R v Smith"
&#x20; mnc: str # Medium Neutral Citation
&#x20; court: str # NSWSC | HCA | FCA | QSC etc
&#x20; judge: list\[str]
&#x20; date\_delivered: str # ISO8601
&#x20; jurisdiction: str # NSW | CTH | QLD | VIC etc
&#x20; # Classification
&#x20; matter\_type: str # criminal | civil | appeal | coronial
&#x20; charges: list\[str] # \["murder s18 Crimes Act 1900 (NSW)"]
&#x20; charge\_categories: list\[str] # \["homicide", "violence"]
&#x20; # Outcome
&#x20; verdict: str # guilty | not\_guilty | appeal\_allowed |
&#x20; # appeal\_dismissed | conviction\_quashed |
&#x20; # sentence\_varied | hung | n/a
&#x20; sentence: str | None # "18 years NMP 13"
&#x20; outcome\_notes: str # free text summary of result
&#x20; # Flags
&#x20; is\_appeal: bool
&#x20; appeal\_of: str | None # MNC of original decision
&#x20; was\_appealed: bool # filled in later by back-reference
&#x20; exoneration\_flag: bool # manual or derived from appeal outcome
&#x20; inadmissible\_evidence: list\[str] # evidence ruled out
&#x20; suppression\_order: bool # whether publication restrictions apply
```
\*\*Extraction prompt (system):\*\*
```
You are a legal metadata extractor. Given the full text of an Australian court judgment,
extract structured metadata in JSON. Output ONLY valid JSON matching the CaseMeta schema.
Be conservative — use null for fields you cannot determine with confidence.
For verdict, use only these values: guilty | not\_guilty | appeal\_allowed |
appeal\_dismissed | conviction\_quashed | sentence\_varied | hung | civil\_judgment | n/a
```
\---
\### 4.3 OutcomeParser
Separate pass specifically for verdict/outcome ground truth. This is the most critical field — it's what the divergence signal compares against.
\*\*Resolution priority:\*\*
1\. Explicit headnote — "HELD: conviction quashed"
2\. Orders section — "Appeal allowed. Convictions on counts 1 and 3 set aside."
3\. Appeal cross-reference — if later appeal found, update original record
4\. Manual flag — `exoneration\_flag=True` sourced from external list
\*\*External exoneration sources:\*\*
\- RMIT Wrongful Convictions Register (AU)
\- Innocence Project Australia case list
\- Royal Commission findings referencing specific convictions
These are small enough to maintain as a static JSON lookup table keyed by defendant name + year.
\---
\### 4.4 ChunkEngine
Chunks documents for RAG retrieval. Chunks are \*\*semantically meaningful units\*\*, not arbitrary token windows.
\*\*Chunk types and sizes:\*\*
| Chunk Type | Content | Target Tokens | Overlap |
|------------|---------|---------------|---------|
| `opening` | Case intro, charges, parties | 400600 | none |
| `testimony` | Each witness examination block | 300500 | 50 |
| `exhibit` | Evidence description + ruling | 200400 | none |
| `ruling` | Each evidentiary ruling by judge | 200300 | none |
| `closing` | Closing submissions summary | 400600 | 50 |
| `judgment` | Judicial reasoning blocks | 400600 | 50 |
| `sentence` | Sentencing remarks | 300500 | none |
\*\*Chunk struct:\*\*
```python
@dataclass
class Chunk:
&#x20; chunk\_id: str # UUID
&#x20; doc\_id: str # parent document MNC
&#x20; chunk\_type: str # from table above
&#x20; sequence: int # position in document
&#x20; text: str
&#x20; token\_count: int
&#x20; speaker: str | None # witness name if testimony
&#x20; page\_ref: str | None # "p.47" if available
&#x20; embedding: list\[float] # 1536-dim, filled by EmbedEngine
```
\*\*Chunking strategy:\*\*
Use Claude claude-haiku-4-5-20251001 to identify structural boundaries:
```
Identify the structural sections of this Australian court judgment.
Return a JSON array of {section\_type, start\_char, end\_char, speaker}.
Section types: opening | testimony | exhibit | ruling | closing | judgment | sentence
```
Then split at identified boundaries, keeping each chunk within token budget.
\---
\### 4.5 EmbedEngine
```yaml
model: text-embedding-3-small # OpenAI, 1536-dim, cheap
batch\_size: 100
store: chromadb # local, no infra
collection\_naming: "au\_cases\_{source\_id}"
```
Each chunk gets an embedding. The embedding is stored in ChromaDB with chunk metadata as payload.
\*\*Query interface:\*\*
```python
def query\_chunks(
&#x20; text: str,
&#x20; chunk\_types: list\[str] = None, # filter by type
&#x20; doc\_ids: list\[str] = None, # filter to specific cases
&#x20; top\_k: int = 10
) -> list\[Chunk]:
&#x20; ...
```
\---
\### 4.6 GraphBuilder
Builds the property graph from extracted metadata + chunk relationships.
\*\*Node types:\*\*
```
(:Case {mnc, court, date, jurisdiction, matter\_type, verdict, exoneration\_flag})
(:Charge {text, category, act, section})
(:Witness {name, role}) # role: prosecution|defence|expert|accused
(:Exhibit {id, description, admitted})
(:Judge {name, court})
(:Ruling {type, outcome}) # type: admissibility|direction|objection
(:Timeline {date, event})
(:Chunk {chunk\_id, type, text\_preview})
```
\*\*Relationship types:\*\*
```
(:Case)-\[:CHARGED\_WITH]->(:Charge)
(:Case)-\[:HEARD\_BY]->(:Judge)
(:Case)-\[:INCLUDES\_TESTIMONY {credibility\_score}]->(:Witness)
(:Case)-\[:HAS\_EXHIBIT {admitted: bool}]->(:Exhibit)
(:Case)-\[:HAS\_RULING]->(:Ruling)
(:Case)-\[:APPEALS]->(:Case) # appeal chain
(:Witness)-\[:GAVE\_TESTIMONY]->(:Chunk)
(:Exhibit)-\[:DESCRIBED\_IN]->(:Chunk)
(:Ruling)-\[:CONCERNS]->(:Exhibit)
(:Ruling)-\[:CONCERNS]->(:Witness)
(:Chunk)-\[:FOLLOWS]->(:Chunk) # sequence chain
(:Chunk)-\[:CORROBORATES]->(:Chunk) # semantic similarity > 0.85
(:Chunk)-\[:CONTRADICTS]->(:Chunk) # semantic similarity < 0.2 on same topic
```
\*\*Graph backend:\*\*
\- Primary: Neo4j (local Docker) for development
\- Alternative: RyuGraph if staying in the existing Lulu/Windows stack
\- Export: GraphML for portability
\*\*CORROBORATES / CONTRADICTS edge generation:\*\*
```python
\# After all chunks embedded, run pairwise similarity within a case
\# for chunks of type testimony and exhibit
for pair in chunk\_pairs\_within\_case(doc\_id, types=\["testimony", "exhibit"]):
&#x20; sim = cosine\_similarity(pair\[0].embedding, pair\[1].embedding)
&#x20; if sim > 0.85:
&#x20; graph.add\_edge(pair\[0], pair\[1], "CORROBORATES", weight=sim)
&#x20; elif same\_topic(pair\[0], pair\[1]) and sim < 0.25:
&#x20; graph.add\_edge(pair\[0], pair\[1], "CONTRADICTS", weight=sim)
```
\---
\## 5. Juror Subgraph Query Interface
This is the interface the jury agent calls at deliberation time.
```python
def get\_juror\_context(
&#x20; case\_mnc: str,
&#x20; persona: JurorPersona,
&#x20; max\_tokens: int = 4000
) -> JurorContext:
&#x20; """
&#x20; Traverse the graph from the persona's anchor node types,
&#x20; follow relevant edge types, collect chunks up to token budget.
&#x20; Returns a compact context string + list of source node IDs.
&#x20; """
```
\*\*Persona → graph traversal mapping:\*\*
```python
PERSONA\_TRAVERSAL = {
&#x20; "nurse": {
&#x20; "anchor\_nodes": \["Witness\[role=expert]", "Exhibit\[category=medical]"],
&#x20; "edge\_types": \["GAVE\_TESTIMONY", "DESCRIBED\_IN", "CORROBORATES"],
&#x20; "chunk\_types": \["testimony", "exhibit"],
&#x20; "exclude\_edges": \["RULED\_INADMISSIBLE"]
&#x20; },
&#x20; "accountant": {
&#x20; "anchor\_nodes": \["Exhibit\[category=financial]", "Witness\[role=expert]"],
&#x20; "edge\_types": \["DESCRIBED\_IN", "HAS\_RULING"],
&#x20; "chunk\_types": \["exhibit", "ruling"],
&#x20; "exclude\_edges": \["RULED\_INADMISSIBLE"]
&#x20; },
&#x20; "skeptic": {
&#x20; "anchor\_nodes": \["Timeline", "Witness"],
&#x20; "edge\_types": \["CONTRADICTS", "HAS\_RULING", "GAVE\_TESTIMONY"],
&#x20; "chunk\_types": \["testimony", "ruling"],
&#x20; "exclude\_edges": \["RULED\_INADMISSIBLE"]
&#x20; },
&#x20; "ex\_cop": {
&#x20; "anchor\_nodes": \["Ruling", "Exhibit\[category=forensic]"],
&#x20; "edge\_types": \["HAS\_RULING", "DESCRIBED\_IN", "CORROBORATES"],
&#x20; "chunk\_types": \["ruling", "exhibit", "judgment"],
&#x20; "exclude\_edges": \["RULED\_INADMISSIBLE"]
&#x20; },
&#x20; "empath": {
&#x20; "anchor\_nodes": \["Witness\[role=prosecution]", "Witness\[role=accused]"],
&#x20; "edge\_types": \["GAVE\_TESTIMONY", "CORROBORATES", "CONTRADICTS"],
&#x20; "chunk\_types": \["testimony", "closing"],
&#x20; "exclude\_edges": \["RULED\_INADMISSIBLE"]
&#x20; },
&#x20; "foreman": {
&#x20; "anchor\_nodes": \["Charge", "Ruling", "Judge"],
&#x20; "edge\_types": \["CHARGED\_WITH", "HAS\_RULING", "HEARD\_BY"],
&#x20; "chunk\_types": \["opening", "judgment", "sentence"],
&#x20; "exclude\_edges": \["RULED\_INADMISSIBLE"]
&#x20; }
}
```
\*\*Critical invariant:\*\* `RULED\_INADMISSIBLE` edges are always excluded from juror queries. This wall is enforced at the graph traversal level, not the prompt level.
\---
\## 6. MetaDB Schema (SQLite)
Tracks ingestion state — what's been fetched, processed, failed.
```sql
CREATE TABLE documents (
&#x20; doc\_id TEXT PRIMARY KEY, -- MNC or UUID
&#x20; source\_id TEXT NOT NULL,
&#x20; url TEXT NOT NULL,
&#x20; fetch\_status TEXT, -- pending|fetched|parsed|embedded|graphed|failed
&#x20; fetch\_timestamp TEXT,
&#x20; parse\_timestamp TEXT,
&#x20; embed\_timestamp TEXT,
&#x20; graph\_timestamp TEXT,
&#x20; error\_message TEXT,
&#x20; char\_count INTEGER,
&#x20; chunk\_count INTEGER,
&#x20; matter\_type TEXT,
&#x20; court TEXT,
&#x20; year INTEGER,
&#x20; verdict TEXT,
&#x20; exoneration\_flag INTEGER DEFAULT 0
);
CREATE TABLE fetch\_queue (
&#x20; queue\_id INTEGER PRIMARY KEY AUTOINCREMENT,
&#x20; source\_id TEXT,
&#x20; url TEXT,
&#x20; priority INTEGER DEFAULT 5, -- 1=highest
&#x20; queued\_at TEXT,
&#x20; attempts INTEGER DEFAULT 0
);
CREATE TABLE source\_state (
&#x20; source\_id TEXT PRIMARY KEY,
&#x20; last\_poll TEXT,
&#x20; last\_rss\_etag TEXT,
&#x20; docs\_fetched INTEGER DEFAULT 0,
&#x20; docs\_failed INTEGER DEFAULT 0
);
```
\---
\## 7. Rate Limiting \& Politeness
```python
RATE\_LIMITS = {
&#x20; "fedcourt": {"rps": 1.0, "concurrent": 2, "retry\_after": 60},
&#x20; "highcourt": {"rps": 0.5, "concurrent": 1, "retry\_after": 120},
&#x20; "nsw\_caselaw": {"rps": 1.0, "concurrent": 2, "retry\_after": 60},
&#x20; "qld\_judgments": {"rps": 0.5, "concurrent": 1, "retry\_after": 120},
&#x20; "auslaw\_mcp": {"rps": 0.3, "concurrent": 1, "retry\_after": 180},
}
\# Always send:
HEADERS = {
&#x20; "User-Agent": "AuCourtIngest/0.1 (legal research; contact: your@email.com)",
&#x20; "Accept": "text/html,application/xhtml+xml,application/msword",
}
\# Respect robots.txt (NSW Caselaw blocks search engines, not research agents)
\# Check before first fetch from each domain
```
\---
\## 8. Error Handling \& Recovery
```python
class FetchError(Exception): pass
class ParseError(Exception): pass
class RateLimitError(Exception): pass
RETRY\_POLICY = {
&#x20; FetchError: {"max\_retries": 3, "backoff": "exponential", "base": 30},
&#x20; RateLimitError: {"max\_retries": 5, "backoff": "exponential", "base": 120},
&#x20; ParseError: {"max\_retries": 1, "backoff": "fixed", "base": 0},
}
\# On ParseError after retry: log to MetaDB, move to manual\_review queue
\# On persistent FetchError: mark source as degraded, alert via Telegram
```
\*\*Telegram alerting\*\* (reuse Lulu's Telegram integration):
```python
ALERTS = {
&#x20; "source\_degraded": "⚠️ AuCourtIngest: {source\_id} failing for {minutes}min",
&#x20; "daily\_summary": "📊 Ingested: {new\_docs} docs | Graph nodes: {nodes} | Errors: {errors}",
&#x20; "milestone": "🎯 Corpus reached {N} cases",
}
```
\---
\## 9. Bootstrap Run Plan
Priority order for initial corpus build:
```
Phase 1 — Week 1 (no permissions needed):
&#x20; \[ ] Federal Court RSS → DOCX, 20102026, criminal filter
&#x20; \[ ] High Court judgments 20002026, criminal appeals
&#x20; \[ ] NSW Caselaw criminal browse, 20102026
&#x20; Target: \~5,000 cases
Phase 2 — Week 2:
&#x20; \[ ] Federal Court PDF backfill 19952009
&#x20; \[ ] QLD Judgments criminal, 20002026
&#x20; \[ ] VIC via AusLaw MCP targeted search
&#x20; Target: \~15,000 cases total
Phase 3 — Ongoing:
&#x20; \[ ] RSS watch mode all sources
&#x20; \[ ] AustLII partnership email → if approved, full backfill
&#x20; \[ ] Exoneration cross-reference pass (RMIT register)
&#x20; Target: 50,000+ cases
```
\---
\## 10. Output Contracts
\*\*DocStore:\*\* `/data/docs/{source\_id}/{year}/{mnc}.json`
```json
{
&#x20; "doc\_id": "\[2019] NSWSC 1234",
&#x20; "meta": { ...CaseMeta... },
&#x20; "chunks": \[ ...Chunk\[] ... ],
&#x20; "raw\_text\_path": "/data/raw/nsw\_caselaw/2019/NSWSC\_1234.html"
}
```
\*\*VectorIndex:\*\* ChromaDB collection `au\_cases`, filterable by `source\_id`, `chunk\_type`, `doc\_id`
\*\*PropertyGraph:\*\* Neo4j `au\_legal` database, accessible via Bolt protocol
\*\*MetaDB:\*\* SQLite at `/data/meta.db`
\---
\## 11. Tech Stack
```
Language: Python 3.12
HTTP: httpx (async)
HTML parsing: BeautifulSoup4
DOCX: python-docx
PDF text: pdfminer.six
PDF OCR: via AusLaw MCP (pytesseract wrapper)
Embeddings: openai text-embedding-3-small
Vector store: chromadb (local)
Graph DB: neo4j (Docker) or ryu\_graph (Windows)
SQLite: aiosqlite
LLM extraction: anthropic claude-haiku-4-5-20251001 (cheap, fast)
MCP client: mcp-python-sdk
Scheduling: APScheduler
Alerting: Telegram Bot API (reuse Lulu token)
Config: TOML (pyproject-style)
```
\---
\## 12. File Structure
```
aucourt\_ingest/
├── main.py # entry point, mode dispatch
├── config.toml # source configs, rate limits, paths
├── orchestrator.py # main loop, queue management
├── sources/
│ ├── base.py # BaseSource ABC
│ ├── fedcourt.py
│ ├── highcourt.py
│ ├── nsw\_caselaw.py
│ ├── qld\_judgments.py
│ └── auslaw\_mcp.py
├── processing/
│ ├── doc\_parser.py
│ ├── meta\_extractor.py
│ ├── outcome\_parser.py
│ ├── chunk\_engine.py
│ ├── embed\_engine.py
│ └── graph\_builder.py
├── storage/
│ ├── doc\_store.py
│ ├── vector\_index.py
│ ├── graph\_db.py
│ └── meta\_db.py
├── jury/
│ ├── personas.py # JurorPersona definitions
│ └── subgraph\_query.py # get\_juror\_context()
├── utils/
│ ├── rate\_limiter.py
│ ├── retry.py
│ ├── telegram.py
│ └── mnc\_parser.py
├── data/
│ ├── docs/
│ ├── raw/
│ └── meta.db
└── tests/
&#x20; ├── test\_parsers.py
&#x20; ├── test\_meta\_extractor.py
&#x20; └── test\_graph\_builder.py
```
\---
\## 13. Open Questions / Decisions Deferred
1\. \*\*Graph backend:\*\* RyuGraph (existing Lulu stack, Windows/Android) vs Neo4j (Docker, better Cypher tooling). Recommend Neo4j for dev, RyuGraph for eventual mobile/embedded target.
2\. \*\*Transcript vs judgment distinction:\*\* Many cases have both. Transcripts are richer for juror simulation (raw testimony). Judgments are cleaner but summarised. Ingest both, tag chunk source type.
3\. \*\*Suppression orders:\*\* Some cases have partial suppression. MetaDB `suppression\_order` flag. Do not ingest suppressed content. Check AustLII suppression notices as reference.
4\. \*\*AustLII partnership timing:\*\* Email now or after MVP? Recommend now — low effort, long lead time for approval.
5\. \*\*Embedding model:\*\* text-embedding-3-small is cheap but ada-002 or Voyage Law may give better legal domain performance. Evaluate on retrieval quality after first 1,000 cases.
\---
\*End of spec v0.1\*