aucourt-ingest/spec.md at master

slothitude d77fe12cfc AuCourtIngest: complete 8-stage Australian legal case ingestion pipeline

Source layer (5 court sources), processing pipeline (parse/extract/chunk/embed/graph),
property graph with 8 node types, juror subgraph queries with 6 personas,
orchestrator with bootstrap/watch/backfill/audit/process modes, 170 tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-05-30 11:56:23 +10:00

25 KiB

Raw Permalink Blame History

# AuCourtIngest — Agent Specification v0.1

**Project:** Australian Legal Case Ingestion Pipeline

**Purpose:** Autonomous agent to discover, fetch, parse, normalise, and graph-ingest Australian court decisions for AI jury/judge analysis

**Author:** Aaron King / Slothitude Games

**Date:** 2026-05-30

---

## 1. Overview

AuCourtIngest is a fully autonomous ingestion agent. It operates as a continuous pipeline with no human operator required after initial configuration. Its output is a populated property graph (RyuGraph or Neo4j-compatible) and a flat document store, both ready for per-juror RAG queries.

The agent has four operational modes:

| Mode | Trigger | Description |

|------|---------|-------------|

| bootstrap | Manual / first run | Historical bulk ingest from all tier-1 sources |

| watch | Cron / RSS poll | Continuous ingestion of new decisions |

| backfill | Manual | Fill gaps in existing corpus by date/court/charge-type |

| audit | Manual | Re-process existing documents to update graph schema |

---

## 2. Architecture


┌─────────────────────────────────────────────────────┐

│                   ORCHESTRATOR                      │

│           (main loop, mode dispatch,                │

│            rate limiting, error recovery)           │

└────────┬───────────────────────────────┬────────────┘

&#x20;        │                               │

&#x20;        ▼                               ▼

┌─────────────────┐             ┌─────────────────────┐

│   SOURCE LAYER  │             │   PROCESSING LAYER  │

│                 │             │                     │

│  - FedCourtRSS  │──fetch─────▶│  - DocParser        │

│  - HighCourtRSS │             │  - MetaExtractor    │

│  - NSWCaselaw   │             │  - OutcomeParser    │

│  - QLDJudgments │             │  - ChunkEngine      │

│  - VICCatalogue │             │  - EmbedEngine      │

│  - AusLawMCP    │             │  - GraphBuilder     │

└─────────────────┘             └──────────┬──────────┘

&#x20;                                          │

&#x20;                                          ▼

&#x20;                              ┌─────────────────────┐

&#x20;                              │   STORAGE LAYER     │

&#x20;                              │                     │

&#x20;                              │  - DocStore (flat)  │

&#x20;                              │  - VectorIndex      │

&#x20;                              │  - PropertyGraph    │

&#x20;                              │  - MetaDB (SQLite)  │

&#x20;                              └─────────────────────┘

---

## 3. Source Definitions

### 3.1 Federal Court of Australia


source\_id: fedcourt

base\_url: https://www.judgments.fedcourt.gov.au

rss\_feed: https://www.judgments.fedcourt.gov.au/rss/fca-judgments

doc\_formats: \[html, docx, pdf]

coverage\_from: 1977

full\_text\_from: 1995

docx\_from: 1995

pdf\_range: \[1977, 1994]

rate\_limit\_rps: 1

tos\_status: public\_domain

fetch\_strategy: rss\_poll\_then\_docx\_download

**Fetch logic:**

1. Poll RSS feed every 6 hours

2. For each new item: extract judgment URL from RSS <link>

3. Fetch HTML page → find DOCX download link (pattern: /judgments/fca/YYYY/NNN.docx)

4. Download DOCX → hand to DocParser

5. For 1977–1994: fetch PDF instead → hand to PDFParser

**DOCX URL pattern:**


https://www.judgments.fedcourt.gov.au/judgments/fca/{YYYY}/{index}.docx

**Bootstrap pagination:**


https://www.fedcourt.gov.au/digital-law-library/judgments/search

?NeedleType=all\&txtSearch=\&category=criminal\&dateFrom=1995-01-01

\&dateTo=2026-12-31\&resultsPerPage=50\&pageNum={N}

---

### 3.2 High Court of Australia


source\_id: highcourt

base\_url: https://www.hcourt.gov.au

transcript\_base: https://www.hcourt.gov.au/cases/case-s{YYYY}-{N}

coverage\_from: 1994

doc\_formats: \[html, pdf]

rate\_limit\_rps: 0.5

tos\_status: public\_domain

fetch\_strategy: index\_crawl

**Fetch logic:**

1. Crawl /cases/recent-judgments — paginated list with matter numbers

2. For each matter: fetch judgment HTML page

3. Also fetch transcript if available (/transcripts/ path)

4. Store judgment + transcript as separate documents, linked by matter\_id

**Priority filter for MVP:**


PRIORITY\_KEYWORDS = \[

&#x20;   "murder", "manslaughter", "sexual assault", "robbery",

&#x20;   "appeal allowed", "conviction quashed", "miscarriage of justice"

]

---

### 3.3 NSW Caselaw


source\_id: nsw\_caselaw

base\_url: https://www.caselaw.nsw.gov.au

browse\_url: https://www.caselaw.nsw.gov.au/browse

coverage\_from: 1988

doc\_formats: \[html, pdf]

rate\_limit\_rps: 1

tos\_status: open\_reproduction\_authorised

fetch\_strategy: mnc\_systematic + browse\_pagination

mnc\_pattern: "\[YYYY] {COURT} {N}"

**Fetch logic:**

1. Browse /browse?filter=criminal → paginate through all pages

2. Extract MNC + decision URL per row

3. Fetch decision HTML → parse full text + metadata

4. MNC serves as canonical case ID throughout system

**MNC parsing:**


\# Example: \[2019] NSWSC 1234

import re

MNC\_PATTERN = r'\\\[(\\d{4})\\]\\s+(\[A-Z]+)\\s+(\\d+)'

def parse\_mnc(mnc: str) -> dict:

&#x20;   m = re.match(MNC\_PATTERN, mnc)

&#x20;   return {"year": m\[1], "court": m\[2], "number": m\[3]}

---

### 3.4 Queensland Judgments


source\_id: qld\_judgments

base\_url: https://www.queenslandjudgments.com.au

coverage: variable\_by\_court

doc\_formats: \[html, pdf]

rate\_limit\_rps: 0.5

tos\_status: free\_access

fetch\_strategy: search\_pagination

**Fetch logic:**

1. Use search API with matter\_type=criminal filter

2. Paginate results → extract case URLs

3. Fetch HTML per case

---

### 3.5 AusLaw MCP (Gap-Fill + Search)


source\_id: auslaw\_mcp

type: mcp\_server

repo: russellbrenner/auslaw-mcp

tools:

&#x20; - search\_austlii

&#x20; - search\_jade

&#x20; - fetch\_document\_text

&#x20; - search\_jade\_by\_citation

use\_case: targeted\_retrieval\_not\_bulk

rate\_limit: conservative  # case-by-case only

**Use cases:**

- Fetch a specific case by citation when referenced in another judgment

- OCR for scanned pre-1995 PDFs

- Cross-reference citation graph to find related cases

- Fill gaps where direct court portals have incomplete coverage

**Tool call pattern:**


\# Retrieve by citation

result = await mcp.call("fetch\_document\_text", {

&#x20;   "url": "https://www.austlii.edu.au/cgi-bin/viewdoc/au/cases/..."

})



\# Search for cases matching charge type

results = await mcp.call("search\_austlii", {

&#x20;   "query": "murder conviction appeal",

&#x20;   "jurisdiction": "nsw",

&#x20;   "limit": 20,

&#x20;   "sortBy": "date"

})

---

## 4. Document Processing Pipeline

### 4.1 DocParser

Accepts: HTML, DOCX, PDF

Outputs: normalised RawDocument struct


@dataclass

class RawDocument:

&#x20;   source\_id: str           # e.g. "nsw\_caselaw"

&#x20;   doc\_id: str              # MNC or internal UUID

&#x20;   url: str

&#x20;   fetch\_timestamp: str     # ISO8601

&#x20;   raw\_text: str            # full extracted text

&#x20;   format: str              # html | docx | pdf

&#x20;   pages: int | None

&#x20;   char\_count: int

**Format handlers:**

- HTML: BeautifulSoup → strip nav/header/footer → extract .judgment-body or equivalent

- DOCX: python-docx → iterate paragraphs, preserve heading structure

- PDF: pdfminer.six for text PDFs; pytesseract via AusLaw MCP OCR for scanned

---

### 4.2 MetaExtractor

Input: RawDocument

Output: CaseMeta struct

Method: Claude claude-haiku-4-5-20251001 via API (cheap, fast, structured)


@dataclass

class CaseMeta:

&#x20;   # Identity

&#x20;   case\_name: str           # e.g. "R v Smith"

&#x20;   mnc: str                 # Medium Neutral Citation

&#x20;   court: str               # NSWSC | HCA | FCA | QSC etc

&#x20;   judge: list\[str]

&#x20;   date\_delivered: str      # ISO8601

&#x20;   jurisdiction: str        # NSW | CTH | QLD | VIC etc



&#x20;   # Classification

&#x20;   matter\_type: str         # criminal | civil | appeal | coronial

&#x20;   charges: list\[str]       # \["murder s18 Crimes Act 1900 (NSW)"]

&#x20;   charge\_categories: list\[str]  # \["homicide", "violence"]



&#x20;   # Outcome

&#x20;   verdict: str             # guilty | not\_guilty | appeal\_allowed |

&#x20;                            # appeal\_dismissed | conviction\_quashed |

&#x20;                            # sentence\_varied | hung | n/a

&#x20;   sentence: str | None     # "18 years NMP 13"

&#x20;   outcome\_notes: str       # free text summary of result



&#x20;   # Flags

&#x20;   is\_appeal: bool

&#x20;   appeal\_of: str | None    # MNC of original decision

&#x20;   was\_appealed: bool       # filled in later by back-reference

&#x20;   exoneration\_flag: bool   # manual or derived from appeal outcome

&#x20;   inadmissible\_evidence: list\[str]  # evidence ruled out

&#x20;   suppression\_order: bool  # whether publication restrictions apply

**Extraction prompt (system):**


You are a legal metadata extractor. Given the full text of an Australian court judgment,

extract structured metadata in JSON. Output ONLY valid JSON matching the CaseMeta schema.

Be conservative — use null for fields you cannot determine with confidence.

For verdict, use only these values: guilty | not\_guilty | appeal\_allowed |

appeal\_dismissed | conviction\_quashed | sentence\_varied | hung | civil\_judgment | n/a

---

### 4.3 OutcomeParser

Separate pass specifically for verdict/outcome ground truth. This is the most critical field — it's what the divergence signal compares against.

**Resolution priority:**

1. Explicit headnote — "HELD: conviction quashed"

2. Orders section — "Appeal allowed. Convictions on counts 1 and 3 set aside."

3. Appeal cross-reference — if later appeal found, update original record

4. Manual flag — exoneration\_flag=True sourced from external list

**External exoneration sources:**

- RMIT Wrongful Convictions Register (AU)

- Innocence Project Australia case list

- Royal Commission findings referencing specific convictions

These are small enough to maintain as a static JSON lookup table keyed by defendant name + year.

---

### 4.4 ChunkEngine

Chunks documents for RAG retrieval. Chunks are **semantically meaningful units**, not arbitrary token windows.

**Chunk types and sizes:**

|------------|---------|---------------|---------|

| testimony | Each witness examination block | 300–500 | 50 |

| closing | Closing submissions summary | 400–600 | 50 |

| judgment | Judicial reasoning blocks | 400–600 | 50 |

**Chunk struct:**


@dataclass

class Chunk:

&#x20;   chunk\_id: str            # UUID

&#x20;   doc\_id: str              # parent document MNC

&#x20;   chunk\_type: str          # from table above

&#x20;   sequence: int            # position in document

&#x20;   text: str

&#x20;   token\_count: int

&#x20;   speaker: str | None      # witness name if testimony

&#x20;   page\_ref: str | None     # "p.47" if available

&#x20;   embedding: list\[float]   # 1536-dim, filled by EmbedEngine

**Chunking strategy:**

Use Claude claude-haiku-4-5-20251001 to identify structural boundaries:


Identify the structural sections of this Australian court judgment.

Return a JSON array of {section\_type, start\_char, end\_char, speaker}.

Section types: opening | testimony | exhibit | ruling | closing | judgment | sentence

Then split at identified boundaries, keeping each chunk within token budget.

---

### 4.5 EmbedEngine


model: text-embedding-3-small  # OpenAI, 1536-dim, cheap

batch\_size: 100

store: chromadb  # local, no infra

collection\_naming: "au\_cases\_{source\_id}"

Each chunk gets an embedding. The embedding is stored in ChromaDB with chunk metadata as payload.

**Query interface:**


def query\_chunks(

&#x20;   text: str,

&#x20;   chunk\_types: list\[str] = None,   # filter by type

&#x20;   doc\_ids: list\[str] = None,       # filter to specific cases

&#x20;   top\_k: int = 10

) -> list\[Chunk]:

&#x20;   ...

---

### 4.6 GraphBuilder

Builds the property graph from extracted metadata + chunk relationships.

**Node types:**


(:Case {mnc, court, date, jurisdiction, matter\_type, verdict, exoneration\_flag})

(:Charge {text, category, act, section})

(:Witness {name, role})          # role: prosecution|defence|expert|accused

(:Exhibit {id, description, admitted})

(:Judge {name, court})

(:Ruling {type, outcome})        # type: admissibility|direction|objection

(:Timeline {date, event})

(:Chunk {chunk\_id, type, text\_preview})

**Relationship types:**


(:Case)-\[:CHARGED\_WITH]->(:Charge)

(:Case)-\[:HEARD\_BY]->(:Judge)

(:Case)-\[:INCLUDES\_TESTIMONY {credibility\_score}]->(:Witness)

(:Case)-\[:HAS\_EXHIBIT {admitted: bool}]->(:Exhibit)

(:Case)-\[:HAS\_RULING]->(:Ruling)

(:Case)-\[:APPEALS]->(:Case)           # appeal chain

(:Witness)-\[:GAVE\_TESTIMONY]->(:Chunk)

(:Exhibit)-\[:DESCRIBED\_IN]->(:Chunk)

(:Ruling)-\[:CONCERNS]->(:Exhibit)

(:Ruling)-\[:CONCERNS]->(:Witness)

(:Chunk)-\[:FOLLOWS]->(:Chunk)         # sequence chain

(:Chunk)-\[:CORROBORATES]->(:Chunk)    # semantic similarity > 0.85

(:Chunk)-\[:CONTRADICTS]->(:Chunk)     # semantic similarity < 0.2 on same topic

**Graph backend:**

- Primary: Neo4j (local Docker) for development

- Alternative: RyuGraph if staying in the existing Lulu/Windows stack

- Export: GraphML for portability

**CORROBORATES / CONTRADICTS edge generation:**


\# After all chunks embedded, run pairwise similarity within a case

\# for chunks of type testimony and exhibit

for pair in chunk\_pairs\_within\_case(doc\_id, types=\["testimony", "exhibit"]):

&#x20;   sim = cosine\_similarity(pair\[0].embedding, pair\[1].embedding)

&#x20;   if sim > 0.85:

&#x20;       graph.add\_edge(pair\[0], pair\[1], "CORROBORATES", weight=sim)

&#x20;   elif same\_topic(pair\[0], pair\[1]) and sim < 0.25:

&#x20;       graph.add\_edge(pair\[0], pair\[1], "CONTRADICTS", weight=sim)

---

## 5. Juror Subgraph Query Interface

This is the interface the jury agent calls at deliberation time.


def get\_juror\_context(

&#x20;   case\_mnc: str,

&#x20;   persona: JurorPersona,

&#x20;   max\_tokens: int = 4000

) -> JurorContext:

&#x20;   """

&#x20;   Traverse the graph from the persona's anchor node types,

&#x20;   follow relevant edge types, collect chunks up to token budget.

&#x20;   Returns a compact context string + list of source node IDs.

&#x20;   """

**Persona → graph traversal mapping:**


PERSONA\_TRAVERSAL = {

&#x20;   "nurse": {

&#x20;       "anchor\_nodes": \["Witness\[role=expert]", "Exhibit\[category=medical]"],

&#x20;       "edge\_types": \["GAVE\_TESTIMONY", "DESCRIBED\_IN", "CORROBORATES"],

&#x20;       "chunk\_types": \["testimony", "exhibit"],

&#x20;       "exclude\_edges": \["RULED\_INADMISSIBLE"]

&#x20;   },

&#x20;   "accountant": {

&#x20;       "anchor\_nodes": \["Exhibit\[category=financial]", "Witness\[role=expert]"],

&#x20;       "edge\_types": \["DESCRIBED\_IN", "HAS\_RULING"],

&#x20;       "chunk\_types": \["exhibit", "ruling"],

&#x20;       "exclude\_edges": \["RULED\_INADMISSIBLE"]

&#x20;   },

&#x20;   "skeptic": {

&#x20;       "anchor\_nodes": \["Timeline", "Witness"],

&#x20;       "edge\_types": \["CONTRADICTS", "HAS\_RULING", "GAVE\_TESTIMONY"],

&#x20;       "chunk\_types": \["testimony", "ruling"],

&#x20;       "exclude\_edges": \["RULED\_INADMISSIBLE"]

&#x20;   },

&#x20;   "ex\_cop": {

&#x20;       "anchor\_nodes": \["Ruling", "Exhibit\[category=forensic]"],

&#x20;       "edge\_types": \["HAS\_RULING", "DESCRIBED\_IN", "CORROBORATES"],

&#x20;       "chunk\_types": \["ruling", "exhibit", "judgment"],

&#x20;       "exclude\_edges": \["RULED\_INADMISSIBLE"]

&#x20;   },

&#x20;   "empath": {

&#x20;       "anchor\_nodes": \["Witness\[role=prosecution]", "Witness\[role=accused]"],

&#x20;       "edge\_types": \["GAVE\_TESTIMONY", "CORROBORATES", "CONTRADICTS"],

&#x20;       "chunk\_types": \["testimony", "closing"],

&#x20;       "exclude\_edges": \["RULED\_INADMISSIBLE"]

&#x20;   },

&#x20;   "foreman": {

&#x20;       "anchor\_nodes": \["Charge", "Ruling", "Judge"],

&#x20;       "edge\_types": \["CHARGED\_WITH", "HAS\_RULING", "HEARD\_BY"],

&#x20;       "chunk\_types": \["opening", "judgment", "sentence"],

&#x20;       "exclude\_edges": \["RULED\_INADMISSIBLE"]

&#x20;   }

}

**Critical invariant:** RULED\_INADMISSIBLE edges are always excluded from juror queries. This wall is enforced at the graph traversal level, not the prompt level.

---

## 6. MetaDB Schema (SQLite)

Tracks ingestion state — what's been fetched, processed, failed.


CREATE TABLE documents (

&#x20;   doc\_id          TEXT PRIMARY KEY,  -- MNC or UUID

&#x20;   source\_id       TEXT NOT NULL,

&#x20;   url             TEXT NOT NULL,

&#x20;   fetch\_status    TEXT,              -- pending|fetched|parsed|embedded|graphed|failed

&#x20;   fetch\_timestamp TEXT,

&#x20;   parse\_timestamp TEXT,

&#x20;   embed\_timestamp TEXT,

&#x20;   graph\_timestamp TEXT,

&#x20;   error\_message   TEXT,

&#x20;   char\_count      INTEGER,

&#x20;   chunk\_count     INTEGER,

&#x20;   matter\_type     TEXT,

&#x20;   court           TEXT,

&#x20;   year            INTEGER,

&#x20;   verdict         TEXT,

&#x20;   exoneration\_flag INTEGER DEFAULT 0

);



CREATE TABLE fetch\_queue (

&#x20;   queue\_id        INTEGER PRIMARY KEY AUTOINCREMENT,

&#x20;   source\_id       TEXT,

&#x20;   url             TEXT,

&#x20;   priority        INTEGER DEFAULT 5,  -- 1=highest

&#x20;   queued\_at       TEXT,

&#x20;   attempts        INTEGER DEFAULT 0

);



CREATE TABLE source\_state (

&#x20;   source\_id       TEXT PRIMARY KEY,

&#x20;   last\_poll       TEXT,

&#x20;   last\_rss\_etag   TEXT,

&#x20;   docs\_fetched    INTEGER DEFAULT 0,

&#x20;   docs\_failed     INTEGER DEFAULT 0

);

---

## 7. Rate Limiting & Politeness


RATE\_LIMITS = {

&#x20;   "fedcourt":      {"rps": 1.0,  "concurrent": 2, "retry\_after": 60},

&#x20;   "highcourt":     {"rps": 0.5,  "concurrent": 1, "retry\_after": 120},

&#x20;   "nsw\_caselaw":   {"rps": 1.0,  "concurrent": 2, "retry\_after": 60},

&#x20;   "qld\_judgments": {"rps": 0.5,  "concurrent": 1, "retry\_after": 120},

&#x20;   "auslaw\_mcp":    {"rps": 0.3,  "concurrent": 1, "retry\_after": 180},

}



\# Always send:

HEADERS = {

&#x20;   "User-Agent": "AuCourtIngest/0.1 (legal research; contact: your@email.com)",

&#x20;   "Accept": "text/html,application/xhtml+xml,application/msword",

}



\# Respect robots.txt (NSW Caselaw blocks search engines, not research agents)

\# Check before first fetch from each domain

---

## 8. Error Handling & Recovery


class FetchError(Exception): pass

class ParseError(Exception): pass

class RateLimitError(Exception): pass



RETRY\_POLICY = {

&#x20;   FetchError:      {"max\_retries": 3, "backoff": "exponential", "base": 30},

&#x20;   RateLimitError:  {"max\_retries": 5, "backoff": "exponential", "base": 120},

&#x20;   ParseError:      {"max\_retries": 1, "backoff": "fixed",       "base": 0},

}



\# On ParseError after retry: log to MetaDB, move to manual\_review queue

\# On persistent FetchError: mark source as degraded, alert via Telegram

**Telegram alerting** (reuse Lulu's Telegram integration):


ALERTS = {

&#x20;   "source\_degraded": "⚠️ AuCourtIngest: {source\_id} failing for {minutes}min",

&#x20;   "daily\_summary":   "📊 Ingested: {new\_docs} docs | Graph nodes: {nodes} | Errors: {errors}",

&#x20;   "milestone":       "🎯 Corpus reached {N} cases",

}

---

## 9. Bootstrap Run Plan

Priority order for initial corpus build:


Phase 1 — Week 1 (no permissions needed):

&#x20; \[ ] Federal Court RSS → DOCX, 2010–2026, criminal filter

&#x20; \[ ] High Court judgments 2000–2026, criminal appeals

&#x20; \[ ] NSW Caselaw criminal browse, 2010–2026

&#x20; Target: \~5,000 cases



Phase 2 — Week 2:

&#x20; \[ ] Federal Court PDF backfill 1995–2009

&#x20; \[ ] QLD Judgments criminal, 2000–2026

&#x20; \[ ] VIC via AusLaw MCP targeted search

&#x20; Target: \~15,000 cases total



Phase 3 — Ongoing:

&#x20; \[ ] RSS watch mode all sources

&#x20; \[ ] AustLII partnership email → if approved, full backfill

&#x20; \[ ] Exoneration cross-reference pass (RMIT register)

&#x20; Target: 50,000+ cases

---

## 10. Output Contracts

**DocStore:** /data/docs/{source\_id}/{year}/{mnc}.json


{

&#x20; "doc\_id": "\[2019] NSWSC 1234",

&#x20; "meta": { ...CaseMeta... },

&#x20; "chunks": \[ ...Chunk\[] ... ],

&#x20; "raw\_text\_path": "/data/raw/nsw\_caselaw/2019/NSWSC\_1234.html"

}

**VectorIndex:** ChromaDB collection au\_cases, filterable by source\_id, chunk\_type, doc\_id

**PropertyGraph:** Neo4j au\_legal database, accessible via Bolt protocol

**MetaDB:** SQLite at /data/meta.db

---

## 11. Tech Stack


Language:       Python 3.12

HTTP:           httpx (async)

HTML parsing:   BeautifulSoup4

DOCX:           python-docx

PDF text:       pdfminer.six

PDF OCR:        via AusLaw MCP (pytesseract wrapper)

Embeddings:     openai text-embedding-3-small

Vector store:   chromadb (local)

Graph DB:       neo4j (Docker) or ryu\_graph (Windows)

SQLite:         aiosqlite

LLM extraction: anthropic claude-haiku-4-5-20251001 (cheap, fast)

MCP client:     mcp-python-sdk

Scheduling:     APScheduler

Alerting:       Telegram Bot API (reuse Lulu token)

Config:         TOML (pyproject-style)

---

## 12. File Structure


aucourt\_ingest/

├── main.py                  # entry point, mode dispatch

├── config.toml              # source configs, rate limits, paths

├── orchestrator.py          # main loop, queue management

├── sources/

│   ├── base.py              # BaseSource ABC

│   ├── fedcourt.py

│   ├── highcourt.py

│   ├── nsw\_caselaw.py

│   ├── qld\_judgments.py

│   └── auslaw\_mcp.py

├── processing/

│   ├── doc\_parser.py

│   ├── meta\_extractor.py

│   ├── outcome\_parser.py

│   ├── chunk\_engine.py

│   ├── embed\_engine.py

│   └── graph\_builder.py

├── storage/

│   ├── doc\_store.py

│   ├── vector\_index.py

│   ├── graph\_db.py

│   └── meta\_db.py

├── jury/

│   ├── personas.py          # JurorPersona definitions

│   └── subgraph\_query.py    # get\_juror\_context()

├── utils/

│   ├── rate\_limiter.py

│   ├── retry.py

│   ├── telegram.py

│   └── mnc\_parser.py

├── data/

│   ├── docs/

│   ├── raw/

│   └── meta.db

└── tests/

&#x20;   ├── test\_parsers.py

&#x20;   ├── test\_meta\_extractor.py

&#x20;   └── test\_graph\_builder.py

---

## 13. Open Questions / Decisions Deferred

1. **Graph backend:** RyuGraph (existing Lulu stack, Windows/Android) vs Neo4j (Docker, better Cypher tooling). Recommend Neo4j for dev, RyuGraph for eventual mobile/embedded target.

2. **Transcript vs judgment distinction:** Many cases have both. Transcripts are richer for juror simulation (raw testimony). Judgments are cleaner but summarised. Ingest both, tag chunk source type.

3. **Suppression orders:** Some cases have partial suppression. MetaDB suppression\_order flag. Do not ingest suppressed content. Check AustLII suppression notices as reference.

4. **AustLII partnership timing:** Email now or after MVP? Recommend now — low effort, long lead time for approval.

5. **Embedding model:** text-embedding-3-small is cheap but ada-002 or Voyage Law may give better legal domain performance. Evaluate on retrieval quality after first 1,000 cases.

---

*End of spec v0.1*

25 KiB Raw Permalink Blame History Unescape Escape

25 KiB

Raw Permalink Blame History