aucourt-ingest/spec.md

\# AuCourtIngest — Agent Specification v0.1


\*\*Project:\*\* Australian Legal Case Ingestion Pipeline  

\*\*Purpose:\*\* Autonomous agent to discover, fetch, parse, normalise, and graph-ingest Australian court decisions for AI jury/judge analysis  

\*\*Author:\*\* Aaron King / Slothitude Games  

\*\*Date:\*\* 2026-05-30  


\---


\## 1. Overview


AuCourtIngest is a fully autonomous ingestion agent. It operates as a continuous pipeline with no human operator required after initial configuration. Its output is a populated property graph (RyuGraph or Neo4j-compatible) and a flat document store, both ready for per-juror RAG queries.


The agent has four operational modes:


| Mode | Trigger | Description |

|------|---------|-------------|

| `bootstrap` | Manual / first run | Historical bulk ingest from all tier-1 sources |

| `watch` | Cron / RSS poll | Continuous ingestion of new decisions |

| `backfill` | Manual | Fill gaps in existing corpus by date/court/charge-type |

| `audit` | Manual | Re-process existing documents to update graph schema |


\---


\## 2. Architecture


```

┌─────────────────────────────────────────────────────┐

│                   ORCHESTRATOR                      │

│           (main loop, mode dispatch,                │

│            rate limiting, error recovery)           │

└────────┬───────────────────────────────┬────────────┘

&#x20;        │                               │

&#x20;        ▼                               ▼

┌─────────────────┐             ┌─────────────────────┐

│   SOURCE LAYER  │             │   PROCESSING LAYER  │

│                 │             │                     │

│  - FedCourtRSS  │──fetch─────▶│  - DocParser        │

│  - HighCourtRSS │             │  - MetaExtractor    │

│  - NSWCaselaw   │             │  - OutcomeParser    │

│  - QLDJudgments │             │  - ChunkEngine      │

│  - VICCatalogue │             │  - EmbedEngine      │

│  - AusLawMCP    │             │  - GraphBuilder     │

└─────────────────┘             └──────────┬──────────┘

&#x20;                                          │

&#x20;                                          ▼

&#x20;                              ┌─────────────────────┐

&#x20;                              │   STORAGE LAYER     │

&#x20;                              │                     │

&#x20;                              │  - DocStore (flat)  │

&#x20;                              │  - VectorIndex      │

&#x20;                              │  - PropertyGraph    │

&#x20;                              │  - MetaDB (SQLite)  │

&#x20;                              └─────────────────────┘

```


\---


\## 3. Source Definitions


\### 3.1 Federal Court of Australia


```yaml

source\_id: fedcourt

base\_url: https://www.judgments.fedcourt.gov.au

rss\_feed: https://www.judgments.fedcourt.gov.au/rss/fca-judgments

doc\_formats: \[html, docx, pdf]

coverage\_from: 1977

full\_text\_from: 1995

docx\_from: 1995

pdf\_range: \[1977, 1994]

rate\_limit\_rps: 1

tos\_status: public\_domain

fetch\_strategy: rss\_poll\_then\_docx\_download

```


\*\*Fetch logic:\*\*

1\. Poll RSS feed every 6 hours

2\. For each new item: extract judgment URL from RSS `<link>`

3\. Fetch HTML page → find DOCX download link (pattern: `/judgments/fca/YYYY/NNN.docx`)

4\. Download DOCX → hand to DocParser

5\. For 1977–1994: fetch PDF instead → hand to PDFParser


\*\*DOCX URL pattern:\*\*

```

https://www.judgments.fedcourt.gov.au/judgments/fca/{YYYY}/{index}.docx

```


\*\*Bootstrap pagination:\*\*

```

https://www.fedcourt.gov.au/digital-law-library/judgments/search

?NeedleType=all\&txtSearch=\&category=criminal\&dateFrom=1995-01-01

\&dateTo=2026-12-31\&resultsPerPage=50\&pageNum={N}

```


\---


\### 3.2 High Court of Australia


```yaml

source\_id: highcourt

base\_url: https://www.hcourt.gov.au

transcript\_base: https://www.hcourt.gov.au/cases/case-s{YYYY}-{N}

coverage\_from: 1994

doc\_formats: \[html, pdf]

rate\_limit\_rps: 0.5

tos\_status: public\_domain

fetch\_strategy: index\_crawl

```


\*\*Fetch logic:\*\*

1\. Crawl `/cases/recent-judgments` — paginated list with matter numbers

2\. For each matter: fetch judgment HTML page

3\. Also fetch transcript if available (`/transcripts/` path)

4\. Store judgment + transcript as separate documents, linked by `matter\_id`


\*\*Priority filter for MVP:\*\*

```python

PRIORITY\_KEYWORDS = \[

&#x20;   "murder", "manslaughter", "sexual assault", "robbery",

&#x20;   "appeal allowed", "conviction quashed", "miscarriage of justice"

]

```


\---


\### 3.3 NSW Caselaw


```yaml

source\_id: nsw\_caselaw

base\_url: https://www.caselaw.nsw.gov.au

browse\_url: https://www.caselaw.nsw.gov.au/browse

coverage\_from: 1988

doc\_formats: \[html, pdf]

rate\_limit\_rps: 1

tos\_status: open\_reproduction\_authorised

fetch\_strategy: mnc\_systematic + browse\_pagination

mnc\_pattern: "\[YYYY] {COURT} {N}"

```


\*\*Fetch logic:\*\*

1\. Browse `/browse?filter=criminal` → paginate through all pages

2\. Extract MNC + decision URL per row

3\. Fetch decision HTML → parse full text + metadata

4\. MNC serves as canonical case ID throughout system


\*\*MNC parsing:\*\*

```python

\# Example: \[2019] NSWSC 1234

import re

MNC\_PATTERN = r'\\\[(\\d{4})\\]\\s+(\[A-Z]+)\\s+(\\d+)'

def parse\_mnc(mnc: str) -> dict:

&#x20;   m = re.match(MNC\_PATTERN, mnc)

&#x20;   return {"year": m\[1], "court": m\[2], "number": m\[3]}

```


\---


\### 3.4 Queensland Judgments


```yaml

source\_id: qld\_judgments

base\_url: https://www.queenslandjudgments.com.au

coverage: variable\_by\_court

doc\_formats: \[html, pdf]

rate\_limit\_rps: 0.5

tos\_status: free\_access

fetch\_strategy: search\_pagination

```


\*\*Fetch logic:\*\*

1\. Use search API with `matter\_type=criminal` filter

2\. Paginate results → extract case URLs

3\. Fetch HTML per case


\---


\### 3.5 AusLaw MCP (Gap-Fill + Search)


```yaml

source\_id: auslaw\_mcp

type: mcp\_server

repo: russellbrenner/auslaw-mcp

tools:

&#x20; - search\_austlii

&#x20; - search\_jade

&#x20; - fetch\_document\_text

&#x20; - search\_jade\_by\_citation

use\_case: targeted\_retrieval\_not\_bulk

rate\_limit: conservative  # case-by-case only

```


\*\*Use cases:\*\*

\- Fetch a specific case by citation when referenced in another judgment

\- OCR for scanned pre-1995 PDFs

\- Cross-reference citation graph to find related cases

\- Fill gaps where direct court portals have incomplete coverage


\*\*Tool call pattern:\*\*

```python

\# Retrieve by citation

result = await mcp.call("fetch\_document\_text", {

&#x20;   "url": "https://www.austlii.edu.au/cgi-bin/viewdoc/au/cases/..."

})


\# Search for cases matching charge type

results = await mcp.call("search\_austlii", {

&#x20;   "query": "murder conviction appeal",

&#x20;   "jurisdiction": "nsw",

&#x20;   "limit": 20,

&#x20;   "sortBy": "date"

})

```


\---


\## 4. Document Processing Pipeline


\### 4.1 DocParser


Accepts: HTML, DOCX, PDF  

Outputs: normalised `RawDocument` struct


```python

@dataclass

class RawDocument:

&#x20;   source\_id: str           # e.g. "nsw\_caselaw"

&#x20;   doc\_id: str              # MNC or internal UUID

&#x20;   url: str

&#x20;   fetch\_timestamp: str     # ISO8601

&#x20;   raw\_text: str            # full extracted text

&#x20;   format: str              # html | docx | pdf

&#x20;   pages: int | None

&#x20;   char\_count: int

```


\*\*Format handlers:\*\*

\- HTML: BeautifulSoup → strip nav/header/footer → extract `.judgment-body` or equivalent

\- DOCX: `python-docx` → iterate paragraphs, preserve heading structure

\- PDF: `pdfminer.six` for text PDFs; `pytesseract` via AusLaw MCP OCR for scanned


\---


\### 4.2 MetaExtractor


Input: `RawDocument`  

Output: `CaseMeta` struct  

Method: Claude claude-haiku-4-5-20251001 via API (cheap, fast, structured)


```python

@dataclass

class CaseMeta:

&#x20;   # Identity

&#x20;   case\_name: str           # e.g. "R v Smith"

&#x20;   mnc: str                 # Medium Neutral Citation

&#x20;   court: str               # NSWSC | HCA | FCA | QSC etc

&#x20;   judge: list\[str]

&#x20;   date\_delivered: str      # ISO8601

&#x20;   jurisdiction: str        # NSW | CTH | QLD | VIC etc


&#x20;   # Classification

&#x20;   matter\_type: str         # criminal | civil | appeal | coronial

&#x20;   charges: list\[str]       # \["murder s18 Crimes Act 1900 (NSW)"]

&#x20;   charge\_categories: list\[str]  # \["homicide", "violence"]


&#x20;   # Outcome

&#x20;   verdict: str             # guilty | not\_guilty | appeal\_allowed |

&#x20;                            # appeal\_dismissed | conviction\_quashed |

&#x20;                            # sentence\_varied | hung | n/a

&#x20;   sentence: str | None     # "18 years NMP 13"

&#x20;   outcome\_notes: str       # free text summary of result


&#x20;   # Flags

&#x20;   is\_appeal: bool

&#x20;   appeal\_of: str | None    # MNC of original decision

&#x20;   was\_appealed: bool       # filled in later by back-reference

&#x20;   exoneration\_flag: bool   # manual or derived from appeal outcome

&#x20;   inadmissible\_evidence: list\[str]  # evidence ruled out

&#x20;   suppression\_order: bool  # whether publication restrictions apply

```


\*\*Extraction prompt (system):\*\*

```

You are a legal metadata extractor. Given the full text of an Australian court judgment,

extract structured metadata in JSON. Output ONLY valid JSON matching the CaseMeta schema.

Be conservative — use null for fields you cannot determine with confidence.

For verdict, use only these values: guilty | not\_guilty | appeal\_allowed |

appeal\_dismissed | conviction\_quashed | sentence\_varied | hung | civil\_judgment | n/a

```


\---


\### 4.3 OutcomeParser


Separate pass specifically for verdict/outcome ground truth. This is the most critical field — it's what the divergence signal compares against.


\*\*Resolution priority:\*\*

1\. Explicit headnote — "HELD: conviction quashed"

2\. Orders section — "Appeal allowed. Convictions on counts 1 and 3 set aside."

3\. Appeal cross-reference — if later appeal found, update original record

4\. Manual flag — `exoneration\_flag=True` sourced from external list


\*\*External exoneration sources:\*\*

\- RMIT Wrongful Convictions Register (AU)

\- Innocence Project Australia case list

\- Royal Commission findings referencing specific convictions


These are small enough to maintain as a static JSON lookup table keyed by defendant name + year.


\---


\### 4.4 ChunkEngine


Chunks documents for RAG retrieval. Chunks are \*\*semantically meaningful units\*\*, not arbitrary token windows.


\*\*Chunk types and sizes:\*\*


| Chunk Type | Content | Target Tokens | Overlap |

|------------|---------|---------------|---------|

| `opening` | Case intro, charges, parties | 400–600 | none |

| `testimony` | Each witness examination block | 300–500 | 50 |

| `exhibit` | Evidence description + ruling | 200–400 | none |

| `ruling` | Each evidentiary ruling by judge | 200–300 | none |

| `closing` | Closing submissions summary | 400–600 | 50 |

| `judgment` | Judicial reasoning blocks | 400–600 | 50 |

| `sentence` | Sentencing remarks | 300–500 | none |


\*\*Chunk struct:\*\*

```python

@dataclass

class Chunk:

&#x20;   chunk\_id: str            # UUID

&#x20;   doc\_id: str              # parent document MNC

&#x20;   chunk\_type: str          # from table above

&#x20;   sequence: int            # position in document

&#x20;   text: str

&#x20;   token\_count: int

&#x20;   speaker: str | None      # witness name if testimony

&#x20;   page\_ref: str | None     # "p.47" if available

&#x20;   embedding: list\[float]   # 1536-dim, filled by EmbedEngine

```


\*\*Chunking strategy:\*\*

Use Claude claude-haiku-4-5-20251001 to identify structural boundaries:

```

Identify the structural sections of this Australian court judgment.

Return a JSON array of {section\_type, start\_char, end\_char, speaker}.

Section types: opening | testimony | exhibit | ruling | closing | judgment | sentence

```

Then split at identified boundaries, keeping each chunk within token budget.


\---


\### 4.5 EmbedEngine


```yaml

model: text-embedding-3-small  # OpenAI, 1536-dim, cheap

batch\_size: 100

store: chromadb  # local, no infra

collection\_naming: "au\_cases\_{source\_id}"

```


Each chunk gets an embedding. The embedding is stored in ChromaDB with chunk metadata as payload.


\*\*Query interface:\*\*

```python

def query\_chunks(

&#x20;   text: str,

&#x20;   chunk\_types: list\[str] = None,   # filter by type

&#x20;   doc\_ids: list\[str] = None,       # filter to specific cases

&#x20;   top\_k: int = 10

) -> list\[Chunk]:

&#x20;   ...

```


\---


\### 4.6 GraphBuilder


Builds the property graph from extracted metadata + chunk relationships.


\*\*Node types:\*\*


```

(:Case {mnc, court, date, jurisdiction, matter\_type, verdict, exoneration\_flag})

(:Charge {text, category, act, section})

(:Witness {name, role})          # role: prosecution|defence|expert|accused

(:Exhibit {id, description, admitted})

(:Judge {name, court})

(:Ruling {type, outcome})        # type: admissibility|direction|objection

(:Timeline {date, event})

(:Chunk {chunk\_id, type, text\_preview})

```


\*\*Relationship types:\*\*


```

(:Case)-\[:CHARGED\_WITH]->(:Charge)

(:Case)-\[:HEARD\_BY]->(:Judge)

(:Case)-\[:INCLUDES\_TESTIMONY {credibility\_score}]->(:Witness)

(:Case)-\[:HAS\_EXHIBIT {admitted: bool}]->(:Exhibit)

(:Case)-\[:HAS\_RULING]->(:Ruling)

(:Case)-\[:APPEALS]->(:Case)           # appeal chain

(:Witness)-\[:GAVE\_TESTIMONY]->(:Chunk)

(:Exhibit)-\[:DESCRIBED\_IN]->(:Chunk)

(:Ruling)-\[:CONCERNS]->(:Exhibit)

(:Ruling)-\[:CONCERNS]->(:Witness)

(:Chunk)-\[:FOLLOWS]->(:Chunk)         # sequence chain

(:Chunk)-\[:CORROBORATES]->(:Chunk)    # semantic similarity > 0.85

(:Chunk)-\[:CONTRADICTS]->(:Chunk)     # semantic similarity < 0.2 on same topic

```


\*\*Graph backend:\*\*

\- Primary: Neo4j (local Docker) for development

\- Alternative: RyuGraph if staying in the existing Lulu/Windows stack

\- Export: GraphML for portability


\*\*CORROBORATES / CONTRADICTS edge generation:\*\*

```python

\# After all chunks embedded, run pairwise similarity within a case

\# for chunks of type testimony and exhibit

for pair in chunk\_pairs\_within\_case(doc\_id, types=\["testimony", "exhibit"]):

&#x20;   sim = cosine\_similarity(pair\[0].embedding, pair\[1].embedding)

&#x20;   if sim > 0.85:

&#x20;       graph.add\_edge(pair\[0], pair\[1], "CORROBORATES", weight=sim)

&#x20;   elif same\_topic(pair\[0], pair\[1]) and sim < 0.25:

&#x20;       graph.add\_edge(pair\[0], pair\[1], "CONTRADICTS", weight=sim)

```


\---


\## 5. Juror Subgraph Query Interface


This is the interface the jury agent calls at deliberation time.


```python

def get\_juror\_context(

&#x20;   case\_mnc: str,

&#x20;   persona: JurorPersona,

&#x20;   max\_tokens: int = 4000

) -> JurorContext:

&#x20;   """

&#x20;   Traverse the graph from the persona's anchor node types,

&#x20;   follow relevant edge types, collect chunks up to token budget.

&#x20;   Returns a compact context string + list of source node IDs.

&#x20;   """

```


\*\*Persona → graph traversal mapping:\*\*


```python

PERSONA\_TRAVERSAL = {

&#x20;   "nurse": {

&#x20;       "anchor\_nodes": \["Witness\[role=expert]", "Exhibit\[category=medical]"],

&#x20;       "edge\_types": \["GAVE\_TESTIMONY", "DESCRIBED\_IN", "CORROBORATES"],

&#x20;       "chunk\_types": \["testimony", "exhibit"],

&#x20;       "exclude\_edges": \["RULED\_INADMISSIBLE"]

&#x20;   },

&#x20;   "accountant": {

&#x20;       "anchor\_nodes": \["Exhibit\[category=financial]", "Witness\[role=expert]"],

&#x20;       "edge\_types": \["DESCRIBED\_IN", "HAS\_RULING"],

&#x20;       "chunk\_types": \["exhibit", "ruling"],

&#x20;       "exclude\_edges": \["RULED\_INADMISSIBLE"]

&#x20;   },

&#x20;   "skeptic": {

&#x20;       "anchor\_nodes": \["Timeline", "Witness"],

&#x20;       "edge\_types": \["CONTRADICTS", "HAS\_RULING", "GAVE\_TESTIMONY"],

&#x20;       "chunk\_types": \["testimony", "ruling"],

&#x20;       "exclude\_edges": \["RULED\_INADMISSIBLE"]

&#x20;   },

&#x20;   "ex\_cop": {

&#x20;       "anchor\_nodes": \["Ruling", "Exhibit\[category=forensic]"],

&#x20;       "edge\_types": \["HAS\_RULING", "DESCRIBED\_IN", "CORROBORATES"],

&#x20;       "chunk\_types": \["ruling", "exhibit", "judgment"],

&#x20;       "exclude\_edges": \["RULED\_INADMISSIBLE"]

&#x20;   },

&#x20;   "empath": {

&#x20;       "anchor\_nodes": \["Witness\[role=prosecution]", "Witness\[role=accused]"],

&#x20;       "edge\_types": \["GAVE\_TESTIMONY", "CORROBORATES", "CONTRADICTS"],

&#x20;       "chunk\_types": \["testimony", "closing"],

&#x20;       "exclude\_edges": \["RULED\_INADMISSIBLE"]

&#x20;   },

&#x20;   "foreman": {

&#x20;       "anchor\_nodes": \["Charge", "Ruling", "Judge"],

&#x20;       "edge\_types": \["CHARGED\_WITH", "HAS\_RULING", "HEARD\_BY"],

&#x20;       "chunk\_types": \["opening", "judgment", "sentence"],

&#x20;       "exclude\_edges": \["RULED\_INADMISSIBLE"]

&#x20;   }

}

```


\*\*Critical invariant:\*\* `RULED\_INADMISSIBLE` edges are always excluded from juror queries. This wall is enforced at the graph traversal level, not the prompt level.


\---


\## 6. MetaDB Schema (SQLite)


Tracks ingestion state — what's been fetched, processed, failed.


```sql

CREATE TABLE documents (

&#x20;   doc\_id          TEXT PRIMARY KEY,  -- MNC or UUID

&#x20;   source\_id       TEXT NOT NULL,

&#x20;   url             TEXT NOT NULL,

&#x20;   fetch\_status    TEXT,              -- pending|fetched|parsed|embedded|graphed|failed

&#x20;   fetch\_timestamp TEXT,

&#x20;   parse\_timestamp TEXT,

&#x20;   embed\_timestamp TEXT,

&#x20;   graph\_timestamp TEXT,

&#x20;   error\_message   TEXT,

&#x20;   char\_count      INTEGER,

&#x20;   chunk\_count     INTEGER,

&#x20;   matter\_type     TEXT,

&#x20;   court           TEXT,

&#x20;   year            INTEGER,

&#x20;   verdict         TEXT,

&#x20;   exoneration\_flag INTEGER DEFAULT 0

);


CREATE TABLE fetch\_queue (

&#x20;   queue\_id        INTEGER PRIMARY KEY AUTOINCREMENT,

&#x20;   source\_id       TEXT,

&#x20;   url             TEXT,

&#x20;   priority        INTEGER DEFAULT 5,  -- 1=highest

&#x20;   queued\_at       TEXT,

&#x20;   attempts        INTEGER DEFAULT 0

);


CREATE TABLE source\_state (

&#x20;   source\_id       TEXT PRIMARY KEY,

&#x20;   last\_poll       TEXT,

&#x20;   last\_rss\_etag   TEXT,

&#x20;   docs\_fetched    INTEGER DEFAULT 0,

&#x20;   docs\_failed     INTEGER DEFAULT 0

);

```


\---


\## 7. Rate Limiting \& Politeness


```python

RATE\_LIMITS = {

&#x20;   "fedcourt":      {"rps": 1.0,  "concurrent": 2, "retry\_after": 60},

&#x20;   "highcourt":     {"rps": 0.5,  "concurrent": 1, "retry\_after": 120},

&#x20;   "nsw\_caselaw":   {"rps": 1.0,  "concurrent": 2, "retry\_after": 60},

&#x20;   "qld\_judgments": {"rps": 0.5,  "concurrent": 1, "retry\_after": 120},

&#x20;   "auslaw\_mcp":    {"rps": 0.3,  "concurrent": 1, "retry\_after": 180},

}


\# Always send:

HEADERS = {

&#x20;   "User-Agent": "AuCourtIngest/0.1 (legal research; contact: your@email.com)",

&#x20;   "Accept": "text/html,application/xhtml+xml,application/msword",

}


\# Respect robots.txt (NSW Caselaw blocks search engines, not research agents)

\# Check before first fetch from each domain

```


\---


\## 8. Error Handling \& Recovery


```python

class FetchError(Exception): pass

class ParseError(Exception): pass

class RateLimitError(Exception): pass


RETRY\_POLICY = {

&#x20;   FetchError:      {"max\_retries": 3, "backoff": "exponential", "base": 30},

&#x20;   RateLimitError:  {"max\_retries": 5, "backoff": "exponential", "base": 120},

&#x20;   ParseError:      {"max\_retries": 1, "backoff": "fixed",       "base": 0},

}


\# On ParseError after retry: log to MetaDB, move to manual\_review queue

\# On persistent FetchError: mark source as degraded, alert via Telegram

```


\*\*Telegram alerting\*\* (reuse Lulu's Telegram integration):

```python

ALERTS = {

&#x20;   "source\_degraded": "⚠️ AuCourtIngest: {source\_id} failing for {minutes}min",

&#x20;   "daily\_summary":   "📊 Ingested: {new\_docs} docs | Graph nodes: {nodes} | Errors: {errors}",

&#x20;   "milestone":       "🎯 Corpus reached {N} cases",

}

```


\---


\## 9. Bootstrap Run Plan


Priority order for initial corpus build:


```

Phase 1 — Week 1 (no permissions needed):

&#x20; \[ ] Federal Court RSS → DOCX, 2010–2026, criminal filter

&#x20; \[ ] High Court judgments 2000–2026, criminal appeals

&#x20; \[ ] NSW Caselaw criminal browse, 2010–2026

&#x20; Target: \~5,000 cases


Phase 2 — Week 2:

&#x20; \[ ] Federal Court PDF backfill 1995–2009

&#x20; \[ ] QLD Judgments criminal, 2000–2026

&#x20; \[ ] VIC via AusLaw MCP targeted search

&#x20; Target: \~15,000 cases total


Phase 3 — Ongoing:

&#x20; \[ ] RSS watch mode all sources

&#x20; \[ ] AustLII partnership email → if approved, full backfill

&#x20; \[ ] Exoneration cross-reference pass (RMIT register)

&#x20; Target: 50,000+ cases

```


\---


\## 10. Output Contracts


\*\*DocStore:\*\* `/data/docs/{source\_id}/{year}/{mnc}.json`

```json

{

&#x20; "doc\_id": "\[2019] NSWSC 1234",

&#x20; "meta": { ...CaseMeta... },

&#x20; "chunks": \[ ...Chunk\[] ... ],

&#x20; "raw\_text\_path": "/data/raw/nsw\_caselaw/2019/NSWSC\_1234.html"

}

```


\*\*VectorIndex:\*\* ChromaDB collection `au\_cases`, filterable by `source\_id`, `chunk\_type`, `doc\_id`


\*\*PropertyGraph:\*\* Neo4j `au\_legal` database, accessible via Bolt protocol


\*\*MetaDB:\*\* SQLite at `/data/meta.db`


\---


\## 11. Tech Stack


```

Language:       Python 3.12

HTTP:           httpx (async)

HTML parsing:   BeautifulSoup4

DOCX:           python-docx

PDF text:       pdfminer.six

PDF OCR:        via AusLaw MCP (pytesseract wrapper)

Embeddings:     openai text-embedding-3-small

Vector store:   chromadb (local)

Graph DB:       neo4j (Docker) or ryu\_graph (Windows)

SQLite:         aiosqlite

LLM extraction: anthropic claude-haiku-4-5-20251001 (cheap, fast)

MCP client:     mcp-python-sdk

Scheduling:     APScheduler

Alerting:       Telegram Bot API (reuse Lulu token)

Config:         TOML (pyproject-style)

```


\---


\## 12. File Structure


```

aucourt\_ingest/

├── main.py                  # entry point, mode dispatch

├── config.toml              # source configs, rate limits, paths

├── orchestrator.py          # main loop, queue management

├── sources/

│   ├── base.py              # BaseSource ABC

│   ├── fedcourt.py

│   ├── highcourt.py

│   ├── nsw\_caselaw.py

│   ├── qld\_judgments.py

│   └── auslaw\_mcp.py

├── processing/

│   ├── doc\_parser.py

│   ├── meta\_extractor.py

│   ├── outcome\_parser.py

│   ├── chunk\_engine.py

│   ├── embed\_engine.py

│   └── graph\_builder.py

├── storage/

│   ├── doc\_store.py

│   ├── vector\_index.py

│   ├── graph\_db.py

│   └── meta\_db.py

├── jury/

│   ├── personas.py          # JurorPersona definitions

│   └── subgraph\_query.py    # get\_juror\_context()

├── utils/

│   ├── rate\_limiter.py

│   ├── retry.py

│   ├── telegram.py

│   └── mnc\_parser.py

├── data/

│   ├── docs/

│   ├── raw/

│   └── meta.db

└── tests/

&#x20;   ├── test\_parsers.py

&#x20;   ├── test\_meta\_extractor.py

&#x20;   └── test\_graph\_builder.py

```


\---


\## 13. Open Questions / Decisions Deferred


1\. \*\*Graph backend:\*\* RyuGraph (existing Lulu stack, Windows/Android) vs Neo4j (Docker, better Cypher tooling). Recommend Neo4j for dev, RyuGraph for eventual mobile/embedded target.

2\. \*\*Transcript vs judgment distinction:\*\* Many cases have both. Transcripts are richer for juror simulation (raw testimony). Judgments are cleaner but summarised. Ingest both, tag chunk source type.

3\. \*\*Suppression orders:\*\* Some cases have partial suppression. MetaDB `suppression\_order` flag. Do not ingest suppressed content. Check AustLII suppression notices as reference.

4\. \*\*AustLII partnership timing:\*\* Email now or after MVP? Recommend now — low effort, long lead time for approval.

5\. \*\*Embedding model:\*\* text-embedding-3-small is cheap but ada-002 or Voyage Law may give better legal domain performance. Evaluate on retrieval quality after first 1,000 cases.


\---


\*End of spec v0.1\*