Court-Decision Extraction & AI Tagging — Christos Prapas

The Problem

A legal research client needed to extract and structure Dutch court decisions from Rechtspraak.nl — the Netherlands' official judicial database. Each decision is an HTML document with a unique ECLI identifier. The client had a list of thousands of ECLIs and needed the full text extracted, merged into Excel, and annotated with section labels (Background, Reasoning, Decision) to feed downstream analysis.

What I Built

Phase 1 — Scraper (rechtspraak_scraper.py): Fetched each decision by ECLI from the Rechtspraak API endpoint, parsed the HTML with BeautifulSoup, extracted structured fields (ECLI, date, court, title, full text), and saved to CSV.

Phase 2 — Merge & export (merge_xlsx.py): Consolidated multiple CSV batches into a single Excel workbook, handling encoding edge cases in Dutch legal text.

Phase 3 — LLM section parser (ollama_section_parser.py): An annotation layer that uses Llama 3 8B running locally via Ollama to identify and label the major structural sections in each decision (Background, Reasoning, Decision). The parser accepts either a local HTML file or a live URL, calls the Ollama REST API with a structured prompt, and returns a JSON of labelled sections. Running the model locally keeps annotation costs at zero and avoids sending court decision content to external APIs.

Technical Highlights

ECLI-based batching: processed ~3,000 ECLIs in parallel with asyncio + rate limiting to respect Rechtspraak's API.
Encoding resilience: Dutch legal text contains special characters and non-standard whitespace; the parser normalises all text before export.
LLM prompt engineering: the Ollama prompt was iteratively tuned on a sample of 50 decisions before full-scale annotation.

Outcome

Delivered: merged_Rechtspraak_nl.xlsx (~3,000 annotated decisions) + the annotator script. Client used it to build a training dataset for a downstream NLP classifier.