The Problem
A legal research client needed to extract and structure Dutch court decisions from Rechtspraak.nl — the Netherlands' official judicial database. Each decision is an HTML document with a unique ECLI identifier. The client had a list of thousands of ECLIs and needed the full text extracted, merged into Excel, and annotated with section labels (Background, Reasoning, Decision) to feed downstream analysis.
What I Built
Phase 1 — Scraper (rechtspraak_scraper.py): Fetched each decision by ECLI from the Rechtspraak API endpoint, parsed the HTML with BeautifulSoup, extracted structured fields (ECLI, date, court, title, full text), and saved to CSV.
Phase 2 — Merge & export (merge_xlsx.py): Consolidated multiple CSV batches into a single Excel workbook, handling encoding edge cases in Dutch legal text.
Phase 3 — LLM section parser (ollama_section_parser.py): An annotation layer that uses Llama 3 8B running locally via Ollama to identify and label the major structural sections in each decision (Background, Reasoning, Decision). The parser accepts either a local HTML file or a live URL, calls the Ollama REST API with a structured prompt, and returns a JSON of labelled sections. Running the model locally keeps annotation costs at zero and avoids sending court decision content to external APIs.
Technical Highlights
- ECLI-based batching: processed ~3,000 ECLIs in parallel with
asyncio+ rate limiting to respect Rechtspraak's API. - Encoding resilience: Dutch legal text contains special characters and non-standard whitespace; the parser normalises all text before export.
- LLM prompt engineering: the Ollama prompt was iteratively tuned on a sample of 50 decisions before full-scale annotation.
Outcome
Delivered: merged_Rechtspraak_nl.xlsx (~3,000 annotated decisions) + the annotator script. Client used it to build a training dataset for a downstream NLP classifier.