AI Language Model for Greek Legal Text — Christos Prapas

The Problem

A research client needed a Greek-language LLM fine-tuned for legal text: parliamentary transcripts, court decisions, and regulatory documents. The base model (Qwen 2.5) has limited Greek vocabulary, so both the tokenizer and the model weights needed updating before fine-tuning would be effective.

Pipeline Overview

The project was structured as a 9-notebook Colab pipeline:

| Notebook | Task | |---|---| | 00_download_to_drive | Download Greek legal corpus to Drive | | 01_data_collection | Parse and normalise source documents | | 02_tokenizer_merge | Extend Qwen tokenizer with Greek tokens | | 03_pretraining | Initial pretraining run (baseline) | | 04_evaluation | Before/after perplexity evaluation | | 05_full_pretraining | Full extended pretraining on A100 | | 06_lep_tokenizer_merge | LEP (Legal Encoding Protocol) tokenizer variant | | 07a_data_curation | Data quality filtering + deduplication | | 07b_qlora_pretraining | QLoRA fine-tuning on curated subset |

Key Technical Decisions

Tokenizer extension — Qwen 2.5's tokenizer was trained primarily on Chinese/English. Greek legal tokens were extracted from the corpus vocabulary, filtered by frequency, and merged into the tokenizer using SentencePieceTrainer. This reduced average token count per Greek sentence by ~30%.

Data curation (07a) — the raw corpus contained OCR noise, duplicate paragraphs, and encoding errors. A deduplication pass (MinHash LSH) and quality filters (length, punctuation density, language detection) reduced the dataset from 2.8 GB to 1.9 GB of clean text.

QLoRA on Qwen 2.5 7B — full pretraining on the 7B parameter model was prohibitively expensive at ~18 A100 GPU-hours. QLoRA (4-bit quantisation + LoRA adapters targeting Q/K/V projections, rank=64) enabled effective fine-tuning within Colab A100 constraints (~40 GB VRAM) at a fraction of the cost.

Evaluation

Perplexity on a held-out Greek legal test set dropped from 28.4 (base Qwen 2.5) to 14.1 (after full pretraining + QLoRA). The model's answers on legal Q&A prompts were validated by the client against reference decisions.

Outcome

Delivered: trained model checkpoint, tokenizer, evaluation scripts, and training logs. Client continued to the production fine-tuning phase independently.