The Problem
A research client needed a Greek-language LLM fine-tuned for legal text: parliamentary transcripts, court decisions, and regulatory documents. The base model (Qwen 2.5) has limited Greek vocabulary, so both the tokenizer and the model weights needed updating before fine-tuning would be effective.
Pipeline Overview
The project was structured as a 9-notebook Colab pipeline:
| Notebook | Task |
|---|---|
| 00_download_to_drive | Download Greek legal corpus to Drive |
| 01_data_collection | Parse and normalise source documents |
| 02_tokenizer_merge | Extend Qwen tokenizer with Greek tokens |
| 03_pretraining | Initial pretraining run (baseline) |
| 04_evaluation | Before/after perplexity evaluation |
| 05_full_pretraining | Full extended pretraining on A100 |
| 06_lep_tokenizer_merge | LEP (Legal Encoding Protocol) tokenizer variant |
| 07a_data_curation | Data quality filtering + deduplication |
| 07b_qlora_pretraining | QLoRA fine-tuning on curated subset |
Key Technical Decisions
Tokenizer extension — Qwen 2.5's tokenizer was trained primarily on Chinese/English. Greek legal tokens were extracted from the corpus vocabulary, filtered by frequency, and merged into the tokenizer using SentencePieceTrainer. This reduced average token count per Greek sentence by ~30%.
Data curation (07a) — the raw corpus contained OCR noise, duplicate paragraphs, and encoding errors. A deduplication pass (MinHash LSH) and quality filters (length, punctuation density, language detection) reduced the dataset from 2.8 GB to 1.9 GB of clean text.
QLoRA on Qwen 2.5 7B — full pretraining on the 7B parameter model was prohibitively expensive at ~18 A100 GPU-hours. QLoRA (4-bit quantisation + LoRA adapters targeting Q/K/V projections, rank=64) enabled effective fine-tuning within Colab A100 constraints (~40 GB VRAM) at a fraction of the cost.
Evaluation
Perplexity on a held-out Greek legal test set dropped from 28.4 (base Qwen 2.5) to 14.1 (after full pretraining + QLoRA). The model's answers on legal Q&A prompts were validated by the client against reference decisions.
Outcome
Delivered: trained model checkpoint, tokenizer, evaluation scripts, and training logs. Client continued to the production fine-tuning phase independently.