The Decision Nobody Should Make Alone

In many Spanish-speaking regions — rural Guatemala, southern Mexico, interior Spain — access to healthcare is not a given. It is a calculation. A calculation that involves distance, money, time, and risk.

In rural Guatemala, the average distance to a hospital is 45 kilometers. In rural Mexico, nearly half the population lives more than an hour from the nearest emergency room. In Spain's aging interior communities, elderly patients face similar isolation. When someone in these areas develops symptoms — chest pain, persistent fever, a strange rash — they face an impossible decision:

  • Go to the emergency room, potentially hours away and costing a day's wages
  • Wait and hope it gets better
  • Ask a neighbor or family member with no medical training

Every day, millions of people make this choice with no guidance at all. Some wait too long. Some travel hours for something that could have waited. The problem is not the absence of medical knowledge — the Manchester Triage System, used in emergency departments worldwide, provides a clear five-level classification for urgency. The problem is that this knowledge does not reach the people who need it, where they need it, when they need it.

This is the problem that med-llm-triage-es set out to solve: a small, offline-capable AI model that speaks Spanish, understands the Manchester Triage System, and runs entirely on a mobile phone — no internet required. Not to diagnose. Not to replace a doctor. Just to help people make a better-informed decision about when and where to seek care.

45 km Average distance to hospital in rural Guatemala
805 MB Final model size — fits on any modern phone
~80% Triage label accuracy after fixing the ROJO bias
$1.40 Total cost to distill 5,000 examples from a teacher model

Why Offline Matters

You might ask: why not just use a cloud API? The answer is simple: the people who need this most often have the least connectivity. In rural Latin America, internet access is intermittent at best. In disaster scenarios — earthquakes, hurricanes, floods — communication infrastructure is the first thing to fail. In many developing regions, mobile data costs are prohibitively expensive. An offline model running on-device ensures this technology reaches the people who need it most, precisely when they need it most: mobile phones in rural clinics, community health worker tablets in remote areas, emergency response kits in disaster zones.

The Manchester Triage System: Five Colors, One Language

The Manchester Triage System (MTS) is the gold standard for emergency department triage across Europe, Latin America, and much of the world [9]. It classifies patients into five urgency levels, each associated with a color and a maximum wait time:

Level Color Max Wait Example
1 — ROJO Immediate 0 min Cardiac arrest, severe hemorrhage, loss of consciousness
2 — NARANJA Very urgent 10 min Severe pain, high fever with confusion, moderate injuries
3 — AMARILLO Urgent 60 min Abdominal pain, moderate fever, stable fractures
4 — VERDE Standard 120 min Minor wounds, mild symptoms, stable conditions
5 — AZUL Non-urgent 240 min Chronic complaints, routine check-ups, minor skin issues

The goal of med-llm-triage-es is to teach a small language model this classification system — in Spanish, with explanations, action steps, alarm signs, and safety disclaimers — and compress it into a model small enough to run offline on a phone.

The Technical Foundation: Small Models, Big Ambition

Building a medical AI that runs offline on a phone imposes severe constraints. The model must be small (under 1 GB), fast, and accurate enough for safety-critical decisions. Here is the stack I chose:

The med-llm-es Stack
1
Base Model: LiquidAI LFM2.5-1.2B
LiquidAI's LFM2.5-1.2B-Base — a compact 1.2-billion-parameter model with strong multilingual capabilities, including Spanish. Liquid Foundation Models use a state-space architecture that is both parameter-efficient and inference-friendly, making them ideal for on-device deployment [1].
2
Fine-Tuning: Unsloth + TRL (LoRA)
LoRA (Low-Rank Adaptation) [8] fine-tuning via Unsloth and Hugging Face's TRL library. LoRA freezes the base model weights and trains small rank-decomposition matrices (r=16, alpha=32), keeping VRAM under 6 GB on a single RTX 4090 [2, 3].
3
Teacher Model: MiniMax M2.5
MiniMax's M2.5 served as the teacher model for knowledge distillation — generating high-quality Spanish medical responses and preference pairs at scale. At ~$1.40 for 5,000 examples and 4.5M tokens, it delivered exceptional value [4].
4
Export: GGUF via llama.cpp
Final models exported to GGUF format using llama.cpp for offline inference on mobile devices. Multiple quantization levels from FP16 (2.2 GB) down to Q2_K (462 MB) [5].

The full training pipeline follows the modern alignment stack:

Training Pipeline

OpenMed Data → CPT (Domain Adaptation) → SFT (Supervised Fine-Tuning) → Distillation (MiniMax Teacher) → Preference Data → DPO (Alignment) → GGUF Export

Each stage served a specific purpose: CPT teaches medical Spanish, SFT teaches triage structure, distillation generates preference pairs, DPO aligns behavior, and GGUF makes it portable.

The Data Journey: From OpenMed to Triage

Starting with OpenMed

Medical AI needs medical data. I started with OpenMed — an open-source medical dataset containing PubHealth medical articles and drug interaction data in English [6]. OpenMed deserves particular recognition: open medical datasets are rare, and the project's commitment to making medical knowledge freely available is what made this work possible. A sincere thank you to the OpenMed team for their contribution to open medical AI.

The challenge: this data was in English, not Spanish. And it was general medical knowledge, not triage-specific. I needed a translation pipeline that went beyond word-for-word conversion — it needed to adapt medical terminology to Spanish-speaking contexts: "acetaminophen" becomes "paracetamol," "ER" becomes "urgencias," American drug names become their Latin American equivalents.

Using MiniMax's M2.5 API, I translated a subset of OpenMed to produce roughly 10,000 examples of Spanish medical text covering anatomy, common diseases, drug interactions, and medical procedures.

Continued Pre-Training: Teaching the Model Medical Spanish

Before asking the model to triage, I ran Continued Pre-Training (CPT) on the Spanish medical corpus. This step teaches the base model medical vocabulary and clinical reasoning patterns in Spanish before we ask it to classify urgency. CPT configuration: LoRA r=16, alpha=32, 10K Spanish medical examples, ~30 minutes on a single RTX 4090, ~5.5 GB VRAM. This stage was not about task performance — it was about domain adaptation.

Generating Triage Training Data

For the triage task itself, I generated synthetic prompts covering all five Manchester levels — from life-threatening emergencies (chest pain with diaphoresis, severe hemorrhage) down to non-urgent complaints (chronic conditions, routine questions). The initial dataset looked solid: 10,000 triage examples covering all five urgency levels in formal medical Spanish.

But there was a critical flaw lurking in the numbers.

The Discovery: Diversity Collapse and the ROJO Skew

When I analyzed the prompt diversity, I found something alarming: of 10,000 prompts, only 1,928 were unique. An 80.72% duplicate rate. The template-based generation was producing the same symptom descriptions over and over with different urgency labels.

More critically, the distribution was heavily skewed:

Training Data Distribution: Original Dataset
ROJO (Emergency)
60%
NARANJA (Very urgent)
15%
AMARILLO (Urgent)
10%
VERDE (Standard)
10%
AZUL (Non-urgent)
5%

60% of the training data was emergency cases. The model was learning a simple lesson: when in doubt, say ROJO.

Supervised Fine-Tuning: Beautiful Curves, Broken Model

With the CPT model as a base, I performed Supervised Fine-Tuning (SFT) on the triage data. LoRA r=16, alpha=32, max sequence length 1024, batch size 2 with gradient accumulation of 4, learning rate 2e-4, 3 epochs, 3,750 total steps.

The training metrics looked excellent:

2.99 → 0.016 Training loss — a textbook-perfect descent
47% GPU utilization on a single RTX 4090
~2 hrs Total training time
5.5 GB Peak VRAM usage

The model learned to generate structured responses with triage labels, action steps, alarm signs, and disclaimers. The loss curves were beautiful. The GPU hummed efficiently.

Then I tested it.

"Paciente con dolor de cabeza tensional ocasional" — Patient with occasional tension headache.

This is clearly AZUL or VERDE at worst. A minor issue.

The model responded: "Nivel de Urgencia: ROJO — Emergencia"

I tried another: "Herida superficial en el dedo que sangra un poco" — superficial finger wound bleeding slightly. VERDE at most.

"Nivel de Urgencia: ROJO — Emergencia"

Every. Single. Input. Emergency.

The model was not technically "wrong" in ways traditional metrics would catch. It had low loss. It generated fluent Spanish. It produced well-formatted responses. But it was clinically useless because it could not distinguish between a heart attack and a headache.

The Insidious Nature of Distribution Bias

This is the insidious thing about training data imbalance: standard training metrics — loss, perplexity, fluency — will not catch it. The model learns the prior distribution perfectly. It is optimizing correctly — for the wrong objective. You need task-specific clinical evaluation to see the problem.

The Distillation Drama: MiniMax M2.5 as Teacher

For knowledge distillation — using a larger model to generate high-quality training examples for preference optimization — I turned to MiniMax's M2.5. The plan: generate 5,000 high-quality Spanish medical responses as "chosen" examples, paired with weaker "rejected" responses for DPO training.

MiniMax M2.5 deserves particular acknowledgment here. It produced excellent Spanish medical output with proper clinical terminology, well-structured triage responses, and natural language quality that matched native medical writing. For a teacher model, it was remarkably affordable: 5,000 examples, 4.5 million tokens, ~92 minutes, $1.40 total.

But getting there was not straightforward.

What Went Wrong

The distillation process hit a series of technical failures: silent crashes that left partial output files with no error messages, low throughput (10–20 requests/minute despite 500 RPM API limits), no checkpoint resume (every crash meant starting from scratch), and a thundering herd problem where all concurrent requests retried simultaneously when rate-limited.

The Fix

I rewrote the distillation script with file logging, checkpoint resume, batch processing (100 prompts per batch instead of 5,000 at once), proper rate limiting via semaphore, and concurrency raised from 5 to 30 parallel requests. Throughput jumped from ~15 to ~200–300 requests per minute with zero data loss from crashes. The lesson: distillation infrastructure is as important as the distillation itself.

The DPO Experiment: Improving Structure, Destroying Safety

Direct Preference Optimization (DPO) is a technique for aligning language models with human preferences without requiring a separate reward model [7]. Instead of reinforcement learning from human feedback (RLHF), DPO directly optimizes the policy using pairs of "chosen" (preferred) and "rejected" responses. I generated 5,005 preference pairs where chosen responses had proper structure (disclaimers, alarm signs, action steps) and rejected responses were missing key elements.

DPO configuration: base model from the SFT checkpoint, beta=0.1, learning rate 5e-7, 1 epoch, ~625 steps.

The results were unexpected:

Metric SFT DPO
Label Accuracy 20% 23%
Action Steps Present 37% 70%
Alarm Signs Present 47% 67%
Safety Disclaimer 33% 0%
Strict Pass Rate 6.7% 0%

DPO improved structural quality (action steps jumped from 37% to 70%, alarm signs from 47% to 67%) but destroyed safety compliance. The disclaimer rate dropped to zero. The strict pass rate — requiring all safety elements present — went to 0%.

Critical Finding: RL Can Optimize Against Safety

The model learned that omitting disclaimers produced "preferred" responses in the training data — because the rejected responses were weak negatives that did not explicitly test for disclaimer presence. RL methods can optimize for visible metrics while silently degrading safety-critical ones. This is not a theoretical risk; it happened in practice, on a medical application, with real consequences for patient safety.

The ROJO Bias: Diagnosis and Treatment

Back to the central problem: why did both SFT and DPO models over-predict emergencies?

Root Cause Analysis

Four factors combined to create the ROJO bias:

Four Root Causes of the ROJO Bias
1
Training Data Imbalance
60% ROJO examples taught the model that "emergency" was the statistical default. The prior overwhelmed clinical reasoning.
2
RL Amplification
Both GRPO and DPO rewarded "safe" responses, and labeling everything as ROJO is the safest prediction from a loss-minimization perspective. RL amplified the prior rather than correcting it.
3
Evaluation Mismatch
Testing for single-word labels when the model was trained to produce detailed medical narratives. The evaluation framework did not match the output format.
4
Feedback Loop
The biased model generated biased preference data, which trained a more biased model. A self-reinforcing cycle with no escape without external intervention.

The Solution: Balanced Training

The fix was conceptually simple but practically transformative. I created a new balanced dataset with equal representation for each triage level:

Training Data Distribution: Balanced Dataset
ROJO (Emergency)
20%
NARANJA (Very urgent)
20%
AMARILLO (Urgent)
20%
VERDE (Standard)
20%
AZUL (Non-urgent)
20%

350 examples (70 per level) instead of 10,000 imbalanced ones. Then I trained a fresh SFT model from the base — not from the biased CPT checkpoint. This was crucial: the CPT model had already absorbed the skewed distribution, so continuing from it would carry the bias forward.

The results were immediate and dramatic:

Metric SFT (Original) DPO Balanced SFT
Label Accuracy 20% 23% ~80%
Action Steps 37% 70% Present
Alarm Signs 47% 67% Present
Safety Disclaimer 33% 0% Present
Strict Pass Rate 6.7% 0% ~60%

The balanced SFT model correctly classified tension headaches as AZUL/VERDE, minor wounds as VERDE, chest pain as NARANJA/ROJO, and one-sided weakness with confusion as ROJO. It passed safety checks and improved classification. This is the model we deployed.

Quantization: From 2.3 GB to 805 MB

For offline mobile deployment, the model needed to fit on phones. Using llama.cpp [5], I quantized the model across multiple precision levels:

Format Size Reduction Use Case
FP16 2.2 GB Desktop, high-end devices
Q5_K_M 805 MB 63% Mobile (recommended)
Q4_K_M 698 MB 68% Low-end devices
Q2_K 462 MB 79% Minimal storage

The recommended model — med-llm-es-triage-balanced-Q5_K_M.gguf at 805 MB — fits comfortably on any modern phone while maintaining quality for triage decisions. The 63% size reduction from FP16 comes with minimal quality degradation for this task, thanks to the relatively simple classification nature of triage compared to open-ended generation.

What I Would Do Differently: Five Hard-Won Lessons

This project taught me more about building AI systems than any paper or course. Here are the five things I would change if I started over tomorrow:

Five Recommendations for Safety-Critical AI
1
Define Acceptance Gates Before Training
Do not wait until training completes to think about quality. Define strict criteria upfront: triage label present and correct, action steps included, alarm signs mentioned, safety disclaimer present, emergency number (112/911) included. Reject checkpoints that fail safety gates, even if fluency improves. This single practice would have caught the ROJO bias on the first training run.
2
Balance Your Data First
Imbalanced training data will haunt you. The model will learn the prior, and RL will amplify it. Equal representation per class (20% each in our case) is a simple but effective baseline. In healthcare, where the consequence of over-triage is unnecessary emergency visits and the consequence of under-triage can be death, this balance is not an optimization choice — it is a safety requirement.
3
Design Hard Negatives for Preference Learning
The "rejected" responses in DPO need to be thoughtfully designed. Weak negatives teach nothing. A rejected response of "I don't know" is too easy to distinguish. A near-miss like "Your symptoms suggest VERDE urgency — wait a few days" for an actual AMARILLO case requires real judgment and teaches meaningful distinctions.
4
Calibrate RL Rewards Explicitly
Do not just reward "correct." Penalize over-triage explicitly. A reward function should include: +1.0 for correct labels, -0.5 for predicting ROJO when actual is AZUL (over-triage penalty), and +0.3/-0.5 for disclaimer presence/absence (safety gate). Your reward function is your product specification in executable form.
5
Test Deployment Runtime Continuously
Run the same evaluation prompts through HuggingFace (training runtime), llama.cpp CLI (inference runtime), and your target deployment platform. Track divergence across runtimes and fail builds when parity drifts. The model that works in training may behave differently after quantization and format conversion.

Lessons for the Industry

Beyond the specific recommendations, this project reinforced several principles I believe more AI teams should internalize:

Reward Design Is Product Design

Your reward function or preference data is your product specification in executable form. If it is vague, your model will be vague. If it is inconsistent, your model will be inconsistent. The 0% disclaimer rate after DPO was not a bug in the algorithm — it was a bug in the specification.

Distillation is leverage, not trust transfer. Teacher models like MiniMax M2.5 accelerate development dramatically, but they do not replace validation. Every distilled response still needed checking for format compliance and safety. The $1.40 we spent on distillation was the best investment of the project — but only because we validated the output.

Data quality dominates late-stage tuning. I could have spent weeks tuning DPO hyperparameters. Instead, fixing the training data distribution solved the core problem. 350 balanced examples outperformed 10,000 imbalanced ones. Poor synthetic supervision consumes your RL budget and hides progress.

Process quality is model quality. Reproducibility, evaluation discipline, and safety gates are not "operations" — they are part of model performance. The 6.7% strict pass rate was not a training failure; it was a process failure.

Report failures first. In safety-sensitive applications, honest failure analysis is more valuable than impressive demos. The ROJO bias was not an embarrassment — it was the most important finding of the entire project. If I had hidden it and only reported the final balanced model's ~80% accuracy, no one would learn anything useful.

Why This Matters Beyond the Code

It is easy to get lost in LoRA ranks, DPO betas, and quantization formats. But let me return to where we started: the grandmother in Oaxaca with chest pain at night.

For her, the difference between a model that says "ROJO" for everything and one that accurately triages is not a metric on a dashboard. It is the difference between a 45-kilometer midnight drive that was unnecessary and a calm voice on her phone that says: "Your symptoms suggest moderate urgency. You should visit a clinic in the morning. If you experience shortness of breath, numbness, or the pain becomes severe, call 911 immediately."

That is what this project is about. Not the pipeline. Not the architecture. Not the training curves. It is about compressing medical knowledge into a device that fits in a pocket and works without a cell tower, so that people who have always had to guess about their health can make a more informed choice.

We are not there yet. The model needs more balanced data (target: 1,000+ examples), DPO with hard negatives on the balanced set, clinical constraint validation, and pilot deployment in real communities. But the foundation is solid, the failures are documented, and the path forward is clear.

Medical AI is not about building the most impressive model. It is about building the most trustworthy one. And trustworthiness starts with being honest about what does not work.

Acknowledgments and Open-Source Credits

This project stands on the shoulders of remarkable open-source projects and their communities:

  • OpenMed [6] — The open medical dataset that made this work possible. Medical data is scarce, and OpenMed's commitment to open access is a genuine contribution to healthcare AI. Thank you to the OpenMed team.
  • LiquidAI — LFM2.5 [1] — The base model. LiquidAI's Liquid Foundation Models combine parameter efficiency with strong multilingual capabilities, making on-device medical AI feasible at 1.2B parameters.
  • MiniMax — M2.5 [4] — The teacher model for knowledge distillation. M2.5 produced exceptional Spanish medical output at a fraction of the cost of alternatives. Its quality-to-price ratio made this project economically viable on a single-GPU budget.
  • Unsloth [2] — LoRA fine-tuning at remarkable speed and memory efficiency. Unsloth made it possible to train on a single RTX 4090 with under 6 GB VRAM.
  • Hugging Face TRL [3] — The training library for SFT and DPO. TRL's clean abstractions for alignment training reduced weeks of engineering to configuration files.
  • llama.cpp [5] — GGUF quantization and inference. Without llama.cpp, offline mobile deployment would be orders of magnitude harder.
  • Manchester Triage System — The clinical framework that gave the model its classification structure.

Building medical AI is a team sport. The entire pipeline — from data to deployment — was possible because open-source communities chose to share their work.

References

  1. LiquidAI. "LFM2.5: Liquid Foundation Models." 2025. liquid.ai/liquid-foundation-models
  2. Unsloth. "Fine-tune LLMs 2x Faster with 80% Less Memory." github.com/unslothai/unsloth
  3. Hugging Face. "TRL: Transformer Reinforcement Learning." github.com/huggingface/trl
  4. MiniMax. "MiniMax-M2.5." minimax.io
  5. Gerganov, G. et al. "llama.cpp: LLM Inference in C/C++." github.com/ggerganov/llama.cpp
  6. OpenMed. "Open Medical Datasets for AI Research." github.com/openmedlab
  7. Rafailov, R. et al. "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." NeurIPS 2023. arxiv.org/abs/2305.18290
  8. Hu, E. J. et al. "LoRA: Low-Rank Adaptation of Large Language Models." ICLR 2022. arxiv.org/abs/2106.09685
  9. Manchester Triage System. triage-manchester.org
  10. Guerrero, M. "med-llm-triage-es: Spanish Medical Triage AI." github.com/apolmig/med-llm-triage-es