Multimodal large language models (MLLMs) are rapidly advancing, yet their reasoning ability often lags behind that of strong text-only counterparts. Existing methods to bridge this gap rely on supervised fine-tuning over large-scale multimodal reasoning data or reinforcement learning, both of which are resource-intensive. A promising alternative is model merging, which interpolates parameters between reasoning-enhanced LLMs and multimodal variants. However, our analysis shows that naive merging is not always a "free lunch": its effectiveness varies drastically across model families, with some (e.g., LLaVA, Idefics) benefiting while others (e.g., Qwen) suffer performance degradation. To address this, we propose Directional Reasoning Injection for Fine-Tuning (DRIFT) MLLMs, a lightweight method that transfers reasoning knowledge in the gradient space, without destabilizing multimodal alignment. DRIFT precomputes a reasoning prior as the parameter-space difference between reasoning and multimodal variants, then uses it to bias gradients during multimodal fine-tuning. This approach preserves the simplicity of standard supervised fine-tuning pipelines while enabling efficient reasoning transfer. Extensive experiments on multimodal reasoning benchmarks, including MathVista and MathVerse, demonstrate that DRIFT consistently improves reasoning performance over naive merging and supervised fine-tuning, while matching or surpassing training-heavy methods at a fraction of the cost.
DRIFT precomputes a reasoning prior as the parameter-space delta between a reasoning model and a multimodal model. During multimodal SFT, gradients are guided toward this prior, injecting reasoning ability while preserving alignment.
Concretely, let Δθ be the difference between a reasoning-enhanced LLM and its multimodal counterpart. We bias the gradient updates in the direction of Δθ during fine-tuning on multimodal data, yielding improved reasoning without destabilizing the vision-language interface.
Effect of model merging on multimodal reasoning benchmarks.
Performance on MathVista, MathVision, and MathVerse for four MLLMs before and after merging with their text-only reasoning experts. Scores include relative change (rel.) versus the base model.
Benchmark | LLaVA-Next-LLaMA3-8B | Idefics-8B | Qwen2-VL-7B | Qwen2.5-VL-7B | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Base | +Dart-Uniform | rel. | Base | +MetaMath | rel. | Base | +Qwen2-Math | rel. | Base | +DeepSeek-R1 | rel. | |
MathVista | 37.4 | 38.2 | +0.8 | 51.8 | 53.2 | +1.4 | 61.2 | 60.2 | -1.0 | 67.9 | 65.8 | -2.1 |
MathVision | 13.8 | 15.8 | +2.0 | 17.1 | 11.8 | -5.3 | 21.1 | 21.7 | +0.6 | 25.0 | 22.7 | -2.3 |
MathVerse | 16.0 | 17.4 | +1.4 | 11.0 | 12.4 | +1.4 | 26.9 | 26.7 | -0.2 | 41.4 | 33.2 | -8.2 |
rel. values denote absolute score differences relative to the Base model.
Layer/Module-wise analysis of model merging pairs
We compare LLaVA-Next-8B vs. Dart-Uniform, Idefics-8B vs. MetaMath, Qwen2-VL-7B vs. Qwen2-Math-7B, and Qwen2.5-VL-7B vs. DeepSeek-R1-Qwen-7B. Top-Left: per-layer L2 norm differences. Bottom-Left: per-layer cosine similarity. Top-Right: average L2 norm differences for FFN and normalization layers. Bottom-Right: average L2 norm differences for attention projections (Q/K/V/O).
Evaluation results on multimodal reasoning benchmarks.
We compare our gradient-based merging approach with standard parameter-space merging baselines. Results are reported on MathVista, MathVision, MathVerse, WeMath (strict/loose), and LogicVista. Best results are in bold. Improvements are reported relative to Baseline.
Model | MathVista | MathVision | MathVerse | WeMath (strict) | WeMath (loose) | LogicVista | Avg. |
---|---|---|---|---|---|---|---|
Qwen2.5-VL-7B-Instruct | 67.9 | 25.0 | 41.4 | 34.3 | 52.8 | 46.7 | 44.7 |
Parameter merging with DeepSeekR1-Qwen-Distill-7B | |||||||
Task Arithmetic | 65.8-2.1 | 22.7-2.3 | 33.2-8.2 | 30.1-4.2 | 51.2-1.6 | 42.0-4.7 | 40.8-3.9 |
Layer Swap | 63.6-4.3 | 22.9-2.1 | 37.9-3.5 | 32.1-2.2 | 50.1-2.7 | 35.1-11.6 | 40.3-4.4 |
TIES | 63.6-4.3 | 23.1-1.9 | 39.5-1.9 | 33.4-0.9 | 51.7-1.1 | 42.1-4.6 | 42.2-2.5 |
DARE-TIES | 66.3-1.6 | 23.6-1.4 | 38.3-3.1 | 33.7-0.6 | 52.6-0.2 | 42.0-4.7 | 42.8-1.9 |
DARE-Linear | 66.0-1.9 | 22.3-2.7 | 35.5-5.9 | 30.8-3.5 | 51.2-1.6 | 42.5-4.2 | 41.4-3.3 |
Reasoning Injection from DeepSeekR1-Qwen-Distill-7B | |||||||
DRIFT (Ours) | 70.3+2.4 | 26.5+1.5 | 43.7+2.3 | 36.9+2.6 | 59.2+6.4 | 45.6-1.1 | 47.0+2.3 |
Subscript values denote absolute differences relative to the Baseline.
Evaluation results on visual reasoning benchmarks.
We report performance on MathVista, MathVision, MathVerse, WeMath (strict), and LogicVista across open-source models and reasoning fine-tuning methods. Our DRIFT results are bold, with improvements relative to our SFT baseline shown as green subscripts.
Model | MathVista | MathVision | MathVerse | WeMath | LogicVista |
---|---|---|---|---|---|
Open-source Models | |||||
LLaVA-OneVision-7B | 62.6 | 17.6 | 17.6 | 17.7 | 32.0 |
InternLM-XComposer2.5 | 64.0 | 17.8 | 16.2 | 14.1 | 34.7 |
InternVL3-8B | 70.5 | 28.6 | 33.9 | 37.5 | 43.6 |
InternVL2.5-8B | 64.5 | 17.0 | 22.8 | 23.5 | 36.0 |
InternVL2-8B | 58.3 | 20.0 | 20.4 | 20.2 | 33.6 |
QvQ-72B-Preview | 70.3 | 34.9 | 48.2 | 39.0 | 58.2 |
Kimi-VL-16B | 66.0 | 21.8 | 34.1 | 32.3 | 42.7 |
Qwen2-VL-7B | 61.6 | 19.2 | 25.4 | 22.3 | 33.3 |
Qwen2.5-VL-7B†| 67.9 | 25.0 | 41.4 | 34.3 | 46.7 |
Reasoning Fine-tuning Methods | |||||
R1-Onevision-7B | 64.1 | 29.9 | 40.0 | — | 61.8 |
OpenVLThinker-7B | 65.3 | 23.0 | 38.1 | 35.2 | 44.5 |
R1-VL-7B | 63.5 | 24.7 | 40.0 | — | — |
X-REASONER | 69.0 | 29.6 | — | — | — |
Ours (SFT) | 68.7 | 25.1 | 42.0 | 33.3 | 45.6 |
DRIFT (Ours) | 70.3+1.6 | 26.5+1.5 | 43.7+1.7 | 36.9+3.6 | 45.6+0.0 |
†indicates results reproduced by ourselves. Subscripts show improvements over our SFT baseline.