DRIFT: Directional Reasoning Injection for Fine-Tuning MLLMs

Abstract

Multimodal large language models (MLLMs) are rapidly advancing, yet their reasoning ability often lags behind that of strong text-only counterparts. Existing methods to bridge this gap rely on supervised fine-tuning over large-scale multimodal reasoning data or reinforcement learning, both of which are resource-intensive. A promising alternative is model merging, which interpolates parameters between reasoning-enhanced LLMs and multimodal variants. However, our analysis shows that naive merging is not always a "free lunch": its effectiveness varies drastically across model families, with some (e.g., LLaVA, Idefics) benefiting while others (e.g., Qwen) suffer performance degradation. To address this, we propose Directional Reasoning Injection for Fine-Tuning (DRIFT) MLLMs, a lightweight method that transfers reasoning knowledge in the gradient space, without destabilizing multimodal alignment. DRIFT precomputes a reasoning prior as the parameter-space difference between reasoning and multimodal variants, then uses it to bias gradients during multimodal fine-tuning. This approach preserves the simplicity of standard supervised fine-tuning pipelines while enabling efficient reasoning transfer. Extensive experiments on multimodal reasoning benchmarks, including MathVista and MathVerse, demonstrate that DRIFT consistently improves reasoning performance over naive merging and supervised fine-tuning, while matching or surpassing training-heavy methods at a fraction of the cost.

Method

DRIFT precomputes a reasoning prior as the parameter-space delta between a reasoning model and a multimodal model. During multimodal SFT, gradients are guided toward this prior, injecting reasoning ability while preserving alignment.

Concretely, let Δθ be the difference between a reasoning-enhanced LLM and its multimodal counterpart. We bias the gradient updates in the direction of Δθ during fine-tuning on multimodal data, yielding improved reasoning without destabilizing the vision-language interface.

Is Model Merging Always Beneficial?

Effect of model merging on multimodal reasoning benchmarks.

Performance on MathVista, MathVision, and MathVerse for four MLLMs before and after merging with their text-only reasoning experts. Scores include relative change (rel.) versus the base model.

Benchmark	LLaVA-Next-LLaMA3-8B			Idefics-8B			Qwen2-VL-7B			Qwen2.5-VL-7B
Benchmark	Base	+Dart-Uniform	rel.	Base	+MetaMath	rel.	Base	+Qwen2-Math	rel.	Base	+DeepSeek-R1	rel.
MathVista	37.4	38.2	+0.8	51.8	53.2	+1.4	61.2	60.2	-1.0	67.9	65.8	-2.1
MathVision	13.8	15.8	+2.0	17.1	11.8	-5.3	21.1	21.7	+0.6	25.0	22.7	-2.3
MathVerse	16.0	17.4	+1.4	11.0	12.4	+1.4	26.9	26.7	-0.2	41.4	33.2	-8.2

rel. values denote absolute score differences relative to the Base model.

Layer/Module-wise analysis of model merging pairs

We compare LLaVA-Next-8B vs. Dart-Uniform, Idefics-8B vs. MetaMath, Qwen2-VL-7B vs. Qwen2-Math-7B, and Qwen2.5-VL-7B vs. DeepSeek-R1-Qwen-7B. Top-Left: per-layer L2 norm differences. Bottom-Left: per-layer cosine similarity. Top-Right: average L2 norm differences for FFN and normalization layers. Bottom-Right: average L2 norm differences for attention projections (Q/K/V/O).

DRIFT Surpasses Naive Merging

Evaluation results on multimodal reasoning benchmarks.

We compare our gradient-based merging approach with standard parameter-space merging baselines. Results are reported on MathVista, MathVision, MathVerse, WeMath (strict/loose), and LogicVista. Best results are in bold. Improvements are reported relative to Baseline.

Model	MathVista	MathVision	MathVerse	WeMath (strict)	WeMath (loose)	LogicVista	Avg.
Qwen2.5-VL-7B-Instruct	67.9	25.0	41.4	34.3	52.8	46.7	44.7
Parameter merging with DeepSeekR1-Qwen-Distill-7B
Task Arithmetic	65.8_-2.1	22.7_-2.3	33.2_-8.2	30.1_-4.2	51.2_-1.6	42.0_-4.7	40.8_-3.9
Layer Swap	63.6_-4.3	22.9_-2.1	37.9_-3.5	32.1_-2.2	50.1_-2.7	35.1_-11.6	40.3_-4.4
TIES	63.6_-4.3	23.1_-1.9	39.5_-1.9	33.4_-0.9	51.7_-1.1	42.1_-4.6	42.2_-2.5
DARE-TIES	66.3_-1.6	23.6_-1.4	38.3_-3.1	33.7_-0.6	52.6_-0.2	42.0_-4.7	42.8_-1.9
DARE-Linear	66.0_-1.9	22.3_-2.7	35.5_-5.9	30.8_-3.5	51.2_-1.6	42.5_-4.2	41.4_-3.3
Reasoning Injection from DeepSeekR1-Qwen-Distill-7B
DRIFT (Ours)	70.3_+2.4	26.5_+1.5	43.7_+2.3	36.9_+2.6	59.2_+6.4	45.6_-1.1	47.0_+2.3

Subscript values denote absolute differences relative to the Baseline.

DRIFT Surpasses SFT

Evaluation results on visual reasoning benchmarks.

We report performance on MathVista, MathVision, MathVerse, WeMath (strict), and LogicVista across open-source models and reasoning fine-tuning methods. Our DRIFT results are bold, with improvements relative to our SFT baseline shown as green subscripts.

Model	MathVista	MathVision	MathVerse	WeMath	LogicVista
*Open-source Models*
LLaVA-OneVision-7B	62.6	17.6	17.6	17.7	32.0
InternLM-XComposer2.5	64.0	17.8	16.2	14.1	34.7
InternVL3-8B	70.5	28.6	33.9	37.5	43.6
InternVL2.5-8B	64.5	17.0	22.8	23.5	36.0
InternVL2-8B	58.3	20.0	20.4	20.2	33.6
QvQ-72B-Preview	70.3	34.9	48.2	39.0	58.2
Kimi-VL-16B	66.0	21.8	34.1	32.3	42.7
Qwen2-VL-7B	61.6	19.2	25.4	22.3	33.3
Qwen2.5-VL-7B^†	67.9	25.0	41.4	34.3	46.7
*Reasoning Fine-tuning Methods*
R1-Onevision-7B	64.1	29.9	40.0	—	61.8
OpenVLThinker-7B	65.3	23.0	38.1	35.2	44.5
R1-VL-7B	63.5	24.7	40.0	—	—
X-REASONER	69.0	29.6	—	—	—
Ours (SFT)	68.7	25.1	42.0	33.3	45.6
DRIFT (Ours)	70.3_+1.6	26.5_+1.5	43.7_+1.7	36.9_+3.6	45.6_+0.0

† indicates results reproduced by ourselves. Subscripts show improvements over our SFT baseline.