MLlm-DR: Towards Explainable Depression Recognition with MultiModal Large Language Models
Clinical depression diagnosis relies heavily on both verbal and non-verbal cues in patient interviews, yet existing automated methods often operate as black-box models and fail to provide trustworthy explanations, limiting their clinical applicability. Moreover, depression datasets commonly impose strict privacy constraints that prohibit access to raw audio–visual data, and the deployment of large LLMs in real-world medical environments is often constrained by computational cost limitations. To address these challenges, we propose MLlm-DR, a multimodal large language model for explainable depression recognition. We first employ knowledge distillation to transfer high-quality diagnostic rationales from a powerful LLMs to a smaller, deployable LLMs, thereby equipping it with clinically aligned reasoning abilities. We further introduce a lightweight query module (LQ-former) that extracts salient depression-related cues from pre-extracted audio and visual features and maps them into an LLM-compatible representation. MLlm-DR is trained in two stages: multimodal alignment via LQ-former pretraining, followed by multi-task optimization that jointly learns rationale generation and score regression, encouraging consistency and improving interpretability. Experimental results show that MLlm-DR achieves state-of-the-art performance on two interview-based benchmark datasets, CMDC and E-DAIC-WOZ, while simultaneously generating clinically meaningful and readable explanations.
Added 2026-04-21