Abstract
This paper addresses the escalating global mental health crisis, particularly accentuated by the COVID-19 pandemic, by proposing a robust solution for the automated detection of depression. Leveraging the DAIC-WOZ dataset, a collection of clinical interviews and survey evaluations from over a hundred individuals, the study employs machine learning algorithms to automate and enhance depression recognition. The performance of the proposed models is rigorously evaluated using key metrics, including root mean square error (RMSE) and mean absolute error (MAE). A significant innovation is introduced with the incorporation of a novel attention fusion network, allowing the integration of features extracted from diverse modalities such as video, text, and audio. The study places a distinctive emphasis on intramodality connection, elucidating the intricate interactions among features within and across modalities. Structured into two pivotal sections, the first reviews existing approaches to automatic depression recognition, exploring associated areas and commonly employed modalities. The second section focuses on methodologies related to visual and audio modalities, laying the foundation for the proposed algorithm. The research strives to contribute valuable insights to the field, offering an effective approach to depression recognition through the integration of multi-modal machine learning techniques. The potential ramifications extend to more accurate mental health assessments and the development of targeted intervention strategies. This study emerges as a timely and crucial endeavor to address the pressing challenges posed by the global mental health crisis.