Comparison of Natural Language Processing Models for Depression Detection in Chatbot Dialogues

Depression is an important challenge in the world today and a large source of disability. In the US, a recent study showed that approximately 36 million adults had at least one major depressive episode, including some with severe impairment [1]. However, approximately two-thirds of all depression ca...

Full description

Bibliographic Details
Main Author: Belser, Christian Alexander
Other Authors: Fletcher, Richard Ribon
Format: Thesis
Published: Massachusetts Institute of Technology 2023
Online Access:https://hdl.handle.net/1721.1/152710
Description
Summary:Depression is an important challenge in the world today and a large source of disability. In the US, a recent study showed that approximately 36 million adults had at least one major depressive episode, including some with severe impairment [1]. However, approximately two-thirds of all depression cases are never diagnosed [2], largely due to a shortage of trained mental health professionals as well as a lingering cultural stigma that often prevents afflicted people from seeking professional care. In order to address this need, there is an emerging interest in using computer algorithms to automatically screen for depression, which offers the potential to be widely deployed to the public via clinical websites and mobile apps. Within this field, Dr. Fletcher’s group at MIT develops mobile platforms that are used to support mental health wellness and psychotherapy, including tools to screen for mental health disorders and refer people to treatment. As part of this work, this thesis compares three distinct Natural Language Processing (NLP) models used to screen for depression. I have revised and updated three state-of-the-art models: (1) Bi-directional gated recurrent unit (BGRU) models, (2) Hierarchical attention networks (HAN), and (3) Long-sequence Transformer models to accurately screen for depression in individuals. The models were all trained and tested on a common standard clinical dataset (DAICWoz) that is derived from clinical patient interviews. After optimization, and exploring several variants of each type of model, the following results were found: BGRU (accuracy=0.71, precision=0.65, recall=63, F1-score=0.64, MCC=0.20); HAN (accuracy= 0.77, precision=0.76, recall=0.77, F1-score=0.76, MCC=0.46); Transformer (accuracy=0.77, precision=0.76, recall=0.77, F1-score=0.76, MCC=0.43). In addition to model performance, I also compare the different categories of models based on computational resources and input token size. I also discuss the future evolution of these models and provide recommendations for specific use cases.