GithubHelp home page GithubHelp logo

gohjiayi / suicidal-text-detection Goto Github PK

View Code? Open in Web Editor NEW
29.0 1.0 11.0 6.68 MB

Building a suicidal text detection model and mental health chatbot with deep learning models and transformers.

Home Page: https://medium.com/@gohjiayi/building-a-suicidal-text-detection-model-and-mental-health-chatbot-d17fd93215fa?source=friends_link&sk=5bd2e0f111380faf66848031b4b4c033

License: MIT License

Jupyter Notebook 100.00%
suicide-prevention mental-health-app deep-learning transformers

suicidal-text-detection's Introduction

Suicidal Text Detection

This project aims to build a predictive model to detect suicidal intent in social media posts, and to integrate the model into a functional mental health chatbot.

Project Resources

For more project details, please refer to the following resources linked here:

If you are unable to view the report and slides, navigate to the docs/ folder.

Project Directory Structure

├── data_preprocessing.ipynb
├── data_cleaning.ipynb
├── eda.ipynb
├── word2vec.ipynb
├── models_logit.ipynb
├── models_cnn.ipynb
├── models_lstm.ipynb
├── models_bert.ipynb
├── models_electra.ipynb
├── infer.ipynb
├── chatbot.ipynb
├── Data
│   ├── suicide_detection.csv
│   ├── suicide_detection_full_clean.csv
│   ├── suicide_detection_final_clean.csv
│   └── ...
├── Models
│   └── ...
└── …

This project is built on Python 3 and scripts were originally hosted on Google Colab. Required packages are installed individually in each .ipynb file.

The Data/ folder consists of the dataset and embeddings used, while the Models/ folder consists of the trained models.

1. Data Collection

The Suicide and Depression Detection dataset is obtained from Kaggle and stored as Data/suicide_detection.csv. The dataset was scraped from Reddit and consists of 232,074 rows equally distributed between 2 classes - suicide and non-suicide.

2. Text Preprocessing

Run data_preprocessing.ipynb to perform text preprocessing and generate Data/suicide_detection_full_clean.csv.

Note: Spelling correction requires a long processing time.

3. Data Cleaning

Run data_cleaning.ipynb to clean data and generate Data/suicide_detection_final_clean.csv. This is the final dataset which will be used for the project and will be split into train:test:val sets with the ratio of 8:1:1. The final dataset is left with 174,436 rows with a class distribution of approximately 4:6 for suicide and non-suicide.

4. Exploratory Data Analysis (EDA)

Run eda.ipynb to gain more insights of both the original dataset and the cleaned dataset. EDA on the original dataset helped to determine steps for text preprocessing and data cleaning, while insights gained on the cleaned train data helped us to better build our representations and models.

5. Representation Learning

Run word2vec.ipynb to pre-train custom Word2Vec embeddings from our cleaned dataset and to generate Data/vocab.txt and Data/embedding_word2vec.txt. In our project, we have also experimented with pre-trained Twitter GloVe embeddings retrieved here, which was downloaded and stored as Data/glove_twitter_27B_200d.txt.

6. Model Building & Evaluation

There are 5 models built for this project - Logistic Regression (Logit), Convolutional Neural Network (CNN), Long Short-term Memory Network (LSTM), BERT, and ELECTRA. These models are stored separately into different notebooks with the format models_[model name].ipynb. The trained models are stored in the Models/ folder.

Note: Different variations were built for each model to find the best hyperparameters by testing empirically.

7. Model Selection

The best variation of the aforementioned models can be seen in the table below. Although BERT and ELECTRA have rather comparable performances, ELECTRA is selected as the best performing model with our prioritisation on F1 score, as well as insights into the model architecture. Run infer.ipynb to predict whether input text has suicidal intent or not using the selected ELECTRA model.

Best Model Accuracy Recall Precision F1 Score
Logit 0.9111 0.8870 0.8832 0.8851
CNN 0.9285 0.9013 0.9125 0.9069
LSTM 0.9260 0.8649 0.9386 0.9003
BERT 0.9757 0.9669 0.9701 0.9685
ELECTRA 0.9792 0.9788 0.9677 0.9732

The suicidal BERT and ELECTRA text classification models trained are available on HuggingFace at gooohjy/suicidal-bert and gooohjy/suicidal-electra.

8. Chatbot Integration

Run chatbot.ipynb to use the mental health chatbot, integrated with the suicidal detection model. The chatbot is based on DialoGPT and custom retrieval-based responses were integrated to suit our use case.

Contributors

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.