GithubHelp home page GithubHelp logo

danielmlow / reddit Goto Github PK

View Code? Open in Web Editor NEW
29.0 4.0 11.0 28.96 MB

analysis of mental health support groups on Reddit

License: Apache License 2.0

Python 0.89% Jupyter Notebook 99.07% Shell 0.04%

reddit's Introduction

Data and code for "Natural language processing reveals vulnerable mental health groups and heightened health anxiety on Reddit during COVID-19"

1. Data

Available at Open Science Framework: https://osf.io/7peyq/

Also available through Zenodo: https://zenodo.org/record/3941387#.YFfi3EhJHL8

Please cite if you use the data:

Low, D. M., Rumker, L., Talker, T., Torous, J., Cecchi, G., & Ghosh, S. S. Natural Language Processing Reveals Vulnerable Mental Health Support Groups and Heightened Health Anxiety on Reddit during COVID-19: An Observational Study. Journal of medical Internet research. doi: 10.2196/22635

@article{low2020natural,
  title={Natural Language Processing Reveals Vulnerable Mental Health Support Groups and Heightened Health Anxiety on Reddit During COVID-19: Observational Study},
  author={Low, Daniel M and Rumker, Laurie and Talkar, Tanya and Torous, John and Cecchi, Guillermo and Ghosh, Satrajit S},
  journal={Journal of medical Internet research},
  volume={22},
  number={10},
  pages={e22635},
  year={2020},
  publisher={JMIR Publications Inc., Toronto, Canada}
}

License: This dataset is made available under the Public Domain Dedication and License v1.0 whose full text can be found at: http://www.opendatacommons.org/licenses/pddl/1.0/ It was downloaded using pushshift API. Re-use of this data is subject to Reddit API terms.

1.1. Reddit mental health dataset

find in data/input/reddit_mental_health_dataset/

Posts and text features for the following timeframes from 28 mental health and non-mental health subreddits:

  • 15 specific mental health support groups (r/EDAnonymous, r/addiction, r/alcoholism, r/adhd, r/anxiety, r/autism, r/bipolarreddit, r/bpd, r/depression, r/healthanxiety, r/lonely, r/ptsd, r/schizophrenia, r/socialanxiety, and r/suicidewatch)
  • 2 broad mental health subreddits (r/mentalhealth, r/COVID19_support)
  • 11 non-mental health subreddits (r/conspiracy, r/divorce, r/fitness, r/guns, r/jokes, r/legaladvice, r/meditation, r/parenting, r/personalfinance, r/relationships, r/teaching).

Downloaded using pushshift API. Re-use of this data is subject to Reddit API terms. Cite TODO if using this dataset.

filenames and corresponding timeframes:

  • post: Jan 1 to April 20, 2020 (called "mid-pandemic" in manuscript; r/COVID19_support appears)
  • pre: Dec 2018 to Dec 2019. A full year which provides more data for a baseline of Reddit posts
  • 2019: Jan 1 to April 20, 2019 (r/EDAnonymous appears). A control for seasonal fluctuations to match post data.
  • 2018: Jan 1 to April 20, 2018. A control for seasonal fluctuations to match post data.

See Supplementary Materials for more information.

Note: if subsampling (e.g., to balance subreddits), we recommend bootstrapping analyses for unbiased results.

1.2. COVID-19 mention dataset (Figure 1)

find in data/input/covid19_counts/

Same posts as in post above for 15 mental health subreddits.

Counting these tokens: 'corona','virus','viral','covid', 'sars','influenza','pandemic', 'epidemic', 'quarantine','lockdown', 'distancing', 'national emergency', 'flatten', 'infect','ventilator', 'mask','symptomatic', 'epidemiolog', 'immun', 'incubation', 'transmission','vaccine'

  • One column covid19_boolean: if one of these words appears at least once (Figure 1)
  • One column covid19_total: total count of words
  • One column covid19_weighed_words: total count of words normalized by the amount of words (n_words) in a post (Figure S3).

1.3. COVID-19 cases

Confirmed COVID-19 cases obtained from ourworldindata.org/covid-cases (source: European CDC).

2. Reproduce

All .ipynb can run on Google Colab (for which data should be on Google Drive; code to load data from Google Drive is available in scripts) or on Jupter Notebook.

To run the .py or .ipynb on Jupter Notebook, create a virtual environment and install the requirements.txt:

  • conda create --name reddit --file requirements.txt
  • conda activate reddit

2.1. Preprocessing

  • reddit_data_extraction.ipynb download data
  • reddit_feature_extraction.ipynb feature extraction for classification (TF-IDF was re-done separately on train set), trend analysis, and supervised dimensionality reduction.
  • See below for preprocessing for topic modeling and unsupervised clustering

2.2. Analyses

Classification
  • Clone catpro from https://github.com/danielmlow/catpro/ and change path in run.py sys.path.append('./../../catpro') accordingly
  • config.py set paths, subreddits to run, and sample size
  • N is the model (0=SGD L1, 1=SGD EN, 2=SVM, 3=ET, 4=XGB)
  • Run remotely: run_v8_<N>.sh runs run.py on cluster running each binary classifier on different nodes through --job_array_task_id set to one of range(0,15)
  • Run locally (set --job_array_task_id and --run_modelN accordingly):
python3 -i run.py --job_array_task_id=1 --run_modelN=0 --run_version_number=8 
  • classification_results.py: figure 5-a, summarize results, extract important features, and visualize testing on COVID19_support (psychological profiler), run (change paths accordingly)
Trend Analysis
  • reddit_descriptive.ipynb: figures 1 and 2
Unsupervised clustering
  • Unsupervised_Clustering_Pipeline.ipynb: figures 3 and 5-c
Topic Modeling
  • reddit_lda_pipeline.ipynb: figure 4 and 5-b
Supervised dimensionality reduction
  • reddit_cluster.ipynb: figure 6
  • reddit_cluster.py: UMAP on 50 random subsamples of 2019 (pre) data to determine sensor precision
    • run remotely: run_umap.sh
    • run locally (--job_array_task_id will run a single subsample):
    python3 reddit_cluster.py --job_array_task_id=0 --plot=True --pre_or_post='pre'
    

reddit's People

Contributors

danielmlow avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.