GithubHelp home page GithubHelp logo

tsainez / kpop-sentiment-analysis Goto Github PK

View Code? Open in Web Editor NEW
3.0 1.0 0.0 9.47 MB

Generating a Korean and English dataset of Tweets about K-pop for sentiment analysis

License: MIT License

Python 4.97% Jupyter Notebook 95.03%
dataset k-pop korean nlp sentiment-analysis

kpop-sentiment-analysis's Introduction

K-pop Sentiment Analysis

Prepared by Anthony Sainez and Jingyi Jennie Wu in collaboration with Barun ICT Research Center [en] [kr] at Yonsei University in Seoul, South Korea.

This repository contains data collected during May 2022 for use in sentiment analysis of K-pop related Tweets.

Context

The data seeks to explore the controversy surrounding Hybe entertainment's new girl group Le Sserafim— a member of which (Kim Garam, 김가람) is alleged to have been a bully during her schooling.

Hybe is an internationally successful and recongizable entertainment company, known mainly for their most influential boy band BTS, which is the best-selling artist in South Korean history among other accolades.

Data

Data description

  • data.csv

    • Contains the complete and raw dataset, including all language results polled from Twitter during our search periods.
    • Search queries included: [Hybe, Le Sserafim, 르세라핌, Garam, 김가람, #garamout]
  • data_clean.csv contains the cleaned dataset, which represents a 31.75% decrease in size.

    • Data clean up involved:
      • Removing duplicate rows.
      • Removing text that is not in English or Korean.
      • Removing suspected advertisements or spam.
    • Korean text was not cleaned up in the same manner as English text.
  • data_ko.csv and data_en.csv contain only the Korean and English results respectively.

    • data_ko.csv also contains a column (en_text) corresponding to the translated text (from Korean to English), using Google Cloud Translate API.

Data samples

For data_en.csv:

id text created_at lang
1529696231739383808 fuck you hybe 2022-05-26 05:29:36+00:00 en
1529717794786271232 Listen if you support Kim Garam but believe that Soojin deserved to be kicked out of GIDLE do not talk to me Soojin… 2022-05-26 06:55:17+00:00 en
1529720463953387522 anyway no center ever could outdo garam 2022-05-26 07:05:54+00:00 en
1529720293987700736 garam’s face is top tier kimgaram 김가람 르세라핌 2022-05-26 07:05:13+00:00 en

For data_ko.csv:

id text en_text created_at lang
1529696402145955840 야가람아아이돌하고싶었으면김가람탈퇴해학폭을하지를말았어야지garamout민폐끼치지말고제발나가주라쏘스뮤직해명해너없으면모두가행복해탈퇴해 Yagaram, if you wanted to be an idol, you should have left Kim Garam and didn’t have to be abusive. 2022-05-26 05:30:17+00:00 ko
1529695383567613952 르세라핌넘무잘해요 Le Seraphim is so good 2022-05-26 05:26:14+00:00 ko
1529727257207783424 너네좆됨김가람쏘스뮤직병신들아ㅋㅋ Fuck you, Kim Garam Source Music bastards hahaha 2022-05-26 07:32:53+00:00 ko
1529727063254761472 누가둘사이원래친구였고틀어진게궁금하대미성년자둘사이자질구레한관계변화일대기를왜대체인터넷에공지로뿌리냐고회사가김가람이학교폭력으로무려보호처분호를받은그상세내용이궁금하… I wonder who was originally friends and what went wrong between the two of them. I'm curious as to the details of how the company, Kim Ga-ram, received a protective measure due to school violence... 2022-05-26 07:32:07+00:00 ko

Notable complications

  • A lot of this data is seriously dirty.
    • Some of the text has been cleaned in various different ways as we learned different processes that worked (and did not work) for preprocessing the data.
      • For instance, there is still some Korean in the English dataset (possibly hashtags) that are an artifact from a previous method of text mining and preprocessing.
    • Additionally, there are still some irrelevant pieces of text in the data, such as advertisements and giveaways that we did not manage to catch on our first pass.
  • This data was collected through a polling process, whereby we repeatedly scraped Twitter for new data. We chose this approach because we were not able to access any data that is older than 7 days or historical data. Our application for Twitter's Academic Research API v2 was denied, and we were not allowed to reapply.
    • The data collection process would be incredibly simplified if another team manages to gain access to the aforementioned API, and would likely result in better analysis. Nevertheless, the process and code to create such a dataset is provided with as much documentation as possible.

Reproducibility

To run and generate the dataset yourself:

  1. Clone the repository::

    git clone https://github.com/tsainez/kpop-sentiment-analysis
    cd kpop-sentiment-analysis
    
  2. Create and activate a virtual environment::

    python3 -m venv venv
    . venv/bin/activate
    
  3. Install Python dependencies::

    pip3 install -r requirements.txt
    
  4. Run the files in the scripts directory in order.

TODO:

  • Adjust scripts to not require human intervention to run.
  • Create a Makefile for ease of reproducibility.

kpop-sentiment-analysis's People

Contributors

tsainez avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.