GithubHelp home page GithubHelp logo

sarc's Introduction

SARC

Evaluation code for the Self-Annotated Reddit Corpus (SARC).

Dependencies: NLTK, scikit-learn, text_embedding.

To recreate the all-balanced and pol-balanced results in Table 2 of the paper:

  1. download 1600-dimensional Amazon GloVe embeddings (NOTE: 2.4 GB compressed)

  2. set the root directory of the SARC dataset at the top of utils.py

  3. run the following ($EMBEDDING is the file of downloaded GloVe embeddings)

  • Bag-of-Words on all: python SARC/eval.py main -l --min_count 5
  • Bag-of-Bigrams on all: python SARC/eval.py main -n 2 -l --min_count 5
  • Embedding on all: python SARC/eval.py main -e -l --embedding $EMBEDDING
  • Bag-of-Words on pol: python SARC/eval.py pol -l
  • Bag-of-Bigrams on pol: python SARC/eval.py pol -n 2 -l
  • Embedding on pol: python SARC/eval.py pol -e -l --embedding $EMBEDDING

If you find this code useful please cite the following:

@inproceedings{khodak2018corpus,
  title={A Large Self-Annotated Corpus for Sarcasm},
  author={Khodak, Mikhail and Saunshi, Nikunj and Vodrahalli, Kiran},
  booktitle={Proceedings of the Linguistic Resource and Evaluation Conference (LREC)},
  year={2018}
}

sarc's People

Contributors

mkhodak avatar nsaunshi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

sarc's Issues

raw/sarc.csv find user comments

Hi~~~

I'm investigating sarcasm detection using your dataset and particularly I'm collecting users' information now. Please correct where I misunderstand.

  1. from SARC/2.0/README.txt, it reads that raw/sarc.csv contains sarcastic and non-sarcastic comments of authors in authors.json. I read the raw/sarc.csv dataset, the first example shows [0, "Yousa guys didn't upvote nothing!", 'BritishEnglishPolice',
    'worldpolitics', 3, 3, 0, '2009-01', 1233446126,
    "Mafia business 'equal' to 9% of Italian GDP", 'c07e6gg',
    '7tvvp']
    I guess the sentence "Yousa guys didn't upvote nothing!" is a post, "BritishEnglishPolice" is an author made this post, "worldpolitics" is subrredit, "3, 3, 0" correspond to scores/ups/downs respectively, "2009-01" and "1233446397" are date and UTC. "Mafia business 'equal' to 9% of Italian GDP" is a comment on this post. And "c07e6gg" and "7tvvp" are sarc and non-sarc responses to this comment. But when I search the comments.json for "c07e6gg" and "7tvvp", it returned nothing. Beside. what does "0" at the beginning mean? I see 0 appeared in many example. Could you help me understand this sarc.csv file? As my goal is to acquire some sarc and non-sarc comments or comments for a given author.

  2. I'm using SARC/2.0 main and pol datasets. I think SARC/2.0 should contain all those in SARC/1.0 and SARC/0.0, right?

Thank you very much!
Best regards

IndexError: list index out of range

hey all,

I'm running your code, and I'm running into an issue with running the bag of words model

So I've made a slight modifications to the instructions, so maybe that's playing a role, but I'm running the command line within the SARC directory:
`directory stuff/SARC> python eval.py main -l

Load SARC data
Traceback (most recent call last):
File "eval.py", line 119, in
main()
File "eval.py", line 45, in main
load_sarc_responses(train_file, test_file, comment_file, lower=args.lower)
File "directory stuff\SARC\utils.py", line 34, in load_sarc_responses
responses = row[1].split(' ')
IndexError: list index out of range`

The reason I'm running this within the SARC directory is in eval.py I've added a few lines of code(listed below) before everything else in eval.py so it would include the text_embeddings module (it shares a parent directory of SARC), and was the only way I could figure out importing the text_embedding within module eval.py so if there's a better way I'm all ears!
`import sys

sys.path.append('../')`

Final note, this error has occurred with both the data provided in the README (https://nlp.cs.princeton.edu/SARC/2.0/) and from the SARC data on Kaggle (https://www.kaggle.com/danofer/sarcasm), as the data from the readme looks odd when viewed in excel and I initially thought is messing with the csv parsing

Thank you,

Matt

subsets to SARC Files

Hi, ~~ Thanks a million if you can help me clarify a data source question!
I visited the link to SARC Files. There are only two folders - main and pol for sarcasm evaluation, right?

However, I read two papers [1] [2], they all claimed they used two subsets: /r/movies and /r/technology from your SARC dataset. [1] claimed they used 8188 samples from /r/movies and 22510 samples from /r/technology. [2] claimed they used said they used 15019 samples from r/movies and 13485 samples from r/technology. [2] gave a link to SARC-2.0.

I checked that in the SARC-2.0 main folder, there are
train-balanced r/movies=1414, train-unbalanced r/movies=121595
test-balanced r/movies=364, test-unbalanced r/movies=27930
train-balanced r/technology=1652, train-unbalanced r/technology=73641
test-balanced r/technology=408, test-unbalanced r/technology=20104

It's impossible to get a balanced dataset for both [1] and [2], right?.. the sarc samples are very few compared with non-sarc samples...

could you help me check if there is any way for the two papers to get such data?
Thanks a million! This confused me quite a lot.

[1] https://www.aclweb.org/anthology/P18-1093.pdf
[2] https://dl.acm.org/doi/pdf/10.1145/3308558.3313735

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.