nlprinceton / sarc Goto Github PK

View Code? Open in Web Editor NEW

40.0 4.0 14.0 13 KB

Evaluation code for the Self-Annotated Reddit Corpus

License: MIT License

Python 100.00%

sarc's Introduction

SARC

Evaluation code for the Self-Annotated Reddit Corpus (SARC).

Dependencies: NLTK, scikit-learn, text_embedding.

To recreate the all-balanced and pol-balanced results in Table 2 of the paper:

download 1600-dimensional Amazon GloVe embeddings (NOTE: 2.4 GB compressed)
set the root directory of the SARC dataset at the top of utils.py
run the following ($EMBEDDING is the file of downloaded GloVe embeddings)

Bag-of-Words on all: python SARC/eval.py main -l --min_count 5
Bag-of-Bigrams on all: python SARC/eval.py main -n 2 -l --min_count 5
Embedding on all: python SARC/eval.py main -e -l --embedding $EMBEDDING
Bag-of-Words on pol: python SARC/eval.py pol -l
Bag-of-Bigrams on pol: python SARC/eval.py pol -n 2 -l
Embedding on pol: python SARC/eval.py pol -e -l --embedding $EMBEDDING

If you find this code useful please cite the following:

@inproceedings{khodak2018corpus,
  title={A Large Self-Annotated Corpus for Sarcasm},
  author={Khodak, Mikhail and Saunshi, Nikunj and Vodrahalli, Kiran},
  booktitle={Proceedings of the Linguistic Resource and Evaluation Conference (LREC)},
  year={2018}
}

sarc's People

Contributors

Stargazers

Watchers

Forkers

mmenbawy aspiringguru naushadzaman narsikaa ranjancse26 databill86 keithkade dmbavkar binliang-nlp nemavatsala kvavk maoqinfeng bistu1 yuhuiwang1228

sarc's Issues

raw/sarc.csv find user comments

Hi~~~

I'm investigating sarcasm detection using your dataset and particularly I'm collecting users' information now. Please correct where I misunderstand.

from SARC/2.0/README.txt, it reads that raw/sarc.csv contains sarcastic and non-sarcastic comments of authors in authors.json. I read the raw/sarc.csv dataset, the first example shows [0, "Yousa guys didn't upvote nothing!", 'BritishEnglishPolice',
'worldpolitics', 3, 3, 0, '2009-01', 1233446126,
"Mafia business 'equal' to 9% of Italian GDP", 'c07e6gg',
'7tvvp']
I guess the sentence "Yousa guys didn't upvote nothing!" is a post, "BritishEnglishPolice" is an author made this post, "worldpolitics" is subrredit, "3, 3, 0" correspond to scores/ups/downs respectively, "2009-01" and "1233446397" are date and UTC. "Mafia business 'equal' to 9% of Italian GDP" is a comment on this post. And "c07e6gg" and "7tvvp" are sarc and non-sarc responses to this comment. But when I search the comments.json for "c07e6gg" and "7tvvp", it returned nothing. Beside. what does "0" at the beginning mean? I see 0 appeared in many example. Could you help me understand this sarc.csv file? As my goal is to acquire some sarc and non-sarc comments or comments for a given author.
I'm using SARC/2.0 main and pol datasets. I think SARC/2.0 should contain all those in SARC/1.0 and SARC/0.0, right?

Thank you very much!
Best regards

IndexError: list index out of range

hey all,

I'm running your code, and I'm running into an issue with running the bag of words model

So I've made a slight modifications to the instructions, so maybe that's playing a role, but I'm running the command line within the SARC directory:
`directory stuff/SARC> python eval.py main -l

Load SARC data
Traceback (most recent call last):
File "eval.py", line 119, in
main()
File "eval.py", line 45, in main
load_sarc_responses(train_file, test_file, comment_file, lower=args.lower)
File "directory stuff\SARC\utils.py", line 34, in load_sarc_responses
responses = row[1].split(' ')
IndexError: list index out of range`

The reason I'm running this within the SARC directory is in eval.py I've added a few lines of code(listed below) before everything else in eval.py so it would include the text_embeddings module (it shares a parent directory of SARC), and was the only way I could figure out importing the text_embedding within module eval.py so if there's a better way I'm all ears!
`import sys

sys.path.append('../')`

Final note, this error has occurred with both the data provided in the README (https://nlp.cs.princeton.edu/SARC/2.0/) and from the SARC data on Kaggle (https://www.kaggle.com/danofer/sarcasm), as the data from the readme looks odd when viewed in excel and I initially thought is messing with the csv parsing

Thank you,

Matt

subsets to SARC Files

Hi, ~~ Thanks a million if you can help me clarify a data source question!
I visited the link to SARC Files. There are only two folders - main and pol for sarcasm evaluation, right?

However, I read two papers [1] [2], they all claimed they used two subsets: /r/movies and /r/technology from your SARC dataset. [1] claimed they used 8188 samples from /r/movies and 22510 samples from /r/technology. [2] claimed they used said they used 15019 samples from r/movies and 13485 samples from r/technology. [2] gave a link to SARC-2.0.

I checked that in the SARC-2.0 main folder, there are
train-balanced r/movies=1414, train-unbalanced r/movies=121595
test-balanced r/movies=364, test-unbalanced r/movies=27930
train-balanced r/technology=1652, train-unbalanced r/technology=73641
test-balanced r/technology=408, test-unbalanced r/technology=20104

It's impossible to get a balanced dataset for both [1] and [2], right?.. the sarc samples are very few compared with non-sarc samples...

could you help me check if there is any way for the two papers to get such data?
Thanks a million! This confused me quite a lot.

[1] https://www.aclweb.org/anthology/P18-1093.pdf
[2] https://dl.acm.org/doi/pdf/10.1145/3308558.3313735

SARC file link is dead

text_embedding not found

text_embedding not found. also didn't find it in python libraries

nlprinceton / sarc Goto Github PK

sarc's Introduction

SARC

sarc's People

Contributors

Stargazers

Watchers

Forkers

sarc's Issues

raw/sarc.csv find user comments

IndexError: list index out of range

subsets to SARC Files

SARC file link is dead

text_embedding not found

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs