several27 / fakenewscorpus Goto Github PK

View Code? Open in Web Editor NEW

380.0 16.0 96.0 453 KB

A dataset of millions of news articles scraped from a curated list of data sources.

License: Apache License 2.0

fakenews dataset database corpus natural-language-processing nlp machine-learning artificial-intelligence

fakenewscorpus's People

Contributors

Stargazers

Watchers

Forkers

aracelimanzanoch anandsrao saradhix eridgd naushadzaman flaque sherlock42 renatosc navaneethsen vincegiorno ztx0728 hashjang michaelstern336 sandgate-dev wfj327 yunitata hsethi2709 amirunpri2018 arianpasquali chengsen gyanratan farisology tarsbase chunxi-alpc apundhir raoden1 aymansalama sduchh tynlong raihan2108 lumen2018 chunyuany iuliangabriel97 thenerdyouknow rhys-l sparkingdark lamprospapav ranjancse26 amlghsh sunshineflickerhop thorsteen coderpriya altovate awoziji digitalcompanion milovanovicdusan kakiac tawonque idsdarg schutza mvarda21 smallcube lirneasia amimul shainaraza yipeng0428 neilellis islammesabah aymo1 jai2033shankar caitlin-hilverman shuaidop chatsdude offbeat-news apd1997 naitik-may4th isspek 2011-sagittarius bifrostluv xuchanguniversity solongs jailukanna kolpashnikova chainsawriot a11en0 lorenzomarc kryukovaeks tskumarage happyboy1233211 klr369 jonjoncardoso xevro sipadiin arifinrafi mr-prudence billll-ppp aintgon raheelashraf00 aalfaizz kojoowusu nautyy marceloamadeu mattthehoople bradley39e gpydzh

fakenewscorpus's Issues

wrong openmagazines.com content

Hello!

Thank you for this huge dataset!
Currently working with it ("fake" and "reliable" labels only for now), I will probably find some problems in it, the majors of which I will post here :).

First one :
contents of openmagazines.com articles are mainly the following:
"This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept"
for 1023 entries out of 1081.
As a fast fix, automatically replacing this content with None or "" would be cleaner :).

Cannot save data to database

Hi,

I am trying to save the data into Cassandra database but it cannot interpret the CSV. After that I tried to check it with Excel and the pandas dataframe but both said that the file does not have a valid CSV format.

Could you help me to find the way how to store the data to Cassandra?

Citing the dataset

Hi @several27!

I'm doing some research on fake news datasets for automatic detection and your corpus is the most complete I've found, it can be really useful for my study! But I don't know if you have a desirable way to cite your work of if it's published somewhere?
Do you have an email that I can contact you?

Thanks!

Unable to extract the complete dataset

Used this command to create a combined zip from the split parts in a directory:
zip -F (name of last part of archive, which will end with .zip, not .z0X) --out (output name of compiled archive).zip
and unzip (archive name).zip to unpack it.Unpacking gives an error after an extraction of 2.8 GB of data.
Also used p7zip to unpack individual files or the combined file and get the error

@several27 It'd be great if you could help me with this.

Download data

Hi,

Would it be possible for you to host the dataset somewhere else as a more accessible download?

I have tried downloading the dataset via awscli, but it throws an error that indicates a permission and/or region mismatch. Is the bucket still public?

Label "rumor"

Awesome work, but in the files we can see a type "rumor" that is not documented in the readme...

Getting the date of when each article was published

Hi @several27 ,

I am currently using your corpus for an NLP project and was wondering if you had the dates for when each article was published available. Otherwise I was wondering if you had the raw HTML for each article that I can download; I can retrieve these dates from the raw HTML. This is because a lot of the domains are dead and I can't look on the internet for these dates anymore.

Thanks,
Changxiao

NYTimes data?

Dear Maciej,

Thanks a lot for making this amazing dataset available! :D I have one quick question and a comment.

I found this dataset includes 1.5M NYTimes articles, can you elaborate little more how you collect them?
I'd love to use this dataset for research. But the lack of details on data collection procedure (e.g., when the collection started and ended, what is the time range of collected news articles) makes it really hard to use this data for academic purposes. If you can describe how you collected this data, it would be gratefully helpful!

Thanks,
Jisun

Not able to open file

How to read the file? It's encoded somehow probably?

Field describing when an article was written

Is there any field describing when an article was written?

Data Labelling

Hello,

How do you label the article? For example, the data collected from the URL http://beforeitsnews.com/awakening-start-here/2018/01/awakening-of-12-strands-of-dna-reconnecting-with-you-movie-10623.html there is no information that the article is fake.

Please explain if I miss something.

CSV File Error

Hello, after I extracted the news.csv.zip file and I opened it in Excel, it only showed me a grayed out screen. Excel did not recognize a file being open since I could not Save As or do any other actions. I have tried opening the CSV using another method from the Data tab then get external data, which did not work as well. I believe that the files you have put maybe corrupted. I would appreciate it if you could update it. Or if anyone knows a solution to my problem do respond.

Thank You

Not able to get the zip file from wget command in Google colab

Hi,

When I am trying to download the file using wget command on google colab and I am getting below error:-

--2020-01-03 04:14:40-- https://storage.googleapis.com/researchably-fake-news-recognition/news_cleaned_2018_02_13.csv.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 108.177.97.128, 2404:6800:4008:c03::80
Connecting to storage.googleapis.com (storage.googleapis.com)|108.177.97.128|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2020-01-03 04:14:40 ERROR 403: Forbidden.

File Corrupt

This file from the series: news.csv.z01 seems to be corrupt when extracting the file series. Should i skip this file by making a dummy or is there a replacement/fix?

several27 / fakenewscorpus Goto Github PK

fakenewscorpus's People

Contributors

Stargazers

Watchers

Forkers

fakenewscorpus's Issues

This file from the series: news.csv.z01 seems to be corrupt when extracting the file series. Should i skip this file by making a dummy or is there a replacement/fix?

Recommend Projects

Recommend Topics

Recommend Org

Jobs