The protestnews-2019 from emerging-welfare

protestnews-2019's Introduction

Sadly we cannot share the whole text of the articles we labelled/annotated due to the copyright infringment laws.
Therefore we prepared three scripts for Document, Sentence level data to automatically download from provided urls and "fill in the blanks".
There is no way to make this whole process lossless due to the those tricky, everchanging htmls.
Even though we try to compensate for every possible problem, there will be some changes from the original data we labelled. So we will evaluate how this small change effects a baseline model, and will share the results. \

Steps

To get your data ready, you need to go into each of the folders (Document, Sentence) and run bash run.sh

Requirements

Firstly install additional requirements in requirements_additional. You can do so by running apt-get install line in Ubuntu. For python packages, you need to visit the github pages and follow install instructions.
For python2 requirements, run -> pip2 install -r requirements2.txt
For python3 requirements, run -> pip3 install -r requirements3.txt

Logs

You can find the log file for scrapy and selenium as collector/log.txt and collector/ghostdriver.log respectively.
For the log file of run.sh of the specific task (Document, Sentence), you can check the output/{task_name}/{data_set}.log

Outputs

For the output files, check under the output/{task_name} folder for {data_set}_filled.json files.

protestnews-2019's People

Contributors

Stargazers

Watchers

protestnews-2019's Issues

Downloaded files

Counting process made in downloadable texts is based on HTTP request but some websites are returning a HTML file with an HTML file which includes a text "There is no news". The number of successfully accessed links can be misleading.

Problem in matching algorithm

There can be non alpha numerical characters in the beginning of the matched substring.
Also if there is a non alpha numerical character in the end of the original substring, it is not matched.
This problem can be dealt with in mapping function to the original text.

Recommend Projects

emerging-welfare / protestnews-2019 Goto Github PK

protestnews-2019's Introduction

Steps

Requirements

Logs

Outputs

protestnews-2019's People

Contributors

Stargazers

Watchers

Forkers

protestnews-2019's Issues

Downloaded files

Problem in matching algorithm

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs