GithubHelp home page GithubHelp logo

emerging-welfare / protestnews-2019 Goto Github PK

View Code? Open in Web Editor NEW
3.0 5.0 2.0 1.45 MB

This repository contains data preparation and preprocessing code for CLEF Lab 2019 ProtestNews.

Python 90.86% Shell 9.14%

protestnews-2019's Introduction

Sadly we cannot share the whole text of the articles we labelled/annotated due to the copyright infringment laws.
Therefore we prepared three scripts for Document, Sentence level data to automatically download from provided urls and "fill in the blanks".
There is no way to make this whole process lossless due to the those tricky, everchanging htmls.
Even though we try to compensate for every possible problem, there will be some changes from the original data we labelled. So we will evaluate how this small change effects a baseline model, and will share the results. \

Steps

To get your data ready, you need to go into each of the folders (Document, Sentence) and run bash run.sh

Requirements

Firstly install additional requirements in requirements_additional. You can do so by running apt-get install line in Ubuntu. For python packages, you need to visit the github pages and follow install instructions.
For python2 requirements, run -> pip2 install -r requirements2.txt
For python3 requirements, run -> pip3 install -r requirements3.txt

Logs

You can find the log file for scrapy and selenium as collector/log.txt and collector/ghostdriver.log respectively.
For the log file of run.sh of the specific task (Document, Sentence), you can check the output/{task_name}/{data_set}.log

Outputs

For the output files, check under the output/{task_name} folder for {data_set}_filled.json files.

protestnews-2019's People

Contributors

osmanmutlu avatar ardakdemir avatar kausta avatar

Stargazers

Dr. Durgesh Kumar avatar Rahul Baraiya avatar  avatar

Watchers

James Cloos avatar Ali Hürriyetoğlu avatar Abdurrahman Beyaz avatar  avatar  avatar

protestnews-2019's Issues

Downloaded files

Counting process made in downloadable texts is based on HTTP request but some websites are returning a HTML file with an HTML file which includes a text "There is no news". The number of successfully accessed links can be misleading.

Problem in matching algorithm

There can be non alpha numerical characters in the beginning of the matched substring.
Also if there is a non alpha numerical character in the end of the original substring, it is not matched.
This problem can be dealt with in mapping function to the original text.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.