GithubHelp home page GithubHelp logo

si_news_classifier's Introduction

Sustainability Indicators (SI) News Classifier

The source codes provided in this repository were developed as part of the research project presented in Advancing sustainability indicators through text mining: a feasibility demonstration. The project is licensed under The University of Illinois/NCSA Open Source License (NCSA) (READ License.txt) and it was last updated on June 25, 2014. The repository contains:

  • The m-code codes for the classification algorithm presented in Rivera et. al., (2013) (SI_News_Classifier.m and associated functions files).
  • The RapidMiner_5 workflow used to pre-process the data.
  • An illustrative example of its use and implementation.

Further development of the algorithm, suggestions, corrections and new examples are always encouraged and welcomed.

Citing this work

We kindly ask any future publications using this software to include a reference to the following publication: Rivera, S., Minsker, B., Work, D. and Roth, D. (2013) “Advancing sustainability indicators through text mining: a feasibility demonstration.” submitted to Environmental Modeling and Software.
Note: A python copy of the code will be released later this year (2014). The documentation will be changed to address any concerns related to the python copy.

Installation

Requirements

  1. The latest version of the source code was developed using Matlab (R2008a), however the code was tested and validated in later versions. A valid copy of the software and a license (can be educational) is needed.
  2. As stated in Rivera et. al., (2013), the implementation of the code requires the additional installation of the Dataless Classification software developed by the Cognitive Computation Group led by Professor Daniel Roth at the University of Illinois at Urbana-Champaign. The software is available at: Importance of Semantic Representation: Dataless Classification . Instructions on the installation and setup of the Dataless Classification can be found in the provided link. The Dataless Classification is integrated into the source codes by providing the path to the installation folder (e.g. '../descartes-0.2/bin/DESCARTES').

Install

To begin using the classification algorithm, download and integrate the SI_News_Classifier.m function into your code. Be aware that if the provided archiving structure is changed, further modifications could be required to the paths hard coded into the source code.

Example of implementation

An illustrative example has been provided with the source code to help users understand the structure of the input and output data. The example contains a total of 26 news articles divided in the following hierarchical tree:

alt text

Objective: Classify news articles under the sustainability indicators. As part of the illustrative example, the following files are included:

  • News_Data.xls - Metadata of news articles used for the illustrative example. Below, a list of the workbooks is presented:
    • News_Labels_and_Sources - Provides a list of the news sources and their labels
    • Words - List of all the words in the set of news articles obtained after pre-processing (tokenization and elimination of stop-words)
  • Example Data - Word-bag (binary and tf-dif) matrix, news labels, training and testing set
  • Example_1.m - Main m-file for the implementation example
Note: The set of news was pre-processed using the provided RapidMiner 5 workflow. Special attention should be paid to the order in which RapidMiner 5 and Matlab read the news articles as they are not necessarily the same.

The main script is the Example_1.m. Before running the script, the appropriate path to the folder containing the Dataless Classification (e.g. '../descartes-0.2/bin/DESCARTES') should be modified in the script.

Outputs:

  • Classification label of test set at all levels of the hierarchical tree
  • Classification confusion matrix for root and parent nodes.

si_news_classifier's People

Contributors

sammyrivera avatar

Stargazers

Hyewon Kang avatar Roberto Salas avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.