GithubHelp home page GithubHelp logo

cynthiakoopman / short-document-clustering-nlp Goto Github PK

View Code? Open in Web Editor NEW
3.0 1.0 2.0 686 KB

Published Article - The Effect of Preprocessing on Short Document Clustering

Jupyter Notebook 100.00%
clustering cluster-analysis text-mining nlp machine-learning data-science feature-extraction preprocessing k-means document-clustering

short-document-clustering-nlp's Introduction

The Effect of Preprocessing on Short Document Clustering

This paper has been accepted in the Jounal Archives of Data Science, Series A:

Koopman, C., and Wilhelm, A., “The Effect of Preprocessing on Short Document Clustering,”Archives of Data Science, SeriesA (Online First), Vol. 6, No. 1, 2020, pp. 1–16. https://doi.org/10.5445/KSP/1000098011/01

Abstract: Document clustering has gained popularity due to social media and its large volume. Natural Language Processing is able to extract information from unstructured data which can be powerful for businesses. Social media, customer reviews and even military messages are all very short and therefore harder to handle than longer texts. Cluster analysis is essential in gaining insight from these unlabeled texts. Preprocessing often removes words, which can become risky in short texts, where the main message is made of only a few words. The effect of preprocessing and feature extraction on these short documents is therefore analyzed in this paper. Six different levels of text normalization are combined with four different feature extraction methods. These setting are all applied on K-means clustering and tested on three different datasets. Anticipated results can not be concluded, however other findings are insightful in terms of the connection between text cleaning and feature extraction.

This study implements:

  • TF-IDF
  • TF-IDF with n-grams
  • Word2Vec
  • GloVe
  • K-means clustering
  • various degrees of text cleaning

Datasets used are from Amazon, Yelp and DBpedia found via "Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In Advances in neural information processing systems, pages 649{657, 2015"

short-document-clustering-nlp's People

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.