GithubHelp home page GithubHelp logo

jayadeepj / r-text-clustering-silhouette Goto Github PK

View Code? Open in Web Editor NEW
0.0 0.0 0.0 250 KB

This repository contains the code implementation of my paper on real time text clustering published in International Journal of Data Mining Techniques and Applications

R 100.00%

r-text-clustering-silhouette's People

Contributors

jayadeepj avatar

Watchers

 avatar

r-text-clustering-silhouette's Issues

Run word assocs multiple times for most frequently occuring words

Hello,

I'm new to R, and I'm trying to run the tm package to obtain clusters of words that I can group together.

I wanted to understand if there was a way in which I could run the "findAssocs" function multiple times to obtain a series of word correlation tables, based on words that occur (say more than 50) times, and then export these tables to a spreadsheet.

Any help would be much appreciated. Thank you so much!

the current code I'm using is pasted below
#create corpus
docs<- Corpus(DirSource("C:\Users\bhatterp\Desktop\Corpus"))

#create the toSpace content transformer
toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern, " ", x))})

#remove punctuations
docs <- tm_map(docs,toSpace,"-")
docs <- tm_map(docs,toSpace,",")
docs <- tm_map(docs,toSpace,";")
docs <- tm_map(docs,toSpace,":")
docs <- tm_map(docs,toSpace,"\.")

#Transform to lower case (need to wrap in content_transformer)
docs <- tm_map(docs,content_transformer(tolower))

#Strip digits (std transformation, so no need for content_transformer)
docs <- tm_map(docs, removeNumbers)

#remove stopwords using the standard list in tm
docs <- tm_map(docs, removeWords,stopwords ("english"))

#Strip whitespace (cosmetic?)
docs <- tm_map(docs, stripWhitespace)

#load library - stemming to aggregate words with common root
library(SnowballC)
#Stem document
docs <- tm_map(docs,stemDocument)

#creating document Matrix from existing corpus
dtm <- DocumentTermMatrix(docs)

#transpose of TDM
tdm <- TermDocumentMatrix(docs)
tdm

inspect(dtm)

#counting word frequency in DTM
freq <- colSums(as.matrix(dtm))

#length should be total number of terms
length(freq)

##create sort order (descending)
ord <- order(freq,decreasing=TRUE)

#inspect most frequently occurring terms
freq[head(ord)]

#Remove sparse terms
dtms <- removeSparseTerms(dtm, 0.15)

checking frequency

freq <- colSums(as.matrix(dtm))
head(table(freq), 20)

#view table of selected terms
freq <- colSums(as.matrix(dtms))
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
head(freq, 14)

#plotting frequently occuring words
library(ggplot2)
wf <- data.frame(word=names(freq), freq=freq)
p <- ggplot(subset(wf, freq>300), aes(word, freq))
p <- p + geom_bar(stat="identity")
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))
p

findAssocs(dtm, c("xx" , "yy"), corlimit=0.85)
findAssocs(dtms, "word", corlimit=0.60)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.