Run word assocs multiple times for most frequently occuring words

Hello,

I'm new to R, and I'm trying to run the tm package to obtain clusters of words that I can group together.

I wanted to understand if there was a way in which I could run the "findAssocs" function multiple times to obtain a series of word correlation tables, based on words that occur (say more than 50) times, and then export these tables to a spreadsheet.

Any help would be much appreciated. Thank you so much!

the current code I'm using is pasted below
#create corpus
docs<- Corpus(DirSource("C:\Users\bhatterp\Desktop\Corpus"))

#create the toSpace content transformer
toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern, " ", x))})

#remove punctuations
docs <- tm_map(docs,toSpace,"-")
docs <- tm_map(docs,toSpace,",")
docs <- tm_map(docs,toSpace,";")
docs <- tm_map(docs,toSpace,":")
docs <- tm_map(docs,toSpace,"\.")

#Transform to lower case (need to wrap in content_transformer)
docs <- tm_map(docs,content_transformer(tolower))

#Strip digits (std transformation, so no need for content_transformer)
docs <- tm_map(docs, removeNumbers)

#remove stopwords using the standard list in tm
docs <- tm_map(docs, removeWords,stopwords ("english"))

#Strip whitespace (cosmetic?)
docs <- tm_map(docs, stripWhitespace)

#load library - stemming to aggregate words with common root
library(SnowballC)
#Stem document
docs <- tm_map(docs,stemDocument)

#creating document Matrix from existing corpus
dtm <- DocumentTermMatrix(docs)

#transpose of TDM
tdm <- TermDocumentMatrix(docs)
tdm

inspect(dtm)

#counting word frequency in DTM
freq <- colSums(as.matrix(dtm))

#length should be total number of terms
length(freq)

##create sort order (descending)
ord <- order(freq,decreasing=TRUE)

#inspect most frequently occurring terms
freq[head(ord)]

#Remove sparse terms
dtms <- removeSparseTerms(dtm, 0.15)

checking frequency

freq <- colSums(as.matrix(dtm))
head(table(freq), 20)

#view table of selected terms
freq <- colSums(as.matrix(dtms))
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
head(freq, 14)

#plotting frequently occuring words
library(ggplot2)
wf <- data.frame(word=names(freq), freq=freq)
p <- ggplot(subset(wf, freq>300), aes(word, freq))
p <- p + geom_bar(stat="identity")
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))
p

findAssocs(dtm, c("xx" , "yy"), corlimit=0.85)
findAssocs(dtms, "word", corlimit=0.60)

jayadeepj / r-text-clustering-silhouette Goto Github PK

r-text-clustering-silhouette's People

Contributors

Watchers

r-text-clustering-silhouette's Issues

Run word assocs multiple times for most frequently occuring words

checking frequency

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs