Use gensim dictionary to do amazon review analysis and classification (NLP)
- Create corpora.Dictionary and doc2bow (spares)
- Create dense matrix
- Create word-context matrix and use PCA to decomposition, then sppmi
- PCA to decomposition on dense doc2bow: doc-vecs, vecs-word
- Create doc-vecs matrix use decomposed word-context matrix with or without sppmi
- Scale the doc-vecs matrix (minmax, standard, log, exponential, binary)
- Use different models to do classification: decision tree, random forest, linear regression, svm, nn
First: generate dict and bow use gensim; second: matrix decomposition to doc-vecs matrix use PCA, word-context and other formulas(sppmi) and PCA; third: do classification use different doc-vecs (scaled) and different models
Results: 1.NN with four layer and 12-14 nodes each layer has best performance 2. different ways of scaling data may works for different models well