GithubHelp home page GithubHelp logo

kevincwu0 / latent-semantic-analysis-book-titles Goto Github PK

View Code? Open in Web Editor NEW
0.0 0.0 1.0 70 KB

Latent Semantic Analysis of Book Titles

Python 100.00%
lsa svd numpy nltk wordnetlemmatizer sklearn matplotlib python

latent-semantic-analysis-book-titles's Introduction

Latent-Semantic-Analysis-Book-Titles

Latent Semantic Analysis

  • Synonymy: Mutliple words with the same meaning
  • Polysemy: one word with multiple meanings

Synonyms:

  • "buy" and "purchase"
  • "big" and "large"
  • "quick" and "speedy"

Polysemes: "Man" (Human oppose to animal, vs. male vs. female, or "hey, man", "Milk" (verb, noun)

alt text

Latent variables: combine words with similar meaning z = 0.7 * computer + 0.5 * PC + 0.6 * laptop (hidden variable to represent all of them)

Job of latent semantic analysis (LSA) is to find these variables and transform original data into these new variables and hopefully the dimensionality of these data is much smaller than the original. Allow us to speed up computation.

Does this help with Polysemy? Conflicting viewpoints on whether it helps with Polysemy.

Math behind LSA

LSA is really Singular Value Decomposition (SVD) on the term-document matrix.

  • singular value decomposition (SVD) is a factorization of a real or complex matrix. It is the generalization of the eigendecomposition of a positive semidefinite normal matrix (for example, a symmetric matrix with positive eigenvalues) to any {\displaystyle m\times n} m\times n matrix via an extension of the polar decomposition. It has many useful applications in signal processing and statistics.

PCA is a simpler form of SVD.

Principle Components Analysis (PCA)

  • does a transformation on our input vector
  • z = Qx
  • Q is a matrix
  • scalar * vector = another vector, same direction
  • matrix * vector = another vector, possibly different direction
  • PCA rotates our original input vectors, same vector different coordinate system.

PCA does 3 things for us:

  1. Decorrelates input data: data new coordinate system has zero correlation
  2. Transformed data is ordered by information content: decreasing ordered, less information etc.
  3. Dimensionality reduction: allows to reduce dimensionality (e.g. 1000 words => latent distnct terms might be 100)
  • removing information != decreasing predictive ability
  • denoising/smoothing/improving generalization

Covariance more variance is synonymous with more information, non-matrix form:

Eigenvalues & Eigenvectors

  • A = diagonal matrix of eigenvalues (there are D of them -> D x D matrix)
  • Q = matrix of stacked eigenvectors (there are D of them -> D x D matrix)
  • we sort A so the eigenvalues are in descending order
  • remember that z = Qx, or in matrix form Z = XQ
  • turns out A is the covariance matrix of Z therefore:
    • variance, aka information in Z is sorted in descending order
    • none of the dimension in Z are correlated

Extending PCA - PCA helps us combine input features (words/terms, columns of input matrix) "Term document matrices" - each term as input, document as sample Combine and decorrelated by docuemtn? Just do PCA on the tranpose ... which leads to weird results.

SVD (singular value decomposition) SVD does both of these at the same time. PSA same time (lots of fun math)

latent-semantic-analysis-book-titles's People

Contributors

kevincwu0 avatar kevincwuu avatar

Forkers

soukaynanait

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.