GithubHelp home page GithubHelp logo

vnabdulov / trecs Goto Github PK

View Code? Open in Web Editor NEW

This project forked from teeohh/trecs

0.0 0.0 0.0 1.39 MB

NLP text recommendation system built in Python using Gensim, spaCy, and Plotly Dash

Python 100.00%

trecs's Introduction

tRECS

Text recommendation system developer built in Python and Dash by Plotly.

Authors: Taylor Olson, Janie Neal, Christiana Prater-Lee, Eshita Nandini

This recommendation system developer guides the user through cleaning their data, building models, and ultimately creates a recommendation system (housed within the interface). The user can also visualize some the models and other features of their data.

INSTALLATION

The user should be using python 2.7+. Most packages required for the interface can be installed in the command line by navigating to the directory containing the reqs.txt file and running "pip install -r reqs.txt". Some packages require further installation. Links to their installation documentation are provided below.

Spacy: https://spacy.io/docs/usage/ 
Follow the instructions for downloading the english model.
 
NLTK:     http://www.nltk.org/data.html#
Follow the instructions to open the interactive installer and install the following corpora: wordnet, wordnet_ic, words, and stopwords.    


HOW TO START THE DASH INTERFACE

In order to start the interface, the user should navigate to the "interface" folder in the command line and call "python interface.py". The command window will provide a link that the user can then copy into the address bar of a browser of their choice. The interface should open in that window, and the user can begin using the features.


FEATURES OF THE INTERFACE AND HOW TO USE THEM

Upload your own data:
User can copy custom data (see sample data for desired format) into the interface folder or use some of the sets included in the interface folder.

Inside the interface the user should select the name of the file and press upload, then select the name of column within the file that contains the names of the objects that are described in the description column(selected next). The description column  should contain the text that will be analyzed. Click next step.

Choose your own cleaning options:
User is required to remove non-ascii characters and punctuation. There are also other optional cleaning features.
stemming vs. lemmatizing: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

Word cut off:
Numbers on the slider indicate the percentage of documents that the words appear in. Use the slider to remove words that occur too frequently or not frequently enough. 

Choose and Build your models:
Options are TF-IDF, LSA, LDA, and spaCy's built in similarity function.

Recommendation: 
Uses the models built in the previous step to provide the selected number of most similar objects to the selected object.

Visualizations:
The following visualizations have been connected to the interface: entity visualization, lda clustering. Please see their descriptions in the description of visualizations.py.


CONTENTS OF MASTER FOLDER

Interface.py: contains the code needed for front end of application, including: HTML layout, Dash components and event handlers, calls to backend. Contains iqss_interface object (below).

Iqss_interface.py: contains all coded needed for back end of application, including: user loaded data frame, cleaned data frame, and built models. Contains two objects, the iqss_interface object (holds the data frames) and the model_object (holds the necessary df, tf-idf, lda, and spacy models). 

Utils.py: contains all methods used for manipulating data frames, including: converting to term and document frequency matrix, cleaning the data frame, extracting entities from the data frame, and determining the number of topics based on the contents of the data frame.
    
Visualizations.py: contains all code necessary to build the various visualizations, including: 
the LDA cluster graph: http://brandonrose.org/clustering  
the similarity graph: a graph that plots the similarity score of an object on the x axis, and the group which the object is part of is plotted on the y axis
LDA topic distribution: https://pyldavis.readthedocs.io/en/latest/readme.html#installation

trecs's People

Contributors

teeohh avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.