GithubHelp home page GithubHelp logo

mdh266 / textclassificationapp Goto Github PK

View Code? Open in Web Editor NEW
17.0 3.0 10.0 8.82 MB

Building and Deploying A Serverless Text Classification Web App

Home Page: https://docwebapp-j3zdo3lhcq-uc.a.run.app/

License: MIT License

Jupyter Notebook 99.73% Python 0.14% Dockerfile 0.02% CSS 0.01% JavaScript 0.01% HTML 0.11%
scikit-learn natural-language-processing machine-learning nltk imbalanced-data imbalanced-learning nlp naive-bayes support-vector-machine text-classification docker data-science fastapi document-classification

textclassificationapp's Introduction

Building and Deploying A Text Classification Web App


Web App: https://docwebapp-j3zdo3lhcq-uc.a.run.app/

About


In this project, over a series of blog posts I'll be buidling a model document classification, also known as text classification and deploying the model as part of a web application to predict the topic of research papers from their abstract.

1st Blog Post: Dealing With Imbalanced Data


In the first blog post I will be working with the Scikit-learn library and an imbalanced dataset (corpus) that I will create from summaries of papers published on arxiv. The topic of each paper is already labeled as the category therefore alleviating the need for me to label the dataset. The imbalance in the dataset will be caused by the imbalance in the number of samples in each of the categories we are trying to predict. Imbalanced data occurs quite frequently in classification problems and makes developing a good model more challenging. Often times it is too expensive or not possible to get more data on the classes that have to few samples. Developing strategies for dealing with imbalanced data is therefore paramount for creating a good classification model. I will cover some of the basics of dealing with imbalanced data using the Imbalance-Learn library as well as building a Naive Bayes classifier and Support Vector Machine using from Scikit-learn. I will also over the basics of term frequency-inverse document frequency and visualizing it using the Plotly library.

2nd Blog Post: Using The Natural Language Toolkit


In this blogpost I picked up from the last one and went over using the Natural Language Toolkit (NLTK) to improve the performance of our text classification models. Specifically, we went over how to remove stopwords, stemming and lemmitization. I applied each of these to the weighted Support Vector Machine model and performed a grid search to find the optimal parameters to use for our models. Finally I persist our model to disk using Joblib so that we can use it as part of a rest API.

3rd Blog Post: A Machine Learning Powered Web App


In this post we'll build out a serverless web app using a few technologies. The advantage of using a serverless framework for me is cost effectiveness: I don't pay much at all unless people use my web app a ton and I don't expect people to visit this app very often. However, due to the serverless framework I will have issues with latency, which I can live with. I'll first go over how to convert my text classification model from the last post into a Rest API using FastAPI and Joblib. Using our model in this way will allow us to send our paper abstracts as json through an HTTP request and get back the predicted topic label for the paper abstract. After this I'll build out a web application usign FastAPI and Bootstrap. Using Bootstrap allows us to have a beautiful responsive website without having to write HTML or JavaScript. Finally, I'll go over deploying both the model API and Web app using Docker and Google Cloud Run to build out a serverless web application!

4th Blog Post: Deep Learning With Tensorflow & Optuna


This time I will use a Convolutional Neural Network (CNN) model with Tensorflow and Keras to predict the topic of each paper's abstract and use Optuna to optimize the hyperparamters of the model. Keras is a high level library that makes building complex deep learning models relatively easy and since it can use Tensorflow as a backend, it is a production ready framework. Optuna is powerful automatic hyperparameter tuning library that uses a define-by-run design that makes it elegant and easy to use. I have just started using this library and have been particularly impressed with the design which is extremely intuitve. While CNN's are no longer the state-of-the-art algorithms for text classification, they still perform quite well and I wanted to explore how they would work on this problem. I should note that, the point of this isn't to build the most high performing model, but rather to show how these tools fit together to build an end-to-end deep learning model.

How To Run This:


To use the notebooks in this project first download Docker and then you can start the notebook with the command:

docker-compose up

and going to the posted url. To recreate the restapi and web app use the commands listed in modealapi and webapp respectively.s

textclassificationapp's People

Contributors

dependabot[bot] avatar mdh266 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.