GithubHelp home page GithubHelp logo

text-classification-bak's Introduction

Step 1: Gather Data

Source data from public data set on BBC news articles.

Its original source.its original source: http://mlg.ucd.ie/datasets/bbc.html

Cleaned up version: https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv

Step 2: Explore Data

Is the number of articles in each category roughly equal?

If our dataset were imbalanced, we would need to carefully configure our model or artificially balance the dataset, for example by undersampling or oversampling each class.

Step 2.5: Choose a Model

Step 3: Prepare Data

extracting features

To further analyze our dataset, we need to transform each article's text to a feature vector, a list of numerical values representing some of the text’s characteristics.

  • one-hot vector

  • Co-occurrence Matrix (SVD Based Methods)

  • word2vec
    • continuous bag of words model

      a model where for each document, an article in our case, the presence (and often the frequency) of words is taken into consideration, but the order in which they occur is ignored.

    • skip-gram

  • Term Frequency, Inverse Document Frequency (tf-idf)

    This statistic represents words’ importance in each document.

Step 4: Build, Train, and Evaluate Model

Select and Train a Model

We don't want to touch the test set until we are ready to launch a model you are confident about, so we need to use part of the training set for training, and part for validation.

Performance Measures

Measuring Accuracy Using Cross-Validation

Confusion Matrix

Precision and Recall

Precision / Recall Tradeoff

The ROC Curve

we must carefully choose the right metric based on the task we are trying to solve. Here, we are dealing with a multi-class classification task (trying to assign one label to each document, out of a number of labels). Given the relative balance of our dataset, accuracy does seem like a good metric to optimize for. If one of the labels was more important than the others, we would then weight it higher, or focus on its own results. For imbalanced datasets such as the one described above, it is common to look at the ROC curve, and optimize the Area Under the Curve (ROC AUC).

Step 5: Tune Hyperparameters

Step 6: Deploy Model

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.