GithubHelp home page GithubHelp logo

jiegzhan / multi-class-text-classification-cnn Goto Github PK

View Code? Open in Web Editor NEW
427.0 30.0 200.0 303.38 MB

Classify Kaggle Consumer Finance Complaints into 11 classes. Build the model with CNN (Convolutional Neural Network) and Word Embeddings on Tensorflow.

License: Apache License 2.0

Python 100.00%
cnn text-classification tensorflow convolutional-neural-networks embeddings sentence-classification multi

multi-class-text-classification-cnn's Introduction

Project: Classify Kaggle Consumer Finance Complaints

Highlights:

  • This is a multi-class text classification (sentence classification) problem.
  • The purpose of this project is to classify Kaggle Consumer Finance Complaints into 11 classes.
  • The model was built with Convolutional Neural Network (CNN) and Word Embeddings on Tensorflow.
  • Input: consumer_complaint_narrative

    • Example: "someone in north Carolina has stolen my identity information and has purchased items including XXXX cell phones thru XXXX on XXXX/XXXX/2015. A police report was filed as soon as I found out about it on XXXX/XXXX/2015. A investigation from XXXX is under way thru there fraud department and our local police department.\n"
  • Output: product

    • Example: Credit reporting

Train:

  • Command: python3 train.py training_data.file parameters.json
  • Example: python3 train.py ./data/consumer_complaints.csv.zip ./parameters.json

A directory will be created during training, and the trained model will be saved in this directory.

Predict:

Provide the model directory (created when running train.py) and new data to predict.py.

  • Command: python3 predict.py ./trained_model_directory/ new_data.file
  • Example: python3 predict.py ./trained_model_1479757124/ ./data/small_samples.json

Reference:

multi-class-text-classification-cnn's People

Contributors

gustavomr avatar jiegzhan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

multi-class-text-classification-cnn's Issues

UnboundLocalError: local variable 'path' referenced before assignment

Hello,
i have a dataset with 1572 sentences and 13 classes. i try to train this data.
here is output and error:
INFO:root:The maximum length of all sentences: 40
INFe O:root:x_train: 1271, x_dev: 142, x_test: 158
INFO:root:y_train: 1271, y_dev: 142, y_test: 158
Traceback (most recent call last):
File "train.py", line 136, in
train_cnn()
File "train.py", line 131, in train_cnn
logging.critical('Accuracy on test set is {} based on the best model {}'.format(test_accuracy, path))
UnboundLocalError: local variable 'path' referenced before assignment.

Error "not enough values to unpack" on predict

When do predict i have got error after 7200 step:

CRITICAL:root:Saved model d:\Django\multi-class-text-classification-cnn\trained_model_1485372325\checkpoints\model-7200 at step 7200
CRITICAL:root:Best accuracy 0.8768386935926203 at step 7200
Traceback (most recent call last):
  File "train.py", line 141, in <module>
    train_cnn()
  File "train.py", line 104, in train_cnn
    x_train_batch, y_train_batch = zip(*train_batch)
ValueError: not enough values to unpack (expected 2, got 0)

Accuracy is good and enough, but why error comes and what if it arise on low accurasy
(Anaconda python 3.5, dataset 47000 rows with Cyrillic chars)

Input file format for prediction

Hello,
I am able to successfully run the small-sample.json file to get the prediction.json. The small-sample.json file is having information on "product" and "consumer_complaint_narrative". For the test file, this is fine. However, I am little confused (as I am new to this...) on the input file format of some unseen data (e.g., consumer_complaint_narrative ). My problem is how to get the "new prediction" of "consumer_complaint_narrative" without providing the "product:" field in the input.json file.
How does the input file format look like for just unseen "consumer_complaint_narrative" data and what should be prediction command? Do I need to edit anything in predict.py?
Can anyone help?
Thanks in advance.

NotFoundError: encountered while running function tf.train.latest_checkpoint() in predict.py

Both the model and the checkpoints exist in the same directory. This is the error i encounter when i try to run the file.

NotFoundError                    Traceback (most recent call last)
<ipython-input-60-8de4d687f60c> in <module>()
  5     checkpoint_dir += '/'
  6 print (checkpoint_dir + 'checkpoints')
  ----> 7 checkpoint_file = tf.train.latest_checkpoint(checkpoint_dir +    'checkpoints')
  8 print (checkpoint_file)

  /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py in latest_checkpoint(checkpoint_dir, latest_filename)
 1612     v1_path = _prefix_to_checkpoint_path(ckpt.model_checkpoint_path,
 1613                                          saver_pb2.SaverDef.V1)
 -> 1614     if file_io.get_matching_files(v2_path) or     file_io.get_matching_files(
 1615         v1_path):
 1616       return ckpt.model_checkpoint_path

 /usr/local/lib/python3.5/dist-packages/tensorflow/python/lib/io/file_io.py in get_matching_files(filename)
 330           # Convert the filenames to string from bytes.
 331           compat.as_str_any(matching_filename)
 --> 332           for single_filename in filename
 333           for matching_filename in pywrap_tensorflow.GetMatchingFiles(
 334               compat.as_bytes(single_filename), status)

/usr/lib/python3.5/contextlib.py in __exit__(self, type, value, traceback)
 64         if type is None:
 65             try:
 ---> 66                 next(self.gen)
 67             except StopIteration:
 68                 return

/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py in raise_exception_on_not_ok_status()
464           None, None,
465           compat.as_text(pywrap_tensorflow.TF_Message(status)),
--> 466           pywrap_tensorflow.TF_GetCode(status))
467   finally:
468     pywrap_tensorflow.TF_DeleteStatus(status)

NotFoundError: /home/user/cnn-model/trained_model_1506946529/checkpoints

Print Predicion

Congratulations on the code helped a lot, how do I print what was the prediction made for the texts.

Regards

error while training

Traceback (most recent call last):
File "train.py", line 136, in
train_cnn()
File "train.py", line 59, in train_cnn
l2_reg_lambda=params['l2_reg_lambda'])
File "/webserver/github/multi-class-text-classification-cnn/text_cnn.py", line 49, in init
self.h_pool = tf.concat(3, pooled_outputs)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1047, in concat
dtype=dtypes.int32).get_shape(
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 651, in convert_to_tensor
as_ref=False)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 716, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 176, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 165, in constant
tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/tensor_util.py", line 367, in make_tensor_proto
_AssertCompatible(values, dtype)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/tensor_util.py", line 302, in _AssertCompatible
(dtype.name, repr(mismatch), type(mismatch).name))
TypeError: Expected int32, got list containing Tensors of type '_Message' instead.

prediction

Hi,
As per your flow, I substitute my own corpus,But additionally I add embedding matrix in between your code and then train the data.Purpose of adding embedding matrix is to acheive close match.First time when I train the data for what is your age and predict for how old are you get result but after that I didn't get proper answer .Where I am lacking

Some question about the parameter.json

Sorry, I am a machine learning beginner, so I have a few questions to ask ...

  1. How to know what kind of training configuration is the best for my DataSet ?
  2. In other words, can you explain the use of each parameter ?

For example, if my DataSet have more than 6,000 Sentences/row and the maximum length is 68.
I try batch_size : 4 and evaluate_every : 2 , but I only get 0.7 Accuracy at step 1152.

  1. So ... Does this mean that it has trained entire DataSet?

tensorflow.python.framework.errors_impl.NotFoundError in predict.py

Traceback (most recent call last):
File "predict.py", line 73, in
predict_unseen_data()
File "predict.py", line 19, in predict_unseen_data
checkpoint_file = tf.train.latest_checkpoint(checkpoint_dir + 'checkpoints')
File "/home/kanimozhiu/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1602, in latest_checkpoint
if file_io.get_matching_files(v2_path) or file_io.get_matching_files(
File "/home/kanimozhiu/.local/lib/python3.5/site-packages/tensorflow/python/lib/io/file_io.py", line 332, in get_matching_files
for single_filename in filename
File "/home/kanimozhiu/anaconda3/lib/python3.5/contextlib.py", line 66, in exit
next(self.gen)
File "/home/kanimozhiu/.local/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.NotFoundError: /home/ubuntu/multi-class-text-classification-cnn/trained_model_1479757124/checkpoints

How to resolve this error?

Running on GPU

Is this running on CPU or GPU? If CPU, how can I make this run on GPU?

Try to run the program with my own data

I am very new to this whole subject. I created a small csv with 25 entries and wanted to run the program with it. Somehow i am failing.

I attached my datafile, so you maybe can have a look onto it and tell me what i am doing wrong.

The first error that occurs to me is:

Traceback (most recent call last):
File "train.py", line 136, in
train_cnn()
File "train.py", line 24, in train_cnn
max_document_length = max([len(x.split(' ')) for x in x_raw])
ValueError: max() arg is an empty sequence

consumer_complaints1.csv.zip

Thank you very much.

How to get class for given text input?

Could you provide some example code how to get class output for given text input?

I was able to get all code working with ./data/small_samples.json but output is accuracy percent - i need exact class name for every text

UnboundLocalError: local variable 'path' referenced before assignment

Hi,

I was using your model trainer, but with considerably less data than from the given example

  1. Trying to train for 3 classes
  2. Approx. 50 sentences in the training set for each class

and what parameters should I provide in order to make a functioning/effective model.

Normally when I run with the given parameters we get an error, and so we were wondering if there was a better way to go about doing that.

Errors include:

  1. for the following paramters
    {
    "num_epochs": 1,
    "batch_size": 20,
    "num_filters": 32,
    "filter_sizes": "3,4,5",
    "embedding_dim": 50,
    "l2_reg_lambda": 0.0,
    "evaluate_every": 200,
    "dropout_keep_prob": 0.5
    }
    logging.critical('Accuracy on test set is {} based on the best model {}'.format(test_accuracy, path))
    UnboundLocalError: local variable 'path' referenced before assignment

Error in training resolved

To switch to tensorflow 0.9 on python 3-

so now i am using python 3 and Tensorflow 0.9.

export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/mac/tensorflow-0.9.0rc0-py3-none-any.whl
python3 -m pip install $TF_BINARY_URL

Then at line 82
replace this -
losses = tf.nn.softmax_cross_entropy_with_logits(self.scores, self.input_y)
with -
losses = tf.nn.softmax_cross_entropy_with_logits(labels = self.input_y, logits = self.scores) # only named arguments accepted

This worked for me! @jiegzhan , Do you agree with this?

Changing Labels

If I want to change the labels, say I want to classify 7 labels, do I just change labels.json?

Own dataset

Hi,

  1. I ran code with my dataset and accuracy is coming zero. What if i have few dataset(in hundred).
  2. There's one variable in parameters.json file named as evaluate_every, we can set this key value to any random value right?
    Thanks

Layers ? Neurons ?

Hi.

I just start learning how to use tensorflow instead of building ANN by myself.

I try to use your code and everything's fine (I added back Tensorboard visualisation support) but .. I don't understand where hidden layers are defined (to add more) or where neuron number per layer is.

Could you please help ?

the probability of the text

Hello Mr.jiegzhan
i want to print out probability scores of each text after predicted
Could you please suggest me how to print Example like [Text_input ,index_label,probability_score ]
Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.