Classify Kaggle Consumer Finance Complaints into 11 classes. Build the model with CNN (Convolutional Neural Network) and Word Embeddings on Tensorflow.

License: Apache License 2.0

Python 100.00%

cnn text-classification tensorflow convolutional-neural-networks embeddings sentence-classification multi

multi-class-text-classification-cnn's Introduction

Project: Classify Kaggle Consumer Finance Complaints

Highlights:

This is a multi-class text classification (sentence classification) problem.
The purpose of this project is to classify Kaggle Consumer Finance Complaints into 11 classes.
The model was built with Convolutional Neural Network (CNN) and Word Embeddings on Tensorflow.

Data: Kaggle Consumer Finance Complaints

Input: consumer_complaint_narrative
- Example: "someone in north Carolina has stolen my identity information and has purchased items including XXXX cell phones thru XXXX on XXXX/XXXX/2015. A police report was filed as soon as I found out about it on XXXX/XXXX/2015. A investigation from XXXX is under way thru there fraud department and our local police department.\n"
Output: product
- Example: Credit reporting

Train:

Command: python3 train.py training_data.file parameters.json
Example: python3 train.py ./data/consumer_complaints.csv.zip ./parameters.json

A directory will be created during training, and the trained model will be saved in this directory.

Predict:

Provide the model directory (created when running train.py) and new data to predict.py.

Command: python3 predict.py ./trained_model_directory/ new_data.file
Example: python3 predict.py ./trained_model_1479757124/ ./data/small_samples.json

Reference:

Implement a cnn for text classification in tensorflow

multi-class-text-classification-cnn's People

Contributors

Stargazers

Watchers

Forkers

ilyeong-ai jiangyt2112 scpei ml-lab tornadozou transmogrifyingaardvark chenmoshushi vyraun zshwuhan nicolechensh ferrero-zhang khronosplus cherish24 libin19861023 shdut o-github-o ranjea guojiangwei2 chenyangh wkkkkk sevinjyolchuyeva leezqcst haiyansang hpduong shiyongde zhwj7552 2020zyc nininininini hhself chenglansky allenmujie logicxin abhisekbit hongwookim x-hacker ruoyucad ycsuperlife gustavomr bibongo wangqiaoshi nrvnujd tinuschen cathyhaha mwin007 cbzhuang frannetty cmwenliu pengjiapeng liquor7 nicoleljc1227 v3nd3774 chinarefers xxpanda casillas-qf psbots dt1219 davidfumo cservan qss2012 colinsongf laisun wesamalnabki zpppy afcentry mqrshiyan peterbengkui pustar michaelsp parkjonghyeob-fork maggie0830 jasonhoou lucky050619 glorykim99 kevinsyc wushicanasl meccy zzzzzch vijayendra-g yangvict eight-corner daisy1992 geraldsec pchoengtawee caplu arnaudmkonan alvarovb ghldun sreendra hansroh roscopecoltran lightningtyb roshanraj msopranointech happyxuwork whumashuai ulisesvidal sankarhari ngchc gitwyy pengfeiran

multi-class-text-classification-cnn's Issues

Getting error: list index out of range in train.py

I downloaded the project and ran train.py
but I got the following error:

line 17, in train_cnn
train_file = sys.argv[1]
IndexError: list index out of range

Please help :)

How to visualize the embedding and confusion matrix in tensorboard?

What code should be added so that the word embedding and confusion matrix at each step could be visualized in tensorboard?

UnboundLocalError: local variable 'path' referenced before assignment

Hello,
i have a dataset with 1572 sentences and 13 classes. i try to train this data.
here is output and error:
INFO:root:The maximum length of all sentences: 40
INFe O:root:x_train: 1271, x_dev: 142, x_test: 158
INFO:root:y_train: 1271, y_dev: 142, y_test: 158
Traceback (most recent call last):
File "train.py", line 136, in
train_cnn()
File "train.py", line 131, in train_cnn
logging.critical('Accuracy on test set is {} based on the best model {}'.format(test_accuracy, path))
UnboundLocalError: local variable 'path' referenced before assignment.

Error "not enough values to unpack" on predict

When do predict i have got error after 7200 step:

CRITICAL:root:Saved model d:\Django\multi-class-text-classification-cnn\trained_model_1485372325\checkpoints\model-7200 at step 7200
CRITICAL:root:Best accuracy 0.8768386935926203 at step 7200
Traceback (most recent call last):
  File "train.py", line 141, in <module>
    train_cnn()
  File "train.py", line 104, in train_cnn
    x_train_batch, y_train_batch = zip(*train_batch)
ValueError: not enough values to unpack (expected 2, got 0)

Accuracy is good and enough, but why error comes and what if it arise on low accurasy
(Anaconda python 3.5, dataset 47000 rows with Cyrillic chars)

Input file format for prediction

Hello,
I am able to successfully run the small-sample.json file to get the prediction.json. The small-sample.json file is having information on "product" and "consumer_complaint_narrative". For the test file, this is fine. However, I am little confused (as I am new to this...) on the input file format of some unseen data (e.g., consumer_complaint_narrative ). My problem is how to get the "new prediction" of "consumer_complaint_narrative" without providing the "product:" field in the input.json file.
How does the input file format look like for just unseen "consumer_complaint_narrative" data and what should be prediction command? Do I need to edit anything in predict.py?
Can anyone help?
Thanks in advance.

NotFoundError: encountered while running function tf.train.latest_checkpoint() in predict.py

Both the model and the checkpoints exist in the same directory. This is the error i encounter when i try to run the file.

NotFoundError                    Traceback (most recent call last)
<ipython-input-60-8de4d687f60c> in <module>()
  5     checkpoint_dir += '/'
  6 print (checkpoint_dir + 'checkpoints')
  ----> 7 checkpoint_file = tf.train.latest_checkpoint(checkpoint_dir +    'checkpoints')
  8 print (checkpoint_file)

  /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py in latest_checkpoint(checkpoint_dir, latest_filename)
 1612     v1_path = _prefix_to_checkpoint_path(ckpt.model_checkpoint_path,
 1613                                          saver_pb2.SaverDef.V1)
 -> 1614     if file_io.get_matching_files(v2_path) or     file_io.get_matching_files(
 1615         v1_path):
 1616       return ckpt.model_checkpoint_path

 /usr/local/lib/python3.5/dist-packages/tensorflow/python/lib/io/file_io.py in get_matching_files(filename)
 330           # Convert the filenames to string from bytes.
 331           compat.as_str_any(matching_filename)
 --> 332           for single_filename in filename
 333           for matching_filename in pywrap_tensorflow.GetMatchingFiles(
 334               compat.as_bytes(single_filename), status)

/usr/lib/python3.5/contextlib.py in __exit__(self, type, value, traceback)
 64         if type is None:
 65             try:
 ---> 66                 next(self.gen)
 67             except StopIteration:
 68                 return

/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py in raise_exception_on_not_ok_status()
464           None, None,
465           compat.as_text(pywrap_tensorflow.TF_Message(status)),
--> 466           pywrap_tensorflow.TF_GetCode(status))
467   finally:
468     pywrap_tensorflow.TF_DeleteStatus(status)

NotFoundError: /home/user/cnn-model/trained_model_1506946529/checkpoints

Print Predicion

Congratulations on the code helped a lot, how do I print what was the prediction made for the texts.

Regards

error while training

Traceback (most recent call last):
File "train.py", line 136, in
train_cnn()
File "train.py", line 59, in train_cnn
l2_reg_lambda=params['l2_reg_lambda'])
File "/webserver/github/multi-class-text-classification-cnn/text_cnn.py", line 49, in init
self.h_pool = tf.concat(3, pooled_outputs)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1047, in concat
dtype=dtypes.int32).get_shape(
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 651, in convert_to_tensor
as_ref=False)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 716, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 176, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 165, in constant
tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/tensor_util.py", line 367, in make_tensor_proto
_AssertCompatible(values, dtype)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/tensor_util.py", line 302, in _AssertCompatible
(dtype.name, repr(mismatch), type(mismatch).name))
TypeError: Expected int32, got list containing Tensors of type '_Message' instead.

prediction

Hi,
As per your flow, I substitute my own corpus,But additionally I add embedding matrix in between your code and then train the data.Purpose of adding embedding matrix is to acheive close match.First time when I train the data for what is your age and predict for how old are you get result but after that I didn't get proper answer .Where I am lacking

ImportError: cannot import name '_ellipsoid'

Accuracy on test set is {} based on the best model {} ?????????

In you codes, the accuracy on test set is based on the latest model after all the train step finished, not based on the best model saved in checkpoint. Am I right?

Some question about the parameter.json

Sorry, I am a machine learning beginner, so I have a few questions to ask ...

How to know what kind of training configuration is the best for my DataSet ?
In other words, can you explain the use of each parameter ?

For example, if my DataSet have more than 6,000 Sentences/row and the maximum length is 68.
I try batch_size : 4 and evaluate_every : 2 , but I only get 0.7 Accuracy at step 1152.

So ... Does this mean that it has trained entire DataSet?

tensorflow.python.framework.errors_impl.NotFoundError in predict.py

Traceback (most recent call last):
File "predict.py", line 73, in
predict_unseen_data()
File "predict.py", line 19, in predict_unseen_data
checkpoint_file = tf.train.latest_checkpoint(checkpoint_dir + 'checkpoints')
File "/home/kanimozhiu/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1602, in latest_checkpoint
if file_io.get_matching_files(v2_path) or file_io.get_matching_files(
File "/home/kanimozhiu/.local/lib/python3.5/site-packages/tensorflow/python/lib/io/file_io.py", line 332, in get_matching_files
for single_filename in filename
File "/home/kanimozhiu/anaconda3/lib/python3.5/contextlib.py", line 66, in exit
next(self.gen)
File "/home/kanimozhiu/.local/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.NotFoundError: /home/ubuntu/multi-class-text-classification-cnn/trained_model_1479757124/checkpoints

How to resolve this error?

Running on GPU

Is this running on CPU or GPU? If CPU, how can I make this run on GPU?

Try to run the program with my own data

I am very new to this whole subject. I created a small csv with 25 entries and wanted to run the program with it. Somehow i am failing.

I attached my datafile, so you maybe can have a look onto it and tell me what i am doing wrong.

The first error that occurs to me is:

Traceback (most recent call last):
File "train.py", line 136, in
train_cnn()
File "train.py", line 24, in train_cnn
max_document_length = max([len(x.split(' ')) for x in x_raw])
ValueError: max() arg is an empty sequence

consumer_complaints1.csv.zip

Thank you very much.

How to get class for given text input?

Could you provide some example code how to get class output for given text input?

I was able to get all code working with ./data/small_samples.json but output is accuracy percent - i need exact class name for every text

my training was running and the best model was saved in the directory.but while running predict.py i am getting an error no module named datahelper.py

UnboundLocalError: local variable 'path' referenced before assignment

Hi,

I was using your model trainer, but with considerably less data than from the given example

Trying to train for 3 classes
Approx. 50 sentences in the training set for each class

and what parameters should I provide in order to make a functioning/effective model.

Normally when I run with the given parameters we get an error, and so we were wondering if there was a better way to go about doing that.

Errors include:

for the following paramters
{
"num_epochs": 1,
"batch_size": 20,
"num_filters": 32,
"filter_sizes": "3,4,5",
"embedding_dim": 50,
"l2_reg_lambda": 0.0,
"evaluate_every": 200,
"dropout_keep_prob": 0.5
}
logging.critical('Accuracy on test set is {} based on the best model {}'.format(test_accuracy, path))
UnboundLocalError: local variable 'path' referenced before assignment

A bug: num_batches_per_epoch = int(data_size / batch_size) + 1

should be:
num_batches_per_epoch = int(math.ceil(float(data_size) / batch_size))

Error in training resolved

To switch to tensorflow 0.9 on python 3-

so now i am using python 3 and Tensorflow 0.9.

export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/mac/tensorflow-0.9.0rc0-py3-none-any.whl
python3 -m pip install $TF_BINARY_URL

Then at line 82
replace this -
losses = tf.nn.softmax_cross_entropy_with_logits(self.scores, self.input_y)
with -
losses = tf.nn.softmax_cross_entropy_with_logits(labels = self.input_y, logits = self.scores) # only named arguments accepted

This worked for me! @jiegzhan , Do you agree with this?

Changing Labels

If I want to change the labels, say I want to classify 7 labels, do I just change labels.json?

run predict.py error?

TypeError: Expected int32, got list containing Tensors of type '_Message' instead

Hi, I am newbie
i'd tried your code, but I got "TypeError: Expected int32, got list containing Tensors of type '_Message' instead" error message. How to solve it?
Thanks

I wish to get raw probability distribution for the predicted classes. facing problems in using softmax

please help me, as how can I use softmax in place of argmax() to get raw probability distribution for the predicted classes

How to initialise with custom embeddings

Own dataset

Hi,

I ran code with my dataset and accuracy is coming zero. What if i have few dataset(in hundred).
There's one variable in parameters.json file named as evaluate_every, we can set this key value to any random value right?
Thanks

Layers ? Neurons ?

Hi.

I just start learning how to use tensorflow instead of building ANN by myself.

I try to use your code and everything's fine (I added back Tensorboard visualisation support) but .. I don't understand where hidden layers are defined (to add more) or where neuron number per layer is.

Could you please help ?

the probability of the text

Hello Mr.jiegzhan
i want to print out probability scores of each text after predicted
Could you please suggest me how to print Example like [Text_input ,index_label,probability_score ]
Thank you

jiegzhan / multi-class-text-classification-cnn Goto Github PK