japerk / nltk-trainer Goto Github PK
View Code? Open in Web Editor NEWTrain NLTK objects with zero code
Home Page: http://nltk-trainer.readthedocs.org/en/latest/
License: Apache License 2.0
Train NLTK objects with zero code
Home Page: http://nltk-trainer.readthedocs.org/en/latest/
License: Apache License 2.0
in requirements.txt, the line:
nltk>=2.0b8
might lead to nltk 3.x being installed, which is different from the requirements stated in the README (and has caused problems for my code personally).
Recommended that we set an upper limit with something like this:
nltk>=2.0b8,<=2.9.9
I have a corpus of 70,000 documents (roughly 237MB) I keep getting hit with memory-related error messages.
I tried renting a VPS with 100 Gigs of RAM, but I got the same error messages.
Is there a way to make the process less memory-intensive?
Is it possible to break the corpus up into smaller corpora, train multiple classifiers and then combine them into one large classifier?
Your link for megam is not correct. http://www.umiacs.umd.edu/~hal/megam/.
I think it would be good to give a better explanation on how to use the trainer. I am trying to load a pre-trained model that made us of this library and couldn't understand how to add this dependency to my project.
Is it compatible with python 3.7? Cuz the documentation and the README mismatch on the content.
After cloning the repo, what should be done?
Training tagger/chunker with Maxent as classifier ends with fail. To do this, there must be installed old version of scipy (10.1).
Currently, I can't pass max_depth to a sklearn.RandomForestClassifier, I would like to be able to do this.
After training a sklearn.BernoulliNB classifier on a corpus I'm getting sporadic errors when trying to predict lables for features with the stored classifier:
feats = {'and': True, (',', 'clean'): True, ('clean', 'and'): True, 'good': True, ('friendly', 'staff'): True, ',': True, '.': True, 'gyros': True, 'clean': True, ('gyros', ','): True, ('good', 'gyros'): True, ('and', 'friendly'): True, 'friendly': True, ('staff', '.'): True, 'staff': True}
clf = pickle.load(open('saved_classifier.pickle'))
p = clf.prob_classify(feats)
The above works. However if:
feats = {'and': True, 'fresh': True, ('fresh', 'and'): True, 'inexpensive': True, ('and', 'inexpensive'): True}
clf.prob_classify(feats) results in a type error... here's the trace:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-184-86c30997b740> in <module>()
----> 1 p = clf.prob_classify(feats)
2 p.prob('pos')
/Library/Python/2.7/site-packages/nltk/classify/api.pyc in prob_classify(self, featureset)
63 """
64 if overridden(self.batch_prob_classify):
---> 65 return self.batch_prob_classify([featureset])[0]
66 else:
67 raise NotImplementedError()
/Library/Python/2.7/site-packages/nltk/classify/scikitlearn.pyc in batch_prob_classify(self, featuresets)
71 def batch_prob_classify(self, featuresets):
72 X = self._convert(featuresets)
---> 73 y_proba = self._clf.predict_proba(X)
74 return [self._make_probdist(y_proba[i]) for i in xrange(len(y_proba))]
75
/Library/Python/2.7/site-packages/scikit_learn-0.14_git-py2.7-macosx-10.8-intel.egg/sklearn/pipeline.pyc in predict_proba(self, X)
154 for name, transform in self.steps[:-1]:
155 Xt = transform.transform(Xt)
--> 156 return self.steps[-1][-1].predict_proba(Xt)
157
158 def decision_function(self, X):
/Library/Python/2.7/site-packages/scikit_learn-0.14_git-py2.7-macosx-10.8-intel.egg/sklearn/naive_bayes.pyc in predict_proba(self, X)
96 the model, where classes are ordered arithmetically.
97 """
---> 98 return np.exp(self.predict_log_proba(X))
99
100
/Library/Python/2.7/site-packages/scikit_learn-0.14_git-py2.7-macosx-10.8-intel.egg/sklearn/naive_bayes.pyc in predict_log_proba(self, X)
77 in the model, where classes are ordered arithmetically.
78 """
---> 79 jll = self._joint_log_likelihood(X)
80 # normalize by P(x) = P(f_1, ..., f_n)
81 log_prob_x = logsumexp(jll, axis=1)
/Library/Python/2.7/site-packages/scikit_learn-0.14_git-py2.7-macosx-10.8-intel.egg/sklearn/naive_bayes.pyc in _joint_log_likelihood(self, X)
433
434 if self.binarize is not None:
--> 435 X = binarize(X, threshold=self.binarize)
436
437 n_classes, n_features = self.feature_log_prob_.shape
/Library/Python/2.7/site-packages/scikit_learn-0.14_git-py2.7-macosx-10.8-intel.egg/sklearn/preprocessing.pyc in binarize(X, threshold, copy)
537 X.data[cond] = 1
538 X.data[not_cond] = 0
--> 539 X.eliminate_zeros()
540 else:
541 cond = X > threshold
/Library/Python/2.7/site-packages/scipy-0.13.0.dev_c31f167_20130307-py2.7-macosx-10.8-intel.egg/scipy/sparse/compressed.pyc in eliminate_zeros(self)
572 fn = sparsetools.csr_eliminate_zeros
573 M,N = self._swap(self.shape)
--> 574 fn( M, N, self.indptr, self.indices, self.data)
575
576 self.prune() #nnz may have changed
/Library/Python/2.7/site-packages/scipy-0.13.0.dev_c31f167_20130307-py2.7-macosx-10.8-intel.egg/scipy/sparse/sparsetools/csr.pyc in csr_eliminate_zeros(*args)
565 csr_eliminate_zeros(int n_row, int n_col, int Ap, int Aj, npy_clongdouble_wrapper Ax)
566 """
--> 567 return _csr.csr_eliminate_zeros(*args)
568
569 def csr_sum_duplicates(*args):
TypeError: Array of type 'byte' required. Array of type 'bool' given
I just tried training a chuncked classifier on the conll2002 dataset, i ran the following command:
python train_chunker.py conll2002 --fileids ned.train --classifier NaiveBayes
Training succeeds but upon saving the following error occurs:
Traceback (most recent call last):
File "train_chunker.py", line 210, in <module>
name = '%s.pickle' % '_'.join(parts)
TypeError: sequence item 1: expected string, list found
I worked around this by using the --filename
flag, but thought I'd submit the bugreport anyway.
The following command works perfectly fine:
python train_chunker.py conll2002 --filename ~/nltk_data/chunkers/conll2002_chunker.pickle --classifier NaiveBayes
Then I copy ~/nltk_data/conll2002/
to ~/ntlk_data/conlltest/
and run the command:
python train_chunker.py conlltest --filename ~/nltk_data/chunkers/conlltest_chunker.pickle --classifier NaiveBayes
The output is:
loading conlltest
Traceback (most recent call last):
File "train_chunker.py", line 80, in <module>
chunked_corpus = load_corpus_reader(args.corpus, reader=args.reader, fileids=args.fileids)
File "/mnt/3E6227E362279F21/scriptie/external/nltk-trainer/nltk_trainer/__init__.py", line 64, in load_corpus_reader
raise ValueError('you must specify a corpus reader')
ValueError: you must specify a corpus reader
What am I missing? My version of nltk is 3.2.5.
I want to pickle my trained classifier, but I don't get any pickled classifier when I use --cross-fold option.
python train_classifier.py --algorithm NaiveBayes --instances sents --fraction 0.9 --cross-fold 10 --show-most-informative 10 /mycorpuspath/
It shows the cross validation results without pickled file. If I remove --cross-fold option, the pickled file is generated.
How can I get the the cross-validated classifier? Thanks.
Hi:
First of all, thank you for putting this code out. It seems to be very useful.
I installed japerk-nltk-trainer-5c0b53c on my Ubuntu 10.10 box. I did have to change the "requirements.txt" file in this line:
scipy>=0.7.0
The error message I'm getting is this:
dscs@lap02:~/Desktop/USC/taxonomy$ python /usr/local/bin/train_classifier.py --multi --instances sents -- cat_pattern "(.+).txt"
Traceback (most recent call last):
File "/usr/local/bin/train_classifier.py", line 5, in
pkg_resources.run_script('nltk-trainer==0.9', 'train_classifier.py')
File "/usr/local/lib/python2.6/dist-packages/distribute-0.6.21-py2.6.egg/pkg_resources.py", line 499, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/local/lib/python2.6/dist-packages/distribute-0.6.21-py2.6.egg/pkg_resources.py", line 1235, in run_script
execfile(script_filename, namespace, namespace)
File "/usr/local/lib/python2.6/dist-packages/nltk_trainer-0.9-py2.6.egg/EGG-INFO/scripts/train_classifier.py", line 4, in
import nltk_trainer.classification.args
File "/usr/local/lib/python2.6/dist-packages/nltk_trainer-0.9-py2.6.egg/nltk_trainer/init.py", line 7, in
from nltk_trainer.tagging.readers import NumberedTaggedSentCorpusReader
ImportError: No module named tagging.readers
Any suggestions appreciated. Thanks.
Using invocation shown on blog, the parameter is ignored.
http://streamhacker.com/2010/10/25/training-binary-text-classifiers-nltk-trainer/
$ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --show-most-informative 10 --no-pickle movie_reviews
2 labels: ['neg', 'pos']
1500 training feats, 500 testing feats
training ['NaiveBayes'] classifier
training NaiveBayes classifier
accuracy: 0.718000
neg precision: 0.950413
neg recall: 0.460000
neg f-measure: 0.619946
pos precision: 0.643799
pos recall: 0.976000
pos f-measure: 0.775835
[z@a japerk-nltk-trainer-dc71c61]$
https://github.com/japerk/nltk-trainer
This package is not the same as have in the book!
this is not as you have in your NLTK cook book.
featx.py is not complete n github.
Please let me know the latest version!
Otherwise I have to return the book, bcoz of a lot of issues.
-fk
I need to train 2 to 3 corpus. How can I approach that?
As per the document, we can use
feats = dict([(word, True) for word in words + ngrams(words, 1)])
as feature set to classify, but I get the type error when I use.
TypeError: can only concatenate list (not "generator") to list
Could you please guide me , If I am doing anything wrong
I have a sentence as TEXT, I tried to create FV as below for classification :
tokens = word_tokenize(text, include_punc=False)
tokens = functools.reduce(operator.add, [tokens if n == 1 else list(ngrams(tokens, n)) for n in [3]])
if not isinstance(tokens, list):
tokens = list(tokens)
feats = dict([(word, True) for word in tokens])
print("Classify: ", self._classifier.classify(feats))
But I always get a constant Pos,Neg probability irrespective of the sentence and overall probability is always negative.
The ConllChunkCorpusReader needs an extra argument, a list of nodetags.
File "analyze_tagger_coverage.py", line 47, in
corpus = reader_cls(args.corpus, '.+')
TypeError: init() takes at least 4 arguments (3 given)
after
python train_classifier.py movie_reviews --classifier sklearn.LinearSVC
I got this
train_classifier.py: error: argument --classifier/--algorithm: invalid choice: '
sklearn.LinearSVC' (choose from 'NaiveBayes', 'DecisionTree', 'Maxent', 'GIS', '
IIS', 'CG', 'BFGS', 'Powell', 'LBFGSB', 'Nelder-Mead', 'MEGAM', 'TADM')
In python, I CAN import sklearn with no error or warning.
How can i solve this?
THX~
I want to implement a phonetics search on my django database, can this be used there? If yes, do tell me how?
When trying to install using pip I get the error
IOError: [Errno 2] No such file or directory: '/Users/icaro.medeiros/.virtualenvs/pylearner/src/nltk-trainer/README.txt'
Hi ,
I am working on classifying the text into positive/negative/neutral. I have seen your demo on how to perform sentiment analysis on the text but I want to use your model in my code , I dont know where this code snippet is in your repo.Also I dont know how to use in my code for sentiment analysis.
Thank you
Great package! Wouldn't it be nice if you could invoke the scripts with something like:
$ nltk train movie_reviews --instances paras --classifier NaiveBayes
$ nltk analyze --sort count --reverse
instead of:
$ python train_classifier.py movie_reviews --instances paras --classifier NaiveBayes
$ python analyze_tagged_corpus.py treebank --sort count --reverse
so that this is truly a command-line tool? Shouldn't be too much work using docopt and adding a console_script entry point in setup.py. What do you think?
I have my own corpus that I want to use to classify strings as keep/reject. It's not immediately obvious how to use a corpus that's not in the nltk data directory. I'm guessing I'll need to modify the code?
Hi Jacoub,
It is written in the documentation at
that the code is python 3 compatible "These scripts are Python 2 & 3 compatible and work with NLTK 2.0.4 and higher.", and here in it is written python 2 only. I tried to run the code on python 3.7 but I got compatibility issues.
It would be very nice if you make the code compatible with python 3.
Thanks.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.