rtapiaoregui / collocater Goto Github PK
View Code? Open in Web Editor NEWSpacy integrable pipeline component to identify collocations in text
License: MIT License
Spacy integrable pipeline component to identify collocations in text
License: MIT License
Hi there,
First up, apologies if this is a stupid question - I'm not an NLP person and some of the language and ideas are brand new to me.
So, as I understand it, collocation is the idea of commonly occurring sequences of words. Prior to actually looking into NLP this week, I would call this n-grams and I think the NLTK agrees with me. The NLTK collocations functions primarily look for n-grams, do some filtering and return those (see https://github.com/nltk/nltk/blob/develop/nltk/collocations.py).
So, if I run nltk's collocation on (as an example) The Hound of the Baskervilles, I get phrases like 'Mr. Holmes', 'Grimpen Mire', 'escaped convict' and 'missing boot' - these all seem pretty reasonable given the plot.
But if I run your collocater pipeline I get very different results (and it takes significantly longer to process). Key differences that I can see being: no proper nouns, I get duplicate entries and those duplicates aren't equal, e.g. I have several 'different' 'look at's returned.
So, I think the lack of proper nouns is caused by the fact that you're determining collocations from a collocation dictionary so words like 'Sherlock' will never be processed.
The duplicate entries I think roughly corresponds to the number of times that collocation occurs and the fact that duplicate entries aren't equal is presumably down to the SpaCy vectors on those tokens being non-equal.
So, my first question is: what are you actually doing to determine these collocations? Why do you need to refer to a dictionary source in order to extract these?
I have a series of follow-up questions that are more about implementation than linguistics algorithms but I think I need to understand the linguistic rationale before I start suggesting technical changes.
Hope you don't mind me reaching out like this.
Hi there,
I just found your module and thought I'd give it a go so I installed it into my spacy virtual environment with pipenv install collocater
, loaded up a new Jupyter Notebook, copy and pasted your example code from the README.md and hit run. I got the below error.
So, I don't know enough about what's going on in your code but there seem to be three obvious sources that could cause problems here?
Happy to back and forth any ideas.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-1-f67211300592> in <module>
3 from pprint import pprint
4
----> 5 collie = collocater.Collocater.loader()
6 nlp = spacy.load('en_core_web_sm')
7 nlp.add_pipe(collie)
~/.local/share/virtualenvs/novel-language-processing-0RQ7FeDi/lib/python3.8/site-packages/collocater/collocater.py in loader(path)
143
144 if not path:
--> 145 obj = joblib.load(pkr.resource_stream(__name__, 'data/collocater_obj.joblib'))
146 else:
147 with open(path, 'rb') as fh:
~/.local/share/virtualenvs/novel-language-processing-0RQ7FeDi/lib/python3.8/site-packages/joblib/numpy_pickle.py in load(filename, mmap_mode)
573 filename = getattr(fobj, 'name', '')
574 with _read_fileobject(fobj, filename, mmap_mode) as fobj:
--> 575 obj = _unpickle(fobj)
576 else:
577 with open(filename, 'rb') as f:
~/.local/share/virtualenvs/novel-language-processing-0RQ7FeDi/lib/python3.8/site-packages/joblib/numpy_pickle.py in _unpickle(fobj, filename, mmap_mode)
502 obj = None
503 try:
--> 504 obj = unpickler.load()
505 if unpickler.compat_mode:
506 warnings.warn("The file '%s' has been generated with a "
/usr/lib64/python3.8/pickle.py in load(self)
1208 raise EOFError
1209 assert isinstance(key, bytes_types)
-> 1210 dispatch[key[0]](self)
1211 except _Stop as stopinst:
1212 return stopinst.value
/usr/lib64/python3.8/pickle.py in load_global(self)
1524 module = self.readline()[:-1].decode("utf-8")
1525 name = self.readline()[:-1].decode("utf-8")
-> 1526 klass = self.find_class(module, name)
1527 self.append(klass)
1528 dispatch[GLOBAL[0]] = load_global
/usr/lib64/python3.8/pickle.py in find_class(self, module, name)
1579 return _getattribute(sys.modules[module], name)[0]
1580 else:
-> 1581 return getattr(sys.modules[module], name)
1582
1583 def load_reduce(self):
AttributeError: module '__main__' has no attribute 'Collocater'
If the input text is "environmentalists" it will stuck in a dead loop
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.