GithubHelp home page GithubHelp logo

collocater's People

Contributors

rtapiaoregui avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

collocater's Issues

Theory Question: Collocater vs. nltk Collocation

Hi there,

First up, apologies if this is a stupid question - I'm not an NLP person and some of the language and ideas are brand new to me.

So, as I understand it, collocation is the idea of commonly occurring sequences of words. Prior to actually looking into NLP this week, I would call this n-grams and I think the NLTK agrees with me. The NLTK collocations functions primarily look for n-grams, do some filtering and return those (see https://github.com/nltk/nltk/blob/develop/nltk/collocations.py).

So, if I run nltk's collocation on (as an example) The Hound of the Baskervilles, I get phrases like 'Mr. Holmes', 'Grimpen Mire', 'escaped convict' and 'missing boot' - these all seem pretty reasonable given the plot.

But if I run your collocater pipeline I get very different results (and it takes significantly longer to process). Key differences that I can see being: no proper nouns, I get duplicate entries and those duplicates aren't equal, e.g. I have several 'different' 'look at's returned.

So, I think the lack of proper nouns is caused by the fact that you're determining collocations from a collocation dictionary so words like 'Sherlock' will never be processed.

The duplicate entries I think roughly corresponds to the number of times that collocation occurs and the fact that duplicate entries aren't equal is presumably down to the SpaCy vectors on those tokens being non-equal.

So, my first question is: what are you actually doing to determine these collocations? Why do you need to refer to a dictionary source in order to extract these?

I have a series of follow-up questions that are more about implementation than linguistics algorithms but I think I need to understand the linguistic rationale before I start suggesting technical changes.

Hope you don't mind me reaching out like this.

Loader() fails - possibly due to virtual environment or jupyter notebook? Or pickles?

Hi there,
I just found your module and thought I'd give it a go so I installed it into my spacy virtual environment with pipenv install collocater, loaded up a new Jupyter Notebook, copy and pasted your example code from the README.md and hit run. I got the below error.

So, I don't know enough about what's going on in your code but there seem to be three obvious sources that could cause problems here?

  1. It looks like the system falls down when trying to unpickle a file. Could it be that the code is searching for the file in a system path instead of a virtualenv path?
  2. And could this be done to the way I have my jupyter lab configured? I.e. my jupyter ecosystem is installed globally and I associate new virtualenvs with the ipykernel module - I know that occasionally this causes issues with magic commands where Jupyter assumes it should be looking in the system path, it seems unlikely but could something similar be occurring here?
  3. How did you pickle these files? I remember there being some issues going between windows/linux or Py2/Py3?

Happy to back and forth any ideas.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-1-f67211300592> in <module>
      3 from pprint import pprint
      4 
----> 5 collie = collocater.Collocater.loader()
      6 nlp = spacy.load('en_core_web_sm')
      7 nlp.add_pipe(collie)

~/.local/share/virtualenvs/novel-language-processing-0RQ7FeDi/lib/python3.8/site-packages/collocater/collocater.py in loader(path)
    143 
    144         if not path:
--> 145             obj = joblib.load(pkr.resource_stream(__name__, 'data/collocater_obj.joblib'))
    146         else:
    147             with open(path, 'rb') as fh:

~/.local/share/virtualenvs/novel-language-processing-0RQ7FeDi/lib/python3.8/site-packages/joblib/numpy_pickle.py in load(filename, mmap_mode)
    573         filename = getattr(fobj, 'name', '')
    574         with _read_fileobject(fobj, filename, mmap_mode) as fobj:
--> 575             obj = _unpickle(fobj)
    576     else:
    577         with open(filename, 'rb') as f:

~/.local/share/virtualenvs/novel-language-processing-0RQ7FeDi/lib/python3.8/site-packages/joblib/numpy_pickle.py in _unpickle(fobj, filename, mmap_mode)
    502     obj = None
    503     try:
--> 504         obj = unpickler.load()
    505         if unpickler.compat_mode:
    506             warnings.warn("The file '%s' has been generated with a "

/usr/lib64/python3.8/pickle.py in load(self)
   1208                     raise EOFError
   1209                 assert isinstance(key, bytes_types)
-> 1210                 dispatch[key[0]](self)
   1211         except _Stop as stopinst:
   1212             return stopinst.value

/usr/lib64/python3.8/pickle.py in load_global(self)
   1524         module = self.readline()[:-1].decode("utf-8")
   1525         name = self.readline()[:-1].decode("utf-8")
-> 1526         klass = self.find_class(module, name)
   1527         self.append(klass)
   1528     dispatch[GLOBAL[0]] = load_global

/usr/lib64/python3.8/pickle.py in find_class(self, module, name)
   1579             return _getattribute(sys.modules[module], name)[0]
   1580         else:
-> 1581             return getattr(sys.modules[module], name)
   1582 
   1583     def load_reduce(self):

AttributeError: module '__main__' has no attribute 'Collocater'

dead loop

If the input text is "environmentalists" it will stuck in a dead loop

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.