cadmiumcr / cadmium Goto Github PK

View Code? Open in Web Editor NEW

202.0 11.0 16.0 9.46 MB

Natural Language Processing (NLP) library for Crystal

Home Page: https://cadmiumcr.com

License: MIT License

Crystal 100.00%

string-distance stemmer inflector sentiment-analysis phonetics transliterator nlp tf-idf wordnet readability

cadmium's Introduction

Cadmium is a Natural Language Processing (NLP) library for Crystal.

For full API documentation check out the docs.

For more complete and up to date information about specific parts of Cadmium, check out each relevant shard repository.

Shard name	Description
cadmium_tokenizer	Contains several types of string tokenizers
cadmium_stemmer	Contains a Porter stemmer, useful to get the stems of english words
cadmium_ngrams	Contains methods to obtain unigram, bigrams, trigrams or ngrams from strings
cadmium_classifier	Contains two probabilistic classifiers used in NLP operations like language detection or POS tagging for example
cadmium_readability	Analyzes blocks of text and determine, using various algorithms, the readability of the text.
cadmium_tfidf	Calculates the Term Frequency–Inverse Document Frequency of a corpus
cadmium_glove	Pure Crystal implementation of Global Vectors for Word Representations
cadmium_pos_tagger	Tags each token of a text with its Part Of Speech category
cadmium_lemmatizer	Returns the lemma of each given string token
cadmium_summarizer	Extracts the most meaningful sentences of a text to create a summary
cadmium_sentiment	Evaluates the sentiment of a text
cadmium_distance	Provides two string distance algorithms
cadmium_transliterator	Provides the ability to transliterate UTF-8 strings into pure ASCII so that they can be safely displayed in URL slugs or file names.
cadmium_phonetics	Allows to match a string with its sound representation
cadmium_inflector	Allows to inflect english words (nouns, verbs and numbers)
cadmium_graph	EdgeWeightedDigraph represents a digraph, you can add an edge, get the number vertexes, edges, get all edges and use toString to print the Digraph.
cadmium_trie	A trie is a data structure for efficiently storing and retrieving strings with identical prefixes, like "meet" and "meek".
cadmium_wordnet	Pure crystal implementation of Stanford NLPs WordNet
cadmium_util	A collection of useful utilities used internally in Cadmium.
cadmium_language_detector	Returns the most probable language code of the analysed text.

Installation

Your project should only include the Cadmium shard(s) you need.

However, in case you want to test out all of Cadmium in a simple way, you can install all modules of the project in a few lines.

Add this to your application's shard.yml:

dependencies:
  cadmium:
    github: cadmiumcr/cadmium
    branch: master

Contributing

Fork it ( https://github.com/cadmiumcr/cadmium/fork )
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create a new Pull Request

Contributors

This project exists thanks to all the people who contribute.

cadmium's People

Contributors

Stargazers

Watchers

Forkers

marcatatem adoxography exploitd johnjansen fossabot andrewzhuk dscottboggs rmarronnier gitter-badger hugoabonizio lcpo coryvegan haydncci lxrst

cadmium's Issues

Website links out to date

Just let you know that the links under "Content" are a bit out to date. They link to https://github.com/cadmiumcr/cadmium#stemmers for example, but there is no #stemmers headline, You probable meant to let it link to https://github.com/cadmiumcr/stemmer?

This is like that for all those links.

Negative values for readability scores

test = "*-/ /*/"
test_readability = Cadmium.readability.new(test).fog # or flesch or kincaid
puts test_readability

Outputs

-NaN

For some longer text (which I can provide, if you need) I get smaller negative values (eg : -620).

Is this the expected behavior ? Shouldn't the value be set to 0 ?

Proposal: Cadmium::POSTagger

Preface

As discussed in #31 , Cadmium::Lemmatizer needs a Token object with POS and morphology data to work properly and be fully tested.
The aim of this proposal is to implement a Cadmium::POSTagger that will create such a Token Object for each input string.
The first tagging algorithm I'm planning to implement is the Viterbi algorithm.
If I can generalize it enough, the plan is to move it to Cadmium::Classifier so it can be used for other objectives.
I'm also planning to implement later Dynamic feature induction which could also be used for Named Entity Recognition (and so be moved to Classifier)
The plan like the Tokenizer module or the Summarizer is to make it possible to choose a specific algorithm instead of being tied to a single one.

Details

I propose to implement this with these actions :

Create a cadmiumcr/pos_tagger repository
Implement a POC POS tagger with the Viterbi algorithm
If the algorithm can be generalized (ie not too specific to POS tagging) move it to Cadmium::Classifier::Viterbi
Move the working POS Tagger to its repository along with english tagging data
Push other languages data to the cadmiumcr/languages repository (I'm not sure yet about this one, without knowing the sizes of the models)
Move the Token struct to Cadmium::Utils as it will be used at least by both the POS Tagger and the Lemmatizer

References

List of existing POS Taggers

Text summarizers

I've looked at sumy and implemented the Luhn method in Crystal using Cadmium.

I'm planning to implement more methods.

Would you be interested in a PR adding a text summarizer module to Cadmium ?

Moving all cadmiumcr/rfcs issues to cadmiumcr/cadmium (ie here)

As all cadmium shards have their own repo, there is no point for users to open issues here, unless it's for new proposals which are not covered by existing shards, which cadmiumcr/rfcs is about.

IMO, it would be more obvious to move all rfcs/proposals here and close cadmium/rfcs.

We should also move your templates (very useful) here.

@watzon : WDYT ?

Problem using the Cadmium Shard : rror: can't find file 'Cadmium'

Hi,
I am attempting to use the Cadmium Shard : https://github.com/cadmiumcr/cadmium

I have created a shard.yml file

name: tst
version: 0.1.0

authors:
  - serge <[email protected]>

targets:
  tst:
    main: src/tst.cr

crystal: 1.2.2

license: MIT

dependencies:
  cadmium:
    github: cadmiumcr/cadmium
    branch: master

and here it the snippet of code (from the Cadmium doc) I'm trying to compile:

require "cadmium"

tokenizer = Cadmium.word_punctuation_tokenizer.new
tokenizer.tokenize("my dog hasn't any fleas.")

Yet, I get the error message:

In src/tst.cr:1:1

 1 | require "cadmium"
     ^
Error: can't find file 'cadmium'

NB: When running shards install, Cadmium seemed to install...

Add a POS tagger

POS tagging is the categorizing of words in a sentence based on part of speech relative to the other words in the sentence. This can be done very simply with wordnet, but to accomplish full POS tagging each word must be tagged based in its relationship to the previous word and the next word. Some "words" are also made up of multiple grams, such as "New York".

The tagger should be able to see a word like "New" and know that if "York", "Jersey", "Amsterdam", or any number of other words appear next to it that there is a very good possibility that they should be counted as one word, and a proper noun at that.

Proposal: Cadmium::Lemmatizer

Preface

Cadmium has a stemmer which is used downstream in several other modules. Its usefulness is not to be questioned.

However relying only on a stemmer will limit Cadmium in different ways :

Stemming a word takes out the grammatical meaning of it, which renders POS tagging impossible.
Only a handful of languages stemming algorithms are implemented. Some languages are by their nature very difficult to stem.

Lemmatization in its implementation is essentially binding a lookup table (or dictionnary) to lemmas and applying additional rules depending on the token found.
i18n lemmas lookup tables are freely available and MIT compatible.

Details

Create a new lemmatizer repository in cadmiumcr
Create in it a Cadmium::Lemmatizer module inspired in its form by Cadmium::Util::StopWords
Its data folder will only contain the english json file of lemmas (file size of several Mb)
The name of the shard will be cadmium_lemmatizer

The real difficulty is, IMO, how to deal with data for other languages.

Here are several realistic possibilities :

Create a cadmium_i18n_data shard containing all languages data (might weight tens of Mo of JSON)
Create a cadmium_XX_data shard for each language containing its own data. (Can we regroup repos in a folder in Github ?)
Host somewhere the data and ask developers to download it according to their needs.
Don't provide the data at all and point developers to possible sources.

That's what I could come up with as solutions but if you have other ideas, do tell !

References

Spacy has a good implementation of lemmatizers.

You can check their github repository to have an idea of what the data is like : spanish language for example

Add spellchecker

Spell checking can be accomplished a number of different ways, none of them particularly fast unfortunately. The basic spell checker has a dictionary of words; if any token doesn't match one of the words in the dictionary it is wrong. Suggestions are implemented by calculating the distance between the incorrectly spelled word and every other word in the corpus. Typically misspelled words are within a distance of 1 from each other, so words with a distance of 1 would be returned.

Ideally the tolerance would be configurable, with the knowledge that the higher the tolerance the longer it will take. I believe the time complexity is O(log n).

Add Word2Vec

Word2Vec is an algorithm created by Google to compare words or phrases by assigning each individual word its own position in vector space. Words are moved around according to how similar they are to other words, creating a graph of words and their similarity to other words.

Not enough awesome

You heard me.

Add Word Error Rate evaluator

A WER evaluator would be a nice addition to Cadmium IMO.

I can work on a PR if you're ok with it.

More evaluators can be implemented so I'm planning to create a Cadmium::Evaluators module.

Proposal: Corpus, Document, Sentence, Token, Language components

As you can see browsing Cadmium shards source code, several entities (for lack of a better word) are declared in different locations and in different ways.

This issue is not just a namespace or redundancy issue but we'd benefit by having fundamental classes or structs describing the tokens, sentences and documents we're dealing with.

I've started in the pos_tagger declaring such structs and objects. It's a WIP and things might change as we'll discover what we need and don't need for higher levels text processing functions.

What's obvious to me is the neat way the Language is declared in cadmium_tokenizer. As languages information is used in different shards (especially the language codes) we could move a big part of it in a language module in cadmium_utils and keep some specific language infos (abbreviations, tag maps, etc) in their respective shards.

More examples from the top of my head :

The Document struct or class might benefit from the cadmium_tfidf and vice versa.

The Cadmium::Utils::Sentence might be renamed to Sentencizer (is that a word ?)
so that a Cadmium::Sentence might exist without conflict.

I'd like to point out that having these classes or structs won't impede users to process raw text without creating these objects. But they are needed to keep morpho-syntaxic infos about the tokens and sentences.

Add language detector

I'm working on a crystal port of franc and once it's working I'd like to merge it in Cadmium if you're ok with this.

Fix ameba lint errors

Currently the build is failing because of ameba lint errors. These include:

src/cadmium/distance/jaro_winkler.cr:64:13
- [C] Metrics/CyclomaticComplexity: Cyclomatic complexity too high [24/10]
src/cadmium/sentiment.cr:42:5
- [C] Metrics/CyclomaticComplexity: Cyclomatic complexity too high [11/10]
src/cadmium/tokenizer/aggressive_tokenizer.cr:14:5
- [C] Metrics/CyclomaticComplexity: Cyclomatic complexity too high [11/10]

It looks like the main problem is the cyclomatic complexity of some of the algorithms. Where possible this should be reduced.

Proposal: Evaluator and/or Benchmark repositories

Preface

Evaluating the accuracy of the output of an NLP component is a science in itself.

When a new NLP algorithm, method or tool is published, it is always accompanied by benchmarks against existing systems.

Those benchmarks are produced using standard evaluation techniques and dataset.

These evaluation techniques are not always automatic.

A human judgment is sometimes necessary. In this case, there's nothing Cadmium can do to help.

However a set of existing tools exist depending on the NLP task to be tested :

Precision, recall and F1 Score are useful statistical metrics when evaluating classification POS tagging, sentiment analysis, etc.
METEOR or BLEU if Cadmium ever does machine translation.
ROUGE for summarization evaluation

We can add to those tools standard dataset and corpora already gold labeled and human checked.

These are just examples found after a cursory search. The list is bigger and the tools get better fast.

Details

The main idea of this proposal is to :

Create a cadmiumcr/evaluator repository.
This module will have the tools listed above and methods to conveniently download the large datasets of gold labelled data.
Create a cadmiumcr/benchmark repository.
This repository will be more like a custom set of crystal scripts using the tools of Cadmium::Evaluator to run benchmarks against the vanilla tools of Cadmium (classifiers, pos tagging, language identification, etc.) and display the results next to competing tools results.

The point being to give a glimpse of Cadmium possibilities and routinely check our tools accuracy (which crystal spec is not intended to do).

This proposal is mainly a braindump, as I don't intend to start working on this short term (I have to finish my POS Tagger first !)

Binary dependent on hard coded data path

Great job job on this fantastic lib !

I'm developing a web app deployed on Heroku and after the crystal binary is compiled all librairies / shards are discarded.

My app compiles fine with cadmium but crashes because it can't find files inside the data folder :

2019-07-08T20:22:33.238998+00:00 app[web.1]: Unhandled exception: Error opening file '/tmp/build_b996c138b6ca4a4c639a8e6ba2a58942/lib/cadmium/src/cadmium/../../data/sentiment.txt' with mode 'r': No such file or directory (Errno)
2019-07-08T20:22:33.239028+00:00 app[web.1]: from /tmp/crystal/share/crystal/src/crystal/system/file.cr:7:7 in 'new'
2019-07-08T20:22:33.239031+00:00 app[web.1]: from /tmp/crystal/share/crystal/src/file.cr:591:12 in 'read'
2019-07-08T20:22:33.239033+00:00 app[web.1]: from /tmp/crystal/share/crystal/src/path.cr:102:5 in '__crystal_main'
2019-07-08T20:22:33.239036+00:00 app[web.1]: from /tmp/crystal/share/crystal/src/crystal/main.cr:47:14 in 'main'
2019-07-08T20:22:33.239038+00:00 app[web.1]: from /tmp/crystal/share/crystal/src/hash.cr:63:3 in '__libc_start_main'
2019-07-08T20:22:33.239040+00:00 app[web.1]: from _start
2019-07-08T20:22:33.239042+00:00 app[web.1]: from ???

Should binaries compiled against Cadmium be dependent on Cadmium being present on the target computer ?

I think not, but there are several solutions to this problem :

1 - Put all data inside .cr files in arrays / hashes / tuples

2 - Make the path to data files configurable (via a yml config file or ENV variables)

3 - ??? I can't think of more :-)

If I missed something in the docs, feel free to close this issue ;-)

cadmiumcr / cadmium Goto Github PK

cadmium's Introduction

Installation

Contributing

Contributors

cadmium's People

Contributors

Stargazers

Watchers

Forkers

cadmium's Issues

Preface

Details

References

Preface

Details

References

Preface

Details

Recommend Projects

Recommend Topics

Recommend Org

Jobs