GithubHelp home page GithubHelp logo

johnsnowlabs / spark-nlp Goto Github PK

View Code? Open in Web Editor NEW
3.7K 100.0 699.0 1.51 GB

State of the Art Natural Language Processing

Home Page: https://sparknlp.org/

License: Apache License 2.0

Scala 68.54% Python 28.35% Java 2.85% HTML 0.18% Shell 0.07% Makefile 0.01%
nlp natural-language-processing spark pyspark named-entity-recognition sentiment-analysis lemmatizer spell-checker entity-extraction part-of-speech-tagger

spark-nlp's Introduction

John Snow Labs: State-of-the-art NLP in Python

The John Snow Labs library provides a simple & unified Python API for delivering enterprise-grade natural language processing solutions:

  1. 15,000+ free NLP models in 250+ languages in one line of code. Production-grade, Scalable, trainable, and 100% open-source.
  2. Open-source libraries for Responsible AI (NLP Test), Explainable AI (NLP Display), and No-Code AI (NLP Lab).
  3. 1,000+ healthcare NLP models and 1,000+ legal & finance NLP models with a John Snow Labs license subscription.

Homepage: https://www.johnsnowlabs.com/

Docs & Demos: https://nlp.johnsnowlabs.com/

Features

Powered by John Snow Labs Enterprise-Grade Ecosystem:

  • 🚀 Spark-NLP : State of the art NLP at scale!
  • 🤖 NLU : 1 line of code to conquer NLP!
  • 🕶 Visual NLP : Empower your NLP with a set of eyes!
  • 💊 Healthcare NLP : Heal the world with NLP!
  • Legal NLP : Bring justice with NLP!
  • 💲 Finance NLP : Understand Financial Markets with NLP!
  • 🎨 NLP-Display Visualize and Explain NLP!
  • 📊 NLP-Test : Deliver Reliable, Safe and Effective Models!
  • 🔬 NLP-Lab : No-Code Tool to Annotate & Train new Models!

Installation

! pip install johnsnowlabs

from johnsnowlabs import nlp
nlp.load('emotion').predict('Wow that was easy!')

See the documentation for more details.

Usage

These are examples of getting things done with one line of code. See the General Concepts Documentation for building custom pipelines.

# Example of Named Entity Recognition
nlp.load('ner').predict("Dr. John Snow is an British physician born in 1813")

Returns :

entities entities_class entities_confidence
John Snow PERSON 0.9746
British NORP 0.9928
1813 DATE 0.5841
# Example of Question Answering 
nlp.load('answer_question').predict("What is the capital of Paris")

Returns :

text answer
What is the capital of France Paris
# Example of Sentiment classification
nlp.load('sentiment').predict("Well this was easy!")

Returns :

text sentiment_class sentiment_confidence
Well this was easy! pos 0.999901
nlp.load('ner').viz('Bill goes to New York')

Returns:
ner_viz_opensource For a full overview see the 1-liners Reference and the Workshop.

Use Licensed Products

To use John Snow Labs' paid products like Healthcare NLP, [Visual NLP], [Legal NLP], or [Finance NLP], get a license key and then call nlp.install() to use it:

! pip install johnsnowlabs
# Install paid libraries via a browser login to connect to your account
from johnsnowlabs import nlp
nlp.install()
# Start a licensed session
nlp.start()
nlp.load('en.med_ner.oncology_wip').predict("Woman is on  chemotherapy, carboplatin 300 mg/m2.")

Usage

These are examples of getting things done with one line of code. See the General Concepts Documentation for building custom pipelines.

# visualize entity resolution ICD-10-CM codes 
nlp.load('en.resolve.icd10cm.augmented')
    .viz('Patient with history of prior tobacco use, nausea, nose bleeding and chronic renal insufficiency.')

returns:
ner_viz_opensource

# Temporal Relationship Extraction&Visualization
nlp.load('relation.temporal_events')\
    .viz('The patient developed cancer after a mercury poisoning in 1999 ')

returns: relationv_viz

Helpful Resources

Take a look at the official Johnsnowlabs page page: https://nlp.johnsnowlabs.com for user documentation and examples

Resource Description
General Concepts General concepts in the Johnsnowlabs library
Overview of 1-liners Most common used models and their results
Overview of 1-liners for healthcare Most common used healthcare models and their results
Overview of all 1-liner Notebooks 100+ tutorials on how to use the 1 liners on text datasets for various problems and from various sources like Twitter, Chinese News, Crypto News Headlines, Airline Traffic communication, Product review classifier training,
Connect with us on Slack Problems, questions or suggestions? We have a very active and helpful community of over 2000+ AI enthusiasts putting Johnsnowlabs products to good use
Discussion Forum More indepth discussion with the community? Post a thread in our discussion Forum
Github Issues Report a bug
Custom Installation Custom installations, Air-Gap mode and other alternatives
The nlp.load(<Model>) function Load any model or pipeline in one line of code
The nlp.load(<Model>).predict(data) function Predict on Strings, List of Strings, Numpy Arrays, Pandas, Modin and Spark Dataframes
The nlp.load(<train.Model>).fit(data) function Train a text classifier for 2-Class, N-Classes Multi-N-Classes, Named-Entitiy-Recognition or Parts of Speech Tagging
The nlp.load(<Model>).viz(data) function Visualize the results of Word Embedding Similarity Matrix, Named Entity Recognizers, Dependency Trees & Parts of Speech, Entity Resolution,Entity Linking or Entity Status Assertion
The nlp.load(<Model>).viz_streamlit(data) function Display an interactive GUI which lets you explore and test every model and feature in Johnsowlabs 1-liner repertoire in 1 click.

License

This library is licensed under the Apache 2.0 license. John Snow Labs' paid products are subject to this End User License Agreement.
By calling nlp.install() to add them to your environment, you agree to its terms and conditions.

spark-nlp's People

Contributors

actions-user avatar agsfer avatar ahmedlone127 avatar albertoandreottiatgmail avatar aleksei-ai avatar alinapetukhova avatar anju-jsl avatar bvannah avatar c-k-loan avatar danilojsl avatar devintdha avatar diatrambitas avatar drbonis avatar fernandrez avatar impr0grammer avatar josejuanmartinez avatar kolia1985 avatar kshitizgit avatar maziyarpanahi avatar murat-gunay avatar pabla avatar prabod avatar riyajohnsnow avatar saif-ellafi avatar showy avatar tshimanga avatar vankov avatar vkocaman avatar wolliq avatar xusliebana avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spark-nlp's Issues

Unclear documentation on how to properly use the POSTagger

Description

The documentation does not provide a clear way to run the POSTagger. The annotator documentation gives the following snippet:

val posTagger = new PerceptronApproach()
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("pos")

However, using this snippet results in a NullPointer exception rather than running.

Expected Behavior

It would be expected that adding this snippet into a reasonable workflow, such as the one provided in the Quickstart documentation this could be added to the pipeline without crashing.

Current Behavior

Adding the POSTagger to the pipeline results in a NullPointer exception

scala> pipeline.fit(data).transform(data).show()
java.lang.NullPointerException
  at java.io.FilterInputStream.read(FilterInputStream.java:133)
  at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
  at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
  at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
  at java.io.InputStreamReader.read(InputStreamReader.java:184)
  at java.io.BufferedReader.fill(BufferedReader.java:161)
  at java.io.BufferedReader.readLine(BufferedReader.java:324)
  at java.io.BufferedReader.readLine(BufferedReader.java:389)
  at scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72)
  at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389)
  at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
  at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
  at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
  at scala.collection.AbstractIterator.to(Iterator.scala:1336)
  at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
  at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
  at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
  at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
  at com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronApproach$.parsePOSCorpusFromDir(PerceptronApproach.scala:227)
  at com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronApproach$.retrievePOSCorpus(PerceptronApproach.scala:246)
  at com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronApproach.train(PerceptronApproach.scala:84)
  at com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronApproach.train(PerceptronApproach.scala:22)
  at com.johnsnowlabs.nlp.AnnotatorApproach.fit(AnnotatorApproach.scala:28)
  at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:153)
  at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149)
  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
  at scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:44)
  at scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:37)
  at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:149)
  ... 54 elided

Possible Solution

The documentation mentions a setCorpusPath config method. From my brief perusal of the code, it appears that using setting the corpus path is required since it does not have a default value. If that is the case, how to set the corpus path should be explained in the documentation along with a full example. Ideally one would not need to specify a corpus, or this library would provide pre-trained models on various corpuses.

Steps to Reproduce

Enter into the spark shell using

spark-shell  --packages JohnSnowLabs:spark-nlp:1.2.2

and then run the following code

import com.johnsnowlabs.nlp._
import com.johnsnowlabs.nlp.annotators._
import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronApproach
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetectorModel
import org.apache.spark.ml.Pipeline

import spark.implicits._
import spark.sql

// Used my own data, adding the data from the notebook as an example
data = spark.read.parquet("../sentiment.parquet").limit(1000)

val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")


val sentenceDetector = new SentenceDetectorModel().setInputCols(Array("document")).setOutputCol("sentence")

val regexTokenizer = new RegexTokenizer().setInputCols(Array("sentence")).setOutputCol("token")

val posTagger = new PerceptronApproach().setInputCols(Array("sentence", "token")).setOutputCol("pos")

val finisher = new Finisher().setInputCols("pos").setCleanAnnotations(false)

val pipeline = new Pipeline().setStages(Array(
        documentAssembler,
        sentenceDetector,
        regexTokenizer,
        posTagger,
        finisher
    ))

pipeline.fit(data).transform(data).show()

Context

I was trying to pass data to what I assumed was a pre-trained model for use in an NLP pipeline.

Your Environment

Spark version: 2.1.1
spark-nlp version: 1.2.2
Running on Amazon EMR

tests-jar in release

Description

When implementing an own AnnotatorModel, I would love to (re)-use text-fixtures of Spark NLP, such as AnnotatorBuilder or DataBuilder

Expected Behavior

spark-nlp-${spark-nlp-version}-tests.jar is provided in the spark repo, so that it can be used as a test-dependency

Current Behavior

No spark-nlp-${spark-nlp-version}-tests.jar is provided

Possible Solution

In maven, the jar-plugin would take care of that. However, I don't know the equivalent in SBT

Steps to Reproduce

Maven dependency:

<dependency> <groupId>JohnSnowLabs</groupId> <artifactId>spark-nlp</artifactId> <version>${spark-nlp-version}</version> <type>test-jar</type> <scope>test</scope> </dependency>

Context

Writing unit-tests for own Annotators

Your Environment

  • Version used: 1.3.0

application resource missing while running spark submit

  1. Latest spark-nlp downloaded
  2. Run using Spark 2.2
    spark-submit --jars /Users/john/spark-nlp/spark-nlp-snapshot.jar
    Exception in thread "main" java.lang.IllegalArgumentException: Missing application resource.
    at org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderUtils.java:241)

Latest release (1.2.4) has not been yet published on Maven

Description

Hello, the SBT can't find the latest release on Maven repository:
https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp

Therefore, the SBT fails to download the new release on refresh. Also, the jar is pointed to 1.2.3 version from Maven on the Home Page:
http://nlp.johnsnowlabs.com/

Many thanks.

Current Behavior

[warn] 	:: com.johnsnowlabs.nlp#spark-nlp_2.11;1.2.4: not found
[warn] 	::::::::::::::::::::::::::::::::::::::::::::::
[warn]
[warn] 	Note: Unresolved dependencies path:
[warn] 		com.johnsnowlabs.nlp:spark-nlp_2.11:1.2.4 
[warn] 		  +- default:multivac-nlp_2.11:0.1
[error] sbt.librarymanagement.ResolveException: unresolved dependency: com.johnsnowlabs.nlp#spark-nlp_2.11;1.2.4: not found

Steps to Reproduce

"com.johnsnowlabs.nlp" %% "spark-nlp" % "1.2.4"

Road Map for Other Languages

Hi,

I understand it's non-trivial work, but, still wonder if other languages (say, Chinese) are already on your road map. Or is the model plug-able so that developers can easily attach Chinese models in spark-nlp?


BR,
Todd

All HTML pages should have unique title

Description

Titles are same for most of the pages e.g. components.html, faqs.html. We should add unique and contextual titles for each page.
title tags are used in SEO and SERPs (Search engine result pages)

Adding StopWordsRemover

I want to add the pyspark.ml.feature StopWordsRemover as a class in the annotator.py file so I can use that function in the same pipeline as the other sparknlp functions.

I have tried the code below, but I get the error: TypeError: 'JavaPackage' object is not callable

What am I doing wrong?

from pyspark.ml.feature import StopWordsRemover as sparkml_StopWordsRemover
stopwordList = sparkml_StopWordsRemover.loadDefaultStopWords("english")

class StopWordsRemover(AnnotatorTransformer):

    caseSensitive = Param(Params._dummy(),
                             "caseSensitive",
                             'whether to do a case sensitive comparison over the stop words',
                             typeConverter=TypeConverters.toBoolean)

    stopWords = Param(Params._dummy(),
                         "stopWords",
                         "The words to be filtered out",
                         typeConverter=TypeConverters.toListString)    
    @keyword_only
    def __init__(self):
        super(StopWordsRemover, self).__init__()
        self._java_obj = self._new_java_obj("com.johnsnowlabs.nlp.annotators.StopWordsRemover", self.uid)
        self._setDefault(caseSensitive=False, stopWords=stopwordList)
        self.setParams(**kwargs)
    
    def setParams(self, caseSensitive=False, 
                  stopWords=stopwordList):
        kwargs = self._input_kwargs
        return self._set(**kwargs)
      
    def setCaseSensitive(self, value):
        return self._set(caseSensitive=value)

    def setStopWords(self, value):
        return self._set(stopWords=value)

Unable to load saved pipelinemodel if i used Stemmer in the pipeline

The error: java.lang.NoSuchMethodException: com.johnsnowlabs.nlp.annotators.Stemmer.read()

Description

I have trained a classifier with following Annotators:
documentAssembler,
sentenceDetector,
tokenizer,
normalizer,
spellChecker,
stemmer,
finisher,
filterer.

The model is trained and saved successfully.
However, when I load the model to predict on new data, the load fails with the message:
java.lang.NoSuchMethodException: com.johnsnowlabs.nlp.annotators.Stemmer.read()

However, if I remove the Stemmer stage from the pipeline, it is loaded successfully.

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

Context

Your Environment

I am using Spark 2.2.0, scala 2.11.12, spark-nlp 2.11-1.2.3

  • Version used:
  • Browser Name and version:
  • Operating System and version (desktop or mobile): Mac OS
  • Link to your project:

Wrong FS when loading PerceptronModel

Description

Saving a trained POS model to s3 and loading it back throws a java.lang.IllegalArgumentException exception with the message "Wrong FS, ... expected: hdfs...."

Expected Behavior

A saved model (either standalone or as part of a spark PipelineModel) should be loadable from s3

Current Behavior

While the model can be saved at the moment, reading it back either as a standalone model or as part of a PipelineModel throws an exception. A sample stacktrace looks like this:

java.lang.IllegalArgumentException: Wrong FS: s3://<bucket-name>/pos-models/anc-pos/fields/POS Model, expected: hdfs://<ip-address>:9000
  at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:651)
  at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:193)
  at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:105)
  at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1118)
  at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114)
  at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
  at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114)
  at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1415)
  at com.johnsnowlabs.nlp.serialization.StructFeature.deserializeObject(Feature.scala:111)
  at com.johnsnowlabs.nlp.serialization.Feature.deserialize(Feature.scala:44)
  at com.johnsnowlabs.nlp.FeaturesReader$$anonfun$load$1.apply(ParamsAndFeaturesReadable.scala:13)
  at com.johnsnowlabs.nlp.FeaturesReader$$anonfun$load$1.apply(ParamsAndFeaturesReadable.scala:12)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:12)
  at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:6)
  at org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:218)
  at com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel$.load(PerceptronModel.scala:71)

Possible Solution

The problem appears to be related to how the default FileSystem is used instead of inferring it from the given Path. Refer harsha2010/magellan#114

Steps to Reproduce

The following code in Scala should help reproduce the problem:

new PerceptronApproach()
  .setInputCols(Array("sentence", "normalized"))
  .setOutputCol("pos")
  .fit(df)
  .save("s3://path")

val perceptronModel = PerceptronModel.read.load("s3://path") // throws an exception

Context

Since we use S3 as our DFS, saving and loading models to/from S3 is critical to model training and serving.

Your Environment

Version used: 1.4.0

Sentiment Analysis always positive

I'm evaluating this product and sentiment with the SentimentDetector always returns "result->positive". Does the package come with a trained model or do I have to train one myself?

    val spark = SparkSession
      .builder()
      .appName("Sentiment")
      .getOrCreate()

    import spark.implicits._

    val data = Seq(Article("I hate this natural language processor. It doesn't work at all!")).toDS

    data.cache()

    val documentAssembler = new DocumentAssembler().setInputCol("body")

    val sentenceDetector = new SentenceDetectorModel()
      .setInputCols(Array("document"))
      .setOutputCol("sentence")

    val regexTokenizer = new RegexTokenizer()
      .setInputCols("sentence")
      .setOutputCol("token")

    val sentimentDetector = new SentimentDetectorModel()
      .setInputCols(Array("sentence", "token"))
      .setOutputCol("sentiment")


    val finisher = new Finisher().setInputCols(Array("sentiment"))
      .setIncludeKeys(true)
      .setCleanAnnotations(false)


    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      sentenceDetector,
      regexTokenizer,
      sentimentDetector,
      finisher
    ))

    val sentimentData = pipeline.fit(data).transform(data)
    sentimentData.show(false)

Results:
|I hate this natural language processor. It it doesn't work at all!|[[document,0,65,I hate this natural language processor. It it doesn't work at all!,Map()]]|[[document,0,38,I hate this natural language processor.,Map()], [document,39,64,It it doesn't work at all!,Map()]]|[[token,0,0,I,Map(sentence -> 1)], [token,2,5,hate,Map(sentence -> 1)], [token,7,10,this,Map(sentence -> 1)], [token,12,18,natural,Map(sentence -> 1)], [token,20,27,language,Map(sentence -> 1)], [token,29,38,processor.,Map(sentence -> 1)], [token,39,40,It,Map(sentence -> 2)], [token,42,43,it,Map(sentence -> 2)], [token,45,51,doesn't,Map(sentence -> 2)], [token,53,56,work,Map(sentence -> 2)], [token,58,59,at,Map(sentence -> 2)], [token,61,64,all!,Map(sentence -> 2)]]|[[sentiment,0,0,positive,Map()]]|result->positive |

ViveknSentimentApproach: Question regarding positive and negative sources

Description

I've create a Jupyter notebook using the ViveknSentimentApproach along the lines of the notebook provided in the docs. Rather than using previously created files for the positive and negative training data, I am generating the files within the notebook itself. When I write the training sets to the filesystem (ie, df.write.mode('overwrite').text("train-data-positive"), I end up with multiple data files (one per partition) and a number of metadata files (crc files and a _SUCCESS file).

For example:

./_SUCCESS
./._SUCCESS.crc
./part-00000-1c1f1cff-5720-4088-80d7-6c15e0fe9bd6-c000.txt
./.part-00000-1c1f1cff-5720-4088-80d7-6c15e0fe9bd6-c000.txt.crc
...

If I then reference the directory where these files are written out (eg, train-data-positive and train-data-negative), I get back an error with the following stacktrace:

Py4JJavaError: An error occurred while calling o1665.fit.
: java.io.FileNotFoundException: Invalid file path
	at java.io.FileInputStream.<init>(FileInputStream.java:133)
	at scala.io.Source$.fromFile(Source.scala:91)
	at scala.io.Source$.fromFile(Source.scala:76)
	at scala.io.Source$.fromFile(Source.scala:54)
	at scala.io.Source$.fromFile(Source.scala:60)
	at com.johnsnowlabs.nlp.util.io.ResourceHelper$SourceStream$$anonfun$2.apply(ResourceHelper.scala:35)
	at com.johnsnowlabs.nlp.util.io.ResourceHelper$SourceStream$$anonfun$2.apply(ResourceHelper.scala:35)
	at scala.Option.getOrElse(Option.scala:121)
	at com.johnsnowlabs.nlp.util.io.ResourceHelper$SourceStream.<init>(ResourceHelper.scala:35)
	at com.johnsnowlabs.nlp.util.io.ResourceHelper$.wordCount(ResourceHelper.scala:192)
	at com.johnsnowlabs.nlp.util.io.ResourceHelper$$anonfun$wordCount$2.apply(ResourceHelper.scala:196)
	at com.johnsnowlabs.nlp.util.io.ResourceHelper$$anonfun$wordCount$2.apply(ResourceHelper.scala:196)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
	at scala.collection.AbstractIterator.to(Iterator.scala:1336)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
	at com.johnsnowlabs.nlp.util.io.ResourceHelper$.wordCount(ResourceHelper.scala:197)
	at com.johnsnowlabs.nlp.util.io.ResourceHelper$$anonfun$wordCount$1.apply(ResourceHelper.scala:189)
	at com.johnsnowlabs.nlp.util.io.ResourceHelper$$anonfun$wordCount$1.apply(ResourceHelper.scala:189)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
	at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
	at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)
	at com.johnsnowlabs.nlp.util.io.ResourceHelper$.wordCount(ResourceHelper.scala:189)
	at com.johnsnowlabs.nlp.annotators.sda.vivekn.ViveknSentimentApproach.train(ViveknSentimentApproach.scala:45)
	at com.johnsnowlabs.nlp.annotators.sda.vivekn.ViveknSentimentApproach.train(ViveknSentimentApproach.scala:14)
	at com.johnsnowlabs.nlp.AnnotatorApproach.fit(AnnotatorApproach.scala:28)
	at sun.reflect.GeneratedMethodAccessor68.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)

However, if I delete the _SUCCESS and crc files then all is well.

Expected Behavior

I can modify my notebook to delete these files but it seems like the code should filter on *.txt if those are the only acceptable file names.

using setRequiredAnnotatorTypes() with many of the annotators throws error.

Many of the annotators in annotator.py list requiredAnnotatorTypes() as a parameter when using either <annotatorName>().explainParams(), or dir(<annotatorName>).

However, when using <annotatorName>().setRequiredAnnotatorTypes on most of these annotators, the following error is thrown.

Py4JJavaError: An error occurred while calling o830.getParam.
: java.util.NoSuchElementException: Param requiredAnnotatorTypes does not exist.

The only place where the requiredAnnotatorTypes parameter is shown is in the class AnnotatorProperties of annotator.py starting on line 29.

Lemmatizer should use POS and token to generate Lemma, have a default dictionary, and allow addition + overriding

Description

This is an enhancement request the Lemmatizer.

Expected Behavior

  1. The current lemmatizer appears to be dictionary based, and the caller needs to supply a mapping of word to lemma. From other dictionary based lemmatizers I have seen, the key is usually (token, pos). This is because certain words need to be lemmatized differently based on what POS it is - for example, stranger as NOUN is lemmatized to stranger, but stranger as ADJ is lemmatized to strange. The Transformer API allows this, I see others support setInputCols(Array[String]) signature, ie, setInputCols(Array("token", "pos")) instead of setInputCols(Array("token")) as it is currently. Only thing to be careful of is that the pos tag style output by the POSTagger must match the style in the Lemma dictionary key.

  2. The implementation does not provide a lemma dictionary and requires the caller to supply one. There are good open source lemma dictionaries available thanks to the FreeLing project for various languages (look under data/{lang}/dictionary/entries), which could be leveraged and provided as part of the spark-nlp download (at least for language=en) so there would be a sensible default for most people. This would make it easier for people to start working with lemmatizers in spark-nlp.

  3. Lemmatizer dictionaries are usually incomplete, so an addLemmaDict() call could be provided that allows user-supplied mappings to be added to the dictionary. The setLemmaDict() could still be used to override the default if needed.

Possible Solution

I haven't provided a solution, since I figure that this is a fairly easy fix for people familiar with the codebase, but if you would prefer a pull request, please let me know and I can work on it.

Your Environment

  • Version used: spark-nlp-2.11-1.2.3

Sentiment analysis not working when tokens get deleted after normalized

Important bug reported by users of the library, makes the library throw an exception when trying to run sentiment analysis on normalized tokens that might have been removed after becoming empty

Description

The issue was introduced in 1.2.0, since unpack function was sorting sentence tokens by their index, causing an issue when tokens were removed after clean up.

Expected Behavior

should not blow up :)

Steps to Reproduce

  1. Pipeline with normalizer and sentiment analysis detector
  2. A sentence with a single dirty character, such as: "Hello. !!.
  3. Execute

Your Environment

  • Version used: 1.2.2

MLeap integration with spark-nlp

MLeap provides a way to deploy Spark-trained pipelines to production-ready API servers

Description

For NLP, the use cases for this abound. This library is great for training. Having an MLeap integration would mean we can deploy the models instantly to an API service.

Feel free to close this issue if you don't think it belongs here, I also made an issue in MLeap to track this integration: combust/mleap#292

DocumentAssembler().explainParams() throws ValueError.

running DocumentAssembler().explainParams() throws the following error.

ValueError: Param Param(parent=u'DocumentAssembler_4d3093c7d855b0453bcc', name='idCol', doc='input column name.') does not belong to DocumentAssembler_4d3093c7d855b0453bcc.

changing base.py lines 10 -12 from:

    outputCol = Param(Params._dummy(), "outputCol", "input column name.", typeConverter=TypeConverters.toString)
    idColName = Param(Params._dummy(), "idCol", "input column name.", typeConverter=TypeConverters.toString)
    metadataColName = Param(Params._dummy(), "metadataCol", "input column name.", typeConverter=TypeConverters.toString)

to:

    outputCol = Param(Params._dummy(), "outputCol", "output column name.", typeConverter=TypeConverters.toString)
    idCol = Param(Params._dummy(), "idCol", "id column name.", typeConverter=TypeConverters.toString)
    metadataCol = Param(Params._dummy(), "metadataCol", "metadata column name.", typeConverter=TypeConverters.toString)

fixed the problem for me.

java.lang.UnsatisfiedLinkError: org.rocksdb.LRUCache.newLRUCache(JIZD)J when trying to do NER using NerCrf

Issue with NER in version 1.4

Description

I am using a Databricks notebook to run the latest version of Spark NLP and trying to do NER using NerCrf. I get an error below and this takes a long time to run.

Expected Behavior

I get the following error when i run the NerCrf

java.lang.UnsatisfiedLinkError: org.rocksdb.LRUCache.newLRUCache(JIZD)J
at org.rocksdb.LRUCache.newLRUCache(Native Method)
at org.rocksdb.LRUCache.(LRUCache.java:74)
at org.rocksdb.LRUCache.(LRUCache.java:19)
at com.johnsnowlabs.nlp.embeddings.WordEmbeddings.(WordEmbeddings.scala:13)
at com.johnsnowlabs.nlp.embeddings.ApproachWithWordEmbeddings.beforeTraining(ApproachWithWordEmbeddings.scala:55)
at com.johnsnowlabs.nlp.AnnotatorApproach.fit(AnnotatorApproach.scala:32)

Possible Solution

I believe the wrong version of rocksdb is packaged with Spark NLP

Steps to Reproduce

  1. I am using a Databricks Notebook and latest Spark runtime 3.4 (includes Apache Spark 2.2.0, Scala 2.11)
  2. I have installed Spark NLP 1.4 from Maven
  3. Here is the link to the Databricks Notebook that shows the error

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5276288487867556/871609068453572/3684151997220649/latest.html

EntityExtractor().setRequireSentences(True) throws error.

EntityExtractor().setRequireSentences(True)

throws this error:

Py4JJavaError: An error occurred while calling o461.getParam.
: java.util.NoSuchElementException: Param requireSentences does not exist.

From the EntityExtractor.scala file, I found the sentences parameter defined on lines 99 to 105, but when I changed lines 191 and 192 in annotator.py from:

def setRequireSentences(self, value):
    return self._set(requireSentences=value)

to:

def setRequireSentences(self, value):
    return self._set(sentences=value)

The error was basically the same:

Py4JJavaError: An error occurred while calling o474.getParam.
: java.util.NoSuchElementException: Param sentences does not exist.

Cannot find entity match in NerCrfApproach

Description

When using NerCrfApproach, it returns error message below. I use a simple text data to test, which includes 'Walmart is in Germany‘, but it cannot find any matching on 'Germany', even that 'Germany' is contained in the test_ner_dataset.txt

Expected Behavior

It should find some match from the simple test data.

Current Behavior

Error message: java.util.NoSuchElementException: None.get

Possible Solution

Steps to Reproduce

  1. Create sample data:

val data = Seq(
(1, "Apple is located in California . It is a great company."),
(2, "Google is located in California . It is a great company."),
(3, "The BBC is located in London . It is a great company."),
(5, "The Walmart is located in Germany . It is a great company.")
).toDF("id", "text")

  1. Create pipeline for pre-transformation before fitting into NER

import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.annotators.sda.vivekn.ViveknSentimentApproach
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")

val regexTokenizer = new com.johnsnowlabs.nlp.annotators.Tokenizer() // it is imported twice in the same scope by import com.johnsnowlabs.nlp.annotators._ and import org.apache.spark.ml.feature.Tokenizer
.setInputCols(Array("sentence"))
.setOutputCol("token")
val positivePhrases = "/dbfs/FileStore/tables/positivePhrases.txt"
val negativePhrases = "/dbfs/FileStore/tables/negativePhrases.txt"

val ViveknSentiment = new ViveknSentimentApproach()
.setInputCols(Array("token", "sentence"))
.setOutputCol("vivekn")
.setPositiveSourcePath(positivePhrases)
.setNegativeSourcePath(negativePhrases)

val posTagger = new PerceptronApproach() // #41
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")

val finisher = new Finisher()
.setInputCols("token")
.setCleanAnnotations(false)

val pipeline2 = new Pipeline()
.setStages(Array(
documentAssembler,
sentenceDetector,
regexTokenizer , // works except for pattern
ViveknSentiment,
posTagger,
finisher
))

var result2 = pipeline2
.fit(data.select("id","text"))
.transform(data.select("id","text"))
3. Fit in NER, which produces error:

val res3 = new NerCrfApproach()
.setInputCols("sentence", "token", "pos")
.setDatasetPath("src/test/resources/ner-corpus/test_ner_dataset.txt")
.setOutputCol("ner")
.fit(result2)

Error meesage: java.util.NoSuchElementException: None.get

Context

Your Environment

  • Version used:
  • Browser Name and version:
  • Operating System and version (desktop or mobile):
  • Link to your project:

Tokenizer: Composite Tokens not retained when targetPattern is set

When I setTargetPatter(\\w+) , composite tokens are not retained

Description

Below code returns 3 tokens 'New', 'York', 'City' instead of 'New York City'.
However When I comment out setTargetPattern(\w+) the behaviour is becomes correct.

val df = spark.createDataFrame(Seq((1,"New York City"))).toDF("id","text")

val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
.setIdCol("id")

val doc = documentAssembler.transform(df)


val tokenizer = new Tokenizer()
                    .setInputCols("document")
                    .setOutputCol("tokens")
                    .setTargetPattern("\\w+")
                    .setCompositeTokens(Array("New York City"))

Context

I am using Spark-NLP release 1.4

wordCount from ResourceHelper doesn't return any value from Map[String, Int]

When reading the corpus as "TXTDS" configured from the SpellChecker, the variable result has 0 Values (e.g. result.toSeq.length == 0). Please offer suggestion on how to retrieve this.
The code snippet is below: (starting at ResourceHelper.Scala:273)

case TXTDS =>
        import spark.implicits._
        val dataset = spark.read.textFile(source)
        val wordCount = spark.sparkContext.broadcast(MMap.empty[String, Int].withDefaultValue(0))
        val documentAssembler = new DocumentAssembler()
          .setInputCol("value")
        val tokenizer = new RegexTokenizer()
          .setInputCols("document")
          .setOutputCol("token")
        val normalizer = new Normalizer()
          .setInputCols("token")
          .setOutputCol("normal")
        val finisher = new Finisher()
          .setInputCols("normal")
          .setOutputCols("finished")
          .setAnnotationSplitSymbol("--")
        new Pipeline()
          .setStages(Array(documentAssembler, tokenizer, normalizer, finisher))
          .fit(dataset)
          .transform(dataset)
          .select("finished").as[String]
          .foreach(text => text.split("--").foreach(t => {
            wordCount.value(t) += 1
          }))
        val result = wordCount.value
        wordCount.destroy()
        result

python DocumentAssembler minor code description error

Error in base.py lines 10 - 12 from:

    outputCol = Param(Params._dummy(), "outputCol", "input column name.", typeConverter=TypeConverters.toString)
    idColName = Param(Params._dummy(), "idCol", "input column name.", typeConverter=TypeConverters.toString)
    metadataColName = Param(Params._dummy(), "metadataCol", "input column name.", typeConverter=TypeConverters.toString)

to:

    outputCol = Param(Params._dummy(), "outputCol", "output column name.", typeConverter=TypeConverters.toString)
    idColName = Param(Params._dummy(), "idCol", "id column name.", typeConverter=TypeConverters.toString)
    metadataColName = Param(Params._dummy(), "metadataCol", "metadata column name.", typeConverter=TypeConverters.toString)

pyspark on EMR not able to import

I am attempting to run this on an AWS EMR cluster with PySpark. It runs just fine in spark-shell, but I cannot import the package via pyspark.

Description

I run pyspark from the CLI via:

pyspark --packages JohnSnowLabs:spark-nlp:1.4.0

which gets the shell started with a few warning lines that are not usually present:

18/02/15 17:28:57 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/JohnSnowLabs_spark-nlp-1.4.0.jar added multiple times to distributed cache.
18/02/15 17:28:57 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/com.typesafe_config-1.3.0.jar added multiple times to distributed cache.
18/02/15 17:28:57 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.rocksdb_rocksdbjni-5.8.0.jar added multiple times to distributed cache.
18/02/15 17:28:57 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.slf4j_slf4j-api-1.7.25.jar added multiple times to distributed cache.
18/02/15 17:28:57 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.apache.commons_commons-compress-1.15.jar added multiple times to distributed cache.
18/02/15 17:28:57 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.objenesis_objenesis-2.6.jar added multiple times to distributed cache.

I immediately try to do from sparknlp.annotator import * and get:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named sparknlp.annotator

Possible Solution

More of a question really...does your package work in Python2 as well as Python3? Didn't see that in the docs anywhere, but maybe I missed it.

Context

AWS Spark EMR cluster which uses the Amazon Linux AMI.

Your Environment

AWS Spark EMR cluster which uses the Amazon Linux AMI. For versioning info, here is my full stream for when I start up pyspark with the package...

$ pyspark --packages JohnSnowLabs:spark-nlp:1.4.0
Python 2.7.13 (default, Jan 31 2018, 00:17:36) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Ivy Default Cache set to: /home/hadoop/.ivy2/cache
The jars for the packages stored in: /home/hadoop/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
JohnSnowLabs#spark-nlp added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
	confs: [default]
	found JohnSnowLabs#spark-nlp;1.4.0 in spark-packages
	found com.typesafe#config;1.3.0 in central
	found org.rocksdb#rocksdbjni;5.8.0 in central
	found org.slf4j#slf4j-api;1.7.25 in spark-list
	found org.apache.commons#commons-compress;1.15 in central
	found org.objenesis#objenesis;2.6 in central
:: resolution report :: resolve 282ms :: artifacts dl 5ms
	:: modules in use:
	JohnSnowLabs#spark-nlp;1.4.0 from spark-packages in [default]
	com.typesafe#config;1.3.0 from central in [default]
	org.apache.commons#commons-compress;1.15 from central in [default]
	org.objenesis#objenesis;2.6 from central in [default]
	org.rocksdb#rocksdbjni;5.8.0 from central in [default]
	org.slf4j#slf4j-api;1.7.25 from spark-list in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   6   |   0   |   0   |   0   ||   6   |   0   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
	confs: [default]
	0 artifacts copied, 6 already retrieved (0kB/7ms)
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/02/15 17:41:39 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
18/02/15 17:41:42 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/JohnSnowLabs_spark-nlp-1.4.0.jar added multiple times to distributed cache.
18/02/15 17:41:42 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/com.typesafe_config-1.3.0.jar added multiple times to distributed cache.
18/02/15 17:41:42 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.rocksdb_rocksdbjni-5.8.0.jar added multiple times to distributed cache.
18/02/15 17:41:42 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.slf4j_slf4j-api-1.7.25.jar added multiple times to distributed cache.
18/02/15 17:41:42 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.apache.commons_commons-compress-1.15.jar added multiple times to distributed cache.
18/02/15 17:41:42 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.objenesis_objenesis-2.6.jar added multiple times to distributed cache.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.0
      /_/

Using Python version 2.7.13 (default, Jan 31 2018 00:17:36)
SparkSession available as 'spark'.
>>> from sparknlp.annotator import *
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named sparknlp.annotator
>>> 

Tokenizer: NoSuchMethodError: com.johnsnowlabs.nlp.annotators.common.Sentence.start()

Description

Tokenizer isn't working while using Python bindings for SparkNLP 1.4.
In SparkNLP 1.2.3 I used RegexTokenizer without any issue.

Expected Behavior

I expected to get a list of tokens from product titles that I pased as parameters, for example.

["this is a product title"] == should return ==> ["this", "is", "a","product","title"]

Current Behavior

The scripts returns the error below.
py4j.protocol.Py4JJavaError: An error occurred while calling o67.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 7, www.host.com, executor 1): java.lang.NoSuchMethodError: com.johnsnowlabs.nlp.annotators.common.Sentence.start()I

Steps to Reproduce

`import sys
from pyspark import HiveContext
from pyspark import SparkContext

from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

spark = SparkContext(appName="ngrams_by_section")
hiveContext = HiveContext(spark)

getData = hiveContext.sql("""
SELECT sentence, label
FROM table
""");

data = getData
document_assembler = DocumentAssembler().setInputCol("sentence").setOutputCol("title")

tokenizer = Tokenizer().setInputCols(["title"]).setOutputCol("token").setTargetPattern("[A-Za-zÀ-úÀ-ÿ0-9ñÑ]+")

tokenizer_pipeline = Pipeline(stages=[
document_assembler,
tokenizer
]
)

tokenizer_model = tokenizer_pipeline.fit(data)
ng = tokenizer_model.transform(data)
ng.show(truncate=False)`

Context

I had some code working with Spark-nlp 1.2.3, recently I linked to 1.4.0 package and adpated my code to work with this version, but it isn't working anymore.

Your Environment

  • Version used: Spark 2.2.0 Scala 2.11
  • Command executed:
    spark-submit --master yarn --deploy-mode client --driver-memory 2g --executor-memory 2g --num-executors 4 --executor-cores 4 --packages JohnSnowLabs:spark-nlp:1.4.0 nlp-bug-test.py

SentenceDetectorModel fills incorrect begin and end in annotation

SentenceDetectorModel returns incorrect begin and end of sentence starting from the second sentence in the document.
For example:
"Hello World!! New Sentence"

Description

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

Context

Your Environment

  • Version used:
  • Browser Name and version:
  • Operating System and version (desktop or mobile):
  • Link to your project:

Could not initialize class com.johnsnowlabs.nlp.util.io.ResourceHelper$ error with EntityExtractor().setEntitiesPath

I receive this error when I try to use EntityExtractor() in a transformer in Python:

java.lang.NoClassDefFoundError: Could not initialize class com.johnsnowlabs.nlp.util.io.ResourceHelper$
	at com.johnsnowlabs.nlp.annotators.EntityExtractor$.retrieveEntityExtractorPhrases(EntityExtractor.scala:148)
	at com.johnsnowlabs.nlp.annotators.EntityExtractor.loadEntities(EntityExtractor.scala:80)
	at com.johnsnowlabs.nlp.annotators.EntityExtractor.getSearchTrie(EntityExtractor.scala:67)
	at com.johnsnowlabs.nlp.annotators.EntityExtractor.com$johnsnowlabs$nlp$annotators$EntityExtractor$$search(EntityExtractor.scala:106)
	at com.johnsnowlabs.nlp.annotators.EntityExtractor$$anonfun$annotate$1.apply(EntityExtractor.scala:133)
	at com.johnsnowlabs.nlp.annotators.EntityExtractor$$anonfun$annotate$1.apply(EntityExtractor.scala:133)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
	at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
	at com.johnsnowlabs.nlp.annotators.EntityExtractor.annotate(EntityExtractor.scala:133)
	at com.johnsnowlabs.nlp.AnnotatorModel$$anonfun$dfAnnotate$1.apply(AnnotatorModel.scala:42)
	at com.johnsnowlabs.nlp.AnnotatorModel$$anonfun$dfAnnotate$1.apply(AnnotatorModel.scala:41)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:423)
	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:49)
	at org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:126)
	at org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:125)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:110)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:349)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

I tested with v1.2.6

Python annotators should be loadable on its own

Description

Python annotators can be loeaded when inside a pipeline, but it fails when loaded on its own:

document_assembler = DocumentAssembler() \
            .setInputCol("text")
    
document_assembler.write().overwrite().save("./da")
DocumentAssembler().read()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-12-78dbc11006d7> in <module>()
      3 
      4 document_assembler.write().overwrite().save("./da")
----> 5 DocumentAssembler().read()
      6 
      7 ### Transform input to appropriate schema

~/apps/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/util.py in read(cls)
    263     def read(cls):
    264         """Returns an MLReader instance for this class."""
--> 265         return JavaMLReader(cls)
    266 
    267 

~/apps/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/util.py in __init__(self, clazz)
    186     def __init__(self, clazz):
    187         self._clazz = clazz
--> 188         self._jread = self._load_java_obj(clazz).read()
    189 
    190     def load(self, path):

TypeError: 'JavaPackage' object is not callable

Expected Behavior

We should be able to load standalone python components

SentenceDetector replaces "ö" with "?"

The SentenceDetector is applied on German texts. If an "ö" occurs, it is replaced by "?". However, if other German special characters (such as ä,ü,ß) occur, the SentenceDetector performs as it should, and does not replace them by a question mark.

Expected Behavior

The SentenceDetector should not change the text within its annotation.

Current Behavior

The SentenceDetector replaces "ö" by "?".

Steps to Reproduce

  1. Run the test-case:
    sentenceDetector.txt

  2. Show the transformed dataframe:
    output.txt
    The token of question is the second one within the second row.

Context

Subsequent annotators perform worse, for example a lemma cannot be found if the token was changed.

Your Environment

  • Version used: 1.4.0

Velocity of this project

This project was featured as a blog on DataBricks site on October 19. Since then the project was pushed to github and 24 issues logged. Of these five have been closed as of December 13: and those were only between October 21 and Nov 5. So there has been visible activity for about two weeks out of the project's 8 weeks of visible lifetime.

What is the story about project maintenance and improvement here - including the long list of "want to haves" on the README? Most OSS projects die rather quickly due to neglect: what evidence can we have this will not be one of those?

Resolve annotators and pipelines consuming lots of driver RAM

Description

Annotators, specifically the Vivekn Sentiment Analysis, is consuming lots of driver RAM due to standard scala collections containing model information. This becomes a storage both inside pipelines and when reading the models back. We need to let this information flow from disk instead

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

Context

Your Environment

  • Version used:
  • Browser Name and version:
  • Operating System and version (desktop or mobile):
  • Link to your project:

DocumentAssembler().setIdCol() throws error

This code:

documentAssembler = DocumentAssembler() \
  .setInputCol("input") \
  .setOutputCol("output") \
  .setIdCol("id")

throws this error:

AttributeError: 'DocumentAssembler' object has no attribute 'idCol'
AttributeErrorTraceback (most recent call last)
in engine
----> 1 documentAssembler = DocumentAssembler()   .setInputCol("input")   .setOutputCol("output")   .setIdCol("id")

/home/cdsw/sparknlp/base.pyc in setIdCol(self, value)
     31 
     32     def setIdCol(self, value):
---> 33         return self._set(idCol=value)
     34 
     35     def setMetadataCol(self, value):

/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/ml/param/__init__.py in _set(self, **kwargs)
    414         """
    415         for param, value in kwargs.items():
--> 416             p = getattr(self, param)
    417             if value is not None:
    418                 try:

AttributeError: 'DocumentAssembler' object has no attribute 'idCol'

Changing line 33 of sparknlp base.py from:
return self._set(idCol=value)
to:
return self._set(idColName=value)
fixed the problem for me.

Common English words being tagged as NER entities

I am using NERRegexApproach to tag named entities, and using dict.txt to train the model.
I am getting many common English words tagged as entities, because of the dict.txt data.
Is there another training dataset that I can use which reduce the number of false positives on the entities? Or is there another way to process the data which will produce a more accurate result?

I thought of filtering out stop words, but some of the incorrect tags are not stop words either.

Description

Here is my code (in Java)

Pipeline pipeline = new Pipeline()
                .setStages(new PipelineStage[]{
            documentAssembler,
            sentenceDetector,
            regexTokenizer                      
        });
        Dataset<Row> output = pipeline
                .fit(docText)
                .transform(docText);
        String nerCorpusdict = "src/spark-nlp/src/main/resources/ner-corpus/dict.txt";       
        NERRegexApproach nerTagger = new NERRegexApproach(); 
        NERRegexApproach$.MODULE$.train(nerCorpusdict);
       
        nerTagger.setInputCols(new String[] {"sentence"});
        nerTagger.setOutputCol("ner");
        nerTagger.setCorpusPath(nerCorpusdict);
       
        Dataset<Row> nerTags = nerTagger.fit(sentenceDetector.transform(output)).transform(output);
        nerTags.select("ner","sentence").show(false);

Here is a sample of the output -
[[named_entity,409,415,PER,Map(will -> PER)], [named_entity,522,528,PER,Map(will -> PER)], [named_entity,263,269,PER,Map(case -> PER)], [named_entity,367,372,PER,Map(age -> PER)], [named_entity,229,235,PER,Map(show -> PER)], [named_entity,651,655,LOC,Map(be -> LOC)]] |[[document,0,226,We demonstrate here several previously unrecognized or insufficiently appreciated properties of the Lee-Carter mortality forecasting approach, the dominant method used in both the academic literature and practical applications.,Map()], [document,227,570,We show that this model is a special case of a considerably simpler, and less often biased, random walk with drift model, and prove that the age profile forecast from both approaches will always become less smooth and unrealistic after a point (when forecasting forward or backwards in time) and will eventually deviate from any given baseline.,Map()], [document,571,682,We use these and other properties we demonstrate to suggest when the model would be most applicable in practice.,Map()]]

  • Versions used: Java 8, Apache Spark 2.2.0, Scala 2.11, spark-nlp:1.2.3 (from Maven Central)

python example sentiment.ipynb - issues with Lemmatizer

Description

  1. git clone the repository
  2. export PYSPARK_DRIVER_PYTHON=jupyter
    export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
  3. pyspark --packages JohnSnowLabs:spark-nlp:1.2.3
  4. had issues with Lemmatizer when running the sentiment notebook in python example

Error message

Py4JJavaError: An error occurred while calling o40.setLemmaDict.
: java.lang.NoClassDefFoundError: com/typesafe/config/ConfigMergeable
at com.johnsnowlabs.nlp.annotators.Lemmatizer$.(Lemmatizer.scala:62)
at com.johnsnowlabs.nlp.annotators.Lemmatizer$.(Lemmatizer.scala)
at com.johnsnowlabs.nlp.annotators.Lemmatizer.setLemmaDict(Lemmatizer.scala:36)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: com.typesafe.config.ConfigMergeable
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 14 more

Environment

  • Version used: JohnSnowLabs:spark-nlp:1.2.3
  • Browser Name and version: Google Chrome version 61.0.3163.100 (64-bit)
  • Operating System and version (desktop or mobile): macOS Sierra Version 10.12.6

java.land.NullPointerException in DocumentAssembler

Hi,

Thanks for the code! Lots of good functionality here!

I've run into the following issue:

Seq(Annotation(annotatorType, 0, text.length - 1, text, metadata))

According to this line, if text.length <= 0 then the upper bound of the range is < 0, which I believe is causing a NullPointerException for my DataFrame:

+--------------------+
|                body|
+--------------------+
|Date and Time: 08...|
|Date and Time: 05...|
|Date and Time: 06...|
|Date and Time: 05...|
|Date and Time: 04...|
|Date and Time: 02...|
|Date and Time: 09...|
|Date and Time: 08...|
|Date and Time: 12...|
|Date and Time: 09...|
+--------------------+
only showing top 10 rows

Incidentally, this data frame being generated from an elasticsearch source, but I don't expect to have non-zero values for body.

Thanks.

Input format examples in DateMatcher()

Could you please include some examples of accepted input date formats that would be automatically detected by the DateMatcher? I have a corpus using multiple formats that are not detected, here are some examples:

06JAN2018
June 30, 2021
MAY 2018

Is it be possible to add our own custom input formats?

Also, it is not clear what are the accepted InputCols as well. Having tried Tokens and Sentences crashes the pipeline execution.

Pregenerated downloadable models

We want to make possible to the user to download pretrained annotators and even entire pipelines, to be ready to process text without a training stage.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.