johnsnowlabs / nlu Goto Github PK

1 line for thousands of State of The Art NLP models in hundreds of languages The fastest and most accurate way to solve text problems.

License: Apache License 2.0

Python 99.93% Shell 0.07%

nlu natural-language-understanding sentiment-classifier text-classification transformers language-detection named-entity-recognition seq2seq t5 lemmatizer

nlu's Introduction

NLU: The Power of Spark NLP, the Simplicity of Python

John Snow Labs' NLU is a Python library for applying state-of-the-art text mining, directly on any dataframe, with a single line of code. As a facade of the award-winning Spark NLP library, it comes with 1000+ of pretrained models in 100+, all production-grade, scalable, and trainable, with everything in 1 line of code.

NLU in Action

See how easy it is to use any of the thousands of models in 1 line of code, there are hundreds of tutorials and simple examples you can copy and paste into your projects to achieve State Of The Art easily.

NLU & Streamlit in Action

This 1 line let's you visualize and play with 1000+ SOTA NLU & NLP models in 200 languages

streamlit run https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/examples/streamlit/01_dashboard.py

NLU provides tight and simple integration into Streamlit, which enables building powerful webapps in just 1 line of code which showcase the. View the NLU&Streamlit documentation or NLU & Streamlit examples section. The entire GIF demo and

All NLU resources overview

Take a look at our official NLU page: https://nlu.johnsnowlabs.com/ for user documentation and examples

Ressource	Description
Install NLU	Just run `pip install nlu pyspark==3.0.2`
The NLU Namespace	Find all the names of models you can load with `nlu.load()`
The `nlu.load(<Model>)` function	Load any of the 1000+ models in 1 line
The `nlu.load(<Model>).predict(data)` function	Predict on `Strings`, `List of Strings`, `Numpy Arrays`, `Pandas`, `Modin` and `Spark Dataframes`
The `nlu.load(<train.Model>).fit(data)` function	Train a text classifier for `2-Class`, `N-Classes` `Multi-N-Classes`, `Named-Entitiy-Recognition` or `Parts of Speech Tagging`
The `nlu.load(<Model>).viz(data)` function	Visualize the results of `Word Embedding Similarity Matrix`, `Named Entity Recognizers`, `Dependency Trees & Parts of Speech`, `Entity Resolution`,`Entity Linking` or `Entity Status Assertion`
The `nlu.load(<Model>).viz_streamlit(data)` function	Display an interactive GUI which lets you explore and test every model and feature in NLU in 1 click.
General Concepts	General concepts in NLU
The latest release notes	Newest features added to NLU
Overview NLU 1-liners examples	Most common used models and their results
Overview NLU 1-liners examples for healthcare models	Most common used healthcare models and their results
Overview of all NLU tutorials and Examples	100+ tutorials on how to use NLU on text datasets for various problems and from various sources like Twitter, Chinese News, Crypto News Headlines, Airline Traffic communication, Product review classifier training,
Connect with us on Slack	Problems, questions or suggestions? We have a very active and helpful community of over 2000+ AI enthusiasts putting NLU, Spark NLP & Spark OCR to good use
Discussion Forum	More indepth discussion with the community? Post a thread in our discussion Forum
John Snow Labs Medium	Articles and Tutorials on the NLU, Spark NLP and Spark OCR
John Snow Labs Youtube	Videos and Tutorials on the NLU, Spark NLP and Spark OCR
NLU Website	The official NLU website
Github Issues	Report a bug

Getting Started with NLU

To get your hands on the power of NLU, you just need to install it via pip and ensure Java 8 is installed and properly configured. Checkout Quickstart for more infos

pip install nlu pyspark==3.0.2

Loading and predicting with any model in 1 line python

import nlu 
nlu.load('sentiment').predict('I love NLU! <3')

Loading and predicting with multiple models in 1 line

Get 6 different embeddings in 1 line and use them for downstream data science tasks!

nlu.load('bert elmo albert xlnet glove use').predict('I love NLU! <3')

What kind of models does NLU provide?

NLU provides everything a data scientist might want to wish for in one line of code!

NLU provides everything a data scientist might want to wish for in one line of code!
1000 + pre-trained models
100+ of the latest NLP word embeddings ( BERT, ELMO, ALBERT, XLNET, GLOVE, BIOBERT, ELECTRA, COVIDBERT) and different variations of them
50+ of the latest NLP sentence embeddings ( BERT, ELECTRA, USE) and different variations of them
100+ Classifiers (NER, POS, Emotion, Sarcasm, Questions, Spam)
300+ Supported Languages
Summarize Text and Answer Questions with T5
Labeled and Unlabeled Dependency parsing
Various Text Cleaning and Pre-Processing methods like Stemming, Lemmatizing, Normalizing, Filtering, Cleaning pipelines and more

Classifiers trained on many different datasets

Choose the right tool for the right task! Whether you analyze movies or twitter, NLU has the right model for you!

trec6 classifier
trec10 classifier
spam classifier
fake news classifier
emotion classifier
cyberbullying classifier
sarcasm classifier
sentiment classifier for movies
IMDB Movie Sentiment classifier
Twitter sentiment classifier
NER pretrained on ONTO notes
NER trainer on CONLL
Language classifier for 20 languages on the wiki 20 lang dataset.

Utilities for the Data Science NLU applications

Working with text data can sometimes be quite a dirty job. NLU helps you keep your hands clean by providing components that take away from data engineering intensive tasks.

Datetime Matcher
Pattern Matcher
Chunk Matcher
Phrases Matcher
Stopword Cleaners
Pattern Cleaners
Slang Cleaner

Where can I see all models available in NLU?

For NLU models to load, see the NLU Namespace or the John Snow Labs Modelshub or go straight to the source.

Supported Data Types

Pandas DataFrame and Series
Spark DataFrames
Modin with Ray backend
Modin with Dask backend
Numpy arrays
Strings and lists of strings

Overview of all tutorials using the NLU-Library

In the following tabular, all available tutorials using NLU are listed. These tutorials will help you learn the usage of the NLU library and on how to use it for your own tasks. Some of the tasks NLU does are translating from any language to the english language, lemmatizing, tokenizing, cleaning text from Symbol or unwanted syntax, spellchecking, detecting entities, analyzing sentiments and many more!

{:.table2}

Tutorial Description	NLU Spells Used	Open In Colab	Dataset and Paper References
Albert Word Embeddings with NLU	`albert`, `sentiment pos albert emotion`		Albert-Paper, Albert on Github, Albert on TensorFlow, T-SNE, T-SNE-Albert, Albert_Embedding
Bert Word Embeddings with NLU	`bert`, `pos sentiment emotion bert`		Bert-Paper, Bert Github, T-SNE, T-SNE-Bert, Bert_Embedding
BIOBERT Word Embeddings with NLU	`biobert` , `sentiment pos biobert emotion`		BioBert-Paper, Bert Github , BERT: Deep Bidirectional Transformers, Bert Github, T-SNE, T-SNE-Biobert, Biobert_Embedding
COVIDBERT Word Embeddings with NLU	`covidbert`, `sentiment covidbert pos`		CovidBert-Paper, Bert Github, T-SNE, T-SNE-CovidBert, Covidbert_Embedding
ELECTRA Word Embeddings with NLU	`electra`, `sentiment pos en.embed.electra emotion`		Electra-Paper, T-SNE, T-SNE-Electra, Electra_Embedding
ELMO Word Embeddings with NLU	`elmo`, `sentiment pos elmo emotion`		ELMO-Paper, Elmo-TensorFlow, T-SNE, T-SNE-Elmo, Elmo-Embedding
GLOVE Word Embeddings with NLU	`glove`, `sentiment pos glove emotion`		Glove-Paper, T-SNE, T-SNE-Glove , Glove_Embedding
XLNET Word Embeddings with NLU	`xlnet`, `sentiment pos xlnet emotion`		XLNet-Paper, Bert Github, T-SNE, T-SNE-XLNet, Xlnet_Embedding
Multiple Word-Embeddings and Part of Speech in 1 Line of code	`bert electra elmo glove xlnet albert pos`		Bert-Paper, Albert-Paper, ELMO-Paper, Electra-Paper, XLNet-Paper, Glove-Paper
Normalzing with NLU	`norm`		-
Detect sentences with NLU	`sentence_detector.deep`, `sentence_detector.pragmatic`, `xx.sentence_detector`		Sentence Detector
Spellchecking with NLU	n.a.	n.a.	-
Stemming with NLU	`en.stem`, `de.stem`		-
Stopwords removal with NLU	`stopwords`		Stopwords
Tokenization with NLU	`tokenize`		-
Normalization of Documents	`norm_document`		-
Open and Closed book question answering with Google's T5	`en.t5` , `answer_question`		T5-Paper, T5-Model
Overview of every task available with T5	`en.t5.base`		T5-Paper, T5-Model
Translate between more than 200 Languages in 1 line of code with Marian Models	`tr.translate_to.fr`, `en.translate_to.fr` ,`fr.translate_to.he` , `en.translate_to.de`		Marian-Papers, Translation-Pipeline (En to Fr), Translation-Pipeline (En to Ger)
BERT Sentence Embeddings with NLU	`embed_sentence.bert`, `pos sentiment embed_sentence.bert`		Bert-Paper, Bert Github, Bert-Sentence_Embedding
ELECTRA Sentence Embeddings with NLU	`embed_sentence.electra`, `pos sentiment embed_sentence.electra`		Electra Paper, Sentence-Electra-Embedding
USE Sentence Embeddings with NLU	`use`, `pos sentiment use emotion`		Universal Sentence Encoder, USE-TensorFlow, Sentence-USE-Embedding
Sentence similarity with NLU using BERT embeddings	`embed_sentence.bert`, `use en.embed_sentence.electra embed_sentence.bert`		Bert-Paper, Bert Github, Bert-Sentence_Embedding
Part of Speech tagging with NLU	`pos`		Part of Speech
NER Aspect Airline ATIS	`en.ner.aspect.airline`		NER Airline Model, Atis intent Dataset
NLU-NER_CONLL_2003_5class_example	`ner`		NER-Piple
Named-entity recognition with Deep Learning ONTO NOTES	`ner.onto`		NER_Onto
Aspect based NER-Sentiment-Restaurants	`en.ner.aspect_sentiment`		-
Detect Named Entities (NER), Part of Speech Tags (POS) and Tokenize in Chinese	`zh.segment_words`, `zh.pos`, `zh.ner`, `zh.translate_to.en`		Translation-Pipeline (Zh to En)
Detect Named Entities (NER), Part of Speech Tags (POS) and Tokenize in Japanese	`ja.segment_words`, `ja.pos`, `ja.ner`, `ja.translate_to.en`		Translation-Pipeline (Ja to En)
Detect Named Entities (NER), Part of Speech Tags (POS) and Tokenize in Korean	`ko.segment_words`, `ko.pos`, `ko.ner.kmou.glove_840B_300d`, `ko.translate_to.en`		-
Date Matching	`match.datetime`		-
Typed Dependency Parsing with NLU	`dep`		Dependency Parsing
Untyped Dependency Parsing with NLU	`dep.untyped`		-
E2E Classification with NLU	`e2e`		e2e-Model
Language Classification with NLU	`lang`		-
Cyberbullying Classification with NLU	`classify.cyberbullying`		Cyberbullying-Classifier
Sentiment Classification with NLU for Twitter	`emotion`		Emotion detection
Fake News Classification with NLU	`en.classify.fakenews`		Fakenews-Classifier
Intent Classification with NLU	`en.classify.intent.airline`		Airline-Intention classifier, Atis-Dataset
Question classification based on the TREC dataset	`en.classify.questions`		Question-Classifier
Sarcasm Classification with NLU	`en.classify.sarcasm`		Sarcasm-Classifier
Sentiment Classification with NLU for Twitter	`en.sentiment.twitter`		Sentiment_Twitter-Classifier
Sentiment Classification with NLU for Movies	`en.sentiment.imdb`		Sentiment_imdb-Classifier
Spam Classification with NLU	`en.classify.spam`		Spam-Classifier
Toxic text classification with NLU	`en.classify.toxic`		Toxic-Classifier
Unsupervised keyword extraction with NLU using the YAKE algorithm	`yake`		-
Grammatical Chunk Matching with NLU	`match.chunks`		-
Getting n-Grams with NLU	`ngram`		-
Assertion	`en.med_ner.clinical en.assert`, `en.med_ner.clinical.biobert en.assert.biobert`, ...		Healthcare-NER, NER_Clinical-Classifier, Toxic-Classifier
De-Identification Model overview	`med_ner.jsl.wip.clinical en.de_identify`, `med_ner.jsl.wip.clinical en.de_identify.clinical`, ...		NER-Clinical
Drug Normalization	`norm_drugs`		-
Entity Resolution	`med_ner.jsl.wip.clinical en.resolve_chunk.cpt_clinical`, `med_ner.jsl.wip.clinical en.resolve.icd10cm`, ...		NER-Clinical, Entity-Resolver clinical
Medical Named Entity Recognition	`en.med_ner.ade.clinical`, `en.med_ner.ade.clinical_bert`, `en.med_ner.anatomy`,`en.med_ner.anatomy.biobert`, ...		-
Relation Extraction	`en.med_ner.jsl.wip.clinical.greedy en.relation`, `en.med_ner.jsl.wip.clinical.greedy en.relation.bodypart.problem`, ...		-
Visualization of NLP-Models with Spark-NLP and NLU	`ner`, `dep.typed`, `med_ner.jsl.wip.clinical resolve_chunk.rxnorm.in`, `med_ner.jsl.wip.clinical resolve.icd10cm`		NER-Piple, Dependency Parsing, NER-Clinical, Entity-Resolver (Chunks) clinical
NLU Covid-19 Emotion Showcase	`emotion`		Emotion detection
NLU Covid-19 Sentiment Showcase	`sentiment`		Sentiment classification
NLU Airline Emotion Demo	`emotion`		Emotion detection
NLU Airline Sentiment Demo	`sentiment`		Sentiment classification
Bengali NER Hindi Embeddings for 30 Models	`bn.ner`, `bn.lemma`, `ja.lemma`, `am.lemma`, `bh.lemma`, `en.ner.onto.bert.small_l2_128`,..		Bengali-NER, Bengali-Lemmatizer, Japanese-Lemmatizer, Amharic-Lemmatizer
Entity Resolution	`med_ner.jsl.wip.clinical en.resolve.umls`, `med_ner.jsl.wip.clinical en.resolve.loinc`, `med_ner.jsl.wip.clinical en.resolve.loinc.biobert`		-
NLU 20 Minutes Crashcourse - the fast Data Science route	`spell`, `sentiment`, `pos`, `ner`, `yake`, `en.t5`, `emotion`, `answer_question`, `en.t5.base` ...		T5-Model, Part of Speech, NER-Piple, Emotion detection , Spellchecker, Sentiment classification
Chapter 0: Intro: 1-liners	`sentiment`, `pos`, `ner`, `bert`, `elmo`, `embed_sentence.bert`		Part of Speech, NER-Piple, Sentiment classification, Elmo-Embedding, Bert-Sentence_Embedding
Chapter 1: NLU base-features with some classifiers on testdata	`emotion`, `yake`, `stem`		Emotion detection
Chapter 2: Translation between 300+ languages with Marian	`tr.translate_to.en`, `en.translate_to.fr`, `en.translate_to.he`		Translation-Pipeline (En to Fr), Translation (En to He)
Chapter 3: Answer questions and summarize Texts with T5	`answer_question`, `en.t5`, `en.t5.base`		T5-Model
Chapter 4: Overview of T5-Tasks	`en.t5.base`		T5-Model
Graph NLU 20 Minutes Crashcourse - State of the Art Text Mining for Graphs	`spell`, `sentiment`, `pos`, `ner`, `yake`, `emotion`, `med_ner.jsl.wip.clinical`, ...		Part of Speech, NER-Piple, Emotion detection, Spellchecker, Sentiment classification
Healthcare with NLU	`med_ner.human_phenotype.gene_biobert`, `med_ner.ade_biobert`, `med_ner.anatomy`, `med_ner.bacterial_species`,...		-
Part 0: Intro: 1-liners	`spell`, `sentiment`, `pos`, `ner`, `bert`, `elmo`, `embed_sentence.bert`		Bert-Paper, Bert Github, T-SNE, T-SNE-Bert , Part of Speech, NER-Piple, Spellchecker, Sentiment classification, Elmo-Embedding , Bert-Sentence_Embedding
Part 1: NLU base-features with some classifiers on Testdata	`yake`, `stem`, `ner`, `emotion`		NER-Piple, Emotion detection
Part 2: Translate between 200+ Languages in 1 line of code with Marian-Models	`en.translate_to.de`, `en.translate_to.fr`, `en.translate_to.he`		Translation-Pipeline (En to Fr), Translation-Pipeline (En to Ger), Translation (En to He)
Part 3: More Multilingual NLP-translations for Asian Languages with Marian	`en.translate_to.hi`, `en.translate_to.ru`, `en.translate_to.zh`		Translation (En to Hi), Translation (En to Ru), Translation (En to Zh)
Part 4: Unsupervise Chinese Keyword Extraction, NER and Translation from chinese news	`zh.translate_to.en`, `zh.segment_words`, `yake`, `zh.lemma`, `zh.ner`		Translation-Pipeline (Zh to En), Zh-Lemmatizer
Part 5: Multilingual sentiment classifier training for 100+ languages	`train.sentiment`, `xx.embed_sentence.labse train.sentiment`	n.a.	Sentence_Embedding.Labse
Part 6: Question-answering and Text-summarization with T5-Modell	`answer_question`, `en.t5`, `en.t5.base`		T5-Paper
Part 7: Overview of all tasks available with T5	`en.t5.base`		T5-Paper
Part 8: Overview of some of the Multilingual modes with State Of the Art accuracy (1-liner)	`bn.lemma`, `ja.lemma`, `am.lemma`, `bh.lemma`, `zh.segment_words`, ...		Bengali-Lemmatizer, Japanese-Lemmatizer , Amharic-Lemmatizer
Overview of some Multilingual modes avaiable with State Of the Art accuracy (1-liner)	`bn.ner.cc_300d`, `ja.ner`, `zh.ner`, `th.ner.lst20.glove_840B_300D`, `ar.ner`		Bengali-NER
NLU 20 Minutes Crashcourse - the fast Data Science route	-		-

Need help?

Simple NLU Demos

NLU different output levels Demo

Features in NLU Overview

Tokenization
Trainable Word Segmentation
Stop Words Removal
Token Normalizer
Document Normalizer
Stemmer
Lemmatizer
NGrams
Regex Matching
Text Matching,
Chunking
Date Matcher
Sentence Detector
Deep Sentence Detector (Deep learning)
Dependency parsing (Labeled/unlabeled)
Part-of-speech tagging
Sentiment Detection (ML models)
Spell Checker (ML and DL models)
Word Embeddings (GloVe and Word2Vec)
BERT Embeddings (TF Hub models)
ELMO Embeddings (TF Hub models)
ALBERT Embeddings (TF Hub models)
XLNet Embeddings
Universal Sentence Encoder (TF Hub models)
BERT Sentence Embeddings (42 TF Hub models)
Sentence Embeddings
Chunk Embeddings
Unsupervised keywords extraction
Language Detection & Identification (up to 375 languages)
Multi-class Sentiment analysis (Deep learning)
Multi-label Sentiment analysis (Deep learning)
Multi-class Text Classification (Deep learning)
Neural Machine Translation
Text-To-Text Transfer Transformer (Google T5)
Named entity recognition (Deep learning)
Easy TensorFlow integration
GPU Support
Full integration with Spark ML functions
1000 pre-trained models in +200 languages!
Multi-lingual NER models: Arabic, Chinese, Danish, Dutch, English, Finnish, French, German, Hewbrew, Italian, Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Urdu and more
Natural Language inference
Coreference resolution
Sentence Completion
Word sense disambiguation
Clinical entity recognition
Clinical Entity Linking
Entity normalization
Assertion Status Detection
De-identification
Relation Extraction
Clinical Entity Resolution

Citation

We have published a paper that you can cite for the NLU library:

@article{KOCAMAN2021100058,
    title = {Spark NLP: Natural language understanding at scale},
    journal = {Software Impacts},
    pages = {100058},
    year = {2021},
    issn = {2665-9638},
    doi = {https://doi.org/10.1016/j.simpa.2021.100058},
    url = {https://www.sciencedirect.com/science/article/pii/S2665963821000063},
    author = {Veysel Kocaman and David Talby},
    keywords = {Spark, Natural language processing, Deep learning, Tensorflow, Cluster},
    abstract = {Spark NLP is a Natural Language Processing (NLP) library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines that can scale easily in a distributed environment. Spark NLP comes with 1100+ pretrained pipelines and models in more than 192+ languages. It supports nearly all the NLP tasks and modules that can be used seamlessly in a cluster. Downloaded more than 2.7 million times and experiencing 9x growth since January 2020, Spark NLP is used by 54% of healthcare organizations as the world’s most widely used NLP library in the enterprise.}
    }
}

nlu's People

Contributors

Stargazers

Watchers

Forkers

sumanthratna alexott mohankhilariwal adbmd c-k-loan stjordanis srik37 sainiudit bobycv06fpm trendingtechnology mfcardenas vvkishere o7s8r6 beatricefallschool mkasmi keitazoumana marcofernandez007 sheerinz gregkvas jeffersonscampos sabirdvd atdavidpark sfrias lakarbatti ssouidi abhishekhm cczhgit hirajanwin mostofa-najmus-sakib achuthasubhash darkknight2223 ahmad225 askmetoo gullfosscapital gyanachand1 brollb varunkothamachu eulerian-tuple manishbhat5 sadam1195 lisaterumi shashank545 cwpdntm marfeljoergsen hercules261188 sardiirfan27 varun-penaganti elcies farik milyiyo murat-karadag prakashcinna slarea rajeshkppt asim5800 sreebhagya-s rajagopal17 tanglespace markussagen roybc aahmadai davidsmith1993 rkhosla102 antoniojesusgarciapalomo alexrogalskiy ai-natural-language-processing-lab jai2033shankar kgars mosalen swati1-ud ivesbai ahmedlone127 davincee marlops nikhileshwar-avs tianchiguaixia roverrwe devintdha luca-martial rohitn rowzzy localhost-server becca-mayers ragiko chaos-observer fredatgithub kieutrinh-t gg-big-org avenchen lyhiving hertera1 dcecchini captainahd johnefemer 0xweb3builder ithink3iam emmakat martyyz2112 aqhali arkajyotichakraborty

nlu's Issues

Unknown environment issue with BioBert

I am using the nlu BioBert mapper to improve upon a tool that already exists called text2term. A few weeks ago, I was able to get the tool working on a personal computer (Mac), but shortly after when I switched to my new work computer (also Mac, same OS but with an Apple Chip instead of Intel), the program no longer worked even with the same source code, Python, and Java version.

A coworker recreated the issue with an Apple Chip computer, Python 3.9.5, and Java 17. If you have any insights, please let me know.

Here are the co-requirements, as well as the versions and the error:
Python 3.10.6 (Also tried 3.9.13)
Java version "1.8.0_341" (Also tried Java 16)
requirements.txt:

Owlready2==0.36
argparse==1.4.0
pandas==1.4.1
numpy==1.23.2
gensim==4.1.2
scipy==1.8.0
scikit-learn==1.0.2
setuptools==60.9.3
requests==2.27.1
tqdm==4.62.3
sparse_dot_topn==0.3.1
bioregistry==0.4.63
nltk==3.7
rapidfuzz==2.0.5
shortuuid==1.0.9

Error:

ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/py4j/clientserver.py", line 516, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/py4j/clientserver.py", line 539, in send_command
    raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
[OK!]
Traceback (most recent call last):
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/nlu/pipe/component_resolution.py", line 276, in get_trained_component_for_nlp_model_ref
    component.get_pretrained_model(nlp_ref, lang, model_bucket),
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/nlu/components/embeddings/sentence_bert/BertSentenceEmbedding.py", line 13, in get_pretrained_model
    return BertSentenceEmbeddings.pretrained(name,language,bucket) \
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/sparknlp/annotator/embeddings/bert_sentence_embeddings.py", line 231, in pretrained
    return ResourceDownloader.downloadModel(BertSentenceEmbeddings, name, lang, remote_loc)
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/sparknlp/pretrained/resource_downloader.py", line 40, in downloadModel
    j_obj = _internal._DownloadModel(reader.name, name, language, remote_loc, j_dwn).apply()
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/sparknlp/internal/__init__.py", line 317, in __init__
    super(_DownloadModel, self).__init__("com.johnsnowlabs.nlp.pretrained." + validator + ".downloadModel", reader,
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/sparknlp/internal/extended_java_wrapper.py", line 26, in __init__
    self._java_obj = self.new_java_obj(java_obj, *args)
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/sparknlp/internal/extended_java_wrapper.py", line 36, in new_java_obj
    return self._new_java_obj(java_class, *args)
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/pyspark/ml/wrapper.py", line 86, in _new_java_obj
    return java_obj(*java_args)
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/pyspark/sql/utils.py", line 190, in deco
    return f(*a, **kw)
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/py4j/protocol.py", line 334, in get_return_value
    raise Py4JError(
py4j.protocol.Py4JError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/nlu/__init__.py", line 234, in load
    nlu_component = nlu_ref_to_component(nlu_ref)
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/nlu/pipe/component_resolution.py", line 160, in nlu_ref_to_component
    resolved_component = get_trained_component_for_nlp_model_ref(lang, nlu_ref, nlp_ref, license_type, model_params)
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/nlu/pipe/component_resolution.py", line 287, in get_trained_component_for_nlp_model_ref
    raise ValueError(f'Failure making component, nlp_ref={nlp_ref}, nlu_ref={nlu_ref}, lang={lang}, \n err={e}')
ValueError: Failure making component, nlp_ref=sent_biobert_pmc_base_cased, nlu_ref=en.embed_sentence.biobert.pmc_base_cased, lang=en, 
 err=An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Users/jason/Documents/GitHub/ontology-mapper/text2term/__main__.py", line 48, in <module>
    Text2Term().map_file(arguments.source, arguments.target, output_file=arguments.output, csv_columns=csv_columns,
  File "/Users/jason/Documents/GitHub/ontology-mapper/text2term/t2t.py", line 63, in map_file
    return self.map(source_terms, target_ontology, source_terms_ids=source_terms_ids, base_iris=base_iris,
  File "/Users/jason/Documents/GitHub/ontology-mapper/text2term/t2t.py", line 115, in map
    self._do_biobert_mapping(source_terms, target_terms, biobert_file)
  File "/Users/jason/Documents/GitHub/ontology-mapper/text2term/t2t.py", line 161, in _do_biobert_mapping
    biobert = BioBertMapper(ontology_terms)
  File "/Users/jason/Documents/GitHub/ontology-mapper/text2term/biobert_mapper.py", line 28, in __init__
    self.biobert = self.load_biobert()
  File "/Users/jason/Documents/GitHub/ontology-mapper/text2term/biobert_mapper.py", line 34, in load_biobert
    biobert = nlu.load('en.embed_sentence.biobert.pmc_base_cased')
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/nlu/__init__.py", line 249, in load
    raise Exception(
Exception: Something went wrong during creating the Spark NLP model_anno_obj for your request =  en.embed_sentence.biobert.pmc_base_cased Did you use a NLU Spell?

Installing Java

Hello,
I wanted to know if it's possible to use nlu on a 'non-cells' editor like VS Code.
I tried to, but I have this error :
Exception: Java gateway process exited before sending its port number

I looked into the colab file and I think I need to paste this

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]

but I don't have this file /usr/lib/jvm/java-8-openjdk-amd64

Could you please send it to me or on Github if it's the solution ?
Thanks for your help

bad casing for nlp ref

Some nlp refs in spell book have wrong casing.
Double check with Modelshub/S3 metadata and fix

Where to obtain "license_keys"?

All the colab examples I wanted to run e.g. this one are asking for a license file via files.upload().
How do I obtain one?
Thanks.

Question regarding Electra Word Emmbeddings

Hi guys,

i just discoverd this amazing library!
My question is, i have a fine-tuned electra model and want to get the word emmbeddings out as you showed in:
https://github.com/JohnSnowLabs/nlu/blob/master/examples/colab/component_examples/sentence_embeddings/NLU_ELECTRA_sentence_embeddings_and_t-SNE_visualization_example.ipynb

Is there a way i could plug-in my own model?

Best regards
Chris

did the last version support python==3.8.10

hx@hx-image:~$ streamlit run https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/examples/streamlit/01_dashboard.py
Traceback (most recent call last):
File "/home/hx/.local/bin/streamlit", line 5, in
from streamlit.web.cli import main
File "/home/hx/.local/lib/python3.8/site-packages/streamlit/init.py", line 55, in
from streamlit.delta_generator import DeltaGenerator as _DeltaGenerator
File "/home/hx/.local/lib/python3.8/site-packages/streamlit/delta_generator.py", line 38, in
from streamlit import config, cursor, env_util, logger, runtime, type_util, util
File "/home/hx/.local/lib/python3.8/site-packages/streamlit/cursor.py", line 18, in
from streamlit.runtime.scriptrunner import get_script_run_ctx
File "/home/hx/.local/lib/python3.8/site-packages/streamlit/runtime/init.py", line 16, in
from streamlit.runtime.runtime import Runtime as Runtime
File "/home/hx/.local/lib/python3.8/site-packages/streamlit/runtime/runtime.py", line 28, in
from streamlit.runtime.app_session import AppSession
File "/home/hx/.local/lib/python3.8/site-packages/streamlit/runtime/app_session.py", line 35, in
from streamlit.runtime import caching, legacy_caching
File "/home/hx/.local/lib/python3.8/site-packages/streamlit/runtime/caching/init.py", line 21, in
from streamlit.runtime.state.session_state import WidgetMetadata
File "/home/hx/.local/lib/python3.8/site-packages/streamlit/runtime/state/init.py", line 16, in
from streamlit.runtime.state.safe_session_state import (
File "/home/hx/.local/lib/python3.8/site-packages/streamlit/runtime/state/safe_session_state.py", line 20, in
from streamlit.runtime.state.session_state import (
File "/home/hx/.local/lib/python3.8/site-packages/streamlit/runtime/state/session_state.py", line 44, in
from streamlit.type_util import ValueFieldName, is_array_value_field_name
File "/home/hx/.local/lib/python3.8/site-packages/streamlit/type_util.py", line 35, in
import pyarrow as pa
File "/home/hx/.local/lib/python3.8/site-packages/pyarrow/init.py", line 65, in
import pyarrow.lib as _lib
File "pyarrow/compat.pxi", line 43, in init pyarrow.lib
File "/home/hx/.local/lib/python3.8/site-packages/cloudpickle/init.py", line 3, in
from cloudpickle.cloudpickle import *
File "/home/hx/.local/lib/python3.8/site-packages/cloudpickle/cloudpickle.py", line 167, in
_cell_set_template_code = _make_cell_set_template_code()
File "/home/hx/.local/lib/python3.8/site-packages/cloudpickle/cloudpickle.py", line 148, in _make_cell_set_template_code
return types.CodeType(
TypeError: an integer is required (got type bytes)

Model Loading

I am loading model like this

import sparknlp
import nlu

spark = sparknlp.start()
df = spark.read.csv("nlp_data.csv")
res = nlu.load("pos").predict(df[["text"]].rdd.flatMap(lambda x: x).collect())
print(res)
spark.stop()

Each time I get the following messages in my console:

com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-3f17e4b8-0bdf-40c5-9879-d62f9c2dc974;1.0
        confs: [default]
        found com.johnsnowlabs.nlp#spark-nlp_2.12;5.2.3 in central
        found com.typesafe#config;1.4.2 in central
        found org.rocksdb#rocksdbjni;6.29.5 in central
        found com.amazonaws#aws-java-sdk-s3;1.12.500 in central
        found com.amazonaws#aws-java-sdk-kms;1.12.500 in central
        found com.amazonaws#aws-java-sdk-core;1.12.500 in central
        found commons-logging#commons-logging;1.1.3 in central
        found commons-codec#commons-codec;1.15 in central
        found org.apache.httpcomponents#httpclient;4.5.13 in central
        found org.apache.httpcomponents#httpcore;4.4.13 in central
        found software.amazon.ion#ion-java;1.0.2 in central
        found com.fasterxml.jackson.dataformat#jackson-dataformat-cbor;2.12.6 in central
        found joda-time#joda-time;2.8.1 in central
        found com.amazonaws#jmespath-java;1.12.500 in central
        found com.github.universal-automata#liblevenshtein;3.0.0 in central
        found com.google.protobuf#protobuf-java-util;3.0.0-beta-3 in central
        found com.google.protobuf#protobuf-java;3.0.0-beta-3 in central
        found com.google.code.gson#gson;2.3 in central
        found it.unimi.dsi#fastutil;7.0.12 in central
        found org.projectlombok#lombok;1.16.8 in central
        found com.google.cloud#google-cloud-storage;2.20.1 in central
        found com.google.guava#guava;31.1-jre in central
        found com.google.guava#failureaccess;1.0.1 in central
        found com.google.guava#listenablefuture;9999.0-empty-to-avoid-conflict-with-guava in central
        found com.google.errorprone#error_prone_annotations;2.18.0 in central
        found com.google.j2objc#j2objc-annotations;1.3 in central
        found com.google.http-client#google-http-client;1.43.0 in central
        found io.opencensus#opencensus-contrib-http-util;0.31.1 in central
        found com.google.http-client#google-http-client-jackson2;1.43.0 in central
        found com.google.http-client#google-http-client-gson;1.43.0 in central
        found com.google.api-client#google-api-client;2.2.0 in central
        found com.google.oauth-client#google-oauth-client;1.34.1 in central
        found com.google.http-client#google-http-client-apache-v2;1.43.0 in central
        found com.google.apis#google-api-services-storage;v1-rev20220705-2.0.0 in central
        found com.google.code.gson#gson;2.10.1 in central
        found com.google.cloud#google-cloud-core;2.12.0 in central
        found io.grpc#grpc-context;1.53.0 in central
        found com.google.auto.value#auto-value-annotations;1.10.1 in central
        found com.google.auto.value#auto-value;1.10.1 in central
        found javax.annotation#javax.annotation-api;1.3.2 in central
        found com.google.cloud#google-cloud-core-http;2.12.0 in central
        found com.google.http-client#google-http-client-appengine;1.43.0 in central
        found com.google.api#gax-httpjson;0.108.2 in central
        found com.google.cloud#google-cloud-core-grpc;2.12.0 in central
        found io.grpc#grpc-alts;1.53.0 in central
        found io.grpc#grpc-grpclb;1.53.0 in central
        found org.conscrypt#conscrypt-openjdk-uber;2.5.2 in central
        found io.grpc#grpc-auth;1.53.0 in central
        found io.grpc#grpc-protobuf;1.53.0 in central
        found io.grpc#grpc-protobuf-lite;1.53.0 in central
        found io.grpc#grpc-core;1.53.0 in central
        found com.google.api#gax;2.23.2 in central
        found com.google.api#gax-grpc;2.23.2 in central
        found com.google.auth#google-auth-library-credentials;1.16.0 in central
        found com.google.auth#google-auth-library-oauth2-http;1.16.0 in central
        found com.google.api#api-common;2.6.2 in central
        found io.opencensus#opencensus-api;0.31.1 in central
        found com.google.api.grpc#proto-google-iam-v1;1.9.2 in central
        found com.google.protobuf#protobuf-java;3.21.12 in central
        found com.google.protobuf#protobuf-java-util;3.21.12 in central
        found com.google.api.grpc#proto-google-common-protos;2.14.2 in central
        found org.threeten#threetenbp;1.6.5 in central
        found com.google.api.grpc#proto-google-cloud-storage-v2;2.20.1-alpha in central
        found com.google.api.grpc#grpc-google-cloud-storage-v2;2.20.1-alpha in central
        found com.google.api.grpc#gapic-google-cloud-storage-v2;2.20.1-alpha in central
        found com.fasterxml.jackson.core#jackson-core;2.14.2 in central
        found com.google.code.findbugs#jsr305;3.0.2 in central
        found io.grpc#grpc-api;1.53.0 in central
        found io.grpc#grpc-stub;1.53.0 in central
        found org.checkerframework#checker-qual;3.31.0 in central
        found io.perfmark#perfmark-api;0.26.0 in central
        found com.google.android#annotations;4.1.1.4 in central
        found org.codehaus.mojo#animal-sniffer-annotations;1.22 in central
        found io.opencensus#opencensus-proto;0.2.0 in central
        found io.grpc#grpc-services;1.53.0 in central
        found com.google.re2j#re2j;1.6 in central
        found io.grpc#grpc-netty-shaded;1.53.0 in central
        found io.grpc#grpc-googleapis;1.53.0 in central
        found io.grpc#grpc-xds;1.53.0 in central
        found com.navigamez#greex;1.0 in central
        found dk.brics.automaton#automaton;1.11-8 in central
        found com.johnsnowlabs.nlp#tensorflow-cpu_2.12;0.4.4 in central
        found com.microsoft.onnxruntime#onnxruntime;1.16.3 in central
:: resolution report :: resolve 1966ms :: artifacts dl 54ms
        :: modules in use:
        com.amazonaws#aws-java-sdk-core;1.12.500 from central in [default]
        com.amazonaws#aws-java-sdk-kms;1.12.500 from central in [default]
        com.amazonaws#aws-java-sdk-s3;1.12.500 from central in [default]
        com.amazonaws#jmespath-java;1.12.500 from central in [default]
        com.fasterxml.jackson.core#jackson-core;2.14.2 from central in [default]
        com.fasterxml.jackson.dataformat#jackson-dataformat-cbor;2.12.6 from central in [default]
        com.github.universal-automata#liblevenshtein;3.0.0 from central in [default]
        com.google.android#annotations;4.1.1.4 from central in [default]
        com.google.api#api-common;2.6.2 from central in [default]
        com.google.api#gax;2.23.2 from central in [default]
        com.google.api#gax-grpc;2.23.2 from central in [default]
        com.google.api#gax-httpjson;0.108.2 from central in [default]
        com.google.api-client#google-api-client;2.2.0 from central in [default]
        com.google.api.grpc#gapic-google-cloud-storage-v2;2.20.1-alpha from central in [default]
        com.google.api.grpc#grpc-google-cloud-storage-v2;2.20.1-alpha from central in [default]
        com.google.api.grpc#proto-google-cloud-storage-v2;2.20.1-alpha from central in [default]
        com.google.api.grpc#proto-google-common-protos;2.14.2 from central in [default]
        com.google.api.grpc#proto-google-iam-v1;1.9.2 from central in [default]
        com.google.apis#google-api-services-storage;v1-rev20220705-2.0.0 from central in [default]
        com.google.auth#google-auth-library-credentials;1.16.0 from central in [default]
        com.google.auth#google-auth-library-oauth2-http;1.16.0 from central in [default]
        com.google.auto.value#auto-value;1.10.1 from central in [default]
        com.google.auto.value#auto-value-annotations;1.10.1 from central in [default]
        com.google.cloud#google-cloud-core;2.12.0 from central in [default]
        com.google.cloud#google-cloud-core-grpc;2.12.0 from central in [default]
        com.google.cloud#google-cloud-core-http;2.12.0 from central in [default]
        com.google.cloud#google-cloud-storage;2.20.1 from central in [default]
        com.google.code.findbugs#jsr305;3.0.2 from central in [default]
        com.google.code.gson#gson;2.10.1 from central in [default]
        com.google.errorprone#error_prone_annotations;2.18.0 from central in [default]
        com.google.guava#failureaccess;1.0.1 from central in [default]
        com.google.guava#guava;31.1-jre from central in [default]
        com.google.guava#listenablefuture;9999.0-empty-to-avoid-conflict-with-guava from central in [default]
        com.google.http-client#google-http-client;1.43.0 from central in [default]
        com.google.http-client#google-http-client-apache-v2;1.43.0 from central in [default]
        com.google.http-client#google-http-client-appengine;1.43.0 from central in [default]
        com.google.http-client#google-http-client-gson;1.43.0 from central in [default]
        com.google.http-client#google-http-client-jackson2;1.43.0 from central in [default]
        com.google.j2objc#j2objc-annotations;1.3 from central in [default]
        com.google.oauth-client#google-oauth-client;1.34.1 from central in [default]
        com.google.protobuf#protobuf-java;3.21.12 from central in [default]
        com.google.protobuf#protobuf-java-util;3.21.12 from central in [default]
        com.google.re2j#re2j;1.6 from central in [default]
        com.johnsnowlabs.nlp#spark-nlp_2.12;5.2.3 from central in [default]
        com.johnsnowlabs.nlp#tensorflow-cpu_2.12;0.4.4 from central in [default]
        com.microsoft.onnxruntime#onnxruntime;1.16.3 from central in [default]
        com.navigamez#greex;1.0 from central in [default]
        com.typesafe#config;1.4.2 from central in [default]
        commons-codec#commons-codec;1.15 from central in [default]
        commons-logging#commons-logging;1.1.3 from central in [default]
        dk.brics.automaton#automaton;1.11-8 from central in [default]
        io.grpc#grpc-alts;1.53.0 from central in [default]
        io.grpc#grpc-api;1.53.0 from central in [default]
        io.grpc#grpc-auth;1.53.0 from central in [default]
        io.grpc#grpc-context;1.53.0 from central in [default]
        io.grpc#grpc-core;1.53.0 from central in [default]
        io.grpc#grpc-googleapis;1.53.0 from central in [default]
        io.grpc#grpc-grpclb;1.53.0 from central in [default]
        io.grpc#grpc-netty-shaded;1.53.0 from central in [default]
        io.grpc#grpc-protobuf;1.53.0 from central in [default]
        io.grpc#grpc-protobuf-lite;1.53.0 from central in [default]
        io.grpc#grpc-services;1.53.0 from central in [default]
        io.grpc#grpc-stub;1.53.0 from central in [default]
        io.grpc#grpc-xds;1.53.0 from central in [default]
        io.opencensus#opencensus-api;0.31.1 from central in [default]
        io.opencensus#opencensus-contrib-http-util;0.31.1 from central in [default]
        io.opencensus#opencensus-proto;0.2.0 from central in [default]
        io.perfmark#perfmark-api;0.26.0 from central in [default]
        it.unimi.dsi#fastutil;7.0.12 from central in [default]
        javax.annotation#javax.annotation-api;1.3.2 from central in [default]
        joda-time#joda-time;2.8.1 from central in [default]
        org.apache.httpcomponents#httpclient;4.5.13 from central in [default]
        org.apache.httpcomponents#httpcore;4.4.13 from central in [default]
        org.checkerframework#checker-qual;3.31.0 from central in [default]
        org.codehaus.mojo#animal-sniffer-annotations;1.22 from central in [default]
        org.conscrypt#conscrypt-openjdk-uber;2.5.2 from central in [default]
        org.projectlombok#lombok;1.16.8 from central in [default]
        org.rocksdb#rocksdbjni;6.29.5 from central in [default]
        org.threeten#threetenbp;1.6.5 from central in [default]
        software.amazon.ion#ion-java;1.0.2 from central in [default]
        :: evicted modules:
        commons-logging#commons-logging;1.2 by [commons-logging#commons-logging;1.1.3] in [default]
        commons-codec#commons-codec;1.11 by [commons-codec#commons-codec;1.15] in [default]
        com.google.protobuf#protobuf-java-util;3.0.0-beta-3 by [com.google.protobuf#protobuf-java-util;3.21.12] in [default]
        com.google.protobuf#protobuf-java;3.0.0-beta-3 by [com.google.protobuf#protobuf-java;3.21.12] in [default]
        com.google.code.gson#gson;2.3 by [com.google.code.gson#gson;2.10.1] in [default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   85  |   0   |   0   |   5   ||   80  |   0   |
        ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-3f17e4b8-0bdf-40c5-9879-d62f9c2dc974
        confs: [default]
        0 artifacts copied, 80 already retrieved (0kB/27ms)


pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[ / ]pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[ — ]Download done! Loading the resource.
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[ | ]sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[ / ]Download done! Loading the resource.
[ — ]2024-02-06 14:43:45.340048: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[OK!]

Is it indicating that I am downloading the model(s) from the internet agin and again, or am I downloading it from the jar files?
I assume that the jar files are now on my local system since it took some time when I first installed spark-nlp, and now it just prints the jars information almost immediately when I run the code

Something went wrong during loading and fitting the pipe...

I saw this error occur in the closed issues and I believe it was fixed in a later version. I'm not sure if this is the same issue as well.

system:
Windows 10
Python 3.8.8
Pyspark 3.0.2
NLU 3.1.1
Spark 3.1.2

An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel.
: java.lang.UnsatisfiedLinkError:

Exception:
Something went wrong during loading and fitting the pipe. Check the other prints for more information and also verbose mode. Did you use a correct model reference?

Breaking dependencies

Hello I'am trying to run your lab into wsl but an error occure with dependencies. The full trace bellow:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/callmarl/workzone/nlp/env/lib/python3.9/site-packages/nlu/pipe/pipeline.py", line 468, in predict
    return __predict__(self, data, output_level, positions, keep_stranger_features, metadata, multithread,
  File "/home/callmarl/workzone/nlp/env/lib/python3.9/site-packages/nlu/pipe/utils/predict_helper.py", line 166, in __predict__
    pipe.fit()
  File "/home/callmarl/workzone/nlp/env/lib/python3.9/site-packages/nlu/pipe/pipeline.py", line 202, in fit
    self.vanilla_transformer_pipe = self.spark_estimator_pipe.fit(self.get_sample_spark_dataframe())
  File "/home/callmarl/workzone/nlp/env/lib/python3.9/site-packages/nlu/pipe/pipeline.py", line 101, in get_sample_spark_dataframe
    return sparknlp.start().createDataFrame(data=text_df)
  File "/home/callmarl/workzone/nlp/env/lib/python3.9/site-packages/pyspark/sql/session.py", line 673, in createDataFrame
    return super(SparkSession, self).createDataFrame(
  File "/home/callmarl/workzone/nlp/env/lib/python3.9/site-packages/pyspark/sql/pandas/conversion.py", line 299, in createDataFrame
    data = self._convert_from_pandas(data, schema, timezone)
  File "/home/callmarl/workzone/nlp/env/lib/python3.9/site-packages/pyspark/sql/pandas/conversion.py", line 331, in _convert_from_pandas
    for column, series in pdf.iteritems():
  File "/home/callmarl/workzone/nlp/env/lib/python3.9/site-packages/pandas/core/generic.py", line 6202, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'iteritems'

callmarl@LAPTOP-QS9M6N2F ~/workzone/nlp % python --version
Python 3.9.2
callmarl@LAPTOP-QS9M6N2F ~/workzone/nlp % pip freeze
asttokens==2.4.0
backcall==0.2.0
certifi==2023.7.22
charset-normalizer==3.2.0
click==8.1.7
colorama==0.4.6
databricks-api==0.9.0
databricks-cli==0.17.7
dataclasses==0.6
decorator==5.1.1
exceptiongroup==1.1.3
executing==1.2.0
idna==3.4
ipython==8.15.0
jedi==0.19.0
johnsnowlabs==5.0.7
matplotlib-inline==0.1.6
nlu==5.0.0
numpy==1.25.2
oauthlib==3.2.2
pandas==2.1.0
parso==0.8.3
pexpect==4.8.0
pickleshare==0.7.5
pkg_resources==0.0.0
prompt-toolkit==3.0.39
ptyprocess==0.7.0
pure-eval==0.2.2
py4j==0.10.9
pyarrow==13.0.0
pydantic==1.10.11
Pygments==2.16.1
PyJWT==2.8.0
pyspark==3.1.2
python-dateutil==2.8.2
pytz==2023.3.post1
requests==2.31.0
six==1.16.0
spark-nlp==5.0.2
spark-nlp-display==4.1
stack-data==0.6.2
svgwrite==1.4
tabulate==0.9.0
traitlets==5.9.0
typing_extensions==4.7.1
tzdata==2023.3
urllib3==1.26.16
wcwidth==0.2.6

GPU support

Hi,
I am using the Marian Models for translation.
It works fine, but I am assuming it works only on CPU
(I am using the following code:
pipe_translate = nlu.load('hu.translate_to.en')
translate = pipe_translate.predict("Sziasztok, mi a helyzet?")
and the predict part takes about 5 second, and I have an A100 GPU,
I dont think this should take so long...)
I can't figure it out, how to use the GPU, or how to check, if it uses the GPU...
(print (tf.test.gpu_device_name()) show the the GPU is there...)
Where can I find some documentation/info about this issue?
I had some issues with CUDA and java installation, but right now these look fine...

Thanks

[New Feature] LLMs for Machine Translation of slot-annotated data

Describe the feature
Expansion of SLU to new languages requires much work on manual annotation of data. In order to significantly reduce amount of work, LLMs can be used to machine translate slot-annotated data, e.g.
"play me <a> Dune <a> on Youtube " => "Spiele mir <a> Dune <a> auf Youtube "

Such feature is especially useful for expansion of On-Device SLU to new languages, as high quality multilingual transformers/LLMs cannot be used as core SLU model in this case.

Expected behavior
MT-LLM pipeline expects english sentences annotated in generic <> tags format (for example: "play me <a> Dune <a> on Youtube ") and outputs translated sentence in the same format ("Spiele mir <a> Dune <a> auf Youtube "). Such data format can be easily converted to BIO annotation and to other popular NLU formats.

Additional context
https://paperswithcode.com/paper/large-language-models-for-expansion-of-spoken

In our recent work, we fine-tuned MT-LLM called BigTranslate towards MT of slot-annotated NLU data. We used parallel Amazon MASSIVE dataset for fine-tuning. There is significant performance improvement after fine-tuning (compared to zero-shot LLM-based machine translation) on multiATIS++ benchmark.

Here you can find fine-tuned BigTranslate: https://huggingface.co/Samsung/BigTranslateSlotTranslator
Here you can find code for fine-tuning + code for NLU training: https://github.com/samsung/mt-llm-nlu

In summary, we are wondering how we can merge our work into this project ) And what parts of our work might be useful for this proejct (e.g., scripts for conversion from BIO to tags format ??).

using NLU for biobert embeddings -- takes a really long time on list of 10,000 words, and on 1 word

Hi, so we are working on generating biobert embeddings for our project. When we run it on a single word it takes about a second or so. When we run on a list of 10,000 words, it either times out or takes upwards of hours to run. Is this normal? Below is how we are using it:

def load_biobert(self):
# Load BioBERT model (for sentence-type embeddings)
self.logger.info("Loading BioBERT model...")
start = time.time()
biobert = nlu.load('en.embed_sentence.biobert.pmc_base_cased')
end = time.time()
self.logger.info('done (BioBERT loading time: %.2fs seconds)', end - start)
return biobert

def get_biobert_embeddings(self, strings):
embedding_list = []
for string in strings:
self.logger.debug("...Generating embedding for: %s", string)
embedding_list.append(self.get_biobert_embedding(string))
return embedding_list

def get_biobert_embedding(self, string):
embedding = self.biobert.predict(string, output_level='sentence', get_embeddings=True)
return embedding.sentence_embedding_biobert.values[0]

Missing embedding results

Hello,

It seems like embeddings are not returned when running any of the embedding predictions. Sentiment and other models do return results fine though.

Any ideas what could I be missing here?

nlu 3.3.0
pyspark 3.0.2
py4j 0.10.9
spark-nlp 3.3.2
running on google colab

logging

Hi!
I would like to ask if it is possible to turn off logging or change the logging level from python script that uses nlu library?
Even simple 'import nlu' generates lines of logs, loading models there are tons of them...

Before importing nlu, I am trying to create pyspark context and set desired log level as it is pointed in logs from import nlu: Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel)

But it doesn't seem to help, actually the opposite, I can't load models and do the predictions then...

The other approach was setting logging levels for all possible loggers: nlu, py4j, py4j.java_gateway to CRITICAL in my case

logging.getLogger('nlu').setLevel(logging.CRITICAL)
logging.getLogger('py4j').setLevel(logging.CRITICAL)
logging.getLogger('py4j.java_gateway').setLevel(logging.CRITICAL)

But it also didn't help.
There are still messages from e.g. WARN SparkSession$Builder, WARN ApacheUtils, I tensorflow/core/platform/cpu_feature_guard.cc:142], etc...

spark nlu load error

Am trying to explore NLU models first and then the NLU Healthcare models.
nlu.load('emotion') step is failing. Attached the logs.

OS – Linux RHEL
Pyspark – version 3.0.1
Command used for install - python3 -m pip install nlu pyspark==3.0.1 --trusted-host pypi.org --trusted-host files.pythonhosted.org
I have created a python venv and install the NLU per above command.

I also tried reinstalling with below command:
python3 -m pip install --upgrade nlu streamlit pyspark==3.0.2

Code below:
import nlu
pp=nlu.load('emotion')
classifierdl_use_emotion download started this may take some time.
Approximate size to download 21.3 MB
[ / ]
An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel.
: java.lang.NoClassDefFoundError: org/tensorflow/Tensor
at com.johnsnowlabs.ml.tensorflow.TensorflowWrapper$.read(TensorflowWrapper.scala:397)
at com.johnsnowlabs.ml.tensorflow.ReadTensorflowModel.readTensorflowModel(TensorflowSerializeModel.scala:145)
at com.johnsnowlabs.ml.tensorflow.ReadTensorflowModel.readTensorflowModel$(TensorflowSerializeModel.scala:120)
at com.johnsnowlabs.nlp.annotators.classifier.dl.ClassifierDLModel$.readTensorflowModel(ClassifierDLModel.scala:291)
at com.johnsnowlabs.nlp.annotators.classifier.dl.ReadClassifierDLTensorflowModel.readTensorflow(ClassifierDLModel.scala:278)
at com.johnsnowlabs.nlp.annotators.classifier.dl.ReadClassifierDLTensorflowModel.readTensorflow$(ClassifierDLModel.scala:276)
at com.johnsnowlabs.nlp.annotators.classifier.dl.ClassifierDLModel$.readTensorflow(ClassifierDLModel.scala:291)
at com.johnsnowlabs.nlp.annotators.classifier.dl.ReadClassifierDLTensorflowModel.$anonfun$$init$$1(ClassifierDLModel.scala:285)
at com.johnsnowlabs.nlp.annotators.classifier.dl.ReadClassifierDLTensorflowModel.$anonfun$$init$$1$adapted(ClassifierDLModel.scala:285)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$onRead$1(ParamsAndFeaturesReadable.scala:47)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$onRead$1$adapted(ParamsAndFeaturesReadable.scala:46)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.onRead(ParamsAndFeaturesReadable.scala:46)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$read$1(ParamsAndFeaturesReadable.scala:57)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$read$1$adapted(ParamsAndFeaturesReadable.scala:57)
at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:35)
at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:24)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:333)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:327)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.downloadModel(ResourceDownloader.scala:456)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel(ResourceDownloader.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.tensorflow.Tensor
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 34 more
[OK!]
EXCEPTION: Could not resolve singular Component for type=emotion and nlp_ref=classifierdl_use_emotion and nlu_ref=emotion and lang =en
Traceback (most recent call last):
File "/apps/sparknlp/spark-nlu/lib64/python3.6/site-packages/nlu/pipe/component_resolution.py", line 852, in construct_component_from_identifier
is_licensed=is_licensed)
File "/apps/sparknlp/spark-nlu/lib64/python3.6/site-packages/nlu/components/classifier.py", line 69, in init
else : self.model = ClassifierDl.get_pretrained_model(nlp_ref, language)
File "/apps/sparknlp/spark-nlu/lib64/python3.6/site-packages/nlu/components/classifiers/classifier_dl/classifier_dl.py", line 11, in get_pretrained_model
return ClassifierDLModel.pretrained(name,language,bucket)
File "/apps/sparknlp/spark-nlu/lib64/python3.6/site-packages/sparknlp/annotator.py", line 8063, in pretrained
return ResourceDownloader.downloadModel(ClassifierDLModel, name, lang, remote_loc)
File "/apps/sparknlp/spark-nlu/lib64/python3.6/site-packages/sparknlp/pretrained.py", line 62, in downloadModel
raise e
File "/apps/sparknlp/spark-nlu/lib64/python3.6/site-packages/sparknlp/pretrained.py", line 59, in downloadModel
j_obj = _internal._DownloadModel(reader.name, name, language, remote_loc, j_dwn).apply()
File "/apps/sparknlp/spark-nlu/lib64/python3.6/site-packages/sparknlp/internal.py", line 214, in init
name, language, remote_loc)
File "/apps/sparknlp/spark-nlu/lib64/python3.6/site-packages/sparknlp/internal.py", line 165, in init
self._java_obj = self.new_java_obj(java_obj, *args)
File "/apps/sparknlp/spark-nlu/lib64/python3.6/site-packages/sparknlp/internal.py", line 175, in new_java_obj
return self._new_java_obj(java_class, *args)
File "/apps/sparknlp/spark-nlu/lib64/python3.6/site-packages/pyspark/ml/wrapper.py", line 69, in _new_java_obj
return java_obj(*java_args)
File "/apps/sparknlp/spark-nlu/lib64/python3.6/site-packages/py4j/java_gateway.py", line 1305, in call
answer, self.gateway_client, self.target_id, self.name)
File "/apps/sparknlp/spark-nlu/lib64/python3.6/site-packages/pyspark/sql/utils.py", line 128, in deco
return f(*a, **kw)
File "/apps/sparknlp/spark-nlu/lib64/python3.6/site-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel.
: java.lang.NoClassDefFoundError: org/tensorflow/Tensor
at com.johnsnowlabs.ml.tensorflow.TensorflowWrapper$.read(TensorflowWrapper.scala:397)
at com.johnsnowlabs.ml.tensorflow.ReadTensorflowModel.readTensorflowModel(TensorflowSerializeModel.scala:145)
at com.johnsnowlabs.ml.tensorflow.ReadTensorflowModel.readTensorflowModel$(TensorflowSerializeModel.scala:120)
at com.johnsnowlabs.nlp.annotators.classifier.dl.ClassifierDLModel$.readTensorflowModel(ClassifierDLModel.scala:291)
at com.johnsnowlabs.nlp.annotators.classifier.dl.ReadClassifierDLTensorflowModel.readTensorflow(ClassifierDLModel.scala:278)
at com.johnsnowlabs.nlp.annotators.classifier.dl.ReadClassifierDLTensorflowModel.readTensorflow$(ClassifierDLModel.scala:276)
at com.johnsnowlabs.nlp.annotators.classifier.dl.ClassifierDLModel$.readTensorflow(ClassifierDLModel.scala:291)
at com.johnsnowlabs.nlp.annotators.classifier.dl.ReadClassifierDLTensorflowModel.$anonfun$$init$$1(ClassifierDLModel.scala:285)
at com.johnsnowlabs.nlp.annotators.classifier.dl.ReadClassifierDLTensorflowModel.$anonfun$$init$$1$adapted(ClassifierDLModel.scala:285)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$onRead$1(ParamsAndFeaturesReadable.scala:47)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$onRead$1$adapted(ParamsAndFeaturesReadable.scala:46)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.onRead(ParamsAndFeaturesReadable.scala:46)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$read$1(ParamsAndFeaturesReadable.scala:57)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$read$1$adapted(ParamsAndFeaturesReadable.scala:57)
at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:35)
at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:24)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:333)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:327)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.downloadModel(ResourceDownloader.scala:456)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel(ResourceDownloader.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.tensorflow.Tensor
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 34 more

ValueError Traceback (most recent call last)
/apps/sparknlp/spark-nlu/lib64/python3.6/site-packages/nlu/init.py in load(request, path, verbose, gpu, streamlit_caching)
341 if nlu_ref == '': continue
--> 342 nlu_component = nlu_ref_to_component(nlu_ref, authenticated=is_authenticated)
343 # if we get a list of components, then the NLU reference is a pipeline, we do not need to check order

/apps/sparknlp/spark-nlu/lib64/python3.6/site-packages/nlu/pipe/component_resolution.py in nlu_ref_to_component(nlu_reference, detect_lang, authenticated, is_recursive_call)
322 authenticated=authenticated,
--> 323 is_recursive_call=is_recursive_call)
324 if resolved_component is None:

/apps/sparknlp/spark-nlu/lib64/python3.6/site-packages/nlu/pipe/component_resolution.py in resolve_component_from_parsed_query_data(lang, component_type, dataset, component_embeddings, nlu_ref, trainable, path, authenticated, is_recursive_call)
467 if constructed_component is None:
--> 468 raise ValueError(f'EXCEPTION : Could not create NLU component for nlp_ref={nlp_ref} and nlu_ref={nlu_ref}')
469 else:

ValueError: EXCEPTION : Could not create NLU component for nlp_ref=classifierdl_use_emotion and nlu_ref=emotion

During handling of the above exception, another exception occurred:

Exception Traceback (most recent call last)
in
----> 1 pp=nlu.load('emotion')

/apps/sparknlp/spark-nlu/lib64/python3.6/site-packages/nlu/init.py in load(request, path, verbose, gpu, streamlit_caching)
360 print(e[1])
361 raise Exception(
--> 362 "Something went wrong during loading and fitting the pipe. Check the other prints for more information and also verbose mode. Did you use a correct model reference?")
363
364

Exception: Something went wrong during loading and fitting the pipe. Check the other prints for more information and also verbose mode. Did you use a correct model reference?

support pandas >= 2.0

latest NLU is incompatible with pandas >=2.0

`pyspark.sql.utils.IllegalArgumentException` on fresh install

Using Windows 10, same errors in Ubuntu WSL
Java version: openjdk version "1.8.0_282" (equivalent to JDK 8)
Installed with pip
in Python 3.6: >>> import nlu without errors

>>> nlu.load('tokenize').predict('Each word and symbol in a sentence will generate token.') # From the homepage
Ivy Default Cache set to: C:\Users\USERNAME\.ivy2\cache
The jars for the packages stored in: C:\Users\USERNAME\.ivy2\jars
:: loading settings :: url = jar:file:/C:/Users/USERNAME/.conda/envs/py36nlp/Lib/site-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.johnsnowlabs.nlp#spark-nlp_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-f4225ffb-68be-4e92-a6fc-c8cf6d7928e2;1.0
        confs: [default]
        found com.johnsnowlabs.nlp#spark-nlp_2.11;2.7.5 in central
        found com.typesafe#config;1.3.0 in central
        found org.rocksdb#rocksdbjni;6.5.3 in central
        found com.amazonaws#aws-java-sdk;1.7.4 in central
        found commons-logging#commons-logging;1.1.1 in central
        found org.apache.httpcomponents#httpclient;4.2 in central
        found org.apache.httpcomponents#httpcore;4.2 in central
        found commons-codec#commons-codec;1.3 in central
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.ivy.util.url.IvyAuthenticator (file:/C:/Users/USERNAME/.conda/envs/py36nlp/Lib/site-packages/pyspark/jars/ivy-2.4.0.jar) to field java.net.Authenticator.theAuthenticator
WARNING: Please consider reporting this to the maintainers of org.apache.ivy.util.url.IvyAuthenticator
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
        found joda-time#joda-time;2.10.10 in central
        [2.10.10] joda-time#joda-time;[2.2,)
        found com.github.universal-automata#liblevenshtein;3.0.0 in central
        found com.google.code.findbugs#annotations;3.0.1 in central
        found net.jcip#jcip-annotations;1.0 in central
        found com.google.code.findbugs#jsr305;3.0.1 in central
        found com.google.protobuf#protobuf-java-util;3.0.0-beta-3 in central
        found com.google.protobuf#protobuf-java;3.0.0-beta-3 in central
        found com.google.code.gson#gson;2.3 in central
        found it.unimi.dsi#fastutil;7.0.12 in central
        found org.projectlombok#lombok;1.16.8 in central
        found org.slf4j#slf4j-api;1.7.21 in central
        found com.navigamez#greex;1.0 in central
        found dk.brics.automaton#automaton;1.11-8 in central
        found org.json4s#json4s-ext_2.11;3.5.3 in central
        found org.joda#joda-convert;1.8.1 in central
        found org.tensorflow#tensorflow;1.15.0 in central
        found org.tensorflow#libtensorflow;1.15.0 in central
        found org.tensorflow#libtensorflow_jni;1.15.0 in central
        found net.sf.trove4j#trove4j;3.0.3 in central
:: resolution report :: resolve 1184ms :: artifacts dl 28ms
        :: modules in use:
        com.amazonaws#aws-java-sdk;1.7.4 from central in [default]
        com.github.universal-automata#liblevenshtein;3.0.0 from central in [default]
        com.google.code.findbugs#annotations;3.0.1 from central in [default]
        com.google.code.findbugs#jsr305;3.0.1 from central in [default]
        com.google.code.gson#gson;2.3 from central in [default]
        com.google.protobuf#protobuf-java;3.0.0-beta-3 from central in [default]
        com.google.protobuf#protobuf-java-util;3.0.0-beta-3 from central in [default]
        com.johnsnowlabs.nlp#spark-nlp_2.11;2.7.5 from central in [default]
        com.navigamez#greex;1.0 from central in [default]
        com.typesafe#config;1.3.0 from central in [default]
        commons-codec#commons-codec;1.3 from central in [default]
        commons-logging#commons-logging;1.1.1 from central in [default]
        dk.brics.automaton#automaton;1.11-8 from central in [default]
        it.unimi.dsi#fastutil;7.0.12 from central in [default]
        joda-time#joda-time;2.10.10 from central in [default]
        net.jcip#jcip-annotations;1.0 from central in [default]
        net.sf.trove4j#trove4j;3.0.3 from central in [default]
        org.apache.httpcomponents#httpclient;4.2 from central in [default]
        org.apache.httpcomponents#httpcore;4.2 from central in [default]
        org.joda#joda-convert;1.8.1 from central in [default]
        org.json4s#json4s-ext_2.11;3.5.3 from central in [default]
        org.projectlombok#lombok;1.16.8 from central in [default]
        org.rocksdb#rocksdbjni;6.5.3 from central in [default]
        org.slf4j#slf4j-api;1.7.21 from central in [default]
        org.tensorflow#libtensorflow;1.15.0 from central in [default]
        org.tensorflow#libtensorflow_jni;1.15.0 from central in [default]
        org.tensorflow#tensorflow;1.15.0 from central in [default]
        :: evicted modules:
        commons-codec#commons-codec;1.6 by [commons-codec#commons-codec;1.3] in [default]
        joda-time#joda-time;2.9.5 by [joda-time#joda-time;2.10.10] in [default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   29  |   1   |   0   |   2   ||   27  |   0   |
        ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-f4225ffb-68be-4e92-a6fc-c8cf6d7928e2
        confs: [default]
        0 artifacts copied, 27 already retrieved (0kB/17ms)
21/03/08 16:01:46 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
        at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379)
        at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394)
        at org.apache.hadoop.util.Shell.<clinit>(Shell.java:387)
        at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:80)
        at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2823)
        at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2818)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2684)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
        at org.apache.spark.deploy.DependencyUtils$.org$apache$spark$deploy$DependencyUtils$$resolveGlobPath(DependencyUtils.scala:191)
        at org.apache.spark.deploy.DependencyUtils$$anonfun$resolveGlobPaths$2.apply(DependencyUtils.scala:147)
        at org.apache.spark.deploy.DependencyUtils$$anonfun$resolveGlobPaths$2.apply(DependencyUtils.scala:145)
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
        at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
        at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
        at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
        at org.apache.spark.deploy.DependencyUtils$.resolveGlobPaths(DependencyUtils.scala:145)
        at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$3.apply(SparkSubmit.scala:343)
        at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$3.apply(SparkSubmit.scala:343)
        at scala.Option.map(Option.scala:146)
        at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:343)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:774)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
21/03/08 16:01:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
No accepted Data type or usable columns found or applying the NLU models failed.
Make sure that the first column you pass to .predict() is the one that nlu should predict on OR rename the column you want to predict on to 'text'
If you are on Google Collab, click on Run time and try factory reset Runtime run the setup script again, you might have used too much memory
On Kaggle try to reset restart session and run the setup script again, you might have used too much memory
Full Stacktrace: see bottom
Additional info:
<class 'pyspark.sql.utils.IllegalArgumentException'> pipeline.py 1380
Stuck? Contact us on Slack! https://join.slack.com/t/spark-nlp/shared_invite/zt-lutct9gm-kuUazcyFKhuGY3_0AMkxqA

Same errors occure when running nlu.load('tokenize').predict('Each word and symbol in a sentence will generate token.')
Full stack trace:

Full Stacktrace was (<class 'pyspark.sql.utils.IllegalArgumentException'>, IllegalArgumentException('Unsupported class file major version 55', 'org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:166)
         at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:148)
         at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:136)
         at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:237)
         at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:50)
         at org.apache.spark.util.FieldAccessFinder$$anon$4$$anonfun$visitMethodInsn$7.apply(ClosureCleaner.scala:845)
         at org.apache.spark.util.FieldAccessFinder$$anon$4$$anonfun$visitMethodInsn$7.apply(ClosureCleaner.scala:828)
         at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
         at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
         at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
         at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
         at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
         at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:134)
         at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
         at org.apache.spark.util.FieldAccessFinder$$anon$4.visitMethodInsn(ClosureCleaner.scala:828)
         at org.apache.xbean.asm6.ClassReader.readCode(ClassReader.java:2175)
         at org.apache.xbean.asm6.ClassReader.readMethod(ClassReader.java:1238)
         at org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:631)
         at org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:355)
         at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:272)
         at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:271)
         at scala.collection.immutable.List.foreach(List.scala:392)
         at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:271)
         at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:163)
         at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
         at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:820)
         at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:819)
         at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
         at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
         at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
         at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:819)
         at org.apache.spark.sql.execution.python.EvalPythonExec.doExecute(EvalPythonExec.scala:89)
         at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
         at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
         at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
         at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
         at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
         at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
         at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:391)
         at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:43)
         at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:627)
         at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
         at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
         at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
         at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
         at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
         at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
         at org.apache.spark.sql.execution.GenerateExec.doExecute(GenerateExec.scala:80)
         at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
         at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
         at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
         at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
         at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
         at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
         at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:391)
         at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:43)
         at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:627)
         at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
         at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
         at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
         at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
         at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
         at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
         at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
         at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296)
         at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3263)
         at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3260)
         at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
         at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
         at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
         at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
         at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withAction(Dataset.scala:3369)
         at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3260)
         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
         at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
         at java.base/java.lang.reflect.Method.invoke(Method.java:566)
         at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
         at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
         at py4j.Gateway.invoke(Gateway.java:282)
         at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
         at py4j.commands.CallCommand.execute(CallCommand.java:79)
         at py4j.GatewayConnection.run(GatewayConnection.java:238)
         at java.base/java.lang.Thread.run(Thread.java:834)'), <traceback object at 0x0000024A29857188>)

Feature request: (bioschemas)[bioschemas.org] based ner extractor

it be great to have a (bioschemas)[bioschemas.org] based ner extractor that not only extracts bio/clinical but also pipelines/tools and other bioschema properties

DataFrame problem with pyspark and pandas interaction

When executing the following code, an error occurs

from johnsnowlabs import nlp

pipeline = nlp.load('sentiment')
pipeline.predict("I love this Documentation! It's so good!")

...
Approximate size to download 354.6 KB
Download done! Loading the resource.
[OK!]
Warning::Spark Session already created, some configs may not take.
Traceback (most recent call last):
  File "/home/user/Documents/test/nlu/test_maen.py", line 8, in <module>
    pipeline.predict("I love this Documentation! It's so good!")
  File "/home/user/Documents/test/nlu/.venv/lib64/python3.10/site-packages/nlu/pipe/pipeline.py", line 468, in predict
    return __predict__(self, data, output_level, positions, keep_stranger_features, metadata, multithread,
  File "/home/user/Documents/test/nlu/.venv/lib64/python3.10/site-packages/nlu/pipe/utils/predict_helper.py", line 166, in __predict__
    pipe.fit()
  File "/home/user/Documents/test/nlu/.venv/lib64/python3.10/site-packages/nlu/pipe/pipeline.py", line 202, in fit
    self.vanilla_transformer_pipe = self.spark_estimator_pipe.fit(self.get_sample_spark_dataframe())
  File "/home/user/Documents/test/nlu/.venv/lib64/python3.10/site-packages/nlu/pipe/pipeline.py", line 101, in get_sample_spark_dataframe
    return sparknlp.start().createDataFrame(data=text_df)
  File "/home/user/Documents/test/nlu/.venv/lib64/python3.10/site-packages/pyspark/sql/session.py", line 603, in createDataFrame
    return super(SparkSession, self).createDataFrame(
  File "/home/user/Documents/test/nlu/.venv/lib64/python3.10/site-packages/pyspark/sql/pandas/conversion.py", line 299, in createDataFrame
    data = self._convert_from_pandas(data, schema, timezone)
  File "/home/user/Documents/test/nlu/.venv/lib64/python3.10/site-packages/pyspark/sql/pandas/conversion.py", line 327, in _convert_from_pandas
    for column, series in pdf.iteritems():
  File "/home/user/Documents/test/nlu/.venv/lib64/python3.10/site-packages/pandas/core/generic.py", line 6202, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'iteritems'. Did you mean: 'isetitem'?

There is a solution for this error on stackoverflow
Maybe you should specify the right version in the dependencies of the johnsnowlabs module?
For example pandas >= 1.3.5, < 2

Platform - Fedora Linux 36

openjdk version "11.0.19" 2023-04-18
OpenJDK Runtime Environment (Red_Hat-11.0.19.0.7-2.fc36) (build 11.0.19+7)
OpenJDK 64-Bit Server VM (Red_Hat-11.0.19.0.7-2.fc36) (build 11.0.19+7, mixed mode, sharing)

Error when loading match.datetime component

import nlu
nlu.load('match.datetime').predict('In the years 2000/01/01 to 2010/01/01 a lot of things happened')

Running it in colab pip install nlu pyspark==3.0.2
Get this Error:
Exception: Something went wrong during loading and fitting the pipe. Check the other prints for more information and also verbose mode. Did you use a correct model reference?

Error while trying to load nlu.load('embed_sentence.bert')

I am trying to create sentence similarity model using Spark_nlp, but i am getting the below two different errors.

sent_small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]

IllegalArgumentException Traceback (most recent call last)
File c:\users\ramesar2\appdata\local\programs\python\python38\lib\site-packages\nlu\pipe\component_resolution.py:276, in get_trained_component_for_nlp_model_ref(lang, nlu_ref, nlp_ref, license_type, model_configs)
274 if component.get_pretrained_model:
275 component = component.set_metadata(
--> 276 component.get_pretrained_model(nlp_ref, lang, model_bucket),
277 nlu_ref, nlp_ref, lang, False, license_type)
278 else:

File c:\users\ramesar2\appdata\local\programs\python\python38\lib\site-packages\nlu\components\embeddings\sentence_bert\BertSentenceEmbedding.py:13, in BertSentence.get_pretrained_model(name, language, bucket)
11 @staticmethod
12 def get_pretrained_model(name, language, bucket=None):
---> 13 return BertSentenceEmbeddings.pretrained(name,language,bucket)
14 .setInputCols('sentence')
15 .setOutputCol("sentence_embeddings")

File c:\users\ramesar2\appdata\local\programs\python\python38\lib\site-packages\sparknlp\annotator\embeddings\bert_sentence_embeddings.py:231, in BertSentenceEmbeddings.pretrained(name, lang, remote_loc)
230 from sparknlp.pretrained import ResourceDownloader
--> 231 return ResourceDownloader.downloadModel(BertSentenceEmbeddings, name, lang, remote_loc)

File c:\users\ramesar2\appdata\local\programs\python\python38\lib\site-packages\sparknlp\pretrained\resource_downloader.py:40, in ResourceDownloader.downloadModel(reader, name, language, remote_loc, j_dwn)
39 try:
---> 40 j_obj = _internal._DownloadModel(reader.name, name, language, remote_loc, j_dwn).apply()
41 except Py4JJavaError as e:

File c:\users\ramesar2\appdata\local\programs\python\python38\lib\site-packages\sparknlp\internal_init_.py:317, in _DownloadModel.init(self, reader, name, language, remote_loc, validator)
316 def init(self, reader, name, language, remote_loc, validator):
--> 317 super(_DownloadModel, self).init("com.johnsnowlabs.nlp.pretrained." + validator + ".downloadModel", reader,
318 name, language, remote_loc)

File c:\users\ramesar2\appdata\local\programs\python\python38\lib\site-packages\sparknlp\internal\extended_java_wrapper.py:26, in ExtendedJavaWrapper.init(self, java_obj, *args)
25 self.sc = SparkContext._active_spark_context
---> 26 self._java_obj = self.new_java_obj(java_obj, *args)
27 self.java_obj = self._java_obj

File c:\users\ramesar2\appdata\local\programs\python\python38\lib\site-packages\sparknlp\internal\extended_java_wrapper.py:36, in ExtendedJavaWrapper.new_java_obj(self, java_class, *args)
35 def new_java_obj(self, java_class, *args):
---> 36 return self._new_java_obj(java_class, *args)

File c:\users\ramesar2\appdata\local\programs\python\python38\lib\site-packages\pyspark\ml\wrapper.py:69, in JavaWrapper._new_java_obj(java_class, *args)
68 java_args = [_py2java(sc, arg) for arg in args]
---> 69 return java_obj(*java_args)

File c:\users\ramesar2\appdata\local\programs\python\python38\lib\site-packages\py4j\java_gateway.py:1304, in JavaMember.call(self, *args)
1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
1305 answer, self.gateway_client, self.target_id, self.name)
1307 for temp_arg in temp_args:

File c:\users\ramesar2\appdata\local\programs\python\python38\lib\site-packages\pyspark\sql\utils.py:134, in capture_sql_exception..deco(*a, **kw)
131 if not isinstance(converted, UnknownException):
132 # Hide where the exception came from that shows a non-Pythonic
133 # JVM exception message.
--> 134 raise_from(converted)
135 else:

File :3, in raise_from(e)

IllegalArgumentException: requirement failed: Was not found appropriate resource to download for request: ResourceRequest(sent_small_bert_L2_128,Some(en),public/models,4.0.2,3.3.0) with downloader: com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader@c7c973f

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last)
File c:\users\ramesar2\appdata\local\programs\python\python38\lib\site-packages\nlu_init_.py:234, in load(request, path, verbose, gpu, streamlit_caching, m1_chip)
233 continue
--> 234 nlu_component = nlu_ref_to_component(nlu_ref)
235 # if we get a list of components, then the NLU reference is a pipeline, we do not need to check order

File c:\users\ramesar2\appdata\local\programs\python\python38\lib\site-packages\nlu\pipe\component_resolution.py:160, in nlu_ref_to_component(nlu_ref, detect_lang, authenticated)
159 else:
--> 160 resolved_component = get_trained_component_for_nlp_model_ref(lang, nlu_ref, nlp_ref, license_type, model_params)
162 if resolved_component is None:

File c:\users\ramesar2\appdata\local\programs\python\python38\lib\site-packages\nlu\pipe\component_resolution.py:287, in get_trained_component_for_nlp_model_ref(lang, nlu_ref, nlp_ref, license_type, model_configs)
286 except Exception as e:
--> 287 raise ValueError(f'Failure making component, nlp_ref={nlp_ref}, nlu_ref={nlu_ref}, lang={lang}, \n err={e}')
289 return component

ValueError: Failure making component, nlp_ref=sent_small_bert_L2_128, nlu_ref=embed_sentence.bert, lang=en,
err=requirement failed: Was not found appropriate resource to download for request: ResourceRequest(sent_small_bert_L2_128,Some(en),public/models,4.0.2,3.3.0) with downloader: com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader@c7c973f

During handling of the above exception, another exception occurred:

Exception Traceback (most recent call last)
Cell In [16], line 2
1 import nlu
----> 2 pipe = nlu.load('embed_sentence.bert')
3 print("pipe",pipe)

File c:\users\ramesar2\appdata\local\programs\python\python38\lib\site-packages\nlu_init_.py:249, in load(request, path, verbose, gpu, streamlit_caching, m1_chip)
247 print(e[1])
248 print(err)
--> 249 raise Exception(
250 f"Something went wrong during creating the Spark NLP model_anno_obj for your request = {request} Did you use a NLU Spell?")
251 # Complete Spark NLP Pipeline, which is defined as a DAG given by the starting Annotators
252 try:

Exception: Something went wrong during creating the Spark NLP model_anno_obj for your request = embed_sentence.bert Did you use a NLU Spell?

Elmo Not work

I install all package and run examples
but Elmo not work

or

please help me!

Unable to load en.ner.dl.bert

I have to following code:

documents = ["Open my files on oceans.", "Open my presentation on oceans.", "open my presentation on week 6 day 3"]
nlu_model = nlu.load('en.ner.dl.bert')
nlu_model.predict(documents, output_level='token')

The nlu.load('en.ner.dl.bert') part causes an error that I am not sure how to fix:

ner_dl_bert download started this may take some time.
Approximate size to download 15.4 MB
[OK!]
pos_anc download started this may take some time.
Approximate size to download 4.3 MB
[OK!]
bert_base_cased download started this may take some time.
Approximate size to download 389.1 MB
[OK!]
<class 'AttributeError'>
'NoneType' object has no attribute '__set_missing_model_attributes__'
Something went wrong during loading and fitting the pipe. Check the other prints for more information and also verbose mode. Did you use a correct model reference?
The NLU components could not be properly created. Please check previous print messages and Verbose mode for further info

My environment:
ubuntu 20.10
python 3.7.9
pyspark 2.4.7
spark-nlp 2.6.5

I appreciate your help

Embed Japanese Sentences with Bert

Hi, thanks for such convenience tool! I would like to ask authors, does this tool supply 'Embed Japanese Sentences with Bert' ? Thank you

combining 'sentiment' and 'emotion' models causes crash

I'm working in a Google Colab notebook and I set up via

!wget http://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash

import nlu

a quick version check nlu.version() confirms 3.4.2

Several of the official tutorial notebooks (for ex.: XLNet)) create a multi-model pipeline that includes both 'sentiment' and 'emotion'.

Direct copy of content from the notebook:

import pandas as pd

# Download the dataset 
!wget -N https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sarcasm/train-balanced-sarcasm.csv -P /tmp

# Load dataset to Pandas
df = pd.read_csv('/tmp/train-balanced-sarcasm.csv')

pipe = nlu.load('sentiment pos xlnet emotion') 

df['text'] = df['comment']

max_rows = 200

predictions = pipe.predict(df.iloc[0:100][['comment','label']], output_level='token')

predictions

However, running a prediction on this pipe results in the following error:


sentimentdl_glove_imdb download started this may take some time.
Approximate size to download 8.7 MB
[OK!]
pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[OK!]
xlnet_base_cased download started this may take some time.
Approximate size to download 417.5 MB
[OK!]
classifierdl_use_emotion download started this may take some time.
Approximate size to download 21.3 MB
[OK!]
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
---------------------------------------------------------------------------
IllegalArgumentException                  Traceback (most recent call last)
<ipython-input-1-9b2e4a06bf65> in <module>()
     34 
     35 # NLU to gives us one row per embedded word by specifying the output level
---> 36 predictions = pipe.predict( df.iloc[0:5][['text','label']], output_level='token' )
     37 
     38 display(predictions)

9 frames
/usr/local/lib/python3.7/dist-packages/pyspark/sql/utils.py in raise_from(e)

IllegalArgumentException: requirement failed: Wrong or missing inputCols annotators in SentimentDLModel_6c1a68f3f2c7.

Current inputCols: sentence_embeddings@glove_100d. Dataset's columns:
(column_name=text,is_nlp_annotator=false)
(column_name=document,is_nlp_annotator=true,type=document)
(column_name=sentence,is_nlp_annotator=true,type=document)
(column_name=sentence_embeddings@tfhub_use,is_nlp_annotator=true,type=sentence_embeddings).
Make sure such annotators exist in your pipeline, with the right output names and that they have following annotator types: sentence_embeddings

Having experimented with various combinations of models, it turns out that the problem is caused whenever 'sentiment' and 'emotion' models are specified in the same pipeline (regardless of pipeline order or what other models are listed).

Running pipe = nlu.load('emotion ANY OTHER MODELS') or pipe = nlu.load('sentiment ANY OTHER MODELS') will be successful, so it really appears to be only a result of combining 'sentiment' and 'emotion'

Is this a known bug? Does anyone have any suggestions for fixing?

My temporary solution has been to run emoPipe = nlu.load('emotion').predict() in isolation, then inner join the resulting dataframe to the the resulting df of pipe = nlu.load('sentiment pos xlnet').predict().

However, I would like to understand better what is failing and to know if there is a way to streamline the inclusion of all models.

Thanks

Issue with nlu.load('sentiment')

I'm trying to follow the example at nlu/examples/colab/component_examples/sequence2sequence/translation_demo.ipynb but I keep on getting this error when nlu.load('sentiment').

My code:

import nlu 
nlu.load('sentiment').predict('I love NLU! <3')

My error:

analyze_sentiment download started this may take some time.
Approx size to download 4.9 MB
[OK!]
<class 'pyspark.sql.utils.IllegalArgumentException'>
'Unsupported class file major version 55'
Something went wrong during loading and fitting the pipe. Check the other prints for more information and also verbose mode. Did you use a correct model reference?
<nlu.NluError at 0x7f046d214be0>

Java problems when using the library

Hello, I followed all the installation steps in the documentation, but it was not enough to get the library working.

Then I had to install the JDK, specify the interpreter and the path to the JDK

import os
from johnsnowlabs import nlp

os.environ["PYSPARK_DRIVER_PYTHON"] = "D:\\myproject\\nlp_command\\.venv\\Scripts"
os.environ["JAVA_HOME"] = "C:\\Program Files\\Java\\jdk-20"

pipeline = nlp.load('sentiment')
# pipeline.predict("I love this Documentation! It's so good!")

But I'm still getting the "Java gateway process exited before sending its port number" error.

Platform - windows 10

java version "20.0.2" 2023-07-18
Java(TM) SE Runtime Environment (build 20.0.2+9-78)
Java HotSpot(TM) 64-Bit Server VM (build 20.0.2+9-78, mixed mode, sharing)

How to set the batch size?

Hi,

The prediction process takes a long time to finish so I check the GPU memory usage and find out it only uses 3GB memory ( I have 16GB memory GPU).
I want to set a larger batch size to speed up the process but I can't find the argument.
How to set the batch size when using the predict function?

import nlu
pipe = nlu.load('xx.embed_sentence.labse', gpu=True)
pipe.pipe.predict(text, output_level='document')

Thanks

error while download hebrewner

I am running the official docker image of nlp-server and trying to make ner on hebrew sentence but it failed to download the model, also i trying to download the model manually and it say to access denied

nlu.load function m1_chip parameter is not passed on correctly

The m1_chip parameter in nlu.load (in init.py) is passed on to get_open_source_spark_context and there used in sparknlp.start(gpu=gpu, m1=True). However, sparknlp.start takes only the parameter apple_silicon.

load error

I get the following error when trying the following:

import nlu
nlu.load('elmo')

using configuration:
OS: Windows 10
Java version: 1.8.0_311 (Java 8)
Pyspark – version: 3.1.2

:: loading settings :: url = jar:file:/C:/Spark/spark-3.2.0-bin-hadoop3.2/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: C:\Users\Lukas.ivy2\cache
The jars for the packages stored in: C:\Users\Lukas.ivy2\jars
com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-f9a2f2a7-e7ac-44f5-a922-ae1493621cbc;1.0
confs: [default]
found com.johnsnowlabs.nlp#spark-nlp_2.12;3.3.4 in central
found com.typesafe#config;1.4.1 in central
found org.rocksdb#rocksdbjni;6.5.3 in central
found com.amazonaws#aws-java-sdk-bundle;1.11.603 in central
found com.github.universal-automata#liblevenshtein;3.0.0 in central
found com.google.code.findbugs#annotations;3.0.1 in central
found net.jcip#jcip-annotations;1.0 in central
found com.google.code.findbugs#jsr305;3.0.1 in central
found com.google.protobuf#protobuf-java-util;3.0.0-beta-3 in central
found com.google.protobuf#protobuf-java;3.0.0-beta-3 in central
found com.google.code.gson#gson;2.3 in central
found it.unimi.dsi#fastutil;7.0.12 in central
found org.projectlombok#lombok;1.16.8 in central
found org.slf4j#slf4j-api;1.7.21 in central
found com.navigamez#greex;1.0 in central
found dk.brics.automaton#automaton;1.11-8 in central
found org.json4s#json4s-ext_2.12;3.5.3 in central
found joda-time#joda-time;2.9.5 in central
found org.joda#joda-convert;1.8.1 in central
found com.johnsnowlabs.nlp#tensorflow-cpu_2.12;0.3.3 in central
found net.sf.trove4j#trove4j;3.0.3 in central
:: resolution report :: resolve 391ms :: artifacts dl 16ms
:: modules in use:
com.amazonaws#aws-java-sdk-bundle;1.11.603 from central in [default]
com.github.universal-automata#liblevenshtein;3.0.0 from central in [default]
com.google.code.findbugs#annotations;3.0.1 from central in [default]
com.google.code.findbugs#jsr305;3.0.1 from central in [default]
com.google.code.gson#gson;2.3 from central in [default]
com.google.protobuf#protobuf-java;3.0.0-beta-3 from central in [default]
com.google.protobuf#protobuf-java-util;3.0.0-beta-3 from central in [default]
com.johnsnowlabs.nlp#spark-nlp_2.12;3.3.4 from central in [default]
com.johnsnowlabs.nlp#tensorflow-cpu_2.12;0.3.3 from central in [default]
com.navigamez#greex;1.0 from central in [default]
com.typesafe#config;1.4.1 from central in [default]
dk.brics.automaton#automaton;1.11-8 from central in [default]
it.unimi.dsi#fastutil;7.0.12 from central in [default]
joda-time#joda-time;2.9.5 from central in [default]
net.jcip#jcip-annotations;1.0 from central in [default]
net.sf.trove4j#trove4j;3.0.3 from central in [default]
org.joda#joda-convert;1.8.1 from central in [default]
org.json4s#json4s-ext_2.12;3.5.3 from central in [default]
org.projectlombok#lombok;1.16.8 from central in [default]
org.rocksdb#rocksdbjni;6.5.3 from central in [default]
org.slf4j#slf4j-api;1.7.21 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 21 | 0 | 0 | 0 || 21 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-f9a2f2a7-e7ac-44f5-a922-ae1493621cbc
confs: [default]
0 artifacts copied, 21 already retrieved (0kB/0ms)
22/01/14 17:30:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
elmo download started this may take some time.
22/01/14 17:31:05 WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped
EXCEPTION: Could not resolve singular Component for type=elmo and nlp_ref=elmo and nlu_ref=elmo and lang =en
Traceback (most recent call last):
File "D:.venv\python3.8_nlu\lib\site-packages\nlu\pipe\component_resolution.py", line 708, in construct_component_from_identifier
return Embeddings(get_default=False, nlp_ref=nlp_ref, nlu_ref=nlu_ref, lang=language,
File "D:.venv\python3.8_nlu\lib\site-packages\nlu\components\embedding.py", line 98, in init
else : self.model =SparkNLPElmo.get_pretrained_model(nlp_ref, lang)
File "D:.venv\python3.8_nlu\lib\site-packages\nlu\components\embeddings\elmo\spark_nlp_elmo.py", line 14, in get_pretrained_model
return ElmoEmbeddings.pretrained(name,language)
File "D:.venv\python3.8_nlu\lib\site-packages\sparknlp\annotator.py", line 7760, in pretrained
return ResourceDownloader.downloadModel(ElmoEmbeddings, name, lang, remote_loc)
File "D:.venv\python3.8_nlu\lib\site-packages\sparknlp\pretrained.py", line 50, in downloadModel
file_size = _internal._GetResourceSize(name, language, remote_loc).apply()
File "D:.venv\python3.8_nlu\lib\site-packages\sparknlp\internal.py", line 231, in init
super(_GetResourceSize, self).init(
File "D:.venv\python3.8_nlu\lib\site-packages\sparknlp\internal.py", line 165, in init
self._java_obj = self.new_java_obj(java_obj, *args)
File "D:.venv\python3.8_nlu\lib\site-packages\sparknlp\internal.py", line 175, in new_java_obj
return self._new_java_obj(java_class, *args)
File "D:.venv\python3.8_nlu\lib\site-packages\pyspark\ml\wrapper.py", line 66, in _new_java_obj
return java_obj(*java_args)
File "D:.venv\python3.8_nlu\lib\site-packages\py4j\java_gateway.py", line 1304, in call
return_value = get_return_value(
File "D:.venv\python3.8_nlu\lib\site-packages\pyspark\sql\utils.py", line 111, in deco
return f(*a, **kw)
File "D:.venv\python3.8_nlu\lib\site-packages\py4j\protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize.
: java.lang.NoClassDefFoundError: org/json4s/package$MappingException
at org.json4s.ext.EnumNameSerializer.deserialize(EnumSerializer.scala:53)
at org.json4s.Formats$$anonfun$customDeserializer$1.applyOrElse(Formats.scala:66)
at org.json4s.Formats$$anonfun$customDeserializer$1.applyOrElse(Formats.scala:66)
at scala.collection.TraversableOnce.collectFirst(TraversableOnce.scala:180)
at scala.collection.TraversableOnce.collectFirst$(TraversableOnce.scala:167)
at scala.collection.AbstractTraversable.collectFirst(Traversable.scala:108)
at org.json4s.Formats$.customDeserializer(Formats.scala:66)
at org.json4s.Extraction$.customOrElse(Extraction.scala:775)
at org.json4s.Extraction$.extract(Extraction.scala:454)
at org.json4s.Extraction$.extract(Extraction.scala:56)
at org.json4s.ExtractableJsonAstNode.extract(ExtractableJsonAstNode.scala:22)
at com.johnsnowlabs.util.JsonParser$.parseObject(JsonParser.scala:28)
at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.parseJson(ResourceMetadata.scala:101)
at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$$anonfun$readResources$1.applyOrElse(ResourceMetadata.scala:129)
at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$$anonfun$readResources$1.applyOrElse(ResourceMetadata.scala:128)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
at scala.collection.Iterator$$anon$13.next(Iterator.scala:593)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:184)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:47)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
at scala.collection.AbstractIterator.to(Iterator.scala:1431)
at scala.collection.TraversableOnce.toList(TraversableOnce.scala:350)
at scala.collection.TraversableOnce.toList$(TraversableOnce.scala:350)
at scala.collection.AbstractIterator.toList(Iterator.scala:1431)
at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.readResources(ResourceMetadata.scala:128)
at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.readResources(ResourceMetadata.scala:123)
at com.johnsnowlabs.client.aws.AWSGateway.getMetadata(AWSGateway.scala:78)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.downloadMetadataIfNeed(S3ResourceDownloader.scala:62)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.resolveLink(S3ResourceDownloader.scala:68)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.getDownloadSize(S3ResourceDownloader.scala:145)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.getDownloadSize(ResourceDownloader.scala:445)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.getDownloadSize(ResourceDownloader.scala:577)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize(ResourceDownloader.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.json4s.package$MappingException
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 51 more

Traceback (most recent call last):
File "D:.venv\python3.8_nlu\lib\site-packages\nlu_init_.py", line 236, in load
nlu_component = nlu_ref_to_component(nlu_ref, authenticated=is_authenticated)
File "D:.venv\python3.8_nlu\lib\site-packages\nlu\pipe\component_resolution.py", line 171, in nlu_ref_to_component
resolved_component = resolve_component_from_parsed_query_data(language, component_type, dataset,
File "D:.venv\python3.8_nlu\lib\site-packages\nlu\pipe\component_resolution.py", line 320, in resolve_component_from_parsed_query_data
raise ValueError(f'EXCEPTION : Could not create NLU component for nlp_ref={nlp_ref} and nlu_ref={nlu_ref}')
ValueError: EXCEPTION : Could not create NLU component for nlp_ref=elmo and nlu_ref=elmo

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "", line 1, in
File "D:.venv\python3.8_nlu\lib\site-packages\nlu_init_.py", line 255, in load
raise Exception(
Exception: Something went wrong during loading and fitting the pipe. Check the other prints for more information and also verbose mode. Did you use a correct model reference?

Something went wrong during completing the DAG for the Spark NLP Pipeline.

What can I do to solve this problem?

Unsupported class file major version 58

Hi,
First thanks a lot for this nice work !
It seems that you used spark and the java version on tensorflow to accomplish that right (with python as wrapper)?

I tried to install the nlu package on my python 3.8 which for now doesn't work (and it's okay #11 ).
So I created a virtual environement with python 3.7.

Launching ipython, import nlu that works.

However when I try to do as the doc : nlu.load('sentiment').predict('Why is NLU is awesome? Because of the sauce!') (In fact just nlu.load('sentiment') is the reason of the crash)

It return an

<class 'pyspark.sql.utils.IllegalArgumentException'>
'Unsupported class file major version 58'

I'm using Archlinux kernel zen 5.9.1 on a new virtual env with only wheel and nlu installed on python 3.7.9

I have java 8 / 11 and 14 installed

how to use language detect

nlu support on Python 3.8

On import nlu, looks like pyspark/cloudpickle.py is failing with:

TypeError: an interger is required (got type bytes) . On some research, I found this is an issue with running pysark on Python 3.8. I am not sure if this is the only cause, but if it is, i recommend placing a requirements for Python<3.8

Unable to pip install nlu (macOS BigSur, Python3.9)

Hello,
I was able to pip install nlu befiore upgrading macOS.
After the upgrade, I wanted to get a clean environment, and when I tried to install nlu again I got this error:

pablos-MBP:spark pablo$ pip install nlu
Defaulting to user installation because normal site-packages is not writeable
Collecting nlu
  Using cached nlu-1.0.2-py3-none-any.whl (150 kB)
Collecting pyarrow>=0.16.0
  Using cached pyarrow-1.0.1.tar.gz (1.3 MB)
  Installing build dependencies ... error
  ERROR: Command errored out with exit status 1:
   command: /usr/local/opt/[email protected]/bin/python3.9 /Users/pablo/Library/Python/3.9/lib/python/site-packages/pip install --ignore-installed --no-user --prefix /private/var/folders/8j/lbsf0k851g391m73x6y10rsr0000gn/T/pip-build-env-xh65myma/overlay --no-warn-script-location --no-binary :none: --only-binary :none: -i https://pypi.org/simple -- 'cython >= 0.29' 'numpy==1.14.5; python_version<'"'"'3.7'"'"'' 'numpy==1.16.0; python_version>='"'"'3.7'"'"'' setuptools setuptools_scm wheel
       cwd: None
  Complete output (4217 lines):

If you want I can share the 4217 lines of the complete error, prbably is the same error as the other ticket about compatibility with Python 3.8, in this case 3.9, so is really any 3.7+?

Remove the hard dependency on the pyspark

Right now, the nlu package has a hard dependency on the pyspark making it hard to use with Databricks runtime, or other compatible Spark runtime. Instead, this package should either rely on implicit dependency completely, or use something like findspark package, something like done here.

P.S. the spark-nlp package itself doesn't depend on the pyspark

error while using biobert PubMed PMC

Hi, I am totally interested in this NLU biobert library. its totally easy to implement yet understandable. However, I faced difficulties while to use this NLU biobert for my project. So I wanna run this code:

`import nlu

embeddings_df2 = nlu.load('en.embed.biobert.pubmed_pmc_base_cased', gpu=True).predict(df['text'], output_level='token')
embeddings_df2`

I am using google colab with GPU. After approximately 40 mins, its suddenly stopped and resulted the error

biobert_pubmed_pmc_base_cased download started this may take some time.
Approximate size to download 386.7 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]

Exception happened during processing of request from ('127.0.0.1', 40522)
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py", line 1207, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py", line 1033, in send_command
response = connection.send_command(command)
File "/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py", line 1212, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
Traceback (most recent call last):
File "/usr/lib/python3.7/socketserver.py", line 316, in _handle_request_noblock
self.process_request(request, client_address)
File "/usr/lib/python3.7/socketserver.py", line 347, in process_request
self.finish_request(request, client_address)
File "/usr/lib/python3.7/socketserver.py", line 360, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/usr/lib/python3.7/socketserver.py", line 720, in init
self.handle()
File "/usr/local/lib/python3.7/dist-packages/pyspark/accumulators.py", line 268, in handle
poll(accum_updates)
File "/usr/local/lib/python3.7/dist-packages/pyspark/accumulators.py", line 241, in poll
if func():
File "/usr/local/lib/python3.7/dist-packages/pyspark/accumulators.py", line 245, in accum_updates
num_updates = read_int(self.rfile)
File "/usr/local/lib/python3.7/dist-packages/pyspark/serializers.py", line 595, in read_int
raise EOFError
EOFError

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:35473)
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/nlu/pipe/pipeline.py", line 438, in predict
self.configure_light_pipe_usage(data.count(), multithread)
File "/usr/local/lib/python3.7/dist-packages/pyspark/sql/dataframe.py", line 585, in count
return int(self._jdf.count())
File "/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py", line 1305, in call
answer, self.gateway_client, self.target_id, self.name)
File "/usr/local/lib/python3.7/dist-packages/pyspark/sql/utils.py", line 128, in deco
return f(*a, **kw)
File "/usr/local/lib/python3.7/dist-packages/py4j/protocol.py", line 336, in get_return_value
format(target_id, ".", name))
py4j.protocol.Py4JError: An error occurred while calling o1231.count

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py", line 977, in _get_connection
connection = self.deque.pop()
IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py", line 1115, in start
self.socket.connect((self.address, self.port))
ConnectionRefusedError: [Errno 111] Connection refused
Exception occured
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/nlu/pipe/pipeline.py", line 438, in predict
self.configure_light_pipe_usage(data.count(), multithread)
File "/usr/local/lib/python3.7/dist-packages/pyspark/sql/dataframe.py", line 585, in count
return int(self._jdf.count())
File "/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py", line 1305, in call
answer, self.gateway_client, self.target_id, self.name)
File "/usr/local/lib/python3.7/dist-packages/pyspark/sql/utils.py", line 128, in deco
return f(*a, **kw)
File "/usr/local/lib/python3.7/dist-packages/py4j/protocol.py", line 336, in get_return_value
format(target_id, ".", name))
py4j.protocol.Py4JError: An error occurred while calling o1231.count

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/nlu/pipe/pipeline.py", line 435, in predict
data, stranger_features, output_datatype = DataConversionUtils.to_spark_df(data, self.spark, self.raw_text_column)
TypeError: cannot unpack non-iterable NoneType object
Exception occured
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/nlu/pipe/pipeline.py", line 438, in predict
self.configure_light_pipe_usage(data.count(), multithread)
File "/usr/local/lib/python3.7/dist-packages/pyspark/sql/dataframe.py", line 585, in count
return int(self._jdf.count())
File "/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py", line 1305, in call
answer, self.gateway_client, self.target_id, self.name)
File "/usr/local/lib/python3.7/dist-packages/pyspark/sql/utils.py", line 128, in deco
return f(*a, **kw)
File "/usr/local/lib/python3.7/dist-packages/py4j/protocol.py", line 336, in get_return_value
format(target_id, ".", name))
py4j.protocol.Py4JError: An error occurred while calling o1231.count

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/nlu/pipe/pipeline.py", line 435, in predict
data, stranger_features, output_datatype = DataConversionUtils.to_spark_df(data, self.spark, self.raw_text_column)
TypeError: cannot unpack non-iterable NoneType object
ERROR:nlu:Exception occured
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/nlu/pipe/pipeline.py", line 438, in predict
self.configure_light_pipe_usage(data.count(), multithread)
File "/usr/local/lib/python3.7/dist-packages/pyspark/sql/dataframe.py", line 585, in count
return int(self._jdf.count())
File "/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py", line 1305, in call
answer, self.gateway_client, self.target_id, self.name)
File "/usr/local/lib/python3.7/dist-packages/pyspark/sql/utils.py", line 128, in deco
return f(*a, **kw)
File "/usr/local/lib/python3.7/dist-packages/py4j/protocol.py", line 336, in get_return_value
format(target_id, ".", name))
py4j.protocol.Py4JError: An error occurred while calling o1231.count

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/nlu/pipe/pipeline.py", line 435, in predict
data, stranger_features, output_datatype = DataConversionUtils.to_spark_df(data, self.spark, self.raw_text_column)
TypeError: cannot unpack non-iterable NoneType object
No accepted Data type or usable columns found or applying the NLU models failed.
Make sure that the first column you pass to .predict() is the one that nlu should predict on OR rename the column you want to predict on to 'text'
On try to reset restart Jupyter session and run the setup script again, you might have used too much memory
Full Stacktrace was (<class 'TypeError'>, TypeError('cannot unpack non-iterable NoneType object'), <traceback object at 0x7f4ed5dd60f0>)
Additional info:
<class 'TypeError'> pipeline.py 435
cannot unpack non-iterable NoneType object
Stuck? Contact us on Slack! https://join.slack.com/t/spark-nlp/shared_invite/zt-lutct9gm-kuUazcyFKhuGY3_0AMkxqA

I already tried 2-3 times. in my opinion, probably due to RAM exceeding. However, I already activated the GPU itself. Any solution for this? Thanks in advance.

Difference between JSL’s “nlu” and “spark-nlp” packages?

Hi, can you please add a comment on the description page about the difference between nlu and spark-nlp libraries?

Column names and pipe settings change after saving

For column names we must save the nlu_ref used to create the pipe
For pipeline, there has to be some bug in saving or writing settings to the annotator

predict() - pyspark IndexError on python 3.11.4

Python version: 3.11.4
pyspark version: 3.1.2

model.predict('I love NLU! <3')
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]

Warning::Spark Session already created, some configs may not take.
Traceback (most recent call last):
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/serializers.py", line 437, in dumps
    return cloudpickle.dumps(obj, pickle_protocol)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 72, in dumps
    cp.dump(obj)
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 540, in dump
    return Pickler.dump(self, obj)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 630, in reducer_override
    return self._function_reduce(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 503, in _function_reduce
    return self._dynamic_function_reduce(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 484, in _dynamic_function_reduce
    state = _function_getstate(func)
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 156, in _function_getstate
    f_globals_ref = _extract_code_globals(func.__code__)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle.py", line 236, in _extract_code_globals
    out_names = {names[oparg] for _, oparg in _walk_global_ops(co)}
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle.py", line 236, in <setcomp>
    out_names = {names[oparg] for _, oparg in _walk_global_ops(co)}
                 ~~~~~^^^^^^^
IndexError: tuple index out of range
Traceback (most recent call last):
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/serializers.py", line 437, in dumps
    return cloudpickle.dumps(obj, pickle_protocol)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 72, in dumps
    cp.dump(obj)
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 540, in dump
    return Pickler.dump(self, obj)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 630, in reducer_override
    return self._function_reduce(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 503, in _function_reduce
    return self._dynamic_function_reduce(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 484, in _dynamic_function_reduce
    state = _function_getstate(func)
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 156, in _function_getstate
    f_globals_ref = _extract_code_globals(func.__code__)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle.py", line 236, in _extract_code_globals
    out_names = {names[oparg] for _, oparg in _walk_global_ops(co)}
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle.py", line 236, in <setcomp>
    out_names = {names[oparg] for _, oparg in _walk_global_ops(co)}
                 ~~~~~^^^^^^^
IndexError: tuple index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users//nlu/nlu/pipe/pipeline.py", line 485, in predict
    return __predict__(self, data, output_level, positions, keep_stranger_features, metadata, multithread,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//nlu/nlu/pipe/utils/predict_helper.py", line 267, in __predict__
    pipe.fit()
  File "/Users//nlu/nlu/pipe/pipeline.py", line 204, in fit
    self.vanilla_transformer_pipe = self.spark_estimator_pipe.fit(self.get_sample_spark_dataframe())
                                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//nlu/nlu/pipe/pipeline.py", line 103, in get_sample_spark_dataframe
    return sparknlp.start().createDataFrame(data=text_df)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/sql/session.py", line 673, in createDataFrame
    return super(SparkSession, self).createDataFrame(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/sql/pandas/conversion.py", line 300, in createDataFrame
    return self._create_dataframe(data, schema, samplingRatio, verifySchema)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/sql/session.py", line 701, in _create_dataframe
    jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/rdd.py", line 2618, in _to_java_object_rdd
    return self.ctx._jvm.SerDeUtil.pythonToJava(rdd._jrdd, True)
                                                ^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/rdd.py", line 2949, in _jrdd
    wrapped_func = _wrap_function(self.ctx, self.func, self._prev_jrdd_deserializer,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/rdd.py", line 2828, in _wrap_function
    pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
                                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/rdd.py", line 2814, in _prepare_for_python_RDD
    pickled_command = ser.dumps(command)
                      ^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/serializers.py", line 447, in dumps
    raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: IndexError: tuple index out of range

using NLU-biobert for entity linking or word embedding

I just wanted to enquire how can one use this model for entity linking? I believe I did see some linking and pos-tagging but is there some documentation that shows matching words to it's meaning rather than just matching with similarity? I want to load a spark database and use the model to perform word embedding by meaning on the whole dataset and store the output in another data frame, also being able to measure its performance by various metrics.

Could not locate executable null\bin\winutils.exe

Hi, thanks for the package, I'm starting t explore it and it looks good so far!
I've juts faced some minor issues when trying to run it on my windows machine and thought about giving a heads up here in case someone finds similar problem.

First is you need to run your python as admin, because folders are created (in order to store downloaded models I presume) and it will cause errors if no permission is granted. Is there a way to choose where these models are downloaded to? This might help with that

Second, after the installations steps in the instructions I got this error when trying to run nlu:

Could not locate executable null\bin\winutils.exe in the Hadoop binaries.

I found the answer to this problem here: https://stackoverflow.com/a/50430966/11483674

johnsnowlabs / nlu Goto Github PK

nlu's Introduction

NLU: The Power of Spark NLP, the Simplicity of Python

NLU in Action

NLU & Streamlit in Action

All NLU resources overview

Getting Started with NLU

Loading and predicting with any model in 1 line python

Loading and predicting with multiple models in 1 line

What kind of models does NLU provide?

Classifiers trained on many different datasets

Utilities for the Data Science NLU applications

Where can I see all models available in NLU?

Supported Data Types

Overview of all tutorials using the NLU-Library

Need help?

Simple NLU Demos

Features in NLU Overview

Citation

nlu's People

Contributors

Stargazers

Watchers

Forkers

nlu's Issues

sent_small_bert_L2_128 download started this may take some time. Approximate size to download 16.1 MB [OK!]

Recommend Projects

Recommend Topics

Recommend Org

Jobs

sent_small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]