Comments (4)
@maziyarpanahi to reproduce:
Dockerfile for ECR Image referenced by EMR Serverless Application
FROM public.ecr.aws/emr-serverless/spark/emr-6.14.0:20230928-x86_64
USER root
RUN pip3 install spark-nlp==5.1.4
USER hadoop:hadoop
Spark Submit via Console
{
"applicationConfiguration": [
{
"classification": "spark-defaults",
"configurations": [],
"properties": {
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.sql.hive.convertMetastoreParquet": "false",
"spark.kryoserializer.buffer.max": "2000M",
"spark.driver.maxResultSize": "0",
"spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4",
"spark.jars": "/usr/lib/hudi/hudi-spark-bundle.jar, s3://BUCKET/jars/spark-nlp_2.12-5.1.4.jar",
"spark.jsl.settings.pretrained.cache_folder": "s3://BUCKET/sparknlp/cache",
"spark.jsl.settings.aws.region": "REGION"
}
}
]
}
Sentence Detector DL Model is the point of contention here.
class forEachBatchProcessor:
def __init__(self):
document_assembler = (
DocumentAssembler()
.setInputCol("content")
.setOutputCol("document")
.setCleanupMode("shrink_full")
)
cleanUpPatterns = [
"(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w\.-]*)|^\.\.\."
]
document_normalizer = (
DocumentNormalizer()
.setInputCols("document")
.setOutputCol("normalizedDocument")
.setLowercase(False)
.setAction("clean")
.setPatterns(cleanUpPatterns)
.setReplacement(" ")
.setPolicy("pretty_all")
)
sentence_detector_dl = (
SentenceDetectorDLModel()
.pretrained()
.setInputCols(["normalizedDocument"])
.setOutputCol("sentences")
.setMinLength(5)
.setExplodeSentences(False)
)
self.pipeline = Pipeline(
stages=[document_assembler, document_normalizer, sentence_detector_dl]
)
Ultimately it's a file not found error in /home/hadoop/cache_pretrained
which has inspired this feature request.
from spark-nlp.
Stack Trace while using the default cache location (adding permissions and ownership to user hadoop:hadoop included here, same as leaving it out)
3.4.1-amzn-1
5.1.4
Internet is connected.
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[ | ]sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
Download done! Loading the resource.
[ / ]
[ — ]
[ \ ]
[ | ]
[ / ]
[ — ]
[ \ ]
[ | ]
[ / ]
[ — ]
An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) ([2600:1f18:3d:a503:edb3:ae2d:502b:9e73] executor 1): java.io.FileNotFoundException: File file:/home/hadoop/cache_pretrained/sentence_detector_dl_en_2.7.0_2.4_1609611052663/metadata/part-00000 does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:832)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1153)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:822)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:472)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:160)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:372)
at org.apache.hadoop.fs.ChecksumFileSystem.lambda$openFileWithOptions$0(ChecksumFileSystem.java:896)
at org.apache.hadoop.util.LambdaUtils.eval(LambdaUtils.java:52)
at org.apache.hadoop.fs.ChecksumFileSystem.openFileWithOptions(ChecksumFileSystem.java:894)
at org.apache.hadoop.fs.FileSystem$FSDataInputStreamBuilder.build(FileSystem.java:4865)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:115)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:288)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:287)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:245)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:97)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2974)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2910)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2909)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2909)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1263)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1263)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1263)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3173)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3112)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3101)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1028)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2288)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2307)
at org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1462)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:405)
at org.apache.spark.rdd.RDD.take(RDD.scala:1435)
at org.apache.spark.rdd.RDD.$anonfun$first$1(RDD.scala:1476)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:405)
at org.apache.spark.rdd.RDD.first(RDD.scala:1476)
at org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:587)
at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:465)
at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:31)
at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:24)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:518)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:510)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.downloadModel(ResourceDownloader.scala:709)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel(ResourceDownloader.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.FileNotFoundException: File file:/home/hadoop/cache_pretrained/sentence_detector_dl_en_2.7.0_2.4_1609611052663/metadata/part-00000 does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:832)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1153)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:822)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:472)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:160)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:372)
at org.apache.hadoop.fs.ChecksumFileSystem.lambda$openFileWithOptions$0(ChecksumFileSystem.java:896)
at org.apache.hadoop.util.LambdaUtils.eval(LambdaUtils.java:52)
at org.apache.hadoop.fs.ChecksumFileSystem.openFileWithOptions(ChecksumFileSystem.java:894)
at org.apache.hadoop.fs.FileSystem$FSDataInputStreamBuilder.build(FileSystem.java:4865)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:115)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:288)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:287)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:245)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:97)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
[OK!]
Traceback (most recent call last):
File "/tmp/spark-45ccd940-724b-4145-a054-1e60d66da1ab/consolidate.py", line 260, in <module>
instantiateForEachBatchProcessor = forEachBatchProcessor()
File "/tmp/spark-45ccd940-724b-4145-a054-1e60d66da1ab/consolidate.py", line 77, in __init__
SentenceDetectorDLModel()
File "/usr/local/lib/python3.7/site-packages/sparknlp/annotator/sentence/sentence_detector_dl.py", line 467, in pretrained
return ResourceDownloader.downloadModel(SentenceDetectorDLModel, name, lang, remote_loc)
File "/usr/local/lib/python3.7/site-packages/sparknlp/pretrained/resource_downloader.py", line 99, in downloadModel
raise e
File "/usr/local/lib/python3.7/site-packages/sparknlp/pretrained/resource_downloader.py", line 96, in downloadModel
j_obj = _internal._DownloadModel(reader.name, name, language, remote_loc, j_dwn).apply()
File "/usr/local/lib/python3.7/site-packages/sparknlp/internal/__init__.py", line 350, in __init__
name, language, remote_loc)
File "/usr/local/lib/python3.7/site-packages/sparknlp/internal/extended_java_wrapper.py", line 27, in __init__
self._java_obj = self.new_java_obj(java_obj, *args)
File "/usr/local/lib/python3.7/site-packages/sparknlp/internal/extended_java_wrapper.py", line 37, in new_java_obj
return self._new_java_obj(java_class, *args)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 86, in _new_java_obj
File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1323, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 169, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) ([2600:1f18:3d:a503:edb3:ae2d:502b:9e73] executor 1): java.io.FileNotFoundException: File file:/home/hadoop/cache_pretrained/sentence_detector_dl_en_2.7.0_2.4_1609611052663/metadata/part-00000 does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:832)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1153)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:822)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:472)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:160)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:372)
at org.apache.hadoop.fs.ChecksumFileSystem.lambda$openFileWithOptions$0(ChecksumFileSystem.java:896)
at org.apache.hadoop.util.LambdaUtils.eval(LambdaUtils.java:52)
at org.apache.hadoop.fs.ChecksumFileSystem.openFileWithOptions(ChecksumFileSystem.java:894)
at org.apache.hadoop.fs.FileSystem$FSDataInputStreamBuilder.build(FileSystem.java:4865)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:115)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:288)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:287)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:245)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:97)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2974)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2910)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2909)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2909)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1263)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1263)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1263)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3173)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3112)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3101)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1028)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2288)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2307)
at org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1462)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:405)
at org.apache.spark.rdd.RDD.take(RDD.scala:1435)
at org.apache.spark.rdd.RDD.$anonfun$first$1(RDD.scala:1476)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:405)
at org.apache.spark.rdd.RDD.first(RDD.scala:1476)
at org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:587)
at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:465)
at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:31)
at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:24)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:518)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:510)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.downloadModel(ResourceDownloader.scala:709)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel(ResourceDownloader.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.FileNotFoundException: File file:/home/hadoop/cache_pretrained/sentence_detector_dl_en_2.7.0_2.4_1609611052663/metadata/part-00000 does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:832)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1153)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:822)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:472)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:160)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:372)
at org.apache.hadoop.fs.ChecksumFileSystem.lambda$openFileWithOptions$0(ChecksumFileSystem.java:896)
at org.apache.hadoop.util.LambdaUtils.eval(LambdaUtils.java:52)
at org.apache.hadoop.fs.ChecksumFileSystem.openFileWithOptions(ChecksumFileSystem.java:894)
at org.apache.hadoop.fs.FileSystem$FSDataInputStreamBuilder.build(FileSystem.java:4865)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:115)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:288)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:287)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:245)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:97)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
from spark-nlp.
Update
I've confirmed step 1 of a workaround with the below
session = Session()
credentials = session.get_credentials()
current_credentials = credentials.get_frozen_credentials()
os.environ['AWS_ACCESS_KEY_ID'] = current_credentials.access_key
os.environ['AWS_SECRET_ACCESS_KEY'] = current_credentials.secret_key
old_spark: SparkSession = (
SparkSession
.builder
.getOrCreate()
)
old_conf = old_spark.sparkContext._conf.getAll()
new_conf = SparkConf()
for key, value in old_conf:
new_conf.set(key, value)
old_spark.stop()
new_conf.set('spark.jsl.settings.aws.credentials.access_key_id', current_credentials.access_key)
new_conf.set('spark.jsl.settings.aws.credentials.secret_access_key', current_credentials.secret_key)
new_conf.set('spark.jsl.settings.aws.credentials.session_token', current_credentials.token)
new_conf.set('spark.eventLog.overwrite', 'true')
spark = (
SparkSession.builder
.config(conf=new_conf)
.getOrCreate())
print(spark.sparkContext._conf.getAll())
However, after confirming the new session does in fact have the updated config, I still receive this error. Should I be setting Hadoop configs directly instead of with spark.jsl.settings.aws.credentials
?
Error: Empty access.key and secret.key hadoop configuration and parameters.
[OK!]
Traceback (most recent call last):
File "/tmp/spark-73462ad5-6316-40dd-84a3-b8cdc13a856c/consolidate.py", line 281, in <module>
instantiateForEachBatchProcessor = forEachBatchProcessor()
File "/tmp/spark-73462ad5-6316-40dd-84a3-b8cdc13a856c/consolidate.py", line 98, in __init__
SentenceDetectorDLModel()
File "/usr/local/lib/python3.7/site-packages/sparknlp/annotator/sentence/sentence_detector_dl.py", line 467, in pretrained
return ResourceDownloader.downloadModel(SentenceDetectorDLModel, name, lang, remote_loc)
File "/usr/local/lib/python3.7/site-packages/sparknlp/pretrained/resource_downloader.py", line 96, in downloadModel
j_obj = _internal._DownloadModel(reader.name, name, language, remote_loc, j_dwn).apply()
File "/usr/local/lib/python3.7/site-packages/sparknlp/internal/__init__.py", line 350, in __init__
name, language, remote_loc)
File "/usr/local/lib/python3.7/site-packages/sparknlp/internal/extended_java_wrapper.py", line 27, in __init__
self._java_obj = self.new_java_obj(java_obj, *args)
File "/usr/local/lib/python3.7/site-packages/sparknlp/internal/extended_java_wrapper.py", line 37, in new_java_obj
return self._new_java_obj(java_class, *args)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 86, in _new_java_obj
File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1323, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 175, in deco
pyspark.errors.exceptions.captured.IllegalArgumentException: requirement failed: Was not found appropriate resource to download for request: ResourceRequest(sentence_detector_dl,Some(en),public/models,5.1.2,3.4) with downloader: com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader@25901693
from spark-nlp.
Final Note
The request stands, my comments were an exercise to find a workaround.
@maziyarpanahi the request can be more concisely articulated to "EMRFS pretrained model cache". What do you think?
I don't believe there is a valid workaround for my use case as cross account hudi queries (documentation implies any data pulled across EMRFS) now fail. I believe it is because the spark context reset forces s3a instead of EMRFS. Redacted stdout:
py4j.protocol.Py4JJavaError: An error occurred while calling o707.load.
: org.apache.hudi.exception.HoodieIOException: Could not check if s3a:///is a valid table
at org.apache.hudi.exception.TableNotFoundException.checkTableValidity(TableNotFoundException.java:59)
at org.apache.hudi.common.table.HoodieTableMetaClient.<init>(HoodieTableMetaClient.java:137)
at org.apache.hudi.common.table.HoodieTableMetaClient.newMetaClient(HoodieTableMetaClient.java:689)
at org.apache.hudi.common.table.HoodieTableMetaClient.access$000(HoodieTableMetaClient.java:81)
at org.apache.hudi.common.table.HoodieTableMetaClient$Builder.build(HoodieTableMetaClient.java:770)
at org.apache.hudi.DefaultSource.sourceSchema(DefaultSource.scala:193)
at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:233)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:118)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:118)
at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:35)
at org.apache.spark.sql.streaming.DataStreamReader.loadInternal(DataStreamReader.scala:168)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:211)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.hadoop.fs.s3a.AWSBadRequestException: getFileStatus on s3a:////.hoodie: com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID:; S3 Extended Request ID:/=; Proxy: null), S3 Extended Request ID:/=:400 Bad Request: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID:; S3 Extended Request ID:/=; Proxy: null)
at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:249)
at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:175)
at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3797)
at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3689)
at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$getFileStatus$24(S3AFileSystem.java:3557)
at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:499)
at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:444)
at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2338)
at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2357)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:3555)
at org.apache.hudi.common.fs.HoodieWrapperFileSystem.lambda$getFileStatus$17(HoodieWrapperFileSystem.java:410)
at org.apache.hudi.common.fs.HoodieWrapperFileSystem.executeFuncWithTimeMetrics(HoodieWrapperFileSystem.java:114)
at org.apache.hudi.common.fs.HoodieWrapperFileSystem.getFileStatus(HoodieWrapperFileSystem.java:404)
at org.apache.hudi.exception.TableNotFoundException.checkTableValidity(TableNotFoundException.java:51)
from spark-nlp.
Related Issues (20)
- .setReadMonthFirst always reads month first
- SparkSession should only be created and accessed on the driver HOT 2
- Unclear in general how to use classification models requiring sentence pairs
- ONNX models crash when they are used in Colab's T4 GPU runtime HOT 3
- No Operation named [init_all_tables] in the Graph when using GPT2Transformer HOT 4
- NER pipeline Not scaling up to use the full cluster nodes HOT 3
- Unsupported Model: BGEEmbeddings
- Spark NLP Configuration's spark.jsl.settings.storage.cluster_tmp_dir: Databricks DBFS location does not work HOT 3
- spark-nlp in databricks writing to root s3 in cluster HOT 1
- Import Whisper large v3 into Spark NLP HOT 5
- Zero-Shot NER gives wrong entities with labels HOT 12
- Cannot cast to float HOT 2
- Flexible normalization HOT 1
- XlmRoBertaSentenceEmbeddings returns huge amount of embeddings instead of set dimensions
- Sparknlp returning different embedding for manual spark dataframe vs reading from file spark dataframe HOT 5
- SparkNLP Embeddings inference 3X slower than with pandas_udf HOT 3
- EntityRuler fails two basic tests HOT 3
- Show an error of 'GLIBC_2.27 not found' when pretrained model download in AWS EMR HOT 2
- Onnx models fail when saving transformer
- Hardcoded column name in DocumentSimilarityRanker annotator
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spark-nlp.