onsdigital / address-index-data Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
Steps followed:
Cloned master branch -> git clone --branch master https://github.com/ONSdigital/address-index-data
Ran ‘set clean assembly’ on the address-index-data folder (assembly was successful but with WARNs while Merging)
Created an application.conf with the following:
addressindex.elasticsearch.nodes="IP of my Elasticsearch"
addressindex.elasticsearch.pass="password"
addressindex.elasticsearch.user="elastic"
In /batch/build.sbt,
I set val localTarget: Boolean = true since I needed a single jar
"org.elasticsearch" %% "elasticsearch-spark-20" % "8.7.1" to go with my ElasticSearch version
Ran the following
java -Dconfig.file=application.conf -jar batch/target/scala-2.11/ons-ai-batch-assembly-0.0.1.jar throws exceptions.
I am pasting parts of the dump below:
_23/06/06 11:51:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/06/06 11:51:24 WARN Persistence: Error creating validator of type org.datanucleus.properties.CorePropertyValidator
ClassLoaderResolver for class "" gave error on creation : {1}
org.datanucleus.exceptions.NucleusUserException: ClassLoaderResolver for class "" gave error on creation : {1}
at org.datanucleus.NucleusContext.getClassLoaderResolver(NucleusContext.java:1087)
at org.datanucleus.PersistenceConfiguration.validatePropertyValue(PersistenceConfiguration.java:797)
at org.datanucleus.PersistenceConfiguration.setProperty(PersistenceConfiguration.java:714)
at org.datanucleus.PersistenceConfiguration.setPersistenceProperties(PersistenceConfiguration.java:693)
at org.datanucleus.NucleusContext.(NucleusContext.java:273)
at org.datanucleus.NucleusContext.(NucleusContext.java:247)
at org.datanucleus.NucleusContext.(NucleusContext.java:225)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.(JDOPersistenceManagerFactory.java:416)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:301)
Caused by: java.lang.NullPointerException
at org.datanucleus.NucleusContext.getClassLoaderResolver(NucleusContext.java:1079)
... 135 more
Nested Throwables StackTrace:
java.lang.NullPointerException
at org.datanucleus.NucleusContext.getClassLoaderResolver(NucleusContext.java:1079)
at org.datanucleus.PersistenceConfiguration.validatePropertyValue(PersistenceConfiguration.java:797)
at org.datanucleus.PersistenceConfiguration.setProperty(PersistenceConfiguration.java:714)
at org.datanucleus.PersistenceConfiguration.setPersistenceProperties(PersistenceConfiguration.java:693)
at org.datanucleus.NucleusContext.(NucleusContext.java:273)
at org.datanucleus.NucleusContext.(NucleusContext.java:247)
at org.datanucleus.NucleusContext.(NucleusContext.java:225)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.(JDOPersistenceManagerFactory.java:416)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:301)
.........
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965)
at java.security.AccessController.doPrivileged(Native Method)
at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960)
.......
23/06/06 11:51:24 WARN HiveMetaStore: Retrying creating default database after error: Unexpected exception caught.
javax.jdo.JDOFatalInternalException: Unexpected exception caught.
at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1193)
at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
......
23/06/06 11:51:24 WARN Hive: Failed to access metastore. This class should not accessed in runtime.
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
.......
Caused by: org.datanucleus.exceptions.NucleusUserException: Persistence process has been specified to use a ClassLoaderResolver of name "datanucleus" yet this has not been found by the DataNucleus plugin mechanism. Please check your CLASSPATH and plugin specification.
at org.datanucleus.NucleusContext.(NucleusContext.java:283)
at org.datanucleus.NucleusContext.(NucleusContext.java:247)
at org.datanucleus.NucleusContext.(NucleusContext.java:225)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.(JDOPersistenceManagerFactory.java:416)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:301)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202)
... 123 more_
Scala and Java version as follows:
Scala code runner version 2.12.4 -- Copyright 2002-2017, LAMP/EPFL and Lightbend, Inc.
openjdk version "1.8.0_362"
Hi,
Would you mind explaining from what dataset the "hierarchy" csv is compiled from?
When I run the below command,
java -Dconfig.file=application.conf -cp batch/target/scala-2.11/ons-ai-batch-assembly-0.0.1.jar uk.gov.ons.addressindex.Main
I met an error.
21/07/14 02:17:15 WARN Utils: Your hostname, myimac.local resolves to a loopback address: 127.0.0.1; using 192.168.1.5 instead (on interface en0)
21/07/14 02:17:15 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
21/07/14 02:17:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/07/14 02:17:20 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
21/07/14 02:17:20 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
21/07/14 02:17:23 WARN CSVDataSource: CSV header does not conform to the schema.
Header: RECORD_IDENTIFIER, CHANGE_TYPE, PRO_ORDER, UPRN, UDPRN, ORGANISATION_NAME, DEPARTMENT_NAME, SUB_BUILDING_NAME, BUILDING_NAME, BUILDING_NUMBER, DEPENDENT_THOROUGHFARE, THROUGHFARE, DOUBLE_DEPENDENT_LOCALITY, DEPENDENT_LOCALITY, POST_TOWN, POSTCODE, POSTCODE_TYPE, DELIVERY_POINT_SUFFIX, WELSH_DEPENDENT_THOROUGHFARE, WELSH_THOROUGHFARE, WELSH_DOUBLE_DEPENDENT_LOCALITY, WELSH_DEPENDENT_LOCALITY, WELSH_POST_TOWN, PO_BOX_NUMBER, PROCESS_DATE, START_DATE, END_DATE, LAST_UPDATE_DATE, ENTRY_DATE
Schema: recordIdentifier, changeType, proOrder, uprn, udprn, organisationName, departmentName, subBuildingName, buildingName, buildingNumber, dependentThoroughfare, thoroughfare, doubleDependentLocality, dependentLocality, postTown, postcode, postcodeType, deliveryPointSuffix, welshDependentThoroughfare, welshThoroughfare, welshDoubleDependentLocality, welshDependentLocality, welshPostTown, poBoxNumber, processDate, startDate, endDate, lastUpdateDate, entryDate
Expected: recordIdentifier but found: RECORD_IDENTIFIER
CSV file: file:///Volumes/REPO/Goran/address-index/batch/src/test/resources/csv/delivery_point/ABP_E811a_v111017.csv
21/07/14 02:17:23 WARN CSVDataSource: CSV header does not conform to the schema.
Header: UPRN, ORGANISATION, LEGAL_NAME
Schema: uprn, organisation, legalName
Expected: legalName but found: LEGAL_NAME
CSV file: file:///Volumes/REPO/Goran/address-index/batch/src/test/resources/csv/organisation/ABP_E811a_v111017.csv
21/07/14 02:17:23 WARN CSVDataSource: CSV header does not conform to the schema.
Header: UPRN, PRIMARY_UPRN, THIS_LAYER, PARENT_UPRN
Schema: uprn, primaryUprn, thisLayer, parentUprn
Expected: primaryUprn but found: PRIMARY_UPRN
CSV file: file:///Volumes/REPO/Goran/address-index/batch/src/test/resources/csv/hierarchy/ABP_E811a_v111017.csv
21/07/14 02:17:23 WARN CSVDataSource: CSV header does not conform to the schema.
Header: UPRN, LPI_KEY, LANGUAGE, LOGICAL_STATUS, START_DATE, END_DATE, LAST_UPDATE_DATE, SAO_START_NUMBER, SAO_START_SUFFIX, SAO_END_NUMBER, SAO_END_SUFFIX, SAO_TEXT, PAO_START_NUMBER, PAO_START_SUFFIX, PAO_END_NUMBER, PAO_END_SUFFIX, PAO_TEXT, USRN, USRN_MATCH_INDICATOR, LEVEL, OFFICIAL_FLAG
Schema: uprn, lpiKey, language, logicalStatus, startDate, endDate, lastUpdateDate, saoStartNumber, saoStartSuffix, saoEndNumber, saoEndSuffix, saoText, paoStartNumber, paoStartSuffix, paoEndNumber, paoEndSuffix, paoText, usrn, usrnMatchIndicator, level, officialFlag
Expected: lpiKey but found: LPI_KEY
CSV file: file:///Volumes/REPO/Goran/address-index/batch/src/test/resources/csv/lpi/ABP_E811a_v111017.csv
21/07/14 02:17:23 WARN CSVDataSource: CSV header does not conform to the schema.
Header: USRN, STREET_CLASSIFICATION
Schema: usrn, streetClassification
Expected: streetClassification but found: STREET_CLASSIFICATION
CSV file: file:///Volumes/REPO/Goran/address-index/batch/src/test/resources/csv/street/ABP_E811a_v111017.csv
21/07/14 02:17:23 WARN CSVDataSource: CSV header does not conform to the schema.
Header: USRN, STREET_DESCRIPTOR, LOCALITY, TOWN_NAME, LANGUAGE
Schema: usrn, streetDescriptor, locality, townName, language
Expected: streetDescriptor but found: STREET_DESCRIPTOR
CSV file: file:///Volumes/REPO/Goran/address-index/batch/src/test/resources/csv/street_descriptor/ABP_E811a_v111017.csv
21/07/14 02:17:23 WARN CSVDataSource: CSV header does not conform to the schema.
Header: UPRN, CROSS_REFERENCE, SOURCE
Schema: uprn, crossReference, source
Expected: crossReference but found: CROSS_REFERENCE
CSV file: file:///Volumes/REPO/Goran/address-index/batch/src/test/resources/csv/crossref/ABP_E811a_v111017.csv
21/07/14 02:17:23 WARN CSVDataSource: CSV header does not conform to the schema.
Header: UPRN, CLASSIFICATION_CODE, CLASS_SCHEME
Schema: uprn, classificationCode, classScheme
Expected: classificationCode but found: CLASSIFICATION_CODE
CSV file: file:///Volumes/REPO/Goran/address-index/batch/src/test/resources/csv/classification/ABP_E811a_v111017.csv
21/07/14 02:17:24 WARN CSVDataSource: CSV header does not conform to the schema.
Header: UPRN, PRIMARY_UPRN, PARENT_UPRN, ESTAB_TYPE, ADDRESS_TYPE
Schema: uprn, primaryUprn, parentUprn, addressType, estabType
Expected: primaryUprn but found: PRIMARY_UPRN
CSV file: file:///Volumes/REPO/Goran/address-index/batch/src/test/resources/csv/hierarchy/ABP_E811a_v111017.csv
Exception in thread "main" org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverClusterInfo(InitializationUtils.java:340)
at org.elasticsearch.spark.rdd.EsSpark$.doSaveToEs(EsSpark.scala:104)
at org.elasticsearch.spark.rdd.EsSpark$.saveToEs(EsSpark.scala:79)
at org.elasticsearch.spark.rdd.EsSpark$.saveToEs(EsSpark.scala:76)
at org.elasticsearch.spark.package$SparkRDDFunctions.saveToEs(package.scala:56)
at uk.gov.ons.addressindex.writers.ElasticSearchWriter$.saveHybridAddresses(ElasticSearchWriter.scala:27)
at uk.gov.ons.addressindex.Main$.saveHybridAddresses(Main.scala:96)
at uk.gov.ons.addressindex.Main$.delayedEndpoint$uk$gov$ons$addressindex$Main$1(Main.scala:56)
at uk.gov.ons.addressindex.Main$delayedInit$body.apply(Main.scala:14)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at uk.gov.ons.addressindex.Main$.main(Main.scala:14)
at uk.gov.ons.addressindex.Main.main(Main.scala)
Caused by: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [GET] on [] failed; server[pvr-locations.es.eu-west-2.aws.cloud.es.io:9243] returned [400|Bad Request:]
at org.elasticsearch.hadoop.rest.RestClient.checkResponse(RestClient.java:469)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:426)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:388)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:392)
at org.elasticsearch.hadoop.rest.RestClient.get(RestClient.java:168)
at org.elasticsearch.hadoop.rest.RestClient.mainInfo(RestClient.java:735)
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverClusterInfo(InitializationUtils.java:330)
... 17 more
I have no idea how to resolve this error.
The prerequisites for these seem to be old versions especially Spark 2.4. Which are the latest versions of each of these with which the project builds and runs well? I am planning to use the master branch.
Java 8
SBT 0.13.16 (http://www.scala-sbt.org/)
Scala 2.12.4
Apache Spark 2.4.0
Elasticsearch 7.9.3
Elasticsearch-spark-30 7.9.12
Also I am planning to connect this to an ElasticSearch 8.7 cluster. So is https://www.javadoc.io/doc/org.elasticsearch/elasticsearch-spark-30_2.12/8.7.0/index.html okay to use?
I met this error while building a docker image.
My Dockerfile as follows:
FROM ubuntu:latest
# Install OpenJDK-8
RUN apt-get update -y && \
apt-get install -y openjdk-8-jdk && \
apt-get install -y ant && \
apt-get clean;
# Fix certificate issues
RUN apt-get update && \
apt-get install ca-certificates-java && \
apt-get clean && \
update-ca-certificates -f;
# Setup JAVA_HOME -- useful for docker commandline
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64/
RUN export JAVA_HOME
# Install sbt and scala
RUN apt-get -qq -y install curl wget
RUN apt-get install gnupg -y
RUN echo "deb https://repo.scala-sbt.org/scalasbt/debian all main" | tee /etc/apt/sources.list.d/sbt.list && \
echo "deb https://repo.scala-sbt.org/scalasbt/debian /" | tee /etc/apt/sources.list.d/sbt_old.list && \
curl -sL "https://keyserver.ubuntu.com/pks/lookup?op=get&search=0x2EE0EA64E40A89B84B2DF73499E82A75642AC823" | apt-key add && \
apt-get update && \
apt-get install sbt -y
RUN apt-get install scala -y
# Install spark
RUN wget https://apache.claz.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
RUN tar xvf spark-*
RUN mv spark-3.1.2-bin-hadoop3.2 /opt/spark
ENV SPARK_HOME /opt/spark
RUN export SPARK_HOME
ENV PATH="${PATH}:${SPARK_HOME}/bin:${SPARK_HOME}/sbin"
RUN start-master.sh
RUN start-slave.sh spark://localhost:7077
# Install Elasticsearch 7
RUN curl -fsSL https://artifacts.elastic.co/GPG-KEY-elasticsearch | apt-key add -
RUN echo "deb https://artifacts.elastic.co/packages/7.x/apt stable main" | tee -a /etc/apt/sources.list.d/elastic-7.x.list
RUN apt-get update
RUN apt-get install elasticsearch
# Install hadoop
ENV HADOOP_HOME /opt/hadoop
RUN export HADOOP_HOME
RUN wget https://archive.apache.org/dist/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz && \
tar -xzf hadoop-3.3.1.tar.gz && \
mv hadoop-3.3.1 $HADOOP_HOME
ENV PATH="${PATH}:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin"
ENV HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native \
HADOOP_OPTS="-Djava.library.path=${HADOOP_HOME}/lib" \
HADOOP_MAPRED_HOME=$HADOOP_HOME \
HADOOP_COMMON_HOME=$HADOOP_HOME \
HADOOP_HDFS_HOME=$HADOOP_HOME \
YARN_HOME=$HADOOP_HOME \
HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop \
LD_LIBRARY_PATH="$HADOOP_HOME/lib/native/"
RUN start-all.sh
WORKDIR /opt/address-index
COPY ./ ./
RUN sbt clean assembly
RUN java -Dconfig.file=application.conf -jar batch/target/scala-2.11/ons-ai-batch-assembly-0.0.1.jar
The API for one of my Elasticsearch instances requires user ID, password and Cert fingerprint. I have appended Python code at the bottom that works on my ES instance.
But in the address-index-data project using the reference.conf or the application.conf I can only specify host, user, password. How do I specify Cert fingerprint?
_from elasticsearch import Elasticsearch
ELASTIC_PASSWORD = "---------------------"
CERT_FINGERPRINT = "-------------------"
client = Elasticsearch(
"https://10.222.13.197:9200",
ssl_assert_fingerprint=CERT_FINGERPRINT,
basic_auth=("elastic", ELASTIC_PASSWORD)
)_
While running the job, I met this error.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1887)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1875)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1874)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1874)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2108)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2057)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2046)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2114)
at org.elasticsearch.spark.rdd.EsSpark$.doSaveToEs(EsSpark.scala:108)
at org.elasticsearch.spark.rdd.EsSpark$.saveToEs(EsSpark.scala:79)
at org.elasticsearch.spark.rdd.EsSpark$.saveToEs(EsSpark.scala:76)
at org.elasticsearch.spark.package$SparkRDDFunctions.saveToEs(package.scala:56)
at uk.gov.ons.addressindex.writers.ElasticSearchWriter$.saveSkinnyHybridAddresses(ElasticSearchWriter.scala:39)
at uk.gov.ons.addressindex.Main$.saveHybridAddresses(Main.scala:94)
at uk.gov.ons.addressindex.Main$.delayedEndpoint$uk$gov$ons$addressindex$Main$1(Main.scala:65)
at uk.gov.ons.addressindex.Main$delayedInit$body.apply(Main.scala:14)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at uk.gov.ons.addressindex.Main$.main(Main.scala:14)
at uk.gov.ons.addressindex.Main.main(Main.scala)
Caused by: java.lang.RuntimeException: Read timed out
at org.codehaus.jackson.map.MappingIterator.next(MappingIterator.java:115)
at org.elasticsearch.hadoop.rest.bulk.BulkProcessor.tryFlush(BulkProcessor.java:241)
at org.elasticsearch.hadoop.rest.bulk.BulkProcessor.flush(BulkProcessor.java:499)
at org.elasticsearch.hadoop.rest.bulk.BulkProcessor.add(BulkProcessor.java:113)
at org.elasticsearch.hadoop.rest.RestRepository.doWriteToIndex(RestRepository.java:192)
at org.elasticsearch.hadoop.rest.RestRepository.writeToIndex(RestRepository.java:172)
at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:74)
at org.elasticsearch.spark.rdd.EsSpark$$anonfun$doSaveToEs$1.apply(EsSpark.scala:108)
at org.elasticsearch.spark.rdd.EsSpark$$anonfun$doSaveToEs$1.apply(EsSpark.scala:108)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: javax.net.ssl.SSLException: Read timed out
at sun.security.ssl.Alert.createSSLException(Alert.java:127)
at sun.security.ssl.TransportContext.fatal(TransportContext.java:324)
at sun.security.ssl.TransportContext.fatal(TransportContext.java:267)
at sun.security.ssl.TransportContext.fatal(TransportContext.java:262)
at sun.security.ssl.SSLTransport.decode(SSLTransport.java:138)
at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1386)
at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1354)
at sun.security.ssl.SSLSocketImpl.access$300(SSLSocketImpl.java:73)
at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:948)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at org.apache.commons.httpclient.ContentLengthInputStream.read(ContentLengthInputStream.java:170)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at org.apache.commons.httpclient.AutoCloseInputStream.read(AutoCloseInputStream.java:108)
at org.elasticsearch.hadoop.rest.DelegatingInputStream.read(DelegatingInputStream.java:62)
at org.codehaus.jackson.impl.Utf8StreamParser.loadMore(Utf8StreamParser.java:172)
at org.codehaus.jackson.impl.Utf8StreamParser._skipWSOrEnd(Utf8StreamParser.java:2309)
at org.codehaus.jackson.impl.Utf8StreamParser.nextToken(Utf8StreamParser.java:444)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.mapObject(UntypedObjectDeserializer.java:219)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.deserialize(UntypedObjectDeserializer.java:47)
at org.codehaus.jackson.map.deser.std.MapDeserializer._readAndBind(MapDeserializer.java:319)
at org.codehaus.jackson.map.deser.std.MapDeserializer.deserialize(MapDeserializer.java:249)
at org.codehaus.jackson.map.deser.std.MapDeserializer.deserialize(MapDeserializer.java:33)
at org.codehaus.jackson.map.MappingIterator.nextValue(MappingIterator.java:178)
at org.codehaus.jackson.map.MappingIterator.next(MappingIterator.java:111)
... 16 more
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:457)
at sun.security.ssl.SSLSocketInputRecord.decodeInputRecord(SSLSocketInputRecord.java:237)
at sun.security.ssl.SSLSocketInputRecord.decode(SSLSocketInputRecord.java:190)
at sun.security.ssl.SSLTransport.decode(SSLTransport.java:109)
... 36 more
21/07/23 05:49:00 WARN TaskSetManager: Lost task 8.0 in stage 18.0 (TID 681, localhost, executor driver): TaskKilled (Stage cancelled)
How to handle this error.
I'm running it inside from Intellij IDEA and using real data.
The format of data is CSV and the file size is more than 8GB.
So I met java.lang.outOfMemorryError: Java heap space
How to resolve it?
Hi, I have the elastic search server running locally on localhost and it's connecting fine, but I get an error when the SessionHiveMetaStoreClient, can you help? thanks
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
at org.apache.hadoop.hive.ql.metadata.Hive.<clinit>(Hive.java:166)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:180)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:114)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:264)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:385)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:287)
at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:195)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:195)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:195)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:194)
at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)
at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102)
at org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:39)
at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:54)
at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52)
at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1.<init>(HiveSessionStateBuilder.scala:69)
at org.apache.spark.sql.hive.HiveSessionStateBuilder.analyzer(HiveSessionStateBuilder.scala:69)
at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$2.apply(BaseSessionStateBuilder.scala:293)
at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$2.apply(BaseSessionStateBuilder.scala:293)
at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:79)
at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:79)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:432)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:233)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
at uk.gov.ons.addressindex.readers.AddressIndexFileReader$.readCsv(AddressIndexFileReader.scala:110)
at uk.gov.ons.addressindex.readers.AddressIndexFileReader$.readBlpuCSV(AddressIndexFileReader.scala:39)
at uk.gov.ons.addressindex.Main$.generateNagAddresses(Main.scala:88)
at uk.gov.ons.addressindex.Main$.saveHybridAddresses(Main.scala:98)
at uk.gov.ons.addressindex.Main$.delayedEndpoint$uk$gov$ons$addressindex$Main$1(Main.scala:78)
at uk.gov.ons.addressindex.Main$delayedInit$body.apply(Main.scala:14)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at uk.gov.ons.addressindex.Main$.main(Main.scala:14)
at uk.gov.ons.addressindex.Main.main(Main.scala)
Caused by: org.datanucleus.exceptions.NucleusUserException: Persistence process has been specified to use a ClassLoaderResolver of name "datanucleus" yet this has not been found by the DataNucleus plugin mechanism. Please check your CLASSPATH and plugin specification.
at org.datanucleus.NucleusContext.<init>(NucleusContext.java:283)
at org.datanucleus.NucleusContext.<init>(NucleusContext.java:247)
at org.datanucleus.NucleusContext.<init>(NucleusContext.java:225)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.<init>(JDOPersistenceManagerFactory.java:416)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:301)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202)
... 91 more
The latest ONS Addressbase Premium data is presented in zip files which contain grouped data categorised by the postcode.
This results in over 10000 files that need to get processed.
It would appear that the current solution only accepts 1 CSV file for each type of record type.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.