GithubHelp home page GithubHelp logo

onsdigital / address-index-data Goto Github PK

View Code? Open in Web Editor NEW
20.0 20.0 8.0 1.37 MB

License: MIT License

Python 48.72% Scala 35.44% R 15.84%
address-index address-parser data-linkage data-science machine-learning nlp-machine-learning office-for-national-statistics ons

address-index-data's People

Contributors

alexflav23 avatar alwestvt avatar analytically avatar gaskyk avatar gsmanderson avatar ivyons avatar jameshoskisson avatar mironor avatar paul-joel avatar richardsmithons avatar sammypoot avatar saniemi avatar steve-thorne-ons avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

address-index-data's Issues

Issues running the address-index-data project -

Steps followed:
Cloned master branch -> git clone --branch master https://github.com/ONSdigital/address-index-data
Ran ‘set clean assembly’ on the address-index-data folder (assembly was successful but with WARNs while Merging)

Created an application.conf with the following:
addressindex.elasticsearch.nodes="IP of my Elasticsearch"
addressindex.elasticsearch.pass="password"
addressindex.elasticsearch.user="elastic"

In /batch/build.sbt,
I set val localTarget: Boolean = true since I needed a single jar
"org.elasticsearch" %% "elasticsearch-spark-20" % "8.7.1" to go with my ElasticSearch version

Ran the following
java -Dconfig.file=application.conf -jar batch/target/scala-2.11/ons-ai-batch-assembly-0.0.1.jar throws exceptions.

I am pasting parts of the dump below:
_23/06/06 11:51:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/06/06 11:51:24 WARN Persistence: Error creating validator of type org.datanucleus.properties.CorePropertyValidator
ClassLoaderResolver for class "" gave error on creation : {1}
org.datanucleus.exceptions.NucleusUserException: ClassLoaderResolver for class "" gave error on creation : {1}
at org.datanucleus.NucleusContext.getClassLoaderResolver(NucleusContext.java:1087)
at org.datanucleus.PersistenceConfiguration.validatePropertyValue(PersistenceConfiguration.java:797)
at org.datanucleus.PersistenceConfiguration.setProperty(PersistenceConfiguration.java:714)
at org.datanucleus.PersistenceConfiguration.setPersistenceProperties(PersistenceConfiguration.java:693)
at org.datanucleus.NucleusContext.(NucleusContext.java:273)
at org.datanucleus.NucleusContext.(NucleusContext.java:247)
at org.datanucleus.NucleusContext.(NucleusContext.java:225)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.(JDOPersistenceManagerFactory.java:416)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:301)

Caused by: java.lang.NullPointerException
at org.datanucleus.NucleusContext.getClassLoaderResolver(NucleusContext.java:1079)
... 135 more
Nested Throwables StackTrace:
java.lang.NullPointerException
at org.datanucleus.NucleusContext.getClassLoaderResolver(NucleusContext.java:1079)
at org.datanucleus.PersistenceConfiguration.validatePropertyValue(PersistenceConfiguration.java:797)
at org.datanucleus.PersistenceConfiguration.setProperty(PersistenceConfiguration.java:714)
at org.datanucleus.PersistenceConfiguration.setPersistenceProperties(PersistenceConfiguration.java:693)
at org.datanucleus.NucleusContext.(NucleusContext.java:273)
at org.datanucleus.NucleusContext.(NucleusContext.java:247)
at org.datanucleus.NucleusContext.(NucleusContext.java:225)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.(JDOPersistenceManagerFactory.java:416)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:301)
.........
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965)
at java.security.AccessController.doPrivileged(Native Method)
at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960)
.......
23/06/06 11:51:24 WARN HiveMetaStore: Retrying creating default database after error: Unexpected exception caught.
javax.jdo.JDOFatalInternalException: Unexpected exception caught.
at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1193)
at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
......
23/06/06 11:51:24 WARN Hive: Failed to access metastore. This class should not accessed in runtime.
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
.......
Caused by: org.datanucleus.exceptions.NucleusUserException: Persistence process has been specified to use a ClassLoaderResolver of name "datanucleus" yet this has not been found by the DataNucleus plugin mechanism. Please check your CLASSPATH and plugin specification.
at org.datanucleus.NucleusContext.(NucleusContext.java:283)
at org.datanucleus.NucleusContext.(NucleusContext.java:247)
at org.datanucleus.NucleusContext.(NucleusContext.java:225)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.(JDOPersistenceManagerFactory.java:416)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:301)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202)
... 123 more_

  1. I tried switching from Java 8 to 11 and back. At least it builds with Java 8.
  2. Tried using "bintray-spark-packages" at "https://repos.spark-packages.org", instead of "bintray-spark-packages" at "https://dl.bintray.com/spark-packages/maven/".
  3. Tried to add "org.apache.hive" %% "hive-common" % "2.3.3" as a dependency, as some blogs seemed to suggest, but didn't know which repo to find it from.

Scala and Java version as follows:
Scala code runner version 2.12.4 -- Copyright 2002-2017, LAMP/EPFL and Lightbend, Inc.
openjdk version "1.8.0_362"

Datasets

Hi,

Would you mind explaining from what dataset the "hierarchy" csv is compiled from?

Exception in thread "main" org.elasticsearch.hadoop.EsHadoopIllegalArgumentException

When I run the below command,

java -Dconfig.file=application.conf -cp batch/target/scala-2.11/ons-ai-batch-assembly-0.0.1.jar uk.gov.ons.addressindex.Main

I met an error.

21/07/14 02:17:15 WARN Utils: Your hostname, myimac.local resolves to a loopback address: 127.0.0.1; using 192.168.1.5 instead (on interface en0)
21/07/14 02:17:15 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
21/07/14 02:17:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/07/14 02:17:20 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
21/07/14 02:17:20 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
21/07/14 02:17:23 WARN CSVDataSource: CSV header does not conform to the schema.
 Header: RECORD_IDENTIFIER, CHANGE_TYPE, PRO_ORDER, UPRN, UDPRN, ORGANISATION_NAME, DEPARTMENT_NAME, SUB_BUILDING_NAME, BUILDING_NAME, BUILDING_NUMBER, DEPENDENT_THOROUGHFARE, THROUGHFARE, DOUBLE_DEPENDENT_LOCALITY, DEPENDENT_LOCALITY, POST_TOWN, POSTCODE, POSTCODE_TYPE, DELIVERY_POINT_SUFFIX, WELSH_DEPENDENT_THOROUGHFARE, WELSH_THOROUGHFARE, WELSH_DOUBLE_DEPENDENT_LOCALITY, WELSH_DEPENDENT_LOCALITY, WELSH_POST_TOWN, PO_BOX_NUMBER, PROCESS_DATE, START_DATE, END_DATE, LAST_UPDATE_DATE, ENTRY_DATE
 Schema: recordIdentifier, changeType, proOrder, uprn, udprn, organisationName, departmentName, subBuildingName, buildingName, buildingNumber, dependentThoroughfare, thoroughfare, doubleDependentLocality, dependentLocality, postTown, postcode, postcodeType, deliveryPointSuffix, welshDependentThoroughfare, welshThoroughfare, welshDoubleDependentLocality, welshDependentLocality, welshPostTown, poBoxNumber, processDate, startDate, endDate, lastUpdateDate, entryDate
Expected: recordIdentifier but found: RECORD_IDENTIFIER
CSV file: file:///Volumes/REPO/Goran/address-index/batch/src/test/resources/csv/delivery_point/ABP_E811a_v111017.csv
21/07/14 02:17:23 WARN CSVDataSource: CSV header does not conform to the schema.
 Header: UPRN, ORGANISATION, LEGAL_NAME
 Schema: uprn, organisation, legalName
Expected: legalName but found: LEGAL_NAME
CSV file: file:///Volumes/REPO/Goran/address-index/batch/src/test/resources/csv/organisation/ABP_E811a_v111017.csv
21/07/14 02:17:23 WARN CSVDataSource: CSV header does not conform to the schema.
 Header: UPRN, PRIMARY_UPRN, THIS_LAYER, PARENT_UPRN
 Schema: uprn, primaryUprn, thisLayer, parentUprn
Expected: primaryUprn but found: PRIMARY_UPRN
CSV file: file:///Volumes/REPO/Goran/address-index/batch/src/test/resources/csv/hierarchy/ABP_E811a_v111017.csv
21/07/14 02:17:23 WARN CSVDataSource: CSV header does not conform to the schema.
 Header: UPRN, LPI_KEY, LANGUAGE, LOGICAL_STATUS, START_DATE, END_DATE, LAST_UPDATE_DATE, SAO_START_NUMBER, SAO_START_SUFFIX, SAO_END_NUMBER, SAO_END_SUFFIX, SAO_TEXT, PAO_START_NUMBER, PAO_START_SUFFIX, PAO_END_NUMBER, PAO_END_SUFFIX, PAO_TEXT, USRN, USRN_MATCH_INDICATOR, LEVEL, OFFICIAL_FLAG
 Schema: uprn, lpiKey, language, logicalStatus, startDate, endDate, lastUpdateDate, saoStartNumber, saoStartSuffix, saoEndNumber, saoEndSuffix, saoText, paoStartNumber, paoStartSuffix, paoEndNumber, paoEndSuffix, paoText, usrn, usrnMatchIndicator, level, officialFlag
Expected: lpiKey but found: LPI_KEY
CSV file: file:///Volumes/REPO/Goran/address-index/batch/src/test/resources/csv/lpi/ABP_E811a_v111017.csv
21/07/14 02:17:23 WARN CSVDataSource: CSV header does not conform to the schema.
 Header: USRN, STREET_CLASSIFICATION
 Schema: usrn, streetClassification
Expected: streetClassification but found: STREET_CLASSIFICATION
CSV file: file:///Volumes/REPO/Goran/address-index/batch/src/test/resources/csv/street/ABP_E811a_v111017.csv
21/07/14 02:17:23 WARN CSVDataSource: CSV header does not conform to the schema.
 Header: USRN, STREET_DESCRIPTOR, LOCALITY, TOWN_NAME, LANGUAGE
 Schema: usrn, streetDescriptor, locality, townName, language
Expected: streetDescriptor but found: STREET_DESCRIPTOR
CSV file: file:///Volumes/REPO/Goran/address-index/batch/src/test/resources/csv/street_descriptor/ABP_E811a_v111017.csv
21/07/14 02:17:23 WARN CSVDataSource: CSV header does not conform to the schema.
 Header: UPRN, CROSS_REFERENCE, SOURCE
 Schema: uprn, crossReference, source
Expected: crossReference but found: CROSS_REFERENCE
CSV file: file:///Volumes/REPO/Goran/address-index/batch/src/test/resources/csv/crossref/ABP_E811a_v111017.csv
21/07/14 02:17:23 WARN CSVDataSource: CSV header does not conform to the schema.
 Header: UPRN, CLASSIFICATION_CODE, CLASS_SCHEME
 Schema: uprn, classificationCode, classScheme
Expected: classificationCode but found: CLASSIFICATION_CODE
CSV file: file:///Volumes/REPO/Goran/address-index/batch/src/test/resources/csv/classification/ABP_E811a_v111017.csv
21/07/14 02:17:24 WARN CSVDataSource: CSV header does not conform to the schema.
 Header: UPRN, PRIMARY_UPRN, PARENT_UPRN, ESTAB_TYPE, ADDRESS_TYPE
 Schema: uprn, primaryUprn, parentUprn, addressType, estabType
Expected: primaryUprn but found: PRIMARY_UPRN
CSV file: file:///Volumes/REPO/Goran/address-index/batch/src/test/resources/csv/hierarchy/ABP_E811a_v111017.csv
Exception in thread "main" org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
        at org.elasticsearch.hadoop.rest.InitializationUtils.discoverClusterInfo(InitializationUtils.java:340)
        at org.elasticsearch.spark.rdd.EsSpark$.doSaveToEs(EsSpark.scala:104)
        at org.elasticsearch.spark.rdd.EsSpark$.saveToEs(EsSpark.scala:79)
        at org.elasticsearch.spark.rdd.EsSpark$.saveToEs(EsSpark.scala:76)
        at org.elasticsearch.spark.package$SparkRDDFunctions.saveToEs(package.scala:56)
        at uk.gov.ons.addressindex.writers.ElasticSearchWriter$.saveHybridAddresses(ElasticSearchWriter.scala:27)
        at uk.gov.ons.addressindex.Main$.saveHybridAddresses(Main.scala:96)
        at uk.gov.ons.addressindex.Main$.delayedEndpoint$uk$gov$ons$addressindex$Main$1(Main.scala:56)
        at uk.gov.ons.addressindex.Main$delayedInit$body.apply(Main.scala:14)
        at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
        at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
        at scala.App$$anonfun$main$1.apply(App.scala:76)
        at scala.App$$anonfun$main$1.apply(App.scala:76)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
        at scala.App$class.main(App.scala:76)
        at uk.gov.ons.addressindex.Main$.main(Main.scala:14)
        at uk.gov.ons.addressindex.Main.main(Main.scala)
Caused by: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [GET] on [] failed; server[pvr-locations.es.eu-west-2.aws.cloud.es.io:9243] returned [400|Bad Request:]
        at org.elasticsearch.hadoop.rest.RestClient.checkResponse(RestClient.java:469)
        at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:426)
        at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:388)
        at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:392)
        at org.elasticsearch.hadoop.rest.RestClient.get(RestClient.java:168)
        at org.elasticsearch.hadoop.rest.RestClient.mainInfo(RestClient.java:735)
        at org.elasticsearch.hadoop.rest.InitializationUtils.discoverClusterInfo(InitializationUtils.java:330)
        ... 17 more

I have no idea how to resolve this error.

Latest versions of prerequisite software

The prerequisites for these seem to be old versions especially Spark 2.4. Which are the latest versions of each of these with which the project builds and runs well? I am planning to use the master branch.

Java 8
SBT 0.13.16 (http://www.scala-sbt.org/)
Scala 2.12.4
Apache Spark 2.4.0
Elasticsearch 7.9.3
Elasticsearch-spark-30 7.9.12

Also I am planning to connect this to an ElasticSearch 8.7 cluster. So is https://www.javadoc.io/doc/org.elasticsearch/elasticsearch-spark-30_2.12/8.7.0/index.html okay to use?

Caused by: org.datanucleus.exceptions.NucleusUserException: Persistence process has been specified to use a ClassLoaderResolver of name "datanucleus" yet this has not been found by the DataNucleus plugin mechanism. Please check your CLASSPATH and plugin specification.

I met this error while building a docker image.
My Dockerfile as follows:

FROM ubuntu:latest

# Install OpenJDK-8
RUN apt-get update -y && \
    apt-get install -y openjdk-8-jdk && \
    apt-get install -y ant && \
    apt-get clean;
    
# Fix certificate issues
RUN apt-get update && \
    apt-get install ca-certificates-java && \
    apt-get clean && \
    update-ca-certificates -f;

# Setup JAVA_HOME -- useful for docker commandline
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64/
RUN export JAVA_HOME

# Install sbt and scala
RUN apt-get -qq -y install curl wget
RUN apt-get install gnupg -y
RUN echo "deb https://repo.scala-sbt.org/scalasbt/debian all main" | tee /etc/apt/sources.list.d/sbt.list && \
    echo "deb https://repo.scala-sbt.org/scalasbt/debian /" | tee /etc/apt/sources.list.d/sbt_old.list && \
    curl -sL "https://keyserver.ubuntu.com/pks/lookup?op=get&search=0x2EE0EA64E40A89B84B2DF73499E82A75642AC823" | apt-key add && \
    apt-get update && \
    apt-get install sbt -y
RUN apt-get install scala -y

# Install spark
RUN wget https://apache.claz.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
RUN tar xvf spark-*
RUN mv spark-3.1.2-bin-hadoop3.2 /opt/spark

ENV SPARK_HOME /opt/spark
RUN export SPARK_HOME
ENV PATH="${PATH}:${SPARK_HOME}/bin:${SPARK_HOME}/sbin"

RUN start-master.sh
RUN start-slave.sh spark://localhost:7077

# Install Elasticsearch 7
RUN curl -fsSL https://artifacts.elastic.co/GPG-KEY-elasticsearch | apt-key add -
RUN echo "deb https://artifacts.elastic.co/packages/7.x/apt stable main" | tee -a /etc/apt/sources.list.d/elastic-7.x.list
RUN apt-get update
RUN apt-get install elasticsearch

# Install hadoop
ENV HADOOP_HOME /opt/hadoop
RUN export HADOOP_HOME
RUN wget https://archive.apache.org/dist/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz && \
    tar -xzf hadoop-3.3.1.tar.gz && \
    mv hadoop-3.3.1 $HADOOP_HOME

ENV PATH="${PATH}:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin"
ENV HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native \
    HADOOP_OPTS="-Djava.library.path=${HADOOP_HOME}/lib" \
    HADOOP_MAPRED_HOME=$HADOOP_HOME \
    HADOOP_COMMON_HOME=$HADOOP_HOME \
    HADOOP_HDFS_HOME=$HADOOP_HOME \
    YARN_HOME=$HADOOP_HOME \
    HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop \
    LD_LIBRARY_PATH="$HADOOP_HOME/lib/native/"

RUN start-all.sh

WORKDIR /opt/address-index
COPY ./ ./
RUN sbt clean assembly
RUN java -Dconfig.file=application.conf -jar batch/target/scala-2.11/ons-ai-batch-assembly-0.0.1.jar

How to specify ca-cert or Fingerprint along with user/pass when calling Elasticsearch

The API for one of my Elasticsearch instances requires user ID, password and Cert fingerprint. I have appended Python code at the bottom that works on my ES instance.

But in the address-index-data project using the reference.conf or the application.conf I can only specify host, user, password. How do I specify Cert fingerprint?

_from elasticsearch import Elasticsearch

Password for the 'elastic' user generated by Elasticsearch

ELASTIC_PASSWORD = "---------------------"
CERT_FINGERPRINT = "-------------------"

Create the client instance

client = Elasticsearch(
"https://10.222.13.197:9200",
ssl_assert_fingerprint=CERT_FINGERPRINT,
basic_auth=("elastic", ELASTIC_PASSWORD)
)_

Caused by: java.net.SocketTimeoutException: Read timed out

While running the job, I met this error.

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1887)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1875)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1874)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1874)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2108)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2057)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2046)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2114)
	at org.elasticsearch.spark.rdd.EsSpark$.doSaveToEs(EsSpark.scala:108)
	at org.elasticsearch.spark.rdd.EsSpark$.saveToEs(EsSpark.scala:79)
	at org.elasticsearch.spark.rdd.EsSpark$.saveToEs(EsSpark.scala:76)
	at org.elasticsearch.spark.package$SparkRDDFunctions.saveToEs(package.scala:56)
	at uk.gov.ons.addressindex.writers.ElasticSearchWriter$.saveSkinnyHybridAddresses(ElasticSearchWriter.scala:39)
	at uk.gov.ons.addressindex.Main$.saveHybridAddresses(Main.scala:94)
	at uk.gov.ons.addressindex.Main$.delayedEndpoint$uk$gov$ons$addressindex$Main$1(Main.scala:65)
	at uk.gov.ons.addressindex.Main$delayedInit$body.apply(Main.scala:14)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at uk.gov.ons.addressindex.Main$.main(Main.scala:14)
	at uk.gov.ons.addressindex.Main.main(Main.scala)
Caused by: java.lang.RuntimeException: Read timed out
	at org.codehaus.jackson.map.MappingIterator.next(MappingIterator.java:115)
	at org.elasticsearch.hadoop.rest.bulk.BulkProcessor.tryFlush(BulkProcessor.java:241)
	at org.elasticsearch.hadoop.rest.bulk.BulkProcessor.flush(BulkProcessor.java:499)
	at org.elasticsearch.hadoop.rest.bulk.BulkProcessor.add(BulkProcessor.java:113)
	at org.elasticsearch.hadoop.rest.RestRepository.doWriteToIndex(RestRepository.java:192)
	at org.elasticsearch.hadoop.rest.RestRepository.writeToIndex(RestRepository.java:172)
	at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:74)
	at org.elasticsearch.spark.rdd.EsSpark$$anonfun$doSaveToEs$1.apply(EsSpark.scala:108)
	at org.elasticsearch.spark.rdd.EsSpark$$anonfun$doSaveToEs$1.apply(EsSpark.scala:108)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: javax.net.ssl.SSLException: Read timed out
	at sun.security.ssl.Alert.createSSLException(Alert.java:127)
	at sun.security.ssl.TransportContext.fatal(TransportContext.java:324)
	at sun.security.ssl.TransportContext.fatal(TransportContext.java:267)
	at sun.security.ssl.TransportContext.fatal(TransportContext.java:262)
	at sun.security.ssl.SSLTransport.decode(SSLTransport.java:138)
	at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1386)
	at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1354)
	at sun.security.ssl.SSLSocketImpl.access$300(SSLSocketImpl.java:73)
	at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:948)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
	at org.apache.commons.httpclient.ContentLengthInputStream.read(ContentLengthInputStream.java:170)
	at java.io.FilterInputStream.read(FilterInputStream.java:133)
	at org.apache.commons.httpclient.AutoCloseInputStream.read(AutoCloseInputStream.java:108)
	at org.elasticsearch.hadoop.rest.DelegatingInputStream.read(DelegatingInputStream.java:62)
	at org.codehaus.jackson.impl.Utf8StreamParser.loadMore(Utf8StreamParser.java:172)
	at org.codehaus.jackson.impl.Utf8StreamParser._skipWSOrEnd(Utf8StreamParser.java:2309)
	at org.codehaus.jackson.impl.Utf8StreamParser.nextToken(Utf8StreamParser.java:444)
	at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.mapObject(UntypedObjectDeserializer.java:219)
	at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.deserialize(UntypedObjectDeserializer.java:47)
	at org.codehaus.jackson.map.deser.std.MapDeserializer._readAndBind(MapDeserializer.java:319)
	at org.codehaus.jackson.map.deser.std.MapDeserializer.deserialize(MapDeserializer.java:249)
	at org.codehaus.jackson.map.deser.std.MapDeserializer.deserialize(MapDeserializer.java:33)
	at org.codehaus.jackson.map.MappingIterator.nextValue(MappingIterator.java:178)
	at org.codehaus.jackson.map.MappingIterator.next(MappingIterator.java:111)
	... 16 more
Caused by: java.net.SocketTimeoutException: Read timed out
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
	at java.net.SocketInputStream.read(SocketInputStream.java:171)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
	at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:457)
	at sun.security.ssl.SSLSocketInputRecord.decodeInputRecord(SSLSocketInputRecord.java:237)
	at sun.security.ssl.SSLSocketInputRecord.decode(SSLSocketInputRecord.java:190)
	at sun.security.ssl.SSLTransport.decode(SSLTransport.java:109)
	... 36 more
21/07/23 05:49:00 WARN TaskSetManager: Lost task 8.0 in stage 18.0 (TID 681, localhost, executor driver): TaskKilled (Stage cancelled)

How to handle this error.

java.lang.OutOfMemoryError: Java heap space

I'm running it inside from Intellij IDEA and using real data.
The format of data is CSV and the file size is more than 8GB.
So I met java.lang.outOfMemorryError: Java heap space
How to resolve it?

error instantiating SessionHiveMetaStoreClient

Hi, I have the elastic search server running locally on localhost and it's connecting fine, but I get an error when the SessionHiveMetaStoreClient, can you help? thanks


org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
	at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
	at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
	at org.apache.hadoop.hive.ql.metadata.Hive.<clinit>(Hive.java:166)
	at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
	at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:180)
	at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:114)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:264)
	at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:385)
	at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:287)
	at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
	at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
	at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:195)
	at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:195)
	at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:195)
	at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
	at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:194)
	at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)
	at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102)
	at org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:39)
	at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:54)
	at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52)
	at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1.<init>(HiveSessionStateBuilder.scala:69)
	at org.apache.spark.sql.hive.HiveSessionStateBuilder.analyzer(HiveSessionStateBuilder.scala:69)
	at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$2.apply(BaseSessionStateBuilder.scala:293)
	at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$2.apply(BaseSessionStateBuilder.scala:293)
	at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:79)
	at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:79)
	at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
	at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
	at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
	at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:432)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:233)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
	at uk.gov.ons.addressindex.readers.AddressIndexFileReader$.readCsv(AddressIndexFileReader.scala:110)
	at uk.gov.ons.addressindex.readers.AddressIndexFileReader$.readBlpuCSV(AddressIndexFileReader.scala:39)
	at uk.gov.ons.addressindex.Main$.generateNagAddresses(Main.scala:88)
	at uk.gov.ons.addressindex.Main$.saveHybridAddresses(Main.scala:98)
	at uk.gov.ons.addressindex.Main$.delayedEndpoint$uk$gov$ons$addressindex$Main$1(Main.scala:78)
	at uk.gov.ons.addressindex.Main$delayedInit$body.apply(Main.scala:14)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at uk.gov.ons.addressindex.Main$.main(Main.scala:14)
	at uk.gov.ons.addressindex.Main.main(Main.scala)

Caused by: org.datanucleus.exceptions.NucleusUserException: Persistence process has been specified to use a ClassLoaderResolver of name "datanucleus" yet this has not been found by the DataNucleus plugin mechanism. Please check your CLASSPATH and plugin specification.
	at org.datanucleus.NucleusContext.<init>(NucleusContext.java:283)
	at org.datanucleus.NucleusContext.<init>(NucleusContext.java:247)
	at org.datanucleus.NucleusContext.<init>(NucleusContext.java:225)
	at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.<init>(JDOPersistenceManagerFactory.java:416)
	at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:301)
	at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202)
	... 91 more

Setup with latest ONS data

The latest ONS Addressbase Premium data is presented in zip files which contain grouped data categorised by the postcode.
This results in over 10000 files that need to get processed.

It would appear that the current solution only accepts 1 CSV file for each type of record type.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.