GithubHelp home page GithubHelp logo

nerdammer / spark-hbase-connector Goto Github PK

View Code? Open in Web Editor NEW
298.0 36.0 108.0 130 KB

Connect Spark to HBase for reading and writing data with ease

License: Apache License 2.0

Shell 0.36% Scala 99.64%
hbase spark

spark-hbase-connector's Introduction

Spark-HBase Connector

Build status

This library lets your Apache Spark application interact with Apache HBase using a simple and elegant API.

If you want to read and write data to HBase, you don't need using the Hadoop API anymore, you can just use Spark.

Including the library

The spark-hbase-connector is available in Sonatype repository. You can just add the following dependency in sbt:

libraryDependencies += "it.nerdammer.bigdata" % "spark-hbase-connector_2.10" % "1.0.3"

The Maven style version of the dependency is:

<dependency>
  <groupId>it.nerdammer.bigdata</groupId>
  <artifactId>spark-hbase-connector_2.10</artifactId>
  <version>1.0.3</version>
</dependency>

If you don't like sbt or Maven, you can also check out this Github repo and execute the following command from the root folder:

sbt package

SBT will create the library jar under target/scala-2.10.

Note that the library depends on the following artifacts:

libraryDependencies +=  "org.apache.spark" % "spark-core_2.10" % "1.6.0" % "provided"

libraryDependencies +=  "org.apache.hbase" % "hbase-common" % "1.0.3" excludeAll(ExclusionRule(organization = "javax.servlet", name="javax.servlet-api"), ExclusionRule(organization = "org.mortbay.jetty", name="jetty"), ExclusionRule(organization = "org.mortbay.jetty", name="servlet-api-2.5"))

libraryDependencies +=  "org.apache.hbase" % "hbase-client" % "1.0.3" excludeAll(ExclusionRule(organization = "javax.servlet", name="javax.servlet-api"), ExclusionRule(organization = "org.mortbay.jetty", name="jetty"), ExclusionRule(organization = "org.mortbay.jetty", name="servlet-api-2.5"))

libraryDependencies +=  "org.apache.hbase" % "hbase-server" % "1.0.3" excludeAll(ExclusionRule(organization = "javax.servlet", name="javax.servlet-api"), ExclusionRule(organization = "org.mortbay.jetty", name="jetty"), ExclusionRule(organization = "org.mortbay.jetty", name="servlet-api-2.5"))


libraryDependencies +=  "org.scalatest" % "scalatest_2.10" % "2.2.4" % "test"

Check also if the current branch is passing all tests in Travis-CI before checking out (See "build" icon above).

Setting the HBase host

The HBase Zookeeper quorum host can be set in multiple ways.

(1) Passing the host to the spark-submit command:

spark-submit --conf spark.hbase.host=thehost ...

(2) Using the hbase-site.xml file (in the root of your jar, i.e. src/main/resources/hbase-site.xml):

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
	<property>
		<name>hbase.zookeeper.quorum</name>
		<value>thehost</value>
	</property>
	
	<!-- Put any other property here, it will be used -->
</configuration>

(3) If you have access to the JVM parameters:

java -Dspark.hbase.host=thehost -jar ....

(4) Using the scala code:

val sparkConf = new SparkConf()
...
sparkConf.set("spark.hbase.host", "thehost")
...
val sc = new SparkContext(sparkConf)

or you can directly configure with SparkContext:

sparkContext.hadoopConfiguration.set("spark.hbase.host", "thehost")

Writing to HBase (Basic)

Writing to HBase is very easy. Remember to import the implicit conversions:

import it.nerdammer.spark.hbase._

You have just to create a sample RDD, as the following one:

val rdd = sc.parallelize(1 to 100)
            .map(i => (i.toString, i+1, "Hello"))

This rdd is made of tuples like ("1", 2, "Hello") or ("27", 28, "Hello"). The first element of each tuple is considered the row id, the others will be assigned to columns.

rdd.toHBaseTable("mytable")
    .toColumns("column1", "column2")
    .inColumnFamily("mycf")
    .save()

You are done. HBase now contains 100 rows in table mytable, each row containing two values for columns mycf:column1 and mycf:column2.

Reading from HBase (Basic)

Reading from HBase is easier. Remember to import the implicit conversions:

import it.nerdammer.spark.hbase._

If you want to read the data written in the previous example, you just need to write:

val hBaseRDD = sc.hbaseTable[(String, Int, String)]("mytable")
    .select("column1", "column2")
    .inColumnFamily("mycf")

Now hBaseRDD contains all the data found in the table. Each object in the RDD is a tuple containing (in order) the row id, the corresponding value of column1 (Int) and column2 (String).

If you don't want the row id but, you only want to see the columns, just remove the first element from the tuple specs:

val hBaseRDD = sc.hbaseTable[(Int, String)]("mytable")
    .select("column1", "column2")
    .inColumnFamily("mycf")

This way, only the columns that you have chosen will be selected.

you don't have to provide column family name as a prefix to column name for the provided column family at inColumnFamily(COLUMN_FAMILY_NAME) but for other columns you need to provide prefix with :(colon).

val hBaseRDD = sc.hbaseTable[(Int, String, String)]("mytable")
    .select("column1","columnfamily2:column2","columnfamily3:column3")
    .inColumnFamily("columnfamily1")

Other Topics

Filtering

It is possible to filter the results by prefixes of row keys. Filtering also supports additional salting prefixes (see the salting section).

val rdd = sc.hbaseTable[(String, String)]("table")
      .select("col")
      .inColumnFamily(columnFamily)
      .withStartRow("00000")
      .withStopRow("00500")

The example above retrieves all rows having a row key greater or equal to 00000 and lower than 00500. The options withStartRow and withStopRow can also be used separately.

Managing Empty Columns

Empty columns are managed by using Option[T] types:

val rdd = sc.hbaseTable[(Option[String], String)]("table")
      .select("column1", "column2")
      .inColumnFamily(columnFamily)

rdd.foreach(t => {
    if(t._1.nonEmpty) println(t._1.get)
})

You can use the Option[T] type every time you are not sure whether a given column is present in your HBase RDD.

Using different column families

Different column families can be used both when reading or writing an RDD.

data.toHBaseTable("mytable")
      .toColumns("column1", "cf2:column2")
      .inColumnFamily("cf1")
      .save()

In the example above, cf1 refers only to column1, because cf2:column2 is already fully qualified.

val count = sc.hbaseTable[(String, String)]("mytable")
      .select("cf1:column1", "column2")
      inColumnFamily("cf2")
      .count

In the reading example above, the default column family cf2 applies only to column2.

Usage in Spark Streaming

The connector can be used in Spark Streaming applications with the same API.

// stream is a DStream[(Int, Int)]

stream.foreachRDD(rdd =>
    rdd.toHBaseTable("table")
      .inColumnFamily("cf")
      .toColumns("col1")
      .save()
    )

HBaseRDD as Spark Dataframe

You can convert hBaseRDD to Spark Dataframe for further Spark Transformations and moving to any other NoSQL Databases such as (MongoDB, Hive etc).

val hBaseRDD = sparkContext.hbaseTable[(Option[String], Option[String], Option[String], Option[String], Option[String])](HBASE_TABLE_NAME).select("column1", "column2","column3","column4","column5").inColumnFamily(HBASE_COLUMN_FAMILY)

Iterating hBaseRDD to create scala org.apache.spark.rdd.RDD [org.apache.spark.sql.Row] (i.e RDD[Row]) in our example.

val rowRDD = hBaseRDD.map(i => Row(i._1.get,i._2.get,i._3.get,i._4.get,i._5.get))

Creating schema structure for above SparkRDD[Row]

object myschema {
      val column1 = StructField("column1", StringType)
      val column2 = StructField("column2", StringType)
      val column3 = StructField("column2", StringType)
      val column4 = StructField("column2", StringType)
      val column5 = StructField("column2", StringType)
      val struct = StructType(Array(column1,column2,column3,column4,column5))
    }

Create Spark Dataframe with RDD[Row] and Schema Structure

val myDf = sqlContext.createDataFrame(rowRDD,myschema.struct)

Now you can apply any spark transformations and actions, for example.

myDF.show()

It will show you Dataframe's Data in tabular structure.

SparkSQL on HBase

Hence, with previous example. you have converted hBaseRDD to appropriate Spark Dataframe you can apply SparkSQL on Dataframe.

Creating temporary table in spark.

myDF.registerTempTable("mytable")

Applying SparkSQL on created temporary table.

sqlContext.sql("SELECT * FROM mytable").show()

Advanced

Salting Prefixes

Salting is supported in reads and writes. Only string valued row id are supported at the moment, so salting prefixes should also be of String type.

sc.parallelize(1 to 1000)
      .map(i => (pad(i.toString, 5), "A value"))
      .toHBaseTable(table)
      .inColumnFamily(columnFamily)
      .toColumns("col")
      .withSalting((0 to 9).map(s => s.toString))
      .save()

In the example above, each row id is composed of 5 digits: from 00001 to 01000. The salting property adds a random digit in front, so you will have records like: 800001, 600031, ...

When reading the RDD, you have just to declare the salting type used in the table and ignore it when using bounds (startRow or stopRow). The library takes care of dealing with salting.

val rdd = sc.hbaseTable[String](table)
      .select("col")
      .inColumnFamily(columnFamily)
      .withStartRow("00501")
      .withSalting((0 to 9).map(s => s.toString))

Custom Mapping with Case Classes

Custom mapping can be used in place of the default tuple-mapping technique. Just define a case class for your type:

case class MyData(id: Int, prg: Int, name: String)

and define an object that contains implicit writer and reader for your type

implicit def myDataWriter: FieldWriter[MyData] = new FieldWriter[MyData] {
    override def map(data: MyData): HBaseData =
      Seq(
        Some(Bytes.toBytes(data.id)),
        Some(Bytes.toBytes(data.prg)),
        Some(Bytes.toBytes(data.name))
      )

    override def columns = Seq("prg", "name")
}

Do not forget to override the columns method.

Then, you can define an implicit reader:

implicit def myDataReader: FieldReader[MyData] = new FieldReader[MyData] {
    override def map(data: HBaseData): MyData = MyData(
      id = Bytes.toInt(data.head.get),
      prg = Bytes.toInt(data.drop(1).head.get),
      name = Bytes.toString(data.drop(2).head.get)
    )

    override def columns = Seq("prg", "name")
}

Once you have done, make sure that the implicits are imported and that it does not produce a non-serializable task (Spark will check it at runtime).

You can now use your converters easily:

val data = sc.parallelize(1 to 100).map(i => new MyData(i, i, "Name" + i.toString))
// data is an RDD[MyData]

data.toHBaseTable("mytable")
  .inColumnFamily("mycf")
  .save()

val read = sc.hbaseTable[MyData]("mytable")
  .inColumnFamily("mycf")

The converters above are low level and use directly the HBase API. Since this connector provides you with many predefined converters for simple and complex types, probably you would like to reuse them. The new FieldReaderProxy and FieldWriterProxy API has been created for this purpose.

High-level converters using FieldWriterProxy

You can create a new FieldWriterProxy by declaring a conversion from your custom type to a predefined type. In this case, the predefined type it is a tuple composed of three basic fields:

// MySimpleData is a case class

implicit def myDataWriter: FieldWriter[MySimpleData] = new FieldWriterProxy[MySimpleData, (Int, Int, String)] {

  override def convert(data: MySimpleData) = (data.id, data.prg, data.name) // the first element is the row id

  override def columns = Seq("prg", "name")
}

The corresponding FieldReaderProxy converts back a tuple of three basic fields into objects of class MySimpleData:

implicit def myDataReader: FieldReader[MySimpleData] = new FieldReaderProxy[(Int, Int, String), MySimpleData] {

  override def columns = Seq("prg", "name")

  override def convert(data: (Int, Int, String)) = MySimpleData(data._1, data._2, data._3)
}

Note that we have not used the HBase API. Currently, FieldWriterProxy can read and write tuples up to 22 fields (including the row id).

spark-hbase-connector's People

Contributors

chetkhatri avatar dskskv avatar fabiofumarola avatar nicolaferraro avatar rinofm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spark-hbase-connector's Issues

Class not found exception

Hi there,

while running spark hbase application, it gives below exception

xception in thread "main" java.lang.NoClassDefFoundError: it/nerdammer/spark/hbase/package$
at SparkHBaseExample$.main(SparkHBaseExample.scala:29)
at SparkHBaseExample.main(SparkHBaseExample.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: it.nerdammer.spark.hbase.package$
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)

I have imported import it.nerdammer.spark.hbase._ in my scala code, also added libraryDependencies += "it.nerdammer.bigdata" % "spark-hbase-connector_2.10" % "1.0.3" to my build.sbt

write to hbase table exception

I am using hdp 2.2 and recompile spark-1.5.2 with jackson package.
but when I run "rdd.toHBaseTable("mytable").toColumns("column1", "column2").inColumnFamily("mycf").save()"

it failed, seems to connect to localhost zk server, how to fix it. I have to specify it in code or configure some files make it work(connect to right server)

thanks.

###log info

java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2016-02-24 12:48:44,286 INFO [Executor task launch worker-2-SendThread(localhost:2181)] zookeeper.ClientCnxn: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to authenticate using SASL (unknown error)
2016-02-24 12:48:44,286 WARN [Executor task launch worker-2-SendThread(localhost:2181)] zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2016-02-24 12:48:44,291 WARN [Executor task launch worker-2] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=localhost:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
2016-02-24 12:48:45,387 INFO [Executor task launch worker-2-SendThread(localhost:2181)] zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2016-02-24 12:48:45,387 WARN [Executor task launch worker-2-SendThread(localhost:2181)] zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

Reading multiple column families

Looking at the basic API's I understand ways to read multiple columns from a single column family, is there any way of having multiple column families too? As such column family cannot be an array, right? Thanks!

It doesn't support the format of a DStream like DStream[(String, Map)]

I am currently using your Spark-HBase Connector library. My project is spark streaming reading data from Kafka and then after the operation then write the result into HBase. I saw your library with spark streaming support. But the problem is, the format of my DStream is like DStream[(String, Map)]. Basically it's a nested data structure with a Map inside. It seems like your library doesn't support that format of DStream. Can you solve that problem? Or can you tell me some hints on your code and I can try to solve that problem and help others have the same problems as well. Thanks in advance

Spark 2.0 support?

Just wanted to check if there are any plans on supporting more recent versions of spark.

Thanks for your help, very promising project!

issue about kerberos

Have you consider about adding kerberos support? When I using this awesome package, I found that even executor get the key tab file; it don't use it to connect with hbase server. Anyone who met this situation before?

convert to dataframe

hi,

any idea to convert the RDD to dataframe to make join with other dataframe

val rdd = sc.hbaseTable[(String, String)]("table")
  .select("col")
  .inColumnFamily(columnFamily)
  .withStartRow("00000")
  .withStopRow("00500")

cordially

issues writing to HBase version 0.94.2-cdh4.2.0

I'm trying to write to HBase version 0.94.2-cdh4.2.0 using Scala and Spark.

I'm not getting any error but these messages:

INFO  2015-11-02 15:50:54,285 [Executor task launch worker-1-SendThread(host:2181)] org.apache.zookeeper.ClientCnxn: Opening socket connection to server ip/ip:2181. Will not attempt to authenticate using SASL (unknown error)
INFO  2015-11-02 15:50:54,453 [Executor task launch worker-1-SendThread(host:2181)] org.apache.zookeeper.ClientCnxn: Socket connection established to host/host:2181, initiating session
[Stage 0:>                                                          (0 + 4) / 4]INFO  2015-11-02 15:50:54,627 [Executor task launch worker-1-SendThread(host:2181)] org.apache.zookeeper.ClientCnxn: Session establishment complete on server host/host:2181, sessionid = 0x150608c04de044c, negotiated timeout = 90000
INFO  2015-11-02 15:50:55,287 [Executor task launch worker-1] org.apache.hadoop.hbase.mapreduce.TableOutputFormat: Created table instance for testTable
INFO  2015-11-02 15:50:55,324 [Executor task launch worker-2] org.apache.hadoop.hbase.mapreduce.TableOutputFormat: Created table instance for testTable
INFO  2015-11-02 15:50:55,325 [Executor task launch worker-3] org.apache.hadoop.hbase.mapreduce.TableOutputFormat: Created table instance for testTable
INFO  2015-11-02 15:50:55,326 [Executor task launch worker-0] org.apache.hadoop.hbase.mapreduce.TableOutputFormat: Created table instance for testTable

Is there any known issue with writing to that specific version of HBase?

NoSuchFeildException: RPC_HEADER

I am running spark (v1.5.1) in local mode --master local[4] and hbase(1.1.2) on local mode as well. When I am using spark-hbase-connector I am getting the following exception

16/06/26 09:09:30 ERROR client.AsyncProcess: Failed to get region location 
org.apache.hadoop.hbase.DoNotRetryIOException: java.lang.NoSuchFieldError: RPC_HEADER
    at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:807)
    at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeRequest(RpcClientImpl.java:920)
    at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest(RpcClientImpl.java:889)
    at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1222)
    at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
    at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
    at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:32651)
    at org.apache.hadoop.hbase.client.ClientSmallScanner$SmallScannerCallable.call(ClientSmallScanner.java:201)
    at org.apache.hadoop.hbase.client.ClientSmallScanner$SmallScannerCallable.call(ClientSmallScanner.java:180)
    at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
    at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:346)
    at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:320)
    at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:126)
    at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:64)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NoSuchFieldError: RPC_HEADER
    at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeConnectionHeaderPreamble(RpcClientImpl.java:830)
    at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:753)
    ... 16 more

Here is my code

val sparkConf = new SparkConf().setAppName("Dump Key Vales to HBase using Spark")
    val sc = new SparkContext(sparkConf)
    System.setProperty("hadoop.home.dir", "/home/sachin/swtools/hadoop-2.7.1")

    // Load file data
    val file = "/home/sachin/work/data/sample.txt"
    val fileData = sc.textFile(file)
    val pairs = fileData.map(line => (line.split("\t")(0), line.split("\t")(1)))

    val aliasTable = new AliasTable("id_lookup", "cf", "id")

    pairs.toHBaseTable(aliasTable.tableName)
      .toColumns(aliasTable.columnName)
      .inColumnFamily(aliasTable.columnFamily)
      .save()

What could possible trigger this exception ? Already tried enough searching on Google but no luck there. Let me know if any more information is required.

Are reads/writes data-local?

Hi, sorry, more of a question rather than an issue, but I didn't know where to ask.

This is a very handy library, but I am wondering if all reads/writes are using data-locality, i.e. running within a spark executor straight from the hdfs or does it go via hbase server? It seems it needs to at-least connect to hbase, but does it fetch the data straight from the hdfs files of hbase or from hbase via tcp?

Thanks

dependency not found: org.codehaus.jackson#jackson-core-asl;1.8.3

Hey Nicolas,
Thanks for the great job.
I am trying to run the following spark-shell:
spark-shell --packages it.nerdammer.bigdata:spark-hbase-connector_2.10:1.0.3
But it seems that jackson-core-als is mis-dependent. It's not going to download anywhere. Could you please recheck?
`::: WARNINGS
[NOT FOUND ] org.codehaus.jackson#jackson-core-asl;1.8.3!jackson-core-asl.jar (0ms)

==== local-m2-cache: tried

  file:/root/.m2/repository/org/codehaus/jackson/jackson-core-asl/1.8.3/jackson-core-asl-1.8.3.jar

`

Change HBaseData and column into FieldMapper

In order to avoid storing default values into Hbase we have to change how the fields are mapped.

My proposal is to change the type HBaseData
from

  type HBaseData = Iterable[Option[Array[Byte]]]

to

  type ColumnQualifier = String
  type HBaseData = Map[ColumnQualifier,Option[Array[Byte]]]

using this approach we can transform the method columns from Iterable[String] to Set[String] and we don't need to use and order base conversion in case classes mapping.

Why

sc.parallelize(mrArray).toHBaseTable(tableName)
.toColumns("column1", "column2")
.inColumnFamily("base_info")
.save()

"Messages Make" is"Error:(55, 29) value toHBaseTable is not a member of org.apache.spark.rdd.RDD[Seq[Any]]
sc.parallelize(mrArray).toHBaseTable(tableName)"
^

Why?

Deal With NA When Write Into HBaseTable

Hi Man,
When I try to save a rdd into hbase , how could I get rid of the NA (like "", "null", "NA" or else) value?

    trdd.toHBaseTable("mbk_user_label")
      .toColumns("mobikeno", "registertime", "ts", "progress", "nation", "status", "citycode")
      .inColumnFamily("attrs")
      .save()

trdd: org.apache.spark.rdd.RDD[(String, String, String, String, String, String, String, String)]

Should I change my trdd into Option[String]?

Thx.

connector not working on Hortonworks Sandbox

Hi, I am trying to use this connector to read/write to hbase but have failed despite much effort. I am using Hortonworks sandbox version 2.5 and sbt assembly to make a jar and then run the jar through spark-submit. I am getting the following error:
`16/12/16 18:45:35 INFO ZooKeeperRegistry: ClusterId read in ZooKeeper is null
16/12/16 18:45:35 INFO ClientCnxn: Session establishment complete on server sandbox.hortonworks.com/172.17.0.2:2181, sessionid = 0x15905fbca090173, negotiated timeout = 40000
16/12/16 18:45:35 INFO ZooKeeperRegistry: ClusterId read in ZooKeeper is null
16/12/16 18:45:35 INFO ZooKeeperRegistry: ClusterId read in ZooKeeper is null
16/12/16 18:45:35 INFO TableOutputFormat: Created table instance for mytable
16/12/16 18:45:35 INFO TableOutputFormat: Created table instance for mytable
16/12/16 18:45:35 INFO TableOutputFormat: Created table instance for mytable
16/12/16 18:45:35 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.RuntimeException: java.lang.NullPointerException
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:218)
at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.loadCache(ClientSmallReversedScanner.java:212)
at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.next(ClientSmallReversedScanner.java:186)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1275)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1181)
at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:410)
at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:359)
at org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:238)
at org.apache.hadoop.hbase.client.BufferedMutatorImpl.close(BufferedMutatorImpl.java:163)
at org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.close(TableOutputFormat.java:120)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$5.apply$mcV$sp(PairRDDFunctions.scala:1119)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1295)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1091)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.getMetaReplicaNodes(ZooKeeperWatcher.java:489)
at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:558)
at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getMetaRegionLocation(ZooKeeperRegistry.java:61)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateMeta(ConnectionManager.java:1211)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1178)
at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:305)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:156)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:60)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:210)
... 19 more
16/12/16 18:45:35 ERROR Executor: Exception in task 2.0 in stage 0.0 (TID 2)
java.lang.RuntimeException: java.lang.NullPointerException
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:218)
at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.loadCache(ClientSmallReversedScanner.java:212)
at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.next(ClientSmallReversedScanner.java:186)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1275)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1181)
at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:410)
at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:359)
at org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:238)
at org.apache.hadoop.hbase.client.BufferedMutatorImpl.close(BufferedMutatorImpl.java:163)
at org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.close(TableOutputFormat.java:120)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$5.apply$mcV$sp(PairRDDFunctions.scala:1119)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1295)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1091)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.getMetaReplicaNodes(ZooKeeperWatcher.java:489)
at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:558)
at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getMetaRegionLocation(ZooKeeperRegistry.java:61)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateMeta(ConnectionManager.java:1211)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1178)
at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:305)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:156)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:60)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:210)
... 19 more
16/12/16 18:45:35 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3)
java.lang.RuntimeException: java.lang.NullPointerException
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:218)
at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.loadCache(ClientSmallReversedScanner.java:212)
at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.next(ClientSmallReversedScanner.java:186)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1275)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1181)
at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:410)
at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:359)
at org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:238)
at org.apache.hadoop.hbase.client.BufferedMutatorImpl.close(BufferedMutatorImpl.java:163)
at org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.close(TableOutputFormat.java:120)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$5.apply$mcV$sp(PairRDDFunctions.scala:1119)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1295)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1091)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.getMetaReplicaNodes(ZooKeeperWatcher.java:489)
at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:558)
at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getMetaRegionLocation(ZooKeeperRegistry.java:61)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateMeta(ConnectionManager.java:1211)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1178)
at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:305)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:156)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:60)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:210)
... 19 more
16/12/16 18:45:35 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.lang.RuntimeException: java.lang.NullPointerException
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:218)
at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.loadCache(ClientSmallReversedScanner.java:212)
at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.next(ClientSmallReversedScanner.java:186)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1275)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1181)
at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:410)
at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:359)
at org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:238)
at org.apache.hadoop.hbase.client.BufferedMutatorImpl.close(BufferedMutatorImpl.java:163)
at org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.close(TableOutputFormat.java:120)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$5.apply$mcV$sp(PairRDDFunctions.scala:1119)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1295)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1091)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.getMetaReplicaNodes(ZooKeeperWatcher.java:489)
at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:558)
at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getMetaRegionLocation(ZooKeeperRegistry.java:61)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateMeta(ConnectionManager.java:1211)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1178)
at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:305)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:156)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:60)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:210)
... 19 more
16/12/16 18:45:35 WARN TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2, localhost): java.lang.RuntimeException: java.lang.NullPointerException
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:218)
at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.loadCache(ClientSmallReversedScanner.java:212)
at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.next(ClientSmallReversedScanner.java:186)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1275)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1181)
at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:410)
at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:359)
at org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:238)
at org.apache.hadoop.hbase.client.BufferedMutatorImpl.close(BufferedMutatorImpl.java:163)
at org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.close(TableOutputFormat.java:120)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$5.apply$mcV$sp(PairRDDFunctions.scala:1119)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1295)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1091)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.getMetaReplicaNodes(ZooKeeperWatcher.java:489)
at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:558)
at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getMetaRegionLocation(ZooKeeperRegistry.java:61)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateMeta(ConnectionManager.java:1211)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1178)
at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:305)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:156)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:60)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:210)
... 19 more

16/12/16 18:45:35 ERROR TaskSetManager: Task 2 in stage 0.0 failed 1 times; aborting job
16/12/16 18:45:35 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/12/16 18:45:35 INFO TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) on executor localhost: java.lang.RuntimeException (java.lang.NullPointerException) [duplicate 1]
16/12/16 18:45:35 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/12/16 18:45:35 INFO TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3) on executor localhost: java.lang.RuntimeException (java.lang.NullPointerException) [duplicate 2]
16/12/16 18:45:35 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/12/16 18:45:35 INFO TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1) on executor localhost: java.lang.RuntimeException (java.lang.NullPointerException) [duplicate 3]
16/12/16 18:45:35 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/12/16 18:45:35 INFO TaskSchedulerImpl: Cancelling stage 0
16/12/16 18:45:35 INFO DAGScheduler: ResultStage 0 (saveAsNewAPIHadoopDataset at HBaseWriterBuilder.scala:101) failed in 1.984 s
16/12/16 18:45:35 INFO DAGScheduler: Job 0 failed: saveAsNewAPIHadoopDataset at HBaseWriterBuilder.scala:101, took 2.593875 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 0.0 failed 1 times, most recent failure: Lost task 2.0 in stage 0.0 (TID 2, localhost): java.lang.RuntimeException: java.lang.NullPointerException
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:218)
at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.loadCache(ClientSmallReversedScanner.java:212)
at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.next(ClientSmallReversedScanner.java:186)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1275)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1181)
at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:410)
at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:359)
at org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:238)
at org.apache.hadoop.hbase.client.BufferedMutatorImpl.close(BufferedMutatorImpl.java:163)
at org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.close(TableOutputFormat.java:120)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$5.apply$mcV$sp(PairRDDFunctions.scala:1119)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1295)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1091)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.getMetaReplicaNodes(ZooKeeperWatcher.java:489)
at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:558)
at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getMetaRegionLocation(ZooKeeperRegistry.java:61)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateMeta(ConnectionManager.java:1211)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1178)
at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:305)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:156)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:60)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:210)
... 19 more

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1433)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1421)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1420)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1420)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:801)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:801)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:801)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1642)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1601)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1590)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:622)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1856)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1869)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1946)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1144)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1074)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1074)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:323)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1074)
at it.nerdammer.spark.hbase.HBaseWriter.save(HBaseWriterBuilder.scala:101)
at org.inno.redistagger.redistagger$.main(redistag.scala:72)
at org.inno.redistagger.redistagger.main(redistag.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.RuntimeException: java.lang.NullPointerException
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:218)
at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.loadCache(ClientSmallReversedScanner.java:212)
at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.next(ClientSmallReversedScanner.java:186)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1275)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1181)
at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:410)
at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:359)
at org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:238)
at org.apache.hadoop.hbase.client.BufferedMutatorImpl.close(BufferedMutatorImpl.java:163)
at org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.close(TableOutputFormat.java:120)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$5.apply$mcV$sp(PairRDDFunctions.scala:1119)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1295)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1091)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.getMetaReplicaNodes(ZooKeeperWatcher.java:489)
at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:558)
at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getMetaRegionLocation(ZooKeeperRegistry.java:61)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateMeta(ConnectionManager.java:1211)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1178)
at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:305)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:156)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:60)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:210)
... 19 more
16/12/16 18:45:35 INFO SparkContext: Invoking stop() from shutdown hook
16/12/16 18:45:35 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
16/12/16 18:45:35 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
16/12/16 18:45:35 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null}
16/12/16 18:45:35 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null}
16/12/16 18:45:35 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null}
16/12/16 18:45:35 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
16/12/16 18:45:35 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
16/12/16 18:45:35 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}
16/12/16 18:45:35 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null}
16/12/16 18:45:35 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null}
16/12/16 18:45:35 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null}
16/12/16 18:45:35 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
16/12/16 18:45:35 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}
16/12/16 18:45:35 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null}
16/12/16 18:45:35 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null}
16/12/16 18:45:35 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}
16/12/16 18:45:35 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null}
16/12/16 18:45:35 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}
16/12/16 18:45:35 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null}
16/12/16 18:45:36 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null}
16/12/16 18:45:36 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null}
16/12/16 18:45:36 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}
16/12/16 18:45:36 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null}
16/12/16 18:45:36 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null}
16/12/16 18:45:36 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null}
16/12/16 18:45:36 INFO SparkUI: Stopped Spark web UI at http://172.17.0.2:4040
16/12/16 18:45:36 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/12/16 18:45:36 INFO MemoryStore: MemoryStore cleared
16/12/16 18:45:36 INFO BlockManager: BlockManager stopped
16/12/16 18:45:36 INFO BlockManagerMaster: BlockManagerMaster stopped
16/12/16 18:45:36 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/12/16 18:45:36 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/12/16 18:45:36 INFO SparkContext: Successfully stopped SparkContext
16/12/16 18:45:36 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/12/16 18:45:36 INFO ShutdownHookManager: Shutdown hook called
16/12/16 18:45:36 INFO ShutdownHookManager: Deleting directory /tmp/spark-52e4239a-d15e-4e08-a80f-98d7870f2aac
16/12/16 18:45:36 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
16/12/16 18:45:36 INFO ShutdownHookManager: Deleting directory /tmp/spark-52e4239a-d15e-4e08-a80f-98d7870f2aac/httpd-921185f9-52b9-4322-b089-ea56f183ffbe`
How can I get it resolved?
Thanks
Ravi

Does the connector support scan with filtering on some column values?

Hello,
I want to get some records based on some scan filters i.e.:
column:something = "somevalue"
https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/Filter.html
Does the connector support this kind of operator? if yes, please give example.

Thank you very much,

how can I get the timestamp of each row data

how can I get the timestamp of each row data?

val rdd = sc.hbaseTable(String, String)
.select("da:")
.inColumnFamily("co")
.filter(rowData => {
// when I use newAPIHadoopRDD, I can get timestamp like this:
// val time: Long = rowData._2.listCells().get(0).getTimestamp
// so, what can I write here?
}).map(rowData => {...})

Persisting Collections type in Hbase

I have tried to follow the How-Use-Guide, it worked perfectly for Primitve Types, however, when I tried to use Collections Types, I received

{ error: value toHBaseTable is not a member of org.apache.spark.rdd.RDD[(String, Int, String, scala.collection.immutable.Map[Int,String])]
rddMap.toHBaseTable("testMap").
}

I guess the collections are not supported yet, is it ?!

The code I have used as follows

{
val rddMap = sc.parallelize(1 to 100).map(i => (i.toString, i+1, "Hello", Map(i -> "Hello")))

***** // write to hbase
rddMap.toHBaseTable("testMap").
toColumns("c1", "c2", "c3").
inColumnFamily("cf").
save()

}

Supporting collections

The library should support storing scala collections in a single HBase column. Since it seems there is not a standard way to map collections to Array[Byte] in HBase, the library should provide different, configurable, ways of handling collections.

The default strategy should be compatible with the one used in the Apache Phoenix project.

Reading table gets hanged at last stage

While reading columns from Hbase table, job gets hanged at last stage.

Cluster Size: 3 nodes
Spark Version: 1.5.2
Executor-Memory (Per Node)=16 Gb
Number of cores :24
Driver Memory: 4Gb

Does connector support Batch Get?

Hi there,

I want to read multiple rows according to a list of row id, but it seems like connector only supports scan query operation. I'm wondering if connector support batch Get. If it does not support, theoretically speaking, is it possible to add this function? That would be great if you guys can give some ideas about it.

Thanks.

Write into multiple HBase

Since right now, both HbaseHost and HBaseRootDir are retrieved from sparkConf, so there's the limitation that we could only connect to one HBase cluster.
What about if I want to output the data into two seperate HBase cluster?

Different types of row-id

Row ids in HBase are not required to be strings. Any type can be used, if it can be converted to Array[Byte].

Currently, the library supports filtering and salting only for string-valued row ids.

We should add support for any type of row ids when filtering rows, and, for other types of row ids in the salting mechanism.

need help: format of "thehost"

Hey thank you so much for providing such a good package.
I'm new to Hbase and thus need some help.

If I want to access a remote Hbase, what should I put in the set("spark.hbase.host", "thehost")?
is it similiar to mongodb as 10.200.xx.xxx:27017 (host:port)? Do I need to add computer user name and password?

Thanks in advance!

select single option empty column

sc.hbaseTable[(String, Option[String])]("table").select("column1").inColumnFamily("f").foreach{case (k, v) => println(s"$k, ${v.isDefined}, ${v.isEmpty}")}

will not println anything

Running in Spark 2.2

import it.nerdammer.spark.hbase._
import org.apache.spark.sql.SparkSession

object SparkHBase {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder
      .appName("FunnelSpark")
      .master("local[*]")
      .config("spark.hbase.host", "localhost")
      .getOrCreate

    val sc = spark.sparkContext
    sc.hadoopConfiguration.set("spark.hbase.host", "localhost")

    val rdd = sc.parallelize(1 to 100)
      .map(i => (i.toString, i + 1, "Hello"))

    rdd.toHBaseTable("mytable")
      .toColumns("column1", "column2")
      .inColumnFamily("mycf")
      .save()

    spark.stop()

  }

}

This code is writing data to HBase but than it failes with this error:

java.lang.IllegalArgumentException: Can not create a Path from a null string
	at org.apache.hadoop.fs.Path.checkPathArg(Path.java:123)
	at org.apache.hadoop.fs.Path.<init>(Path.java:135)
	at org.apache.hadoop.fs.Path.<init>(Path.java:89)
	at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:58)
	at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:132)
	at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:101)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
	at it.nerdammer.spark.hbase.HBaseWriter.save(HBaseWriterBuilder.scala:102)
	at SparkHBase$.main(SparkHBase.scala:47)
	at SparkHBase.main(SparkHBase.scala)```

Write to HBase Using hbaseTable Read From HBase

Hi,

A quick question, say I read a table from HBase,

val hBaseRDD = sc.hbaseTable[(String,String, Float, Float)]("photos_sub")
            .select("tags","lat", "lon")
            .inColumnFamily("details")

How can I write hBaseRDD back to HBase? Thanks

could not find implicit value for parameter mapper: it.nerdammer.spark.hbase.conversion.FieldWriter[org.apache.spark.sql.Row] rdd.toHBaseTable("mytable")

import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.functions.{col, concat, lit}
import it.nerdammer.spark.hbase._

object SparkHBase {
    def main(args: Array[String]): Unit = {
    val sparkConf = new 
    SparkConf().setAppName("HbaseSpark
"HbaseSpark").setMaster("local[*]").set("spark.hbase.host", "localhost")
    

    val sc = new SparkContext(sparkConf)
    val sqlContext = new SQLContext(sc)
    val df = sqlContext
      .read
      .format("com.databricks.spark.csv")
      .option("delimiter", "\001")
      .load("/Users/11130/small")

    val df1 = df.withColumn("row_key", concat(col("_c3"), lit("_"), col("_c5"), lit("_"), col("_c0")))
    df1.registerTempTable("mytable1")
    val newDf = sqlContext.sql("Select row_key, _c0, _c1, _c2, _c3, _c4, _c5, _c6, _c7," +
      "_c8, _c9, _c10, _c11, _c12, _c13, _c14, _c15, _c16, _c17, _c18, _c19 from mytable")

    val rdd = newDf.rdd

    rdd.toHBaseTable("mytable")
      .toColumns("event_id", "device_id", "uidx", "session_id", "server_ts", "client_ts", "event_type", "data_set_name",
        "screen_name", "card_type", "widget_item_whom", "widget_whom", "widget_v_position", "widget_item0_h_position",
        "publisher_tag", "utm_medium", "utm_source", "utm_campaign", "referrer_url", "notification_class")
      .inColumnFamily("mycf")
      .save()

    sc.stop()
  }
}

Any idea what is going wrong here?

multiple versions with the same row key

When I tried with a data that contains multiple versions of values with the same row key, I can get only the last version.

I saved RDD into HBase as follows:

myRDD.toHBaseTable("table").toColumns("col1", "col2").inColumnFamily("cf").save()

And retrieved as follows:

val hBaseRDD = sc.hbaseTable(String, Option[String], Option[String])
.select("col1", "col2")
.inColumnFamily("cf")
.withStartRow("columbian")
.withStopRow("columbian")
hBaseRDD.foreach(t => {
if(t._1.nonEmpty) print(t._1.mkString+": ")
if(t._2.nonEmpty) print(t._2.mkString+", ")
if(t._3.nonEmpty) println(t._3.mkString)
})

I set "hbase.column.max.version" as "65536" in both hbase-site.xml and within Spark application.
Within Spark application:

val sparkConf = new SparkConf().setAppName("sample")
val hbaseSparkConf = new HBaseSparkConf()
val conf= hbaseSparkConf.createHadoopBaseConfig()
conf.addResource("/usr/local/hbase/conf/hbase-site.xml")
conf.set("spark.hbase.root.dir", "file:///usr/local/hbase/hbaseroot")
conf.set("hbase.column.max.version", "65536")
val sc = new SparkContext(sparkConf)

in hbase-site.xml

hbase.rootdirfile:///usr/local/hbase/hbaseroot hbase.zookeeper.property.dataDir /usr/local/hbase/zookeeper hbase.tmp.dir/usr/local/hbase/tmp hbase.column.max.version65536

Optimizations on Spark HBase connector

Hey Guys,

I was wondering what kind of optimizations can I perform to improve my speed for Read and Write on HBase. Is there any documentation or steps present for it.

For e.g.

  1. Partition Discovery or Some kind of Region / Key based optimizations internally handling the read or scan operations
  2. Some memory management parameters which could improve the read write speed. etc

Example fails on dependency

Hi
I am trying to run this example and am getting NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration and it is coming from it.nerdammer.spark.hbase.HBaseSpark.createHadoopBaseConfig. Any library I am missing or any dependency I need to fix. I am using scala and mvn.

spark streaming with hbase ERROR

i got this error :
17/10/30 09:20:51 ERROR Utils: Aborting task
java.lang.IllegalArgumentException: Wrong number of columns. Expected 4 found 3
at it.nerdammer.spark.hbase.HBaseWriter$$anonfun$1.apply(HBaseWriterBuilder.scala:80)
at it.nerdammer.spark.hbase.HBaseWriter$$anonfun$1.apply(HBaseWriterBuilder.scala:66)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$4.apply(SparkHadoopMapReduceWriter.scala:147)
at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$4.apply(SparkHadoopMapReduceWriter.scala:144)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.org$apache$spark$internal$io$SparkHadoopMapReduceWriter$$executeTask(SparkHadoopMapReduceWriter.scala:159)
at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$3.apply(SparkHadoopMapReduceWriter.scala:89)
at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$3.apply(SparkHadoopMapReduceWriter.scala:88)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Spark Streaming context

i am using spark streaming context to take flume stream and process it using spark , then want to save the Dstream into Hbase , is it possible to do so using your library ? Thanks in advance

hbase Read issue - Json Mapping error

We are reading from a secured hbase cluster, and facing the following issue -

Exception in thread "main" com.fasterxml.jackson.databind.JsonMappingException: No suitable constructor found for type [simple type, class org.apache.spark.rdd.RDDOperationScope]: can not instantiate from JSON object (need to add/enable type information?)

the actual content form the table is plain String. The source code for the same is fairly straightforward -

val hBaseRDD = sparkContext.hbaseTable[(String)]("myDB:test").select("123456789").inColumnFamily("m")
hBaseRDD.take(10)

We have provided the hbase quorum information through hbase-site.xml in the packaged jar file.

As a Spark SQL datasource

Support for HBase as a Datasource for ThriftServer:

CREATE TEMPORARY TABLE tempTable
USING hbase
OPTIONS (table"tempHBaseTable")

Python support

Does this package support python? If not, are there any similar package that reads HBase data to Spark using python?

Not able to write to an hbase table

I have and EMR cluster on which spark is running , and another EMR cluster on which hbase is running , I have created a table named 'TableForSpark' on it and I'm trying to write data to it using the following code:

import it.nerdammer.spark.hbase._
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._
//import org.apache.spark.sql.execution.datasources.hbase._
object hbaseTest {
def main( args: Array[String] ){
val conf = new SparkConf().setAppName("Hbase test")
//conf.set("spark.hbase.host", "192.168.0.23")
val sc = new SparkContext(conf)

val rdd = sc.parallelize(1 to 10).map(i => (i.toString, i+1, "Hello"))

val rdd1 = rdd.toHBaseTable("TableForSpark").toColumns("column1", "column1").inColumnFamily("cf")
rdd1.save()

}
}

I have built 'spark-hbase-connector' using scala 2.11.8 on spark 2.0.0.

When I submit the job using the following command , it gets stuck up in the last stage:
sudo spark-submit --deploy-mode client --jars $(echo lib/*.jar | tr ' ' ',') --class com.oreilly.learningsparkexamples.hbaseTest target/scala-2.11/hbase-test_2.11-0.0.1.jar

I have also kept hbase-site.xml file in the resource folder and the program is correctly picking up the zookeeper ip from it.

I have checked the logs of the task , it is able to connect to the zookeeper but not able to write to hbase, Could any throw some light on the problem.

The last part of the log looks like this:

16/08/18 11:48:35 INFO YarnClientSchedulerBackend: Application application_1470825934412_0088 has started running.
16/08/18 11:48:35 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 46496.
16/08/18 11:48:35 INFO NettyBlockTransferService: Server created on 10.60.0.xxx:46496
16/08/18 11:48:35 INFO BlockManager: external shuffle service port = 7337
16/08/18 11:48:35 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.60.0.13, 46496)
16/08/18 11:48:35 INFO BlockManagerMasterEndpoint: Registering block manager 10.60.0.xxx:46496 with 414.4 MB RAM, BlockManagerId(driver, 10.60.0.xxx, 46496)
16/08/18 11:48:35 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.60.0.13, 46496)
16/08/18 11:48:36 INFO EventLoggingListener: Logging events to hdfs:///var/log/spark/apps/application_1470825934412_0088
16/08/18 11:48:36 INFO Utils: Using initial executors = 0, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
16/08/18 11:48:36 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
16/08/18 11:48:36 INFO SparkContext: Starting job: saveAsNewAPIHadoopDataset at HBaseWriterBuilder.scala:102
16/08/18 11:48:36 INFO DAGScheduler: Got job 0 (saveAsNewAPIHadoopDataset at HBaseWriterBuilder.scala:102) with 2 output partitions
16/08/18 11:48:36 INFO DAGScheduler: Final stage: ResultStage 0 (saveAsNewAPIHadoopDataset at HBaseWriterBuilder.scala:102)
16/08/18 11:48:36 INFO DAGScheduler: Parents of final stage: List()
16/08/18 11:48:36 INFO DAGScheduler: Missing parents: List()
16/08/18 11:48:36 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at map at HBaseWriterBuilder.scala:66), which has no missing parents
16/08/18 11:48:37 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 89.1 KB, free 414.4 MB)
16/08/18 11:48:37 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 33.2 KB, free 414.3 MB)
16/08/18 11:48:37 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.60.0.13:46496 (size: 33.2 KB, free: 414.4 MB)
16/08/18 11:48:37 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1012
16/08/18 11:48:37 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at map at HBaseWriterBuilder.scala:66)
16/08/18 11:48:37 INFO YarnScheduler: Adding task set 0.0 with 2 tasks
16/08/18 11:48:37 INFO ExecutorAllocationManager: Requesting 1 new executor because tasks are backlogged (new desired total will be 1)
16/08/18 11:48:42 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.60.0.134:53842) with ID 1
16/08/18 11:48:42 INFO ExecutorAllocationManager: New executor 1 has registered (new total is 1)
16/08/18 11:48:42 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, ip-10-60-0-xxx.ec2.internal, partition 0, PROCESS_LOCAL, 5427 bytes)
16/08/18 11:48:42 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, ip-10-60-0-xxx.ec2.internal, partition 1, PROCESS_LOCAL, 5484 bytes)
16/08/18 11:48:42 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-60-0-xxx.ec2.internal:34581 with 2.8 GB RAM, BlockManagerId(1, ip-10-60-0-134.ec2.internal, 34581)
16/08/18 11:48:42 INFO YarnSchedulerBackend$YarnDriverEndpoint: Launching task 0 on executor id: 1 hostname: ip-10-60-0-xxx.ec2.internal.
16/08/18 11:48:42 INFO YarnSchedulerBackend$YarnDriverEndpoint: Launching task 1 on executor id: 1 hostname: ip-10-60-0-xxx.ec2.internal.
16/08/18 11:48:43 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on ip-10-60-0-xxx.ec2.internal:34581 (size: 33.2 KB, free: 2.8 GB)

It gets stuck up at this point.

Thanks & Regards,
Surender.

ArrayIndexOutOfBoundsException when using HashSaltingProvider

Hello,

When adding salting on the rowKeys using withSalting, my sparkJob fails with an ArrayIndexOutOfBoundsException on it.nerdammer.spark.hbase.HashSaltingProvider.salt(SaltingProvider.scala:67).

The line is:

 override def salt(rowKey: Array[Byte]): T = salting(hash(rowKey) % salting.size)

I think the issue here is that if hash(rowKey) is negative, then hash(rowKey) % salting.size is also negative, causing the exception.

"hbaseTable" cannot read from HBase

I am trying to pull data from HBase table using spark-hbase-connector.

Here are the steps that I have done:

  1. Download spark-hbase-connector_2.10-1.0.3.jar file
  2. call spark using:
    spark-shell --deploy-mode client --master yarn --jars spark-hbase-connector_2.10-1.0.3.jar --conf spark.hbase.host=server1.dev.hbaseserver.com
  3. run:

import it.nerdammer.spark.hbase._

import org.apache.spark.SparkContext

import org.apache.spark.SparkConf

  1. read HBase table using:
    val hBaseRDD = sc.hbaseTable[(String, String, String)]("WEATHER_INFO").select("tem-0301", "tem-0401").inColumnFamily("r")
  2. try to check the first 10 rows:
    hBaseRDD.take(10)

Here is the error log:

Caused by: java.net.SocketTimeoutException: callTimeout=60000, callDuration=68347: row 'WEATHER_INFO,,00000000000000' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=server1.dev.hbaseserver.com,60020,1465837116220, seqNum=0
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:159)
at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:64)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hbase.exceptions.ConnectionClosingException: Call to server1.dev.hbaseserver.com/10.14.116.11:60020 failed on local exception: org.apache.hadoop.hbase.exceptions.ConnectionClosingException: Connection to server1.dev.hbaseserver.com/10.14.116.11:60020 is closing. Call id=9, waitTime=2

It seems like the program is pointing to table 'hbase:meta' instead of pointing to table 'WEATHER_INFO'.
I am new to scala, I use pyspark most of the time. I want to try this Spark-Hbase-connector, because it has a much better and powerful UI.
Please point out where I made mistakes? Thanks.

sbt-spark-package

Hi,

Thank you very much for this library! Would you like to integrate your Spark Package with the sbt-spark-package plugin? This will make publishing new releases to the Spark Packages website super easy (you can simply publish a release using sbt spPublish, without the need to fill out the release form on the webpage!

If your package is hosted on the Spark Packages Repository, people can simply use your package in Spark Applications with:
spark-shell --packages nerdammer/spark-hbase-connector:0.9.4

Similarly, users that also have the sbt-spark-package plugin can add
`libraryDependencies += "nerdammer/spark-hbase-connector:0.9.4"
to their sbt build file.

The integration is simple, and I would be happy to submit a PR if you like.

Best,
Burak

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.