michal-harish / kafka-hadoop-loader Goto Github PK

Hadoop Job for schemaless incremental loading of messages from Kafka topics onto hdfs with configurable output partitioning. :no_entry_sign:

License: Apache License 2.0

Java 100.00%

kafka-hadoop-loader's People

Contributors

Stargazers

Watchers

kafka-hadoop-loader's Issues

Work out a good API for transformations

Use parquet output format as a guiding example,and when stable create branches for kafka v0.9 and v 0.10

handle null message payloads

Kafka uses null message payloads as a valid message so this needs to be treated and added into system test.

Exactly-once delivery with hdfs checkpointer

It's mostly there, just needs a lock mechanism and a system test which should:

produce a batch of messages
run the job succesfully
produce another batch of messages
run the job and crash it after it has consumed some messages
re-run the job succesfully
check all messages from both batches are on hdfs exactly once

hi
it seems you have an error in your pom.xml. all gridport elemnts are with co extension instead of com....
after fixing that and running mvn assembly:single , i still get an error:
Downloaded: http://maven.gridport.com/content/repositories/snapshots/commons-cli/commons-cli/1.2/commons-cli-1.2.jar (51 B at 0.0 KB/sec)
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 44.610s
[INFO] Finished at: Tue Jan 27 16:56:19 IST 2015
[INFO] Final Memory: 10M/724M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-assembly-plugin:2.2-beta-5:single (default-cli) on project kafka-hadoop-loader: Failed to create assembly: Error creating assembly archive jar-with-dependencies: error in opening zip file -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

did i do anything wrong?

How do I know that the message has been sent to hdfs？

I use the cmd:
hadoop jar kafka-hadoop-loader.jar -z hadoop1:2181,hadoop2:2181 testFile1 /input

then ,I use the hdfs cmd to check th file: hdfs dfs -ls /input ,the result is :

Found 4 items
drwxr-xr-x - root supergroup 0 2016-09-06 15:07 /input/'testFile
drwxr-xr-x - root supergroup 0 2016-09-06 15:14 /input/'testFile1
drwxr-xr-x - root supergroup 0 2016-09-06 15:31 /input/_OFFSETS
-rw-r--r-- 1 root supergroup 0 2016-09-06 15:31 /input/_SUCCESS

but I can't open the File folder with single quotes

Default timestamp extractor for Kafka 0.10+

java.nio.BufferUnderflowException

17/07/26 15:20:58 INFO mapreduce.Job main: Task Id : attempt_1496398596009_5376987_m_000005_1, Status : FAILED
Error: java.nio.BufferUnderflowException
at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:145)
at java.nio.ByteBuffer.get(ByteBuffer.java:694)
at kafka.api.ApiUtils$.readShortString(ApiUtils.scala:40)
at kafka.api.TopicData$.readFrom(FetchResponse.scala:95)
at kafka.api.FetchResponse$$anonfun$4.apply(FetchResponse.scala:169)
at kafka.api.FetchResponse$$anonfun$4.apply(FetchResponse.scala:168)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.Range.foreach(Range.scala:141)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at kafka.api.FetchResponse$.readFrom(FetchResponse.scala:168)
at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:135)
at kafka.javaapi.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:47)
at io.amient.kafka.hadoop.io.KafkaInputRecordReader.nextKeyValue(KafkaInputRecordReader.java:154)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:556)
at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
i run this，It reminds me this error

Dynamic port allocation in embedded kafka system test

it's fixed to 9092 which will fail if there is say already kafka running locally on its default port

loader data lose partition and will overwrite the pre-data not append to file

I checkout the code, test run, I find the prolem that the loader data will lose partition and will overwrite the pre-data, not append new data to old file

@Test for exactly-once guarantee

produce a batch of messages
run the job succesfully
produce another batch of messages
run the job and crash it after it has consumed some messages
re-run the job succesfully
check all messages from both batches are on hdfs exactly once

replace fake hadoop job test with a MRUnit + Embedded Kafka server

This will work nicely with different versions

Upgrade to Kafka 0.8+ protocols

Generic configuration for the underlying consumers

At the moment the consumer can be configured only for these settings:

CONFIG_KAFKA_MESSAGE_MAX_BYTES = "kafka.fetch.message.max.bytes";
CONFIG_KAFKA_SOCKET_TIMEOUT_MS = "kafka.socket.timeout.ms";
CONFIG_KAFKA_RECEIVE_BUFFER_BYTES = "kafka.socket.receive.buffer.bytes";

But it should simply pass anything kafka.consumer.* to the underlying consumer

compile failing

mvn assembly:single

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-assembly-plugin:2.2-beta-5:single (default-cli) on project kafka-hadoop-loader: Failed to create assembly: Failed to resolve dependencies for project: co.gridport.kafka:kafka-hadoop-loader:jar:1.2.2-SNAPSHOT: Unable to get dependency information for org.apache.kafka:kafka-core:jar:0.7.3-1: Failed to retrieve POM for org.apache.kafka:kafka-core:jar:0.7.3-1: Failure to transfer org.apache.kafka:kafka-core:pom:0.7.3-1 from http://maven.gridport.co/content/repositories/snapshots was cached in the local repository, resolution will not be reattempted until the update interval of gridport-snapshots has elapsed or updates are forced. Original error: Could not transfer artifact org.apache.kafka:kafka-core:pom:0.7.3-1 from/to gridport-snapshots (http://maven.gridport.co/content/repositories/snapshots): maven.gridport.co: Name or service not known

michal-harish / kafka-hadoop-loader Goto Github PK

kafka-hadoop-loader's People

Contributors

Stargazers

Watchers

Forkers

kafka-hadoop-loader's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs