GithubHelp home page GithubHelp logo

michal-harish / kafka-hadoop-loader Goto Github PK

View Code? Open in Web Editor NEW
90.0 90.0 46.0 13.28 MB

Hadoop Job for schemaless incremental loading of messages from Kafka topics onto hdfs with configurable output partitioning. :no_entry_sign:

License: Apache License 2.0

Java 100.00%

kafka-hadoop-loader's People

Contributors

michal-harish avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kafka-hadoop-loader's Issues

Exactly-once delivery with hdfs checkpointer

It's mostly there, just needs a lock mechanism and a system test which should:

  1. produce a batch of messages
  2. run the job succesfully
  3. produce another batch of messages
  4. run the job and crash it after it has consumed some messages
  5. re-run the job succesfully
  6. check all messages from both batches are on hdfs exactly once

Assembly Error

hi
it seems you have an error in your pom.xml. all gridport elemnts are with co extension instead of com....
after fixing that and running mvn assembly:single , i still get an error:
Downloaded: http://maven.gridport.com/content/repositories/snapshots/commons-cli/commons-cli/1.2/commons-cli-1.2.jar (51 B at 0.0 KB/sec)
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 44.610s
[INFO] Finished at: Tue Jan 27 16:56:19 IST 2015
[INFO] Final Memory: 10M/724M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-assembly-plugin:2.2-beta-5:single (default-cli) on project kafka-hadoop-loader: Failed to create assembly: Error creating assembly archive jar-with-dependencies: error in opening zip file -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

did i do anything wrong?

How do I know that the message has been sent to hdfs?

I use the cmd:
hadoop jar kafka-hadoop-loader.jar -z hadoop1:2181,hadoop2:2181 testFile1 /input

then ,I use the hdfs cmd to check th file: hdfs dfs -ls /input ,the result is :

Found 4 items
drwxr-xr-x - root supergroup 0 2016-09-06 15:07 /input/'testFile
drwxr-xr-x - root supergroup 0 2016-09-06 15:14 /input/'testFile1
drwxr-xr-x - root supergroup 0 2016-09-06 15:31 /input/_OFFSETS
-rw-r--r-- 1 root supergroup 0 2016-09-06 15:31 /input/_SUCCESS

but I can't open the File folder with single quotes

java.nio.BufferUnderflowException

17/07/26 15:20:58 INFO mapreduce.Job main: Task Id : attempt_1496398596009_5376987_m_000005_1, Status : FAILED
Error: java.nio.BufferUnderflowException
at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:145)
at java.nio.ByteBuffer.get(ByteBuffer.java:694)
at kafka.api.ApiUtils$.readShortString(ApiUtils.scala:40)
at kafka.api.TopicData$.readFrom(FetchResponse.scala:95)
at kafka.api.FetchResponse$$anonfun$4.apply(FetchResponse.scala:169)
at kafka.api.FetchResponse$$anonfun$4.apply(FetchResponse.scala:168)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.Range.foreach(Range.scala:141)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at kafka.api.FetchResponse$.readFrom(FetchResponse.scala:168)
at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:135)
at kafka.javaapi.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:47)
at io.amient.kafka.hadoop.io.KafkaInputRecordReader.nextKeyValue(KafkaInputRecordReader.java:154)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:556)
at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
i run this,It reminds me this error

@Test for exactly-once guarantee

  1. produce a batch of messages
  2. run the job succesfully
  3. produce another batch of messages
  4. run the job and crash it after it has consumed some messages
  5. re-run the job succesfully
  6. check all messages from both batches are on hdfs exactly once

Generic configuration for the underlying consumers

At the moment the consumer can be configured only for these settings:

CONFIG_KAFKA_MESSAGE_MAX_BYTES = "kafka.fetch.message.max.bytes";
CONFIG_KAFKA_SOCKET_TIMEOUT_MS = "kafka.socket.timeout.ms";
CONFIG_KAFKA_RECEIVE_BUFFER_BYTES = "kafka.socket.receive.buffer.bytes";

But it should simply pass anything kafka.consumer.* to the underlying consumer

compile failing

mvn assembly:single

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-assembly-plugin:2.2-beta-5:single (default-cli) on project kafka-hadoop-loader: Failed to create assembly: Failed to resolve dependencies for project: co.gridport.kafka:kafka-hadoop-loader:jar:1.2.2-SNAPSHOT: Unable to get dependency information for org.apache.kafka:kafka-core:jar:0.7.3-1: Failed to retrieve POM for org.apache.kafka:kafka-core:jar:0.7.3-1: Failure to transfer org.apache.kafka:kafka-core:pom:0.7.3-1 from http://maven.gridport.co/content/repositories/snapshots was cached in the local repository, resolution will not be reattempted until the update interval of gridport-snapshots has elapsed or updates are forced. Original error: Could not transfer artifact org.apache.kafka:kafka-core:pom:0.7.3-1 from/to gridport-snapshots (http://maven.gridport.co/content/repositories/snapshots): maven.gridport.co: Name or service not known

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.