mvallebr / csvinputformat Goto Github PK

View Code? Open in Web Editor NEW

35.0 35.0 45.0 84 KB

Input format for hadoop able to read multiline CSVs

License: Apache License 2.0

Java 99.69% Dockerfile 0.31%

csvinputformat's People

Contributors

Stargazers

Watchers

csvinputformat's Issues

Question about records spanning splits

Hello! I've been looking through the code for this project, and I do see how CSVLineRecordReader reads multiple lines if necessary, but I'm having trouble seeing how you handle the case where a record lies across a split boundary. The line reader stuff in Hadoop itself seems to have some handling where it will read a little beyond the end point that it is told in order to read a full line, and will skip the first line at the start point that it is told so as not to double-read the line that was read as part of the previous split. I don't see anything like that in this project. Am I missing something? If so, could you point me to how that is handled/implemented?

Thanks!

Sendding parameters to mapper

Hi,

I am new to Big Data and Mao reduce ,   I want to send the columnIndex to print the particular column value in the row, where can I set this additional parameter to mapper. Please respond

org.apache.hadoop.mapreduce.TaskAttemptContext is abstract; cannot be instantiated

This is my error:
compile:
[javac] /Users/colmac/Java/CSVInputFormat-master/build.xml:20: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds
[javac] Compiling 8 source files to /Users/colmac/Java/CSVInputFormat-master/build
[javac] /Users/colmac/Java/CSVInputFormat-master/src/test/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormatTest.java:23: org.apache.hadoop.mapreduce.TaskAttemptContext is abstract; cannot be instantiated
[javac] TaskAttemptContext context = new TaskAttemptContext(conf, new TaskAttemptID());
[javac] ^
[javac] /Users/colmac/Java/CSVInputFormat-master/src/test/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormatTest.java:26: org.apache.hadoop.mapreduce.JobContext is abstract; cannot be instantiated
[javac] List actualSplits = inputFormat.getSplits(new JobContext(conf, new JobID()));
[javac] ^
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
[javac] 2 errors

I did a bit of searching and found this:
it has become an interface in Hadoop 2.x so the code needs to be reworked

Is this something you can fix?

Bug in CSVNLineInputFormat.getSplitsForFile?

Line 125 and Line 127 are the same so I was wondering why there's an if-else?

issue with getSplits?

Hi, I am using this in Spark to read hadoop files. Everything works fine if I set LINES_PER_MAP to a really large number such that the file does not get split at all. However, if I do set it to something smaller, it seems to find a larger number of records than there actually are. Have you had any issues with this?

Can you publish this with an Apache 2.0 License?

I'm currently working on a project where CSVInputFormat would come in handy, but we can only include external code that is properly licensed. Are you willing the republish your repository under the Apache 2.0 License? I'm suggesting this one as it is commonly used in the apache/hadoop ecosystem.

I can create a pull request for this if you prefer.

Thank you

Error when using a zipped CSV and have multiple splits.

I got the below error when processing a zipped csv file and forcing multiple splits by giving small LINES_PER_MAP .
This below error comes for the last chunk.

java.lang.Exception: java.io.EOFException: Cannot seek after EOF
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:406)
Caused by: java.io.EOFException: Cannot seek after EOF
at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1459)
at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:62)
at com.pointcross.CSVLineRecordReader.initialize(CSVLineRecordReader.java:213)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:478)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:671)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:268)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)

LICENSE about this library

@mvallebr

Hi mvallebr
Could you tell me the license for this library?
Thank you!

mvallebr / csvinputformat Goto Github PK

csvinputformat's People

Contributors

Stargazers

Watchers

Forkers

csvinputformat's Issues

Question about records spanning splits

Sendding parameters to mapper

org.apache.hadoop.mapreduce.TaskAttemptContext is abstract; cannot be instantiated

Bug in CSVNLineInputFormat.getSplitsForFile?

issue with getSplits?

Can you publish this with an Apache 2.0 License?

Error when using a zipped CSV and have multiple splits.

LICENSE about this library

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs