mvallebr / csvinputformat Goto Github PK
View Code? Open in Web Editor NEWInput format for hadoop able to read multiline CSVs
License: Apache License 2.0
Input format for hadoop able to read multiline CSVs
License: Apache License 2.0
Hello! I've been looking through the code for this project, and I do see how CSVLineRecordReader
reads multiple lines if necessary, but I'm having trouble seeing how you handle the case where a record lies across a split boundary. The line reader stuff in Hadoop itself seems to have some handling where it will read a little beyond the end point that it is told in order to read a full line, and will skip the first line at the start point that it is told so as not to double-read the line that was read as part of the previous split. I don't see anything like that in this project. Am I missing something? If so, could you point me to how that is handled/implemented?
Thanks!
Hi,
I am new to Big Data and Mao reduce , I want to send the columnIndex to print the particular column value in the row, where can I set this additional parameter to mapper. Please respond
This is my error:
compile:
[javac] /Users/colmac/Java/CSVInputFormat-master/build.xml:20: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds
[javac] Compiling 8 source files to /Users/colmac/Java/CSVInputFormat-master/build
[javac] /Users/colmac/Java/CSVInputFormat-master/src/test/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormatTest.java:23: org.apache.hadoop.mapreduce.TaskAttemptContext is abstract; cannot be instantiated
[javac] TaskAttemptContext context = new TaskAttemptContext(conf, new TaskAttemptID());
[javac] ^
[javac] /Users/colmac/Java/CSVInputFormat-master/src/test/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormatTest.java:26: org.apache.hadoop.mapreduce.JobContext is abstract; cannot be instantiated
[javac] List actualSplits = inputFormat.getSplits(new JobContext(conf, new JobID()));
[javac] ^
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
[javac] 2 errors
I did a bit of searching and found this:
it has become an interface in Hadoop 2.x so the code needs to be reworked
Is this something you can fix?
Hi, I am using this in Spark to read hadoop files. Everything works fine if I set LINES_PER_MAP to a really large number such that the file does not get split at all. However, if I do set it to something smaller, it seems to find a larger number of records than there actually are. Have you had any issues with this?
I'm currently working on a project where CSVInputFormat would come in handy, but we can only include external code that is properly licensed. Are you willing the republish your repository under the Apache 2.0 License? I'm suggesting this one as it is commonly used in the apache/hadoop ecosystem.
I can create a pull request for this if you prefer.
Thank you
I got the below error when processing a zipped csv file and forcing multiple splits by giving small LINES_PER_MAP .
This below error comes for the last chunk.
java.lang.Exception: java.io.EOFException: Cannot seek after EOF
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:406)
Caused by: java.io.EOFException: Cannot seek after EOF
at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1459)
at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:62)
at com.pointcross.CSVLineRecordReader.initialize(CSVLineRecordReader.java:213)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:478)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:671)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:268)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Hi mvallebr
Could you tell me the license for this library?
Thank you!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.