GithubHelp home page GithubHelp logo

csvinputformat's People

Contributors

amalrik avatar chaos-generator avatar davideeva avatar dependabot[bot] avatar mvallebr avatar psuryawanshi avatar sergey-podunov avatar trsudarshan avatar tsukaby avatar zachradtka avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

csvinputformat's Issues

Question about records spanning splits

Hello! I've been looking through the code for this project, and I do see how CSVLineRecordReader reads multiple lines if necessary, but I'm having trouble seeing how you handle the case where a record lies across a split boundary. The line reader stuff in Hadoop itself seems to have some handling where it will read a little beyond the end point that it is told in order to read a full line, and will skip the first line at the start point that it is told so as not to double-read the line that was read as part of the previous split. I don't see anything like that in this project. Am I missing something? If so, could you point me to how that is handled/implemented?

Thanks!

Sendding parameters to mapper

Hi,

I am new to Big Data and Mao reduce ,   I want to send the columnIndex to print the particular column value in the row, where can I set this additional parameter to mapper. Please respond

org.apache.hadoop.mapreduce.TaskAttemptContext is abstract; cannot be instantiated

This is my error:
compile:
[javac] /Users/colmac/Java/CSVInputFormat-master/build.xml:20: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds
[javac] Compiling 8 source files to /Users/colmac/Java/CSVInputFormat-master/build
[javac] /Users/colmac/Java/CSVInputFormat-master/src/test/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormatTest.java:23: org.apache.hadoop.mapreduce.TaskAttemptContext is abstract; cannot be instantiated
[javac] TaskAttemptContext context = new TaskAttemptContext(conf, new TaskAttemptID());
[javac] ^
[javac] /Users/colmac/Java/CSVInputFormat-master/src/test/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormatTest.java:26: org.apache.hadoop.mapreduce.JobContext is abstract; cannot be instantiated
[javac] List actualSplits = inputFormat.getSplits(new JobContext(conf, new JobID()));
[javac] ^
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
[javac] 2 errors

I did a bit of searching and found this:
it has become an interface in Hadoop 2.x so the code needs to be reworked

Is this something you can fix?

issue with getSplits?

Hi, I am using this in Spark to read hadoop files. Everything works fine if I set LINES_PER_MAP to a really large number such that the file does not get split at all. However, if I do set it to something smaller, it seems to find a larger number of records than there actually are. Have you had any issues with this?

Can you publish this with an Apache 2.0 License?

I'm currently working on a project where CSVInputFormat would come in handy, but we can only include external code that is properly licensed. Are you willing the republish your repository under the Apache 2.0 License? I'm suggesting this one as it is commonly used in the apache/hadoop ecosystem.

I can create a pull request for this if you prefer.

Thank you

Error when using a zipped CSV and have multiple splits.

I got the below error when processing a zipped csv file and forcing multiple splits by giving small LINES_PER_MAP .
This below error comes for the last chunk.

java.lang.Exception: java.io.EOFException: Cannot seek after EOF
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:406)
Caused by: java.io.EOFException: Cannot seek after EOF
at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1459)
at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:62)
at com.pointcross.CSVLineRecordReader.initialize(CSVLineRecordReader.java:213)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:478)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:671)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:268)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.