Comments (5)
Sorry for the delay to answer @snabar, I believe I had had some problem like this in the beginning, but it should be working in master version... When I get some time I will check it and if it's really not working I will provide a fix
from csvinputformat.
Hi, Code works perfectly fine but my only trouble is that my data files which are csv can range from 10 gb to 200 gb. Now how to do decide the LINES_PER_MAP so that it does not get overloaded on one mapper also it should do the distribution properly. I was thinking of doing some calculations to get the LINES_PER_MAP(i know input data size and columns data types) but still i am not that much confident if it solves. Any thoughts on how to decided the LINES_PER_MAP.
from csvinputformat.
LINES_PER_MAP will determine the size of a mapper task. If you have a csv with 10000 lines and LINES_PER_MAP=1000, you will get 10000/1000 = 10 map tasks, to be distributed to your mapper workers along your cluster.
If your file has 10gb+, the number of lines is probably huge, so a value like 10k or 100k seems reasonable to me, but I would decide this after experimenting different values in the real cluster.
Thanks for confirming the code works fine.
from csvinputformat.
Thank you for the prompt reply. I really appreciate it. I can actually run few tests and can come up with numbers but do you think there could be other smarter ways ??
Two concerns -
Consider a file having 10 records now number of lines can be 10 or it can be 15 as say one record is big and it spans over 6 lines when opened in vim editor. Now in most of the cases people might know the number of records and not the number of lines. Now, if i know the number of input records i might not be able to clearly estimate the number of lines.
Also, other way of figuring out it would be lets take a standard file of 100 GB with 1000 columns and calculate the worst case scenario. Run it through a simple map based job and find how many records are processed in a split(assuming 256 MB block size) lets say that number comes out to be 100k then go with that number for all the files as we know 256 MB is what which should be filled up with the records.
Any thoughts you have ? or any other smarter way ?
from csvinputformat.
from csvinputformat.
Related Issues (8)
- Error when using a zipped CSV and have multiple splits. HOT 1
- Sendding parameters to mapper HOT 3
- LICENSE about this library HOT 3
- Question about records spanning splits HOT 1
- org.apache.hadoop.mapreduce.TaskAttemptContext is abstract; cannot be instantiated HOT 2
- Can you publish this with an Apache 2.0 License? HOT 2
- Bug in CSVNLineInputFormat.getSplitsForFile? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from csvinputformat.