Hi, I am using this in Spark to read hadoop files. Everything works fine if I set LINE

Sorry for the delay to answer <a class="user-mention notranslate" data-hovercard-type=

issue with getSplits? about csvinputformat HOT 5 CLOSED

mvallebr commented on July 20, 2024

issue with getSplits?

from csvinputformat.

Comments (5)

mvallebr commented on July 20, 2024

Sorry for the delay to answer @snabar, I believe I had had some problem like this in the beginning, but it should be working in master version... When I get some time I will check it and if it's really not working I will provide a fix

from csvinputformat.

nitishcse412 commented on July 20, 2024

Hi, Code works perfectly fine but my only trouble is that my data files which are csv can range from 10 gb to 200 gb. Now how to do decide the LINES_PER_MAP so that it does not get overloaded on one mapper also it should do the distribution properly. I was thinking of doing some calculations to get the LINES_PER_MAP(i know input data size and columns data types) but still i am not that much confident if it solves. Any thoughts on how to decided the LINES_PER_MAP.

from csvinputformat.

mvallebr commented on July 20, 2024

LINES_PER_MAP will determine the size of a mapper task. If you have a csv with 10000 lines and LINES_PER_MAP=1000, you will get 10000/1000 = 10 map tasks, to be distributed to your mapper workers along your cluster.
If your file has 10gb+, the number of lines is probably huge, so a value like 10k or 100k seems reasonable to me, but I would decide this after experimenting different values in the real cluster.

Thanks for confirming the code works fine.

from csvinputformat.

nitishcse412 commented on July 20, 2024

Thank you for the prompt reply. I really appreciate it. I can actually run few tests and can come up with numbers but do you think there could be other smarter ways ??

Two concerns -

Consider a file having 10 records now number of lines can be 10 or it can be 15 as say one record is big and it spans over 6 lines when opened in vim editor. Now in most of the cases people might know the number of records and not the number of lines. Now, if i know the number of input records i might not be able to clearly estimate the number of lines.

Also, other way of figuring out it would be lets take a standard file of 100 GB with 1000 columns and calculate the worst case scenario. Run it through a simple map based job and find how many records are processed in a split(assuming 256 MB block size) lets say that number comes out to be 100k then go with that number for all the files as we know 256 MB is what which should be filled up with the records.

Any thoughts you have ? or any other smarter way ?

from csvinputformat.

mvallebr commented on July 20, 2024

So, if I understood your case correctly, every record you have is composed of N CSV rows, is that right? Do you need to map records instead of rows? I don't see a problem with estimation of number of lines per your description, but if I understood it correctly, you won't be able to use this input format in your case, you would have to use a RecordInputFormat, specific for your case. Please correct me if I misunderstood you.

…

On 21 February 2017 at 19:01, Nitish Sharma ***@***.***> wrote: Thank you for the prompt reply. I really appreciate it. I can actually run few tests and can come up with numbers but do you think there could be other smarter ways ?? Two concerns - Consider a file having 10 records now number of lines can be 10 or it can be 15 as say one record is big and it spans over 6 lines when opened in vim editor. Now in most of the cases people might know the number of records and not the number of lines. Now, if i know the number of input records i might not be able to clearly estimate the number of lines. Also, other way of figuring out it would be lets take a standard file of 100 GB with 1000 columns and calculate the worst case scenario. Run it through a simple map based job and find how many records are processed in a split(assuming 256 MB block size) lets say that number comes out to be 100k then go with that number for all the files as we know 256 MB is what which should be filled up with the records. Any thoughts you have ? or any other smarter way ? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#7 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAxKUfr1f0WDcxyQiSYnPzVLCmjsa1Lgks5rezSBgaJpZM4E3qgQ> .

-- Marcelo Valle http://mvalle.com - @mvallebr

from csvinputformat.

issue with getSplits? about csvinputformat HOT 5 CLOSED

Comments (5)

Related Issues (8)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs