GithubHelp home page GithubHelp logo

Comments (5)

AngusDavis avatar AngusDavis commented on May 21, 2024

In general, we can't retry uploads as we don't keep a copy of the stream contents locally. In this case, it appears to be in a mapper and that is hopefully able to retried.

From the stack trace, it looks like what's happening is that there are two or more processes writing to the same GCS object. When the stream is finalized, there's a check to ensure that the object hasn't been changed since we began writing (and to prevent errors in case one finalize RPC returns a 50X error followed by a second RPC that returns 200 OK - see https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcsio/src/main/java/com/google/cloud/hadoop/gcsio/GoogleCloudStorageImpl.java#L302 for details on that).

Is it expected that two or more processes are writing to the same object?

from hadoop-connectors.

Dima1224 avatar Dima1224 commented on May 21, 2024

No, it was just one process writing. I see these types of errors (although usually 404 or 500) when using multiple outputs via the GCS connector. In this particular case I was writing to HDFS via Context and to GCS via MultipleOutputs.

Hadoop attempted to retry, but wasn't able to recover due to a bunch of

java.io.IOException: Object gs://spins/home/dmikhels/sld_load/exceptions_out/part-m-20185 already exists.
        at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.create(GoogleCloudStorageImpl.java:383)
        at com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage.create(CacheSupplementedGoogleCloudStorage.java:98)
        at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.createInternal(GoogleCloudStorageFileSystem.java:254)
        at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.create(GoogleCloudStorageFileSystem.java:238)
        at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.<init>(GoogleHadoopOutputStream.java:79)
        at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:896)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452)
        at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:128)
        at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.getRecordWriter(MultipleOutputs.java:411)
        at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.write(MultipleOutputs.java:370)
        at com.spins.hdp.sld.FilterSld$M.reportRuleFailure(FilterSld.java:139)
        at com.spins.hdp.sld.FilterSld$M.runRules(FilterSld.java:128)
        at com.spins.hdp.sld.FilterSld$M.map(FilterSld.java:88)
        at com.spins.hdp.sld.FilterSld$M.map(FilterSld.java:1)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
        at org.apache.hadoop.mapred.Child.main(Child.java:249)

from hadoop-connectors.

AngusDavis avatar AngusDavis commented on May 21, 2024

In the case of a single writer, it's most likely due to a first request to finalize the write failing (50x or timeout) followed by a second attempt and the second attempt fails because unbeknownst to us, the first write actually succeeded at some point. The fact that a second map attempt then fails seems to support this.

It is absolutely a hack, but a possible way to workaround the future map failures (and assuming that you're OK overwriting existing files):

public class OverwritingTextFormat<K, V> extends TextFormat<K, V> {
  public RecordWriter<K, V> getRecordWriter(TaskAttemptContext job) 
      throws IOException, InterruptedException {
  
    Configuration conf = job.getConfiguration();
    boolean isCompressed = getCompressOutput(job);
    CompressionCodec codec = null;
    String extension = "";
    if (isCompressed) {
      Class<? extends CompressionCodec> codecClass = 
        getOutputCompressorClass(job, GzipCodec.class);
      codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf);
      extension = codec.getDefaultExtension();
    }
    Path file = getDefaultWorkFile(job, extension);
    FileSystem fs = file.getFileSystem(conf);
    if (fs.exists(file)) {
      fs.delete(file);
    }
    return super.getRecordWriter(job);
}

and when setting up the MultipleOutputs:

MultipleOutputs.addNamedOutput(job, "text", OverwritingTextFormat.class, LongWritable.class, Text.class);

A less hacky approach might be to write the multiple output files to HDFS as well and then distcp to GCS after the job completes. It'll obviously take longer, but might be worthwhile if you're uncomfortable with overwriting files.

While we can't retry, we might be able to recover. A possible approach would be adding our own nonce to object metadata with each object write. If we encounter a 412, we can fetch the current version of the object and if we find the nonce associated with the failed write then we know that the write succeeded. I'm not yet certain where we'd hook this into the stack, but we can probably figure something out.

@dennishuo - Thoughts on the rough recovery proposal?

*updated to fix workaround getRecordWriter

from hadoop-connectors.

Dima1224 avatar Dima1224 commented on May 21, 2024

Thanks for the suggestion, that makes sense. I'll try it out if the problem comes up again. Looking forward to a fix from your guys' side when you can get to it.

Thanks!

from hadoop-connectors.

Shasidhar avatar Shasidhar commented on May 21, 2024

@AngusDavis Do we have a fix for this? Or we still have to go through the workaround mentioned above?
Also, does GCS support parallel writes now?

from hadoop-connectors.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.