Getting a 412 Precondition Failed causes a stacktrace to be thrown; is it possible to

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Retry on 412 possible? about hadoop-connectors HOT 5 CLOSED

cbacon commented on May 21, 2024

Retry on 412 possible?

from hadoop-connectors.

Comments (5)

AngusDavis commented on May 21, 2024

In general, we can't retry uploads as we don't keep a copy of the stream contents locally. In this case, it appears to be in a mapper and that is hopefully able to retried.

From the stack trace, it looks like what's happening is that there are two or more processes writing to the same GCS object. When the stream is finalized, there's a check to ensure that the object hasn't been changed since we began writing (and to prevent errors in case one finalize RPC returns a 50X error followed by a second RPC that returns 200 OK - see https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcsio/src/main/java/com/google/cloud/hadoop/gcsio/GoogleCloudStorageImpl.java#L302 for details on that).

Is it expected that two or more processes are writing to the same object?

from hadoop-connectors.

Dima1224 commented on May 21, 2024

No, it was just one process writing. I see these types of errors (although usually 404 or 500) when using multiple outputs via the GCS connector. In this particular case I was writing to HDFS via Context and to GCS via MultipleOutputs.

Hadoop attempted to retry, but wasn't able to recover due to a bunch of

java.io.IOException: Object gs://spins/home/dmikhels/sld_load/exceptions_out/part-m-20185 already exists.
        at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.create(GoogleCloudStorageImpl.java:383)
        at com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage.create(CacheSupplementedGoogleCloudStorage.java:98)
        at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.createInternal(GoogleCloudStorageFileSystem.java:254)
        at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.create(GoogleCloudStorageFileSystem.java:238)
        at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.<init>(GoogleHadoopOutputStream.java:79)
        at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:896)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452)
        at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:128)
        at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.getRecordWriter(MultipleOutputs.java:411)
        at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.write(MultipleOutputs.java:370)
        at com.spins.hdp.sld.FilterSld$M.reportRuleFailure(FilterSld.java:139)
        at com.spins.hdp.sld.FilterSld$M.runRules(FilterSld.java:128)
        at com.spins.hdp.sld.FilterSld$M.map(FilterSld.java:88)
        at com.spins.hdp.sld.FilterSld$M.map(FilterSld.java:1)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
        at org.apache.hadoop.mapred.Child.main(Child.java:249)

from hadoop-connectors.

AngusDavis commented on May 21, 2024

In the case of a single writer, it's most likely due to a first request to finalize the write failing (50x or timeout) followed by a second attempt and the second attempt fails because unbeknownst to us, the first write actually succeeded at some point. The fact that a second map attempt then fails seems to support this.

It is absolutely a hack, but a possible way to workaround the future map failures (and assuming that you're OK overwriting existing files):

public class OverwritingTextFormat<K, V> extends TextFormat<K, V> {
  public RecordWriter<K, V> getRecordWriter(TaskAttemptContext job) 
      throws IOException, InterruptedException {
  
    Configuration conf = job.getConfiguration();
    boolean isCompressed = getCompressOutput(job);
    CompressionCodec codec = null;
    String extension = "";
    if (isCompressed) {
      Class<? extends CompressionCodec> codecClass = 
        getOutputCompressorClass(job, GzipCodec.class);
      codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf);
      extension = codec.getDefaultExtension();
    }
    Path file = getDefaultWorkFile(job, extension);
    FileSystem fs = file.getFileSystem(conf);
    if (fs.exists(file)) {
      fs.delete(file);
    }
    return super.getRecordWriter(job);
}

and when setting up the MultipleOutputs:

MultipleOutputs.addNamedOutput(job, "text", OverwritingTextFormat.class, LongWritable.class, Text.class);

A less hacky approach might be to write the multiple output files to HDFS as well and then distcp to GCS after the job completes. It'll obviously take longer, but might be worthwhile if you're uncomfortable with overwriting files.

While we can't retry, we might be able to recover. A possible approach would be adding our own nonce to object metadata with each object write. If we encounter a 412, we can fetch the current version of the object and if we find the nonce associated with the failed write then we know that the write succeeded. I'm not yet certain where we'd hook this into the stack, but we can probably figure something out.

@dennishuo - Thoughts on the rough recovery proposal?

*updated to fix workaround getRecordWriter

from hadoop-connectors.

Dima1224 commented on May 21, 2024

Thanks for the suggestion, that makes sense. I'll try it out if the problem comes up again. Looking forward to a fix from your guys' side when you can get to it.

Thanks!

from hadoop-connectors.

Shasidhar commented on May 21, 2024

@AngusDavis Do we have a fix for this? Or we still have to go through the workaround mentioned above?
Also, does GCS support parallel writes now?

from hadoop-connectors.

Retry on 412 possible? about hadoop-connectors HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs