Comments (5)
In general, we can't retry uploads as we don't keep a copy of the stream contents locally. In this case, it appears to be in a mapper and that is hopefully able to retried.
From the stack trace, it looks like what's happening is that there are two or more processes writing to the same GCS object. When the stream is finalized, there's a check to ensure that the object hasn't been changed since we began writing (and to prevent errors in case one finalize RPC returns a 50X error followed by a second RPC that returns 200 OK - see https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcsio/src/main/java/com/google/cloud/hadoop/gcsio/GoogleCloudStorageImpl.java#L302 for details on that).
Is it expected that two or more processes are writing to the same object?
from hadoop-connectors.
No, it was just one process writing. I see these types of errors (although usually 404 or 500) when using multiple outputs via the GCS connector. In this particular case I was writing to HDFS via Context and to GCS via MultipleOutputs.
Hadoop attempted to retry, but wasn't able to recover due to a bunch of
java.io.IOException: Object gs://spins/home/dmikhels/sld_load/exceptions_out/part-m-20185 already exists.
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.create(GoogleCloudStorageImpl.java:383)
at com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage.create(CacheSupplementedGoogleCloudStorage.java:98)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.createInternal(GoogleCloudStorageFileSystem.java:254)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.create(GoogleCloudStorageFileSystem.java:238)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.<init>(GoogleHadoopOutputStream.java:79)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:896)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452)
at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:128)
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.getRecordWriter(MultipleOutputs.java:411)
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.write(MultipleOutputs.java:370)
at com.spins.hdp.sld.FilterSld$M.reportRuleFailure(FilterSld.java:139)
at com.spins.hdp.sld.FilterSld$M.runRules(FilterSld.java:128)
at com.spins.hdp.sld.FilterSld$M.map(FilterSld.java:88)
at com.spins.hdp.sld.FilterSld$M.map(FilterSld.java:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
from hadoop-connectors.
In the case of a single writer, it's most likely due to a first request to finalize the write failing (50x or timeout) followed by a second attempt and the second attempt fails because unbeknownst to us, the first write actually succeeded at some point. The fact that a second map attempt then fails seems to support this.
It is absolutely a hack, but a possible way to workaround the future map failures (and assuming that you're OK overwriting existing files):
public class OverwritingTextFormat<K, V> extends TextFormat<K, V> {
public RecordWriter<K, V> getRecordWriter(TaskAttemptContext job)
throws IOException, InterruptedException {
Configuration conf = job.getConfiguration();
boolean isCompressed = getCompressOutput(job);
CompressionCodec codec = null;
String extension = "";
if (isCompressed) {
Class<? extends CompressionCodec> codecClass =
getOutputCompressorClass(job, GzipCodec.class);
codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf);
extension = codec.getDefaultExtension();
}
Path file = getDefaultWorkFile(job, extension);
FileSystem fs = file.getFileSystem(conf);
if (fs.exists(file)) {
fs.delete(file);
}
return super.getRecordWriter(job);
}
and when setting up the MultipleOutputs:
MultipleOutputs.addNamedOutput(job, "text", OverwritingTextFormat.class, LongWritable.class, Text.class);
A less hacky approach might be to write the multiple output files to HDFS as well and then distcp to GCS after the job completes. It'll obviously take longer, but might be worthwhile if you're uncomfortable with overwriting files.
While we can't retry, we might be able to recover. A possible approach would be adding our own nonce to object metadata with each object write. If we encounter a 412, we can fetch the current version of the object and if we find the nonce associated with the failed write then we know that the write succeeded. I'm not yet certain where we'd hook this into the stack, but we can probably figure something out.
@dennishuo - Thoughts on the rough recovery proposal?
*updated to fix workaround getRecordWriter
from hadoop-connectors.
Thanks for the suggestion, that makes sense. I'll try it out if the problem comes up again. Looking forward to a fix from your guys' side when you can get to it.
Thanks!
from hadoop-connectors.
@AngusDavis Do we have a fix for this? Or we still have to go through the workaround mentioned above?
Also, does GCS support parallel writes now?
from hadoop-connectors.
Related Issues (20)
- Issue with cached credentials when attempting to use different keyfiles in the same Spark App HOT 1
- Test failures after HADOOP-18724
- Question: How to use gcs-connector on GKE with Workload Identity HOT 1
- BQ storage libray blocked on update to grpc v1.56
- GoogleCloudStorageFileSystem#delete recursive does not page
- Memory issues while running Apache Spark streaming applications on Google Dataproc cluster | OutOfMemoryError Java heap space
- flumk sink hdfs to gcs, all gcs write thread blocked
- how to transfer file from local to gcs bucket using dataproc hadoop in intellij
- GCS Connector fails with StackOverflowError during accessing hadoop credentials
- GhfsStorageStatistics cannot be cast ERROR HOT 9
- Support disabling automatic decompression of gzip files in GCS connector
- gcs-connector 3.0 not working with pyspark HOT 5
- gcs-connector:3.0.0 failing due to certificate when accessing to GCS from Github runner with WIF configuration HOT 7
- Feature request: automatic identity deduction a la google.auth.default()
- gcs-connector-3.0.0-shaded CVEs HOT 1
- How can I sink GCS connector metrics into GCP Cloud Monitor? HOT 2
- globStatus should prioritize server-side filtering over listing all files and performing local matches
- Conversion from InputStream -> ByteBuffer on gRPC writes creates many byte[] allocations. HOT 1
- DirectPath Unauthorized access issues using java-storage
- Unauthorized access on gRPC via DirectPath have issues
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hadoop-connectors.