googleclouddataproc / hadoop-connectors Goto Github PK

View Code? Open in Web Editor NEW

279.0 99.0 235.0 11 MB

Libraries and tools for interoperability between Hadoop-related open-source software and Google Cloud Platform.

License: Apache License 2.0

Java 99.84% Shell 0.13% Dockerfile 0.03%

hadoop-filesystem hadoop-hcfs google-cloud-dataproc bigquery hadoop

hadoop-connectors's Introduction

Apache Hadoop Connectors

Libraries and tools for interoperability between Apache Hadoop related open-source software and Google Cloud Platform.

Google Cloud Storage connector for Apache Hadoop (HCFS)

The Google Cloud Storage connector for Hadoop enables running MapReduce jobs directly on data in Google Cloud Storage by implementing the Hadoop FileSystem interface. For details, see the README.

Building the Cloud Storage connector

Note that build requires Java 11+ and fails with older Java versions.

To build the connector for specific Hadoop version, run the following commands from the main directory:

./mvnw clean package

In order to verify test coverage for specific Hadoop version, run the following commands from the main directory:

./mvnw -P coverage clean verify

The Cloud Storage connector JAR can be found in gcs/target/ directory.

Adding the Cloud Storage connector to your build

Maven group ID is com.google.cloud.bigdataoss and artifact ID for Cloud Storage connector is gcs-connector.

To add a dependency on Cloud Storage connector using Maven, use the following:

<dependency>
  <groupId>com.google.cloud.bigdataoss</groupId>
  <artifactId>gcs-connector</artifactId>
  <version>3.0.0</version>
</dependency>

Resources

On Stack Overflow, use the tag google-cloud-dataproc for questions about the connectors in this repository. This tag receives responses from the Stack Overflow community and Google engineers, who monitor the tag and offer unofficial support.

hadoop-connectors's People

Contributors

Stargazers

Watchers

Forkers

angusdavis zulily meowoid rmetzger mateusz-pytel wouterdebie ravwojdyla tomogrady manoranjanp dennishuo truthelectron xardazz snadorp flipkart rohansingh rucky2013 tmatsuo lehmier pmkc deacondesperado jfratzke rajeshbalamohan shiyanghuang ominux samelamin brentdorsey blakemesdag plumpmath vijaykramesh shengnwen n-fukuda aman-ebay lukefalsina mustafabagdatli ocadaruma rbramwell dineshr2015 davidchen322 grzegorzlyczba ekzhu ramabadrinath chemikadze rjaiswal-datalicious cogniteev debasish-das-ck prasadjani imrenagi andreysaksonov hortonworks shopify wsf1990 ovimihai functicons opt-tech orpheuz jescolanoathudsondata sadineni aniket486 sidseth yaliang wikp tmitim hatv garrettjonesgoogle prashantgolash julianocristian andrej-moravian jeromebanks mooons pzampino sunrunaway baptmoisson jphalip emsand yzhou2001 martijnvdgrift cristicmf udim tomwhite bhanditz dalavancloud fokko james-oppel-exa satyanarayan1 darshanrd yanago i201821180 ben-white-openx pbonito adam-stasiak shahin liuwenclever-zz szewi kvankayala diggerk dmarkwat altech-stack yiliaogoog fendibl vanjasmailovic

hadoop-connectors's Issues

AppEngine exported logs requires Writer permissions for reading

When using the GoogleHadoopFileSystem to read AppEngine logs exported to Google Cloud Storage, an Access Denied exception is thrown when using a user that has only read permissions to the bucket and key of the exported logs.

Hadoop Connector consistently fails to read specific shards from BigQuery

When extracting data from BigQuery, I see the following on specific tables.

java.lang.IndexOutOfBoundsException at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream.read(GoogleHadoopFSInputStream.java:138) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:59) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:91) at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:144) at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:184) at com.google.cloud.hadoop.io.bigquery.GsonRecordReader.nextKeyValue(GsonRecordReader.java:87) at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.nextKeyValue(DynamicFileListRecordReader.java:177) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:168)

This issues always occurs on the exact same shard. If I change ENABLE_SHARDED_EXPORT_KEY to false I don't have any issue (but this is much slower). The size of the json file in one shard created on gcs is 2.36 GB with 8 shards.

Add "ignoreUnknownValues" option for BigQuery output formats

Sometimes it is desirable to allow a workload to produce outputs which don't match the currently defined schema in BigQuery. We should plumb through setIgnoreUnknownValues as a settable hadoop config option. Relevant to this StackOverflow question.

Possibly we should also wholesale plumb through all the different JobConfigurationLoad options.

java.lang.StackOverflowError when using AvroBigQueryInputFormat with complex tables

I got the following stacktrace when using AvroBigQueryInputFormat on a fairly complex table with many nested/repeated records. I inspected the dump .avro files in GCS with avro-tools. avro-tools getschema works fine but avro-tools tojson triggers the same stacktrace.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 10.160.0.27): java.lang.StackOverflowError
    at org.apache.avro.io.parsing.Symbol$Sequence.flattenedSize(Symbol.java:323)
    at org.apache.avro.io.parsing.Symbol.flattenedSize(Symbol.java:216)
    at org.apache.avro.io.parsing.Symbol$Sequence.flattenedSize(Symbol.java:323)
    at org.apache.avro.io.parsing.Symbol.flattenedSize(Symbol.java:216)
    at org.apache.avro.io.parsing.Symbol$Sequence.flattenedSize(Symbol.java:323)
    at org.apache.avro.io.parsing.Symbol.flattenedSize(Symbol.java:216)
    at org.apache.avro.io.parsing.Symbol$Sequence.flattenedSize(Symbol.java:323)
    at org.apache.avro.io.parsing.Symbol.flattenedSize(Symbol.java:216)

Support for setting User Agent

As a demand of Google Cloud Partner Data Reporting Policy for adding a User Agent String while using Google Cloud Services, is it possible to provide a new configuration property (sth. like fs.google.user.agent.prefix) to let users set it in Hadoop/Spark Configuration level ?

BigQuery Connector for Spark: Query Views

I tried the BigQuery Connector (0.10.2-hadoop2) for Spark (2.1.0) and it works fine for normal tables.

But when I use it for BigQuery views, the Spark job shows following info message (in the aggregated yarn logs): ShardedExportToCloudStorage: Table '<project>:<dataset>.<my-view-name>' to be exported has 0 rows and 0 bytes. The job doesn't exit with any error.

Has anybody tried to query a view with the BigQuery connector from a Spark application before and encountered a similar problem?

Saving to an EU location via the spark API

When saving to BQ using the saveAsNewAPIHadoopDataset it defaults to the US location, is there any way to save to the EU location?

Setting the hadoop configuration to EU doesnt seem to be picked up by the connector

LeaseExpiredException when copying many files from gcs to HDFS

Hi,

We're syncing large numbers of files from a google bucket to HDFS using the gcs_connector. This works great, up to a point -- maybe 100 files in? -- after which point we start seeing a number of LeaseExpiredExceptions.

Specifically:

org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /datasets/path/to/destination/directory/on/hdfs/union-effects.pkl._COPYING_ (inode 36373244): File does not exist. Holder DFSClient_NONMAPREDUCE_-1721918755_1 does not have any open files.
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3602)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:3399)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3255)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:676)
        at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:212)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:483)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)

Typically, once we see these they proceed for the subsequent files being downloaded. If we kill the processes & start again, the download for the files succeeds until a number of files have downloaded, at which point it starts up again.

For the sake of completeness, we are using the following bash snippet to download the files:

#!/usr/bin/env bash

if [ $# -ne 2 ]; then
        echo "Usage: $0 <src> <dest>" >&2
        exit 1
fi

export SRC="$1"
shift
export DEST="$1"
shift

echo "Copying from $SRC to $DEST"

export hl="hdfs dfs $hadoop_args -Dfs.gs.path.encoding=uri-path -ls"
export hc="hdfs dfs $hadoop_args -Dfs.gs.path.encoding=uri-path -cp"
export ht="hdfs dfs $hadoop_args -Dfs.gs.path.encoding=uri-path -test"
export gsl="gsutil ls"

one_file() {
        file="$1"
        if ! $ht -e "$DEST/$file" 2> /dev/null; then
                echo -n "Copying: $SRC/$file to $DEST/$file"
                err="$(mktemp)"
                if $hc "$SRC/$file" "$DEST/$file" 2>"$err"; then
                        echo " done"
                else
                        echo " error!"
                        cat "$err"
                fi
                rm -f "$err"
        else
                echo "Found: $DEST/$file"
        fi
}

export -f one_file

$hl -R $SRC 2>/dev/null \
| tail -n +2 \
| sed "s_^.* ${SRC}/__" \
| parallel -j+0 --env hc,ht,SRC,DEST,one_file 'one_file'

Wondering (hoping) you might have some insight as to what might be causing these errors?

This ends up being about 24 processes running in parallel.

(addendum):
Here is another, possibly more complete log of this error:

Copying: gs://bucket-name/rna/etc-etc-etc-read-name-sorted.bam to /datasets/bucket-name/rna/etc-etc-etc-read-name-sorted.bam error!
17/07/27 19:53:43 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.1-hadoop2
17/07/27 19:53:44 WARN gcs.GoogleHadoopFileSystemBase: No working directory configured, using default: 'gs://bucket-name/'
17/07/27 19:53:45 WARN hdfs.DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /datasets/bucket-name/rna/etc-etc-etc-read-name-sorted.bam._COPYING_ (inode 36384065): File does not exist. Holder DFSClient_NONMAPRED
UCE_-1837747866_1 does not have any open files.
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3602)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:3399)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3255)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:676)
        at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:212)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:483)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)

        at org.apache.hadoop.ipc.Client.call(Client.java:1472)
        at org.apache.hadoop.ipc.Client.call(Client.java:1403)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
        at com.sun.proxy.$Proxy14.addBlock(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:252)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
        at com.sun.proxy.$Proxy15.addBlock(Unknown Source)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1674)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1471)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:668)
17/07/27 19:53:45 WARN gcsio.GoogleCloudStorageReadChannel: Channel for 'gs://bucket-name/rna/etc-etc-etc-read-name-sorted.bam' is not open.
cp: No lease on /datasets/bucket-name/rna/etc-etc-etc-read-name-sorted.bam._COPYING_ (inode 36384065): File does not exist. Holder DFSClient_NONMAPREDUCE_-1837747866_1 does not have any open files.

Trying to Load data into BigQuery using IndirectBigQueryOutputFormat/saveAsNewAPIHadoopDataset(conf) performance issue

I am trying to TRUNCATE and LOAD the data into Google Cloud BigQuery table using Apache Spark. Though it is achievable with the help of IndirectBigQueryOutputFormat as stated by Dennis Hou in the link #43 , I got serious performance issue.

Below are some code samples/configurations used, conf.set(BigQueryConfiguration.OUTPUT_TABLE_WRITE_DISPOSITION_KEY, "WRITE_TRUNCATE") conf.set("mapreduce.job.outputformat.class",classOf[IndirectBigQueryOutputFormat[_,_]].getName) BigQueryOutputConfiguration.configure(conf,projectId,outputDatasetId,outputTableId,outputSchema,"gs://spark3/temp",BigQueryFileFormat.NEWLINE_DELIMITED_JSON,classOf[TextOutputFormat[_,_]])

After doing some transformation on the input data, I got
val append as Dataset[Row] and i am converting it to RDD[String,Long String] using val Final_Stmt=append.as[(Long,String,Long)].rdd

Now its time to load the data into BigQuery table so I used

Final_Stmt.map(pair => (null, convertToJson(pair))).saveAsNewAPIHadoopDataset(conf)

where definition of converToJson is as below,

def convertToJson(pair: (Long,String,Long)) : JsonObject = { val id = pair._1 val name = pair._2 val score = pair._3 val jsonObject = new JsonObject() jsonObject.addProperty("id", id) jsonObject.addProperty("name", name) jsonObject.addProperty("score", score) return jsonObject }

If I comment saveAsNewAPIHadoopDataset(conf) and only prints Final_Stmt's map then it runs for 2.44 minutes for data of 8 rows and 3 columns. But If I run with saveAsNewAPIHadoopDataset(conf) which is used to load data into BigQuery table, it is taking near about 41 minutes to load.

By looking at the output window, TaskSetManager creates near about 200 tasks and for each task below gets executed,

INFO com.google.cloud.hadoop.io.bigquery.BigQueryFactory: Creating BigQuery from default credential. INFO com.google.cloud.hadoop.io.bigquery.BigQueryFactory: Creating BigQuery from given credential. INFO com.google.cloud.hadoop.io.bigquery.output.ForwardingBigQueryFileOutputFormat: Delegating functionality to 'TextOutputFormat'

and each task takes too much time to execute.

How to improve performance of this task or is there anything missing from my end?

Accessing GCS from Spark/Hadoop outside Google Cloud

My issue is superficially similar to #48, but seems separate so I'm filing here.

I'm interested in reading some gs:// URLs from a local Spark/Hadoop app.

I ran gcloud auth application-default login and got a key file.

Then I run spark-shell --jars my-assembly.jar which includes this library correctly on the classpath.

Then in the spark-shell, I set the hadoop configs detailed in INSTALL.md:

val conf = sc.hadoopConfiguration
conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
conf.set("fs.gs.project.id", "<MY_PROJECT>")  // actual project filled in
conf.set("google.cloud.auth.service.account.enable", "true")
conf.set("google.cloud.auth.service.account.json.keyfile", "/path/to/keyfile")  // actual keyfile filled in

import org.apache.hadoop.fs.Path
val path = new Path("gs://BUCKET/OBJECT")  // actual path filled in
val fs = path.getFileSystem(conf)

So far so good, but then actually accessing the object fails:

scala> fs.exists(path)
…
java.io.IOException: Error accessing: bucket: BUCKET, object: OBJECT
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.wrapException(GoogleCloudStorageImpl.java:1706)
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getObject(GoogleCloudStorageImpl.java:1732)
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getItemInfo(GoogleCloudStorageImpl.java:1617)
  at com.google.cloud.hadoop.gcsio.ForwardingGoogleCloudStorage.getItemInfo(ForwardingGoogleCloudStorage.java:214)
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.getFileInfo(GoogleCloudStorageFileSystem.java:1093)
  at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getFileStatus(GoogleHadoopFileSystemBase.java:1413)
  at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1400)
  ... 48 elided
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 401 Unauthorized
{
  "code" : 401,
  "errors" : [ {
    "domain" : "global",
    "location" : "Authorization",
    "locationType" : "header",
    "message" : "Anonymous users does not have storage.objects.get access to object BUCKET/OBJECT.",
    "reason" : "required"
  } ],
  "message" : "Anonymous users does not have storage.objects.get access to object BUCKET/OBJECT."
}
  at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:145)
  at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
  at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
  at com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:321)
  at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1056)
  at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
  at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
  at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getObject(GoogleCloudStorageImpl.java:1726)
  ... 53 more

I seemed to successfully oauth with my gcloud-linked google-account when I ran gcloud auth application-default login, per the docs; why am I acccessing GCS as an anonymous user?

Distcp OAuth Access

When using OAuth, distributed copy (hadoop distcp) doesn't work because the authentication token and the setting (google.cloud.auth.service.account.enable=false) are not propagated, leading to mapper error.

Support for querying federated tables in BigQuery

In this mode, the files may already be available in GCS, in which case the connector needs to query directly from GCS using the table metadata from BigQuery instead of issuing an extra export.

Got exception: Connection closed prematurely

Hey there,

We recently got issues with getting into errors when reading files from Cloud Storage:

2015-09-30 23:15:17,334 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:u_1661 cause:java.io.IOException: Error reading gs://example-bucket/some-file.gz at position 20971520
java.io.IOException: Error reading gs://example-bucket/some-file.gz at position 20971520
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.openStreamAndSetMetadata(GoogleCloudStorageReadChannel.java:667)
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.performLazySeek(GoogleCloudStorageReadChannel.java:555)
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.read(GoogleCloudStorageReadChannel.java:289)
  at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream.read(GoogleHadoopFSInputStream.java:158)
  at java.io.DataInputStream.read(DataInputStream.java:149)
  at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:151)
  at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:135)
  at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:77)
  at java.io.InputStream.read(InputStream.java:101)
  at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:205)
  at org.apache.hadoop.util.LineReader.readLine(LineReader.java:169)
  at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:139)
  at org.apache.pig.builtin.PigStorage.getNext(PigStorage.java:259)
  at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:204)
  at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:530)
  at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
  at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:363)
  at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:415)
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
  at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.net.SocketTimeoutException: Read timed out
  at java.net.SocketInputStream.socketRead0(Native Method)
  at java.net.SocketInputStream.read(SocketInputStream.java:152)
  at java.net.SocketInputStream.read(SocketInputStream.java:122)
  at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
  at sun.security.ssl.InputRecord.read(InputRecord.java:480)
  at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:927)
  at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:884)
  at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
  at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
  at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
  at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
  at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
  at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
  at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1323)
  at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
  at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:338)
  at com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:37)
  at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:94)
  at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
  at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
  at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
  at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeMedia(AbstractGoogleClientRequest.java:380)
  at com.google.api.services.storage.Storage$Objects$Get.executeMedia(Storage.java:4680)
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.openStreamAndSetMetadata(GoogleCloudStorageReadChannel.java:651)
  ... 23 more

It also seems like there's a bug in the log output of the current retry.

Any idea why this can occur? It seems like intermittent issues but I want to make sure.

Thanks

bigquery connector no such field error is not useful.

can the error message include the name of the field pretty please.

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.io.IOException: no such field
    at com.google.cloud.hadoop.io.bigquery.BigQueryUtils.waitForJobCompletion(BigQueryUtils.java:97)
    at com.google.cloud.hadoop.io.bigquery.BigQueryRecordWriter$BigQueryAsyncWriteChannel.handleResponse(BigQueryRecordWriter.java:135)
    at com.google.cloud.hadoop.io.bigquery.BigQueryRecordWriter$BigQueryAsyncWriteChannel.handleResponse(BigQueryRecordWriter.java:100)
    at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.close(AbstractGoogleAsyncWriteChannel.java:289)
    at com.google.cloud.hadoop.io.bigquery.BigQueryRecordWriter.close(BigQueryRecordWriter.java:360)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$5.apply$mcV$sp(PairRDDFunctions.scala:1043)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1215)

Space not allowed in file name

I have existing code that creates files with spaces in the name. It works in HDFS, and it also works when creating the file using gsutil, but it fails when using gcs-connector.

$ gsutil cp /tmp/cloud-dmp/.metadata_never_index "gs://cloud-dmp/a b c"
Copying file:///tmp/cloud-dmp/.metadata_never_index [Content-Type=application/octet-stream]...

$ hadoop fs -mkdir "gs://cloud-dmp/tmp/a b"
16/04/08 12:08:00 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.4.2-hadoop2
16/04/08 12:08:01 ERROR gcsio.GoogleCloudStorageFileSystem: Invalid bucket name (cloud-dmp) or object name (tmp/a b)
java.net.URISyntaxException: Illegal character in path at index 20: gs://cloud-dmp/tmp/a b
at java.net.URI$Parser.fail(URI.java:2829)
at java.net.URI$Parser.checkChars(URI.java:3002)
at java.net.URI$Parser.parseHierarchical(URI.java:3086)
at java.net.URI$Parser.parse(URI.java:3034)
at java.net.URI.(URI.java:595)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.getPath(GoogleCloudStorageFileSystem.java:1567)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.getGcsPath(GoogleHadoopFileSystem.java:173)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getFileStatus(GoogleHadoopFileSystemBase.java:1167)
at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1683)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1629)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.globStatus(GoogleHadoopFileSystemBase.java:1338)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.globStatus(GoogleHadoopFileSystemBase.java:1261)

AbstractBigQueryInputFormat.getSplits checks JobID when it should not

https://github.com/GoogleCloudPlatform/bigdata-interop/blob/0ea21db8790bdd27cfc9be5da58f011b3dea786b/bigquery/src/main/java/com/google/cloud/hadoop/io/bigquery/AbstractBigQueryInputFormat.java#L96
Condition checks if JobID exists (and it does not in Crunch planning phase) even tho BigQueryConfiguration.TEMP_GCS_PATH_KEY configuration key is set.
The only need for jobId is if it is not set.

Setting Content-type on data created from GCS connector FS calls

Hello!

Currently data we create via the GCS connector using regular FS calls always defaults to application/octet-stream for it's content-type in the metadata.

Is there a way to configure this when writing per file?

Example using Jackson to write json:

private static final ObjectMapper MAPPER = new ObjectMapper();

FileSystem fs = FileSystem.get(getConfiguration());
Path path = new Path("gs://some-bucket/somepath.json");
try (FSDataOutputStream outputStream = fs.create(path)) {
   MAPPER.writeWithDefaultPrettyPrinter().writeValue(outputStream, someJsonObject);
}

Results in:

Erroneous "already exists" error when reusing Hadoop output path due to stale cache

Originally reported by a Hadoop user to one of our feedback lists, it appears the directory inference feature 1.4.0 introduces in minor bug when the GCS connector is used in conjunction with tools which don't cooperate with the DirectoryListCache such as the Pantheon GUI or gsutil, where a stale cache entry persists after deleting a directory in the web GUI or gsutil and thus erroneously prevents Hadoop from being able to reuse the same output path until either the cache is cleared manually or hadoop fs -rmr is used to allow the DirectoryListCache to remove the stale entries. In general, the steps to reproduce the issue were:

Run a job outputting to gs://foo-bucket/foo-dir
Use web UI to remove output or gsutil rm -R gs://foo-bucket/foo-dir
Re-run job outputting to gs://foo-bucket/foo-dir
Job fails with "already exists" error.

The workaround when using gcs-connector-1.4.0 is simply to set fs.gs.implicit.dir.infer.enable=false in Hadoop configs or gcs-core-template.xml when using bdutil. Code-side fix will be in by the next release.

Spark Bigquery Job Cannot read from non-public tables

Using the latest branch from master:GHFS version: 1.5.0-hadoop2 and the bigquery connector 0.7.6 from master branch.
Tested on Spark local, Spark 1.6.0/1.6.1 cluster, current dataproc version.
Bigquery dataset permissions for owners, editors, viewers and service account set to "is owner"

Steps to recreate:

When using this example:
https://cloud.google.com/hadoop/examples/bigquery-connector-spark-example

and additional necessary configuration is added

conf.set("google.cloud.auth.service.account.enable", "true");
conf.set("google.cloud.auth.service.account.email", "[email protected]");
conf.set("google.cloud.auth.service.account.keyfile", "location.p12");
conf.set("fs.gs.impl","com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
conf.set("fs.AbstractFileSystem.gs.impl","com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")

All works well:

16/07/29 15:19:10 INFO NewHadoopRDD: Input split: gs://tempspark/hadoop/tmp/bigquery/wordcounttmp51001d75-d5d8-4470-865a-0e6e5ea89e26/shard-1/data-*.json[82328 estimated records]
16/07/29 15:19:10 INFO NewHadoopRDD: Input split: gs://tempspark/hadoop/tmp/bigquery/wordcounttmp51001d75-d5d8-4470-865a-0e6e5ea89e26/shard-0/data-*.json[82328 estimated records]
16/07/29 15:19:10 INFO DynamicFileListRecordReader: Initializing DynamicFileListRecordReader with split 'InputSplit:: length:82328 locations: [] toString(): gs://tempspark/hadoop/tmp/bigquery/wordcounttmp51001d75-d5d8-4470-865a-0e6e5ea89e26/shard-0/data-*.json[82328 estimated records]', task context 'TaskAttemptContext:: TaskAttemptID:attempt_201607291519_0000_m_000000_0 Status:'
16/07/29 15:19:10 INFO DynamicFileListRecordReader: Initializing DynamicFileListRecordReader with split 'InputSplit:: length:82328 locations: [] toString(): gs://tempspark/hadoop/tmp/bigquery/wordcounttmp51001d75-d5d8-4470-865a-0e6e5ea89e26/shard-1/data-*.json[82328 estimated records]', task context 'TaskAttemptContext:: TaskAttemptID:attempt_201607291519_0000_m_000001_0 Status:'
16/07/29 15:19:42 INFO DynamicFileListRecordReader: Adding new file 'data-000000000000.json' of size 0 to knownFileSet.
16/07/29 15:19:42 INFO DynamicFileListRecordReader: Moving to next file 'gs://tempspark/hadoop/tmp/bigquery/wordcounttmp51001d75-d5d8-4470-865a-0e6e5ea89e26/shard-1/data-000000000000.json' which has 0 bytes. Records read so far: 0
16/07/29 15:19:42 INFO GoogleCloudStorageReadChannel: Got 'range not satisfiable' for reading gs://tempspark/hadoop/tmp/bigquery/wordcounttmp51001d75-d5d8-4470-865a-0e6e5ea89e26/shard-1/data-000000000000.json at position 0; assuming empty.

etc and the word count and persistence completes.

When the dataset is changed from the public dataset to a private one:

 // Input parameters
val fullyQualifiedInputTableId = "publicdata:samples.shakespeare" to
// Input parameters with valid project, dataset and table
val fullyQualifiedInputTableId = "project:dataset.table"

Shards and records are found but nothing happens and the job stalls at:

16/07/29 16:01:23 INFO NewHadoopRDD: Input split: gs://tempspark/hadoop/tmp/bigquery/wordcounttmp02e3b5bf-5929-4117-a306-45ddadca5896/shard-1/data-*.json[2755 estimated records]
16/07/29 16:01:23 INFO NewHadoopRDD: Input split: gs://tempspark/hadoop/tmp/bigquery/wordcounttmp02e3b5bf-5929-4117-a306-45ddadca5896/shard-0/data-*.json[2755 estimated records]
16/07/29 16:01:23 INFO DynamicFileListRecordReader: Initializing DynamicFileListRecordReader with split 'InputSplit:: length:2755 locations: [] toString(): gs://tempspark/hadoop/tmp/bigquery/wordcounttmp02e3b5bf-5929-4117-a306-45ddadca5896/shard-0/data-*.json[2755 estimated records]', task context 'TaskAttemptContext:: TaskAttemptID:attempt_201607291601_0000_m_000000_0 Status:'
16/07/29 16:01:23 INFO DynamicFileListRecordReader: Initializing DynamicFileListRecordReader with split 'InputSplit:: length:2755 locations: [] toString(): gs://tempspark/hadoop/tmp/bigquery/wordcounttmp02e3b5bf-5929-4117-a306-45ddadca5896/shard-1/data-*.json[2755 estimated records]', task context 'TaskAttemptContext:: TaskAttemptID:attempt_201607291601_0000_m_000001_0 Status:'

without proceeding, or errors in any driver or executor logs. It seems to be a permission or some other problem with private dataset .

Issue with connecting to GCS when behind a Squid proxy

Hi, I have installed a hadoop cluster on multiple GCE instances. The namenode GCE instance has an external IP, the datanode GCE instances have no external IP, and they proxy through Squid on the namenode for http and https. When I run "hadoop fs -ls gs://mybucket" on the namenode, everything works fine. When I run "hadoop fs -ls gs://mybucket" on any of the datanodes, I get the following stacktrace:

14/10/29 18:35:07 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.3.0-hadoop1
Oct 29, 2014 6:35:28 PM com.google.api.client.http.HttpRequest execute
WARNING: exception thrown while executing request
java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:618)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.protocol.https.HttpsClient.(HttpsClient.java:275)
at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:371)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:153)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:965)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:410)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:343)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:460)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getBucket(GoogleCloudStorageImpl.java:1285)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getItemInfo(GoogleCloudStorageImpl.java:1240)
at com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage.getItemInfo(CacheSupplementedGoogleCloudStorage.java:452)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.getFileInfo(GoogleCloudStorageFileSystem.java:1004)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.exists(GoogleCloudStorageFileSystem.java:368)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configureBuckets(GoogleHadoopFileSystemBase.java:1621)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.configureBuckets(GoogleHadoopFileSystem.java:71)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1569)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:700)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:663)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1446)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1464)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:263)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at org.apache.hadoop.fs.FsShell.ls(FsShell.java:583)
at org.apache.hadoop.fs.FsShell.doall(FsShell.java:1591)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:1810)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:1916)

When I run "gsutil ls gs://mybucket" on the datanodes, everything works fine, so I know it's not an issue with the way I have configured Squid. I have tried with the latest 1.3.0-hadoop1 and 1.2.6 versions of the gcs-connector. Any help would be appreciated.

Support wildcard or TABLE_DATE_RANGE in sqlContext.bigQueryTable()

I want to load data from BigQuery with the Spark connector from Spotify (https://github.com/spotify/spark-bigquery) which is built on top of the bigdata-interop project.

It works totally fine when I load one table at a time, for example:
sqlContext.bigQueryTable("<dataset-id>.<some-table-name")

But I want to load multiple tables at a time. I tried to use DATE_RANGE functions and table wildcards which are provided by BigQuery SQL dialect:
sqlContext.bigQueryTable("<dataset-id>.<some-table-name-with-wildcard>")

But this doesn't work and I get the following error:
Invalid datasetAndTableString '<dataset-id>.<some-table-name-with-wildcard>'; must match regex '[a-zA-Z0-9_]+\.[a-zA-Z0-9_]+'

That's clear because the specification of a table reference says: "[...] The ID must contain only letters (a-z, A-Z), numbers (0-9), or underscores (_)[...]"

If I do a select with sqlContext.bigQuerySelect() I would have to manually unnest all the arrays contained in the source data (which are a lot). Otherwise I would get errors like Cannot output multiple independently repeated fields at the same time.

It would be much more comfortable and the code or SQL statement for loading data from multiple tables at a time would be much easier when you would support wildcard tables or TABLE_DATE_RANGE in the table specification validation.

Thank you in advance for a little statement on this issue.

Crashes / failures on key paths with odd characters in them

I have a test bucket with the following first level key path components:

Attempting to list that directory via the connector gives the following:

$ hadoop fs -ls 'gs://<redacted>' 16/02/16 10:18:36 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.4.0-hadoop2 16/02/16 10:18:40 ERROR gcsio.GoogleCloudStorageFileSystem: Invalid bucket name (<redacted>) or object name (a b/) java.net.URISyntaxException: Illegal character in path at index 20: gs://<redacted>/a b/ at java.net.URI$Parser.fail(URI.java:2829) at java.net.URI$Parser.checkChars(URI.java:3002) at java.net.URI$Parser.parseHierarchical(URI.java:3086) at java.net.URI$Parser.parse(URI.java:3034) at java.net.URI.<init>(URI.java:595) at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.getPath(GoogleCloudStorageFileSystem.java:1566) at com.google.cloud.hadoop.gcsio.FileInfo.<init>(FileInfo.java:70) at com.google.cloud.hadoop.gcsio.FileInfo.fromItemInfo(FileInfo.java:252) at com.google.cloud.hadoop.gcsio.FileInfo.fromItemInfos(FileInfo.java:263) at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.listFileInfo(GoogleCloudStorageFileSystem.java:995) at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.listStatus(GoogleHadoopFileSystemBase.java:1050) at org.apache.hadoop.fs.shell.PathData.getDirectoryContents(PathData.java:268) at org.apache.hadoop.fs.shell.Command.recursePath(Command.java:347) at org.apache.hadoop.fs.shell.Ls.processPathArgument(Ls.java:90) at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:260) at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:244) at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:190) at org.apache.hadoop.fs.shell.Command.run(Command.java:154) at org.apache.hadoop.fs.FsShell.run(FsShell.java:287) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.fs.FsShell.main(FsShell.java:340) -ls: Invalid bucket name (<redacted>) or object name (a b/) Usage: hadoop fs [generic options] -ls [-d] [-h] [-R] [<path> ...]

The following works:

hadoop fs -ls 'gs://<redacted>/a%20b'

The following does not:

hadoop fs -ls 'gs://<redacted>/a b'
hadoop fs -ls 'gs://<redacted>/a?b'
hadoop fs -ls 'gs://<redacted>/a#b'
hadoop fs -ls 'gs://<redacted>/a%23b'
hadoop fs -ls 'gs://<redacted>/a%3Fb'
hadoop fs -ls 'gs://<redacted>/a%2523b'
hadoop fs -ls 'gs://<redacted>/a%253Fb'

Support for increasing timeouts?

I'm having trouble uploading large files via the connector - once the file goes past 100MB I begin having upload timeouts. Further, it seems like there is no way to configure Google APIs defaults of 20 seconds.

Using BigQuery connector in HDInsights cluster: Guava version conflict

I want to use the BigQuery connector in a HDInsights cluster with Spark 2.1.0 (Hortonworks Data Platform 2.6). If I run my job locally it works fine, but if I deploy it to the cluster (via Livy, but this should't matter here) I got:

ERROR ApplicationMaster: User class threw exception: java.lang.NoSuchMethodError: com.google.common.base.Splitter.splitToList(Ljava/lang/CharSequence;)Ljava/util/List;
java.lang.NoSuchMethodError: com.google.common.base.Splitter.splitToList(Ljava/lang/CharSequence;)Ljava/util/List;
        at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase$ParentTimestampUpdateIncludePredicate.create(GoogleHadoopFileSystemBase.java:789)
...

I had similar issues in the past with Spark and third party libraries (especially libraries which use Google Guava). As far as I know, usually the best solution is to explicitly shade the conflicting libraries. I use maven as build tool thus I use maven shade plugin to shade the conflicting libraries:

<plugin>
	<artifactId>maven-shade-plugin</artifactId>
	<version>3.0.0</version>
	<executions>
		<execution>
			<!-- ...... --> 
			<phase>package</phase>
			<goals>
				<goal>shade</goal>
			</goals>
			<configuration>
				<transformers>
					<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
					</transformer>
				</transformers>
				<filters>
					<filter>
						<artifact>*:*</artifact>
						<excludes>
							<exclude>META-INF/*.SF</exclude>
							<exclude>META-INF/*.DSA</exclude>
							<exclude>META-INF/*.RSA</exclude>
						</excludes>
					</filter>
				</filters>
			</configuration>
		</execution>
	</executions>
</plugin>

A next try was to add and explcit relocate Guava via the maven plugin:

<relocation>
	<pattern>com.google.guava</pattern>
	<shadedPattern>shaded.com.google.guava</shadedPattern>
</relocation>

But with the same error message. I also tried to set spark.{driver, executor}.userClassPathFirst=true, also without success. Then I got something like:

Caused by: java.lang.RuntimeException: java.lang.ClassCastException: cannot assign instance of scala.concurrent.duration.FiniteDuration to field org.apache.spark.rpc.RpcTimeout.duration of type scala.concurrent.duration.FiniteDuration in instance of org.apache.spark.rpc.RpcTimeout
        at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133)
...

which is not thrown in Google BigQuery connector. Is this another error, which would mean that this has nothing to do with the lib conflict problem from above? Or does this also originate from the library issue?

I don't have ideas anymore... Does anybody have some more ideas what could go wrong here or what I am missing? I'm happy about any tips or ideas! Thank you very much.

DynamicFileListRecordReader error

I'm experiencing issue with proper reader termination.
DynamicFileListRecordReader expects 0-record file to be present at the bucket path, but looks like it's not automatically created, so the reader goes into infinite loop.
According to this
https://cloud.google.com/bigquery/exporting-data-from-bigquery
this 0-record file is created only when using wildcard urls. But according to hadoop log single-url with no wildcards is used and that wait is unnecessary

My config for haddop job is

        BigQueryConfiguration.configureBigQueryInput(conf, "publicdata:samples.shakespeare");
        conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem");
        // tempoary path where to download data from BiqQuery tables
        conf.set(BigQueryConfiguration.TEMP_GCS_PATH_KEY, "gs://mybucket/mypath");
        conf.set(BigQueryConfiguration.PROJECT_ID_KEY, "myprojid");
        conf.set(GoogleHadoopFileSystemBase.GCS_PROJECT_ID_KEY, "myprojid");

So maybe remove that wait or make DynamicFileListRecordReader more flexible?

Mismatching `AbstractBigQueryInputFormat` expectations when specifying a query

When using AbstractBigQueryInputFormat and its descendants:

If a query is specified via BigQueryConfiguration.INPUT_QUERY_KEY, and the table from which you are issuing the query against is specified as INPUT_TABLE_ID_KEY, QueryBasedExport.runQuery will fail with Already Exists: Table [tablename].

If you specify a new table name as INPUT_TABLE_ID_KEY, which will be used by QueryBasedExport.runQuery as @param tableRef the table to write the results to, the following failure will be experienced at com.google.cloud.hadoop.io.bigquery.AbstractBigQueryInputFormat.constructExport(AbstractBigQueryInputFormat.java:191):

{
  "code" : 404,
  "errors" : [ {
    "domain" : "global",
    "message" : "Not found: Table [tablename]",
    "reason" : "notFound"
  } ],
  "message" : "Not found: Table [tablename]"
}

It appears there is a mismatch of expectations. Both QueryBasedExport and the other Exports use the same table name / table reference, but they have mutually exclusive preconditions.

QueryBasedExport expects the table to not exist, while the logic of the AbstractBigQueryInputFormat expects that same table reference to already exist (https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/bigquery/src/main/java/com/google/cloud/hadoop/io/bigquery/AbstractBigQueryInputFormat.java#L191).

[Kind/Help] Exception in thread "main" java.lang.NoSuchFieldError: GCS_SCOPES

I am trying to read and write into big query from hadoop/spark using this example. I have configured the gcs connector as mentioned here. When I try to list the data using hadoop fs -ls gs://big_query_test I am getting the following error. can someone help me to understand this.

akhilputhiry@akhilputhiry-Latitude-3450:~/opt/hadoop/etc/hadoop$ hadoop fs -ls gs://big_query_test
17/11/01 18:04:08 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.1-hadoop2
Exception in thread "main" java.lang.NoSuchFieldError: GCS_SCOPES
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1824)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:1012)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:975)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2812)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
	at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:325)
	at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:245)
	at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:228)
	at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:103)
	at org.apache.hadoop.fs.shell.Command.run(Command.java:175)
	at org.apache.hadoop.fs.FsShell.run(FsShell.java:315)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
	at org.apache.hadoop.fs.FsShell.main(FsShell.java:378)

"mapred.bq.temp.gcs.path" error when folder already exists

When the folder corresponding to the parameter mapred.bq.temp.gcs.path already exists an exception is thrown

java.io.IOException: Conflict occurred creating export directory. Path gs://rocket-omds-cluster-tmp/test7 already exists

It could be handy to allow the client code to configure overriding of the temp folder or allow using a folder that already exists.

Rate limit of GCS connector

Hi,
I'm using Spark on a Google Compute Engine cluster with the Google Cloud Storage connector (instead of HDFS, as recommended), and get a lot of "rate limit" errors, as follows:

java.io.IOException: Error inserting: bucket: *****, object: *****
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.wrapException(GoogleCloudStorageImpl.java:1600)
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl$3.run(GoogleCloudStorageImpl.java:475)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:745)
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 429 Too Many Requests
{
  "code" : 429,
  "errors" : [ {
    "domain" : "usageLimits",
    "message" : "The total number of changes to the object ***** exceeds the rate limit. Please reduce the rate of create, update, and delete requests.",
    "reason" : "rateLimitExceeded"
  } ],
  "message" : "The total number of changes to the object ***** exceeds the rate limit. Please reduce the rate of create, update, and delete requests."
}
  at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:145)
  at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
  at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
  at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
  at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
  at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl$3.run(GoogleCloudStorageImpl.java:472)
  ... 3 more

Is there a way to control the read/write rate of the GCS connector?

Not able to use WRITE_DISPOSITION property from Spark job

I am trying to write data to a BigQuery table, requirement is to overwrite the existing data in the table. For that purpose i am trying to use the property
BigQueryConfiguration.OUTPUT_TABLE_WRITE_DISPOSITION_KEY to override the default value which is WRITE_APPEND, but it is still appending the data to the table.

I am using Apache Spark with Scala and running the job in the Cloud Dataproc cluster environment.

Here is the piece of code I am trying:

BigQueryConfiguration.configureBigQueryOutput(conf, projectId, outputDataSetId, outputTableId, outputTableSchema)
conf.set(BigQueryConfiguration.OUTPUT_TABLE_WRITE_DISPOSITION_KEY, "WRITE_TRUNCATE")
conf.set("mapreduce.job.outputformat.class", classOf[BigQueryOutputFormat[_,_]].getName)
rdd.map(data => (null, data)).saveAsNewAPIHadoopDataset(conf)

"java.io.IOException: Not an Avro data file" when using AvroBigQueryInputFormat

Code looks like this

val conf = self.hadoopConfiguration

conf.setClass(
  AbstractBigQueryInputFormat.INPUT_FORMAT_CLASS_KEY,
  classOf[AvroBigQueryInputFormat], classOf[InputFormat[_, _]])
BigQueryConfiguration.configureBigQueryInput(conf, table.getProjectId, table.getDatasetId, table.getTableId)

sc.newAPIHadoopFile(
  BigQueryStrings.toString(table).replace(':', '.'),
  classOf[AvroBigQueryInputFormat],
  classOf[LongWritable], classOf[GenericData.Record]).map(_._2)

Stacktrace:

Caused by: java.io.IOException: Not an Avro data file
    at org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50)
    at com.google.cloud.hadoop.io.bigquery.AvroRecordReader.initialize(AvroRecordReader.java:49)
    at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.nextKeyValue(DynamicFileListRecordReader.java:176)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:277)
    at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)

I saw these in the log:
15/10/20 14:51:54 INFO bigquery.AbstractBigQueryInputFormat: Resolved GCS export path: 'gs://starship/hadoop/tmp/bigquery/job_201510201451_0002' 15/10/20 14:51:55 INFO bigquery.ShardedExportToCloudStorage: Computed '2' shards for sharded BigQuery export. 15/10/20 14:51:55 INFO bigquery.ShardedExportToCloudStorage: Table 'prefab-wave-844:bigquery_staging.spark_query_20151020144723_1201584542' to be exported has 4251779 rows and 1641186694 bytes 15/10/20 14:51:55 INFO bigquery.ShardedExportToCloudStorage: Computed '2' shards for sharded BigQuery export. 15/10/20 14:51:55 INFO bigquery.ShardedExportToCloudStorage: Table 'prefab-wave-844:bigquery_staging.spark_query_20151020144723_1201584542' to be exported has 4251779 rows and 1641186694 bytes
And verified that the export path indeed contains Avro files only and nothing else.
I also tried reading the export path with spark-avro and that works fine.

Append/ Overwrite to table via Spark API

Hi Guys

I have been using this project to set up a spark connector to BQ, I noticed that writing to a bq table via .saveAsNewAPIHadoopDataset(hadoopConf) results in my table getting duplicate data. I understand that because I am calling the save method twice (2 different batch jobs)

The problem is the source data doesnt have the ability to be streamed so it leads me to rerunning the job in a batch manner, its an API

This means that some of the data returned per job is the same which results on duplicate data being inserted

Is there a way to set up the connector so that it recognises the records are the same and either append or overwrite them?

I think the job configuration allows me to to this via the create/write disposition

Is there something similar to that using the API?

Retry on 412 possible?

Getting a 412 Precondition Failed causes a stacktrace to be thrown; is it possible to retry or recover from this situation instead? Connector version was 1.4.1-hadoop1.

Stacktrace:

java.io.IOException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 412 Precondition Failed 
{ 
"code" : 412, 
"errors" : [ { 
"domain" : "global", 
"location" : "If-Match", 
"locationType" : "header", 
"message" : "Precondition Failed", 
"reason" : "conditionNotMet" 
} ], 
"message" : "Precondition Failed" 
} 
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.waitForCompletionAndThrowIfUploadFailed(AbstractGoogleAsyncWriteChannel.java:431)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.close(AbstractGoogleAsyncWriteChannel.java:289) 
at com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage$1.close(CacheSupplementedGoogleCloudStorage.java:117) 
at java.nio.channels.Channels$1.close(Channels.java:178) 
at java.io.FilterOutputStream.close(FilterOutputStream.java:160) 
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.close(GoogleHadoopOutputStream.java:121) 
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61) 
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86) 
at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:106) 
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs$RecordWriterWithCounter.close(MultipleOutputs.java:309) 
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.close(MultipleOutputs.java:465) 
at com.spins.hdp.sld.FilterSld$M.cleanup(FilterSld.java:150) 
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:148) 
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) 
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364) 
at org.apache.hadoop.mapred.Child$4.run(Child.java:255) 
at java.security.AccessController.doPrivileged(Native Method) 
at javax.security.auth.Subject.doAs(Subject.java:415) 
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) 
at org.apache.hadoop.mapred.Child.main(Child.java:249)

Art-homeimprovements.com

Issues loading project in IntelliJ

Has anyone successfully gotten the project loaded into IntelliJ? My normal flow for doing so is resulting in IntelliJ not understanding the project's Maven-module structure.

Here's a screencast showing me importing it by pointing IntelliJ at its POM, selecting the hadoop2 profile (the same failure happens whether I do this or not), and then IntelliJ not recognizing the <module> declarations in the root POM, and accordingly hadoop imports not being resolved in e.g. the gcs module:

I'm using:

IntelliJ IDEA 2017.1.1
Build #IU-171.4073.35, built on April 6, 2017
JRE: 1.8.0_112-release-736-b16 x86_64
JVM: OpenJDK 64-Bit Server VM by JetBrains s.r.o
Mac OS X 10.12.4

listing of large directories in gcs

Listing of large directories in gcs bucket is slow since the maxListItemsPerCall in GoogleCloudStorageOptions is not configurable and default 1024 is too low for large buckets/directories resulting in large number of round-trips to the server

v1.6.1 not shown on maven

The release page shows that the latest release is v1.6.1 (in April), after 1.6.0 (December last year).

However if I go to the mvnrepository page, the last version it shows is 1.6.0 (December last year; same for bigdataoss-parent). It looks like 1.6.1 is missing?

Is this indicative of a problem, or perhaps I'm misunderstanding something?

Supply GCS credentials through the command line

Hi,

To my understanding in order to execute a distcp command to upload data from our local Hadoop cluster into GS I have to define the following parameter google.cloud.auth.service.account.json.keyfile and make sure all worker nodes in my cluster have the keyfile available locally.

Since I want to allow multiple users across the org to upload data to their own bucket in GS using their own credentials, I want to be able to send GS credentials through the distcp command line like we currently do when uploading data to S3.

Is there a solution in the pipeline which should enable this functionality?

Best,

Eyal

Hadoop Connector ignores proxy settings

I have been trying to set fs.gs.proxy.address to no avail, for use with command line tools like /opt/hadoop/bin/hdfs dfs -ls gs://bucket-name.

I checked via tcpdump that the connector does not try to use the proxy.

I've also tried to changed the transport type. Although setting XXXX does trigger an error (i.e., this is correctly taken into account), I get the same timeout backtrace with APACHE and JAVA_NET. This seems that even when transport is set to APACHE, the connector is attempting to use JAVA_NET anyway -- see following backtrace obtained with

    <property>
        <name>fs.gs.http.transport.type</name>
        <value>APACHE</value>
    </property>

Finally, the connector does not log the proxy settings, so it is hard to figure out what is going on.

java.net.SocketTimeoutException: connect timed out
       	at java.net.PlainSocketImpl.socketConnect(Native Method)
       	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
       	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
       	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
       	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
       	at java.net.Socket.connect(Socket.java:589)
       	at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:668)
       	at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
       	at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
       	at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
       	at sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:264)
       	at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:367)
       	at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191)
       	at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1138)
       	at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1032)
       	at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177)
       	at sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(HttpURLConnection.java:1316)
       	at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1291)
       	at sun.net.www.protocol.https.HttpsURLConnectionImpl.getOutputStream(HttpsURLConnectionImpl.java:250)
       	at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:77)
       	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
       	at com.google.api.client.auth.oauth2.TokenRequest.executeUnparsed(TokenRequest.java:283)
       	at com.google.api.client.auth.oauth2.TokenRequest.execute(TokenRequest.java:307)
       	at com.google.cloud.hadoop.util.CredentialFactory$GoogleCredentialWithRetry.executeRefreshToken(CredentialFactory.java:132)
       	at com.google.api.client.auth.oauth2.Credential.refreshToken(Credential.java:489)
       	at com.google.api.client.auth.oauth2.Credential.intercept(Credential.java:217)
       	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:859)
       	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
       	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
       	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
       	at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getBucket(GoogleCloudStorageImpl.java:1657)
       	at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getItemInfo(GoogleCloudStorageImpl.java:1612)
       	at com.google.cloud.hadoop.gcsio.ForwardingGoogleCloudStorage.getItemInfo(ForwardingGoogleCloudStorage.java:214)
       	at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.getFileInfo(GoogleCloudStorageFileSystem.java:1093)
       	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getFileStatus(GoogleHadoopFileSystemBase.java:1413)
       	at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)
       	at org.apache.hadoop.fs.Globber.glob(Globber.java:265)
       	at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1676)
       	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.globStatus(GoogleHadoopFileSystemBase.java:1583)
       	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.globStatus(GoogleHadoopFileSystemBase.java:1506)
       	at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:326)
       	at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:235)
       	at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:218)
       	at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:201)
       	at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
       	at org.apache.hadoop.fs.FsShell.run(FsShell.java:287)
       	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
       	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
       	at org.apache.hadoop.fs.FsShell.main(FsShell.java:340)

I am running

# java -version
openjdk version "1.8.0_121"
OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-1~bpo8+1-b13)
OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode)

edit: I managed to make it work, through export HADOOP_OPTS="$HADOOP_OPTS -Dhttp.proxyHost=.... However, this is a bit confusing that the Connector's explicit configuration parameter do not seem have any effect for this.

com.google.api.client not shaded in GCS Connector fat jar

GCS connector has a dependency on google-api-client, but does not shade it. When an application brings its own version of google-api-client, this results in 404 errors like the following:

Job setup failed : com.google.api.client.http.HttpResponseException: 404 Not Found
	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1070)
	at com.google.api.client.googleapis.batch.BatchRequest.execute(BatchRequest.java:241)
	at com.google.cloud.hadoop.gcsio.BatchHelper.flushIfPossible(BatchHelper.java:118)
	at com.google.cloud.hadoop.gcsio.BatchHelper.flush(BatchHelper.java:132)
	at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getItemInfos(GoogleCloudStorageImpl.java:1493)
	at com.google.cloud.hadoop.gcsio.ForwardingGoogleCloudStorage.getItemInfos(ForwardingGoogleCloudStorage.java:221)
	at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.getFileInfos(GoogleCloudStorageFileSystem.java:1159)
	at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.mkdirs(GoogleCloudStorageFileSystem.java:530)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.mkdirs(GoogleHadoopFileSystemBase.java:1381)
	at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1881)
	at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.setupJob(FileOutputCommitter.java:313)
	at org.apache.hadoop.mapred.FileOutputCommitter.setupJob(FileOutputCommitter.java:131)
	at org.apache.hadoop.mapred.OutputCommitter.setupJob(OutputCommitter.java:265)
	at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.handleJobSetup(CommitterEventHandler.java:254)
	at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.run(CommitterEventHandler.java:234)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)

This seems to be caused by not shading com.google.api.client dependencies in the maven-shade-plugin config.

Default credentials also when running outside of Google

Currently to run gcs-connector on premise we have to download and set up an explicit OAuth 2.0 private key (as explained in the docs). It would be easier if gcs-connector could use gcloud's default credentials (that in our case we already have to set up for other purposes anyways).

BigQueryOutputCommitter support for multiple concurrent jobs into same output dataset

At the moment, BigQueryOutputCommitter uses a temporary dataset based solely on the name of the final output dataset, and requires the temporary dataset to be strictly owned by one job at a time. Following the same convention of FileOutputCommitter, it should incorporate JobAttemptID into the temporary dataset's name to allow multiple jobs concurrently outputting to the same output dataset. Additionally, for custom cases where JobAttemptID may not be available, the setting should be configurable via some overridable configuration key.

Missing rows reading Google Cloud Storage gzipped through Spark

Hi,

I have a Spark job from dataproc where I simply count the number of rows in a JSON newline-delimited file.
The number of lines is correct when I upload a plain text json.
The number of lines is incorrect (11 rows out of 576 in my example) when the json is gzipped first, and content-encoding: gzip is set as metadata in GCS.
I don't have this issue if I use gsutil to download the gzipped json and re-upload it (gsutil decompresses the file when downloaded, so the re-upload if effectively without compression)

Any clue?

Thanks

Olivier

Multiple processes to read BigQuery tables conflict with a temporary export directory

I have some errors to conflicts an export directory among multiple processes, when I submitted some spark job to read BigQuery tables at the same time with different processes.
I think the temporary export GCS directory depends on a job id. To avoid the issue, for example, we should append it some random number or something.

https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/bigquery/src/main/java/com/google/cloud/hadoop/io/bigquery/BigQueryConfiguration.java#L314

Exception in thread "main" java.io.IOException: Conflict occurred creating export directory. Path gs://spark-helper-us-region/hadoop/tmp/bigquery/job_20
1612022259_0000 already exists
        at com.google.cloud.hadoop.io.bigquery.AbstractExportToCloudStorage.prepare(AbstractExportToCloudStorage.java:55)
        at com.google.cloud.hadoop.io.bigquery.AbstractBigQueryInputFormat.getSplits(AbstractBigQueryInputFormat.java:113)
        at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:120)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
        at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1307)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
        at org.apache.spark.rdd.RDD.take(RDD.scala:1302)
        at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1342)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
        at org.apache.spark.rdd.RDD.first(RDD.scala:1341)
        at com.spotify.spark.bigquery.package$BigQuerySQLContext.bigQueryTable(package.scala:112)
        at com.mercari.spark.sql.SparkBigQueryHelper.readBigQueryTable(SparkBigQueryHelper.scala:123)
        at com.mercari.spark.batch.ActivitiesTableCreator.fetchEventLog(ActivitiesTableCreator.scala:99)
        at com.mercari.spark.batch.ActivitiesTableCreator.run(ActivitiesTableCreator.scala:33)
        at com.mercari.spark.batch.ActivitiesTableCreator.run(ActivitiesTableCreator.scala:25)
        at com.mercari.spark.batch.ActivitiesTableCreator$.apply(ActivitiesTableCreator.scala:222)

Accessing Google storage by Spark from outside Google cloud

Hi guys,

We're trying to access the Google storage from with a Spark on Yarn job (writing to gs://...) on a cluster the resides outside Google Cloud.

We have setup the correct service account and credentials but still facing some issues :

The spark.hadoop.google.cloud.auth.service.account.keyfile points to the credentials file on the Spark driver but the Spark code (workers running on different servers) still try to access the same file path (which doesn't exist). We got to work correctly by having the credentials file on the exact same location on both the driver and the workers, but this is not practical and was a temporary workaround.

Is there any delegation token mechanism by which the driver authenticates with the Google cloud and sends the it to the workers so they don't need to have the same credential key at the exact same path ?

We tried also to upload the credential file (p12 or json) to the workers and set :
spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS
or
spark.executor.extraJavaOptions

to the file path (different from the driver file path) but we're getting :

java.io.IOException: Error getting access token from metadata server at: http://metadata/computeMetadata/v1/instance/service-accounts/default/token
	at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:87)
	at com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:68)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1319)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:549)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:512)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2696)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2733)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2715)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:382)
Caused by: java.net.UnknownHostException: metadata
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:589)
	at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
	at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
	at sun.net.www.http.HttpClient.New(HttpClient.java:308)
	at sun.net.www.http.HttpClient.New(HttpClient.java:326)
	at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
	at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
	at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93)
	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:965)
	at com.google.api.client.googleapis.compute.ComputeCredential.executeRefreshToken(ComputeCredential.java:87)
	at com.google.api.client.auth.oauth2.Credential.refreshToken(Credential.java:489)
	at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:85)
	... 14 more

Is there any documentation for this use case that we missed ?

Thanks,

Streaming inserts to BigQuery

Hi Folks

I have been trying to do streaming inserts into BigQuery with no luck

Even the simple demo from the guides doesn't work

Here is a sample code of trying to write a row to a bigquery table

    val jsonString = """{"name":"testName","isNew":false}"""
    val allRows = Array.fill[TableDataInsertAllRequest.Rows](100)(new TableDataInsertAllRequest.Rows)
    allRows(0).setJson(gson.fromJson(jsonString, targetClass))
    val insertAllRequest = new TableDataInsertAllRequest().setRows(allRows.toList)
    val request = bigquery.tabledata().insertAll(targetTable.getProjectId,
      targetTable.getDatasetId,
      targetTable.getTableId,
      insertAllRequest)

I am getting the below error

{"insertErrors":[{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":1},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":2},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":3},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":4},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":5},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":6},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":7},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":8},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":9},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":10},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":11},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":12},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":13},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":14},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":15},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":16},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":17},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":18},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":19},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":20},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":21},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":22},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":23},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":24},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":25},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":26},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":27},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":28},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":29},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":30},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":31},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":32},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":33},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":34},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":35},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":36},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":37},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":38},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":39},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":40},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":41},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":42},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":43},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":44},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":45},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":46},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":47},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":48},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":49},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":50},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":51},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":52},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":53},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":54},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":55},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":56},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":57},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":58},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":59},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":60},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":61},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":62},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":63},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":64},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":65},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":66},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":67},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":68},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":69},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":70},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":71},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":72},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":73},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":74},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":75},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":76},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":77},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":78},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":79},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":80},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":81},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":82},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":83},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":84},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":85},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":86},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":87},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":88},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":89},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":90},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":91},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":92},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":93},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":94},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":95},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":96},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":97},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":98},{"errors":[{"debugInfo":"","location":"","message":"","reason":"invalid"}],"index":99},{"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":0}],"kind":"bigquery#tableDataInsertAllResponse"}

Can anyone help?

java.io.IOException: Not an Avro data file

GoogleHadoopFileSystemBase.setTimes() not working

I have a reference to the GoogleHadoopFileSystemBase in my java code, and I’m trying to call setTimes(Path p, long mtime, long atime) to modify the timestamp of a file. It doesn’t seem to be working though. I also checked the timestamp using hadoop fs -ls gs://mybucket/, but that timestamp also shows up as unchanged.

This is also posted at http://stackoverflow.com/questions/33640664/googlehadoopfilesystembase-settimes-not-working, but has the answer changed? Using the link that James posted there, there does seem to be a field for GCS objects listing the updated time, so GoogleHadoopFileSystem should be able to use that as the mtime?

To that, Dennis Huo responded:
So, the answer hasn't quite changed yet. We did investigate using the field you mention before, but it's not quite what we need; in particular, GCS doesn't have a notion of actually "updating" the content of GCS objects, so in the case of the "updated" field, it only changes when something mutates the GCS metadata itself. Since we couldn't modify one of the actual behavior fields like acl, cache-control, etc., just for purposes of getting a new updated timestamp, this meant modifying the arbitrary-data "metadata" key/value fields was the only feasible way to change the field, and then in that case we often want better client-side control of the values, so in the related "directory update times" feature we ended up just sticking mtime in a custom metadata field in the GCS object.

This means in theory, we can indeed implement mtime using custom metadata key/value pairs in the GCS objects, just that it isn't implemented yet today. So far, most cases which depend on mtime, however, will run into other problematic assumptions, such as directory mtimes always being updated when objects are added/removed to directories, and in the case of GCS, directory-modification bottlenecks run into rate-limit issues very quickly (see http://stackoverflow.com/questions/31851192/rate-limit-with-apache-spark-gcs-connector).

If you need mtimes but don't need directory modification semantics, we could add single-object mtime as a feature request for an upcoming release.

I indeed don't need the directory modification semantics. My use case is that we used to create files in HDFS which functioned as locks that could time out. We'd periodically update the timestamp of the lock to signal that the process owning the lock is still alive.

ShardedExportToCloudStorage should allow custom splits

Looks like the # of splits is determined by # of shards. This should be fine for traditional M/R jobs but in Spark where there are a fixed number of executors, it might be better to use all of them even if # of executors > # shards.

https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/bigquery/src/main/java/com/google/cloud/hadoop/io/bigquery/ShardedExportToCloudStorage.java#L73

Issue with JobTracker freezing on later versions of gcs-connector

Using Hadoop 1.2.1 with the latest GCS connector (as of this writing), we run into a condition after a job completes where the job tracker completely freezes for ~15 minutes (the web UI is inaccessible and applications cannot submit new jobs to the cluster, but the machine is still accessible). This causes a bunch of headaches with job failures/etc. This has happened on a couple clusters in GCE (one 128 node, one 16 node; namenode is sized appropriately). It used to happen intermittently but now it's become very frequent.

I did a jstack dump when it was frozen which you can see here: https://gist.github.com/ap0/89c4fce43bfe177f15cd

The one that sticks out is this:

Thread 5369: (state = IN_NATIVE)
java.net.SocketOutputStream.socketWrite0(java.io.FileDescriptor, byte[], int, int) @bci=0 (Interpreted frame)
java.net.SocketOutputStream.socketWrite(byte[], int, int) @bci=52, line=113 (Interpreted frame)
java.net.SocketOutputStream.write(byte[], int, int) @bci=4, line=159 (Interpreted frame)
sun.security.ssl.OutputRecord.writeBuffer(java.io.OutputStream, byte[], int, int, int) @bci=5, line=377 (Interpreted frame)
sun.security.ssl.OutputRecord.write(java.io.OutputStream, boolean, java.io.ByteArrayOutputStream) @bci=458, line=363 (Interpreted frame)
sun.security.ssl.SSLSocketImpl.writeRecordInternal(sun.security.ssl.OutputRecord, boolean) @bci=62, line=830 (Interpreted frame)
sun.security.ssl.SSLSocketImpl.writeRecord(sun.security.ssl.OutputRecord, boolean) @bci=302, line=801 (Interpreted frame)
sun.security.ssl.AppOutputStream.write(byte[], int, int) @bci=161, line=122 (Interpreted frame)
org.apache.http.impl.io.AbstractSessionOutputBuffer.write(byte[], int, int) @bci=34, line=128 (Interpreted frame)
...

It looks like it gets hung in this call. My suspicion was that the connection dies during the transfer and there's no SO_TIMEOUT value set. I tried changing this in the google-http-java-client by setting the SO_TIMEOUT value on SSL sockets (zulily/google-http-java-client@380341e), which seemed promising but eventually did not work. A colleague said they had issues with the job tracker locking up as well; they found the problem occurred after a job completed when it would write the log files to GCS; their fix was to just disable this log writing.

One thing that does work is to use the older version of gcs-connector which we have running on some other clusters (1.2.8). This one does not seem to exhibit this behavior. I think a regression occurred between then and now that is causing this, but I have been unable to track it down.