GithubHelp home page GithubHelp logo

spotify / gcs-tools Goto Github PK

View Code? Open in Web Editor NEW
71.0 17.0 14.0 145 KB

GCS support for avro-tools, parquet-tools and protobuf

License: Apache License 2.0

Scala 48.53% Java 40.57% Shell 10.90%
gcs google-storage avro protobuf gcs-connector gcp parquet

gcs-tools's Issues

No valid credential configuration discovered

Hello,
Coming from Scio's documentation, I ended-up installing proto-tools through Homebrew (spotify/public/gcs-proto-tools stable 0.2.4). However, when I run it I get the following error:

$ proto-tools getschema gs://bucket/data.protobuf.avro
Exception in thread "main" java.lang.IllegalArgumentException: No valid credential configuration discovered:  [CredentialOptions{serviceAccountEnabled=false, serviceAccountPrivateKeyId=null, serviceAccountPrivateKey=null, serviceAccountEmail=null, serviceAccountKeyFile=null, serviceAccountJsonKeyFile=null, nullCredentialEnabled=false, transportType=JAVA_NET, tokenServerUrl=https://oauth2.googleapis.com/token, proxyAddress=null, proxyUsername=null, proxyPassword=null, authClientId=32555940559.apps.googleusercontent.com, authClientSecret=<redacted>, authRefreshToken=null}]
	at com.google.common.base.Preconditions.checkArgument(Preconditions.java:220)
	at com.google.cloud.hadoop.util.CredentialOptions$Builder.build(CredentialOptions.java:171)
	at com.google.cloud.hadoop.util.HadoopCredentialConfiguration.getCredentialFactory(HadoopCredentialConfiguration.java:227)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getCredential(GoogleHadoopFileSystemBase.java:1343)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.createGcsFs(GoogleHadoopFileSystemBase.java:1501)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1483)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:470)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3572)
	at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3673)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3624)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:557)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
	at org.apache.avro.mapred.FsInput.<init>(FsInput.java:38)
	at org.apache.avro.tool.ProtoGetSchemaTool.run(ProtoGetSchemaTool.java:33)
	at org.apache.avro.tool.ProtoMain.run(ProtoMain.java:64)
	at org.apache.avro.tool.ProtoMain.main(ProtoMain.java:53)

I'm logged into my GCP project with gcloud.

The README seems to suggest that something needs to be done with GCS-connector but I can't figure out what exactly:

  • Is it something to be installed separately? How?
  • How can I edit some core-site.xml file within a Homebrew installation? Or can it be passed to the command-line?

New release with parquet-tools 1.10.1?

Hello there ๐Ÿ‘‹

I noticed that you updated the version of parquet-tools on master (allowing usage of rowcount ๐Ÿ™ ) for a fair amount of time now but there was no release with it.

Do you have any idea if/when you would be able to make a new release?

Relying heavily on parquet hosted on GCS, this is cruelly missing!

Nonetheless, thanks for this awesome tool ๐Ÿ‘

All latest tools fail to authenticate to GCS

STR:

1a. Install all latest (v0.2.2 on Aug 29) tools
1b. Or build latest master to parquet-cli-1.12.3.jar, proto-tools-3.21.1.jar, avro-tools-1.11.0.jar,magnolify-tools-0.4.8.jar

  1. Run all of them using basic read command like <TOOL> tojson <GCS_PATH>

Actual:
Tool launches browser that shows a page:
Screen Shot 2022-08-29 at 9 50 20 AM

With a message:

The version of the app you're using doesn't include the latest security features to keep you protected. Please make sure to download from a trusted source and update to the latest, most secure version.

Exected:
Tool reads a file according to spec

no JSON input found: gcloud credentials

on gcs-avro-tools 0.1.7 from homebrew, there appear to be issues loading the application default credentials.

Exception in thread "main" java.lang.IllegalArgumentException: no JSON input found
	at com.google.api.client.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:92)
	at com.google.api.client.util.Preconditions.checkArgument(Preconditions.java:49)
	at com.google.api.client.json.JsonParser.startParsing(JsonParser.java:222)
	at com.google.api.client.json.JsonParser.parse(JsonParser.java:379)
	at com.google.api.client.json.JsonParser.parse(JsonParser.java:335)
	at com.google.api.client.json.JsonParser.parseAndClose(JsonParser.java:165)
	at com.google.api.client.json.JsonParser.parseAndClose(JsonParser.java:147)
	at com.google.api.client.json.JsonFactory.fromInputStream(JsonFactory.java:206)
	at com.google.api.client.extensions.java6.auth.oauth2.FileCredentialStore.loadCredentials(FileCredentialStore.java:154)
	at com.google.api.client.extensions.java6.auth.oauth2.FileCredentialStore.<init>(FileCredentialStore.java:86)
	at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromFileCredentialStoreForInstalledApp(CredentialFactory.java:301)

....

when I gcloud auth application-default login, it saves my credentials to /Users/cchow/.config/gcloud/application_default_credentials.json. did the expected path change?

proto-tools NoSuchMethod error

When calling proto-tools with either tojson or getschema, the following error is thrown:

Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Splitter.splitToList(Ljava/lang/CharSequence;)Ljava/util/List;
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase$ParentTimestampUpdateIncludePredicate.create(GoogleHadoopFileSystemBase.java:641)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.createOptionsBuilderFromConfig(GoogleHadoopFileSystemBase.java:1978)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1675)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:862)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:825)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2625)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2607)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
	at org.apache.avro.tool.Util.openFromFS(Util.java:88)
	at org.apache.avro.tool.Util.fileOrStdin(Util.java:60)
	at org.apache.avro.tool.ProtoToJsonTool.run(ProtoToJsonTool.java:48)
	at org.apache.avro.tool.ProtoMain.run(ProtoMain.java:64)
	at org.apache.avro.tool.ProtoMain.main(ProtoMain.java:53)

add `proto-tools fromPb` method?

an example use case is inspecting the pipelineUrl file that Dataflow stages (which is a .pb file representing org.apache.beam.model.pipeline.v1.Pipeline) to verify coders and transforms. It would just be a wrapper around protoc's decode or decode_raw method, maybe with built-in support for common Protobuf messages like Pipeline (although we'd have to handle different schema versions for different Beam versions).

error running proto-tools tojson

running proto-tools tojson throws this error:

Exception in thread "main" java.lang.NoSuchMethodError: com.google.protobuf.CodedInputStream.newInstance(Ljava/nio/ByteBuffer;)Lcom/google/protobuf/CodedInputStream;
at me.lyh.protobuf.generic.GenericReader.read(GenericReader.scala:21)
at org.apache.avro.tool.ProtobufReader.toJson(ProtobufReader.scala:9)
at org.apache.avro.tool.ProtoToJsonTool.run(ProtoToJsonTool.java:59)
at org.apache.avro.tool.ProtoMain.run(ProtoMain.java:64)
at org.apache.avro.tool.ProtoMain.main(ProtoMain.java:53)`

NoSuchMethodError when running parquet-tools locally

Hello there ๐Ÿ‘‹

Following the README, I tried to build the project & use it locally but parquet-tools fails with a NoSuchMethodError:

The command:

% java -jar parquet-tools/target/scala-2.12/parquet-tools-1.10.1.jar rowcount --debug gs://path/to/parquet/file

The error:

java.lang.NoSuchMethodError: com.google.common.base.Splitter.splitToList(Ljava/lang/CharSequence;)Ljava/util/List;
        at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase$ParentTimestampUpdateIncludePredicate.create(GoogleHadoopFileSystemBase.java:790)
        at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.createOptionsBuilderFromConfig(GoogleHadoopFileSystemBase.java:2140)
        at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1832)
        at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:1013)
        at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:976)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
        at org.apache.parquet.tools.command.RowCountCommand.execute(RowCountCommand.java:83)
        at org.apache.parquet.tools.Main.main(Main.java:223)
java.lang.NoSuchMethodError: com.google.common.base.Splitter.splitToList(Ljava/lang/CharSequence;)Ljava/util/List;

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.