Comments (5)
@dennishuo is this project still being maintained? no commits since Dec and no replies on the issues :(
from hadoop-connectors.
Sorry for the delay, indeed this project is still being maintained.
I believe the location is a property of the destination BigQuery dataset, and IIRC the connector doesn't auto-create an output dataset if it doesn't exist yet. The configuration location also has to be set to match the destination dataset because the configuration location is used for the temporary dataset holding uncommitted results before appending into the destination table during commitTask
.
You should make sure the destination dataset you pre-created is already in EU and continue setting the configuration key to match.
Incidentally, are you using the older com.google.cloud.hadoop.io.bigquery.BigQueryOutputFormat
or the newer com.google.cloud.hadoop.io.bigquery.output.IndirectBigQueryOutputFormat
?
from hadoop-connectors.
Thanks for the reply! I was using the older version so perhaps that was the reason
I will check if that sorts the problem out.
Out of interest, will using the newer version give any performance improvements?
Whats the difference between the old and new?
from hadoop-connectors.
Indeed, the newer version has been measured to have better performance.
The difference between the old and the new is that the old version tried to get too fancy and write straight into BigQuery temporary tables, calling "CopyTable" inside of commitTask
; this was a "simpler" flow since it doesn't require intermediate storage outside of BigQuery, but unfortunately it consumes much more "BigQuery load jobs" quota, since every task commits independent temporary tables first.
The new version writes to GCS first via the GCS connector for Hadoop, and on commitJob
calls a single BigQuery "load" to ingest from GCS. In theory this could be slower because BigQuery doesn't even begin its backend ingest until the commitJob
, but it turns out the overhead of per-task BigQuery loads is much higher, so that overall the newer version is faster.
We're still in the process of switching documentation over to recommending the new connector version as the default; since it's newer it's possible there are new bugs we haven't found yet, but generally since it's just built on the well-tested GCS connector for Hadoop, it's expected to be fairly stable nonetheless.
Note however that both the old and new versions do not create a new BigQuery "dataset", and location is still determined by the "dataset" location more so than by config key.
If you use the new vesion, the GCS staging location you specify should also be in EU if you're going to load it into an EU BigQuery dataset.
from hadoop-connectors.
I see, well thanks for the clarification and ofcourse your help!
from hadoop-connectors.
Related Issues (20)
- Issue with cached credentials when attempting to use different keyfiles in the same Spark App HOT 1
- Test failures after HADOOP-18724
- Question: How to use gcs-connector on GKE with Workload Identity HOT 1
- BQ storage libray blocked on update to grpc v1.56
- GoogleCloudStorageFileSystem#delete recursive does not page
- Memory issues while running Apache Spark streaming applications on Google Dataproc cluster | OutOfMemoryError Java heap space
- flumk sink hdfs to gcs, all gcs write thread blocked
- how to transfer file from local to gcs bucket using dataproc hadoop in intellij
- GCS Connector fails with StackOverflowError during accessing hadoop credentials
- GhfsStorageStatistics cannot be cast ERROR HOT 9
- Support disabling automatic decompression of gzip files in GCS connector
- gcs-connector 3.0 not working with pyspark HOT 5
- Can ServiceAccountJsonKeyFile be ignored when ServiceAccountPrivateKeyId is set? HOT 1
- Custom implementation of AccessTokenProvider doesn't work according to 2.2.10 documentation. HOT 2
- DirectPath have issues with CloudBuild
- We used a vulnerability scanning tool (SecBinaryCheck)to scan gcs-connector-hadoop3-2.2.6-shaded.jar and found high-risk security vulnerabilities. Is this an issue?? HOT 1
- Writing single file ended with ~9 GCS calls
- GCS Connector sending generationMatch = 0 for existing objects HOT 7
- DirectPath Unauthorized access issues using java-storage
- Unauthorized access on gRPC via DirectPath have issues
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hadoop-connectors.