GithubHelp home page GithubHelp logo

ga-bq's Introduction

Google Analytics -> BigQuery streaming

Stream raw hit-level Google Analytics data into BigQuery

Installation

  1. Create new project here https://console.developers.google.com/project
  2. Create new dataset in Google BigQuery https://bigquery.cloud.google.com
  3. Download and install Google App Engine python SDK https://cloud.google.com/appengine/downloads
  4. git clone https://github.com/lnklnklnk/ga-bq.git
  5. Create new app from source in Google SDK
  6. Set gcloud project: gcloud config set project your-project
  7. Change gifPath in js/gabq.js to ga-tracker-dot-[your-project].appspot.com/collect
  8. Set project_id (your-project), dataset_id (from step 2), table_id in bqloader.py
  9. Deploy application: gcloud app deploy app.yaml
  10. Visit ga-tracker-dot-[your-project].appspot.com/tasks/create_bq_table to create BigQuery table. (Expected response if everything goes well is simply an ok)
  11. Include plugin on your website. Add line: <script async src="http://ga-tracker-dot-[your-project].appspot.com/js/gabq.js"></script> after GA code and ga('require', 'gabqplugin'); after ga('create',..)
  12. Now you raw GA data collects in BigQuery table

Note: Ecomerce data is currently not supported, it will be added soon

Tuning

In case you have more than 1000 events per minute you may duplicate number of cron workers by copy pasting them in cron.yaml, e.g. something like this:

cron:
- description: process queue
  url: /tasks/process_queue
  schedule: every 1 mins

- description: process queue
  url: /tasks/process_queue
  schedule: every 1 mins

- description: process queue
  url: /tasks/process_queue
  schedule: every 1 mins  

- description: process queue
  url: /tasks/process_queue
  schedule: every 1 mins  

- description: process queue
  url: /tasks/process_queue
  schedule: every 1 mins  

Take in mind that there is an limit - you can not lease more than 1000 rows from queue at once, so we are scaling this by number of cronjobs, so now each minute we will be able to proccess 5000 events. While playing around we have noticed that there is an limit to number of cronjobs at 60 - so with this in mind, you may grow up to 60 000 per minute.

Troubleshooting

Internal Server error (UnknownQueueError) when sending data to /collect

If you don't see your pull-queue queue in the Cloud Tasks underneath Pull queues display on the developer console, try deploying the queue config explicitly:

gcloud app deploy queue.yaml

ga-bq's People

Contributors

antonzol avatar jwest75674 avatar laszlocsontos avatar lnklnklnk avatar mac2000 avatar mayank-bhatia avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ga-bq's Issues

IE 9 and previous versions - no data to BQ

Hi, I'm testing your app and it works great on recent browsers, like IE10 and IE11 and latest Firefox and Chrome versions, but I see no results for IE9,IE8 and earlier versions.

I think it is about XHR but I'm not sure...

This JS works for me:

function GaBqPlugin(tracker) {

ga(function(tracker) {
        var originalSendHitTask = tracker.get('sendHitTask');
        tracker.set('sendHitTask', function(model) {
            var payLoad = model.get('hitPayload');
            originalSendHitTask(model);

            if (window.XMLHttpRequest){var gifRequest = new XMLHttpRequest()};
            if (window.ActiveXObject){var gifRequest = new XDomainRequest()};

            var gifPath = "http://[xxx].appspot.com/collect";
            gifRequest.open('get', gifPath + '?' + payLoad, true);
            gifRequest.send();
        });
    });

}
ga('provide', 'gabqplugin', GaBqPlugin);

Hope this helps.

Thanks

Queue Lease Time

Queue Lease Time should be increased from one second to at least ten or even 30 to avoid possible issues when you have more than one cron job to process messages, other wise there is a chance that you will get duplicate insertions into BigQuery

InsertId should be used for automatic deduplication

InsertId is an optional string, unique ID for each row. BigQuery uses this property to detect duplicate insertion requests on a best-effort basis.

docs

I do not see any limits for it so seems that whole payload may be used as an id to avoid duplicates

Error 405 - Requesting https://ga-tracker-dot-myproject.appspot.com/collect

Error 405 - Requesting https://ga-tracker-dot-myproject.appspot.com/collect

I seem to be having sporadic issues across multiple websites relating to error 405 responses from ga-bq.

I am wondering if anyone else has reported this issue, or might have insight as to possible causes and fixes...

This error seems to cause some websites to only partially load, others it is specific pages, like a wordpress based elementor editor failing to load completely.

image

image

split pull-queue

If tasks in queue there too much cron cant process them all.
Errol log there.
I thing that better is to split queue in to packages.

<HttpError 400 when requesting https://bigquery.googleapis.com/bigquery/v2/projects/kuluarpohod-147905/datasets/GAfirst/tables/GAhits/insertAll?alt=json returned "The row insert id length 1182 is too long."> (/base/alloc/tmpfs/dynamic_runtimes/python27g/7cb976f64e72c78c/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py:1552)
Traceback (most recent call last):
  File "/base/alloc/tmpfs/dynamic_runtimes/python27g/7cb976f64e72c78c/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1535, in __call__
    rv = self.handle_exception(request, response, e)
  File "/base/alloc/tmpfs/dynamic_runtimes/python27g/7cb976f64e72c78c/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1529, in __call__
    rv = self.router.dispatch(request, response)
  File "/base/alloc/tmpfs/dynamic_runtimes/python27g/7cb976f64e72c78c/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1278, in default_dispatcher
    return route.handler_adapter(request, response)
  File "/base/alloc/tmpfs/dynamic_runtimes/python27g/7cb976f64e72c78c/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1102, in __call__
    return handler.dispatch()
  File "/base/alloc/tmpfs/dynamic_runtimes/python27g/7cb976f64e72c78c/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 572, in dispatch
    return self.handle_exception(e, self.app.debug)
  File "/base/alloc/tmpfs/dynamic_runtimes/python27g/7cb976f64e72c78c/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 570, in dispatch
    return method(*args, **kwargs)
  File "/base/data/home/apps/s~kuluarpohod-147905/ga-tracker:20200103t165006.423608064662716982/process_queue.py", line 29, in get
    bq_loader.insert_rows(rows)
  File "/base/data/home/apps/s~kuluarpohod-147905/ga-tracker:20200103t165006.423608064662716982/bqloader.py", line 202, in insert_rows
    body=body).execute()
  File "/base/data/home/apps/s~kuluarpohod-147905/ga-tracker:20200103t165006.423608064662716982/oauth2client/util.py", line 132, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/base/data/home/apps/s~kuluarpohod-147905/ga-tracker:20200103t165006.423608064662716982/apiclient/http.py", line 723, in execute
    raise HttpError(resp, content, uri=self.uri)
HttpError: <HttpError 400 when requesting https://bigquery.googleapis.com/bigquery/v2/projects/kuluarpohod-147905/datasets/GAfirst/tables/GAhits/insertAll?alt=json returned "The row insert id length 1182 is too long.">

Partitioning option

Not sure why, but script does not create partitioned table

For small projects it may be ok, but if there is 1+ million records per day it may become not a cheap solution

So the question is: is it by design, or should we care at all about it?

May it be as an parameter in configuration or something like that

Queue configuration

Will be nice if there will be queue configuration description in readme, we have tried out your implementation but in our case queue now is 1 million records, seems that queue handler is not able to process all incomming requests, just wondering, from your experience what is the best way to tune it

HTTP 404

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.