GithubHelp home page GithubHelp logo

gsc_wrapper's Introduction

Google Search Console for Python (by Antoine Eripret)

License: MIT

Package purpose and content

gscwrapper is aimed at providing SEO profesionnals wotking with Python a strong basis to work with Google Search Console's APIs.

It provides an easy way to query and work with data from the following endpoints:

For now, this package is only designed to work with the webmasters.readonly scope from the API.

Another GSC library?

There are countless GSC libraries available. My favorite (and the one I've been using for years) is available here. That being said, these libraries:

  • Are often limited to downloading data and don't offer methods to run common SEO analysis. I would often end up copying my code between notebooks and I needed a library to centralize the common operations I often do.
  • Are sometimes owned by non-SEO and therefore aren't always up-to-date, especially when there is an API update. Python is used by many SEO professionals and yet we often rely on non-SEO to maintain the libraries we use as an industry.

I've decided to create my own based on my most common needs as a SEO profesionnal. It has also been a fun project to work on :)

DISCLAIMER: this library is not aimed at taking decisions for you, it just speeds up some repetitive data manipulation tasks we often do. I strongly advise you to read & understand the code behind a method if you aim at taking decisions only based on the output of a method. In most cases, the only library used under the hood is Pandas.

Suggestions? Issues?

I'm more than welcome to receive suggestions or solve issues through GitHub. Nevertheless:

  • The code is extensively commented to make it readable for everyone, even if you don't master Python. If you have a question on how a method works under the hood, please have a look at the code first.
  • I'm not a developer and this is, by far, the most complex project I had to work on by myself. I try to stick to concepts I understand and I won't update my code just because I'm not using a best practice here and there.
  • I do it for free and hence I have to prioritize my (paid) work and my personnal life over this library.

Quickstart

First, install the package using:

pip3 install git+https://github.com/antoineeripret/gsc_wrapper

Then, follow these steps:

  • Create a new project in the Google Developers Console,
  • Enable the Google Search Console API under "APIs & Services".
  • Create an "Oauth consent screen" (External). The app name doesn't really matter.
  • Add the webmasters.readonly scope
  • Add your e-mail(s) to the test users.
  • Next, create credentials under "Credentials" (Oauth client ID)
  • Choose "Desktop app" for the "Application type". Again, the name doesn't matter.
  • Download the JSON file and save it in your working directory

If you want more detail about this process, have a look at this video (from 12:00 to 27:00).

After that, executing your first query is as easy as using the following code snippet:

import gscwrapper
#authentificate 
account = gscwrapper.generate_auth(
    'config/client_secret_mvp.json', 
    serialize='config/credentials.json'
)
#we choose the website we want 
webproperty = account['https://www.exemple.com/']
report = (
    webproperty
    #we call the query method 
    .query
    #we define the dates 
    .range(start="2023-01-01", stop="2023-02-01")
    #we define the dimensions 
    .dimensions(['page'])
    #we get the data 
    .get()
)

The above example will use your client configuration file to interactively generate your credentials. You'll then be able to call any available method on the returned object containing your GSC data.

If you're unsure what webproperties are linked to your account, you can run the following code, which will return a DataFrame with your webproperties and their permission levels.

account.list_webproperties()

The first time you run this code, you'll be asked to visit an URL:

  • Copy and paste it in your browser
  • Select the e-mail adress you've added as a test user in a previous step
  • Click on "Continue"
  • Click on "Continue" again
  • Copy the authorization code and paste it in the input box that will have appeared in your terminal / notebook from where you run the code

If you prefer to use a service account key, the process is easier and you just have to run the following code. No need to save the credentials in that case.

account = (
    gscwrapper
    .generate_auth(
        client_config="service_account.json", 
        service_account_auth=True
    )
)

Saving credentials

If you wish to save your credentials, to avoid going through the OAuth consent screen in the future, you can specify a path to save them by specifying serialize='path/to/credentials.json.

When you want to authenticate a new account you run:

account = gscwrapper.generate_auth(
    'config/client_secret_mvp.json', 
    serialize='config/credentials.json'
)

Which will save your credentials to a file called credentials.json.

From then on, you can authenticate with:

account = gscwrapper.generate_auth(
    'config/client_secret_mvp.json', 
    credentials='config/credentials.json'
)

Search Analytics

Querying

To query data from the search analytics, you can filter the data you retrieve based on:

  • range: Unlike Josh's library, you need to explicitely define the dates (YYYY-MM-DD format).
report = (
    webproperty
    .query
    .range(start="2023-01-01", stop="2023-02-01")
    .dimensions(["date"])
    .get()
)
  • dimensions: dimensions need to be passed as a list. You cannot get data without specifying at least one dimension.
report = (
    webproperty
    .query
    .range(start="2023-01-01", stop="2023-02-01")
    .dimensions(["date"])
    .get()
)

Please be aware that you may be affected by data sampling based on the number of dimensions you need. I strongly advise to include only the dimensions you need, otherwise the data extraction may take more time than needed.

  • filter: you can decide to analyse just a part of your website. You can filter using any dimension or operator included below:
DIMENSIONS = ['country','device','page','query','searchAppearance','date']
OPERATORS = ['equals','notEquals','contains','notContains','includingRegex','excludingRegex']

If you use a REGEX, I strongly advise to test it using the GSC UI first, because some characters are not suppoorted.

IMPORTANT: you can filter on a dimension that is not included in your report.

report = (
    webproperty
    .query
    .range(start="2023-01-01", stop="2023-02-01")
    .filter("page", "blog", "contains")
    .dimensions(["date"])
    .get()
)
report = (
    webproperty
    .query
    .range(start="2023-01-01", stop="2023-02-01")
    .filter("page", "blog", "contains")
    .dimensions(["date"])
    .search_type('discover')
    .get()
)
  • limit: by default, the library will fetch the API and try to retrieve all the data available. You can decide to retrieve a specific number of results using this method.
report = (
    webproperty
    .query
    .range(start="2023-01-01", stop="2023-02-01")
    .filter("page", "blog", "contains")
    .dimensions(["date"])
    .limit(50)
    .get()
)

When you run any of these code snippets, you'll generate a Report object.

Report

A Report object contains all the data you downloaded from GSC. I developped a couple of methods you can freely use based on your needs. Based on the method you use, you'll need some dimensions in your report. For instance, you cannot call the ctr_yield_curve() method without the date and the query dimensions.

If you are not using the API (working with the bulk export for instance), you can load a Report object from any DataFrame using the following code:

(
    gscwrapper
    .query
    .Report(
        #the dataframe where you have your data 
        x, 
        #the name of the webproperty 
        "https://www.website.com/",
        #min date 
        "2023-01-01",
        #max date 
        "2023-30-01"
)

You'll then be able to use any of the available methods.

Available methods:

show_data()

Required dimensions Required metrics Output
None None pd.DataFrame

This method is pretty straight-forward. It returns your data as a Pandas DataFrame. Useful to check what the API has returned and perform ad-hoc analysis that are not covered by the other methods from this library.

ctr_yield_curve()

Required dimensions Required metrics Output
query / date clicks / impressions / position pd.DataFrame

You can call this method to build a CTR yield curve with your data.

(
    report 
    .ctr_yield_curve()
)

For example, you could get this output.

position ctr clicks impressions kw_count
1.0 19.02 807 4242 1516
2.0 11.66 476 4084 1026
3.0 15.1 340 2252 637
4.0 12.2 268 2196 730
5.0 7.16 126 1761 551
6.0 5.49 195 3552 909
7.0 3.66 107 2921 696
8.0 3.27 102 3124 930
9.0 2.17 67 3087 897
10.0 1.51 73 4841 1250

group_data_by_period()

Required dimensions Required metrics Output
date None pd.DataFrame

We often need to compare weekly or monthly data, when GSC only provides daily data. Using this method, you can resample your data.

Accepted periods are the following:

  • D: day
  • W: week
  • M: month
  • Q: quarter
  • Y: year
report.group_data_by_period('W')

This will output a table where the main metrics (clicks & impressions) are grouped by the new period chosen.

date clicks impressions
2023-01-01 206 8109
2023-01-08 1871 69275
2023-01-15 1998 67207
2023-01-22 1706 60436
2023-01-29 1980 67552
2023-02-05 935 29155

active_pages()

Required dimensions Required metrics Output
page clicks / impressions pd.DataFrame

We sometimes have to know the percentage of pages that are active from a list of URLs. This method allows you to compare the pageyou have in your GSC report to a list of URLs.

You can either:

  • pass a manual list of URLs
  • provide a sitemap or sitemap index URL and the library will download all the URLs included there.
(
    report
    .active_pages(
        sitemap_url = "https://www.website.com/sitemap_index.xml"
    )
)

The outcome: a table telling you which pages from your list are active (based on impressions or clicks).

page clicks impressions active_impression active_clicks
https://www.website.com/blog/content_1 1.0 21.0 True True
https://www.website.com/blog/content_2 0.0 78.0 True False
https://www.website.com/blog/content_3 1.0 161.0 True True

cannibalization()

Required dimensions Required metrics Output
query / page clicks / impressions pd.DataFrame

GSC can be a fabulous tool to find cannibalization at scale. The trick is to remove false positives and cases of good cannibaliation. This method can be used to find queries where:

  • we have more than one page ranking for a specific query
  • we have more than one page representing at least 10% of the clicks for a given query
  • this query represents at least 10% of the total clicks of the page

This definition is subjective and you are free to have a look at the source code to create your own based on your projects.

(
    report 
    .cannibalization(brand_variants=['brand','mybrand'])
)

IMPORTANT: you need to provide the common branding structures as a list to remove these cases from the cannibalization analysis. Otherwise, you would end up with a lot of false positives on your branded terms.

This method would return the following table, with a selection of the pages that seem to suffer from cannibalization. You can then define what you want to do with them based on the SEO context.

page query clicks_query impressions click_pct clicks_page click_pct_page
https://www.website.com/blog/content_1 xxxxx 11 385 57.89 44 25.00
https://www.website.com/blog/content_2 xxxxx 8 191 42.10 23 34.78
https://www.website.com/blog/content_1 yyyyy 15 387 62.50 44 34.09
https://www.website.com/blog/content_2 yyyyy 9 148 37.50 23 39.13

forecast()

Required dimensions Required metrics Output
date clicks pd.DataFrame

This method is used to forecast traffic, using prophet. Using the data available in our Reportobject, we can use this external library to forecast future clicks.

The function is simple to use but please note that I do not advise to create forecast if you don't have at least a decent number of days in your Report object, otheriwse the forecast will be inaccurate.

You can specify the number of days you want to forecast as the only accepted parameter for this method.

(
    report 
    .forecast(days=10)
)

This method returns the DataFrame as created by Prophet, and if you want to understand the column names, have a look at their documentation.

brand_vs_no_brand()

Required dimensions Required metrics Output
query / date None pd.DataFrame

This method allows you to simply compare your branded and non-branded clicks / impressions over time.

(
    report 
    .brand_vs_no_brand(brand_variants=['brand', 'mybrand'])
)

It returns a clean table with your clicks & impressions over time that allows you to see how your traffic is evolving on branded and non-branded terms.

date clicks_brand impressions_brand clicks_no_brand impressions_no_brand
2023-01-01 0.0 4.0 60 4917
2023-01-02 0.0 15.0 84 6648
2023-01-03 1.0 10.0 88 5401
2023-01-04 0.0 11.0 80 5390
2023-01-05 0.0 12.0 80 5401
2023-01-06 0.0 13.0 79 5492
2023-01-07 0.0 11.0 80 5458

keyword_gap()

Required dimensions Required metrics Output
query None pd.DataFrame

Some common SEO tools such as Semmrush, Ahrefs or Sistrix allows you to perform a keyword gap analysis.

This is the first step to understand what contents you may want to create for your projects if it makes sense.

This method allows you to perform a similar operation by **comparing the keywords you have in your Report object and any other list of keywords.

(
    report 
    .keyword_gap(
        #the DataFrame where your list of keywords is stored
        df, 
        #the column name where your keywords are stored
        column='keyword',
    )
)

This method will filter your dfto keep only the keywords that are not included in your Report object.

causal_impact()

Required dimensions Required metrics Output
date clicks Causal Impact object

GSC is a great tool to understand if some changes applied to a specific set of pages is having a positive impact.

This method allows you to use Causal Impact to infer the expected effect a given intervention (or any action) had on some response variable by analyzing differences between expected and observed time series data.

(
    report 
    .causal_impact(
        intervention_date="2023-01-01",
    )
)

To ensure that the results make sense, you need to have at least the same amount of days before & after the intervention date in your Reportobject.

This method will return a ciobject. Refer to the documentation to understand how you can explore the results.

update_urls()

Required dimensions Required metrics Output
page None Report object

This method is one you need to use if you're dealing with a migration. When you want to compare the traffic before / after, we need to manually update the URLs returned by GSC based on a redirect mapping we have.

This method simplifies the process:

  • you provide a redirect mapping with from and to columns
  • the method will update your Reportobject by updating the from URLs by the to URLs.
(
    report 
    .update_urls(
        #the dataframe where we have the from & to columns 
        redirect_mapping=redirects 
    )
)

As the method just update the object itself, you can then call any other method freely.

extract_search_volume()

Required dimensions Required metrics Output
query None pd.DataFrame

GSC is a goldmine when it comes to keyword discovery, but it is sometimes handy to know what is the search volume of a given keyword we have in our dataset.

Even if keyword volume are not the only datapoint you need to use to assess a content potential, in somes cases it must be used.

This method leverages DataForSEO, an API I often recommand to get search volume at scale, to get the search_volumefor your keyword.

Please note that this method is designed to work with any dataset size, buy I do not recommand to use it if you have more than 100,000 keywords. In that case, use a separate script.

(
    report 
    .extract_search_volume(
        #the location code from DataForSEO
        ## 2250 is France 
        location_code=2250, 
        #credentials for the API 
        client_email="[email protected]", 
        client_password="xxxxx", 
        #If you just want to calculate the cost 
        calculate_cost=True
    )
)

To avoid any cost-related issue, you need to explicitely set the calculate_cost parameter to Falseto run the extraction.. For now, search volume are extracted for Google only.

find_potential_contents_to_kill()

Required dimensions Required metrics Output
page clicks / impressions pd.DataFrame

This method is similar to active_pages(): you provide a sitemap URL and the function returns the contents based on your clicks_threshold and/or impressions_threshold.

(
    report 
    .find_potential_contents_to_kill(
        #my sitemap 
        "https://www.website.com/sitemap_index.xml", 
        #the threshold 
        clicks_threshold=0, 
        impressions_threshold=0,
    )
)

This code would return a DataFrame like the following one:

loc clicks impressions ctr position
https://www.website.com/ 0.0 0.0 0.0 0.0
https://www.website.com/blog/2024/01/09/xxx 0.0 0.0 0.0 0.0
https://www.website.com/blog/2024/01/09/yyy 0.0 0.0 0.0 0.0

Please note that you shouldn't just kill contents that have no impressions / clicks, because:

  • Some are useful even if they do not generate SEO traffic
  • Some of them may have been published some days ago
  • Some of them may need to be updated

But it speeds a part of this process.

find_content_decay()

Required dimensions Required metrics Output
page / date clicks pd.DataFrame

Content Decay is something we always have to investigate to ensure that our best performing contents always stay at the top of the SERPs.

It is indeed easier to rank better for an existing content than ranking a new one (all things being equal, obviously). This method analyzes your data and returns the content that seem to be suffering from this issue.

(
    report 
    .find_content_decay(
        threshold_decay=0.25, 
        threshold_clicks=100
    )
)

This code would return a list of contents with the following characteristics:

  • During the last (full) month of available data in the Report object, the page generating at least 25% less clicks than what it had during the peak month
  • During its peak, the content generated at least 100 clicks

When you get the output, you need to add your industry knowledge to understand what is going on because:

  • The seasonnality can affect the outcome. Indeed, if your peak month is August and your run the analysis in December, all your contents may be "decaying".
  • New SERP layout can also affect your CTR and hence affect the output of this method

Still, it's a good to speed-up the process and come up with a smaller list of contents to update to protect your key positions.

pages_not_in_sitemap()

Required dimensions Required metrics Output
page None pd.DataFrame

This method is self-explanatory.

It will allow you to find the pages in your GSC Reportobject that are not included in a sitemap your provide.

winners_losers()

Required dimensions Required metrics Output
page / date clicks pd.DataFrame

This method is especially useful after Google Updates. It allows you to quickly know what content are generating less / more clicks between two periods.

(
    report 
    .winners_losers(
        period_from=['2023-01-01', '2023-01-15'],
        period_to=['2023-01-16', '2023-01-31'],
    )
)

find_long_tail_keywords()

Required dimensions Required metrics Output
query None pd.DataFrame

Quickly filter your GSC Report object based on the number of words included in your keywords.

For instance:

(
    report 
    .find_long_tail_keywords(number_of_words=7)
)

This will simply filter your data to include only keywords that are composed by at least 7 words.

find_ctr_outliers()

Required dimensions Required metrics Output
query / date clicks / impressions / position pd.DataFrame

This method allows you to find CTR outliers.

  • It first call the ctr_yield_curve method to build a custom basis for comparison. I strongly advise to filter to Reportobject accordingly to ensure that you are not mixing several SERP layouts.
  • It then compare the expected CTR (based on weighted average position) and the real CTR (based on clicks & impressions) to find outliers.
(
    report 
    .find_ctr_outliers()
)

abcd()

Required dimensions Required metrics Output
None None pd.DataFrame

This methods allows you to assign an ABCD rank to a metric based on cumulative percentage contribution.

  • A: belong to the top 50%
  • B: between 50 & 75%
  • C: between 75 and 90%
  • D: between 90 and 100%

For instance, if e run the following code:

(
    report
    .abcd('clicks')
 )

we could get this table as the output:

country clicks abcd
mex 5955 A
ecu 1936 B
ven 1447 B
esp 765 C
col 716 C

pages_per_day()

Required dimensions Required metrics Output
date / page None pd.DataFrame

This method has been designed thinking about Google Discover. You can easily know how many pages has appeared per day using this method.

(
    report
    .pages_per_day()
 )

This would return a table with the number of pages per day, as the name of the method suggest.

date page
2024-01-01 1901
2024-01-02 2544
2024-01-03 2761
2024-01-04 2853
2024-01-05 2473
2024-01-06 2281
2024-01-07 2796

pages_lifespan()

Required dimensions Required metrics Output
date / page None pd.DataFrame

Similar function to pages_per_day(), but is tells you what is the average lifespan of a page in your dataset.

seasonality_per_day()

Required dimensions Required metrics Output
date clicks / impressions pd.DataFrame

This method allows you to quickly understand the weekly seasonnality you have in your data.

(
    report
    .seasonality_per_day()
 )

It would return a table as the following:

date click impressions
Monday 2218 352148
Tuesday 2456 399690
Wednesday 2532 414933
Thursday 2312 381517
Friday 1081 195331
Saturday 827 142410
Sunday 1267 210677

replace_query_from_list()

Required dimensions Required metrics Output
query None pd.DataFrame

When you are working on a project where pages are created at scale, you don't want to understand what are the most common keywords, but are the most common structures.

To achieve this objective, this method allow us to replace any occurence of an element of a list in our query column. For instance, if you have a travel website, you won't have "flight paris barcelona" but "flight element element" assuming that you provide a list of cities.

This will considerably speed-up your analysis to optimize your templates.

(
    report
    .replace_query_from_list(word_list)
 )

Sitemaps

You can also query the Sitemaps by creating a Sitemap object.

import gscwrapper

#authentificate 
account = gscwrapper.generate_auth(
    'config/client_secret_mvp.json', 
    serialize='config/credentials.json'
)

#we choose the website we want 
webproperty = account['https://www.exemple.com/']
#we create the sitemap obkect 
sitemap = (webproperty.sitemap)

list_sitemaps()

We can easily list the sitemaps that are included in GSC:

(
    sitemap
    .list_sitemaps()
)

This would return a DataFrame similar to the following one:

path lastSubmitted isPending isSitemapsIndex lastDownloaded warnings errors contents type
https://www.website.com/sitemap_index.xml 2023-12-23T23:41:05.453Z False True 2024-01-09T15:07:27.851Z 0 0 [{'type': 'web', 'submitted': '475', 'indexed': '0'}, {'type': 'image', 'submitted': '457', 'indexed': '0'}]
https:///www.website.com/page-sitemap.xml 2019-10-15T15:44:16.831Z False False 2024-01-04T09:48:28.500Z 3 0 [{'type': 'web', 'submitted': '14', 'indexed': '0'}] sitemap

check_sitemaps()

This method will check if the sitemaps we have in GSC are all returning a 200 response code. If you have a heavy IT policty in place, please note that the response code this method gets and the one Google would get may be different.

(
    sitemap
    .check_sitemaps()
)
path response_code
https://www.website.com/sitemap_index.xml 200
https://www.website.com/page-sitemap.xml 200

URL Inspection

You can also query the URL Inspection by creating a Inspect object.

import gscwrapper

#authentificate 
account = gscwrapper.generate_auth(
    'config/client_secret_mvp.json', 
    serialize='config/credentials.json'
)

#we choose the website we want 
webproperty = account['https://www.exemple.com/']
#we create the sitemap obkect 
inspect = (webproperty.inspect)

add_urls()

This method allows you to add URLs in the Inspect object. These URLs are the one you'll send to the API in the execute() call to get the data for.

(
 inspect
 .add_urls([
     'https://www.website.com/page',
     'https://www.website.com/other-page'
     ]
    )
)

remove_urls()

Same logic but to remove URLs from the Inspect object.

execute()

THis will call the API for every unique URLs you added using the add_urls() method. Based on the number you have, the extraction can take a moment.

(
    inspect 
    .execute()
)

If for whatever reason the execution fails during the process, you can retrieve the results that have already been generated by inspecting inspect.results.

gsc_wrapper's People

Contributors

antoineeripret avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.