The seo-monitoring from oxylabs

Scraping Experts - Building SEO Monitoring System using Python, Celery, and a SERP Scraper API

Video

Building an SEO Monitoring System with Python, Celery, and SERP Scraper API

Abstract

This solution is based on the data engineering principles of data ingestion and processing with a combination of remote calls for data enrichment.

The features are as follows:

Accepts CSV or XLSX files as an input for keyword SERP scraping
Moves input file to different directory after it was processed
Cleans the input keywords and prepares them to be submitted to the Oxylabs SERP Scraper API
Uses Celery to produce parallel requests to the SERP Scraper API (refer docker-compose for --autoscale parameter use)
Aggregates the responses in exact-same order as they were submitted to the Celery worker as a task
Retry & timeout added for the Celery tasks
Authenticates each request to the SERP Scraping API
Produces a new output file (CSV or XLSX) with the results from the SERP Scraper API
Continuously watches for a new input file to be added for processing

Installation

This project uses Python 3.10.x version and runs on virtual environment (venv), therefore make sure that the Python installation on your local system exists.

Credentials and configuration

To properly configure the application, copy-rename bundled dist.env to .env and update the parameters as needed (refer the docs at Oxylabs SERP Scraper API docs):

SERP configuration

SERP_TARGET=xxxxxxx (Refer to the Oxylabs SERP Scraper API docs)
SERP_DOMAIN=xxxxxxx (Refer to the Oxylabs SERP Scraper API docs)
SERP_PARSE_RESULT=True (Should SERP Scraper API parse the results?)
SERP_LANGUAGE=en
SERP_PAGES=5 (how many pages to scrape)

Local directories and file watcher poll (using seconds)

INPUT_KEYWORDS="./input" (Where keyword input file will be put)
INPUT_PROCESSED="./input/processed" (Where processed keyword input file will be put)
OUTPUT_KEYWORDS="./output" (Where result output file will be put)
OUTPUT_FILE_TYPE=xlsx (What OUTPUT file type to use [CSV/XLSX])
OUTPUT_FILE_NAME=keywords_serps (What name to use for OUTPUT file)
INPUT_POLL_TIME=5 (How many seconds to wait before checking for new input files)

SERP Scraper API authentication

OXY_SERPS_AUTH_USERNAME=XXXXX
OXY_SERPS_AUTH_PASSWORD=YYYYY

Local (Mac)

Checkout the scraping-experts-seo-monitoring source
Run: cd scraping-experts-seo-monitoring
Run: python3.10 -m venv venv
Run: source venv/bin/activate
Run: pip install --upgrade pip wheel setuptools
Run: pip install -r requirements.txt

Additionally, it is required to download internal python library artefacts to use the word tokenizer. To do this, after the project was installed, follow:

Run: cd scraping-experts-seo-monitoring
Run: source venv/bin/activate
Run: python (you will be prompted with Python CLI)
Run: import nltk; nltk.download('punkt')
Run: import nltk; nltk.download('stopwords')
Use CTRL+D to exit the Python CLI

Now you should be able to develop the project locally in your favourite IDE.

Docker (using Docker Compose)

Checkout the scraping-experts-seo-monitoring source
Run: cd scraping-experts-seo-monitoring
Run: docker-compose build
Run: docker-compose up -d --scale worker=5 && docker-compose logs -f
To stop the services running, exit the log watch mode with CTRL+C and run docker-compose down

INPUT file

The input keywords file must be placed at the root of /input directory, where the Python application will scan for new files and as soon as it finds (INPUT_POLL_TIME) the file it starts to process.

The application expects the XLSX file (or CSV) to have a following format:

XLSX

Keyword
sample1
sample2
other

CSV (with header)

keyword
sample1
sample2
other

oxylabs / seo-monitoring Goto Github PK

seo-monitoring's Introduction

Scraping Experts - Building SEO Monitoring System using Python, Celery, and a SERP Scraper API

Video

Abstract

Installation

Credentials and configuration

Local (Mac)

Docker (using Docker Compose)

INPUT file

seo-monitoring's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs