Light

danoctua / tradeshows-scrape Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 7.19 MB

Set of trade shows crawling spiders

Python 100.00%

crawler scrapy tradeshow python3

tradeshows-scrape's Introduction

Trade Shows scrapping tool

Simple scrapping tool for extracting exhibitors data from the trade shows pages.

Current features:

base spider with generating start request[s].
base exhibitor item with predefined fields (add new fields if required).
export-to-csv pipeline [crawls for each spider are grouped in the folders].

Configuration

To use this tool, be sure to install all dependencies from the requirements.txt file.

Development

Automatic

There's a custom command to create new trade show spider: scrapy createspider.

You'll have to enter some required fields for the spider, and the core be generated from the template without any manual actions.

Manual

To create new spider you have to make the following steps:

check if spider exists for this trade show.
if not, create spider file in the spiders folder (let's keep the same convention here: trade_show_name_spider.py).
- inherit after the BaseSpider (or any existing base spider) which for now provides start_requests method.
- add the following variables:
  - name - basic spider name that you could refer for later and that would be used to identify scrapped data.
  - EXHIBITION_DATE, EXHIBITION_NAME, EXHIBITION_WEBSITE - from the ticket description.
  - URLS - initial urls where we would extract exhibitors list from.
  - [optional] HEADERS - headers you want to send along with the initial request.
- override two methods:
  - fetch_exhibitors - callback for initial urls. Select exhibitors here and add callback to the next method.
  - parse_exhibitors - extract exhibitor data and yield the Exhibitor item from this method.

Pipelines

Currently, available pipelines (please add a simple documentation when you're creating new one):

PrefetchExhibitionDataPipeline - pipeline which insert exhibition data you've provided as spider attribute.
ExportItemPipeline - exports item to the CSV file.

Middlewares

Available middlewares:

ProxyDownloaderMiddleware - middleware to set proxy for spider requests. Remember to set the PROXY attribute for spider if you're using this middleware. See middleware documentation and usages for more details.
ProxySessionMiddleware - the same as above, but provides useful in some cases feature - keeping the same IP for all the requests. Currently, this is a proxy-dependent feature as the only Zyte is supporting this, but it maybe extended for other proxies as well.

Running

To run your crawl, execute the following command in the root folder: scrapy crawl [SPIDER_NAME]

After you'd see the logs and output of the spider. Spider finish execution after it fetched all data.

Go to the result/[SPIDER_NAME] and you would find all crawl results here. Naming format is Crawl-[CRAWL_START_DATETIME].csv.

Feel free to use it and add some docs here when you create useful pipelines/middlewares.

List of supported trade shows

This list is generated with the scrapy readme command.

tradeshows-scrape's People

Contributors

Stargazers

Watchers

tradeshows-scrape's Issues

Investigate A2Z blocking issue for some targets

Some exhibitions, like New York Now show and Sports Tailgate are blocking the requests even with all headers sent to the target. Manual anti-bot validation is required. The only way to crawl those pages is using the proxy.

Homi Milano blocking

Homi Milano blocked with the 500 status code.

Debug why this's happening
Crawl successfully

Create a simple UI to run spiders

Create a simple UI to run all available spiders

Add tests

Add tests to confirm spiders are working after code changes.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs