GithubHelp home page GithubHelp logo

pawanpaudel93 / scrapy-ipfs-filecoin Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 1.0 193 KB

Scrapy ipfs & filecoin pipelines and feed exports to store items into Web3Storage, LightHouse, Estuary, Pinata, Moralis, Filebase or any s3 compatible services.

Home Page: https://pypi.org/project/scrapy-ipfs-filecoin/

License: MIT License

Python 100.00%
estuary filebase filecoin ipfs moralis pinata python web3storage lighthouse

scrapy-ipfs-filecoin's Introduction

original

Welcome to Scrapy-IPFS-Filecoin

Version

Scrapy is a popular open-source and collaborative python framework for extracting the data you need from websites. scrapy-ipfs-filecoin provides scrapy pipelines and feed exports to store items into IPFS and Filecoin using services like Web3.Storage, LightHouse.Storage, Estuary, Pinata, Moralis, Filebase or any S3 compatible services.

๐Ÿ  Homepage

Install

npm install -g https://github.com/pawanpaudel93/ipfs-only-hash.git
pip install scrapy-ipfs-filecoin

Example

scrapy-ipfs-filecoin-example

Usage

  1. Install ipfs-only-hash and scrapy-ipfs-filecoin.
npm install -g https://github.com/pawanpaudel93/ipfs-only-hash.git
pip install scrapy-ipfs-filecoin
  1. Add 'scrapy-ipfs-filecoin.pipelines.ImagesPipeline' and/or 'scrapy-ipfs-filecoin.pipelines.FilesPipeline' to ITEM_PIPELINES setting in your Scrapy project if you need to store images or other files to IPFS and Filecoin. For Images Pipeline, use:
ITEM_PIPELINES = {'scrapy_ipfs_filecoin.pipelines.ImagesPipeline': 1}

For Files Pipeline, use:

ITEM_PIPELINES = {'scrapy_ipfs_filecoin.pipelines.FilesPipeline': 1}

The advantage of using the ImagesPipeline for image files is that you can configure some extra functions like generating thumbnails and filtering the images based on their size.

Or You can also use both the Files and Images Pipeline at the same time.

ITEM_PIPELINES = {
 'scrapy_ipfs_filecoin.pipelines.ImagesPipeline': 1,
 'scrapy-ipfs-filecoin.pipelines.FilesPipeline': 1
}

If you are using the ImagesPipeline make sure to install the pillow package. The Images Pipeline requires Pillow 7.1.0 or greater. It is used for thumbnailing and normalizing images to JPEG/RGB format.

pip install pillow

Then, configure the target storage setting to a valid value that will be used for storing the downloaded images. Otherwise the pipeline will remain disabled, even if you include it in the ITEM_PIPELINES setting.

Add store path of files or images for Web3Storage, LightHouse, Moralis, Pinata or Estuary as required.

# for ImagesPipeline
IMAGES_STORE = 'w3s://images' # For Web3Storage
IMAGES_STORE = 'es://images' # For Estuary
IMAGES_STORE = 'lh://images' # For LightHouse
IMAGES_STORE = 'pn://images' # For Pinata
IMAGES_STORE = 'ms://images' # For Moralis
 # For Filebase or other s3 compatible services
 # Here bucket-name can be your bucket name created and folder-name can be a scraping specific folder to store your files
IMAGES_STORE = "s3://bucket-name/folder-name/images/"

# For FilesPipeline
FILES_STORE = 'w3s://files' # For Web3Storage
FILES_STORE = 'es://files' # For Estuary
FILES_STORE = 'lh://files' # For LightHouse
FILES_STORE = 'es://files' # For Pinata
FILES_STORE = 'pn://files' # For Moralis
 # For Filebase or other s3 compatible services
 # Here bucket-name can be your bucket name created and folder-name can be a scraping specific folder to store your files
FILES_STORE = "s3://bucket-name/folder-name/files/"

For more info regarding ImagesPipeline and FilesPipline. See here

  1. For Feed storage to store the output of scraping as json, csv, json, jsonlines, jsonl, jl, csv, xml, marshal, pickle etc set FEED_STORAGES as following for the desired output format:
from scrapy_ipfs_filecoin.feedexport import get_feed_storages
FEED_STORAGES = get_feed_storages()

Then set API Key for one of the storage i.e Web3Storage, LightHouse, Moralis, Pinata or Estuary. And, set FEEDS as following to finally store the scraped data.

For Web3Storage:

W3S_API_KEY = "<W3S_API_KEY>"

FEEDS = {
 'w3s://house.json': {
  "format": "json"
 },
}

For LightHouse:

LH_API_KEY = "<LH_API_KEY>"

FEEDS = {
 'lh://house.json': {
  "format": "json"
 },
}

For Estuary:

ES_API_KEY = "<ES_API_KEY>"

FEEDS = {
 'es://house.json': {
  "format": "json"
 },
}

For Pinata:

PN_JWT_TOKEN = "<PN_JWT_TOKEN>"

FEEDS = {
 'pn://house.json': {
  "format": "json"
 },
}

For Moralis:

MS_API_KEY = "<MS_API_KEY>"

FEEDS = {
 'ms://house.json': {
  "format": "json"
 },
}

For Filebase or other s3 compatible services

The S3 pipeline requires botocore so install it.

pip install botocore
 S3_ACCESS_KEY_ID = "<S3_ACCESS_KEY_ID>"
 S3_SECRET_ACCESS_KEY = "<S3_SECRET_ACCESS_KEY>"
 S3_ENDPOINT_URL = "https://s3.filebase.com"
 S3_IPFS_URL_FORMAT = "https://ipfs.filebase.io/ipfs/{cid}"

  # Here bucket-name can be your bucket name created and folder-name can be a scraping specific folder to store your files

 FEEDS = {
  "s3://bucket-name/folder-name/%(name)s_%(time)s.json": {"format": "json"},
  "s3://bucket-name/folder-name/%(name)s_%(time)s.csv": {"format": "csv"},
 }

See more on FEEDS here

  1. Now perform the scrapping as you would normally.

Author

๐Ÿ‘ค Pawan Paudel

๐Ÿค Contributing

Contributions, issues and feature requests are welcome!
Feel free to check issues page.

Show your support

Give a โญ๏ธ if this project helped you!

Copyright ยฉ 2022 Pawan Paudel.

scrapy-ipfs-filecoin's People

Contributors

pawanpaudel93 avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Forkers

bellyfat

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.