GithubHelp home page GithubHelp logo

spider-rs / spider-py Goto Github PK

View Code? Open in Web Editor NEW
15.0 2.0 0.0 1.24 MB

Spider ported to Python

Home Page: https://spider-rs.github.io/spider-py/

License: MIT License

Rust 95.72% Python 4.28%
python scraper spider web-crawler headless-chrome crawler

spider-py's Introduction

spider-py

The spider project ported to Python.

Getting Started

  1. pip install spider_rs
import asyncio

from spider_rs import Website

async def main():
    website = Website("https://choosealicense.com")
    website.crawl()
    print(website.get_links())

asyncio.run(main())

View the examples to learn more.

Development

Install maturin pipx install maturin and python.

  1. maturin develop

Benchmarks

View the benchmarks to see a breakdown between libs and platforms.

Test url: https://espn.com

libraries pages speed
spider(rust): crawl 150,387 1m
spider(nodejs): crawl 150,387 153s
spider(python): crawl 150,387 186s
scrapy(python): crawl 49,598 1h
crawlee(nodejs): crawl 18,779 30m

The benches above were ran on a mac m1, spider on linux arm machines performs about 2-10x faster.

Issues

Please submit a Github issue for any issues found.

spider-py's People

Contributors

j-mendez avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

spider-py's Issues

pip install fail

Collecting spider_rs
Using cached http://mirrors.aliyun.com/pypi/packages/26/ba/bc4bf77e7923583aede0ed4a203f2ea5c2d6a2c85fe8d07f4b3f81570c9b/spider_rs-0.0.30.tar.gz (38 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... error
error: subprocess-exited-with-error

× Preparing metadata (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [14 lines of output]
Updating rsproxy index
From https://rsproxy.cn/crates.io-index
* [new ref] HEAD -> origin/HEAD
error: failed to select a version for the requirement spider = "^1.86.11" (locked to 1.86.11)
candidate versions found which didn't match: 1.86.5, 1.86.4, 1.86.3, ...
location searched: rsproxy index (which is replacing registry crates-io)
required by package spider_rs v0.0.30 (C:\Users\m1778\AppData\Local\Temp\pip-install-v_9bx8vo\spider-rs_d3f1942900e14bb9966f68dbe963a045)
perhaps a crate was updated and forgotten to be re-vendored?
馃挜 maturin failed
Caused by: Cargo metadata failed. Does your crate compile with cargo build?
Caused by: cargo metadata exited with an error:
Error running maturin: Command '['maturin', 'pep517', 'write-dist-info', '--metadata-directory', 'C:\Users\m1778\AppData\Local\Temp\pip-modern-metadata-pzhkjv6h', '--interpreter', 'C:\Users\m1778\.pyenv\pyenv-win\versions\3.11.6\python.exe']' returned non-zero exit status 1.
Checking for Rust toolchain....
Running maturin pep517 write-dist-info --metadata-directory C:\Users\m1778\AppData\Local\Temp\pip-modern-metadata-pzhkjv6h --interpreter C:\Users\m1778\.pyenv\pyenv-win\versions\3.11.6\python.exe
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.