GithubHelp home page GithubHelp logo

Comments (7)

oltarasenko avatar oltarasenko commented on June 9, 2024

@Ziinc probably can give more info here.

But could you please describe the use case? Why can't you use parse_item?

from crawly.

ziyouchutuwenwu avatar ziyouchutuwenwu commented on June 9, 2024

here is my usage scenario:

for site demo.com, i need to get some info such as title, category for the main page.
and get the sub url from some links
when i get the sub url, i send requests, then parse data from response, here i need to get some detail info, such as author, price and etc.

the data parsar from sub page should be different from main page, i don't know how to do it through crawly.

great thanks.

from crawly.

ziyouchutuwenwu avatar ziyouchutuwenwu commented on June 9, 2024

for python part, my demo code seems like this
image

from crawly.

oltarasenko avatar oltarasenko commented on June 9, 2024

So... Do you have different items on different pages? Or same data just structured differently?

from crawly.

ziyouchutuwenwu avatar ziyouchutuwenwu commented on June 9, 2024

yes, basiclly, i have different data structure on different pages, but according to the sample code, i don't know how to write the code.
It will be appreciate if there are some examples that can help me.

from crawly.

oltarasenko avatar oltarasenko commented on June 9, 2024

Sorry I still don't understand if that's one of these two:

  1. Same item which can be extracted with other selectors
  2. Two different items

from crawly.

Ziinc avatar Ziinc commented on June 9, 2024

sorry @ziyouchutuwenwu I only just saw this, must have missed the ping.

Parsers are meant for commonly used logic that you want to reuse across spiders. A parser is simply a Pipeline module, with the result of each Parser being passed to the next. The opts 3rd positional arg allows you to provide spider-specific configuration to your parser.

For example, on site 1, you want to extract all links with a h1 tag but filter them out based on some site-specific filter function, and build requests from all extracted links:

# spider 1
parsers: [
  {MyCustomRequestParser, [selector: ".h1", filter: &my_filter_function/1]}
]

Then, in spider 2 that is crawling site 2, we only want h2 tags, but without using any filtering:

# spider 2
parsers: [
  {MyCustomRequestParser, [selector: ".h2"]}
]

Then your MyCustomRequestParser.run/3 contains the logic required to select and build the requests

from crawly.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.