Comments (7)
@Ziinc probably can give more info here.
But could you please describe the use case? Why can't you use parse_item?
from crawly.
here is my usage scenario:
for site demo.com, i need to get some info such as title, category for the main page.
and get the sub url from some links
when i get the sub url, i send requests, then parse data from response, here i need to get some detail info, such as author, price and etc.
the data parsar from sub page should be different from main page, i don't know how to do it through crawly.
great thanks.
from crawly.
for python part, my demo code seems like this
from crawly.
So... Do you have different items on different pages? Or same data just structured differently?
from crawly.
yes, basiclly, i have different data structure on different pages, but according to the sample code, i don't know how to write the code.
It will be appreciate if there are some examples that can help me.
from crawly.
Sorry I still don't understand if that's one of these two:
- Same item which can be extracted with other selectors
- Two different items
from crawly.
sorry @ziyouchutuwenwu I only just saw this, must have missed the ping.
Parsers are meant for commonly used logic that you want to reuse across spiders. A parser is simply a Pipeline
module, with the result of each Parser being passed to the next. The opts
3rd positional arg allows you to provide spider-specific configuration to your parser.
For example, on site 1, you want to extract all links with a h1 tag but filter them out based on some site-specific filter function, and build requests from all extracted links:
# spider 1
parsers: [
{MyCustomRequestParser, [selector: ".h1", filter: &my_filter_function/1]}
]
Then, in spider 2 that is crawling site 2, we only want h2 tags, but without using any filtering:
# spider 2
parsers: [
{MyCustomRequestParser, [selector: ".h2"]}
]
Then your MyCustomRequestParser.run/3
contains the logic required to select and build the requests
from crawly.
Related Issues (20)
- Crawly.fetch giving 301 response instead of 200 HOT 1
- My Spider's code is never invoked, weird behavior with `Crawly.RequestsStorage.pop` in library code HOT 5
- This is actually a question, Nested scraping HOT 2
- Genserver time out crash in long-running pipeline HOT 1
- Stop and resume the spider where it stopped HOT 2
- Protocol error HOT 7
- `Crawly.Fetchers.Fetcher` implementation for Playwright HOT 4
- robots.txt matching is pretty buggy HOT 10
- Running many instances of one spider HOT 3
- Make the management tool opt-in by default HOT 5
- Q: Can the spider "fan out" on a website? (multiple next items) HOT 1
- Error: Could not load spiders. HOT 5
- [error] Pipeline crash by call: Crawly.Middlewares.UniqueRequest.run
- Encountering Complications with Forwarding to Crawly.API.Router HOT 1
- Upgrade to `httpoison` 2.x? HOT 3
- management Web UI on localhost:4001 is not working HOT 4
- Does Crawly support requests using the POST method? HOT 1
- Crawly compilation warnings, undefined Floki functions HOT 1
- CI issue: Failed to upload the report to 'https://coveralls.io', Couldn't find a repository matching this job. HOT 1
- Set base_url in init options instead of callback HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from crawly.