Though just using the Scrapy-like callbacks is easy and straightforward to code, it would be extra nice to have a higher abstraction of concepts so we could declaratively write scrapers. This would remove boilerplate, remove selector logic from page navigation logic and additionally allow graceful handling of unexpected and unsupported page types that would otherwise crash the scraper (without error handling).
For example, it would be nice to be able to do this:
class YourSpider < ApplicationSpider
...
item :product do
text :name, '#ProdPageTitle'
int :ean, '#ProdPageProdCode' do |r|
r[/([0-9]+)/]
end
async do
array :images, combi(css('#ShopProdImagesNew img.ShopProdThumbImg'), xpath('@src')))
text :description, '#ProdPageTabsWrap #tab1'
custom :specs, '#ProdPageProperties > span' do |r|
r.to_a.in_groups_of(2).map{|s| {
name: s[0].text,
value: s[1].text
}}
end
end
end
end
The block contains invocations to (predefined) field types which are given names, (a) selector(s) and optionally a block for post-processing of the Nokogiri result. Every field accepts an async
argument that specifies whether the element is rendered by Javascript or not. The async block sets every field in it to be async, meaning, that the browser.current_page
is queried a few times with a timeout until extraction of the specified element works (when the page is actually rendered):
def extract_on_ready(expression, multi: true, retries: 3, wait: 1, default: nil)
retries.times do
# the extract function determines the type of the selector expression so it knows
# whether to call xpath() or css() on the Nokogiri object.
result = extract(expression, multi: multi, async: false)
case result
when Nokogiri::XML::Element
return result
when Nokogiri::XML::NodeSet, Array
return result if !result.empty?
end
sleep 1
refresh #self.response = browser.current_response
end
default
end
Because the selectors are declaratively defined, the expression type has to be given (and defaults to 'css'), since the Nokogiri css and xpath methods are called indirectly.
css('#ShopProdImagesNew img.ShopProdThumbImg')
This design allows for the following, given a parse_item()
function that would extract all fields from the defined item
of the same name as the current inline handler:
class YourSpider < ApplicationSpider
# start from start_urls
request_start do |category_list, **opts1|
request_all :product_list, urls: category_list.css(".css-selector1") do |product_list, **opts2|
for link in product_list.css(".css-selector2")
request_in :product, url: link do |product, **opts3|
save_to "results.json", parse_item(), format: :pretty_json
end
end
end
end
end
You could even go further. If you move the result file definition to the item definition, then the inline callback handler could automatically and implicitly extract the item:
class YourSpider < ApplicationSpider
...
# defined item with result file definition passed by hash
item :product, file: "results/{:category}.json" do
text :name, '#ProdPageTitle'
text :category, css: '.category-name'
end
# item definition with result file config in body
item :otherproduct do
text :name, '#ProdPageTitle'
text :category, css: '.category-name'
save_to "results/{:category}.json", {
format: :pretty_json,
append: true,
position: false
}
end
# start from start_urls
request_start do |category_list, **opts1|
request_all :product_list, urls: category_list.css(".css-selector1") do |product_list, **opts2|
for link in product_list.css(".css-selector2")
# this call requests the page on url, knows that it contains a :product Entity, auto-extracts from the predefined Entity selectors, and auto-saves it to a result file as defined in the entity.
request_item :product, url: link
end
end
end
end
The logical result leaves us with only a DSL for defining the relationship between pages and how to get from one to the next. If we would have a class-level Page description, we could have a singular parse() entrypoint that can figure out the page type on its own.
class YourSpider < ApplicationSpider
# class-level declaration of a page type
page :product_list do
identifier css: 'body.product-list-page'
has_many :product, css: '#productlist a.product-link'
end
page :product do
identifier do |response|
!response.xpath('//div[@id="product-image"]').empty? and response.css('body.is-product').length > 0
end
end
end
In ApplicationSpider:
def parse(response, url, **opts)
@page_types.each do |page_definition|
if page_definition.page_of_type(response)
@entities[page_definition.name].parse(response)
else
puts "unrecognised page type at #{url}!"
end
end
end
The parse() entrypoint would automatically find the right Page definition to know how to parse it and how to branch to deeper pages. All deeper pages are also parsed using the singular parse() callback. The advantage of this approach is that the navigational flow gets very robust, since page types are explicitly identified by selectors in the Page definition. You would get a nice log of all unexpected page types (customized landing pages, error pages etc), and encountering them does not break the code or require error catching by the user.
The downside to the explicit approach is the customizeability when you need somethign specific to be done in order to parse a speciic page (type). To account for this, the parse() entrypoint would have to check if a user-defined callback is defined that fits the page definition, similar to how it works now. So for a :product Page definition, it would look for a parse_product_page(response, url, **opts)
callback that allows a user to hook into the flow.