elixir-crawly / crawly Goto Github PK
View Code? Open in Web Editor NEWCrawly, a high-level web crawling & scraping framework for Elixir.
Home Page: https://hexdocs.pm/crawly
License: Apache License 2.0
Crawly, a high-level web crawling & scraping framework for Elixir.
Home Page: https://hexdocs.pm/crawly
License: Apache License 2.0
It might look a little offtopic. This question interesting for me in in terms of my education.
Is there some reason why Crawly do not use Poolboy library?
Hi everyone! I've been using crawly recently and I found the folder option a bit confusing.
The folder is always set as /tmp
which makes think that only absolute paths are allowed. A single example with a local or ~
path would make things clearer, for example here in the last one: https://hexdocs.pm/crawly/Crawly.Pipelines.WriteToFile.html
The other thing that bugs me is that the folder has to exist. It would be nicer if the folder is created when missing. This would open the possibility of setting the default to a local path, making it more immediate whether the parser is working
I'm looking to do cron-style spider scheduling, where the engine starts the spider if it is not running at a scheduled timing interval.
Should this be within the Engine or Commander (?) module context?
This would require the Engine to be part of a supervision tree, I think.
I see that you can export JSON with the following config:
config :crawly,
other configs...
pipelines: [Crawly.Pipelines.JSONEncoder]
Is there a middleware for exporting to CSV format or a recommend way to do this?
Also, why does the file format end in file.ji
instead of file.json
when using JSONncoder?
It looks like it easy to overflow queue. Entire system down. This happen when I set concurrency more than ~20. What is the recommended setting of concurrency for Crawly? What is performance can you achieve with it?
22:27:10.427 [error] GenServer #PID<0.527.0> terminating
** (MatchError) no match of right hand side value: {:empty, {[], []}}
(hackney 1.15.2) /Users/mycomputer/Documents/Projects/Playgraound/homebase/deps/hackney/src/hackney_pool.erl:509: :hackney_pool.queue_out/2
(hackney 1.15.2) /Users/mycomputer/Documents/Projects/Playgraound/homebase/deps/hackney/src/hackney_pool.erl:376: :hackney_pool.dequeue/3
(hackney 1.15.2) /Users/mycomputer/Documents/Projects/Playgraound/homebase/deps/hackney/src/hackney_pool.erl:349: :hackney_pool.handle_info/2
(stdlib 3.11.2) gen_server.erl:637: :gen_server.try_dispatch/4
(stdlib 3.11.2) gen_server.erl:711: :gen_server.handle_msg/6
(stdlib 3.11.2) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: {:DOWN, #Reference<0.4051887452.3045064705.41660>, :request, #PID<0.455.0>, :shutdown}
State: {:state, :default, {:metrics_ng, :metrics_dummy}, 50, 150000, {:dict, 50, 16, 16, 8, 80, 48, {[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []}, {{[[#Reference<0.4051887452.3045064705.22552> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064712.10994> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064710.25098> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064705.39221> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064708.41389> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064708.41530> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064708.49674> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064712.31884> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064705.34134> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064711.22288> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064708.35098> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064709.19015> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064705.39210> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064706.63253> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064707.42573> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [], [[#Reference<0.4051887452.3045064708.30170> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064709.13738> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064710.33158> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064705.41238> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064708.41539> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064706.61055> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064709.21269> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064708.42098> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064706.61233> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064712.11785> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064707.34068> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064712.26414> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064711.38039> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064712.34782> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [], [[#Reference<0.4051887452.3045064712.8112> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064712.24733> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064707.43205> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064708.22048> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064705.41660> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064707.35050> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064705.24426> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064709.13941> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064709.18037> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064706.55708> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064709.13898> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064712.23688> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064706.52408> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064708.32218> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064707.33419> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064708.44346> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064709.13751> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064708.41062> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064709.34340> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064707.20603> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064712.26880> | {'www.homebase.co.uk', 443, :hackney_ssl}]]}}}, {:dict, 1, 16, 16, 8, 80, 48, {[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []}, {{[], [], [], [[{'www.homebase.co.uk', 443, :hackney_ssl} | {[], []}]], [], [], [], [], [], [], [], [], [], [], [], []}}}, {:dict, 20, 16, 16, 8, 80, 48, {[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []}, {{[[#Reference<0.4051887452.3045064712.10994> | {{#PID<0.436.0>, #Reference<0.4051887452.3045064712.11107>}, {'www.homebase.co.uk', 443, :hackney_ssl}}], [#Reference<0.4051887452.3045064710.25098> | {{#PID<0.491.0>, #Reference<0.4051887452.3045064710.25100>}, {'www.homebase.co.uk', 443, :hackney_ssl}}], [#Reference<0.4051887452.3045064708.49674> | {{#PID<0.435.0>, #Reference<0.4051887452.3045064708.49679>}, {'www.homebase.co.uk', 443, :hackney_ssl}}], [#Reference<0.4051887452.3045064712.31884> | {{#PID<0.503.0>, #Reference<0.4051887452.3045064712.31886>}, {'www.homebase.co.uk', 443, :hackney_ssl}}]], [[#Reference<0.4051887452.3044802561.184150> | {{#PID<0.428.0>, #Reference<0.4051887452.3044802561.184156>}, {'www.homebase.co.uk', 443, :hackney_ssl}}]], [[#Reference<0.4051887452.3045064706.63253> | {{#PID<0.521.0>, #Reference<0.4051887452.3045064706.63260>}, {'www.homebase.co.uk', 443, :hackney_ssl}}]], [[#Reference<0.4051887452.3045064707.42573> | {{#PID<0.516.0>, #Reference<0.4051887452.3045064707.42615>}, {'www.homebase.co.uk', 443, :hackney_ssl}}]], [], [[#Reference<0.4051887452.3045064708.30170> | {{#PID<0.519.0>, #Reference<0.4051887452.3045064708.30172>}, {'www.homebase.co.uk', 443, :hackney_ssl}}], [#Reference<0.4051887452.3045064705.41238> | {{#PID<0.457.0>, #Reference<0.4051887452.3045064705.41240>}, {'www.homebase.co.uk', 443, :hackney_ssl}}], [#Reference<0.4051887452.3045064706.61055> | {{#PID<0.501.0>, #Reference<0.4051887452.3045064706.61080>}, {'www.homebase.co.uk', 443, :hackney_ssl}}]], [[#Reference<0.4051887452.3045064706.61233> | {{#PID<0.448.0>, #Reference<0.4051887452.3045064706.61248>}, {'www.homebase.co.uk', 443, :hackney_ssl}}]], [[#Reference<0.4051887452.3045064712.11785> | {{#PID<0.495.0>, #Reference<0.4051887452.3045064712.11789>}, {'www.homebase.co.uk', 443, :hackney_ssl}}], [#Reference<0.4051887452.3045064711.38039> | {{#PID<0.514.0>, #Reference<0.4051887452.3045064711.38065>}, {'www.homebase.co.uk', 443, :hackney_ssl}}], [#Reference<0.4051887452.3045064712.34782> | {{#PID<0.450.0>, #Reference<0.4051887452.3045064712.34787>}, {'www.homebase.co.uk', 443, :hackney_ssl}}]], [], [[#Reference<0.4051887452.3045064707.43205> | {{#PID<0.446.0>, #Reference<0.4051887452.3045064707.43464>}, {'www.homebase.co.uk', 443, :hackney_ssl (truncated)
Optional xpath support.
Should be able to handle dirty xml input.
Hi, I am unable to scrape the erlang solutions blog as the quickstart guide states here:
https://github.com/oltarasenko/crawly#quickstart
Attempting to run the spider through iex results in:
iex(1)> Crawly.Engine.start_spider(MyCrawler.EslSpider)
[info] Starting the manager for Elixir.MyCrawler.EslSpider
[debug] Running spider init now.
[debug] Scraped ":title,:url"
[debug] Starting requests storage worker for Elixir.MyCrawler.EslSpider...
[debug] Started 2 workers for Elixir.MyCrawler.EslSpider
:ok
iex(2)> [info] Current crawl speed is: 0 items/min
[info] Stopping MyCrawler.EslSpider, itemcount timeout achieved
I'm quite lost as there is no way for me to debug this, if it is a network issue (which is highly unlikely since i can access the esl website through my browser), or if it is an issue with the urls being filtered out.
defmodule MyCrawler.EslSpider do
@behaviour Crawly.Spider
alias Crawly.Utils
require Logger
@impl Crawly.Spider
def base_url(), do: "https://www.erlang-solutions.com"
@impl Crawly.Spider
def init() do
Logger.debug("Running spider init now.")
[start_urls: ["https://www.erlang-solutions.com/blog.html"]]
end
@impl Crawly.Spider
def parse_item(response) do
IO.inspect(response)
hrefs = response.body |> Floki.find("a.more") |> Floki.attribute("href")
requests =
Utils.build_absolute_urls(hrefs, base_url())
|> Utils.requests_from_urls()
# Modified this to make it even more general, to eliminate the possibility of selector problem
title = response.body |> Floki.find("title") |> Floki.text()
%{
:requests => requests,
:items => [%{title: title, url: response.request_url}]
}
end
end
Of note is that the spider does not even call the parse_items
callback, as the IO.inspect
for the response is not called at all.
Config is as follows:
config :crawly,
closespider_timeout: 10,
concurrent_requests_per_domain: 2,
follow_redirects: true,
output_format: "csv",
item: [:title, :url],
item_id: :title
Hi
FeatureRequest - We can replace splash with headless chrome via puppeteer
Is there an option to add a proxy in Splash from Crawly?
Looks like it's quite hard to maintain the documentation in a good shape if you have multiple versions with different settings. (And especially different (slightly diverging) tutorials).
We need to have versioning.
I am slightly biased against the standard Elixir style of documenting the code (e.g. having very large docstrings makes the code unreadable, at least to me).
I would try to add an index page to docsify, and would store stand alone copies for different versions (in case of major updates in API)
As it was discussed here: #97 (comment)
we want to build a lightweight (probably HTTP based) UI for the single node based Crawly operations. For people who don't want (or don't need) to use more complex https://github.com/oltarasenko/crawly_ui.
As far as we see it now we need to develop a Lightweight HTTP client (alternatively we might look into command line clients like https://github.com/ndreynolds/ratatouille)
One of the common problems I am facing right now, if that it's not possible to separate one crawly's job from another, from the external point of view.
E.g. the same spider can be executed multiple times. How do we know that the data came exactly from a given run? For example here: http://18.216.221.122/ in order to group items inside a UI we need to have an ID to unite them.
I plan to:
What do you think about the idea?
Hey people,
I have spent quite a bit of time prototyping a UI for Crawly. I think it's the next step we have to make in order to make Crawly visible in the Elixir and Web crawling space.
A good UI would help to convert web scraping into a process with clear create - test - use circles. Please have a glance on the early prototype here:
https://github.com/oltarasenko/crawly_ui
or test it here: http://18.216.221.122/jobs/1/items/
I have in Database about 250k URLs that need to enrich. Go to every link of the list and parse its HTML. Is there a good way how to feed them to Crawly queue? Is it suitable to use Crawly in this case?
One of the concerns I want to address in the new milestone is Logs.
We need to be able to split logs per spider if requested. It would allow us to have spider logs in one place. Understanding the performance of the job, and what was dropped for example, etc.
As I see it should be quite similar to: https://docs.scrapy.org/en/latest/topics/logging.html
@Ziinc I wonder if you have solved this problem on your own already or have some ideas to share?
Hi,
I'm interested in know what the appropriate pattern is for authenticating a spider? Most of the spiders I'm writing need to log in first before being able to scrape the content that I need access to. What would be the normal pattern for this as I've not found any examples of it that I can see.
My guess would be to write a middleware to perform the log in and set the auth cookies, however this will be a different authentication process for the various different spiders. Would this be done within the spider itself perhaps in the init()
function?
Thanks for any help.
Hello!
I looking the way to associate scraped data, I do it with Ecto however stuck. Do you guys can help me with it?
Link to the original post here: https://elixirforum.com/t/how-should-i-use-put-assoc-with-upsert-many-to-many/30315
I am currently running into an issue where one of my spider gets denied for making too many requests, and I would like to set the concurent requests for that specific spider to 1 without affecting the other spiders, so far I havent found a way to do this.
Currently the only way I see of achieving what I want is creating a separate application for each spider with each its own config, which doesnt feel like it will be quite optimal has I will end up with probably 50+ spiders meaning 50+ apps.
The question: is there currently a way to make configurations spider specific, and if not do you intend on making that possible in the future?
I have a problem with build the custom Ecto pipeline.
https://hexdocs.pm/crawly/basic_concepts.html#custom-item-pipelines
If I get it right, I need to define the pipeline module in my config.exs file topipelines:[]
section?
Also I miss understanding that does mean this part MyApp.insert_with_ecto(item)
? Do you mean Repo.insert here or what?
Can you please describe in more detail how should I connect it?
P.S. I apologize if the questions seemed stupid, but I still do not understand how this works. I will be thrilled when I figure it out. Ecto is a missed puzzle for my crawly projects.
This looks like a very useful project.
Can you please update the example Crawly Example (https://www.homebase.co.uk) and Tutorial? They both seem to be broken.
Right now the file name is the spider's module name and there's no way to configure it. Because the file name is static, previous data is overwritten, where as scrapers usually add a timestamp to the file name. That would be very nice because in this way previous results are automatically retained and it's immediate when the parsing was done.
Spiders could have an optional name
method which could be used to enable users to customize it and the default naming could include a timestamp.
I think Postgres would be a good way to store the spider state to in case the system crash it can continue the crawling from there it was stoped.
Is there any recommendations or suggestions how to implement it in my current Crawly project?
Pipeline config should be localized to the specific pipeline.
Benefits:
Application.get_env
usage within pipeline modules makes things less clear when declaring pipelines.Related to #20 , would pave the way for adding logic into a pipeline (instead of having a fat pipeline module)
Proposed api:
For example:
pipelines: [
....
MyCustom.Pipeline.CleanItem,
{Crawly.Pipelines.Validate, item: [:title, :url] },
{Crawly.Pipelines.DuplicatesFilter, item_id: :title },
Crawly.Pipelines.JSONEncoder
]
Besides adjusting existing built-in pipelines, this proposed change would also require the adjustment of Crawly.Utils.pipe
to check for tuple definitions.
Some tests are quite ugly at the moment :(. Need to check the race condition here:
1) test CSV encoder test Items are stored in CSV after csv pipeline (DataStorageWorkerTest)
test/data_storage_worker_test.exs:149
** (MatchError) no match of right hand side value: {:error, :already_started}
stacktrace:
test/data_storage_worker_test.exs:6: DataStorageWorkerTest.__ex_unit_setup_0/1
test/data_storage_worker_test.exs:1: DataStorageWorkerTest.__ex_unit__/2
As it fails the CI pipeline for master.
Problem:
Crawly only allows single item type scraping. However, what if i am crawling two different sites, with vastly different items?
For example, web page A (e.g. a blog) will have:
while web page B (e.g. a weather site) will have:
In the current setup, the only way to work around this is to lump all these logically different items into one large item, such that the end item declaration in config will be:
item: [:title, :comments, :article_content, :related_links, :temperature, :country]
the issues are that:
:item_id
) is not item specific. Since the item-type from the weather site has no title, I can't specify an item-type-specific field.I have some idea of how this could be implemented, taking inspiration from scrapy.
We could define item structs, and sort the items to their appropriate pipelines according to struct.
Using the tutorial as an example:
using this ideal scenario config:
config :crawly,
closespider_timeout: 10,
concurrent_requests_per_domain: 8,
follow_redirects: true,
closespider_itemcount: 1000,
middlewares: [
Crawly.Middlewares.DomainFilter,
Crawly.Middlewares.UniqueRequest,
Crawly.Middlewares.UserAgent
],
pipelines: [
{MyItemStruct, [
Crawly.Pipelines.Validate,
{Crawly.Pipelines.DuplicatesFilter, item_id: :title }, # similar to how supervisor trees are declared
Crawly.Pipelines.CSVEncoder
]},
{MyOtherItemStruct, [
Crawly.Pipelines.Validate,
Crawly.Pipelines.CleanMyData,
{Crawly.Pipelines.DuplicatesFilter, item_id: :name }, # similar to how supervisor trees are declared
Crawly.Pipelines.CSVEncoder
]},
]
with the spider implemented as so:
@impl Crawly.Spider
def parse_item(response) do
hrefs = response.body |> Floki.find("a.more") |> Floki.attribute("href")
requests =
Utils.build_absolute_urls(hrefs, base_url())
|> Utils.requests_from_urls()
title = response.body |> Floki.find("article.blog_post h1") |> Floki.text()
name= response.body |> Floki.find("article.blog_post h2") |> Floki.text()
%{
:requests => requests,
:items => [
%MyItemStruct{title: title, url: response.request_url},
%MyOtherItemStruct{name: title, url: response.request_url}
]
}
end
The returned items then can get sorted into their specified pipelines.
This configuration method proposes the following:
{MyPipelineModule, validate_more_than: 5}
To consider backwards compatability, a single-item pipeline could still be declared. This would only be the case for a multi-item pipeline.
Do let me know what you think @oltarasenko
Hello!
When connecting to a proxy, my IP does not change. I am using ProxyMesh. When I try on my machine via OS Setting, connection by HTTPS are working fine. Does Crawly support HTTPS? Can it be a problem of the issue?
Here my config file:
use Mix.Config
# in config.exs
config :crawly,
proxy: "us-ca.proxymesh.com:31280",
closespider_timeout: 10,
concurrent_requests_per_domain: 7,
closespider_itemcount: 1000,
middlewares: [
Crawly.Middlewares.DomainFilter,
Crawly.Middlewares.UniqueRequest,
Crawly.Middlewares.UserAgent
],
pipelines: [
{Crawly.Pipelines.Validate, fields: [:url]},
# {Crawly.Pipelines.DuplicatesFilter, item_id: :title},
Crawly.Pipelines.JSONEncoder,
{Crawly.Pipelines.WriteToFile, extension: "jl", folder: "/tmp"} # NEW IN 0.7.0
],
port: 4001
To check an IPs of processes, I used this small module:
defmodule Spider.Proxy do
@behaviour Crawly.Spider
require Logger
@impl Crawly.Spider
def base_url(), do: "https://whatismyipaddress.com/"
@impl Crawly.Spider
def init() do
[
start_urls: [
"https://whatismyipaddress.com/"
]
]
end
@impl Crawly.Spider
def parse_item(response) do
item = %{
url:
Floki.find(
response.body,
"div#section_left > div:nth-of-type(2) > div:nth-of-type(1) > a"
)
|> Floki.text()
}
%Crawly.ParsedItem{:items => [item], :requests => []}
end
end
I want to store scraped data in Postgres with help of Ecto of course.
Is there some best practice for this?
Is HMLT, CSS, JS in-browser rendering on the roadmap?
Thanks!
I believe a on_finish/0 optional callback would be very beneficial (I personnaly need to know when my spiders finish and I would rather not be poking Crawly.Engine every X seconds to check if my spider is still running using Crawly.Engine.running_spiders).
It can be called if it is defined prior to
GenServer.call(MODULE, {:stop_spider, spider_name})
in Engine.
I have tested this change with my application and it seems to work, however I am very new to elixir and I don't know this repo well so I'm not certain this is an acceptable solution to this problem, and I'm not certain this would cover all cases of Spiders stoping.
Hello,
I've learned the basics of Elixir and OTP and created an online card game with Phoenix (including sockets, channels and presence). I'd like to contribute to this project to gain more experience in Elixir.
Do you have any task I could help with? I've seen on your roadmap that you want a UI for jobs management, seems like an interesting feature to build.
Hi, great project. I'm considering a rewrite of my existing scraper in scrapy but i can't seem to find any proxy support in the docs. Is there a way to customize the request to go through a proxy?
What:
Currently Crawly uses HTTPoison to perform requests. We want to make it more dynamic, to be able to use other HTTP clients, and headless browsers.
Why:
All currently known HTTP clients have some specific behaviors. Some sites would ban everything which does not look like a browser. Some sites would use JS to render web pages. We need to be able to address all these problems by dynamically configuring which backend to use in each concrete situation.
How to scrape a API using POST method calls, are there any examples or docs. Any sample code would be nice. Thanks
How it is possible to implement in Crawly adding options for HTTPoison.
For example, such:
url = "https://example.com/api/endpoint_that_needs_a_bearer_token"
headers = []
**options = [ssl: [{:versions, [:'tlsv1.2']}], recv_timeout: 500]**
{:ok, response} = HTTPoison.get(url, headers, **options**)
And so that it can then be used in a spider.
Thanks for this awesome library <3
Is there any possibility to write the scraped items to a database with ecto instead of writing them to file?
This feature would involve rate limiting of requests made by a spider, such as X requests per min.
Docs for Basic Concepts and configuring Splash fetcher state the configuration should be
fetcher: {Crawly.Fetchers.Splash, [base_splash_url: "http://localhost:8050/render.html"]}
However, the fetch method is looking for the base_url
keyword in the config. As seen here
The message for the error, here, should be updated too.
mix helpers for generating boilerplate code for spiders and pipelines.
Based on:
https://elixirforum.com/t/web-scraping-tools/4823/31
Stop parsing each page four times.
When you run response.body |> Floki.find(...), you’re really running the equivalent of response.body |> Floki.parse() |> Floki.find(...) which means your four Floki.finds are parsing the whole document four times.
Instead, try parsed_body = Floki.parse(response.body) then parsed_body |> Floki.find(...).```
It may be Crawly.crawl(url, spider_name) or Crawly.crawl_with(url, spider_name)
The idea is based on:
#51 (comment)
We want to build a way to debug how spider would fetch a given page. Which links are going to be extracted? Which items are going to be fetched?
Hello!
I copy the examples repos and try to run it on my system. What is, am I doing wrong?
iex(1)> Crawly.Engine.start_spider(Esl)
16:46:27.399 [info] Starting the manager for Elixir.Esl
16:46:27.409 [debug] Starting requests storage worker for Elixir.Esl...
16:46:27.514 [debug] Started 4 workers for Elixir.Esl
:ok
iex(2)>
16:47:27.515 [info] Current crawl speed is: 0 items/min
16:47:27.515 [info] Stopping Esl, itemcount timeout achieved
Currently, there the Crawly.Engine
apis are lacking for spider monitoring and management, especially for when there is no access to logs.
I think some critical areas are:
stop_all_spiders
to stop all running spidersThe stopping of spiders should be easy to implement.
For the spider stats, since some of the data is nested quite deep in the supervision tree, i'm not so sure how to get it to "bubble up" to the Crawly.Engine
level.
@oltarasenko thoughts?
When writing json, WriteFile produces invalid json, as it uses a newline to separate items.
{"title": "first item}
{"title": "second item"}
The items should be in a list, comma separated:
[
{"title": "first item"},
{"title": "second item"}
]
Are there any examples of saving multiple files? For example, saving multiple images for each Crawly request. So far I came across the WriteToFile
pipeline which seems to be used for saving data into one file, CSV, JSON, etc.
A lot of my crawl depends on proper user-agent strings. It's a bit hard to supply user agents using a config as we're doing now. It would be good to have a database of user agents and to pick user agents from it. I am thinking of a standalone application with a simple interface, which then could be integrated into crawly.
We could get a database from:http://www.useragentstring.com/pages/api.php or any other service.
Currently, we're setting proxy and follow redirect settings directly, for example:
Check if it's possible to migrate it to middlewares.
init/0
results in a "hardcoded" way of providing initial urls to crawl.
For example, during application runtime, if I wanted to feed urls to the spider to crawl, I would not be able to do so without having a convoluted method of fetching those urls (e.g. from a database).
I propose allowing options to be passed as arguments to the spider's init
callback, passed through the Crawly.Engine.start_spider
function. These options are optional, and is up to the user to implement it in the spider.
Example:
Crawly.Engine.start_spider(MySpider, urls: ["my urls"], pagination: false)
#in the spider
def init(opts \\ []) do
# default options
opts = Enum.into(opts, %{urls: ["Other url"], pagination: true})
# do something with pagination flag
# ....
[start_urls: opts[:urls] ]
end
We need to be able to re-try failing requests.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.