elixir-crawly / crawly Goto Github PK

View Code? Open in Web Editor NEW

842.0 842.0 104.0 2.85 MB

Crawly, a high-level web crawling & scraping framework for Elixir.

Home Page: https://hexdocs.pm/crawly

License: Apache License 2.0

Elixir 94.33% Shell 0.42% HTML 4.72% Dockerfile 0.54%

crawler crawling elixir erlang extract-data scraper scraping scraping-websites spider

crawly's People

Contributors

Stargazers

Watchers

Forkers

domhaobaobao mikalv jerojasro stjordanis yangcancai jallum jamescheuk91 harlantwood kylekermgard ogabriel maartz michaltrzcinka juanbono rayleyva ubi-mirrors edgarlatorre torifukukaiou kfabryczny filipevarjao mlataibrahim singularitypostman aymanosman vanessaklee sreecodeslayer an0nym0u5101 gordon-parrott maiphuong-van cybernet kaozdl kingshalaby1 davidalphafox artaxerces mgibowski oshosanya vsevolodbreus mmachado95 raphexion vermaxik bangalcat m-dhuicq hieuphq strogo feng19 spectator jeff66ruan edmondfrank altyaper wall-eeeeeee avillen jhonatannunessilva nomadhomes ht013 mckethanor darkslategrey kianmeng m4hi2 jack-s9 grkek ericmason defp winsalva stroemgren matteoredaelli suzdalnitski ziinc johannese njausteve vapalape ps491 fstp nuno84 amelyo oleg-kivra azrosen92 kszambelanczyk redlin nobrainskull petrus-jvrensburg mallieb horizon65 expivot zongwu233 4kd serpent213 abhinavs maltekrupa 0comptaoleohe aej 1subdisploschi rjswenson cmnstmntmn nicktaylor- felbdogg rohitmungre chaitanyapandit ydlr uldza talkanbaev-artur dogweather arobie1992

crawly's Issues

Why Crawly do not use Poolboy library?

It might look a little offtopic. This question interesting for me in in terms of my education.

Is there some reason why Crawly do not use Poolboy library?

WriteFile folder option improvements

Hi everyone! I've been using crawly recently and I found the folder option a bit confusing.

The folder is always set as /tmp which makes think that only absolute paths are allowed. A single example with a local or ~ path would make things clearer, for example here in the last one: https://hexdocs.pm/crawly/Crawly.Pipelines.WriteToFile.html

The other thing that bugs me is that the folder has to exist. It would be nicer if the folder is created when missing. This would open the possibility of setting the default to a local path, making it more immediate whether the parser is working

Spider scheduling for continuous crawling

I'm looking to do cron-style spider scheduling, where the engine starts the spider if it is not running at a scheduled timing interval.

Should this be within the Engine or Commander (?) module context?

This would require the Engine to be part of a supervision tree, I think.

Example for CSV export

I see that you can export JSON with the following config:

  config :crawly,
    other configs...
    pipelines: [Crawly.Pipelines.JSONEncoder]

Is there a middleware for exporting to CSV format or a recommend way to do this?

Also, why does the file format end in file.ji instead of file.json when using JSONncoder?

What is recommended "concurent_request_per_domain" ?

It looks like it easy to overflow queue. Entire system down. This happen when I set concurrency more than ~20. What is the recommended setting of concurrency for Crawly? What is performance can you achieve with it?

22:27:10.427 [error] GenServer #PID<0.527.0> terminating
** (MatchError) no match of right hand side value: {:empty, {[], []}}
    (hackney 1.15.2) /Users/mycomputer/Documents/Projects/Playgraound/homebase/deps/hackney/src/hackney_pool.erl:509: :hackney_pool.queue_out/2
    (hackney 1.15.2) /Users/mycomputer/Documents/Projects/Playgraound/homebase/deps/hackney/src/hackney_pool.erl:376: :hackney_pool.dequeue/3
    (hackney 1.15.2) /Users/mycomputer/Documents/Projects/Playgraound/homebase/deps/hackney/src/hackney_pool.erl:349: :hackney_pool.handle_info/2
    (stdlib 3.11.2) gen_server.erl:637: :gen_server.try_dispatch/4
    (stdlib 3.11.2) gen_server.erl:711: :gen_server.handle_msg/6
    (stdlib 3.11.2) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: {:DOWN, #Reference<0.4051887452.3045064705.41660>, :request, #PID<0.455.0>, :shutdown}
State: {:state, :default, {:metrics_ng, :metrics_dummy}, 50, 150000, {:dict, 50, 16, 16, 8, 80, 48, {[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []}, {{[[#Reference<0.4051887452.3045064705.22552> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064712.10994> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064710.25098> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064705.39221> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064708.41389> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064708.41530> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064708.49674> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064712.31884> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064705.34134> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064711.22288> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064708.35098> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064709.19015> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064705.39210> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064706.63253> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064707.42573> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [], [[#Reference<0.4051887452.3045064708.30170> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064709.13738> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064710.33158> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064705.41238> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064708.41539> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064706.61055> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064709.21269> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064708.42098> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064706.61233> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064712.11785> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064707.34068> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064712.26414> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064711.38039> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064712.34782> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [], [[#Reference<0.4051887452.3045064712.8112> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064712.24733> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064707.43205> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064708.22048> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064705.41660> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064707.35050> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064705.24426> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064709.13941> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064709.18037> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064706.55708> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064709.13898> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064712.23688> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064706.52408> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064708.32218> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064707.33419> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064708.44346> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064709.13751> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064708.41062> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064709.34340> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064707.20603> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064712.26880> | {'www.homebase.co.uk', 443, :hackney_ssl}]]}}}, {:dict, 1, 16, 16, 8, 80, 48, {[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []}, {{[], [], [], [[{'www.homebase.co.uk', 443, :hackney_ssl} | {[], []}]], [], [], [], [], [], [], [], [], [], [], [], []}}}, {:dict, 20, 16, 16, 8, 80, 48, {[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []}, {{[[#Reference<0.4051887452.3045064712.10994> | {{#PID<0.436.0>, #Reference<0.4051887452.3045064712.11107>}, {'www.homebase.co.uk', 443, :hackney_ssl}}], [#Reference<0.4051887452.3045064710.25098> | {{#PID<0.491.0>, #Reference<0.4051887452.3045064710.25100>}, {'www.homebase.co.uk', 443, :hackney_ssl}}], [#Reference<0.4051887452.3045064708.49674> | {{#PID<0.435.0>, #Reference<0.4051887452.3045064708.49679>}, {'www.homebase.co.uk', 443, :hackney_ssl}}], [#Reference<0.4051887452.3045064712.31884> | {{#PID<0.503.0>, #Reference<0.4051887452.3045064712.31886>}, {'www.homebase.co.uk', 443, :hackney_ssl}}]], [[#Reference<0.4051887452.3044802561.184150> | {{#PID<0.428.0>, #Reference<0.4051887452.3044802561.184156>}, {'www.homebase.co.uk', 443, :hackney_ssl}}]], [[#Reference<0.4051887452.3045064706.63253> | {{#PID<0.521.0>, #Reference<0.4051887452.3045064706.63260>}, {'www.homebase.co.uk', 443, :hackney_ssl}}]], [[#Reference<0.4051887452.3045064707.42573> | {{#PID<0.516.0>, #Reference<0.4051887452.3045064707.42615>}, {'www.homebase.co.uk', 443, :hackney_ssl}}]], [], [[#Reference<0.4051887452.3045064708.30170> | {{#PID<0.519.0>, #Reference<0.4051887452.3045064708.30172>}, {'www.homebase.co.uk', 443, :hackney_ssl}}], [#Reference<0.4051887452.3045064705.41238> | {{#PID<0.457.0>, #Reference<0.4051887452.3045064705.41240>}, {'www.homebase.co.uk', 443, :hackney_ssl}}], [#Reference<0.4051887452.3045064706.61055> | {{#PID<0.501.0>, #Reference<0.4051887452.3045064706.61080>}, {'www.homebase.co.uk', 443, :hackney_ssl}}]], [[#Reference<0.4051887452.3045064706.61233> | {{#PID<0.448.0>, #Reference<0.4051887452.3045064706.61248>}, {'www.homebase.co.uk', 443, :hackney_ssl}}]], [[#Reference<0.4051887452.3045064712.11785> | {{#PID<0.495.0>, #Reference<0.4051887452.3045064712.11789>}, {'www.homebase.co.uk', 443, :hackney_ssl}}], [#Reference<0.4051887452.3045064711.38039> | {{#PID<0.514.0>, #Reference<0.4051887452.3045064711.38065>}, {'www.homebase.co.uk', 443, :hackney_ssl}}], [#Reference<0.4051887452.3045064712.34782> | {{#PID<0.450.0>, #Reference<0.4051887452.3045064712.34787>}, {'www.homebase.co.uk', 443, :hackney_ssl}}]], [], [[#Reference<0.4051887452.3045064707.43205> | {{#PID<0.446.0>, #Reference<0.4051887452.3045064707.43464>}, {'www.homebase.co.uk', 443, :hackney_ssl (truncated)

Optional XPath support

Optional xpath support.

Should be able to handle dirty xml input.

Unable to get up and running from the quick start

Hi, I am unable to scrape the erlang solutions blog as the quickstart guide states here:
https://github.com/oltarasenko/crawly#quickstart

Attempting to run the spider through iex results in:

iex(1)> Crawly.Engine.start_spider(MyCrawler.EslSpider)
[info] Starting the manager for Elixir.MyCrawler.EslSpider
[debug] Running spider init now.
[debug] Scraped ":title,:url"
[debug] Starting requests storage worker for Elixir.MyCrawler.EslSpider...
[debug] Started 2 workers for Elixir.MyCrawler.EslSpider
:ok
iex(2)> [info] Current crawl speed is: 0 items/min
[info] Stopping MyCrawler.EslSpider, itemcount timeout achieved

I'm quite lost as there is no way for me to debug this, if it is a network issue (which is highly unlikely since i can access the esl website through my browser), or if it is an issue with the urls being filtered out.

defmodule MyCrawler.EslSpider do
  @behaviour Crawly.Spider
  alias Crawly.Utils
  require Logger
  @impl Crawly.Spider
  def base_url(), do: "https://www.erlang-solutions.com"

  @impl Crawly.Spider
  def init() do
    Logger.debug("Running spider init now.")
    [start_urls: ["https://www.erlang-solutions.com/blog.html"]]
  end

  @impl Crawly.Spider
  def parse_item(response) do
    IO.inspect(response)
    hrefs = response.body |> Floki.find("a.more") |> Floki.attribute("href")

    requests =
      Utils.build_absolute_urls(hrefs, base_url())
      |> Utils.requests_from_urls()

    # Modified this to make it even more general, to eliminate the possibility of selector problem
    title = response.body |> Floki.find("title") |> Floki.text()

    %{
      :requests => requests,
      :items => [%{title: title, url: response.request_url}]
    }
  end
end

Of note is that the spider does not even call the parse_items callback, as the IO.inspect for the response is not called at all.

Config is as follows:

config :crawly,
  closespider_timeout: 10,
  concurrent_requests_per_domain: 2,
  follow_redirects: true,
  output_format: "csv",
  item: [:title, :url],
  item_id: :title

Add browser rendering using headless chrome

FeatureRequest - We can replace splash with headless chrome via puppeteer

How can I connect Proxy to Splash?

Is there an option to add a proxy in Splash from Crawly?

Add versioning to the documentation

Looks like it's quite hard to maintain the documentation in a good shape if you have multiple versions with different settings. (And especially different (slightly diverging) tutorials).

We need to have versioning.

I am slightly biased against the standard Elixir style of documenting the code (e.g. having very large docstrings makes the code unreadable, at least to me).

I would try to add an index page to docsify, and would store stand alone copies for different versions (in case of major updates in API)

Add lightweight UI for Crawly Management

As it was discussed here: #97 (comment)
we want to build a lightweight (probably HTTP based) UI for the single node based Crawly operations. For people who don't want (or don't need) to use more complex https://github.com/oltarasenko/crawly_ui.

As far as we see it now we need to develop a Lightweight HTTP client (alternatively we might look into command line clients like https://github.com/ndreynolds/ratatouille)

without an external database as a dependency
which allows to schedule/stop jobs on the given node
which allows seeing currently running jobs [ideally with metrics like crawl-speed]
which allows to schedule jobs with given parameters like concurrency

Tag jobs with unique id

One of the common problems I am facing right now, if that it's not possible to separate one crawly's job from another, from the external point of view.

E.g. the same spider can be executed multiple times. How do we know that the data came exactly from a given run? For example here: http://18.216.221.122/ in order to group items inside a UI we need to have an ID to unite them.

I plan to:

Extend engine in the way that start_spider would accept an optional job_id parameter
The engine would automatically generate the job if now provided
Crawly will use this job_id when the communication with the external world is done. E.g. if we are shipping logs somewhere -> it will be used. If we're shipping items it will be used.

What do you think about the idea?

Announcement: Crawly UI, first early prototype

Hey people,

I have spent quite a bit of time prototyping a UI for Crawly. I think it's the next step we have to make in order to make Crawly visible in the Elixir and Web crawling space.

A good UI would help to convert web scraping into a process with clear create - test - use circles. Please have a glance on the early prototype here:
https://github.com/oltarasenko/crawly_ui

or test it here: http://18.216.221.122/jobs/1/items/

How to use Crawly when not need to collect links?

I have in Database about 250k URLs that need to enrich. Go to every link of the list and parse its HTML. Is there a good way how to feed them to Crawly queue? Is it suitable to use Crawly in this case?

Splitting spider logs

One of the concerns I want to address in the new milestone is Logs.
We need to be able to split logs per spider if requested. It would allow us to have spider logs in one place. Understanding the performance of the job, and what was dropped for example, etc.

As I see it should be quite similar to: https://docs.scrapy.org/en/latest/topics/logging.html

@Ziinc I wonder if you have solved this problem on your own already or have some ideas to share?

Pattern for handling authentication of requests

Hi,

I'm interested in know what the appropriate pattern is for authenticating a spider? Most of the spiders I'm writing need to log in first before being able to scrape the content that I need access to. What would be the normal pattern for this as I've not found any examples of it that I can see.

My guess would be to write a middleware to perform the log in and set the auth cookies, however this will be a different authentication process for the various different spiders. Would this be done within the spider itself perhaps in the init() function?

Thanks for any help.

Build association within scraped data

Hello!
I looking the way to associate scraped data, I do it with Ecto however stuck. Do you guys can help me with it?

Link to the original post here: https://elixirforum.com/t/how-should-i-use-put-assoc-with-upsert-many-to-many/30315

Is it possible to have a configuration per spider?

I am currently running into an issue where one of my spider gets denied for making too many requests, and I would like to set the concurent requests for that specific spider to 1 without affecting the other spiders, so far I havent found a way to do this.

Currently the only way I see of achieving what I want is creating a separate application for each spider with each its own config, which doesnt feel like it will be quite optimal has I will end up with probably 50+ spiders meaning 50+ apps.

The question: is there currently a way to make configurations spider specific, and if not do you intend on making that possible in the future?

Ecto item pipeline configuration

I have a problem with build the custom Ecto pipeline.
https://hexdocs.pm/crawly/basic_concepts.html#custom-item-pipelines

If I get it right, I need to define the pipeline module in my config.exs file topipelines:[]section?
Also I miss understanding that does mean this part MyApp.insert_with_ecto(item) ? Do you mean Repo.insert here or what?
Can you please describe in more detail how should I connect it?

P.S. I apologize if the questions seemed stupid, but I still do not understand how this works. I will be thrilled when I figure it out. Ecto is a missed puzzle for my crawly projects.

Update example

This looks like a very useful project.

Can you please update the example Crawly Example (https://www.homebase.co.uk) and Tutorial? They both seem to be broken.

Improvements to output file names of spiders

Right now the file name is the spider's module name and there's no way to configure it. Because the file name is static, previous data is overwritten, where as scrapers usually add a timestamp to the file name. That would be very nice because in this way previous results are automatically retained and it's immediate when the parsing was done.

Spiders could have an optional name method which could be used to enable users to customize it and the default naming could include a timestamp.

Discussion: Store state for fault-tolerance crawler

I think Postgres would be a good way to store the spider state to in case the system crash it can continue the crawling from there it was stoped.

Is there any recommendations or suggestions how to implement it in my current Crawly project?

Move away from using global app env for pipeline module config

Pipeline config should be localized to the specific pipeline.
Benefits:

Reduces clashes (when more middlewares/pipelines get built)
makes parameter declaration clearer
makes pipelines more "functional" and reusable
use of Application.get_env usage within pipeline modules makes things less clear when declaring pipelines.

Related to #20 , would pave the way for adding logic into a pipeline (instead of having a fat pipeline module)

Proposed api:

Tuple definitions would only be required for modules that require configuration
- at the pipeline level, they could throw an error when checking for parameters, or use default values.
a pipeline module can also be passed directly when no parameters are needed

For example:

pipelines: [
  ....
  MyCustom.Pipeline.CleanItem,
  {Crawly.Pipelines.Validate, item: [:title, :url] },
  {Crawly.Pipelines.DuplicatesFilter, item_id: :title },
  Crawly.Pipelines.JSONEncoder
]

Besides adjusting existing built-in pipelines, this proposed change would also require the adjustment of Crawly.Utils.pipe to check for tuple definitions.

Improve tests for CSV encoder pipeline

Some tests are quite ugly at the moment :(. Need to check the race condition here:

  1) test CSV encoder test Items are stored in CSV after csv pipeline (DataStorageWorkerTest)
     test/data_storage_worker_test.exs:149
     ** (MatchError) no match of right hand side value: {:error, :already_started}
     stacktrace:
       test/data_storage_worker_test.exs:6: DataStorageWorkerTest.__ex_unit_setup_0/1
       test/data_storage_worker_test.exs:1: DataStorageWorkerTest.__ex_unit__/2

As it fails the CI pipeline for master.

Scraping different items from the same spider, each with different pipeline requirements

Problem:
Crawly only allows single item type scraping. However, what if i am crawling two different sites, with vastly different items?

For example, web page A (e.g. a blog) will have:

comments
article content
related links
title

while web page B (e.g. a weather site) will have:

temperature
country

In the current setup, the only way to work around this is to lump all these logically different items into one large item, such that the end item declaration in config will be:

item: [:title, :comments, :article_content, :related_links, :temperature, :country]

the issues are that:

they share the same pipeline. When scraping the weather data, the blog-related fields will be blank, and vice versa for when scraping the blog. This will affect pipeline validations, since pipeline is shared.
Output of the two items will be the same.
Duplication checks (such as :item_id) is not item specific. Since the item-type from the weather site has no title, I can't specify an item-type-specific field.

I have some idea of how this could be implemented, taking inspiration from scrapy.
We could define item structs, and sort the items to their appropriate pipelines according to struct.

Using the tutorial as an example:

using this ideal scenario config:

config :crawly,
  closespider_timeout: 10,
  concurrent_requests_per_domain: 8,
  follow_redirects: true,
  closespider_itemcount: 1000,
  middlewares: [
    Crawly.Middlewares.DomainFilter,
    Crawly.Middlewares.UniqueRequest,
    Crawly.Middlewares.UserAgent
  ],
  pipelines: [
    {MyItemStruct, [
        Crawly.Pipelines.Validate,
        {Crawly.Pipelines.DuplicatesFilter, item_id: :title }, # similar to how supervisor trees are declared
        Crawly.Pipelines.CSVEncoder
    ]},
    {MyOtherItemStruct, [
        Crawly.Pipelines.Validate,
        Crawly.Pipelines.CleanMyData,
        {Crawly.Pipelines.DuplicatesFilter, item_id: :name }, # similar to how supervisor trees are declared
        Crawly.Pipelines.CSVEncoder
    ]},
  ]

with the spider implemented as so:

 @impl Crawly.Spider
  def parse_item(response) do
    hrefs = response.body |> Floki.find("a.more") |> Floki.attribute("href")

    requests =
      Utils.build_absolute_urls(hrefs, base_url())
      |> Utils.requests_from_urls()

    title = response.body |> Floki.find("article.blog_post h1") |> Floki.text()
    name= response.body |> Floki.find("article.blog_post h2") |> Floki.text()

    %{
         :requests => requests,
         :items => [
            %MyItemStruct{title: title, url: response.request_url},
            %MyOtherItemStruct{name: title, url: response.request_url}
         ]
      }
  end

The returned items then can get sorted into their specified pipelines.

This configuration method proposes the following:

allow declaration of item-specific pipelines for multiple items
allow passing of arguments to a pipeline implementation e.g. {MyPipelineModule, validate_more_than: 5}

To consider backwards compatability, a single-item pipeline could still be declared. This would only be the case for a multi-item pipeline.

Do let me know what you think @oltarasenko

Proxy setup.

Hello!

When connecting to a proxy, my IP does not change. I am using ProxyMesh. When I try on my machine via OS Setting, connection by HTTPS are working fine. Does Crawly support HTTPS? Can it be a problem of the issue?

Here my config file:

use Mix.Config
# in config.exs
config :crawly,
  proxy: "us-ca.proxymesh.com:31280",
  closespider_timeout: 10,
  concurrent_requests_per_domain: 7,
  closespider_itemcount: 1000,
  middlewares: [
    Crawly.Middlewares.DomainFilter,
    Crawly.Middlewares.UniqueRequest,
    Crawly.Middlewares.UserAgent
  ],
  pipelines: [
    {Crawly.Pipelines.Validate, fields: [:url]},
    # {Crawly.Pipelines.DuplicatesFilter, item_id: :title},
    Crawly.Pipelines.JSONEncoder,
    {Crawly.Pipelines.WriteToFile, extension: "jl", folder: "/tmp"} # NEW IN 0.7.0
   ],
   port: 4001

To check an IPs of processes, I used this small module:

defmodule Spider.Proxy do
  @behaviour Crawly.Spider

  require Logger

  @impl Crawly.Spider
  def base_url(), do: "https://whatismyipaddress.com/"

  @impl Crawly.Spider
  def init() do
    [
      start_urls: [
        "https://whatismyipaddress.com/"
      ]
    ]
  end

  @impl Crawly.Spider
  def parse_item(response) do
    item = %{
      url:
        Floki.find(
          response.body,
          "div#section_left > div:nth-of-type(2) > div:nth-of-type(1) > a"
        )
        |> Floki.text()
    }

    %Crawly.ParsedItem{:items => [item], :requests => []}
  end
end

Pluggable fetchers improvements

Update Crawly.fetch to use backends
Allow specifying fetcher without options

Postgres instead of files.

I want to store scraped data in Postgres with help of Ecto of course.
Is there some best practice for this?

Browser Rendering?

Is HMLT, CSS, JS in-browser rendering on the roadmap?

Thanks!

on_finish callback for spiders

I believe a on_finish/0 optional callback would be very beneficial (I personnaly need to know when my spiders finish and I would rather not be poking Crawly.Engine every X seconds to check if my spider is still running using Crawly.Engine.running_spiders).

It can be called if it is defined prior to
GenServer.call(MODULE, {:stop_spider, spider_name})
in Engine.

I have tested this change with my application and it seems to work, however I am very new to elixir and I don't know this repo well so I'm not certain this is an acceptable solution to this problem, and I'm not certain this would cover all cases of Spiders stoping.

I would like to contribute to this project

Hello,
I've learned the basics of Elixir and OTP and created an online card game with Phoenix (including sockets, channels and presence). I'd like to contribute to this project to gain more experience in Elixir.
Do you have any task I could help with? I've seen on your roadmap that you want a UI for jobs management, seems like an interesting feature to build.

Proxy support?

Hi, great project. I'm considering a rewrite of my existing scraper in scrapy but i can't seem to find any proxy support in the docs. Is there a way to customize the request to go through a proxy?

Add pluggable HTTP backends

What:

Currently Crawly uses HTTPoison to perform requests. We want to make it more dynamic, to be able to use other HTTP clients, and headless browsers.

Why:

All currently known HTTP clients have some specific behaviors. Some sites would ban everything which does not look like a browser. Some sites would use JS to render web pages. We need to be able to address all these problems by dynamically configuring which backend to use in each concrete situation.

INFO: How to scrape a API using POST method calls

How to scrape a API using POST method calls, are there any examples or docs. Any sample code would be nice. Thanks

Write an example of Crawly with splash integrated as a proxy

Add options for HTTPoison

How it is possible to implement in Crawly adding options for HTTPoison.
For example, such:


url = "https://example.com/api/endpoint_that_needs_a_bearer_token"
headers = []
**options = [ssl: [{:versions, [:'tlsv1.2']}], recv_timeout: 500]**
{:ok, response} = HTTPoison.get(url, headers, **options**)

And so that it can then be used in a spider.

Dumping data to DB instead of a file

Thanks for this awesome library <3

Is there any possibility to write the scraped items to a database with ecto instead of writing them to file?

Request throttling

This feature would involve rate limiting of requests made by a spider, such as X requests per min.

Docs for configuring Splash Fetcher should be base_url not base_splash_url

Docs for Basic Concepts and configuring Splash fetcher state the configuration should be
fetcher: {Crawly.Fetchers.Splash, [base_splash_url: "http://localhost:8050/render.html"]}

However, the fetch method is looking for the base_url keyword in the config. As seen here

The message for the error, here, should be updated too.

Code generator helpers

mix helpers for generating boilerplate code for spiders and pipelines.

Update tutorials so Floki does not parse documents multiple times

Based on:
https://elixirforum.com/t/web-scraping-tools/4823/31

Stop parsing each page four times.

When you run response.body |> Floki.find(...), you’re really running the equivalent of response.body |> Floki.parse() |> Floki.find(...) which means your four Floki.finds are parsing the whole document four times.

Instead, try parsed_body = Floki.parse(response.body) then parsed_body |> Floki.find(...).```

Implement Crawly.crawl/2 command for fetching given page with given spider

It may be Crawly.crawl(url, spider_name) or Crawly.crawl_with(url, spider_name)

The idea is based on:
#51 (comment)

We want to build a way to debug how spider would fetch a given page. Which links are going to be extracted? Which items are going to be fetched?

Crawly does not return any data.

Hello!

I copy the examples repos and try to run it on my system. What is, am I doing wrong?

iex(1)> Crawly.Engine.start_spider(Esl)

16:46:27.399 [info]  Starting the manager for Elixir.Esl
 
16:46:27.409 [debug] Starting requests storage worker for Elixir.Esl...
 
16:46:27.514 [debug] Started 4 workers for Elixir.Esl
:ok
iex(2)> 
16:47:27.515 [info]  Current crawl speed is: 0 items/min
 
16:47:27.515 [info]  Stopping Esl, itemcount timeout achieved

Improvements for spider management

Currently, there the Crawly.Engine apis are lacking for spider monitoring and management, especially for when there is no access to logs.

I think some critical areas are:

spider crawl stats (scraped item count, dropped request/item count, scrape speed)
stop_all_spiders to stop all running spiders

The stopping of spiders should be easy to implement.

For the spider stats, since some of the data is nested quite deep in the supervision tree, i'm not so sure how to get it to "bubble up" to the Crawly.Engine level.

@oltarasenko thoughts?

WriteFile produces invalid json

When writing json, WriteFile produces invalid json, as it uses a newline to separate items.

{"title": "first item}
{"title": "second item"}

The items should be in a list, comma separated:

[
    {"title": "first item"},
    {"title": "second item"}
]

Downloading Files

Are there any examples of saving multiple files? For example, saving multiple images for each Crawly request. So far I came across the WriteToFile pipeline which seems to be used for saving data into one file, CSV, JSON, etc.

Improve user agents database

A lot of my crawl depends on proper user-agent strings. It's a bit hard to supply user agents using a config as we're doing now. It would be good to have a database of user agents and to pick user agents from it. I am thinking of a standalone application with a simple interface, which then could be integrated into crawly.

We could get a database from:http://www.useragentstring.com/pages/api.php or any other service.

Read request settings from middlewares

Currently, we're setting proxy and follow redirect settings directly, for example:

Check if it's possible to migrate it to middlewares.

Passing spider initialization options

init/0 results in a "hardcoded" way of providing initial urls to crawl.

For example, during application runtime, if I wanted to feed urls to the spider to crawl, I would not be able to do so without having a convoluted method of fetching those urls (e.g. from a database).

I propose allowing options to be passed as arguments to the spider's init callback, passed through the Crawly.Engine.start_spider function. These options are optional, and is up to the user to implement it in the spider.

Example:

Crawly.Engine.start_spider(MySpider, urls: ["my urls"], pagination: false)

#in the spider
def init(opts \\ []) do
  # default options
  opts = Enum.into(opts, %{urls: ["Other url"], pagination: true})
  # do something with pagination flag
  # ....

  [start_urls: opts[:urls] ]
end

Retries support

We need to be able to re-try failing requests.

elixir-crawly / crawly Goto Github PK

crawly's People

Contributors

Stargazers

Watchers

Forkers

crawly's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs