GithubHelp home page GithubHelp logo

stjordanis / crawly Goto Github PK

View Code? Open in Web Editor NEW

This project forked from elixir-crawly/crawly

0.0 2.0 0.0 1.78 MB

Crawly, a high-level web crawling & scraping framework for Elixir.

Home Page: https://oltarasenko.github.io/crawly/

License: Apache License 2.0

Elixir 99.78% Shell 0.22%

crawly's Introduction

Crawly

oltarasenko Coverage Status Hex pm hex.pm downloads

Overview

Crawly is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

Requirements

  1. Elixir "~> 1.10"
  2. Works on Linux, Windows, OS X and BSD

Quickstart

  1. Add Crawly as a dependencies:

    # mix.exs
    defp deps do
        [
          {:crawly, "~> 0.13.0"},
          {:floki, "~> 0.26.0"}
        ]
    end
  2. Fetch dependencies: $ mix deps.get

  3. Create a spider

    # lib/crawly_example/esl_spider.ex
    defmodule EslSpider do
      use Crawly.Spider
      
      alias Crawly.Utils
    
      @impl Crawly.Spider
      def base_url(), do: "https://www.erlang-solutions.com"
    
      @impl Crawly.Spider
      def init(), do: [start_urls: ["https://www.erlang-solutions.com/blog/"]]
    
      @impl Crawly.Spider
      def parse_item(response) do
        {:ok, document} = Floki.parse_document(response.body)
        hrefs = document |> Floki.find("a.btn-link") |> Floki.attribute("href")
    
        requests =
          Utils.build_absolute_urls(hrefs, base_url())
          |> Utils.requests_from_urls()
    
        title = document |> Floki.find("h1.page-title-sm") |> Floki.text()
    
        %{
          :requests => requests,
          :items => [%{title: title, url: response.request_url}]
        }
      end
    end
  4. Configure Crawly

    • By default, Crawly does not require any configuration. But obviously you will need a configuration for fine tuning the crawls:
    # in config.exs
    config :crawly,
      closespider_timeout: 10,
      concurrent_requests_per_domain: 8,
      middlewares: [
        Crawly.Middlewares.DomainFilter,
        Crawly.Middlewares.UniqueRequest,
        {Crawly.Middlewares.UserAgent, user_agents: ["Crawly Bot"]}
      ],
      pipelines: [
        {Crawly.Pipelines.Validate, fields: [:url, :title]},
        {Crawly.Pipelines.DuplicatesFilter, item_id: :title},
        Crawly.Pipelines.JSONEncoder,
        {Crawly.Pipelines.WriteToFile, extension: "jl", folder: "/tmp"}
      ]
  5. Start the Crawl:

    • $ iex -S mix
    • iex(1)> Crawly.Engine.start_spider(EslSpider)
  6. Results can be seen with: $ cat /tmp/EslSpider.jl

Need more help?

I have decided to create a public telegram channel, so it's now possible to be connected, and it's possible to ask questions and get answers faster!

Please join me on: https://t.me/crawlyelixir

Browser rendering

Crawly can be configured in the way that all fetched pages will be browser rendered, which can be very useful if you need to extract data from pages which has lots of asynchronous elements (for example parts loaded by AJAX).

You can read more here:

Experimental UI

The CrawlyUI project is an add-on that aims to provide an interface for managing and rapidly developing spiders.

Checkout the code from GitHub or try it online CrawlyUIDemo

See more at Experimental UI

Documentation

Roadmap

  1. Pluggable HTTP client
  2. Retries support
  3. Cookies support
  4. XPath support - can be actually done with meeseeks
  5. Project generators (spiders)
  6. UI for jobs management

Articles

  1. Blog post on Erlang Solutions website: https://www.erlang-solutions.com/blog/web-scraping-with-elixir.html
  2. Blog post about using Crawly inside a machine learning project with Tensorflow (Tensorflex): https://www.erlang-solutions.com/blog/how-to-build-a-machine-learning-project-in-elixir.html
  3. Web scraping with Crawly and Elixir. Browser rendering: https://medium.com/@oltarasenko/web-scraping-with-elixir-and-crawly-browser-rendering-afcaacf954e8
  4. Web scraping with Elixir and Crawly. Extracting data behind authentication: https://oltarasenko.medium.com/web-scraping-with-elixir-and-crawly-extracting-data-behind-authentication-a52584e9cf13
  5. What is web scraping, and why you might want to use it?
  6. Using Elixir and Crawly for price monitoring
  7. Building a Chrome-based fetcher for Crawly

Example projects

  1. Blog crawler: https://github.com/oltarasenko/crawly-spider-example
  2. E-commerce websites: https://github.com/oltarasenko/products-advisor
  3. Car shops: https://github.com/oltarasenko/crawly-cars
  4. JavaScript based website (Splash example): https://github.com/oltarasenko/autosites

Contributors

We would gladly accept your contributions!

Documentation

Please find documentation on the HexDocs

Production usages

Using Crawly on production? Please let us know about your case!

crawly's People

Contributors

cybernet avatar edgarlatorre avatar feng19 avatar filipevarjao avatar harlantwood avatar jallum avatar jerojasro avatar juanbono avatar kylekermgard avatar maiphuong-van avatar mgibowski avatar michaltrzcinka avatar ogabriel avatar oltarasenko avatar oshosanya avatar rootkc avatar torifukukaiou avatar vermaxik avatar ziinc avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.