GithubHelp home page GithubHelp logo

sukhcha-in / dart_web_scraper Goto Github PK

View Code? Open in Web Editor NEW
3.0 2.0 0.0 63 KB

Easy-to-use, reusable web scraper. Extracts and cleans HTML/JSON, providing structured data results.

Home Page: https://pub.dev/packages/dart_web_scraper

License: MIT License

Dart 100.00%
htmlparser jsonparser parser parsing scraper scraping webscraper webscraping

dart_web_scraper's Introduction

Pub version Pub Likes Pub Points Pub Popularity

Easy-to-use, reusable web scraper. Extracts and cleans HTML/JSON, providing structured data results.

Note

This package is still in development and not ready for production use. Please contribute to help make it better.

Why need this package?

I was tired of writing the parsing logic from scratch every time I needed to scrape a new website. Debugging and testing the parsing logic was a nightmare. This package is an attempt to solve this problem. It provides a simple and reusable way to scrape data from the web.

Getting Started

Add dart_web_scraper as dependencies in your pubspec.yaml:

dependencies:
  dart_web_scraper:

Import the package:

import 'package:dart_web_scraper/dart_web_scraper.dart';

Usage

/// Initialize WebScraper
WebScraper webScraper = WebScraper();

/// Scrape website based on configMap
Map<String, Object> result = await webScraper.scrape(
  url: Uri.parse("https://quotes.toscrape.com"),
  configMap: configMap,
  configIndex: 0,
  debug: true,
);

print(jsonEncode(result));

Example configMap for quotes.toscrape.com can be found here.

Structure

This is the basic structure for configMap:

  configMap // Map<String, List<Config>>
  ├── quotes.toscrape.com // String
  │   ├── Config
  │   │   ├── parsers // Map<String, List<Parser>>
  │   │   │   └── urltarget.name // String. Name of your UrlTarget
  │   │   │       ├── Parser
  │   │   │       │   ├── id // String. Add _root as entrypoint id
  │   │   │       │   ├── parent // List<String>
  │   │   │       │   ├── type // ParserType
  │   │   │       └── Parser // Another Parser for same UrlTarget
  │   │   └── urlTargets // List<UrlTarget>
  │   │       ├── UrlTarget
  │   │       │   ├── name // String
  │   │       │   └── where // List<String>
  │   │       └── UrlTarget // Another UrlTarget for a Config
  │   └── Config // Another Config for same domain
  └── example.com // Another domain in configMap

Classes and Methods

Config class

Config for a domain.

Config Config({
  // Scrape the URL again even if `html` is passed to `WebScraper.scrape`. Defaults to false.
  bool forceFetch = false,
  // User agent device. Defaults to mobile.
  UserAgentDevice userAgent = UserAgentDevice.mobile,
  // Allow user passed HTML. Defaults to true.
  bool usePassedHtml = true,
  // Allow user passed User-Agent. Defaults to false.
  bool usePassedUserAgent = false,
  // Allow user passed cookies. Defaults to false.
  bool usePassedCookies = false,
  // Map of UrlTarget's name containing list of parsers.
  required Map<String, List<Parser>> parsers,
  // List of UrlTarget. More details below.
  required List<UrlTarget> urlTargets,
})

UrlTarget class

It is used to target different sections of a website. For example you can have different set of parsers in a config object for /products/foo and /search?q=foo

UrlTarget UrlTarget({
  // Name of the UrlTarget
  required String name,
  // Set list of static paths in a url. For any path set it to "/"
  required List<String> where,
  // Useful if you want to use API request instead of scraping a webpage. Defaults to true.
  bool needsHtml = true,
  // Parameters cleaner. More details below.
  UrlCleaner? urlCleaner,
})

UrlCleaner class

Clean the URL before it's passed to a scraper.

UrlCleaner UrlCleaner({
  // Set whitelisted or blacklisted URL parameters.
  List<String>? whitelistParams,
  List<String>? blacklistParams,
  // Set custom static parameters to a URL.
  Map<String, String>? appendParams,
})

Parser class

Easy to use and reusable parser class :)

Parser Parser({
  // `id` is used for final result.
  // Child parsers can reference to parent parser using `id`.
  // You can have multiple parsers with same id and same parent and will execute one by one and stop execution once data is successfully parsed by one parser.
  required String id,
  // A child can have multiple parents, it will execute once parent parser is successfully executed.
  required List<String> parent,
  // Set the parser types.
  required ParserType type,
  // List of selectors wil execute one by one and stop execution once data is successfully parsed by one selector.
  List<String> selector = const [],
  // Set parser for private usage. Will be not added to final result.
  bool isPrivate = false,
  // Set multiple to `true` if data is a List.
  bool multiple = false,
  // Optional parameters explained below.
  AbstractOptional? optional,
  // Custom cleaner function, clean the data and return data.
  Object? Function(Data, bool)? cleaner,
})

Parser Types

Type Description Selector Optional
ParserType.element Extracts element nodes from HTML using CSS selectors. CSS selector required. N/A
ParserType.attribute Extracts attribute from HTML element using CSS selectors. Use CSS selector to select an element and append attribute name with ::. Ex: div#myid::name where name refers to the attribute name. Optional
ParsetType.text Extracts text from HTML element using CSS selectors. CSS selector required. Optional
ParserType.image Extracts image URL from HTML element. CSS selector required. After selecting an element it tries to find src attribute. Optional
ParserType.url Extracts URL from an HTML element CSS selector required. After selecting an element it tries to find href attribute. Optional
ParserType.urlParam From an URL it extracts query parameter. Add parameter name in selector. Optional
ParserType.table Extracts data from HTML table. CSS selector required. Select table using this selector. TableOptional
ParserType.sibling Used when target element doesn't have a valid selector but sibling does. CSS selector is required. SiblingOptional
ParserType.strBetween Extracts the string between two strings. Not required StrBtwOptional
ParserType.http Get data using http request Not required HttpOptional
ParserType.json Decode JSON string or extract data. json_path syntax should be used as a selector Optional
ParserType.jsonld Extracts all Ld+Json objects and places them into a list Not required N/A
ParserType.jsonTable Extracts data from JSON as table. json_path syntax should be used as a selector TableOptional
ParserType.json5decode Decodes JSON5 syntax Not required N/A
ParserType.staticVal Useful if you want to set static values to final result Not required StaticOptional
ParserType.returnUrlParser Returns URL which was passed to WebScraper Not required Optional

Data injection to selector

You can inject previously parsed data by parser selector using <slot> For example:

selector: [
  // for css selector
  "div#<slot>id</slot>"
  // or for json path:
  r"$.data.<slot>id</slot>.value"
]

You can also inject data using slot into HttpOptional's url field. For example:

Parser(
  id: "json",
  parent: ["product_id"],
  type: ParserType.http,
  isPrivate: true,
  optional: HttpOptional(
    url: "https://example.com/productdetails/<slot>product_id</slot>",
    responseType: HttpResponseType.json,
  ),
),

Optional Parameters

Optional class

Can be used with any parser.

Optional Optional({
  // Pre defined methods which can be applied to final result
  ApplyMethod? apply,
  // Regex selector and regexGroup can be used together to select data from final result
  String? regex,
  int? regexGroup,
  // regexReplace something with regexReplaceWith
  String? regexReplace,
  String? regexReplaceWith,
  // Replace the first occurence in a string with a string
  Map<String, String>? replaceFirst,
  // Replace all occurences in a string with a string
  Map<String, String>? replaceAll,
  // Crop string from start. If data is List it removes the elements from start.
  int? cropStart,
  // Crop string from end. If data is List it removes the elements from end.
  int? cropEnd,
  // Prepend something to a string
  String? prepend,
  // Append something to a string
  String? append,
  // Converts final result to boolean if data matches with one of the `match` object
  List<Object>? match,
  // Select nth child from a list
  int? nth,
  // Split a string by something
  String? splitBy,
})

HttpOptional class

Required with ParserType.http

HttpOptional HttpOptional({
  // URL to fetch data
  String? url,
  // GET and POST methods are currently supported
  HttpMethod? method,
  // Custom headers
  Map<String, Object>? headers,
  // Set Useragent to mobile or desktop
  UserAgentDevice? userAgent,
  // Set expected response type
  HttpResponseType? responseType,
  // Payload for POST requests
  Object? payload,
  // Use proxy?
  bool usePassedProxy = false,
  // Set payload type for POST request
  HttpPayload? payloadType,
  // Used for debugging purposes only, saves file to /cache folder
  bool cacheResponse = false,
})

StrBtwOptional class

Required with ParserType.strBetween

StrBtwOptional StrBtwOptional({
  // Starting of a string
  String? start,
  // Ending of a string
  String? end
})

SiblingOptional class

Used with ParserType.sibling

SiblingOptional SiblingOptional({
  // previous or next sibling, defaults to next if `SiblingOptional` is not passed
  required SiblingDirection direction,
  // Check if sibling.text contains some string
  List<String>? where
})

TableOptional class

Used with ParserType.table Required with ParserType.jsonTable

TableOptional TableOptional({
  // Set CSS selector for selecting keys row
  // When using jsonTable set keys as json path selector
  String? keys,
  // Set CSS selector for selecting values row
  // When using jsonTable set values as json path selector
  String? values,
})

StaticOptional class

Required with ParserType.staticVal

StaticOptional StaticOptional({
  // Set string value to result
  String? strVal,
  // Set Map to result
  Map<String, Object>? mapVal
})

Credits

json_path - JSON path selector
json5 - JSON5 syntax decoder

dart_web_scraper's People

Contributors

sukhcha-in avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.