GithubHelp home page GithubHelp logo

scalpel's Introduction

Scalpel Build status Hackage

Scalpel is a convenient web scraping library to extract data from HTML webpages. It's inspired by libraries like Parsec and Perl's Web::Scraper, and provides a declarative, monadic interface on top of the robust HTML parsing library TagSoup

Quickstart

{-# LANGUAGE OverloadedStrings #-}

import Control.Applicative ((<|>))
import Text.HTML.Scalpel

htmlString :: String
htmlString =
    "<html>\
    \  <body>\
    \    <div class='comments'>\
    \      <div class='comment container'>\
    \        <span class='comment author'>Sally</span>\
    \        <div class='comment text'>Woo hoo!</div>\
    \      </div>\
    \      <div class='comment container'>\
    \        <span class='comment author'>Bill</span>\
    \        <img class='comment image' src='http://example.com/cat.gif' />\
    \      </div>\
    \    </div>\
    \  </body>\
    \</html>"

main :: IO ()
main = do
    -- We can either scrape a raw html of any StringLike type (fetched before by other means):
    let scrapedCommentsFromString = scrapeStringLike htmlString comments
    -- prints: Just [TextComment "Sally" "Woo hoo!",ImageComment "Bill" "http://example.com/cat.gif"]
    print scrapedCommentsFromString

    -- or let Scalpel fetch and scrape an HTML page for us for convenience :
    scrapedCommentsFromUrl <- scrapeURL "http://example.org/article.html" comments
    -- example.org doesn't have the HTML above
    -- prints: Just []
    print scrapedCommentsFromUrl

type Author = String

data Comment
    = TextComment Author String
    | ImageComment Author URL
    deriving (Show, Eq)

comments :: Scraper String [Comment]
comments = chroots ("div" @: [hasClass "container"]) comment
  where
    comment :: Scraper String Comment
    comment = textComment <|> imageComment

    textComment :: Scraper String Comment
    textComment = do
        author <- text $ "span" @: [hasClass "author"]
        commentText <- text $ "div" @: [hasClass "text"]
        return $ TextComment author commentText

    imageComment :: Scraper String Comment
    imageComment = do
        author <- text $ "span" @: [hasClass "author"]
        imageURL <- attr "src" $ "img" @: [hasClass "image"]
        return $ ImageComment author imageURL

This example demonstrates the most important features of this library: You can parse and extract data from raw HTML text or from a webpage by providing an URL; here we use a hypothetical HTML located at "http://example.com/article.html" to extract a list of all of the comments.

More examples can be found in the examples folder in the Scalpel git repository.

To understand the code it's important to know that this this library provides two main building blocks to build web scrapers: Selectors and Scrapers.

Selectors

Selectors describe a location within an HTML DOM tree. The simplest selector, that can be written is a simple string value. For example, the selector "div" matches every single div node in a DOM. Selectors can be combined using tag combinators. The // operator to define nested relationships within a DOM tree. For example, the selector "div" // "a" matches all anchor tags nested arbitrarily deep within a div tag.

In addition to describing the nested relationships between tags, selectors can also include predicates on the attributes of a tag. The @: operator creates a selector that matches a tag based on the name and various conditions on the tag's attributes. An attribute predicate is just a function that takes an attribute and returns a boolean indicating if the attribute matches a criteria. There are several attribute operators that can be used to generate common predicates. The @= operator creates a predicate that matches the name and value of an attribute exactly. For example, the selector "div" @: ["id" @= "article"] matches div tags where the id attribute is equal to "article".

Scrapers

Scrapers are values that are parameterized over a selector and produce a value from an HTML DOM tree. The Scraper type takes two type parameters. The first is the string like type that is used to store the text values within a DOM tree. Any string like type supported by Text.StringLike is valid. The second type is the type of value that the scraper produces.

There are several scraper primitives that take selectors and extract content from the DOM. Each primitive defined by this library comes in two variants: singular and plural. The singular variants extract the first instance matching the given selector, while the plural variants match every instance.

Tips & Tricks

The primitives provided by scalpel are intentionally minimalistic with the assumption being that users will be able to build up complex functionality by combining them with functions that work on existing type classes (Monad, Applicative, Alternative, etc.).

This section gives examples of common tricks for building up more complex behavior from the simple primitives provided by this library.

OverloadedStrings

Selector, TagName and AttributeName are all IsString instances, and thus it is convenient to use scalpel with OverloadedStrings enabled. If not using OverloadedStrings, all tag names must be wrapped with tagSelector.

Matching Wildcards

Scalpel has 3 different wildcard values each corresponding to a distinct use case.

  • anySelector is used to match all tags:

    textOfAllTags = texts anySelector

  • AnyTag is used when matching all tags with some attribute constraint. For example, to match all tags with the attribute class equal to "button":

    textOfTagsWithClassButton = texts $ AnyTag @: [hasClass "button"]

  • AnyAttribute is used when matching tags with some arbitrary attribute equal to a particular value. For example, to match all tags with some attribute equal to "button":

    textOfTagsWithAnAttributeWhoseValueIsButton = texts $ AnyTag @: [AnyAttribute @= "button"]

Complex Predicates

It is possible to run into scenarios where the name and attributes of a tag are not sufficient to isolate interesting tags and properties of child tags need to be considered.

In these cases the guard function of the Alternative type class can be combined with chroot and anySelector to implement predicates of arbitrary complexity.

Building off the above example, consider a use case where we would like find the html contents of a comment that mentions the word "cat".

The strategy will be the following:

  1. Isolate the comment div using chroot.

  2. Then within the context of that div the textual contents can be retrieved with text anySelector. This works because the first tag within the current context is the div tag selected by chroot, and the anySelector selector will match the first tag within the current context.

  3. Then the predicate that "cat" appear in the text of the comment will be enforced using guard. If the predicate fails, scalpel will backtrack and continue the search for divs until one is found that matches the predicate.

  4. Return the desired HTML content of the comment div.

catComment :: Scraper String String
catComment =
    -- 1. First narrow the current context to the div containing the comment's
    --    textual content.
    chroot ("div" @: [hasClass "comment", hasClass "text"]) $ do
        -- 2. anySelector can be used to access the root tag of the current context.
        contents <- text anySelector
        -- 3. Skip comment divs that do not contain "cat".
        guard ("cat" `isInfixOf` contents)
        -- 4. Generate the desired value.
        html anySelector

For the full source of this example, see complex-predicates in the examples directory.

Generalized Repetition

The pluralized versions of the primitive scrapers (texts, attrs, htmls) allow the user to extract content from all of the tags matching a given selector. For more complex scraping tasks it will at times be desirable to be able to extract multiple values from the same tag.

Like the previous example, the trick here is to use a combination of the chroots function and the anySelector selector.

Consider an extension to the original example where image comments may contain some alt text and the desire is to return a tuple of the alt text and the URLs of the images.

The strategy will be the following:

  1. to isolate each img tag using chroots.

  2. Then within the context of each img tag, use the anySelector selector to extract the alt and src attributes from the current tag.

  3. Create and return a tuple of the extracted attributes.

altTextAndImages :: Scraper String [(String, URL)]
altTextAndImages =
    -- 1. First narrow the current context to each img tag.
    chroots "img" $ do
        -- 2. Use anySelector to access all the relevant content from the the currently
        -- selected img tag.
        altText <- attr "alt" anySelector
        srcUrl  <- attr "src" anySelector
        -- 3. Combine the retrieved content into the desired final result.
        return (altText, srcUrl)

For the full source of this example, see generalized-repetition in the examples directory.

Operating with other monads inside the Scraper

ScraperT is a monad transformer scraper: it allows lifting m a operations inside a ScraperT str m a with functions like:

-- Particularizes to 'm a -> ScraperT str m a'
lift :: (MonadTrans t, Monad m) => m a -> t m a

-- Particularizes to things like `IO a -> ScraperT str IO a'
liftIO :: MonadIO m => IO a -> m a

Example: Perform HTTP requests on page images as you scrape:

  1. Isolate images using chroots.

  2. Within that context of an img tag, obtain the src attribute containing the location of the file.

  3. Perform an IO operation to request metadata headers from the source.

  4. Use the data to build and return more complex data

-- Holds original link and data if it could be fetched
data Image = Image String (Maybe Metadata)
  deriving Show

-- Holds mime type and file size
data Metadata = Metadata String Int
  deriving Show

-- Scrape the page for images: get their metadata
scrapeImages :: URL -> ScraperT String IO [Image]
scrapeImages topUrl = do
    chroots "img" $ do
        source <- attr "src" "img"
        guard . not . null $ source
        -- getImageMeta is called via liftIO because ScrapeT transforms over IO
        liftM (Image source) $ liftIO (getImageMeta topUrl source)

For the full source of this example, see downloading data

For more documentation on monad transformers, see the hackage page

Explicit error handling

ScraperT is an instance of MonadError which allows you to throw errors from within parsing code to stop parsing and return an error.

When doing error handling in this way, there are 3 cases to consider:

  1. An explicitly thrown error
  2. A failed scraping without a thrown error
  3. A valid result

This can be implemented for String valued errors as follows:

type Error = String
type ScraperWithError a = ScraperT String (Either Error) a

scrapeStringOrError :: String -> ScraperWithError a -> Either Error a
scrapeStringOrError html scraper
        | Left error    <- result  = Left error
        | Right Nothing <- result  = Left "Unknown error"
        | Right (Just a) <- result = Right a
    where
    result = scrapeStringLikeT html scraper

To add explicit erroring you can use the <|> operator from Alternative to throw an error when something fails:

comment :: ScraperWithError Comment
comment = textComment <|> imageComment <|> throwError "Unknown comment type"

With this approach, when you throw an error it will stop all parsing. So if you have an expression a <|> b and there is a nested throwError in a, then the parsing will fail. Even if b would be successful.

For the full source for this approach, see error-handling in the examples directory.

Another approach that would let you accumulate errors without stopping parsing would be to use MonadWriter and accumulate debugging information in a Monoid like a list:

type Error = String
type ScraperWithError a = ScraperT String (Writer [Error]) a

scrapeStringOrError :: String -> ScraperWithError a -> (Maybe a, [Error])
scrapeStringOrError html scraper = runWriter . scrapeStringLikeT

Then to log an error you can use tell:

comment :: ScraperWithError Comment
comment = textComment <|> imageComment <|> (tell ["Unknown comment type"] >> empty)

You can also retrieve the current HTML being parsed with html anySelector and incorporate that into your log message:

logError :: String -> ScraperWithError a
logError message = do
  currentHtml <- html anySelector
  tell ["Unknown comment type: " ++ html]
  empty

comment :: ScraperWithError Comment
comment = textComment <|> imageComment <|> logError "Unknown comment type: "

For the full source for this approach, see error-handling-with-writer in the examples directory.

scalpel-core

The scalpel package depends on 'http-client' and 'http-client-tls' to provide networking support. For projects with an existing HTTP client these dependencies may be unnecessary.

For these scenarios users can instead depend on scalpel-core which does not provide networking support and has minimal dependencies.

Troubleshooting

My Scraping Target Doesn't Return The Markup I Expected

Some websites return different markup depending on the user agent sent along with the request. In some cases, this even means returning no markup at all in an effort to prevent scraping.

To work around this, you can add your own user agent string.

#!/usr/local/bin/stack
-- stack runghc --resolver lts-6.24 --install-ghc --package scalpel-0.6.0
{-# LANGUAGE NamedFieldPuns #-}
{-# LANGUAGE OverloadedStrings #-}

import Text.HTML.Scalpel
import qualified Network.HTTP.Client as HTTP
import qualified Network.HTTP.Client.TLS as HTTP
import qualified Network.HTTP.Types.Header as HTTP


-- Create a new manager settings based on the default TLS manager that updates
-- the request headers to include a custom user agent.
managerSettings :: HTTP.ManagerSettings
managerSettings = HTTP.tlsManagerSettings {
  HTTP.managerModifyRequest = \req -> do
    req' <- HTTP.managerModifyRequest HTTP.tlsManagerSettings req
    return $ req' {
      HTTP.requestHeaders = (HTTP.hUserAgent, "My Custom UA")
                          : HTTP.requestHeaders req'
    }
}

main = do
    manager <- Just <$> HTTP.newManager managerSettings
    html <- scrapeURLWithConfig (def { manager }) url $ htmls anySelector
    maybe printError printHtml html
  where
    url = "https://www.google.com"
    printError = putStrLn "Failed"
    printHtml = mapM_ putStrLn

A list of user agent strings can be found here.

scalpel's People

Contributors

acastello avatar aupiff avatar debug-ito avatar fimad avatar jbaum98 avatar jezen avatar jwiegley avatar levinotik avatar malteneuss avatar ncfavier avatar ocramz avatar peterbecich avatar raveline avatar sportanova avatar sulami avatar teofilc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scalpel's Issues

"fail" implementation

It would be great if you could add fail _ = mzero to Selector str's Monad instance.

This would allow me to cleanly pattern match inside selectors, like this:

do Just x <- fmap readMaybe . text $ "foo" // "bar"
   ...

This selector would fail without throwing an error when the tag in question can't be parsed as an integer.

Can this be built on Windows 10?

This API looks really promising. Nice work!

I'm using stack to build but get this error:

Configuring curl-1.3.8...
Building curl-1.3.8...
Preprocessing library curl-1.3.8...
[1 of 8] Compiling Network.Curl.Debug ( Network\Curl\Debug.hs, .stack-work\dist\b7fec021\build\Network\Curl\Debug.o )
[2 of 8] Compiling Network.Curl.Code ( Network\Curl\Code.hs, .stack-work\dist\b7fec021\build\Network\Curl\Code.o )
[3 of 8] Compiling Network.Curl.Types ( Network\Curl\Types.hs, .stack-work\dist\b7fec021\build\Network\Curl\Types.o )
[4 of 8] Compiling Network.Curl.Info ( Network\Curl\Info.hs, .stack-work\dist\b7fec021\build\Network\Curl\Info.o )
[5 of 8] Compiling Network.Curl.Post ( Network\Curl\Post.hs, .stack-work\dist\b7fec021\build\Network\Curl\Post.o )
[6 of 8] Compiling Network.Curl.Opts ( Network\Curl\Opts.hs, .stack-work\dist\b7fec021\build\Network\Curl\Opts.o )
[7 of 8] Compiling Network.Curl.Easy ( Network\Curl\Easy.hs, .stack-work\dist\b7fec021\build\Network\Curl\Easy.o )

D:\atc\AppData\Local\Temp\stack8212\curl-1.3.8\Network\Curl\Easy.hs:27:1: warning: [-Wtabs]
    Tab character found here, and in two further locations.
    Please use spaces instead.
[8 of 8] Compiling Network.Curl     ( Network\Curl.hs, .stack-work\dist\b7fec021\build\Network\Curl.o )

D:\atc\AppData\Local\Temp\stack8212\curl-1.3.8\Network\Curl.hs:278:1: warning: [-Wtabs]
    Tab character found here, and in 8 further locations.
    Please use spaces instead.

D:\atc\AppData\Local\Temp\stack8212\curl-1.3.8\curlc.c:10:23: error:
     fatal error: curl/curl.h: No such file or directory
compilation terminated.
`gcc.exe' failed in phase `C Compiler'. (Exit code: 1)

I have this set in my stack.yaml:
extra-lib-dirs: ["C:/Program Files/cURL/dlls"]

But that obviously doesn't help.

Any ideas please?

Document scalpel's parsing algorithm

Scalpel does not attempt to follow the HTML spec's parsing algorithm. How it interprets / parses content should be documented in the public API, not just in the internal comments around the parsing code.

Avoid curl - bad things happen

So against my gut I decided to just use curl as Scalpel already had it built in. But I ended up seriously regretting that. Once my scraper was large enough, it started running with a lot of concurrency. Sure enough, the app started randomly segfaulting in the libcurl code. I replaced all libcurl bindings with http-client-tls bindings and haven't had any issues. I strongly suggest that you move away from curl. I can submit a PR for the changes if you like. The only loss will be in allowing users to supply their own Curl options.

I was very pleasantly surprised to find that http-client-tls even supports SOCKS proxies thanks to the connection package. I needed this feature in curl to connect to Tor proxies.

Example issue.

It took me A While to figure out why the example wasn't returning anything in ghci.

Turns out that unless you give a Show instance for Comment, ghci silently returns when you run allComments. Here's a complete, working version (assuming the URL has the html in question):

  import Text.HTML.Scalpel
  import Control.Applicative

  type Author = String

  data Comment
      = TextComment Author String
      | ImageComment Author URL
      deriving (Show, Eq)

  allComments :: IO (Maybe [Comment])
  allComments = scrapeURL "http://vrici.lojban.org/~rlpowell/media/public/test.html" comments
     where
         comments :: Scraper String [Comment]
         comments = chroots ("div" @: [hasClass "container"]) comment

         comment :: Scraper String Comment
         comment = textComment <|> imageComment

         textComment :: Scraper String Comment
         textComment = do
             author      <- text $ "span" @: [hasClass "author"]
             commentText <- text $ "div"  @: [hasClass "text"]
             return $ TextComment author commentText

         imageComment :: Scraper String Comment
         imageComment = do
             author   <- text       $ "span" @: [hasClass "author"]
             imageURL <- attr "src" $ "img"  @: [hasClass "image"]
             return $ ImageComment author imageURL

Giving:

  Prelude> :load scalpel.hs
  [1 of 1] Compiling Main             ( scalpel.hs, interpreted )
  Ok, modules loaded: Main.
  *Main> allComments
  Just [TextComment "Sally" "Woo hoo!",ImageComment "Bill" "http://example.com/cat.gif",TextComment "Susan" "WTF!?!"]
  *Main>

Selectors don't play nicely with OverloadedStrings

For example, this:

imageURL :: Scraper BL.ByteString BL.ByteString
imageURL = attr "src" $ "img"

spits out the following error (with OverloadedStrings turned on):

/Users/Ben/projects/amazon-experiments/src/Main.hs:224:12:
    No instance for (Selectable r0) arising from a use of ‘attr’
    The type variable ‘r0’ is ambiguous
    Note: there are several potential instances:
      instance Selectable Text.HTML.Scalpel.Any
        -- Defined in ‘scalpel-0.3.1:Text.HTML.Scalpel.Internal.Select.Types’
      instance Selectable Selector
        -- Defined in ‘scalpel-0.3.1:Text.HTML.Scalpel.Internal.Select.Types’
      instance Selectable String
        -- Defined in ‘scalpel-0.3.1:Text.HTML.Scalpel.Internal.Select.Types’
    In the expression: attr "src"
    In the expression: attr "src" $ "img"
    In an equation for ‘imageURL’: imageURL = attr "src" $ "img"
/Users/Ben/projects/amazon-experiments/src/Main.hs:224:25:
    No instance for (Data.String.IsString r0)
      arising from the literal ‘"img"’
    The type variable ‘r0’ is ambiguous
    Note: there are several potential instances:
      instance Data.String.IsString
                 aeson-0.11.2.1:Data.Aeson.Types.Internal.Value
        -- Defined in ‘aeson-0.11.2.1:Data.Aeson.Types.Internal’
      instance (a ~ ByteString) =>
               Data.String.IsString
                 (attoparsec-0.13.0.2:Data.Attoparsec.ByteString.Internal.Parser a)
        -- Defined in ‘attoparsec-0.13.0.2:Data.Attoparsec.ByteString.Char8’
      instance (a ~ text-1.2.2.1:Data.Text.Internal.Text) =>
               Data.String.IsString
                 (attoparsec-0.13.0.2:Data.Attoparsec.Text.Internal.Parser a)
        -- Defined in ‘attoparsec-0.13.0.2:Data.Attoparsec.Text.Internal’
      ...plus 14 others
    In the second argument of ‘($)’, namely ‘"img"’
    In the expression: attr "src" $ "img"
    In an equation for ‘imageURL’: imageURL = attr "src" $ "img"

The way around this is to provide an explicit type for "img" and "src", e.g:

imageURL :: Scraper BL.ByteString BL.ByteString
imageURL = attr ("src" :: String) $ ("img" :: String)

which is a little tedious. I realize this is a limitation of the type system, it's not like the initial problem can be fixed outright. In practice, I've created a bunch of declarations like this:

src_ :: String
src_ = "src"

img_ :: String
img_ = "img"

-- and so on...

And used them like so:

imageURL :: Scraper BL.ByteString BL.ByteString
imageURL = attr src_ $ img_

It might be nice to provide some selectors for common HTML elements and attributes in the library itself for this purpose. What do you think?

scalpel-core (without curl)

It would be nice for me if you created something like scalpel-core without dependency on curl package or functions like scrapeURL.

For those who want to use Scalpel purely for scraping (like me), dependency on curl is pretty redundant. curl depends on libcurl, so users often fail to install it once (and then they install libcurl and try again). This is a little annoying.

Figure out story for queries that span multiple sub-trees.

Right now scalpel makes the assumption that an HTML document is a tree (possibly a malformed one) and that scraping involves selecting one or more sub-trees and extracting (the same) data from each sub-tree.

There are use cases (#41, #45) that don't fit into this model but would nice to support.

This issue is for brainstorming API/architecture changes that would support these types of queries.
#41 could be solved by extending selectors to allow jumping between sibling sub-trees, maybe something that looked like:

<p class="something">Here</p>
<p>Other stuff that matters</p>
chroot "p" @: [hasClass "something"] $ do
  here <- text AnyTag
  otherStuff <- text $ rightSibling "p"

The problem with this is that it doesn't really extend to more complicated scenarios like those #45 which involves collecting a sequence of siblings until a certain condition is met.

\r\n

Windows return newline is turned into a TagText

Sequences

Hi. I like your library 👍.
However, I do not see any clear/obvious way how to parse (/scrape)

<body>
  <h1>title1</h1>
  <h2>title2 1</h2>
  <p>text 1</p>
  <p>text 2</p>
  <h2>title2 2</h2>
  <p>text 3</p>
  <h2>title2 3</h2>
</body>

into something like

type Title = String
type Paragraph = String -- For simplicity
data Part = Part Title [Paragraph]

expected :: [Part]
expected =
    [ Part "title2 1" ["text 1", "text 2"]
    , Part "title2 2" ["text 3"]
    , Part "title2 3" []
    ]

If I just miss something, would you consider adding this into examples. Or maybe a slight change in combinators? Or maybe introduce some sequence operator?

Probably related to issue #41.

Thanks :-).

Getting nth element?

If I'm working with html resembling the following,

<table>
  <tr>
    <td><strong>col 1</strong></td>
    <td><a href="special-link">col 2</a></td>
    <td>col k</td>
  </tr>
  <tr>
    <td>col 1</td>
    <td><a href="special-link">col 2</a></td>
    <td>col k</td>
  </tr>
  <tr>
    <td><a href="link">col 1</a></td>
    <td><a href="special-link">col 2</td>
    <td>col k</td>
  </tr>
</table>        

What is the suggested way to use Scalpel to extract the nth td from each tr? Notice that we can only be sure of the form of the column of interest, and that the other tds in each row have variable structure.

I'm resorting to hacky solutions that look like:

tableScraper :: Scraper String [String]
tableScraper = chroot ("table" :: String) $
    chroots ("tr" :: String) $ (!! 1) <$> columnLinkScraper

columnLinkScraper = chroots ("td" :: String) $
    attr ("href" :: String) ("a" :: String) <|> return ""

main :: IO ()
main = hspec $ do

  describe "scalpel" $
    it "parsed well" $ do
        contents <- readFile "test.html"
        let scrapedLinks = scrapeStringLike contents tableScraper
        case scrapedLinks of
           Just links -> links `shouldBe` replicate (length links) "special-link"
           Nothing -> scrapedLinks `shouldSatisfy` isJust

Is there a better way to do this? Does a chrootIndex :: Int -> s -> Scraper str a -> Scraper str a function make sense? chroot is essentially just chrootHead, after all. I've looked at the implementation of chroot and is seems like we would just need to use an alternative chroots that doesn't use the mapMaybe function and preserves the total number of matched root tags. I'm happy to write this function if everyone thinks it's a good idea.

Implicitly terminated <li> tags and atDepth

I'm having problems parsing a document that relies on HTML's implicit termination of <li> tags. I can't get it to work correctly in combination with scalpel's atDepth facility.

I've written a pair of test cases that I think shows the problem:

jforberg@f83413e

    -- This case passes
    ,   scrapeTest
            "atDepth correctly handles explicitly terminated <li> tags"
            "<body><ul><li>Li</li></ul><ul><li>La</li></ul></body>"
            (Just ["Li", "La"])
            (texts $ "body" // "li" `atDepth` 2)

    -- This case fails
    ,   scrapeTest
            "atDepth correctly handles implicitly terminated <li> tags"
            "<body><ul><li>Li</ul><ul><li>La</ul></body>"
            (Just ["Li", "La"])
            (texts $ "body" // "li" `atDepth` 2)

Based on my understanding of HTML, the two strings are equivalent representations of the same document. But the latter seems to be parsed incorrectly by scalpel (or maybe an underlying HTML library?).

My real use case is very similar to the test strings, except that there are nested <ul>'s which I need to avoid recursing into, hence the use of atDepth.

how to use attrs combinator

Not sure this should be an issue as much as a question, but I'm trying to get the value of all src attributes for img tags from a page. I'm trying something like this:

pageImgSources :: String -> IO (Maybe [Text])
pageImgSources url = scrapeURL url $ attrs (pack "src") $ (pack "img")

But seem to come up with Nothing every time. Would you mind giving me a pointer on this?

Thanks for this great library!

Make "withAll" always succeed

I need to match a field that may be optional. I figured I could use the ‘plural’ versions of selectors (e.g. texts) plus something like listToMaybe to ‘safely’ match only the first occurrence of an element, but this does not work as expected.

To my great surprise, texts also fails matching anything if there are 0 elements, which I did not expect. I expected it to return an empty list (like ‘many’ in Parsec). In keeping with the principle of least surprise, I would recommend changing the semantics of these matchers.

Build failure with GHC 8

> /tmp/stackage-build9$ stack unpack scalpel-0.3.0.1
Unpacked scalpel-0.3.0.1 to /tmp/stackage-build9/scalpel-0.3.0.1/
> /tmp/stackage-build9/scalpel-0.3.0.1$ runghc -clear-package-db -global-package-db -package-db=/home/stackage/work/builds/nightly/pkgdb Setup configure --package-db=clear --package-db=global --package-db=/home/stackage/work/builds/nightly/pkgdb --libdir=/home/stackage/work/builds/nightly/lib --bindir=/home/stackage/work/builds/nightly/bin --datadir=/home/stackage/work/builds/nightly/share --libexecdir=/home/stackage/work/builds/nightly/libexec --sysconfdir=/home/stackage/work/builds/nightly/etc --docdir=/home/stackage/work/builds/nightly/doc/scalpel-0.3.0.1 --htmldir=/home/stackage/work/builds/nightly/doc/scalpel-0.3.0.1 --haddockdir=/home/stackage/work/builds/nightly/doc/scalpel-0.3.0.1 --flags=
Configuring scalpel-0.3.0.1...
> /tmp/stackage-build9/scalpel-0.3.0.1$ runghc -clear-package-db -global-package-db -package-db=/home/stackage/work/builds/nightly/pkgdb Setup build
Building scalpel-0.3.0.1...
Preprocessing library scalpel-0.3.0.1...
[1 of 7] Compiling Text.HTML.Scalpel.Internal.Select.Types ( src/Text/HTML/Scalpel/Internal/Select/Types.hs, dist/build/Text/HTML/Scalpel/Internal/Select/Types.o )
[2 of 7] Compiling Text.HTML.Scalpel.Internal.Select.Combinators ( src/Text/HTML/Scalpel/Internal/Select/Combinators.hs, dist/build/Text/HTML/Scalpel/Internal/Select/Combinators.o )
[3 of 7] Compiling Text.HTML.Scalpel.Internal.Select ( src/Text/HTML/Scalpel/Internal/Select.hs, dist/build/Text/HTML/Scalpel/Internal/Select.o )

src/Text/HTML/Scalpel/Internal/Select.hs:78:28: error:
    • Couldn't match type ‘forall a. Maybe a’ with ‘CloseOffset’
      Expected type: (TagSoup.Tag str, Int)
                     -> ((TagSoup.Tag str, CloseOffset), Int)
        Actual type: (TagSoup.Tag str, Int)
                     -> ((TagSoup.Tag str, forall a. Maybe a), Int)
    • In the first argument of ‘map’, namely ‘(first (, Nothing))’
      In the expression: map (first (, Nothing))
      In the expression:
        map (first (, Nothing)) $ concat $ Map.elems state

src/Text/HTML/Scalpel/Internal/Select.hs:85:41: error:
    • Couldn't match type ‘Maybe Int’ with ‘forall a. Maybe a’
      Expected type: Maybe ((TagSoup.Tag str, forall a. Maybe a), Int)
        Actual type: Maybe ((TagSoup.Tag str, Maybe Int), Int)
    • In the expression: calcOffset <$> maybeOpen
      In the first argument of ‘catMaybes’, namely
        ‘[Just ((tag, Nothing), index), calcOffset <$> maybeOpen]’
      In the expression:
        catMaybes [Just ((tag, Nothing), index), calcOffset <$> maybeOpen]

src/Text/HTML/Scalpel/Internal/Select.hs:87:21: error:
    • Couldn't match type ‘forall a. Maybe a’ with ‘Maybe Int’
      Expected type: [((TagSoup.Tag str, CloseOffset), Int)]
        Actual type: [((TagSoup.Tag str, forall a. Maybe a), Int)]
    • In the expression: res ++ go xs state'
      In the expression:
        let
          maybeOpen = head <$> Map.lookup tagName state
          state' = Map.alter popTag tagName state
          res = catMaybes [...]
        in res ++ go xs state'
      In an equation for ‘go’:
          go (x@(tag, index) : xs) state
            | TagSoup.isTagClose tag
            = let
                maybeOpen = head <$> Map.lookup tagName state
                state' = Map.alter popTag tagName state
                ....
              in res ++ go xs state'
            | TagSoup.isTagOpen tag = go xs (Map.alter appendTag tagName state)
            | otherwise = ((tag, Nothing), index) : go xs state
            where
                tagName = getTagName tag
                appendTag ::
                  Maybe [(TagSoup.Tag str, Int)] -> Maybe [(TagSoup.Tag str, Int)]
                appendTag m = (x :) <$> (m <|> Just [])
                calcOffset :: (t, Int) -> ((t, Maybe Int), Int)
                ....

src/Text/HTML/Scalpel/Internal/Select.hs:87:28: error:
    • Couldn't match type ‘Maybe Int’ with ‘forall a. Maybe a’
      Expected type: [((TagSoup.Tag str, forall a. Maybe a), Int)]
        Actual type: [((TagSoup.Tag str, CloseOffset), Int)]
    • In the second argument of ‘(++)’, namely ‘go xs state'’
      In the expression: res ++ go xs state'
      In the expression:
        let
          maybeOpen = head <$> Map.lookup tagName state
          state' = Map.alter popTag tagName state
          res = catMaybes [...]
        in res ++ go xs state'

src/Text/HTML/Scalpel/Internal/Select.hs:89:40: error:
    • Couldn't match type ‘forall a. Maybe a’ with ‘Maybe Int’
      Expected type: [((TagSoup.Tag str, CloseOffset), Int)]
        Actual type: [((TagSoup.Tag str, forall a. Maybe a), Int)]
    • In the expression: ((tag, Nothing), index) : go xs state
      In an equation for ‘go’:
          go (x@(tag, index) : xs) state
            | TagSoup.isTagClose tag
            = let
                maybeOpen = head <$> Map.lookup tagName state
                state' = Map.alter popTag tagName state
                ....
              in res ++ go xs state'
            | TagSoup.isTagOpen tag = go xs (Map.alter appendTag tagName state)
            | otherwise = ((tag, Nothing), index) : go xs state
            where
                tagName = getTagName tag
                appendTag ::
                  Maybe [(TagSoup.Tag str, Int)] -> Maybe [(TagSoup.Tag str, Int)]
                appendTag m = (x :) <$> (m <|> Just [])
                calcOffset :: (t, Int) -> ((t, Maybe Int), Int)
                ....
      In an equation for ‘tagWithOffset’:
          tagWithOffset tags
            = let
                indexed = zip tags ...
                unsorted = go indexed Map.empty
                ....
              in map fst sorted
            where
                go ::
                  [(TagSoup.Tag str, Int)]
                  -> Map.Map str [(TagSoup.Tag str, Int)]
                     -> [((TagSoup.Tag str, CloseOffset), Int)]
                go [] state = map (first (, Nothing)) $ concat $ Map.elems state
                go (x@(tag, index) : xs) state
                  | TagSoup.isTagClose tag = let ... in res ++ go xs state'
                  | TagSoup.isTagOpen tag = go xs (Map.alter appendTag tagName state)
                  | otherwise = ((tag, Nothing), index) : go xs state
                  where
                      tagName = getTagName tag
                      appendTag ::
                        Maybe [(TagSoup.Tag str, Int)] -> Maybe [(TagSoup.Tag str, Int)]
                      ....

src/Text/HTML/Scalpel/Internal/Select.hs:89:66: error:
    • Couldn't match type ‘Maybe Int’ with ‘forall a. Maybe a’
      Expected type: [((TagSoup.Tag str, forall a. Maybe a), Int)]
        Actual type: [((TagSoup.Tag str, CloseOffset), Int)]
    • In the second argument of ‘(:)’, namely ‘go xs state’
      In the expression: ((tag, Nothing), index) : go xs state
      In an equation for ‘go’:
          go (x@(tag, index) : xs) state
            | TagSoup.isTagClose tag
            = let
                maybeOpen = head <$> Map.lookup tagName state
                state' = Map.alter popTag tagName state
                ....
              in res ++ go xs state'
            | TagSoup.isTagOpen tag = go xs (Map.alter appendTag tagName state)
            | otherwise = ((tag, Nothing), index) : go xs state
            where
                tagName = getTagName tag
                appendTag ::
                  Maybe [(TagSoup.Tag str, Int)] -> Maybe [(TagSoup.Tag str, Int)]
                appendTag m = (x :) <$> (m <|> Just [])
                calcOffset :: (t, Int) -> ((t, Maybe Int), Int)
                ....

html stream should be decoded before calling TagSoup's parseTags

I need to work on a page encoded in utf-8 that also have html entities. The problem is that in that case entities aren't converted to their utf-8 encoded representation; as a result I don't know how to recover all special characters on the page.

The problem comes from the fact that TagSoup converts html entities to their unicode code point equivalent; but scalpel doesn't pass a decoded string to TagSoup.

A small example:

Prelude Data.ByteString.Char8 Text.HTML.TagSoup> parseTags (pack "test\xC3\xA9 &#233; hop") :: [Tag ByteString]
[TagText "test\195\169 \233 hop"]

The resulting string mixes utf-8 and unicode code points.

  • \195\169 is the "é" character encoded in utf-8
  • \233 is the "é" character unicode code point.

To work around it I would need to be able to decode the text from utf-8 in downloadAsTags

downloadAsTags url = do
        maybeBytes <- openURIWithOpts url options
        return $ (TagSoup.parseTags . decodeEncoding . TagSoup.castString) <$> maybeBytes
decodeEncoding = undefined  -- to be done, how ?

Depth-First Search not really Depth-First ?

Thanks for this great, easy-to-use, smartly thought out, library.

According to comments, the selectNodes function is a DFS implementation. I'm not completely sure (manual recursion is not my forte), but I think it's not.

Example case:

$ stack ghci
> :set -XOverloadedString
> let xml = "<html><div><p>p1</p><p>p2</p><blockquote><p>p3</p></blockquote><p>p4</p></div></html>"
> scrapeStringLike xml (texts $ "div" // "p")
Just ["p1","p2","p4","p3"]

But this XML:

<html>
  <div>
    <p>p1</p>
    <p>p2</p>
    <blockquote><p>p3</p></blockquote>
    <p>p4</p>
  </div>
</html>

Should, in DFS, be processed through in the order given by the letter in parenthesis in the example below:

Span 0 17 (A)
|
`- Span 1 16 (B)
   |
   +- Span 2 4 (C)
   |
   +- Span 5 7 (D)
   |
   +- Span 8 12 (E)
   |  |
   |  `- Span 9 11 (F)
   |
   `- Span 13 15 (G)

So the selector should yield Just ["p1","p2","p3","p4"].

Unless the previous result was the intended behaviour ? In which case I think comments and README should warn about this.

Suggested fix

In this pattern (https://github.com/fimad/scalpel/blob/master/scalpel-core/src/Text/HTML/Scalpel/Internal/Select.hs#L222), we should not have:

| otherwise = selectNodes [n] (tags, fs, ctx) $ selectNodes [n] (tags, Tree.subForest f, ctx) acc

Or we won't get depth-first; we should rather have:

| otherwise  = selectNodes [n] (tags, Tree.subForest f, ctx) $ selectNodes [n] (tags, fs, ctx) acc

This pass tests and gives the proper result with the former ghci example. I have a PR ready to fire if needed.

(a -> Maybe b) -> Scraper str a -> Scraper str b

Thanks to the Functor instance, it is easy to process the result of a scraper to get another scraper. However, I could not find any way to do it with a processing that might fail.

Use case: I read some integer value from an attribute that I convert to Int, and I'd like it to fail gracefully.

read <$> attr "data-count" "div"  -- Exception when there's no parse

What I'd like

read <$?> attr "data-count" "div"  -- The scrapper returns Nothing when there's no parse

The implementation is simple provided access to MkScraper (which is internal). The hard part is finding the right function name or the right symbol :)

Consider changing the type of scrape, html, and related functions

scrape :: _ => Scraper str a -> [Tag str] -> Maybe a (and friends) lead to the awkward situation of scrape (htmls Any) :: [Tag str] -> Maybe [a], which IIUC will never return Just [].

Changing the type of e.g. html to _ => Scraper str (Maybe str) also has the benefit that patterns like maybe "default" <$> html become possible, leading to a richer monadic EDSL.

howto extract all attribute pairs of a tag

is there a way to extract all attributes with their values of tags?
example"

  <img src="foo.gif" title="My Foo">
  <img src="bar.gif" title="My Bar">

I would like to get a list of tuples
[("foo.gif","My Foo"), ("bar.gif","My Bar")]

I am able to find all img tags with a src attribute, but I see no way to get the title to the corresponding img tag.

Scalpel completely fails for some sites

Some sites return no markup at all, or just fail.

I made a small test case to reproduce the issue.

Change the first argument to scrapeURL to one of the other URLs to test.

#!/usr/local/bin/stack
-- stack runghc --resolver lts-6.24 --install-ghc --package scalpel-0.4.0

import Text.HTML.Scalpel

-- SUCCESS: Prints all the HTML
reed = "http://www.reed.co.uk/jobs/london?keywords=javascript"

-- SUCCESS: Prints all the HTML
indeed = "http://www.indeed.co.uk/jobs?q=javascript&l=london"

-- FAILED: Prints the string "Failed"
jobsite = "http://www.jobsite.co.uk/vacancies?search_type=quick&query=javascript&location=london&jobTitle-input=&location-input=&radius=20"

-- FAILED: Doesn't print anything at all, which I think translates to a result
-- of `Just []`
monster = "http://www.monster.co.uk/jobs/search/?q=javascript&where=London"

main :: IO ()
main = do
  html <- scrapeURL monster $ htmls anySelector
  maybe printError printHtml html
  where
    printError = putStrLn "Failed"
    printHtml = mapM_ putStrLn

Generalize singular and plural scrapers

Having singular and plural versions of all top-level scraper functions seems like a poor API. A singular scraper could easily be built from a plural scraper by simply selecting the first one. Getting optimal performance might be a little tricky but certainly not impossible. Have you considered this?

[Q] Combinator with default values?

A Scraper might fail for multiple reasons; for example, a web page might not contain every field for certain query configurations.

What's a good approach for scraping optionally-present fields?

Expose function to construct AttributePredicate generally

It would be nice to have a function match :: (String -> String -> Bool) -> AttributePredicate for maximum flexibility.

There are, of course, other potential types, such as match :: (forall str. StringLike str => Attribute str -> Bool) -> AttributePredicate, but the simplest one seems like a good start.

// reverses the order of selection

Hi, I discovered that calling something like...

htmls x

will match results in the opposite order as

htmls (x // y)

(assuming that there is one y inside each x).

Is this intended? Thanks!

[question] How to select siblings?

In libraries like jQuery/cheerio, given an HTML document like:

<p class="something">Here</p>
<p>Other stuff that matters</p>

You can select "Other stuff that matters" with a selector like: .something+p.

This structure, while not my cup of tea, is used every now and then on websites such as http://hackage.haskell.org.

Is there a way to do this?

nested selector gives redundant result

With scalpel-0.3.0.1, I ran the following.

import Text.HTML.Scalpel (Scraper, attrs, (//), scrapeStringLike)

nestedDivs :: String
nestedDivs = "<div id=\"outer\"><div id=\"inner\">inner text</div></div>"

idScraper :: Scraper String [String]
idScraper = attrs "id" ("div" // "div")

main :: IO ()
main = do
  print $ scrapeStringLike nestedDivs idScraper

and got the result:

Just ["outer","inner","inner"]

but I had expected:

Just ["inner"]

I don't understand why I got that result. Is it a bug, or it's expected behavior?

Class for things that can be parsed by Scalpel

This would be very useful for things like interacting with Servant. I also like the way Aeson uses this for things like polymorphic .:, so that could be worth looking into.

See this for the current class we are using for this purpose and this for the way we are integrating it with Servant.

Add generalized repetition

matchAll :: Scraper a b -> Scraper a [b], generalizing htmls, attrs, texts, and chroots from their singular forms.

This is useful in case I want to matchAll (html "a" <* (attr "title" "a" >>= \x -> guard (somePredicate x))).

(Yes, I actually ran into this.)

many doesn't solve the problem because it's alternation; replicateM and friends don't solve the problem because each Scraper looks from the current spot.

Scraper for inner HTML

Currently there's the text scraper that grabs the textual content of the selected node, and there's the html scraper that get's the whole selected node. I'd like a scraper that gives me only the inner HTML, without the selected tag itself.

Keeping HTML after scraping

I'm having a lot of fun getting to know scalpel! I was wondering - is there a way to run the Scraper but still keep the html (not just take the inner text)?

Select nth element?

Hi,

in scalpel there some functions that simply return the "first match" - e.g. text or chroot. Is there a way to select/chroot the nth matching element? Suppose, I have the following HTML:

One

Two

Now I want to select only the second

- is there a way to do this?

Thanks,
Marius

Hard to use scalpel with OverloadedStrings enabled.

Related to: #4

As of scalpel-0.3.0.1, it's hard to use with OverloadedStrings enabled.

As explained in #4, the reason is that GHC cannot figure out the concrete type for string literals for Selectable, AttributeName and TagName classes. As a result, we always have to specify the type (e.g., ("div" :: String)).

One solution to this problem is to remove Selectable class and make Selector an instance of IsString.

data Selector = ...

tagSelector :: String -> Selector
tagSelector tag = tag @: []

-- | Substitute for 'Any'.
anySelector :: Selector
anySelector = ...

instance IsString Selector where
  fromString = tagSelector

and then

(//) :: Selector -> Selector -> Selector 

That way, we always deal with the concrete type Selector. With OverloadedStrings enabled, we can use a string literal for Selector.

The same goes for TagName and AttributeName classes. For those classes, we have to introduce corresponding data types, though.

Downsides of this approach are:

  • It destroys backward-compatibility. So if I were to do that, I would make a new module for this new API. The API of Text.HTML.Scalpel would stay, but its implementation would be rewritten.
  • This new API may be hard to use WITHOUT OverloadedStrings, compared with the current API. It's OK for me. I almost always use OverloadedStrings.

How to use (@=~)?

It looks like a convenience operator but I don't see how to use it easily. The RegexLike instance in Text.Regex.TDFA.String wants Regex, which is not convenient to create. How do you mean it to be used? I came up with the following after a fair deal of browsing the docs:

import Text.Regex.TDFA
import Text.Regex.TDFA.String (compile)

  let re = forceRight $ compile defaultCompOpt defaultExecOpt "b*"
  sym <- text ("a" @: ["href" @=~ re])

hasClass doesn't work when attribute contains whitespace

While trying to scrape haddock, I found scalpel can't find <div class="subs methods">.
Following is a minimum reproduction.

file.html

<p class="foo"> foo </p>
<p class="foo bar"> foo bar</p>
> file <- readFile "./file.html"
> scrapeStringLike file (html $ "p" @: [hasClass "foo"])
Just "<p class=\"foo\"> foo </p>"
> scrapeStringLike file (html $ "p" @: [hasClass "foo bar"])
Nothing

Hard to use on ByteStrings. expects tag names to be the same type

Hi, thanks for your hard work! Very useful library.

I'm having a hard time using it to scrape a lazy ByteString. I'm using scrape after parsing the tags with tag soup.

So I have to make all my scraping functions be Scraper Bytestring, right? In order to get this to work with OverloadedStrings , I have to add the :: ByteString to the tag name. Shouldn't it be able work without that?

scrapeTitle :: Scraper ByteString ByteString
scrapeTitle = text ("title" :: ByteString)

It seems like the type of the tag name is tied to the type of the string I'm parsing. Is there any way to have the tag names always be the same type, and just convert it when you need to compare it later?

Allow selecting bare text nodes

Unfortunately I don't think position would help with that example since there is currently no way to select bare text nodes. One of the assumptions scalpel makes is that anything you'd want to select is between <tags>.

It's also not immediately clear how to expose bare text selection in a way that would be backwards compatible. My current thinking is to create an additional value for SelectNode for text nodes. That would let you do something like the following to grab the second text node under an <h2>:

chroot "h2" $ 
  chroots textSelector $ do
    p <- position
    guard (p == 1)
    text textSelector

With an API like the one proposed in #21 you could do something even more snazzy like: text ("h2" /// textSelector) to grab just the text nodes that are direct children of the <h2>.

The potential issue here though is that allowing selection of bare text nodes would create a breaking change in the behavior of anySelector. For example, scrapeStringLike "<a>text</a>" $ texts anySelector currently returns Just ["text"] but if we treated each text node as selectable then it would return Just ["text", "text"].

This might be an OK breaking change though since I think the most useful use of anySelector is to select the current root node in a chroot block like the examples in the read me.

Originally posted by @fimad in #48 (comment)

chroot does not backtrack if guard fails

If guard is used within the inner scraper passed to chroot scalpel should backtrack and try to find a tag that matches if the first tag fails. Currently this backtracking does not happen.

For example,

scrapeStringLike "<a>foo</a><a>bar</a><a>baz</a>" $ chroot "a" $ do
        t <- text Any
        guard ("b" `isInfixOf` t)
        html Any

Currently returns Nothing, but should return Just "<a>bar</a>"

Allow direct manipulation of TagSpec object

Sometimes I would rather work with the node tree (and thus TagSpec) itself rather than the Scraper / SerialScraper interface.

It would be optimal for my use case if TagSpec and various functions for manipulating it (children, name etc.) were exposed as a low level api. The current high level api would then be a layer on top of that and would be the same as it is currently, except perhaps some extra functions for dropping into the low level api when desired.

Of course TagSpec itself would have to be an abstract data type with a hidden constructor / fields rather than a tuple to preserve various invariants from being violated. It would also probably be worth renaming the type to something like Html or Nodes or similar. Another thing to consider would be whether or not its worth having explicit types for when you know you have a single node vs potentially zero or multiple nodes (Tree/Node vs Forest/Nodes/Html) to make functions like name :: Node str -> str make more sense.

Single Level Selectors

I can't see any way to select tags that are immediate children of a parent rather than ones at an arbitrary nesting. Basically I have some html like this:

<div id="a">
   <div>
      <span></span>
      <span></span>
   </div>
   <span></span>
</div>

I'd like to chroot into the nested div and do stuff stuff that takes into account the nested spans. I'd ALSO like to process all the spans at the top level of div "a" without touching those under the other div.

In my actual use case I was able to work around this because of what I was doing. In general, though, this is a feature I would expect from any scraping API. If you were super-motivated and just added css selectors that would help a lot :p

I hope you can solve this, this is one of the most pleasant scraping APIs I've ever used, much more pleasant that handsomesoup.

Negate Selector?

How do I select tags that lack a certain class, or in general fails to satisfy an AttributePredicate?
This would be simple if there was some function like,

notP :: AttributePredicate -> AttributePredicate
notP (MkAttributePredicate f) = MkAttributePredicate (not . f)

but MkAttributePredicate isn't exposed.

Is there another way to do this, or should I make a pull request about adding something like notP?

Optional Scrapers?

Example: Scraping Tweets. Some tweets have location information, and some don't. Some tweets have an extra "card url", and some don't.

If I define a scraper like this:

type ScrapeReturn = (T.Text, T.Text, T.Text, T.Text, T.Text, T.Text, T.Text)

tweetScraper :: Scraper T.Text [ScrapeReturn]
tweetScraper = tweets
   where
       tweets :: Scraper T.Text [ScrapeReturn]
       tweets = chroots ("div" @: [hasClass "js-stream-tweet"]) infos

       infos :: Scraper T.Text ScrapeReturn
       infos = do
           author <- attr "data-screen-name" Any
           id <- attr "data-tweet-id" Any
           body <- text $ "div"  @: [hasClass "js-tweet-text-container"]
           counters <- texts $ "span" @: [hasClass "ProfileTweet-actionCountForPresentation"]
           let retweets = head counters
           let likes = counters !! 2
           location <- text $ "span" @: [hasClass "Tweet-geo"]
           card_url <- attr "data-card-url" ("div"  @: [hasClass "js-macaw-cards-iframe-container"])
           return (id, author, location, retweets, likes, T.strip body, card_url)

then it will only return scraped values for those tweets that have both a location and card_url. That is, nothing at all will be returned for a huge majority of tweets, because most tweets are missing either a location or a card_url.

Is it possible to define a Scraper as optional, rather than a necessary match that causes the Scraper to return nothing when it isn't matched?

Or, is there an "and" operator, as opposed to the <|> operator? I could do something like scrape all the locations AND all the card urls AND all the rest of the infos?

Or, this would also be easily achievable with a Scraper that returned a fixed value, something like Scraper "", which returns the empty String. Then I could use the OR operator: location <|> Scraper "".

Please expose Internal modules :)

Please consider exposing internal modules, even if they're not part of the versioning scheme. As a library user, it'd be much easier for me to jump through the docs via links and understand what's going on (and why some function is missing and whether I can implement it myself), if the Internal modules are made available.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.