GithubHelp home page GithubHelp logo

ndmitchell / tagsoup Goto Github PK

View Code? Open in Web Editor NEW
228.0 10.0 37.0 1.4 MB

Haskell library for parsing and extracting information from (possibly malformed) HTML/XML documents

License: Other

Haskell 99.69% Batchfile 0.31%

tagsoup's Introduction

TagSoup Hackage version Stackage version Build status

TagSoup is a library for parsing HTML/XML. It supports the HTML 5 specification, and can be used to parse either well-formed XML, or unstructured and malformed HTML from the web. The library also provides useful functions to extract information from an HTML document, making it ideal for screen-scraping.

The library provides a basic data type for a list of unstructured tags, a parser to convert HTML into this tag type, and useful functions and combinators for finding and extracting information. This document gives two particular examples of scraping information from the web, while a few more may be found in the Sample file from the source repository. The examples we give are:

  • Obtaining the last modified date of the Haskell wiki
  • Obtaining a list of Simon Peyton Jones' latest papers
  • A brief overview of some other examples

The intial version of this library was written in Javascript and has been used for various commercial projects involving screen scraping. In the examples general hints on screen scraping are included, learnt from bitter experience. It should be noted that if you depend on data which someone else may change at any given time, you may be in for a shock!

This library was written without knowledge of the Java version of TagSoup. They have made a very different design decision: to ensure default attributes are present and to properly nest parsed tags. We do not do this - tags are merely a list devoid of nesting information.

Acknowledgements

Thanks to Mike Dodds for persuading me to write this up as a library. Thanks to many people for debugging and code contributions, including: Gleb Alexeev, Ketil Malde, Conrad Parker, Henning Thielemann, Dino Morelli, Emily Mitchell, Gwern Branwen.

Potential Bugs

There are two things that may go wrong with these examples:

  • The Websites being scraped may change. There is nothing I can do about this, but if you suspect this is the case let me know, and I'll update the examples and tutorials. I have already done so several times, it's only a few minutes work.
  • The openURL method may not work. This happens quite regularly, and depending on your server, proxies and direction of the wind, they may not work. The solution is to use wget to download the page locally, then use readFile instead. Hopefully a decent Haskell HTTP library will emerge, and that can be used instead.

Last modified date of Haskell wiki

Our goal is to develop a program that displays the date that the wiki at wiki.haskell.org was last modified. This example covers all the basics in designing a basic web-scraping application.

Finding the Page

We first need to find where the information is displayed and in what format. Taking a look at the front web page, when not logged in, we see:

<ul id="footer-info">
  <li id="footer-info-lastmod"> This page was last modified on 9 September 2013, at 22:38.</li>
  <li id="footer-info-copyright">Recent content is available under <a href="/HaskellWiki:Copyrights" title="HaskellWiki:Copyrights">simple permissive license</a>.</li>
</ul>

So, we see that the last modified date is available. This leads us to rule 1:

Rule 1: Scrape from what the page returns, not what a browser renders, or what view-source gives.

Some web servers will serve different content depending on the user agent, some browsers will have scripting modify their displayed HTML, some pages will display differently depending on your cookies. Before you can start to figure out how to start scraping, first decide what the input to your program will be. There are two ways to get the page as it will appear to your program.

Using the HTTP package

We can write a simple HTTP downloader with using the HTTP package:

module Main where

import Network.HTTP

openURL :: String -> IO String
openURL x = getResponseBody =<< simpleHTTP (getRequest x)

main :: IO ()
main = do
    src <- openURL "http://wiki.haskell.org/Haskell"
    writeFile "temp.htm" src

Now open temp.htm, find the fragment of HTML containing the hit count, and examine it.

Finding the Information

Now we examine both the fragment that contains our snippet of information, and the wider page. What does the fragment have that nothing else has? What algorithm would we use to obtain that particular element? How can we still return the element as the content changes? What if the design changes? But wait, before going any further:

Rule 2: Do not be robust to design changes, do not even consider the possibility when writing the code.

If the user changes their website, they will do so in unpredictable ways. They may move the page, they may put the information somewhere else, they may remove the information entirely. If you want something robust talk to the site owner, or buy the data from someone. If you try and think about design changes, you will complicate your design, and it still won't work. It is better to write an extraction method quickly, and happily rewrite it when things change.

So now, let's consider the fragment from above. It is useful to find a tag which is unique just above your snippet - something with a nice id or class attribute - something which is unlikely to occur multiple times. In the above example, an id with value footer-info-lastmod seems perfect.

module Main where

import Data.Char
import Network.HTTP
import Text.HTML.TagSoup

openURL :: String -> IO String
openURL x = getResponseBody =<< simpleHTTP (getRequest x)

haskellLastModifiedDateTime :: IO ()
haskellLastModifiedDateTime = do
    src <- openURL "http://wiki.haskell.org/Haskell"
    let lastModifiedDateTime = fromFooter $ parseTags src
    putStrLn $ "wiki.haskell.org was last modified on " ++ lastModifiedDateTime
    where fromFooter = unwords . drop 6 . words . innerText . take 2 . dropWhile (~/= "<li id=footer-info-lastmod>")

main :: IO ()
main = haskellLastModifiedDateTime

Now we start writing the code! The first thing to do is open the required URL, then we parse the code into a list of Tags with parseTags. The fromFooter function does the interesting thing, and can be read right to left:

  • First we throw away everything (dropWhile) until we get to an li tag containing id=footer-info-lastmod. The (~==) and (~/=) operators are different from standard equality and inequality since they allow additional attributes to be present. We write "<li id=lastmod>" as syntactic sugar for TagOpen "li" [("id","footer-info-lastmod")]. If we just wanted any open tag with the given id attribute we could have written (~== TagOpen "" [("id","footer-info-lastmod")]) and this would have matched. Any empty strings in the second element of the match are considered as wildcards.
  • Next we take two elements: the <li> tag and the text node immediately following.
  • We call the innerText function to get all the text values from inside, which will just be the text node following the footer-info-lastmod.
  • We split the string into a series of words and drop the first six, i.e. the words This, page, was, last, modified and on
  • We reassemble the remaining words into the resulting string 9 September 2013, at 22:38.

This code may seem slightly messy, and indeed it is - often that is the nature of extracting information from a tag soup.

Rule 3: TagSoup is for extracting information where structure has been lost, use more structured information if it is available.

Simon's Papers

Our next very important task is to extract a list of all Simon Peyton Jones' recent research papers off his home page. The largest change to the previous example is that now we desire a list of papers, rather than just a single result.

As before we first start by writing a simple program that downloads the appropriate page, and look for common patterns. This time we want to look for all patterns which occur every time a paper is mentioned, but no where else. The other difference from last time is that previous we grabbed an automatically generated piece of information - this time the information is entered in a more freeform way by a human.

First we spot that the page helpfully has named anchors, there is a current work anchor, and after that is one for Haskell. We can extract all the information between them with a simple take/drop pair:

takeWhile (~/= "<a name=haskell>") $
drop 5 $ dropWhile (~/= "<a name=current>") tags

This code drops until you get to the "current" section, then takes until you get to the "haskell" section, ensuring we only look at the important bit of the page. Next we want to find all hyperlinks within this section:

map f $ sections (~== "<A>") $ ...

Remember that the function to select all tags with name "A" could have been written as (~== TagOpen "A" []), or alternatively isTagOpenName "A". Afterwards we map each item with an f function. This function needs to take the tags starting just after the link, and find the text inside the link.

f = dequote . unwords . words . fromTagText . head . filter isTagText

Here the complexity of interfacing to human written markup comes through. Some of the links are in italic, some are not - the filter drops all those that are not, until we find a pure text node. The unwords . words deletes all multiple spaces, replaces tabs and newlines with spaces and trims the front and back - a neat trick when dealing with text which has spacing at the source code but not when displayed. The final thing to take account of is that some papers are given with quotes around the name, some are not - dequote will remove the quotes if they exist.

For completeness, we now present the entire example:

module Main where

import Network.HTTP
import Text.HTML.TagSoup

openURL :: String -> IO String
openURL x = getResponseBody =<< simpleHTTP (getRequest x)

spjPapers :: IO ()
spjPapers = do
        tags <- parseTags <$> openURL "http://research.microsoft.com/en-us/people/simonpj/"
        let links = map f $ sections (~== "<A>") $
                    takeWhile (~/= "<a name=haskell>") $
                    drop 5 $ dropWhile (~/= "<a name=current>") tags
        putStr $ unlines links
    where
        f :: [Tag String] -> String
        f = dequote . unwords . words . fromTagText . head . filter isTagText

        dequote ('\"':xs) | last xs == '\"' = init xs
        dequote x = x

main :: IO ()
main = spjPapers

Other Examples

Several more examples are given in the Sample.hs file, including obtaining the (short) list of papers from my site, getting the current time and a basic XML validator. All use very much the same style as presented here - writing screen scrapers follow a standard pattern. We present the code from two for enjoyment only.

My Papers

module Main where

import Network.HTTP
import Text.HTML.TagSoup

openURL :: String -> IO String
openURL x = getResponseBody =<< simpleHTTP (getRequest x)

ndmPapers :: IO ()
ndmPapers = do
        tags <- parseTags <$> openURL "http://community.haskell.org/~ndm/downloads/"
        let papers = map f $ sections (~== "<li class=paper>") tags
        putStr $ unlines papers
    where
        f :: [Tag String] -> String
        f xs = fromTagText (xs !! 2)

main :: IO ()
main = ndmPapers

UK Time

module Main where

import Network.HTTP
import Text.HTML.TagSoup

openURL :: String -> IO String
openURL x = getResponseBody =<< simpleHTTP (getRequest x)

currentTime :: IO ()
currentTime = do
    tags <- parseTags <$> openURL "http://www.timeanddate.com/worldclock/uk/london"
    let time = fromTagText (dropWhile (~/= "<span id=ct>") tags !! 1)
    putStrLn time

main :: IO ()
main = currentTime

Other Examples

In Sample.hs the following additional examples are listed:

  • Google Tech News
  • Package list form Hackage
  • Print names of story contributors on sequence.complete.org
  • Parse rows of a table

Related Projects

tagsoup's People

Contributors

2016rshah avatar aavogt avatar alexgerdes avatar alogic0 avatar christopherking42 avatar danbst avatar dino- avatar dunefox avatar eriknstevenson avatar felixonmars avatar iblech avatar jameysharp avatar jaspervdj avatar ketil-malde avatar kfish avatar malteneuss avatar maxgabriel avatar michelk avatar ndmitchell avatar ntc2 avatar rcook avatar ruuda avatar ryanglscott avatar stianlagstad avatar stiiin avatar tomjaguarpaw avatar vshabanov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tagsoup's Issues

Incorrect source position before bogus comment

Bug #70 is fixed, but the same bug with bogus comments is not:

> parseTagsOptions Text.HTML.TagSoup.parseOptions{ optTagWarning = True, optTagPosition = True } "<div><!--foo-->bar</div>"
[TagPosition 1 1,TagOpen "div" [],TagPosition 1 6,TagComment "foo",TagPosition 1 16,TagText "bar",TagPosition 1 19,TagClose "div"]
> parseTagsOptions Text.HTML.TagSoup.parseOptions{ optTagWarning = True, optTagPosition = True } "<div><?foo</div>"
[TagPosition 1 1,TagOpen "div" [],TagPosition 1 8,TagOpen "?foo<" [("div","")],TagPosition 1 13,TagWarning "Unexpected \"/\"",TagPosition 1 17,TagWarning "Expected \"?>\""]

Note the TagPosition 1 8 in the second example.

empty attribute values are omitted in rendering

I'd like to reopen renderTags render empty attributes <body style=""> as <body style>

I do have some problems with this, as in the legacy html I am processing there are some content-less attributes that were included solely for being there (like <img alt="" …>), and I don't want to change anything about that (for now, of course a proper solution is omitting them entirely or filling them with values).

If you don't want to alter the data model (like wrapping the value in Maybe), a quick fix might be to distinguish "boolean" attributes which are commonly empty from "text" attributes, so that you can both produce the empty attribute syntax for the former and the empty string value for the latter respectively. This would also work around the problems with the doctype that were mentioned in the google code issue.

Provide a more primitive parsing function

Hi,

it would be great if there was a more primitive parsing function that would allow usage of tagsoup in enumerators, pipes or conduits, e.g.

parseTag :: ByteString -> (Maybe Tag, ByteString)

or a return value similar to attoparsec's IResult type

Preserve collapsed tags

From https://code.google.com/p/ndmitchell/issues/detail?id=359

From looking at the positioning data, it seems like the rendering function
could take advantage of whether both the open and closing tag have the exact
same positioning data to make the output the same as input. Is it possible
you could make this change?

Fiddly at the moment, but with the new position information becomes more feasible.

Entities conversion doesn't use document's encoding, result is unusable.

I need to work on a page encoded in utf-8 that also have html entities. The problem is that in that case entities aren't converted to their utf-8 encoded representation; as a result I don't know how to recover all special characters on the page.

A small example:

Prelude Data.ByteString.Char8 Text.HTML.TagSoup> parseTags (pack "test\xC3\xA9 &#233; hop") :: [Tag ByteString]
[TagText "test\195\169 \233 hop"]

The resulting string mixes utf-8 and iso-8859-1 encodings.

\195\169 is the "é" character encoded in utf-8
\233 is the "é" character encoded in iso-8859-1

I have no idea (yet) how to work around it.

Incorrect source position before comment.

With tagsoup 0.14.2:

Prelude Text.HTML.TagSoup> parseTagsOptions parseOptions{ optTagWarning = True, optTagPosition = True } "<div><!--foo-->bar</div>"
[TagPosition 1 1,TagOpen "div" [],TagPosition 1 8,TagComment "foo",TagPosition 1 16,TagText "bar",TagPosition 1 19,TagClose "div"]

If I'm not terribly confused, the third token should be TagPosition 1 6, not TagPosition 1 8. The open tag consists of five characters. Is this a bug?

See jgm/pandoc#4282.

Current UK time sample does not work due to change in URL and HTML layout

Program currently fails as follows:

$ stack exec tagsoup-test
tagsoup-test: Prelude.!!: index too large

Firstly, the HTTP request fails due to a 301 "moved permanently" response from the server. Secondly, the HTML layout has changed slight: we should look for a <span> instead of a <strong> element. I'll update the sample.

Is it even possible to make `tagTree` lazy?

https://github.com/ndmitchell/tagsoup/blob/master/Text/HTML/TagSoup/Tree.hs#L29 contains a comment saying:

-- | Convert a list of tags into a tree. This version is not lazy at
--   all, that is saved for version 2.

The thing is though, is it even possible to make this lazy? I think the answer is no for the current API. Consider this:

(TagOpen "html" []):undefined

Would this output

(TagBranch "html" [] undefined):undefined

or

(TagLeaf $ TagOpen "html" []):undefined

if made lazy? The only thing we know is that it is of the form

_:_

so that is all a lazy version could output at the moment.

There are three solutions I can think of:

  1. Tell tagTree which tags should have closing tags, and which shouldn't, and have it raise an error or do some other behavior when this assumption is violated. No user facing types would need to be changed. This is the least safe option though.
  2. Somehow merge TagBranch and TagLeaf . TagOpen (and think this was done in the past). You could have a lazy boolean flag (evaluating this boolean flag would cause the list to evaluated.) The amount of laziness possible, while still keeping it a tree, is still limited though. You could not lazily see what comes after that part of the tree. You would be able to scan its inner-html lazily though. This requires changing user facing types. It also sacrifices the tree like nature of the datatype somewhat.
  3. Not making it lazy.

Modify HTML

How would I modify content of some tag or attribute and return modified version of full html document?

Schema validation

Is schema validation a possibility for this library? This is an important feature especially when working with well formatter XML.

TagPosition not updated properly before bare & character

The bug is shown by the following example:

Prelude Text.HTML.TagSoup> parseTagsOptions  Text.HTML.TagSoup.parseOptions{ optTagWarning = False, optTagPosition = True} "<a>&</a>"
[TagPosition 1 1,TagOpen "a" [],TagPosition 1 5,TagText "&",TagPosition 1 5,TagClose "a"]

Note that the TagPosition is the same before and after TagText "&".

I would expect instead:

[TagPosition 1 1,TagOpen "a" [],TagPosition 1 4,TagText "&",TagPosition 1 5,TagClose "a"]

Compare:

Prelude Text.HTML.TagSoup> parseTagsOptions  Text.HTML.TagSoup.parseOptions{ optTagWarning = False, optTagPosition = True} "<a>x</a>"
[TagPosition 1 1,TagOpen "a" [],TagPosition 1 4,TagText "x",TagPosition 1 5,TagClose "a"]

This is related to two pandoc bugs, jgm/pandoc#4094, jgm/pandoc#4088.

Escape single quote (') characters as &#39;

I was surprised to discover that the escapeXml function escapes double quotes (") but not single quotes ('). Digging into the source code of tagsoup, I found this comment:

-- | A table mapping XML entity names to resolved strings. All strings are a single character long.
-- Does /not/ include @apos@ as Internet Explorer does not know about it.
xmlEntities :: [(String, String)]
xmlEntities = let a*b = (a,[b]) in
["quot" * '"'
,"amp" * '&'
-- ,"apos" * '\'' -- Internet Explorer does not know that
,"lt" * '<'
,"gt" * '>'
]

This suggests that the reason that single quotes aren't escaped is due to Internet Explorer not supporting &apos;. But this feels a bit too conservative, since Internet Explorer does support &#39;, as suggested here. (Credit goes to this stache pull request for that idea.)

Would you be open to escapeXml escaping single quotes as &#39; instead?

Issue about working with `OverloadedStrings` extension

I come across a same error once asked here http://stackoverflow.com/questions/7352213/haskell-tagsoup-library-with-overloadedstrings

The error looks like:

No instance for (TagRep t0) arising from a use of ‘~==’
    The type variable t0 is ambiguous
    Note: there are several potential instances:
      instance Text.StringLike.StringLike str => TagRep (Tag str)
        -- Defined in ‘Text.HTML.TagSoup’
      instance TagRep String -- Defined in ‘Text.HTML.TagSoup’
    In the first argument of sections, namely (~== "<input>")
    In the first argument of (.), namely sections (~== "<input>")
    In the second argument of (<$>), namely
      sections (~== "<input>")
       . takeWhile (~/= s "<input name=user>")

Currently I'm using the workaround posted in the answer, add a helper function:

s :: String -> String
s = id

-- have to add `s` before string everywhere.. for example 
let sth = sections (~== s "<input>") . takeWhile (~/= s "<input name=user>") $ tags

It still looks wordy. Is there a better solution to this problem?

Parsing arbitrary numerical tags

While trying to parse the output from Hoogle in order to extract function names:

fmapFromHogle ="<span class=name><0>fmap</0></span> :: Functor f =&gt; (a -&gt; b) -&gt; f a -&gt; f b"

*Main λ> TS.parseTags fmapFromHogle
[TagOpen "span" [("class","name")],TagText "<0>fmap",TagComment "0",TagClose "span",TagText " :: Functor f => (a -> b) -> f a -> f b"]

the <0> isen't parsed as a tag, if I replace it with a <x> for example it works
<x1> works as well, but <1x> dosen't

[TagOpen "span" [("class","name")],TagOpen "x" [],TagText "fmap",TagClose "x",TagClose "span",TagText " :: Functor f => (a -> b) -> f
a -> f b"]

Incorrect parsing of tags inside `<script>` environments?

It's possible that I don't understand the HTML 5 parsing algorithm correctly, but it doesn't seem to me that tagsoup should be finding an img tag inside a quoted javascript string (this is with tagsoup 0.13.1, ghc 7.8.3):

Prelude Text.HTML.TagSoup> let x = "<style type=\"text/javascript\">\nvar x = \"<img src=\\\"./image/pic.png\\\" alt=\\\"pic.png\\\" width=400 />\"\n</style>\n" 
Prelude Text.HTML.TagSoup> parseTags x
[TagOpen "style" [("type","text/javascript")],TagText "\nvar x = \"",TagOpen "img" [("src","\\\"./image/pic.png\\\""),("alt","\\\"pic.png\\\""),("width","400")],TagClose "img",TagText "\"\n",TagClose "style",TagText "\n"]

Is this a bug? Cf. jgm/pandoc#1489.

Eliminate Text.HTML.Download

From https://code.google.com/p/ndmitchell/issues/detail?id=368

One user reported that the network dependency in tagsoup was causing issues. Since this is only used by a deprecated part of the package it seems worth trying to remove.

A search of Hackage reveals only two users of this module, hackage-plot and hackage-sparks, both by Dons. Plan is to create patches, get them submitted, wait for these packages to be uploaded, and then delete it.

Dons didn't reply, so I have made them off by default, and can be installed with a flag.

~== should be case insensitive

From https://code.google.com/p/ndmitchell/issues/detail?id=385

José Romildo Malaquias writes:

Currently the comparison of the name of open and close tags in the
inexact match operator (~==) of the Text.HTML.TagSoup package is case
sensitive. Maybe it is better to let this comparison be case
insensitive, as it should be irrelavant to the meaning of the HTML
code. The same would apply to the name of attributes in open tags.

I have just noticed that the package tagsoup-parsec already has parsers
for open and close tags which are already case insensitive.

I replied: Perhaps that does make sense. Not sure, but definitely worth thinking about.

Come to think of it, does anyone searching for <span> not also want to find <SPAN>? I doubt it. Perhaps the tag names and attribute names should be case insensitive, but the text values should still be case sensitive.

Redesign the parser

The current parser is a String parser that can also plug in ByteString and Text, and then treats them like String, which is inefficient. The plan (which will fix #12, #13, #11 and #5) is to:

  • Read the HTML5 spec a lot. Write it as a DSL. Have that DSL generate code pretty directly which is high performance. That will either be a deep embedding, or a shallow embedding that might spit out some generated Haskell (I suspect the latter).

  • Have the primitive parser have specialisations for different types. There should be a strict Text and ByteString one, and one with and without position numbers. These parsers should take a strict buffer, parse what it can, and output a stream of "steps". Something like:

    parseable :: IsText a => S -> a -> ([Fragment a], S)

The Fragment would say things like "start a tag", "start an attribute name", "start an attribute value", "here is some text". The S would be the state to continue with more data. Given this interface, you can write a lazy parser which is fast because it is strict in each chunk of the strict data structure.

CC @ChristopherKing42 who was asking about this on another ticket.

Hit count sample no longer works

Unfortunately, the URL in the hit count example in README.md, i.e. http://www.haskell.org/haskellwiki/Haskell, returns a 302 "moved temporarily" HTTP response. I'll see if I can figure out a way to rewrite the sample to work.

Update: I tried changing the URL to https://wiki.haskell.org/Haskell. Unfortunately, the new front page no longer includes a hit count.

Matching multiple class attribute values

From https://code.google.com/p/ndmitchell/issues/detail?id=475

ACheshkov says:

  1. So we have html : <div class="class1 class2">
  2. I want find all div tags with class1
  3. ~== "<div class=class1>" does not work properly

I replied: Yes, it's only perfect matching at that stage. I wonder if it should be substring matching, or at least have the option. Perhaps you really want regular expressions?

As a workaround, you can run a pass on it first to take class="foo bar" and transform it to class="foo" class="bar", and then your matching would work as you hope.

possible bug

I'm not sure this is a bug; maybe it's just a dark corner of the HTML5 parsing algorithm? But I did find it very surprising. Can you comment? Is this really valid HTML?

Prelude Text.HTML.TagSoup> parseTags
"<www.boe.es/buscar/act.php?id=BOE-A-1996-8930#a66>"
[TagOpen "www.boe.es" [("buscar",""),("act.php?id","BOE-A-1996-8930#a66")]]

Can StringLike have a IsString dependency?

One nice thing would be if StringLike was a subclass of IsString from the base package. This way, you don't have to require parameters to satisfy both StringLike and IsString, or convert between them (which can be annoying if you want to keep things polymorphic). It also allows overloadedstrings to work with StringLike.

It would be trivial to do this, and you could even give fromString a default implementation.

If we want to do this, I could implement this.

ByteString version is too slow

From https://code.google.com/p/ndmitchell/issues/detail?id=290

vasyl said: I've used following attached code for benchmark, the "page.html" could be
arbitrary page, for example hackage packages list.

On my PC, String version of tagsoup executes in 132 ms, and ByteString in
453 ms.

{-# LANGUAGE NoMonomorphismRestriction #-}
import Text.HTML.TagSoup
import qualified Data.ByteString.Char8 as B
import qualified Data.ByteString.Lazy.Char8 as BL
import Criterion.Main

tagsCount = length . parseTags

main = do
  fb <- B.readFile "page.html"
  fs <- readFile "page.html"
  fl <- BL.readFile "page.html"
  defaultMain [
    bench "String" $ nf tagsCount fs,
    bench "ByteString" $ nf tagsCount fb,
    bench "Lazy ByteString" $ nf tagsCount fl,
    bench "ByteString to String" $ nf tagsCount (B.unpack fb)
    ]

IMO this behavior is bad, because everyone suspect, that ByteString should
be faster. I think the best way is to disable bytestrings for now, because
converting BS to String is faster anyway (the last benchmark)

@ndmitchell replied:

Hmm, there is a benchmark in tagsoup, and I found them to be the same speed. The reason
I included ByteString is that it takes less memory, which does matter for some
applications.

I'll see how your benchmarks differ, and combine them in to mine. Tagsoup-0.8 was
intended to be an interface release, with Tagsoup-0.9 providing speed. With any luck
I'll have ByteString going substantially faster in the next release.

Have an option to make all tag names lowercase

In HTML5 this is the standard. There should be an option to set to parseOptions to make names lowercase. Not sure what it should be by default though. This ticket replaces #2 and #19, since if you make all tags lowercase then it takes care.

canonicalize tags before generating a tag tree

The tagTree function compares tag names to match open and close tags. Perhaps it should call canonicaliseTags on the list of tags first? I don't think anyone would want to differentiate between lower and uppercase tags when generating a tree.

Consider renaming 'tagsoup' test tool to avoid collision with

Original report: https://bugs.gentoo.org/show_bug.cgi?id=547734

/usr/bin/tagsoup is being provided by multiple packages in gentoo:
dev-haskell/tagsoup
Homepage: http://community.haskell.org/~ndm/tagsoup/
Description: Parsing and extracting information from (possibly malformed) HTML/XML documents

dev-java/tagsoup
Homepage: http://mercury.ccil.org/~cowan/XML/tagsoup/
Description: A SAX-compliant parser written in Java

We've renamed tagsoup to haskell-tagsoup in gentoo to avoid collision.

What do you think ofslightly changing a tool name
to something like 'hstagsoup' / 'hsts' / other?

What is the long term status of Text.HTML.TagSoup.Tree?

Types like TagTree and functions like parseTree would be very useful for scraping, so I was wondering about the long term status of that module, particularly including things like whether or not it will handle omitted td/tr closing tags.

I have been using Scalpel for scraping, and it usually works quite well for web scraping, although sometimes it is too high level and I would rather just work with a node tree.

Scalpel itself has its own notion of a Tree called TagSpec in an internal module here, so it's also worth considering if HTML5 compliant node tree parsing can be done in a single place and reused by any library that needs it.

spjPapers and ndmPapers samples in README.md do not compile

Here's the compiler error:

/home/rcook/haskell/tagsoup-test/src/Main.hs:17:15:
    Expecting one more argument to ‘Tag’
    Expected kind ‘*’, but ‘Tag’ has kind ‘* -> *’
    In the type signature for ‘f’: f :: [Tag] -> String
    In an equation for ‘spjPapers’:
        spjPapers
          = do { tags <- fmap parseTags
                         $ openURL "http://research.microsoft.com/en-us/people/simonpj/";
                 let links = ...;
                 putStr $ unlines links }
          where
              f :: [Tag] -> String
              f = dequote
                  . unwords . words . fromTagText . head . filter isTagText
              dequote ('"' : xs) | last xs == '"' = init xs
              dequote x = x

This is due to the missing type parameter for Tag.

I can fix this.

Add isTagComment function

TagComment is the only tag that does not sport an accompanying isTagComment function in the isTag* family.

Cannot disable numeric character reference resolution

Using this option, I was expecting the character references to remain intact:

parseTagsOptions (parseOptionsEntities (const Nothing)) "&#x3c; &#60; &lt;"
  == [TagText "&#x3c; &#60; &lt;"]

However the numerical ones did not. This is the result instead:

[TagText "< < &lt;"]

And in fact the Text.HTML.TagSoup.Implementation.output function ignores the optEntityData option for numerical entities. (link to code)

This is problematic with ByteStrings, and when the charset used by the HTML document is still unknown (e.g., to be extracted from its <head> component, which I wish to use TagSoup for in the first place): resolved multibyte characters are truncated and I don't know what they should be translated to anyway.
Thanks.

TagSoup uses unbounded stack space while parsing a single tag

I expected to be able to use TagSoup to parse arbitrary HTML, but what I've found is that a single tag with lots of content (and by "lots" I only mean hundreds of kilobytes) can fill up the Haskell stack and crash my program. I've switched to fast-tagsoup which doesn't have this problem, but this seems like a relevant bug in tagsoup.

If aaaa.html is a file of the form:

<html><body>[286 KB of content goes here]</body></html>

then the following Haskell program will crash when compiled with GHC's defaults, using TagSoup 0.13.8:

module Main where
import System.IO
import Text.HTML.TagSoup

main :: IO ()
main = do
  text <- openFile "aaaa.html" ReadMode >>= hGetContents
  let tags = parseTags text in putStrLn (show (tags !! 4))

The error it crashes with is:

Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.

I can of course increase the stack size, but all that means is that it takes a longer tag to crash it.

If you want a complete example including the HTML file, I put it up at https://gist.github.com/rspeer/57c8d5995bc281e284b4 . I replaced every non-linebreak character in the HTML with "a" to emphasize that the content doesn't matter.

Idea: Make TagTree a Monad!

I know it sounds crazy, but hear me out!

We define a TagTreeM str a as

data TagTreeM str a
    = -- | A 'TagOpen'/'TagClose' pair with the 'Tag' values in between.
      TagBranch str [Attribute str] [TagTree str]
    | -- | Any leaf node
      TagLeaf (Tag str)
    | -- | A hole in the document
      Hole a
                   deriving (Eq,Ord,Show)

A we can define TagTree str as type TagTree str = TagTreeM str Data.Void.Void. [TagTreeM str] is now a Monad! (We probably want to make a newtype for [TagTreeM str], but through out this post I'll just use it directly.

What does this monad due, exactly? It is a document with holes in it to be filled in it (sort of like the Continuation Monad.) This makes it feasible to construct documents using TagSoup instead of just parsing them!

For example, we can define an operator

tag s = [TagBranch s [] [Hole ()]]

Now if we do

do
    tag "Html"
    tag "Body"
    [TagLeaf (TagText ("Hello World"))]

We get

<html>
<body>
Hello World
</body>
</html>

We can also make a list of things. For example, if we do TagBranch "ul" [] [TagBranch "li" [] [Hole 1], TagBranch "li" [] [Hole 2], TagBranch "li" [] [Hole 3]] We get a [TagTreeM str Int], which represents an html document with three holes, labeled with integers. The holes are in an unordered list. You could also imagine algebraic data types being used, representing different parts of the document.

Another useful operator would be

i :: [TagTree str] -> [TagTreeM str ()]
i t = vacuous t ++ return ()

i inserts a [TagTree str] into the document, leaving a hole beneath it for the rest of the document. It allows you to sequentially combine documents.

Now, also this is pretty radical, since tagsoup is a parsing library, not an html rendering one. It still is an interesting thing to consider, though.

Edit: It is also possible to add a Hole constructor to Tag str instead. That way both [TagTreeM str] and [Tag str] would be monads. tag s = [TagOpen s [], Hole (), TagClose s] instead then.

Support for well-formed XML validation

TagWarning is to used to check whether the individual XML tags are correct if not mistaken. I am not aware of a function that checks the soundness of the whole structure, i.e. all TagOpen have a corresponding TagClose.

Are there plans to add this feature or is this something expected from the user? Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.