ndmitchell / tagsoup Goto Github PK

Haskell library for parsing and extracting information from (possibly malformed) HTML/XML documents

License: Other

Haskell 99.69% Batchfile 0.31%

tagsoup's Introduction

TagSoup

TagSoup is a library for parsing HTML/XML. It supports the HTML 5 specification, and can be used to parse either well-formed XML, or unstructured and malformed HTML from the web. The library also provides useful functions to extract information from an HTML document, making it ideal for screen-scraping.

The library provides a basic data type for a list of unstructured tags, a parser to convert HTML into this tag type, and useful functions and combinators for finding and extracting information. This document gives two particular examples of scraping information from the web, while a few more may be found in the Sample file from the source repository. The examples we give are:

Obtaining the last modified date of the Haskell wiki
Obtaining a list of Simon Peyton Jones' latest papers
A brief overview of some other examples

The intial version of this library was written in Javascript and has been used for various commercial projects involving screen scraping. In the examples general hints on screen scraping are included, learnt from bitter experience. It should be noted that if you depend on data which someone else may change at any given time, you may be in for a shock!

This library was written without knowledge of the Java version of TagSoup. They have made a very different design decision: to ensure default attributes are present and to properly nest parsed tags. We do not do this - tags are merely a list devoid of nesting information.

Acknowledgements

Thanks to Mike Dodds for persuading me to write this up as a library. Thanks to many people for debugging and code contributions, including: Gleb Alexeev, Ketil Malde, Conrad Parker, Henning Thielemann, Dino Morelli, Emily Mitchell, Gwern Branwen.

Potential Bugs

There are two things that may go wrong with these examples:

The Websites being scraped may change. There is nothing I can do about this, but if you suspect this is the case let me know, and I'll update the examples and tutorials. I have already done so several times, it's only a few minutes work.
The openURL method may not work. This happens quite regularly, and depending on your server, proxies and direction of the wind, they may not work. The solution is to use wget to download the page locally, then use readFile instead. Hopefully a decent Haskell HTTP library will emerge, and that can be used instead.

Last modified date of Haskell wiki

Our goal is to develop a program that displays the date that the wiki at wiki.haskell.org was last modified. This example covers all the basics in designing a basic web-scraping application.

Finding the Page

We first need to find where the information is displayed and in what format. Taking a look at the front web page, when not logged in, we see:

<ul id="footer-info">
  <li id="footer-info-lastmod"> This page was last modified on 9 September 2013, at 22:38.</li>
  <li id="footer-info-copyright">Recent content is available under <a href="/HaskellWiki:Copyrights" title="HaskellWiki:Copyrights">simple permissive license</a>.</li>
</ul>

So, we see that the last modified date is available. This leads us to rule 1:

Rule 1: Scrape from what the page returns, not what a browser renders, or what view-source gives.

Some web servers will serve different content depending on the user agent, some browsers will have scripting modify their displayed HTML, some pages will display differently depending on your cookies. Before you can start to figure out how to start scraping, first decide what the input to your program will be. There are two ways to get the page as it will appear to your program.

Using the HTTP package

We can write a simple HTTP downloader with using the HTTP package:

module Main where

import Network.HTTP

openURL :: String -> IO String
openURL x = getResponseBody =<< simpleHTTP (getRequest x)

main :: IO ()
main = do
    src <- openURL "http://wiki.haskell.org/Haskell"
    writeFile "temp.htm" src

Now open temp.htm, find the fragment of HTML containing the hit count, and examine it.

Finding the Information

Now we examine both the fragment that contains our snippet of information, and the wider page. What does the fragment have that nothing else has? What algorithm would we use to obtain that particular element? How can we still return the element as the content changes? What if the design changes? But wait, before going any further:

Rule 2: Do not be robust to design changes, do not even consider the possibility when writing the code.

If the user changes their website, they will do so in unpredictable ways. They may move the page, they may put the information somewhere else, they may remove the information entirely. If you want something robust talk to the site owner, or buy the data from someone. If you try and think about design changes, you will complicate your design, and it still won't work. It is better to write an extraction method quickly, and happily rewrite it when things change.

So now, let's consider the fragment from above. It is useful to find a tag which is unique just above your snippet - something with a nice id or class attribute - something which is unlikely to occur multiple times. In the above example, an id with value footer-info-lastmod seems perfect.

module Main where

import Data.Char
import Network.HTTP
import Text.HTML.TagSoup

openURL :: String -> IO String
openURL x = getResponseBody =<< simpleHTTP (getRequest x)

haskellLastModifiedDateTime :: IO ()
haskellLastModifiedDateTime = do
    src <- openURL "http://wiki.haskell.org/Haskell"
    let lastModifiedDateTime = fromFooter $ parseTags src
    putStrLn $ "wiki.haskell.org was last modified on " ++ lastModifiedDateTime
    where fromFooter = unwords . drop 6 . words . innerText . take 2 . dropWhile (~/= "<li id=footer-info-lastmod>")

main :: IO ()
main = haskellLastModifiedDateTime

Now we start writing the code! The first thing to do is open the required URL, then we parse the code into a list of Tags with parseTags. The fromFooter function does the interesting thing, and can be read right to left:

First we throw away everything (dropWhile) until we get to an li tag containing id=footer-info-lastmod. The (~==) and (~/=) operators are different from standard equality and inequality since they allow additional attributes to be present. We write "<li id=lastmod>" as syntactic sugar for TagOpen "li" [("id","footer-info-lastmod")]. If we just wanted any open tag with the given id attribute we could have written (~== TagOpen "" [("id","footer-info-lastmod")]) and this would have matched. Any empty strings in the second element of the match are considered as wildcards.
Next we take two elements: the <li> tag and the text node immediately following.
We call the innerText function to get all the text values from inside, which will just be the text node following the footer-info-lastmod.
We split the string into a series of words and drop the first six, i.e. the words This, page, was, last, modified and on
We reassemble the remaining words into the resulting string 9 September 2013, at 22:38.

This code may seem slightly messy, and indeed it is - often that is the nature of extracting information from a tag soup.

Rule 3: TagSoup is for extracting information where structure has been lost, use more structured information if it is available.

Simon's Papers

Our next very important task is to extract a list of all Simon Peyton Jones' recent research papers off his home page. The largest change to the previous example is that now we desire a list of papers, rather than just a single result.

As before we first start by writing a simple program that downloads the appropriate page, and look for common patterns. This time we want to look for all patterns which occur every time a paper is mentioned, but no where else. The other difference from last time is that previous we grabbed an automatically generated piece of information - this time the information is entered in a more freeform way by a human.

First we spot that the page helpfully has named anchors, there is a current work anchor, and after that is one for Haskell. We can extract all the information between them with a simple take/drop pair:

takeWhile (~/= "<a name=haskell>") $
drop 5 $ dropWhile (~/= "<a name=current>") tags

This code drops until you get to the "current" section, then takes until you get to the "haskell" section, ensuring we only look at the important bit of the page. Next we want to find all hyperlinks within this section:

map f $ sections (~== "<A>") $ ...

Remember that the function to select all tags with name "A" could have been written as (~== TagOpen "A" []), or alternatively isTagOpenName "A". Afterwards we map each item with an f function. This function needs to take the tags starting just after the link, and find the text inside the link.

f = dequote . unwords . words . fromTagText . head . filter isTagText

Here the complexity of interfacing to human written markup comes through. Some of the links are in italic, some are not - the filter drops all those that are not, until we find a pure text node. The unwords . words deletes all multiple spaces, replaces tabs and newlines with spaces and trims the front and back - a neat trick when dealing with text which has spacing at the source code but not when displayed. The final thing to take account of is that some papers are given with quotes around the name, some are not - dequote will remove the quotes if they exist.

For completeness, we now present the entire example:

module Main where

import Network.HTTP
import Text.HTML.TagSoup

openURL :: String -> IO String
openURL x = getResponseBody =<< simpleHTTP (getRequest x)

spjPapers :: IO ()
spjPapers = do
        tags <- parseTags <$> openURL "http://research.microsoft.com/en-us/people/simonpj/"
        let links = map f $ sections (~== "<A>") $
                    takeWhile (~/= "<a name=haskell>") $
                    drop 5 $ dropWhile (~/= "<a name=current>") tags
        putStr $ unlines links
    where
        f :: [Tag String] -> String
        f = dequote . unwords . words . fromTagText . head . filter isTagText

        dequote ('\"':xs) | last xs == '\"' = init xs
        dequote x = x

main :: IO ()
main = spjPapers

Other Examples

Several more examples are given in the Sample.hs file, including obtaining the (short) list of papers from my site, getting the current time and a basic XML validator. All use very much the same style as presented here - writing screen scrapers follow a standard pattern. We present the code from two for enjoyment only.

My Papers

module Main where

import Network.HTTP
import Text.HTML.TagSoup

openURL :: String -> IO String
openURL x = getResponseBody =<< simpleHTTP (getRequest x)

ndmPapers :: IO ()
ndmPapers = do
        tags <- parseTags <$> openURL "http://community.haskell.org/~ndm/downloads/"
        let papers = map f $ sections (~== "<li class=paper>") tags
        putStr $ unlines papers
    where
        f :: [Tag String] -> String
        f xs = fromTagText (xs !! 2)

main :: IO ()
main = ndmPapers

UK Time

module Main where

import Network.HTTP
import Text.HTML.TagSoup

openURL :: String -> IO String
openURL x = getResponseBody =<< simpleHTTP (getRequest x)

currentTime :: IO ()
currentTime = do
    tags <- parseTags <$> openURL "http://www.timeanddate.com/worldclock/uk/london"
    let time = fromTagText (dropWhile (~/= "<span id=ct>") tags !! 1)
    putStrLn time

main :: IO ()
main = currentTime

Other Examples

In Sample.hs the following additional examples are listed:

Google Tech News
Package list form Hackage
Print names of story contributors on sequence.complete.org
Parse rows of a table

Related Projects

TagSoup for Java - an independently written malformed HTML parser for Java.
HXT: Haskell XML Toolbox - a more comprehensive XML parser, giving the option of using TagSoup as a lexer.
Other Related Work - as described on the HXT pages.
Using TagSoup with Parsec - a nice combination of Haskell libraries.
tagsoup-parsec - a library for easily using TagSoup as a token type in Parsec.
tagsoup-megaparsec - a library for easily using TagSoup as a token type in Megaparsec.
WraXML - construct a lazy tree from TagSoup lexemes.
html-conduit - parse malformed HTML into a tree and extract data with convenient combinators (from the Yesod project).

tagsoup's People

Contributors

Stargazers

Watchers

tagsoup's Issues

Incorrect source position before bogus comment

Bug #70 is fixed, but the same bug with bogus comments is not:

> parseTagsOptions Text.HTML.TagSoup.parseOptions{ optTagWarning = True, optTagPosition = True } "<div><!--foo-->bar</div>"
[TagPosition 1 1,TagOpen "div" [],TagPosition 1 6,TagComment "foo",TagPosition 1 16,TagText "bar",TagPosition 1 19,TagClose "div"]
> parseTagsOptions Text.HTML.TagSoup.parseOptions{ optTagWarning = True, optTagPosition = True } "<div><?foo</div>"
[TagPosition 1 1,TagOpen "div" [],TagPosition 1 8,TagOpen "?foo<" [("div","")],TagPosition 1 13,TagWarning "Unexpected \"/\"",TagPosition 1 17,TagWarning "Expected \"?>\""]

Note the TagPosition 1 8 in the second example.

empty attribute values are omitted in rendering

I'd like to reopen renderTags render empty attributes <body style=""> as <body style>

I do have some problems with this, as in the legacy html I am processing there are some content-less attributes that were included solely for being there (like <img alt="" …>), and I don't want to change anything about that (for now, of course a proper solution is omitting them entirely or filling them with values).

If you don't want to alter the data model (like wrapping the value in Maybe), a quick fix might be to distinguish "boolean" attributes which are commonly empty from "text" attributes, so that you can both produce the empty attribute syntax for the former and the empty string value for the latter respectively. This would also work around the problems with the doctype that were mentioned in the google code issue.

Numeric character references

I found that browsers emit '\xFFFD' character for numeric entities with overflow values. TagSoup currently puts '?' character.

And looking at HTML5 spec it seems that some numeric entities must be replaced to another ones. Look at the table here https://dev.w3.org/html5/spec-preview/tokenization.html#tokenizing-character-references

Provide a more primitive parsing function

Hi,

it would be great if there was a more primitive parsing function that would allow usage of tagsoup in enumerators, pipes or conduits, e.g.

parseTag :: ByteString -> (Maybe Tag, ByteString)

or a return value similar to attoparsec's IResult type

Should a stack file be added?

Link to Samples is broken

Page not found: https://github.com/ndmitchell/tagsoup/blob/master/TagSoup/Sample.hs

HTML5 noncompliance with unfinished tags

From https://code.google.com/p/ndmitchell/issues/detail?id=354

<p>some text</p<b>foo</b>

I think, from reading HTML 5, that should insert the missing >, rather than swallowing the <b>. Perhaps fixing bug #12 will fix this one automatically.

Preserve collapsed tags

From https://code.google.com/p/ndmitchell/issues/detail?id=359

From looking at the positioning data, it seems like the rendering function
could take advantage of whether both the open and closing tag have the exact
same positioning data to make the output the same as input. Is it possible
you could make this change?

Fiddly at the moment, but with the new position information becomes more feasible.

Entities conversion doesn't use document's encoding, result is unusable.

I need to work on a page encoded in utf-8 that also have html entities. The problem is that in that case entities aren't converted to their utf-8 encoded representation; as a result I don't know how to recover all special characters on the page.

A small example:

Prelude Data.ByteString.Char8 Text.HTML.TagSoup> parseTags (pack "test\xC3\xA9 &#233; hop") :: [Tag ByteString]
[TagText "test\195\169 \233 hop"]

The resulting string mixes utf-8 and iso-8859-1 encodings.

\195\169 is the "é" character encoded in utf-8
\233 is the "é" character encoded in iso-8859-1

I have no idea (yet) how to work around it.

Incorrect source position before comment.

With tagsoup 0.14.2:

Prelude Text.HTML.TagSoup> parseTagsOptions parseOptions{ optTagWarning = True, optTagPosition = True } "<div><!--foo-->bar</div>"
[TagPosition 1 1,TagOpen "div" [],TagPosition 1 8,TagComment "foo",TagPosition 1 16,TagText "bar",TagPosition 1 19,TagClose "div"]

If I'm not terribly confused, the third token should be TagPosition 1 6, not TagPosition 1 8. The open tag consists of five characters. Is this a bug?

See jgm/pandoc#4282.

Current UK time sample does not work due to change in URL and HTML layout

Program currently fails as follows:

$ stack exec tagsoup-test
tagsoup-test: Prelude.!!: index too large

Firstly, the HTTP request fails due to a 301 "moved permanently" response from the server. Secondly, the HTML layout has changed slight: we should look for a <span> instead of a <strong> element. I'll update the sample.

Is it even possible to make `tagTree` lazy?

https://github.com/ndmitchell/tagsoup/blob/master/Text/HTML/TagSoup/Tree.hs#L29 contains a comment saying:

-- | Convert a list of tags into a tree. This version is not lazy at
--   all, that is saved for version 2.

The thing is though, is it even possible to make this lazy? I think the answer is no for the current API. Consider this:

(TagOpen "html" []):undefined

Would this output

(TagBranch "html" [] undefined):undefined

(TagLeaf $ TagOpen "html" []):undefined

if made lazy? The only thing we know is that it is of the form

_:_

so that is all a lazy version could output at the moment.

There are three solutions I can think of:

Tell tagTree which tags should have closing tags, and which shouldn't, and have it raise an error or do some other behavior when this assumption is violated. No user facing types would need to be changed. This is the least safe option though.
Somehow merge TagBranch and TagLeaf . TagOpen (and think this was done in the past). You could have a lazy boolean flag (evaluating this boolean flag would cause the list to evaluated.) The amount of laziness possible, while still keeping it a tree, is still limited though. You could not lazily see what comes after that part of the tree. You would be able to scan its inner-html lazily though. This requires changing user facing types. It also sacrifices the tree like nature of the datatype somewhat.
Not making it lazy.

Link to Java version of TagSoup is broken

Page not found: http://home.ccil.org/~cowan/XML/tagsoup/

Research language agnostic test suites

If we can find a language agnostic test suite for parsing HTML that could save a lot of work.

I've only spent a few minutes looking into this, but I found some promising leads:

w3c HTML tests: https://github.com/w3c/web-platform-tests/tree/master/html
Detailed tests for the html5lib python library, split into a separate repo: https://github.com/html5lib/html5lib-tests

Modify HTML

How would I modify content of some tag or attribute and return modified version of full html document?

Run the demos on Travis

Following on from #36, the Travis buildbot should check the demos still work.

Schema validation

Is schema validation a possibility for this library? This is an important feature especially when working with well formatter XML.

TagPosition not updated properly before bare & character

The bug is shown by the following example:

Prelude Text.HTML.TagSoup> parseTagsOptions  Text.HTML.TagSoup.parseOptions{ optTagWarning = False, optTagPosition = True} "<a>&</a>"
[TagPosition 1 1,TagOpen "a" [],TagPosition 1 5,TagText "&",TagPosition 1 5,TagClose "a"]

Note that the TagPosition is the same before and after TagText "&".

I would expect instead:

[TagPosition 1 1,TagOpen "a" [],TagPosition 1 4,TagText "&",TagPosition 1 5,TagClose "a"]

Compare:

Prelude Text.HTML.TagSoup> parseTagsOptions  Text.HTML.TagSoup.parseOptions{ optTagWarning = False, optTagPosition = True} "<a>x</a>"
[TagPosition 1 1,TagOpen "a" [],TagPosition 1 4,TagText "x",TagPosition 1 5,TagClose "a"]

This is related to two pandoc bugs, jgm/pandoc#4094, jgm/pandoc#4088.

Escape single quote (') characters as '

I was surprised to discover that the escapeXml function escapes double quotes (") but not single quotes ('). Digging into the source code of tagsoup, I found this comment:

tagsoup/src/Text/HTML/TagSoup/Entity.hs

Lines 73 to 82 in 99a43e9

 -- | A table mapping XML entity names to resolved strings. All strings are a single character long. 

 -- Does /not/ include @apos@ as Internet Explorer does not know about it. 

 xmlEntities :: [(String, String)] 

 xmlEntities = let a*b = (a,[b]) in 

 ["quot" * '"' 

 ,"amp" * '&' 

 -- ,"apos" * '\'' -- Internet Explorer does not know that 

 ,"lt" * '<' 

 ,"gt" * '>' 

 ]

This suggests that the reason that single quotes aren't escaped is due to Internet Explorer not supporting '. But this feels a bit too conservative, since Internet Explorer does support ', as suggested here. (Credit goes to this stache pull request for that idea.)

Would you be open to escapeXml escaping single quotes as ' instead?

Issue about working with `OverloadedStrings` extension

I come across a same error once asked here http://stackoverflow.com/questions/7352213/haskell-tagsoup-library-with-overloadedstrings

The error looks like:

No instance for (TagRep t0) arising from a use of ‘~==’
    The type variable ‘t0’ is ambiguous
    Note: there are several potential instances:
      instance Text.StringLike.StringLike str => TagRep (Tag str)
        -- Defined in ‘Text.HTML.TagSoup’
      instance TagRep String -- Defined in ‘Text.HTML.TagSoup’
    In the first argument of ‘sections’, namely ‘(~== "<input>")’
    In the first argument of ‘(.)’, namely ‘sections (~== "<input>")’
    In the second argument of ‘(<$>)’, namely
      ‘sections (~== "<input>")
       . takeWhile (~/= s "<input name=user>")

Currently I'm using the workaround posted in the answer, add a helper function:

s :: String -> String
s = id

-- have to add `s` before string everywhere.. for example 
let sth = sections (~== s "<input>") . takeWhile (~/= s "<input name=user>") $ tags

It still looks wordy. Is there a better solution to this problem?

Parsing arbitrary numerical tags

While trying to parse the output from Hoogle in order to extract function names:

fmapFromHogle ="<span class=name><0>fmap</0></span> :: Functor f =&gt; (a -&gt; b) -&gt; f a -&gt; f b"

*Main λ> TS.parseTags fmapFromHogle
[TagOpen "span" [("class","name")],TagText "<0>fmap",TagComment "0",TagClose "span",TagText " :: Functor f => (a -> b) -> f a -> f b"]

the <0> isen't parsed as a tag, if I replace it with a <x> for example it works
<x1> works as well, but <1x> dosen't

[TagOpen "span" [("class","name")],TagOpen "x" [],TagText "fmap",TagClose "x",TagClose "span",TagText " :: Functor f => (a -> b) -> f
a -> f b"]

`tagsoup` binary no longer avaliable.

The Readme mention tagsoup binary but seem it no longer provided.

related: NixOS/nixpkgs#55268

Incorrect parsing of tags inside `<script>` environments?

It's possible that I don't understand the HTML 5 parsing algorithm correctly, but it doesn't seem to me that tagsoup should be finding an img tag inside a quoted javascript string (this is with tagsoup 0.13.1, ghc 7.8.3):

Prelude Text.HTML.TagSoup> let x = "<style type=\"text/javascript\">\nvar x = \"<img src=\\\"./image/pic.png\\\" alt=\\\"pic.png\\\" width=400 />\"\n</style>\n" 
Prelude Text.HTML.TagSoup> parseTags x
[TagOpen "style" [("type","text/javascript")],TagText "\nvar x = \"",TagOpen "img" [("src","\\\"./image/pic.png\\\""),("alt","\\\"pic.png\\\""),("width","400")],TagClose "img",TagText "\"\n",TagClose "style",TagText "\n"]

Is this a bug? Cf. jgm/pandoc#1489.

Eliminate Text.HTML.Download

From https://code.google.com/p/ndmitchell/issues/detail?id=368

One user reported that the network dependency in tagsoup was causing issues. Since this is only used by a deprecated part of the package it seems worth trying to remove.

A search of Hackage reveals only two users of this module, hackage-plot and hackage-sparks, both by Dons. Plan is to create patches, get them submitted, wait for these packages to be uploaded, and then delete it.

Dons didn't reply, so I have made them off by default, and can be installed with a flag.

~== should be case insensitive

From https://code.google.com/p/ndmitchell/issues/detail?id=385

José Romildo Malaquias writes:

Currently the comparison of the name of open and close tags in the
inexact match operator (~==) of the Text.HTML.TagSoup package is case
sensitive. Maybe it is better to let this comparison be case
insensitive, as it should be irrelavant to the meaning of the HTML
code. The same would apply to the name of attributes in open tags.

I have just noticed that the package tagsoup-parsec already has parsers
for open and close tags which are already case insensitive.

I replied: Perhaps that does make sense. Not sure, but definitely worth thinking about.

Come to think of it, does anyone searching for <span> not also want to find <SPAN>? I doubt it. Perhaps the tag names and attribute names should be case insensitive, but the text values should still be case sensitive.

Redesign the parser

The current parser is a String parser that can also plug in ByteString and Text, and then treats them like String, which is inefficient. The plan (which will fix #12, #13, #11 and #5) is to:

Read the HTML5 spec a lot. Write it as a DSL. Have that DSL generate code pretty directly which is high performance. That will either be a deep embedding, or a shallow embedding that might spit out some generated Haskell (I suspect the latter).
Have the primitive parser have specialisations for different types. There should be a strict Text and ByteString one, and one with and without position numbers. These parsers should take a strict buffer, parse what it can, and output a stream of "steps". Something like:

parseable :: IsText a => S -> a -> ([Fragment a], S)

The Fragment would say things like "start a tag", "start an attribute name", "start an attribute value", "here is some text". The S would be the state to continue with more data. Given this interface, you can write a lazy parser which is fast because it is strict in each chunk of the strict data structure.

CC @ChristopherKing42 who was asking about this on another ticket.

Hit count sample no longer works

Unfortunately, the URL in the hit count example in README.md, i.e. http://www.haskell.org/haskellwiki/Haskell, returns a 302 "moved temporarily" HTTP response. I'll see if I can figure out a way to rewrite the sample to work.

Update: I tried changing the URL to https://wiki.haskell.org/Haskell. Unfortunately, the new front page no longer includes a hit count.

Matching multiple class attribute values

From https://code.google.com/p/ndmitchell/issues/detail?id=475

ACheshkov says:

So we have html : <div class="class1 class2">
I want find all div tags with class1
~== "<div class=class1>" does not work properly

I replied: Yes, it's only perfect matching at that stage. I wonder if it should be substring matching, or at least have the option. Perhaps you really want regular expressions?

As a workaround, you can run a pass on it first to take class="foo bar" and transform it to class="foo" class="bar", and then your matching would work as you hope.

The default named character policy assumes that trailing semicolons are always optional

The w3c defines about a hundred named character references for which the trailing semicolon is optional.

This causes problems when parsing links in HTML content that happen to have query arguments which match defined names. For instance, tagsoup will erroneously parse ?foo=bar&mid=baz as ?foo=bar∣=baz.

HTML5 resync

From https://code.google.com/p/ndmitchell/issues/detail?id=353

The HTML5 draft has moved on, and now the implementation no longer matches the descriptions. I need to resync the two versions, and apply any changes that have been made.

possible bug

I'm not sure this is a bug; maybe it's just a dark corner of the HTML5 parsing algorithm? But I did find it very surprising. Can you comment? Is this really valid HTML?

Prelude Text.HTML.TagSoup> parseTags
"<www.boe.es/buscar/act.php?id=BOE-A-1996-8930#a66>"
[TagOpen "www.boe.es" [("buscar",""),("act.php?id","BOE-A-1996-8930#a66")]]

Can StringLike have a IsString dependency?

One nice thing would be if StringLike was a subclass of IsString from the base package. This way, you don't have to require parameters to satisfy both StringLike and IsString, or convert between them (which can be annoying if you want to keep things polymorphic). It also allows overloadedstrings to work with StringLike.

It would be trivial to do this, and you could even give fromString a default implementation.

If we want to do this, I could implement this.

ByteString version is too slow

From https://code.google.com/p/ndmitchell/issues/detail?id=290

vasyl said: I've used following attached code for benchmark, the "page.html" could be
arbitrary page, for example hackage packages list.

On my PC, String version of tagsoup executes in 132 ms, and ByteString in
453 ms.

{-# LANGUAGE NoMonomorphismRestriction #-}
import Text.HTML.TagSoup
import qualified Data.ByteString.Char8 as B
import qualified Data.ByteString.Lazy.Char8 as BL
import Criterion.Main

tagsCount = length . parseTags

main = do
  fb <- B.readFile "page.html"
  fs <- readFile "page.html"
  fl <- BL.readFile "page.html"
  defaultMain [
    bench "String" $ nf tagsCount fs,
    bench "ByteString" $ nf tagsCount fb,
    bench "Lazy ByteString" $ nf tagsCount fl,
    bench "ByteString to String" $ nf tagsCount (B.unpack fb)
    ]

IMO this behavior is bad, because everyone suspect, that ByteString should
be faster. I think the best way is to disable bytestrings for now, because
converting BS to String is faster anyway (the last benchmark)

@ndmitchell replied:

Hmm, there is a benchmark in tagsoup, and I found them to be the same speed. The reason
I included ByteString is that it takes less memory, which does matter for some
applications.

I'll see how your benchmarks differ, and combine them in to mine. Tagsoup-0.8 was
intended to be an interface release, with Tagsoup-0.9 providing speed. With any luck
I'll have ByteString going substantially faster in the next release.

Add online parse tree explorer

Perhaps using GHCJS so it can be static and client-side? A nice GHCJS experiment.

Have an option to make all tag names lowercase

In HTML5 this is the standard. There should be an option to set to parseOptions to make names lowercase. Not sure what it should be by default though. This ticket replaces #2 and #19, since if you make all tags lowercase then it takes care.

canonicalize tags before generating a tag tree

The tagTree function compares tag names to match open and close tags. Perhaps it should call canonicaliseTags on the list of tags first? I don't think anyone would want to differentiate between lower and uppercase tags when generating a tree.

Consider renaming 'tagsoup' test tool to avoid collision with

Original report: https://bugs.gentoo.org/show_bug.cgi?id=547734

/usr/bin/tagsoup is being provided by multiple packages in gentoo:
dev-haskell/tagsoup
Homepage: http://community.haskell.org/~ndm/tagsoup/
Description: Parsing and extracting information from (possibly malformed) HTML/XML documents

dev-java/tagsoup
Homepage: http://mercury.ccil.org/~cowan/XML/tagsoup/
Description: A SAX-compliant parser written in Java

We've renamed tagsoup to haskell-tagsoup in gentoo to avoid collision.

What do you think ofslightly changing a tool name
to something like 'hstagsoup' / 'hsts' / other?

What is the long term status of Text.HTML.TagSoup.Tree?

Types like TagTree and functions like parseTree would be very useful for scraping, so I was wondering about the long term status of that module, particularly including things like whether or not it will handle omitted td/tr closing tags.

I have been using Scalpel for scraping, and it usually works quite well for web scraping, although sometimes it is too high level and I would rather just work with a node tree.

Scalpel itself has its own notion of a Tree called TagSpec in an internal module here, so it's also worth considering if HTML5 compliant node tree parsing can be done in a single place and reused by any library that needs it.

spjPapers and ndmPapers samples in README.md do not compile

Here's the compiler error:

/home/rcook/haskell/tagsoup-test/src/Main.hs:17:15:
    Expecting one more argument to ‘Tag’
    Expected kind ‘*’, but ‘Tag’ has kind ‘* -> *’
    In the type signature for ‘f’: f :: [Tag] -> String
    In an equation for ‘spjPapers’:
        spjPapers
          = do { tags <- fmap parseTags
                         $ openURL "http://research.microsoft.com/en-us/people/simonpj/";
                 let links = ...;
                 putStr $ unlines links }
          where
              f :: [Tag] -> String
              f = dequote
                  . unwords . words . fromTagText . head . filter isTagText
              dequote ('"' : xs) | last xs == '"' = init xs
              dequote x = x

This is due to the missing type parameter for Tag.

I can fix this.

Useful combinator

From https://code.google.com/p/ndmitchell/issues/detail?id=383

Cale said 1y 2m 2d 3h 55m 46s ago: I've found the following simple infix function really useful with TagSoup: ss /> f = ss >>= partitions f

Add isTagComment function

TagComment is the only tag that does not sport an accompanying isTagComment function in the isTag* family.

optMinimize and position information don't work together

From https://code.google.com/p/ndmitchell/issues/detail?id=360

If you have position information, optMinimize is never called - it should skip pos/warn when looking for pairs of open/close tags.

Travis build is failing

The Travis build is currently failing.

tagsoup doesn't allow uppercase X in hexadecimal numerical character references

See https://github.com/ndmitchell/tagsoup/blob/master/Text/HTML/TagSoup/Entity.hs#L36

And look at http://www.w3.org/TR/html5/syntax.html sec 8.1.4:

"Hexadecimal numeric character reference
The ampersand must be followed by a "#" (U+0023) character, which must be followed by either a "x" (U+0078) character or a "X" (U+0058) character, which ..."

Cannot disable numeric character reference resolution

Using this option, I was expecting the character references to remain intact:

parseTagsOptions (parseOptionsEntities (const Nothing)) "&#x3c; &#60; &lt;"
  == [TagText "&#x3c; &#60; &lt;"]

However the numerical ones did not. This is the result instead:

[TagText "< < &lt;"]

And in fact the Text.HTML.TagSoup.Implementation.output function ignores the optEntityData option for numerical entities. (link to code)

This is problematic with ByteStrings, and when the charset used by the HTML document is still unknown (e.g., to be extracted from its <head> component, which I wish to use TagSoup for in the first place): resolved multibyte characters are truncated and I don't know what they should be translated to anyway.
Thanks.

renderOptions should close tags as per the HTML5 spec

From https://code.google.com/p/ndmitchell/issues/detail?id=576

When calling render, <input/> will be output as <input></input>, which fails the HTML5 spec. If you switch to self-closing everything, then <div/> raises an error. I should go through and set the defaults to validate with HTML5.

HTML entities without semicolon not parsed when written in one word with other text

Not sure about whether it HTML5 compatible but text like &micrometer and rock&amproll should be parsed as µmeter and rock&roll, all major browsers do this.

You could see some test cases in testUnescapeHtml function in https://github.com/vshabanov/fast-tagsoup/blob/master/Text/HTML/TagSoup/Test.hs

TagSoup uses unbounded stack space while parsing a single tag

I expected to be able to use TagSoup to parse arbitrary HTML, but what I've found is that a single tag with lots of content (and by "lots" I only mean hundreds of kilobytes) can fill up the Haskell stack and crash my program. I've switched to fast-tagsoup which doesn't have this problem, but this seems like a relevant bug in tagsoup.

If aaaa.html is a file of the form:

<html><body>[286 KB of content goes here]</body></html>

then the following Haskell program will crash when compiled with GHC's defaults, using TagSoup 0.13.8:

module Main where
import System.IO
import Text.HTML.TagSoup

main :: IO ()
main = do
  text <- openFile "aaaa.html" ReadMode >>= hGetContents
  let tags = parseTags text in putStrLn (show (tags !! 4))

The error it crashes with is:

Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.

I can of course increase the stack size, but all that means is that it takes a longer tag to crash it.

If you want a complete example including the HTML file, I put it up at https://gist.github.com/rspeer/57c8d5995bc281e284b4 . I replaced every non-linebreak character in the HTML with "a" to emphasize that the content doesn't matter.

Idea: Make TagTree a Monad!

I know it sounds crazy, but hear me out!

We define a TagTreeM str a as

data TagTreeM str a
    = -- | A 'TagOpen'/'TagClose' pair with the 'Tag' values in between.
      TagBranch str [Attribute str] [TagTree str]
    | -- | Any leaf node
      TagLeaf (Tag str)
    | -- | A hole in the document
      Hole a
                   deriving (Eq,Ord,Show)

A we can define TagTree str as type TagTree str = TagTreeM str Data.Void.Void. [TagTreeM str] is now a Monad! (We probably want to make a newtype for [TagTreeM str], but through out this post I'll just use it directly.

What does this monad due, exactly? It is a document with holes in it to be filled in it (sort of like the Continuation Monad.) This makes it feasible to construct documents using TagSoup instead of just parsing them!

For example, we can define an operator

tag s = [TagBranch s [] [Hole ()]]

Now if we do

do
    tag "Html"
    tag "Body"
    [TagLeaf (TagText ("Hello World"))]

We get

<html>
<body>
Hello World
</body>
</html>

We can also make a list of things. For example, if we do TagBranch "ul" [] [TagBranch "li" [] [Hole 1], TagBranch "li" [] [Hole 2], TagBranch "li" [] [Hole 3]] We get a [TagTreeM str Int], which represents an html document with three holes, labeled with integers. The holes are in an unordered list. You could also imagine algebraic data types being used, representing different parts of the document.

Another useful operator would be

i :: [TagTree str] -> [TagTreeM str ()]
i t = vacuous t ++ return ()

i inserts a [TagTree str] into the document, leaving a hole beneath it for the rest of the document. It allows you to sequentially combine documents.

Now, also this is pretty radical, since tagsoup is a parsing library, not an html rendering one. It still is an interesting thing to consider, though.

Edit: It is also possible to add a Hole constructor to Tag str instead. That way both [TagTreeM str] and [Tag str] would be monads. tag s = [TagOpen s [], Hole (), TagClose s] instead then.

Support for well-formed XML validation

TagWarning is to used to check whether the individual XML tags are correct if not mistaken. I am not aware of a function that checks the soundness of the whole structure, i.e. all TagOpen have a corresponding TagClose.

Are there plans to add this feature or is this something expected from the user? Thanks.

	-- \| A table mapping XML entity names to resolved strings. All strings are a single character long.
	-- Does /not/ include @apos@ as Internet Explorer does not know about it.
	xmlEntities :: [(String, String)]
	xmlEntities = let a*b = (a,[b]) in
	["quot" * '"'
	,"amp" * '&'
	-- ,"apos" * '\'' -- Internet Explorer does not know that
	,"lt" * '<'
	,"gt" * '>'
	]

ndmitchell / tagsoup Goto Github PK

tagsoup's Introduction

TagSoup

Acknowledgements

Potential Bugs

Last modified date of Haskell wiki

Finding the Page

Using the HTTP package

Finding the Information

Simon's Papers

Other Examples

My Papers

UK Time

Other Examples

Related Projects

tagsoup's People

Contributors

Stargazers

Watchers

Forkers

tagsoup's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs