GithubHelp home page GithubHelp logo

go-html-transform's People

Contributors

darkhelmet avatar zaphar avatar

Watchers

 avatar

go-html-transform's Issues

go1 tag missing import fixes

The fixes from change 7ac884a7e2f5beaca5abacb9da9263a86c293829 haven't been 
added to the go1 tag.  When installing this library with go get it will pull 
the go1 tagged source and so miss out on the fixes.

I see this on with a go 1.4 install on Linux.  The source of the go get command 
shows what's going on:

https://go.googlesource.com/go/+/master/src/cmd/go/get.go#401

Original issue reported on code.google.com by [email protected] on 18 Dec 2014 at 11:25

Unclosed tags are not supported

Trying to parse a page that uses unclosed tags (e.g. <p> with no </p>) barfs.

Parse error: NotSameTag: End Tag does not match Start Tag start:[p] end:[body]

An example page that cannot be parsed is http://dns-sd.org/ServiceTypes.html.

Original issue reported on code.google.com by [email protected] on 18 May 2012 at 2:43

Missing return types

What steps will reproduce the problem?
1. go get code.google.com/p/go-html-transform


What do you see instead?
# code.google.com/p/go-html-transform/css
.go/src/code.google.com/p/go-html-transform/css/parse.go:140: cannot use 
consumeCommentBody (type func(*PositionByteScanner)) as type cssParseFunc in 
return argument
.go/src/code.google.com/p/go-html-transform/css/parse.go:148: too many arguments

What version of the product are you using? On what operating system?
Go 1.2rc1


Original issue reported on code.google.com by [email protected] on 22 Sep 2013 at 11:37

Remove dependency for syscall package, so that GAE apps can import this lib.

What steps will reproduce the problem?
1. Create GAE application which imports 
"code.google.com/p/go-html-transform/h5".
2. Build it.

What is the expected output? What do you see instead?
The apps should be built, but actually it gets an error.
Error message -> Failed parsing input: parser: bad import "syscall".
It is because GAE doesn't allow us to import syscall package for security 
reasons.

What version of the product are you using? On what operating system?
 - go_appengine_sdk_darwin_amd64-1.8.1.zip
 - MacOS 10.8.4

Is it possible to remove dependency for syscall package only on GAE?

Original issue reported on code.google.com by [email protected] on 12 Jun 2013 at 5:56

transform.Replace fails when selected node's parent is non-nil

Hi Jeremy, 

Continuing from the discussion at 
http://stackoverflow.com/questions/10068552/using-go-html-transform-to-preproces
s-html-replace-fails , I've been playing some more with Replace and still 
encountering problems. Bear in mind I'm a flailing amateur and very new to Go, 
so take all this with a grain of salt.

There are two issues I'm finding:

for i, c := range p.Children {
    if c == n {
        n := i - 1
        if n < 0 {
            n = 0
        }
        var newChild []*Node
        pre := p.Children[:n]
        post := p.Children[i+1:]
        newChild = append(pre, ns...)
        p.Children = append(newChild, post...)
    }
}

This causes the first children to be ignored altogether as it results in 
p.Children[:0] when we need p.Children[:i]. I don't see why n is necessary at 
all in this case, surely:

        pre := p.Children[:i]
        post := p.Children[i+1:]

works? I may be missing something there.

Secondly, c == n never equates true. I was wondering how the test case missed 
this, then I realised ReplaceTest only checks a node where the parent is nil. 
Replace works as expected there, but fails on any operation where the node 
returned by the selector is the child of another node.

I don't know Go well but the language spec says pointers will equate true if 
they point to the same variable or are both nil. If I've understood correctly 
these pointers don't point to the same variable, since n points to a copy and c 
points to the original. This would suggest there's a problem with the way the 
copy of the linked list is built (in that p.Children contains pointers to the 
old list, not the copy).

I'm submitting a patch for the first issue but please treat it with deep 
suspicion since I've never used hg to patch anything before, and it's kind of 
hard to test given the problem discussed above. I'll look at the way the nodes 
are cloned but you might come up with a solution while I'm still trying to 
understand the source!

Original issue reported on code.google.com by [email protected] on 26 Apr 2012 at 10:03

Attachments:

De-referenced nil from parse, inside bogusCommentHandler

What steps will reproduce the problem?
1. Try to parse "http://www.cse.ucsc.edu/~elkaim/elkaim/Overbot.html"

What is the expected output? What do you see instead?

Worker starting url: http://www.cse.ucsc.edu/~elkaim/elkaim/Overbot.html
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x28 pc=0x4772c0]

goroutine 4 [running]:
code.google.com/p/go-html-transform/h5.addSibling(0xf8400ad0c0, 0x478ff5, 
0xf8400ad0c0, 0x3f)
        C:/projects/go/src/code.google.com/p/go-html-transform/h5/h5.go:1208 +0x3a
code.google.com/p/go-html-transform/h5.bogusCommentHandler(0xf8400ad0c0, 
0x47714e, 0x0, 0x0)
        C:/projects/go/src/code.google.com/p/go-html-transform/h5/h5.go:1189 +0x28
code.google.com/p/go-html-transform/h5.(*Parser).Parse(0xf8400ad0c0, 
0xf840091e40, 0xf8400ad0c0, 0xf8400991b0)
        C:/projects/go/src/code.google.com/p/go-html-transform/h5/h5.go:338 +0x52
examples/htmlreader.(*Htmldocument).Parsedoc(0x377f18, 0xf840054300, 0x33, 0x0, 
0x0, ...)
        C:/projects/go/src/examples/htmlreader/htmlreader.go:50 +0x1bb
main.parsesiteworker(0xf84008e000, 0xf840001780, 0x0, 0x0)
        C:/projects/go/src/examples/htmlreaderdemo/htmlreaderdemo.go:112 +0x194
created by main.parsesite
        C:/projects/go/src/examples/htmlreaderdemo/htmlreaderdemo.go:57 +0x17f



What version of the product are you using? On what operating system?

Win7-x64, go version go1.0.3


Please provide any additional information below.


Original issue reported on code.google.com by [email protected] on 9 Jan 2013 at 10:39

Memory leak

What steps will reproduce the problem?
1. go build test.go
2. ./test
cat test.go
package main

import(
    "net/http"
    "fmt"
    "code.google.com/p/go-html-transform/html/transform"
    )

func main(){
    c := new(http.Client)
    selQuery := transform.NewSelectorQuery("table[width=764]", "tr", "td", "a.blue")
    for {
        r, _ := http.NewRequest("GET", "http://www.yahoo.com", nil)

        rs, _ := c.Do(r)

        doc, _ := transform.NewDocFromReader(rs.Body)
        nodes := selQuery.Apply(doc)
        if 0 != len(nodes){
            fmt.Printf("#")
            break
        }else{
            fmt.Printf(".")
        }
        rs.Body.Close()
    }
    return
}

What is the expected output? What do you see instead?
Steady memory usage

What version of the product are you using? On what operating system?
Linux, both 32 and 64 bit. Go1
transform version was 188:259c2a97052b through "go get 
code.google.com/p/go-html-transform/html/transform" from yesterday.


Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 10 Apr 2012 at 2:33

transform.Replace panics on root nodes

What steps will reproduce the problem?
1. Create a tree, with transform.NewTransform, for example.
2. Pass the whole tree to Replace, ex: root.Apply(transform.Replace(...

What is expected?
Entire tree replaced.
What do you see instead?
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x30 pc=0x420b75]

goroutine 1 [running]:
code.google.com/p/go-html-transform/html/transform._func_009(0xf84002f230, 
0x42095a, 0xf84002da80, 0x100000001, 0xf84002da80, ...)
        /pool/byron0/goext/src/code.google.com/p/go-html-transform/html/transform/transform.go:111 +0x2a
code.google.com/p/go-html-transform/html/transform.(*Transformer).Apply(0xf84005
50d0, 0xf84002c3c0, 0x2b55dbbe1f38, 0x100000001, 0xf84002a280, ...)
        /pool/byron0/goext/src/code.google.com/p/go-html-transform/html/transform/transform.go:54 +0xb7
main.main()
        /pool/byron0/so/so.go:17 +0x25c


What version of the product are you using? On what operating system?

go1


Original issue reported on code.google.com by [email protected] on 9 Apr 2012 at 10:05

Attachments:

Imbalanced closing tag causes Parse failure

What steps will reproduce the problem?
1.  See attached html_test.go

What is the expected output? What do you see instead?
It's sloppy HTML, but I would expect a successful parsing.

What version of the product are you using? On what operating system?
$ hg sum
parent: 202:40c397f37dc3 go1
 BugFix: Replace replace did comparisons wrong.
branch: default
commit: 1 modified (new branch head)
update: 16 new changesets (update)

Please provide any additional information below.
The following diff makes the attached test pass, but I don't fully appreciate 
the ramifications of removing that check:

$ hg diff
diff -r 40c397f37dc3 h5/h5.go
--- a/h5/h5.go  Mon May 14 19:15:33 2012 -0500
+++ b/h5/h5.go  Thu Jun 21 13:06:50 2012 -0700
@@ -998,9 +998,11 @@
                                // reset the current node
                                n = p.curr
                        }
+                       /*
                        if string(n.data) != string(tag) {
                                return nil, newEndTagError("NotSameTag", n, tag)
                        }
+                       */
                        //fmt.Println("YYY: closing a tag")
                        popNode(p)
                        return dataStateHandlerSwitch(p), nil


Original issue reported on code.google.com by [email protected] on 21 Jun 2012 at 8:08

Attachments:

Faster selector matching using indexes

Using an in memory index of ids and classes could speed up selector matching by 
allowing us to walk a subset of of the node tree instead of the whole thing.

Original issue reported on code.google.com by JeremyMZHS on 30 Jan 2013 at 12:37

Full CSS parser.

Having a full CSS parser would allow us to do transforms on the css content of 
an html file as well.

Original issue reported on code.google.com by JeremyMZHS on 22 Jul 2013 at 3:38

undefined: h5.New

What steps will reproduce the problem?
1.
go run hello2.go
2.
content of hello2.go
package main

import (
    "fmt"
    "code.google.com/p/go-html-transform/h5"
    "net/http"
)

func main() {
    resp, err := http.Get("http://google.com")
    defer resp.Body.Close()
    fmt.Println(err)

    tree, _ := h5.New(rdr)

    fmt.Printf(tree)

}

What is the expected output? What do you see instead?
It seems to me that when i do :
go get "code.google.com/p/go-html-transform/h5"
it does not update to the latest code found at 
http://code.google.com/p/go-html-transform/source/checkout

What version of the product are you using? On what operating system?
I am using go build from source at tip go version 1.0.3 on W7x64

Please provide any additional information below.

I guess my question is more about how to get the latest sources from 
http://code.google.com/p/go-html-transform/source/checkout and use them to 
compile my hello2.go program

Thanks

Original issue reported on code.google.com by [email protected] on 1 Mar 2013 at 2:05

Descendent selectors not working as expected

What steps will reproduce the problem?
1. See attached multipleSelectors.go

What is the expected output? What do you see instead? In attached file, 
expected output would be to select all <a> as a result of selectors "body" and 
"a" being applied. This is shown under 'results of asterick seperated 
selectors:' in console output. 

Current behavior is to not select anything, shown under 'results of standard 
selectors:'


What version of the product are you using? On what operating system? Still 
getting used to Go's package management, so I'm not totally sure, but I think 
rev 3efab001d743 (current HEAD). OS: Mac OS 10.6.8


Please provide any additional information below.
I haven't used child-selectors so I don't know if or how they are affected by 
this issue.

Thanks again for the efforts being put into this tool!

Original issue reported on code.google.com by [email protected] on 30 Jun 2012 at 4:45

Attachments:

List is reassigned instead of appended in Chain.Find (should probably be a set though)

What steps will reproduce the problem?
1. With an html snippet with many separate matches, it seems to sometimes only 
return the last set of responses.  For instance with the following snipppet and 
the css:

if you create a selector out of "tr td" and Find on a node containing the 
following html, you only get the responses nested under the id=d tr element.

"""
<table>
<tr id=a><td id=1><b>acaricide</b> (əˈkærɪˌsaɪd) <a ><img /></a></td></tr>
<tr id=b><td id=2>&nbsp;</td></tr>
<tr id=c><td id=3>&mdash;<b><i>n</i></b> </td></tr>
<tr id=d><td id=4></td><td >any drug or formulation for killing 
acarids</td></tr>
</table>
"""


What is the expected output? What do you see instead?

I expect it to match all of the td elements. Instead it only matches the td 
elements inside the last tr node.

Please provide any additional information below.

I think a quick fix could be:
https://code.google.com/r/rhironaga-go-html-transform/source/detail?r=798176b5e3
66e4a9d62b7220ff88293f9bc185ee

But I think it also introduces an issue where it can return duplicate results.  
I think a more robust solution would switch Find to return a set 
(map[*html.Node]bool) instead of a list of nodes.  The problem case I think I 
encountered was:

selector:
div span

html:
<div><div><span></span></div></div>

I think it might make more sense to only return span once.

Original issue reported on code.google.com by [email protected] on 17 Jun 2013 at 6:12

Err on tag namespace

What steps will reproduce the problem?
1. See attached namespaceErr.go

What is the expected output? What do you see instead?
Expected behavior is to continue parsing document and either ignore the element 
in question, treat the 'namespace:' as the start of a tag name, or handle 
namespaces as a property of a Node: 
http://jsoup.org/apidocs/org/jsoup/select/Selector.html

Currently, the behavior is to close open tags, returns a Node representing the 
portion of the document that was parsed (I think) and an err with some info.


What version of the product are you using? On what operating system? Still 
getting used to Go's package management, so I'm not totally sure, but I think 
rev 3efab001d743 (current HEAD). OS: Mac OS 10.6.8


Please provide any additional information below.

Totally stoked to see this library moving along. Thanks for your time and 
effort!

Original issue reported on code.google.com by [email protected] on 30 Jun 2012 at 4:36

Attachments:

exp/html moved to net subrepo

http://code.google.com/p/go/source/detail?r=ffbff9f7596e2655eab581b0188ad2a02177
78f0&repo=net

Imports now should be:
"code.google.com/p/go.net/html"

Best regards,
Dobrosław Żybort

Original issue reported on code.google.com by [email protected] on 12 Feb 2013 at 9:07

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.