lambdax / go-html-transform Goto Github PK
View Code? Open in Web Editor NEWAutomatically exported from code.google.com/p/go-html-transform
License: Artistic License 2.0
Automatically exported from code.google.com/p/go-html-transform
License: Artistic License 2.0
The fixes from change 7ac884a7e2f5beaca5abacb9da9263a86c293829 haven't been
added to the go1 tag. When installing this library with go get it will pull
the go1 tagged source and so miss out on the fixes.
I see this on with a go 1.4 install on Linux. The source of the go get command
shows what's going on:
https://go.googlesource.com/go/+/master/src/cmd/go/get.go#401
Original issue reported on code.google.com by [email protected]
on 18 Dec 2014 at 11:25
Trying to parse a page that uses unclosed tags (e.g. <p> with no </p>) barfs.
Parse error: NotSameTag: End Tag does not match Start Tag start:[p] end:[body]
An example page that cannot be parsed is http://dns-sd.org/ServiceTypes.html.
Original issue reported on code.google.com by [email protected]
on 18 May 2012 at 2:43
What steps will reproduce the problem?
1. go get code.google.com/p/go-html-transform
What do you see instead?
# code.google.com/p/go-html-transform/css
.go/src/code.google.com/p/go-html-transform/css/parse.go:140: cannot use
consumeCommentBody (type func(*PositionByteScanner)) as type cssParseFunc in
return argument
.go/src/code.google.com/p/go-html-transform/css/parse.go:148: too many arguments
What version of the product are you using? On what operating system?
Go 1.2rc1
Original issue reported on code.google.com by [email protected]
on 22 Sep 2013 at 11:37
Hello,
According to
https://groups.google.com/forum/#!msg/golang-nuts/eD8dh3T9yyA/l5Ail-xfMiAJ,
some package paths used in the project have been changed and now those causes a
build problem. I wrote the patch for it. Please consider applying it.
Thanks,
Tatsushi Demachi
Original issue reported on code.google.com by [email protected]
on 11 Nov 2014 at 2:47
Attachments:
What steps will reproduce the problem?
1. Create GAE application which imports
"code.google.com/p/go-html-transform/h5".
2. Build it.
What is the expected output? What do you see instead?
The apps should be built, but actually it gets an error.
Error message -> Failed parsing input: parser: bad import "syscall".
It is because GAE doesn't allow us to import syscall package for security
reasons.
What version of the product are you using? On what operating system?
- go_appengine_sdk_darwin_amd64-1.8.1.zip
- MacOS 10.8.4
Is it possible to remove dependency for syscall package only on GAE?
Original issue reported on code.google.com by [email protected]
on 12 Jun 2013 at 5:56
Hi Jeremy,
Continuing from the discussion at
http://stackoverflow.com/questions/10068552/using-go-html-transform-to-preproces
s-html-replace-fails , I've been playing some more with Replace and still
encountering problems. Bear in mind I'm a flailing amateur and very new to Go,
so take all this with a grain of salt.
There are two issues I'm finding:
for i, c := range p.Children {
if c == n {
n := i - 1
if n < 0 {
n = 0
}
var newChild []*Node
pre := p.Children[:n]
post := p.Children[i+1:]
newChild = append(pre, ns...)
p.Children = append(newChild, post...)
}
}
This causes the first children to be ignored altogether as it results in
p.Children[:0] when we need p.Children[:i]. I don't see why n is necessary at
all in this case, surely:
pre := p.Children[:i]
post := p.Children[i+1:]
works? I may be missing something there.
Secondly, c == n never equates true. I was wondering how the test case missed
this, then I realised ReplaceTest only checks a node where the parent is nil.
Replace works as expected there, but fails on any operation where the node
returned by the selector is the child of another node.
I don't know Go well but the language spec says pointers will equate true if
they point to the same variable or are both nil. If I've understood correctly
these pointers don't point to the same variable, since n points to a copy and c
points to the original. This would suggest there's a problem with the way the
copy of the linked list is built (in that p.Children contains pointers to the
old list, not the copy).
I'm submitting a patch for the first issue but please treat it with deep
suspicion since I've never used hg to patch anything before, and it's kind of
hard to test given the problem discussed above. I'll look at the way the nodes
are cloned but you might come up with a solution while I'm still trying to
understand the source!
Original issue reported on code.google.com by [email protected]
on 26 Apr 2012 at 10:03
Attachments:
What steps will reproduce the problem?
1. Try to parse "http://www.cse.ucsc.edu/~elkaim/elkaim/Overbot.html"
What is the expected output? What do you see instead?
Worker starting url: http://www.cse.ucsc.edu/~elkaim/elkaim/Overbot.html
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x28 pc=0x4772c0]
goroutine 4 [running]:
code.google.com/p/go-html-transform/h5.addSibling(0xf8400ad0c0, 0x478ff5,
0xf8400ad0c0, 0x3f)
C:/projects/go/src/code.google.com/p/go-html-transform/h5/h5.go:1208 +0x3a
code.google.com/p/go-html-transform/h5.bogusCommentHandler(0xf8400ad0c0,
0x47714e, 0x0, 0x0)
C:/projects/go/src/code.google.com/p/go-html-transform/h5/h5.go:1189 +0x28
code.google.com/p/go-html-transform/h5.(*Parser).Parse(0xf8400ad0c0,
0xf840091e40, 0xf8400ad0c0, 0xf8400991b0)
C:/projects/go/src/code.google.com/p/go-html-transform/h5/h5.go:338 +0x52
examples/htmlreader.(*Htmldocument).Parsedoc(0x377f18, 0xf840054300, 0x33, 0x0,
0x0, ...)
C:/projects/go/src/examples/htmlreader/htmlreader.go:50 +0x1bb
main.parsesiteworker(0xf84008e000, 0xf840001780, 0x0, 0x0)
C:/projects/go/src/examples/htmlreaderdemo/htmlreaderdemo.go:112 +0x194
created by main.parsesite
C:/projects/go/src/examples/htmlreaderdemo/htmlreaderdemo.go:57 +0x17f
What version of the product are you using? On what operating system?
Win7-x64, go version go1.0.3
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 9 Jan 2013 at 10:39
What steps will reproduce the problem?
1. go build test.go
2. ./test
cat test.go
package main
import(
"net/http"
"fmt"
"code.google.com/p/go-html-transform/html/transform"
)
func main(){
c := new(http.Client)
selQuery := transform.NewSelectorQuery("table[width=764]", "tr", "td", "a.blue")
for {
r, _ := http.NewRequest("GET", "http://www.yahoo.com", nil)
rs, _ := c.Do(r)
doc, _ := transform.NewDocFromReader(rs.Body)
nodes := selQuery.Apply(doc)
if 0 != len(nodes){
fmt.Printf("#")
break
}else{
fmt.Printf(".")
}
rs.Body.Close()
}
return
}
What is the expected output? What do you see instead?
Steady memory usage
What version of the product are you using? On what operating system?
Linux, both 32 and 64 bit. Go1
transform version was 188:259c2a97052b through "go get
code.google.com/p/go-html-transform/html/transform" from yesterday.
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 10 Apr 2012 at 2:33
What steps will reproduce the problem?
1. Create a tree, with transform.NewTransform, for example.
2. Pass the whole tree to Replace, ex: root.Apply(transform.Replace(...
What is expected?
Entire tree replaced.
What do you see instead?
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x30 pc=0x420b75]
goroutine 1 [running]:
code.google.com/p/go-html-transform/html/transform._func_009(0xf84002f230,
0x42095a, 0xf84002da80, 0x100000001, 0xf84002da80, ...)
/pool/byron0/goext/src/code.google.com/p/go-html-transform/html/transform/transform.go:111 +0x2a
code.google.com/p/go-html-transform/html/transform.(*Transformer).Apply(0xf84005
50d0, 0xf84002c3c0, 0x2b55dbbe1f38, 0x100000001, 0xf84002a280, ...)
/pool/byron0/goext/src/code.google.com/p/go-html-transform/html/transform/transform.go:54 +0xb7
main.main()
/pool/byron0/so/so.go:17 +0x25c
What version of the product are you using? On what operating system?
go1
Original issue reported on code.google.com by [email protected]
on 9 Apr 2012 at 10:05
Attachments:
What steps will reproduce the problem?
1. See attached html_test.go
What is the expected output? What do you see instead?
It's sloppy HTML, but I would expect a successful parsing.
What version of the product are you using? On what operating system?
$ hg sum
parent: 202:40c397f37dc3 go1
BugFix: Replace replace did comparisons wrong.
branch: default
commit: 1 modified (new branch head)
update: 16 new changesets (update)
Please provide any additional information below.
The following diff makes the attached test pass, but I don't fully appreciate
the ramifications of removing that check:
$ hg diff
diff -r 40c397f37dc3 h5/h5.go
--- a/h5/h5.go Mon May 14 19:15:33 2012 -0500
+++ b/h5/h5.go Thu Jun 21 13:06:50 2012 -0700
@@ -998,9 +998,11 @@
// reset the current node
n = p.curr
}
+ /*
if string(n.data) != string(tag) {
return nil, newEndTagError("NotSameTag", n, tag)
}
+ */
//fmt.Println("YYY: closing a tag")
popNode(p)
return dataStateHandlerSwitch(p), nil
Original issue reported on code.google.com by [email protected]
on 21 Jun 2012 at 8:08
Attachments:
[deleted issue]
Using an in memory index of ids and classes could speed up selector matching by
allowing us to walk a subset of of the node tree instead of the whole thing.
Original issue reported on code.google.com by JeremyMZHS
on 30 Jan 2013 at 12:37
Having a full CSS parser would allow us to do transforms on the css content of
an html file as well.
Original issue reported on code.google.com by JeremyMZHS
on 22 Jul 2013 at 3:38
What steps will reproduce the problem?
1.
go run hello2.go
2.
content of hello2.go
package main
import (
"fmt"
"code.google.com/p/go-html-transform/h5"
"net/http"
)
func main() {
resp, err := http.Get("http://google.com")
defer resp.Body.Close()
fmt.Println(err)
tree, _ := h5.New(rdr)
fmt.Printf(tree)
}
What is the expected output? What do you see instead?
It seems to me that when i do :
go get "code.google.com/p/go-html-transform/h5"
it does not update to the latest code found at
http://code.google.com/p/go-html-transform/source/checkout
What version of the product are you using? On what operating system?
I am using go build from source at tip go version 1.0.3 on W7x64
Please provide any additional information below.
I guess my question is more about how to get the latest sources from
http://code.google.com/p/go-html-transform/source/checkout and use them to
compile my hello2.go program
Thanks
Original issue reported on code.google.com by [email protected]
on 1 Mar 2013 at 2:05
What steps will reproduce the problem?
1. See attached multipleSelectors.go
What is the expected output? What do you see instead? In attached file,
expected output would be to select all <a> as a result of selectors "body" and
"a" being applied. This is shown under 'results of asterick seperated
selectors:' in console output.
Current behavior is to not select anything, shown under 'results of standard
selectors:'
What version of the product are you using? On what operating system? Still
getting used to Go's package management, so I'm not totally sure, but I think
rev 3efab001d743 (current HEAD). OS: Mac OS 10.6.8
Please provide any additional information below.
I haven't used child-selectors so I don't know if or how they are affected by
this issue.
Thanks again for the efforts being put into this tool!
Original issue reported on code.google.com by [email protected]
on 30 Jun 2012 at 4:45
Attachments:
h5.go and node.go need to use the new path for go.net, that is, golang.org/x/net
This is breaking projects using this library.
Original issue reported on code.google.com by [email protected]
on 14 Nov 2014 at 6:16
What steps will reproduce the problem?
1. With an html snippet with many separate matches, it seems to sometimes only
return the last set of responses. For instance with the following snipppet and
the css:
if you create a selector out of "tr td" and Find on a node containing the
following html, you only get the responses nested under the id=d tr element.
"""
<table>
<tr id=a><td id=1><b>acaricide</b> (əˈkærɪˌsaɪd) <a ><img /></a></td></tr>
<tr id=b><td id=2> </td></tr>
<tr id=c><td id=3>—<b><i>n</i></b> </td></tr>
<tr id=d><td id=4></td><td >any drug or formulation for killing
acarids</td></tr>
</table>
"""
What is the expected output? What do you see instead?
I expect it to match all of the td elements. Instead it only matches the td
elements inside the last tr node.
Please provide any additional information below.
I think a quick fix could be:
https://code.google.com/r/rhironaga-go-html-transform/source/detail?r=798176b5e3
66e4a9d62b7220ff88293f9bc185ee
But I think it also introduces an issue where it can return duplicate results.
I think a more robust solution would switch Find to return a set
(map[*html.Node]bool) instead of a list of nodes. The problem case I think I
encountered was:
selector:
div span
html:
<div><div><span></span></div></div>
I think it might make more sense to only return span once.
Original issue reported on code.google.com by [email protected]
on 17 Jun 2013 at 6:12
What steps will reproduce the problem?
1. See attached namespaceErr.go
What is the expected output? What do you see instead?
Expected behavior is to continue parsing document and either ignore the element
in question, treat the 'namespace:' as the start of a tag name, or handle
namespaces as a property of a Node:
http://jsoup.org/apidocs/org/jsoup/select/Selector.html
Currently, the behavior is to close open tags, returns a Node representing the
portion of the document that was parsed (I think) and an err with some info.
What version of the product are you using? On what operating system? Still
getting used to Go's package management, so I'm not totally sure, but I think
rev 3efab001d743 (current HEAD). OS: Mac OS 10.6.8
Please provide any additional information below.
Totally stoked to see this library moving along. Thanks for your time and
effort!
Original issue reported on code.google.com by [email protected]
on 30 Jun 2012 at 4:36
Attachments:
http://code.google.com/p/go/source/detail?r=ffbff9f7596e2655eab581b0188ad2a02177
78f0&repo=net
Imports now should be:
"code.google.com/p/go.net/html"
Best regards,
Dobrosław Żybort
Original issue reported on code.google.com by [email protected]
on 12 Feb 2013 at 9:07
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.