GithubHelp home page GithubHelp logo

puerkitobio / goquery Goto Github PK

View Code? Open in Web Editor NEW
13.9K 252.0 916.0 556 KB

A little like that j-thing, only in Go.

License: BSD 3-Clause "New" or "Revised" License

Go 62.31% Shell 0.27% Roff 37.42%
selector-strings jquery html-parsing goquery

goquery's Introduction

goquery - a little like that j-thing, only in Go

Build Status Go Reference Sourcegraph Badge

goquery brings a syntax and a set of features similar to jQuery to the Go language. It is based on Go's net/html package and the CSS Selector library cascadia. Since the net/html parser returns nodes, and not a full-featured DOM tree, jQuery's stateful manipulation functions (like height(), css(), detach()) have been left off.

Also, because the net/html parser requires UTF-8 encoding, so does goquery: it is the caller's responsibility to ensure that the source document provides UTF-8 encoded HTML. See the wiki for various options to do this.

Syntax-wise, it is as close as possible to jQuery, with the same function names when possible, and that warm and fuzzy chainable interface. jQuery being the ultra-popular library that it is, I felt that writing a similar HTML-manipulating library was better to follow its API than to start anew (in the same spirit as Go's fmt package), even though some of its methods are less than intuitive (looking at you, index()...).

Table of Contents

Installation

Required Go version:

  • Starting with version v1.10.0 of goquery, Go 1.23+ is required due to the use of function-based iterators.
  • For v1.9.0 of goquery, Go 1.18+ is required due to the use of generics.
  • For previous goquery versions, a Go version of 1.1+ was required because of the net/html dependency.

Ongoing goquery development is tested on the latest 2 versions of Go.

$ go get github.com/PuerkitoBio/goquery

(optional) To run unit tests:

$ cd $GOPATH/src/github.com/PuerkitoBio/goquery
$ go test

(optional) To run benchmarks (warning: it runs for a few minutes):

$ cd $GOPATH/src/github.com/PuerkitoBio/goquery
$ go test -bench=".*"

Changelog

Note that goquery's API is now stable, and will not break.

  • 2024-09-06 (v1.10.0) : Add EachIter which provides an iterator that can be used in for..range loops on the *Selection object. goquery now requires Go version 1.23+ (thanks @amikai).
  • 2024-09-06 (v1.9.3) : Update go.mod dependencies.
  • 2024-04-29 (v1.9.2) : Update go.mod dependencies.
  • 2024-02-29 (v1.9.1) : Improve allocation and performance of the Map function and Selection.Map method, better document the cascadia differences (thanks @jwilsson).
  • 2024-02-22 (v1.9.0) : Add a generic Map function, goquery now requires Go version 1.18+ (thanks @Fesaa).
  • 2023-02-18 (v1.8.1) : Update go.mod dependencies, update CI workflow.
  • 2021-10-25 (v1.8.0) : Add Render function to render a Selection to an io.Writer (thanks @anthonygedeon).
  • 2021-07-11 (v1.7.1) : Update go.mod dependencies and add dependabot config (thanks @jauderho).
  • 2021-06-14 (v1.7.0) : Add Single and SingleMatcher functions to optimize first-match selection (thanks @gdollardollar).
  • 2021-01-11 (v1.6.1) : Fix panic when calling {Prepend,Append,Set}Html on a Selection that contains non-Element nodes.
  • 2020-10-08 (v1.6.0) : Parse html in context of the container node for all functions that deal with html strings (AfterHtml, AppendHtml, etc.). Thanks to @thiemok and @davidjwilkins for their work on this.
  • 2020-02-04 (v1.5.1) : Update module dependencies.
  • 2018-11-15 (v1.5.0) : Go module support (thanks @Zaba505).
  • 2018-06-07 (v1.4.1) : Add NewDocumentFromReader examples.
  • 2018-03-24 (v1.4.0) : Deprecate NewDocument(url) and NewDocumentFromResponse(response).
  • 2018-01-28 (v1.3.0) : Add ToEnd constant to Slice until the end of the selection (thanks to @davidjwilkins for raising the issue).
  • 2018-01-11 (v1.2.0) : Add AddBack* and deprecate AndSelf (thanks to @davidjwilkins).
  • 2017-02-12 (v1.1.0) : Add SetHtml and SetText (thanks to @glebtv).
  • 2016-12-29 (v1.0.2) : Optimize allocations for Selection.Text (thanks to @radovskyb).
  • 2016-08-28 (v1.0.1) : Optimize performance for large documents.
  • 2016-07-27 (v1.0.0) : Tag version 1.0.0.
  • 2016-06-15 : Invalid selector strings internally compile to a Matcher implementation that never matches any node (instead of a panic). So for example, doc.Find("~") returns an empty *Selection object.
  • 2016-02-02 : Add NodeName utility function similar to the DOM's nodeName property. It returns the tag name of the first element in a selection, and other relevant values of non-element nodes (see doc for details). Add OuterHtml utility function similar to the DOM's outerHTML property (named OuterHtml in small caps for consistency with the existing Html method on the Selection).
  • 2015-04-20 : Add AttrOr helper method to return the attribute's value or a default value if absent. Thanks to piotrkowalczuk.
  • 2015-02-04 : Add more manipulation functions - Prepend* - thanks again to Andrew Stone.
  • 2014-11-28 : Add more manipulation functions - ReplaceWith*, Wrap* and Unwrap - thanks again to Andrew Stone.
  • 2014-11-07 : Add manipulation functions (thanks to Andrew Stone) and *Matcher functions, that receive compiled cascadia selectors instead of selector strings, thus avoiding potential panics thrown by goquery via cascadia.MustCompile calls. This results in better performance (selectors can be compiled once and reused) and more idiomatic error handling (you can handle cascadia's compilation errors, instead of recovering from panics, which had been bugging me for a long time). Note that the actual type expected is a Matcher interface, that cascadia.Selector implements. Other matcher implementations could be used.
  • 2014-11-06 : Change import paths of net/html to golang.org/x/net/html (see https://groups.google.com/forum/#!topic/golang-nuts/eD8dh3T9yyA). Make sure to update your code to use the new import path too when you call goquery with html.Nodes.
  • v0.3.2 : Add NewDocumentFromReader() (thanks jweir) which allows creating a goquery document from an io.Reader.
  • v0.3.1 : Add NewDocumentFromResponse() (thanks assassingj) which allows creating a goquery document from an http response.
  • v0.3.0 : Add EachWithBreak() which allows to break out of an Each() loop by returning false. This function was added instead of changing the existing Each() to avoid breaking compatibility.
  • v0.2.1 : Make go-getable, now that go.net/html is Go1.0-compatible (thanks to @matrixik for pointing this out).
  • v0.2.0 : Add support for negative indices in Slice(). BREAKING CHANGE Document.Root is removed, Document is now a Selection itself (a selection of one, the root element, just like Document.Root was before). Add jQuery's Closest() method.
  • v0.1.1 : Add benchmarks to use as baseline for refactorings, refactor Next...() and Prev...() methods to use the new html package's linked list features (Next/PrevSibling, FirstChild). Good performance boost (40+% in some cases).
  • v0.1.0 : Initial release.

API

goquery exposes two structs, Document and Selection, and the Matcher interface. Unlike jQuery, which is loaded as part of a DOM document, and thus acts on its containing document, goquery doesn't know which HTML document to act upon. So it needs to be told, and that's what the Document type is for. It holds the root document node as the initial Selection value to manipulate.

jQuery often has many variants for the same function (no argument, a selector string argument, a jQuery object argument, a DOM element argument, ...). Instead of exposing the same features in goquery as a single method with variadic empty interface arguments, statically-typed signatures are used following this naming convention:

  • When the jQuery equivalent can be called with no argument, it has the same name as jQuery for the no argument signature (e.g.: Prev()), and the version with a selector string argument is called XxxFiltered() (e.g.: PrevFiltered())
  • When the jQuery equivalent requires one argument, the same name as jQuery is used for the selector string version (e.g.: Is())
  • The signatures accepting a jQuery object as argument are defined in goquery as XxxSelection() and take a *Selection object as argument (e.g.: FilterSelection())
  • The signatures accepting a DOM element as argument in jQuery are defined in goquery as XxxNodes() and take a variadic argument of type *html.Node (e.g.: FilterNodes())
  • The signatures accepting a function as argument in jQuery are defined in goquery as XxxFunction() and take a function as argument (e.g.: FilterFunction())
  • The goquery methods that can be called with a selector string have a corresponding version that take a Matcher interface and are defined as XxxMatcher() (e.g.: IsMatcher())

Utility functions that are not in jQuery but are useful in Go are implemented as functions (that take a *Selection as parameter), to avoid a potential naming clash on the *Selection's methods (reserved for jQuery-equivalent behaviour).

The complete package reference documentation can be found here.

Please note that Cascadia's selectors do not necessarily match all supported selectors of jQuery (Sizzle). See the cascadia project for details. Also, the selectors work more like the DOM's querySelectorAll, than jQuery's matchers - they have no concept of contextual matching (for some concrete examples of what that means, see this ticket). In practice, it doesn't matter very often but it's something worth mentioning. Invalid selector strings compile to a Matcher that fails to match any node. Behaviour of the various functions that take a selector string as argument follows from that fact, e.g. (where ~ is an invalid selector string):

  • Find("~") returns an empty selection because the selector string doesn't match anything.
  • Add("~") returns a new selection that holds the same nodes as the original selection, because it didn't add any node (selector string didn't match anything).
  • ParentsFiltered("~") returns an empty selection because the selector string doesn't match anything.
  • ParentsUntil("~") returns all parents of the selection because the selector string didn't match any element to stop before the top element.

Examples

See some tips and tricks in the wiki.

Adapted from example_test.go:

package main

import (
  "fmt"
  "log"
  "net/http"

  "github.com/PuerkitoBio/goquery"
)

func ExampleScrape() {
  // Request the HTML page.
  res, err := http.Get("http://metalsucks.net")
  if err != nil {
    log.Fatal(err)
  }
  defer res.Body.Close()
  if res.StatusCode != 200 {
    log.Fatalf("status code error: %d %s", res.StatusCode, res.Status)
  }

  // Load the HTML document
  doc, err := goquery.NewDocumentFromReader(res.Body)
  if err != nil {
    log.Fatal(err)
  }

  // Find the review items
  doc.Find(".left-content article .post-title").Each(func(i int, s *goquery.Selection) {
		// For each item found, get the title
		title := s.Find("a").Text()
		fmt.Printf("Review %d: %s\n", i, title)
	})
}

func main() {
  ExampleScrape()
}

Related Projects

  • Goq, an HTML deserialization and scraping library based on goquery and struct tags.
  • andybalholm/cascadia, the CSS selector library used by goquery.
  • suntong/cascadia, a command-line interface to the cascadia CSS selector library, useful to test selectors.
  • gocolly/colly, a lightning fast and elegant Scraping Framework
  • gnulnx/goperf, a website performance test tool that also fetches static assets.
  • MontFerret/ferret, declarative web scraping.
  • tacusci/berrycms, a modern simple to use CMS with easy to write plugins
  • Dataflow kit, Web Scraping framework for Gophers.
  • Geziyor, a fast web crawling & scraping framework for Go. Supports JS rendering.
  • Pagser, a simple, easy, extensible, configurable HTML parser to struct based on goquery and struct tags.
  • stitcherd, A server for doing server side includes using css selectors and DOM updates.
  • goskyr, an easily configurable command-line scraper written in Go.
  • goGetJS, a tool for extracting, searching, and saving JavaScript files (with optional headless browser).
  • fitter, a tool for selecting values from JSON, XML, HTML and XPath formatted pages.
  • seltabl, an orm-like package and supporting language server for extracting values from HTML

Support

There are a number of ways you can support the project:

  • Use it, star it, build something with it, spread the word!
    • If you do build something open-source or otherwise publicly-visible, let me know so I can add it to the Related Projects section!
  • Raise issues to improve the project (note: doc typos and clarifications are issues too!)
    • Please search existing issues before opening a new one - it may have already been addressed.
  • Pull requests: please discuss new code in an issue first, unless the fix is really trivial.
    • Make sure new code is tested.
    • Be mindful of existing code - PRs that break existing code have a high probability of being declined, unless it fixes a serious issue.
  • Sponsor the developer
    • See the Github Sponsor button at the top of the repo on github
    • or via BuyMeACoffee.com, below

Buy Me A Coffee

License

The BSD 3-Clause license, the same as the Go language. Cascadia's license is here.

goquery's People

Contributors

38elements avatar amikai avatar andrewstuart avatar anthonygedeon avatar aybabtme avatar bfontaine avatar conneroisu avatar deining avatar dependabot[bot] avatar fesaa avatar foolin avatar glebtv avatar haruyama avatar ithinco avatar jauderho avatar jcconnell avatar jpillora avatar jwilsson avatar kataras avatar kinoute avatar matrixik avatar mna avatar piotrkowalczuk avatar radovskyb avatar santosh653 avatar slotix avatar thatguystone avatar thiemok avatar trtstm avatar zaba505 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

goquery's Issues

'*goquery.Document' to 'io.Reader'

This isn't an issue; it's an idea/feature request. It'd be cool if there were a way to convert a goquery.Document type to an io.Reader type. The NewDocumentFromReader function does the reverse. A NewReaderFromDocument function would be cool too.

exp/html api changed, causing cascadia to fail, causing goquery to fail

the exp/html api must've changed, since cascadia relies on functions that don't exist anymore.

Here's some terminal printouts for your pleasure:

$ go get github.com/PuerkitoBio/goquery                                                                                                                                              
# code.google.com/p/cascadia                                                                                                                                                                                                                  
/usr/local/go/src/pkg/code.google.com/p/cascadia/selector.go:18: n.FirstChild undefined (type *html.Node has no field or method FirstChild)                                                                                                   
/usr/local/go/src/pkg/code.google.com/p/cascadia/selector.go:30: n.FirstChild undefined (type *html.Node has no field or method FirstChild)                                                                                                   
/usr/local/go/src/pkg/code.google.com/p/cascadia/selector.go:70: n.FirstChild undefined (type *html.Node has no field or method FirstChild)                                                                                                   
/usr/local/go/src/pkg/code.google.com/p/cascadia/selector.go:257: n.FirstChild undefined (type *html.Node has no field or method FirstChild)                                                                                                  
/usr/local/go/src/pkg/code.google.com/p/cascadia/selector.go:274: n.FirstChild undefined (type *html.Node has no field or method FirstChild)                                                                                                  
/usr/local/go/src/pkg/code.google.com/p/cascadia/selector.go:354: parent.FirstChild undefined (type *html.Node has no field or method FirstChild)                                                                                             
/usr/local/go/src/pkg/code.google.com/p/cascadia/selector.go:399: parent.FirstChild undefined (type *html.Node has no field or method FirstChild)                                                                                             
/usr/local/go/src/pkg/code.google.com/p/cascadia/selector.go:419: n.FirstChild undefined (type *html.Node has no field or method FirstChild)                                                                                                  
/usr/local/go/src/pkg/code.google.com/p/cascadia/selector.go:465: n.PrevSibling undefined (type *html.Node has no field or method PrevSibling)                                                                                                
/usr/local/go/src/pkg/code.google.com/p/cascadia/selector.go:468: n.PrevSibling undefined (type *html.Node has no field or method PrevSibling)                                                                                                
/usr/local/go/src/pkg/code.google.com/p/cascadia/selector.go:468: too many errors  

Incorrect parse

Hi, I got a problem when parsing the website:

<title>Saturday Night Live: The Best of Chris Kattan (2004)</title>
Saturday Night Live: The Best of Chris Kattan (2003) (TV)
Now Playing
Movie/TV News
My Movies
DVD/Video
IMDbTV
Message Boards
Showtimes & Tickets
IMDbPro
IMDb Resume
 Login | Register
Home |
Top Movies |
Photos |
Independent Film |
GameBase |
Browse |
Help
All
Titles
\- TV Episodes
My Movies
Names
Companies
Keywords
Characters
Quotes
Bios
Plots
more |
tips
SHOP SATURDAY...
Amazon.com
Amazon.ca
Amazon.co.uk
Amazon.de
Amazon.fr
DVDVHSCDAll
DVDVHSCDAll
DVDVHSCDAll
DVDVHSCDAll
DVDVHSCDAll
IMDb >
Saturday Night Live: The Best of Chris Kattan (2003) (TV)
Quicklinks
main detailscombined detailsfull cast and crewcompany creditsuser commentsuser ratingsrecommendationsplot keywordsmovie connectionsrelease dates
Top Links
-trailers
-full cast and crew
-trivia
-official sites
-memorable quotes
Overview
main details
-combined details
-full cast and crew
-company credits
-tv schedule
Promotional
-taglines
-trailers
-posters
-photo gallery
Awards & Reviews
-user comments
-external reviews
-newsgroup reviews
-awards
-user ratings
-recommendations
-message board
Plot & Quotes
-plot summary
-plot keywords
-Amazon.com summary
-memorable quotes
Fun Stuff
-trivia
-goofs
-soundtrack listing
-crazy credits
-alternate versions
-movie connections
-FAQ
Other Info
-merchandising links
-box office/business
-release dates
-filming locations
-technical specs
-laserdisc details
-DVD details
-literature listings
-news articles
External Links
-showtimes
-official sites
-miscellaneous
-photographs
-sound clips
-video clips
Saturday Night Live: The Best of Chris Kattan (2003) (TV)
advertisement
photos
board
trailer
details
Register or login to rate this title
User Rating:
4.8/10
(57 votes)
more
Release Date:
27 September 2003 (USA)
more
Genre:
Comedy more
Plot Keywords:
Character Name In Title
User Comments:
Lets get physical
more
(Credited cast)Chris Kattan ... Various Characters (archive footage)more
Runtime:
USA:76 min
Country:
USA
Language:
English
Color:
Color
MOVIEmeter:
16% since last week
why?
Company:
Broadway Video
more
Movie Connections:
Spoofs The Lord of the Rings: The Two Towers (2002)
more
This FAQ is empty. Add the first question.
(Comment on this title)
1 out of 1 people found the following comment useful:-
Lets get physical, 12 November 2005
Author:
Chip_douglas from Rijswijk (ZH), The Netherlands
Chris Kattan might well be the greatest physical comedian they ever had
on Saturday Night Live. Only Jim Carrey's work on "In Living Color"
surpasses him. Like Carrey, Kattan is the kind of comedian that you
either love or hate. Or you might love him at first but eventually
he'll get on your nerves. Kattan also has this gay stigma hanging over
his head, hindering a successful movie career like Carrey and
Groundlings pal Will Ferrell. At least for now. This compilation opens
with the first appearance of the Roxbury Guys, when Will and Chris were
just trying out the idea and had not even found Haddaway's 'What is
Love' to head-bop to. This made me expect some more Roxbury dancing
later on in the show, but no such luck.His many recurring characters appear one by one, including one of the
most elaborate Mango skits they ever did (Mango vs. J-lo), Suel
Forrester in court, the Zimmermans at Halloween (Chris and Cheri Oteri
are a great match) and that other pair of dancing brothers, the
DeMarco's, auditioning for Bon Jovi. For some reason Jovi seems to drag
out the ending long past the stage of being funny. When it comes to his
most energetic and far out character of all, I would not have picked
the Mr. Peepers skit with Kevin Spacey as the best one. The ones with
The Rock and Tom Hanks were much funnier. Besides, I think they edited
it somewhat. My absolute favourite in this collection has to be the
'How do you say? Ah Yes! Show' with Antonio Banderas and Jennifer Love
Hewitt.During a montage of celebrity impressions it becomes clear Kattan is
not the best mimic on the show, but his take on Pacino is absolutely
hilarious. The White trash couple is pretty disgusting, especially when
Amy Poehlerwants to suck his toe-thumb (don't ask). As usual Fred
Armisen steals the show in the Buddy Mills sketch, and when they get to
Goth Talk, it's the one where his character Azrael Abbys is pretending
to be dead. Well I guess he was always more of a sidekick in that one
anyway. There are three clips from Weekend update, one with Norm and
two with Jimmy and Tina. Kattan's first appearance as Gollum is
actually taken from the Superbowl Half-time show. It ends with the
Siamese twins skit with Jennifer Garner that has Chris playing a
normal, boring straight guy. I guess they had to reinforce that image
one last time. Either that or it was an excuse to have more Jimmy
Fallon in there. They should have put in another round of Roxbury if
you ask me.9 out of 10
Was the above comment useful to you?
more
Discuss this title with other users on IMDb message board for Saturday Night Live: The Best of Chris Kattan (2003) (TV)
If you enjoyed this title, our database also recommends:
A Bug's Life
Meet the Parents
Keeping the Faith
The Opposite of Sex
The Big Kahuna
IMDb User Rating:
IMDb User Rating:
IMDb User Rating:
IMDb User Rating:
IMDb User Rating:
Add a recommendation |
Show more recommendations
You may report errors and omissions on this page to the IMDb database managers. They will be examined and if approved will be included in a future update. Clicking the 'Update' buttonwill take you through a step-by-step process.
Home | Search | Now Playing | News | My Movies | Games | Boards | Help | US Movie Showtimes | Top 250 | Register | RecommendationsBox Office | Index | Trailers |
Jobs | IMDbPro.com - Free Trial | IMDb Resume
Copyright © 1990-2007 Internet Movie Database Inc.
Terms and Privacy Policy under which this service is provided to you.
An company. 
Advertise on IMDb. 
License our content.

Back to Movie index

The golang code is like this:

for _, n := range doc.Find("body").Children().Not("style").Not("script").Nodes {
          buf.WriteString(getNodeText(n))
}

What I finally get is only the "Back to Movie index". I dont quite understand why.

SetAttr only updates existing attributes, it should also create them

I had a look at your tests and you're only testing updating attributes. I confirmed this with:

package main

import (
    "fmt"
    "strings"

    "github.com/PuerkitoBio/goquery"
)

func main() {

    doc, _ := goquery.NewDocumentFromReader(strings.NewReader(`
        <html>
        <body>
            <input id="foo" placeholder="123"/>
        </body>
        </html>
    `))

    doc.Find("#foo").SetAttr("value", "bar").SetAttr("placeholder", "456")

    out, _ := doc.Html()

    fmt.Print(out)
}
$ go run example.go
<html><head></head><body>
            <input id="foo" placeholder="456"/>


    </body></html>

find input is nil ~~~

uri = "http://www.google.com"
doc, e := goquery.NewDocument(uri)
if e != nil {
    beego.Error(e)
}
form := doc.Find("input").Each(func(j int, input *goquery.Selection) {
        println(input.Html())
    })

the input is nil~~, help~~~

imported and not used: "github.com/PuerkitoBio/goquery" but undefined

When i run $ go run main.go

I got this :
# command-line-arguments
./main.go:5: imported and not used: "github.com/PuerkitoBio/goquery"
./main.go:9: undefined: NewDocument
./main.go:13: undefined: doc

The file main.go is the example.
I adjust it in order to run it outside.
I added package main and the corresponding import (import "github.com/PuerkitoBio/goquery").
Before this i run go get github.com/PuerkitoBio/goquery

How to find html attribute whose value begins with a number?

Say the html document I'm parsing has the attribute class="1post". Is there a way to use the Find() function for this class? If I run doc.Find(".1post"), I get this error:
panic: expected identifier, found 1 instead

This might be a cascadia issue. Do you know of a way around it?

Question about attributes

Hi,

I have HTML element:

[input onkeypress="if(event.keyCode == 13){processHash('Search')}" class="jq-zoho-search-input" type="text" id="searchInputBox" accesskey="f" title="Search jQuery" name="search"]

There is way to get all atributes of selected element ? I am expecting to get:

onkeypress="if(event.keyCode == 13){processHash('Search')}"
class="jq-zoho-search-input"
type="text"
id="searchInputBox"
accesskey="f"
title="Search jQuery"
name="search"

I know that there is function Attr, but I want all them in one list.. I select element and run function GetAttributes (for example) and I get list of attributes..

Is this possible with current version of GoQuery ? I can't find it..

Can't build inside a container

Im trying to build this package inside a container:

Step 8 : RUN go get github.com/PuerkitoBio/goquery ---> Running in e81d2861928f
# github.com/PuerkitoBio/goquery
gopath/src/github.com/PuerkitoBio/goquery/filter.go:116: cannot use sel.Nodes (type []*"code.google.com/p/go.net/html".Node) as type []*"golang.org/x/net/html".Node in argument to cs.Filter
gopath/src/github.com/PuerkitoBio/goquery/filter.go:116: cannot use cs.Filter(sel.Nodes) (type []*"golang.org/x/net/html".Node) as type []*"code.google.com/p/go.net/html".Node in return argument
gopath/src/github.com/PuerkitoBio/goquery/filter.go:120: cannot use s.Get(0) (type *"code.google.com/p/go.net/html".Node) as type *"golang.org/x/net/html".Node in argument to cs.Match
gopath/src/github.com/PuerkitoBio/goquery/query.go:20: cannot use s.Nodes[0] (type *"code.google.com/p/go.net/html".Node) as type *"golang.org/x/net/html".Node in argument to cs.Match
gopath/src/github.com/PuerkitoBio/goquery/query.go:22: cannot use s.Nodes (type []*"code.google.com/p/go.net/html".Node) as type []*"golang.org/x/net/html".Node in argument to cs.Filter
gopath/src/github.com/PuerkitoBio/goquery/traversal.go:105: cannot use n (type *"code.google.com/p/go.net/html".Node) as type *"golang.org/x/net/html".Node in argument to cs.Match
gopath/src/github.com/PuerkitoBio/goquery/traversal.go:385: cannot use c (type *"code.google.com/p/go.net/html".Node) as type *"golang.org/x/net/html".Node in argument to sel.MatchAll
gopath/src/github.com/PuerkitoBio/goquery/traversal.go:385: cannot use sel.MatchAll(c) (type []*"golang.org/x/net/html".Node) as type []*"code.google.com/p/go.net/html".Node in append
2014/11/06 15:01:48 The command [/bin/sh -c go get github.com/PuerkitoBio/goquery] returned a non-zero code: 2

Wondering if anyone can give me some inisght ?

Got an error when executing go get

After executing go get github.com/PuerkitoBio/goquery, I got following prompt:

abort: code.google.com certificate error: certificate is for *.googleusercontent.com, *.blogspot.com, *.bp.blogspot.com, *.commondatastorage.googleapis.com, *.doubleclickusercontent.com, *.ggpht.com, *.googledrive.com, *.googlesyndication.com, *.sandbox.googleusercontent.com, *.storage.googleapis.com, blogspot.com, bp.blogspot.com, commondatastorage.googleapis.com, doubleclickusercontent.com, ggpht.com, googledrive.com, googleusercontent.com, static.panoramio.com.storage.googleapis.com, storage.googleapis.com
(configure hostfingerprint 70:03:d2:44:35:d0:d4:64:85:f0:3e:c8:15:9c:4d:e7:59:91:50:0d or use --insecure to connect insecurely)
package github.com/PuerkitoBio/goquery
    imports code.google.com/p/cascadia: exit status 255

I'm a newbie to Golang. Could anyone tell me how to fix it? Thanks!

the result don't matches the selector string.

e.g.

...
doc.Find("img[src]")
...

< img src="assets/images/gallery/thumb-1.jpg" alt="150x150"/> is ok
< img alt="150x150" src="assets/images/gallery/thumb-1.jpg" /> will not match

Modifying (XML) file

This actually is a question; I was wondering,

  1. Will goquery work on arbitrary XML files?
  2. Is it possible to alter an attribute in the document?

In .Each func how “return”

doc.Find("dl[class='brand_tree'] dd ul li").Each(func(index int, s *goquery.Selection) {
brandLiId, exists := s.Attr("id")
if exists == false {
return // End "Each" ?
}

//......
})

undefined: goquery.Selection

Are the docs outdated?

When running an Each statement such as the following,

r.Doc.Find("form").Each(func(i int, form *goquery.Selection) {

}

gives me undefined: goquery.Selection.

It seems to return *goquery.Node instead, which then means that I can't call .Find() on the form object.

What am I overlooking?

why i cannot install exp/html ?

when i excute go get github.com/PuerkitoBio/goquery ,i get an error says :imports exp/html : unrecognized import path "exp/html".

How to read from file?

What if I have already downloaded an HTML file locally and want to parse it?

Based on the main example, I tried:

if doc, e = goquery.NewDocument("file://somefile.html"); e != nil {
    log.Fatal(e)
} 

But it gives:

2014/06/20 17:05:04 Get file://somefile.html: unsupported protocol scheme "file"
exit status 1

Getting video src?

Sorry for the newb question, Learning go and trying to extract the mp4 from a vine link. I'm trying to follow the example, and extract the "src" attribute from the "video" html tag, but I get a

'# command-line-arguments
./main.go:17: multiple-value s.Find("video").Attr() in single-value context 

error when running, I was wondering if I need to select the Attribute a different way?

package main

import (
  "fmt"
  "log"

  "github.com/PuerkitoBio/goquery"
)

func getMP4URL() {
  doc, err := goquery.NewDocument("https://vine.co/v/MlWtKgwh7WY")
  if err != nil {
    log.Fatal(err)
  }

  doc.Find(".vine-video-container").Each(func(i int, s *goquery.Selection) {
    mp4 := s.Find("video").Attr("src")
    fmt.Printf("MP4 %d: %s\n", i, mp4)
  })
}

func main() {
  getMP4URL()
}

User-Agent

Can we a way of configuring the user-agent?
I'm getting problems recently because of it :(

Why no New from a Reader or String?

I am using goquery to scan existing HTML files. These aren't created from a response. And I don't want to expose go.net/html to my application.

A NewFromString or NewFromReader would be great... or just a simple New

func New(src io.Reader) (d *Document, e error){
    // Parse the HTML into nodes
    root, e := html.Parse(src)
    if e != nil {
        return
    }

    // Create and fill the document
    d = newDocument(root, nil)
    return
}

BTW nice library, thank you very much.

NextWhile function?

I see NextUntil and NextFilteredUntil, which are awesome. Is there a way to select the next elements while they match a certain selector? For example, I know I need to select a bunch of <p> elements but I don't know what the next non-p tag will be, so I don't know what to put in for "until".

If there isn't a way to do this currently, would you like me to try implementing this and submitting a pull request?

Don't change &nbsp; to space in Html()

I'm trying to split a string using the &nbsp; entity. This works with Jsoup for Java, but here it gets changed to a normal space, making what I want to do impossible.

Unable to install exp/html

I'm new to go so I'm not having much luck here.

You refer to the https://code.google.com/p/go-wiki/wiki/InstallingExp page for installing the experimental libraries however I think this page has changed since you wrote your instructions. Also, the given example on that page also does not work.

I think that the HTML package may have been moved outside of experimental? I'm not familiar enough with Go to know or not. Either way, I cannot get goquery installed.

.Find Error "invalid memory address or nil pointer dereference"

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x8 pc=0x459f16]

goroutine 5196 [running]:
github.com/PuerkitoBio/goquery.func·008(0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /Users/jinke/golang/src/github.com/PuerkitoBio/goquery/traversal.go:383 +0x46
github.com/PuerkitoBio/goquery.mapNodes(0xc2091358f0, 0x1, 0x1, 0x2afca5fda5f8, 0x0, ...)
        /Users/jinke/golang/src/github.com/PuerkitoBio/goquery/traversal.go:532 +0x8f
github.com/PuerkitoBio/goquery.findWithSelector(0xc2091358f0, 0x1, 0x1, 0x718dd0, 0x25, ...)
        /Users/jinke/golang/src/github.com/PuerkitoBio/goquery/traversal.go:389 +0x8d
github.com/PuerkitoBio/goquery.(*Selection).Find(0xc208006660, 0x718dd0, 0x25, 0xc2000b3000)
        /Users/jinke/golang/src/github.com/PuerkitoBio/goquery/traversal.go:27 +0x45
main.qiche4sListSpider(0xc2000aca40, 0x2, 0xc20487e400, 0x32, 0x1512, ...)
        /Users/jinke/golang/src/cds_spider/price/main/bitauto.go:195 +0x406
main.func·005(0x1, 0xc2007fe180)
        /Users/jinke/golang/src/cds_spider/price/main/bitauto.go:165 +0x20e
github.com/PuerkitoBio/goquery.(*Selection).Each(0xc20125a480, 0x2afca5fdadb8, 0x25)
        /Users/jinke/golang/src/github.com/PuerkitoBio/goquery/iteration.go:7 +0xf7
main.qiche4sSpider(0xc2000aca40, 0x1447, 0x6b80f0, 0x0, 0x1512, ...)
        /Users/jinke/golang/src/cds_spider/price/main/bitauto.go:166 +0x5e5
main.func·008(0xc2000aca40, 0x1447, 0x6b80f0, 0x0, 0x1512, ...)
        /Users/jinke/golang/src/cds_spider/price/main/bitauto.go:263 +0xb6
created by cds_spider/price/frame.(*Frame).Start
        /Users/jinke/golang/src/cds_spider/price/frame/frame.go:116 +0x29c

code :

193  root, _ := html.Parse(res.Body)
194  document := goquery.NewDocumentFromNode(root)
195  selections := document.Find("div[class='lm_subprice_blc'] table tr")
196  if selections.Size() == 0 {
197         //.....
198         return
199  }

Will it be "root, _ := html.Parse(res.Body)" wrong here ?
update:

root, err := html.Parse(res.Body)
if err != nil {
//.......
 return
}

NewDocumentFromString?

I saw this: #20 for NewDocumentFromReader . I could also really use NewDocumentFromString... I have the unfortunate responsibility to "clean" some html and resave that is in an XML pseudo-RSS feed. So by time I get down to the parts I'd need to manipulate the html I'm within a loop of strings.

Text()

First I would like to thank you for your code, it eases my life 👍

Issue I got

Source page is as follow:

<html>
<head>
<meta http-equiv="Content-Language" content="en-us">
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title>Test page</title>
</head>
<body>
<h1>This is the Test page for a crawler</h1>
<p>Before getting the Admission of.</p>
</body>
</html>

doc.Find("body").Children().Not("style").Not("script").Text() gave me the result:
This is the Test page for a crawlerBefore getting the Admission of

Why is crawler and Before not seperated? I think it should not be the problem of windows-1252. Maybe something is wrong? I have not read through the source code yet.

Package can't be `go get`

Hello, I'm using your package as a testing library in my project, but starting of today my builds started to fail. The CI logs print the following error:

# github.com/PuerkitoBio/goquery
../../PuerkitoBio/goquery/filter.go:116: cannot use sel.Nodes (type []*"code.google.com/p/go.net/html".Node) as type []*"golang.org/x/net/html".Node in argument to cs.Filter
../../PuerkitoBio/goquery/filter.go:116: cannot use cs.Filter(sel.Nodes) (type []*"golang.org/x/net/html".Node) as type []*"code.google.com/p/go.net/html".Node in return argument
../../PuerkitoBio/goquery/filter.go:120: cannot use s.Get(0) (type *"code.google.com/p/go.net/html".Node) as type *"golang.org/x/net/html".Node in argument to cs.Match
../../PuerkitoBio/goquery/query.go:20: cannot use s.Nodes[0] (type *"code.google.com/p/go.net/html".Node) as type *"golang.org/x/net/html".Node in argument to cs.Match
../../PuerkitoBio/goquery/query.go:22: cannot use s.Nodes (type []*"code.google.com/p/go.net/html".Node) as type []*"golang.org/x/net/html".Node in argument to cs.Filter
../../PuerkitoBio/goquery/traversal.go:105: cannot use n (type *"code.google.com/p/go.net/html".Node) as type *"golang.org/x/net/html".Node in argument to cs.Match
../../PuerkitoBio/goquery/traversal.go:385: cannot use c (type *"code.google.com/p/go.net/html".Node) as type *"golang.org/x/net/html".Node in argument to sel.MatchAll
../../PuerkitoBio/goquery/traversal.go:385: cannot use sel.MatchAll(c) (type []*"golang.org/x/net/html".Node) as type []*"code.google.com/p/go.net/html".Node in append
FAIL    github.com/9uuso/vertigo [build failed]

I tried using different Go versions, but the command seems to fail on at least Go 1.2, 1.3 and 1.3.1.

doc.Html()

doc.html() will return the string wrapped by html struct.How can I just to string, no html tag,head tag...

"Newbier" Example?

Hey there! Nice job with this package.

I'm a newcomers to go so I thought I'd most humbly submit my example I'd use to show others. It's just easier to copy and paste this example into a test.go file and simply run it (for complete beginners).

package main

import (
  "fmt"
  "log"
  gq "github.com/PuerkitoBio/goquery"
)

func ExampleScrape() {
  var doc *gq.Document
  var e error

  if doc, e = gq.NewDocument("http://metalsucks.net"); e != nil {
    log.Fatal(e)
  }

  doc.Find(".reviews-wrap article .review-rhs").Each(func(i int, s *gq.Selection) {
    band := s.Find("h3").Text()
    title := s.Find("i").Text()
    fmt.Printf("Review %d: %s - %s\n", i, band, title)
    })
}

func main() {
  ExampleScrape()
}

Deleting element from DOM

Hi,

I can't solve one thing..

I have selection:

[h1 class="title"]
Some kind of titlte [a href="url" class="encore"][span class="comm red"][/span][/a]
[/h1]

selecting element:

sel := dom.Find(h1[class=title])

then

sel = sel.Not(a)

after this I am expecting that element "a" will be removed with all childs

then I call

html, _ = sel.Html()
fmt.Println(html)

and I get title and "a" element with it..

probably I am doing something wrong, in my case I need to remove elements from my selection

thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.