GithubHelp home page GithubHelp logo

technikempire / gq Goto Github PK

View Code? Open in Web Editor NEW
26.0 6.0 14.0 1.82 MB

CSS Selector Engine for Gumbo Parser

License: Other

C++ 90.69% Batchfile 4.24% HTML 4.72% CMake 0.35%
gumbo-parser css-selector

gq's Introduction

GQ

GQ is a CSS Selector Engine for Gumbo Parser written in C++11. Using Gumbo Parser as a backend, GQ can parse input HTML and allow users to select and modify elements in the parsed document with CSS Selectors and the provided simple, but powerful mutation API.

This project is a fork of gumbo-query. I opted to have this be an unofficial fork because I intended on performing nearly a complete rewrite, which I did, and as such this source is completely irreconcilable with the original gumbo-query source.

Usage

You can either construct a Document around an existing GumboOutput pointer, at which point the Document will assume managing the lifetime of the GumboOutput, or you can supply a raw string of HTML for Document to parse and also maintain.

std::string someHtmlString = "...";
std::string someSelectorString = "...";
auto testDocument = gq::Document::Create();
testDocument->Parse(someHtmlString);

try
{
    auto results = testDocument->Find(someSelectorString);
    auto numResults = results.GetNodeCount();
}
catch(std::runtime_error& e)
{
    // Necessary because naturally, the parser can throw.
}

As you can see, you can run raw selector strings into the ::Find(...) method, but each time, the selector string will be "compiled" into a SharedSelector and destroyed. You can alternatively "precompile" and save built selectors, and as such avoid wrapping every ::Find(...) call in a try/catch.

GumboOutput* output = SOMETHING_NOT_NULL;
auto testDocument = gq::Document::Create(output);

gq::Parser parser;

std::vector<std::string> collectionOfRawSelectorStrings {...};
std::vector<gq::SharedSelector> compiledSelectors();
compiledSelectors.reserve(collectionOfRawSelectorStrings.size());

for(auto& s : collectionOfRawSelectorStrings)
{
    try
    {
        auto result = parser.CreateSelector(s);
        compiledSelectors.push_back(result);
    }
    catch(std::runtime_error& e)
    {
        // Necessary because naturally, the parser can throw.
    }
}

size_t numResults = 0;
for(auto& ss : compiledSelectors)
{
    auto results = testDocument->Find(ss);
    numResults += results.GetNodeCount();
}

These snippets are just meant to demonstrate the most basic of usage. Thanks to the mutation api, it's possible to have fine grain control over elements matched by selectors. Look at the mutation sample for a complete example of using this feature.

The contract placed on the end user is very light. Keep Document alive for as long as you're storing or accessing any Node object, directly or indirectly. That's basically it.

Speed

One of the primary goals with this engine was to maximize speed. For my purposes, I wanted to ensure I could run an insane amount of selectors without any visible delay to the user. Running the TestParser test benchmarks parsing and using every single selector in EasyList (spare a handful which were removed because they're improperly formatted) against a standard high profile website's landing page HTML. For example, if I download the source for the landing page of yahoo.com and use it in the parser test at the time of this writing, the current results on my dev laptop are:

Processed 27646 selectors. Had handled errors? false
Benchmarking parsing speed.
Time taken to parse 2764600 selectors: 2443.11 ms.
Processed at a rate of 0.000883713 milliseconds per selector or 1131.59 selectors per millisecond.
Benchmarking document parsing.
Time taken to parse 100 documents: 8054.37 ms.
Processed at a rate of 80.5437 milliseconds per document.
Benchmarking selection speed.
Time taken to run 2764600 selectors against the document: 5709.75 ms producing 23300 total matches.
Processed at a rate of 0.00206531 milliseconds per selector or 484.189 selectors per millisecond.
Benchmarking mutation.
Time taken to run 27646 selectors against the document while serializing with mutations 100 times: 6110.32 ms.
Time per cycle 61.1032 ms.
Processed at a rate of 0.0022102 milliseconds per selector or 452.448 selectors per millisecond.

So from these results, a document could be loaded, parsed, and have 27646 precompiled selectors run on it in about 137.6412 milliseconds. If you include reserializing the input to an HTML string with modifications, it's about 141.6469 msec to load, parse the document, run 27646 selectors and serialize the output with modifications based on those selectors back to an HTML string.

It should be obvious that the speed can greatly vary depending on the size and complexity of the input HTML. For example, running the same test program against the cnn.com landing page yields the following results:

Processed 27646 selectors. Had handled errors? false
Benchmarking parsing speed.
Time taken to parse 2764600 selectors: 2396.14 ms.
Processed at a rate of 0.000866723 milliseconds per selector or 1153.77 selectors per millisecond.
Benchmarking document parsing.
Time taken to parse 100 documents: 2081.5 ms.
Processed at a rate of 20.815 milliseconds per document.
Benchmarking selection speed.
Time taken to run 2764600 selectors against the document: 3321.3 ms producing 9900 total matches.
Processed at a rate of 0.00120137 milliseconds per selector or 832.386 selectors per millisecond.
Benchmarking mutation.
Time taken to run 27646 selectors against the document while serializing with mutations 100 times: 3478.62 ms.
Time per cycle 34.7862 ms.
Processed at a rate of 0.00125827 milliseconds per selector or 794.741 selectors per millisecond.

As you can see, approaching double the speed over the yahoo.com website's landing page.

Speed doesn't mean much if the matching code is broken. As such, over 40 tests currently exist that ensure correct functionality of various types of selectors. I have yet to write tests for nested and combined selectors.

Configuration

Presently, there are only scripts/project files for building GQ under Windows with Visual Studio 2015. There is no reason why GQ cannot be used under Linux or OSX, I just simply have not gone there yet. It will come soon.
A recent contribution to the repository has added Cmake support, and bug fixes to enable correct functionality under Linux. There is a minimal amount of setup required for building under Windows with VS, and it's detailed in the Wiki.

TODO

  • Mutation API.
  • Tests for combined and nested selectors.
  • Reduce candidate collections BEFORE attempting to match in the event that the selector is a BinarySelector with the intersection operator. Can reduce sets by only keeping candidates that match the traits from both the left and right hand sides of the BinarySelector, which would drastically reduce candidates and thus drastically increase matching speed. This was tried and abandoned, it's actually faster to just let it chew through all candidates.
  • Modify Selector::Match() and related methods to return the final matched node. Required for child selectors and such.
  • Work around for including root node without having to switch to the abysmal weak_ptr in TreeMap.
  • Scripts/Project Files for building/using under Linux, OSX.

Original Goals

  • Wrapping things up in proper namespaces.
  • Remove custom rolled automatic reference counting, remove any sort of shared_ptr and make lifetime management simple.
  • Fix broken parsing that was ported from cascadia, but is invalid for use with Gumbo Parser.
  • Make parsing/matching produce the same behavior as jQuery does on the exact same test data.
  • Replace std::string with boost::string_ref wherever string copies don't truly need to be generated.
  • Implement a mapping system to dramatically increase matching speed by filtering potential matches by traits.
  • Remove local state tracking from the selector parser.
  • Expose compiled selectors to the public so that they can be retained and recycled against existing and new documents.
  • "Comments. Lots of comments."
  • "Speed. Lots of Speed."

gq's People

Contributors

hhyyrylainen avatar steven-joruk avatar technikempire avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

gq's Issues

Node tag names are broken

Node tag names, when building node attributes, are lost. The reason for this is because inside Node.cpp, the node's tag name as a string is extracted via the ::Util::GetNodeTagName method. This method returns a temporary string. Inside the ::Node class, this method is called then the result is wrapped in a boost::string_ref and stored within the attributes map. This of course cannot ever work, so performing any function based on a tag type/name would also be broken.

Not sure how this was missed, as tests should have failed as well.

Speed really isn't that great

It would seem that our tree building process is quite expensive. IIRC, when I looked into this some time ago, it wasn't gumbo's parsing speed really dragging things down, but rather the tree building we do post-gumbo parsing.

Should be able to speed that up.

Parser::ParseIdentifier(boost::string_ref& selectorStr) Doesn't calculate index

Parser::ParseIdentifier(boost::string_ref& selectorStr) doesn't correctly calculate the index, and has a duplicate error string.

if (ind <= 0 || ind > static_cast<int>(selectorStr.size()))
{
      throw std::runtime_error(u8"In Parser::ParseIdentifier(boost::string_ref&) - Expected selector containing identifier, yet no valid identifier was found.");
}

This is incorrect, because of the way that the ind member is incremented. In the event of a selector like:

#myid

only Parser::ParseIdentifier will be called, then the selector is complete. The ind will be incremented to string::size() + 1 before the loop breaks. This isn't a failure, it just means that the identifier consumed the entire selector.

Way too many matches

Running the TestParser example on the test html data yields hundreds of matches. CsQuery yields just over 30. Must manually validate results.

Examples

Need examples for this library usage in c++.
Thanks in advance.

Possible nth-child selector parsing bug

Got bug report here:

TechnikEmpire/HttpFilteringEngine#77

Google translate tells me roughly:

For example, this rule : 2muslim.com ### frameD3x9G6_center>: nth-child (2n)
GQ project inside HttpFilteringEngine\deps\qq\src\Parser.cpp 892 line rhs = std :: stoi (rhss);
This time rhss value is an empty string

So it looks like a failure to parse the nth child value. Needs to be confirmed.

Invalid HTML while parsing inline HTML string

Here is my code:

        auto testDocument = gq::Document::Create();
	testDocument->Parse("<div><p>1 <a > some link </a></p><p>2</p><p>3</p><p>4</p><p>5</p><p>6</p></div>");

	try
	{
	    auto results = testDocument->Find("p:nth-child(odd)");
	    auto numResults = results.GetNodeCount();
	    std::cout << numResults << std::endl;
	}
	catch(std::runtime_error& e)
	{
	    // Necessary because naturally, the parser can throw.
	}

Error Thrown: In Document::Parse(const std::string&) - Failed to generate any HTML nodes from parsing process. The supplied string is most likely invalid HTML.

It used to parse correctly before on gumbo-query with selector mistake.
I can see in Document.cpp gumbo_parse_with_options is used previous repo was using gumbo_parse. Not sure if this has anything to do with this?

Both `:contains` selector tests fail

Both :contains selector tests fail because they rightfully expect 3 matches, but because the <html> is not included in the document map to avoid a circular reference in shared_ptrs that will cause memory leaks, 1 less match is returned. I was hoping to avoid having to address this already, but it's an issue.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.