It would seem that our tree building process is quite expensive. IIRC, when I looked i

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

The parsing is HORRIBLY slow: I have packed <a href="https://github.com/TechnikEmpire/

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Speed really isn't that great,about technikempire/gq

Comments (17)

TechnikEmpire commented on May 20, 2024 1

@sinall definitely there is an issue that is killing performance and I have identified it. I'm working on rewriting this stuff but it's core functionality so it's extensive work.

In the meantime, smaller HTML shouldn't be a big deal.

from gq.

GuacheSuede commented on May 20, 2024

any progress on this ?

from gq.

TechnikEmpire commented on May 20, 2024

Nope and this probably won't be getting attention for a long time as I have nothing driving that change.

from gq.

phinzphinz commented on May 20, 2024

The parsing is HORRIBLY slow: I have packed gqsrc.zip such that it is reproducible. The only thing I changed in the source is to use "en_US.UTF8" instead of "en_US" as described in

#13

Simply unpack everything and compile. The program is very easy and is based on the example code:

#include "Document.hpp"
#include <iostream>
#include <string>
#include <fstream>
#include <chrono>

using namespace std;

std::string loadFileToString(const std::string &filename){
	std::ifstream t(filename);
	t.seekg(0, std::ios::end);
	size_t size = t.tellg();
	std::string buffer(size, ' ');
	t.seekg(0);
	t.read(&buffer[0], size); 
	return buffer;
}

int main() {
	string html = loadFileToString("./html/coinmarketcapAll.html");

	std::string &someHtmlString = html;
	std::string someSelectorString = "div";
	auto testDocument = gq::Document::Create();
	auto start= chrono::system_clock::now();
	testDocument->Parse(someHtmlString);
	auto startcss= chrono::system_clock::now();

	try
	{
		auto results = testDocument->Find(someSelectorString);
		const auto numResults = results.GetNodeCount();
		cout<<"numResults: "<<numResults<<endl;
	}
	catch(std::runtime_error& e)
	{
		throw e;
		// Necessary because naturally, the parser can throw.
	}


	auto end= chrono::system_clock::now();
	cout<<"Time Parse : "<<chrono::duration_cast<chrono::microseconds>(startcss-start).count()<< " microseconds."<<endl;
	cout<<"Time CSS   : "<<chrono::duration_cast<chrono::microseconds>(end-startcss).count()<< " microseconds."<<endl;

	return 0;
}

You can compile it in the unpacked directory gqsrc with the command g++ *.cpp -lgumbo (for simplicity, everything from GQ is static, except for libgumbo). If I run the executable with ./a.out, I get

numResults: 1681
Time Parse : 107849066 microseconds.
Time CSS   : 55982 microseconds.

And yes, that is more than 100 seconds using all my CPU resources on a i7 7700HQ!!! If I turn on heavy optimization with g++ *.cpp -lgumbo -flto -Ofast -fwhole-program, ./a.out gives me

numResults: 1681
Time Parse : 5015653 microseconds.
Time CSS   : 7778 microseconds.

which is still a lot. For comparison, the following Julia code is doing the same (use Pkg.add() to install the three packages)

using Cascadia
using Gumbo
using Requests

r = get("https://coinmarketcap.com/all/views/all/")
@time h = parsehtml(convert(String, r.data));
@time cur = matchall(Selector("div"),h.root);

and takes this amount of time on cold start (Julia is JIT):

julia> @time h = parsehtml(convert(String, r.data));
  0.906903 seconds (1.11 M allocations: 68.890 MiB, 2.61% gc time)

julia> @time cur = matchall(Selector("div"),h.root);
  1.310899 seconds (1.52 M allocations: 96.578 MiB, 1.70% gc time)

and this amount of time on repeated evaluations:

julia> @time h = parsehtml(convert(String, r.data));
  0.243488 seconds (897.90 k allocations: 57.309 MiB, 8.71% gc time)

julia> @time cur = matchall(Selector("div"),h.root);
  0.614870 seconds (1.27 M allocations: 82.394 MiB, 13.57% gc time)

This is less than a second. I wanted to switch to C++ for more speed less needed ressources...

from gq.

phinzphinz commented on May 20, 2024

I did the same for gumbo-query and packed gumboquerysrc.zip. With the program

#include <iostream>
#include <string>
#include "Document.h"
#include "Node.h"
#include <fstream>
#include <chrono>

using namespace std;

std::string loadFileToString(const std::string &filename){
	std::ifstream t(filename);
	t.seekg(0, std::ios::end);
	size_t size = t.tellg();
	std::string buffer(size, ' ');
	t.seekg(0);
	t.read(&buffer[0], size); 
	return buffer;
}


int main(int argc, char * argv[])
{
  string html = loadFileToString("./html/coinmarketcapAll.html");
  std::string &page=html;
  CDocument doc;
  auto start= chrono::system_clock::now();
  doc.parse(page.c_str());
  auto startcss= chrono::system_clock::now();
  CSelection c = doc.find("div");
  auto end= chrono::system_clock::now();
  
  cout<<"numResults: "<<c.nodeNum()<<endl;
  cout<<"Time Parse : "<<chrono::duration_cast<chrono::microseconds>(startcss-start).count()<< " microseconds."<<endl;
  cout<<"Time CSS   : "<<chrono::duration_cast<chrono::microseconds>(end-startcss).count()<< " microseconds."<<endl;
  return 0;
}

I get the following results. g++ *.cpp -lgumbo and ./a.out gives me

numResults: 1681
Time Parse : 105975 microseconds.
Time CSS   : 23936 microseconds.

and from g++ *.cpp -lgumbo -flto -Ofast -fwhole-program , ./a.out, I get

numResults: 1681
Time Parse : 104778 microseconds.
Time CSS   : 2607 microseconds.

I have to admit that I like the GQ interface with namespaces more than that of gumbo-query, but without a performance improvement, GQ is not competetive.

from gq.

TechnikEmpire commented on May 20, 2024

Parse with raw gumbo interface for comparison please and then pass the constructed gumbo output to the appropriate constructor for gq document. Please benchmark both the raw gumbo parsing and the construction of gq document separately.

Please also parse the results on your machine of the exact example program without modifications (except the locale fix I guess).

Parsing takes a hit because of the tree I build for rapid selector execution but this obviously isn't right whatever is happening and I've never seen this.

Also you said about gq competing with gumbo-parser. Was never meant to be a competition. Gumbo parser is so full of bugs I can't even tell you. Just try to run some of the tests I've made here with it, they'll fail entirely.

from gq.

phinzphinz commented on May 20, 2024

Using a modified Parse() function (I just inserted the header <chrono> before and added three checkpoints and two cout's) we can see that the Init() function is causing the delay.

	void Document::Parse(const std::string& source)
	{
		auto startparse=std::chrono::system_clock::now();
		// No attempting to parse empty strings.
		#ifndef NDEBUG
			assert(source.length() > 0 && (source.find_first_not_of(u8" \t\r\n") != std::string::npos) && u8"In Document::Parse(const std::string&) - Empty or whitespace string supplied.");
		#else
			if (source.length() == 0) { throw std::runtime_error(u8"In Document::Parse(const std::string&) - Empty string supplied."); }
		#endif

		if (m_gumboOutput != nullptr)
		{
			gumbo_destroy_output(&m_parsingOptions, m_gumboOutput);
		}
		
		m_gumboOutput = gumbo_parse_with_options(&m_parsingOptions, source.c_str(), source.size());

		// Check if we got coal in our stocking.
		if (m_gumboOutput == nullptr)
		{
			throw std::runtime_error(u8"In Document::Parse(const std::string&) - Failed to allocate GumboOutput.");
		}

		// Check if we got absolutely nothing in our stocking. If we didn't then, santa isn't real.
		if (m_gumboOutput->root == nullptr 
			|| m_gumboOutput->root->v.element.children.length == 0 
			|| static_cast<GumboNode*>(m_gumboOutput->root->v.element.children.data[0])->v.element.children.length == 0)
		{
			throw std::runtime_error(u8"In Document::Parse(const std::string&) - Failed to generate any HTML nodes from parsing process. The supplied string is most likely invalid HTML.");
		}

		m_node = m_gumboOutput->root;

		m_treeMap.Clear();
		auto beforeInit=std::chrono::system_clock::now();

		// Must call init to build out and index and children children.
		Init();
		auto afterInit=std::chrono::system_clock::now();
		std::cout<<"Gumbo parsing              : "<< std::chrono::duration_cast<std::chrono::microseconds>(beforeInit-startparse).count()<<" microseconds\n";
		std::cout<<"Init() in Document::Parse(): "<< std::chrono::duration_cast<std::chrono::microseconds>(afterInit-beforeInit).count()<<" microseconds\n";
	}

The new main.cpp function, I am using is

#include "../include/Document.hpp"
#include <iostream>
#include <string>
#include <fstream>
#include <chrono>

using namespace std;

std::string loadFileToString(const std::string &filename){
	std::ifstream t(filename);
	t.seekg(0, std::ios::end);
	size_t size = t.tellg();
	std::string buffer(size, ' ');
	t.seekg(0);
	t.read(&buffer[0], size); 
	return buffer;
}

int main(int argc,const char** argv) {
	if (argc!=2) {
		cout<<"Usage: "<<argv[0]<<" url "<<endl;
		return 0;
	}

	string filename{argv[1]};
	cout<<filename<<endl;
	// return 0;
	string html = loadFileToString(filename);

	std::string &someHtmlString = html;
	std::string someSelectorString = "div";
	auto testDocument = gq::Document::Create();
	auto start= chrono::system_clock::now();
	testDocument->Parse(someHtmlString);
	auto startcss= chrono::system_clock::now();

	try
	{
		auto results = testDocument->Find(someSelectorString);
		const auto numResults = results.GetNodeCount();
		// cout<<"numResults: "<<numResults<<endl;
	}
	catch(std::runtime_error& e)
	{
		throw e;
		// Necessary because naturally, the parser can throw.
	}


	auto end= chrono::system_clock::now();
	// cout<<html.size()<<", "<<
	cout<<"Time Parse()               : "<<chrono::duration_cast<chrono::microseconds>(startcss-start).count()<< " microseconds."<<endl;
	cout<<"Time CSS selection         : "<<chrono::duration_cast<chrono::microseconds>(end-startcss).count()<< " microseconds."<<endl;
	cout<<"Length of HTML string      : "<<html.length()<<endl;

	return 0;
}

Note that I cannot use the exact example program because it does not provide the functionality to load a string from file and it does not have the timing checkpoints. But you can see that the program I use is basically the exact example program with time measuring functionality and the ability to load a string from a file, so nothing special. I packed it in gqsrc.zip with three test html files which can be passed as the first command line argument to the executable to be generated with g++ *.cpp -lgumbo in the unpacked directory. The results are:
For a large HTML (2.5MB)

Gumbo parsing              : 104161 microseconds
Init() in Document::Parse(): 104796457 microseconds
Time Parse()               : 104900633 microseconds.
Time CSS selection         : 58580 microseconds.
Length of HTML string      : 2530379

For a mid HTML (160KB)

Gumbo parsing              : 29905 microseconds
Init() in Document::Parse(): 250702 microseconds
Time Parse()               : 280621 microseconds.
Time CSS selection         : 694 microseconds.
Length of HTML string      : 158461

For a small HTML (75KB)

Gumbo parsing              : 4480 microseconds
Init() in Document::Parse(): 48563 microseconds
Time Parse()               : 53059 microseconds.
Time CSS selection         : 410 microseconds.
Length of HTML string      : 75829

I think the algorithm used in Init() is not efficient for large HTMLs.

from gq.

TechnikEmpire commented on May 20, 2024

@phinzphinz Ah ok so there it is, it's the size variation.

That whole tree system I made was sort of whipped up in the moment without any planning. I'm thinking I need to just blow that away and let the queries execute dynamically like gumo-query does.

Thanks for the reports and for doing the testing. I'm pretty piled up with work right now but I'll see if I can't get around to looking at this. Pretty sure my goal at this point is going to be delete the Tree class files and fix until it compiles/works again.

from gq.

sinall commented on May 20, 2024

@TechnikEmpire
I'm trying to download, parse and extract information from http://vip.stock.finance.sina.com.cn/corp/go.php/vRPD_NewStockIssue/page/1.phtml.

Here's code:

	auto document = gq::Document::Create();
	document->Parse(utf8Body);
	stopWatch.Stop();
	logger_.Debug(WStringUtils::Format(L"Parsed html in [%.2fms]", stopWatch.GetLastTaskTimeMillis()));

	stopWatch.Start();
	std::vector<StockInfo> stockInfos;
	auto trNodes = document->Find(GetSelector("#NewStockTable tbody tr"));
	for (auto i = 2U; i < trNodes.GetNodeCount(); i++) {
		StockInfo stockInfo;
		auto trNode = trNodes.GetNodeAt(i);
		stockInfo.SetSymbol(trNode->GetChildAt(0)->GetText());
		stockInfo.SetPurchaseSymbol(trNode->GetChildAt(1)->GetText());
		stockInfo.SetName(WStringUtils::TrimSpace(trNode->GetChildAt(2)->GetText()));
		auto issueDateStr = trNode->GetChildAt(3)->GetText();
		auto issueDate = DateUtils::ParseAsLocal(issueDateStr);
		stockInfo.SetIssueDate(issueDate);
		auto issuePrice = StringUtils::To<double>(trNode->GetChildAt(7)->GetText());
		stockInfo.SetIssuePrice(issuePrice);
		stockInfos.push_back(stockInfo);
	}
	stopWatch.Stop();
	logger_.Debug(WStringUtils::Format(L"Extracted [%d] new stocks in [%.2fms]", stockInfos.size(), stopWatch.GetLastTaskTimeMillis()));

Here's the log:
2018-05-27 23:03:42.496175 [0x00004290] debug - Downloaded html in [99.00ms]
2018-05-27 23:03:53.082191 [0x00004290] debug - Parsed html in [10584.00ms]
2018-05-27 23:03:55.506796 [0x00004290] debug - Extracted [40] new stocks in [2422.00ms]

The html is about 442KB. It takes 10 seconds to parse html, and 2.4 seconds to extract information. This is too slow for my application.

from gq.

TechnikEmpire commented on May 20, 2024

Ok I'm going to actually look at this today because that's brutal. I merged in a PR or two without testing, that's about all I can figure in terms of what went wrong.

I've been looking at myhtml and it has a selector engine in modest. This project might get killed and become a wrapper around that if this issue is that serious. Last time I benchmarked this code, it took 83 msec to parse Yahoo's landing page and build the selector tree/cache. So I dunno what's happened but I'll render some action against this issue at some point today.

from gq.

TechnikEmpire commented on May 20, 2024

Is everyone here testing on Linux?

from gq.

sinall commented on May 20, 2024

I'm using GQ on windows.

from gq.

TechnikEmpire commented on May 20, 2024

@sinall Thanks.

from gq.

TechnikEmpire commented on May 20, 2024

Ok folks well, you have to weigh in. I can bring parsing back to basically raw gumbo parsing speed, but the consequence is that the super fast selector system is RIP in peace.

Time taken to run 27646 selectors against the document: 75036.2 ms producing 1685 total matches.
Processed at a rate of 2.71418 milliseconds per selector or 0.368436 selectors per millisecond.
Benchmarking mutation.
Time taken to run 27646 selectors against the document while serializing with mutations 1 times: 74865.3 ms.
Time per cycle 74865.3 ms.
Processed at a rate of 2.708 milliseconds per selector or 0.369277 selectors per millisecond.

Not included here, the parsing stage was ~180 msec IIRC.

Using @phinzphinz 's huge >2mb HTML sample he submitted, those are the stats. Takes nearly 3 msec to run a single selector through the entire parsed body.

I guess it's up to what people want. To me, this makes the whole project garbage and a waste of time, but that's just my opinion.

from gq.

TechnikEmpire commented on May 20, 2024

@sinall Oh, definitely don't use debug lol, debug is absolutely horridly slow. Test speeds in release. Anyway I'm working on this.

from gq.

sinall commented on May 20, 2024

@TechnikEmpire I tried release mode, here's the log:

2018-05-30 11:05:18.679213 [0x00002a78] debug - Downloaded html in [129.00ms]
2018-05-30 11:05:18.938386 [0x00002a78] debug - Parsed html in [259.00ms]
2018-05-30 11:05:19.022441 [0x00002a78] debug - Extracted [40] new stocks in [83.00ms]

This is acceptable for me at this moment. And I'm looking forward to your fast selector system in future.

from gq.

TechnikEmpire commented on May 20, 2024

My advice to everyone here is to switch to my_html (or whatever bundle of his has selectors). The design of Gumbo is such that internals are kept away from the user entirely, and to make this function properly and without a slowdown as HTML size increases, I need to do invasive changes to Gumbo. Frankly I can't decide what is more painful: writing my own compliant parser based on the current spec in C++ or being forced to maintain an abandoned C library written with precisely 0 comments detailing anything.

from gq.

Speed really isn't that great about gq HOT 17 OPEN

Comments (17)

Related Issues (13)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs