serp-spider / search-engine-google Goto Github PK

View Code? Open in Web Editor NEW

166.0 85.0 61.0 4.87 MB

:spider: Google client for SERPS

Home Page: https://serp-spider.github.io

License: Other

PHP 98.95% Shell 1.05%

search-engine google scraping serp

search-engine-google's People

Contributors

Stargazers

Watchers

search-engine-google's Issues

Add a non captcha blocking exception

The captchaException is thrown if the response status code is other than 200 or 404 and captcha input is found in response body. This works well if a captcha is found. But sometime google blocks the clients either permanently or temporarily for a certain period. In this there should be an Informed exception generated so that the developers can make a better decision what to do next while scraping. Have a look on the following attachment when google blocks the request without captcha.

wrong start value coming in the search url

The search url for page 1 and 2 is coming fine, but from 3rd page the start value becomes huge.
For example, page 1 : https://www.google.co.in/search?q=buy+watches&uule=w+CAIQICIHQ2hlbm5haQ&gws_rd=cr
page 2 : https://www.google.co.in/search?q=buy+watches&start=10&uule=w+CAIQICIHQ2hlbm5haQ&gws_rd=cr
page 3 : https://www.google.co.in/search?q=buy+watches&start=200&num=100&uule=w+CAIQICIHQ2hlbm5haQ&gws_rd=cr
Here page 3 start value should be 20, but it shows 200.
My inputs are page 1, results 10 / page 2, results 10 / page 3, results 100.
So, ideally the 3rd url should be start=20&num=100

Raw adwords parsing

Google parser does not support raw page adwords

Make cookies request dependent

Currently cookies are client dependent, and proxies are request dependent, both of cookies and proxy should have the same dependence because we dont want to share cookies accross many proxies.

Brand pack is coming under classical

http://serp-spider.github.io/documentation/search-engine/google/parse-page/#classical-large

As mentioned above, the brand pack is not coming under "classical large" section, it is coming under "classical" only but sub-links array is available.
Please check

Implement large classical result

When a site has sitelinks, it is detected a map result, large classical result should be used instead

Wrong parameters for exception RequestErrorException in GoogleClient.php 110

PHP Fatal error: Wrong parameters for Exception([string $exception [, long $code [, Exception $previous = NULL]]]) in /vendor/serps/search-engine-google/src/GoogleClient.php on line 110

           $errorDom = new GoogleError($response->getPageContent(), $effectiveUrl);

            if ($errorDom->isCaptcha()) {
                throw new GoogleCaptchaException(new GoogleCaptcha($errorDom));
            } else {
                throw new RequestErrorException($errorDom);
            }

$errorDom must be a string

Make tests easier to write

Hard unit tests means no joy at implementing new features and fixing google update

Easy way of writing test means fastest development.

Unit tests for google parser are hard to write and to read. A new - descriptive - way for writing tests is needed

Local (MAP) results are not coming

This is the url i am trying to parse : https://www.google.co.in/search?q=chennai+schools&uule=w+CAIQICIHQ2hlbm5haQ&gws_rd=cr

It has contain local (map) results.
But i am able to get only classical results from this through parser, MAP results are not coming.

I tried with different keywords and in that news, images, videos are all captured, but MAP is not coming in the result.

Image are not standard

A new way to work with media is available in the core, the goal is to give a standard way to work with images from results.

All images should now be outputed with this format (MediaInterface)

Implement related keywords

Your requirements could not be resolved to an installable set of packages.

problem while installing using composer

Problem 1
- Can only install one of: psr/http-message[1.0, 1.0.1].
- Can only install one of: psr/http-message[1.0.1, 1.0].
- Can only install one of: psr/http-message[1.0, 1.0.1].
- serps/core 0.1.0 requires psr/http-message 1.0 -> satisfiable by psr/http-message[1.0].
- serps/http-client-curl v0.1.0 requires serps/core ~0.1.0 -> satisfiable by serps/core[0.1.0].
- Installation request for serps/http-client-curl ^0.1.0 -> satisfiable by serps/http-client-curl[v0.1.0].
- Installation request for psr/http-message == 1.0.1.0 -> satisfiable by psr/http-message[1.0.1].

Installation failed, reverting ./composer.json to its original content.

version 0.2 unable to fetch a google page

From #50:

If i use UserAgent i am not getting response from query function, the page keeps processing in some infinite loop. If the UserAgent param is passed empty then the response comes from query function.
Some issue with User Agent parameter.
Also, if i proceed by commenting out the user agent param, then response is coming. But, $response->getNaturalResults(); throws InvalidDOMException.

"url" is blank for classical results

For sample search "restaurants near me", the classical result set returned no value for the "url" field.

The search was attempted using the CurlClient()

Parsing of video group for mobile results

Mobile results have video group items and need a special parser

Error with captcha

I obtain error when i have captcha return

Fatal error: Uncaught Serps\SearchEngine\Google\Exception\GoogleCaptchaException in /vendor/serps/search-engine-google/src/GoogleClient.php on line 112
Serps\SearchEngine\Google\Exception\GoogleCaptchaException: in /vendor/serps/search-engine-google/src/GoogleClient.php on line 112

The dom of the Google captcha page was changed?

I attached page of another captcha page. I was getting this very often in last days.
When I call getImageUrl() it throwing
PHP Fatal error: Call to a member function getAttribute() on null in /var/www/seos/vendor/serps/search-engine-google/src/Page/GoogleCaptcha.php on line 57

CaptchaPageDump.txt

Improve IDE integration

Analysing result will be easier when giving a better IDE complemention for resultset items.

For instance $item->url should be referenced as an url object

Large video result don't have the classical type

Large video results should be both of classical_video and classical but classical was missing

Dublicates in search results.

Have a problem.
I try parse 10 page for one keyword.
And its contains dublicates urls on different position
(1-5 urls from 10 pages)

I use ArrayCookieJar on request
But when i request every next page for keyword Google return new values for cookies.

Why i see dublicates in results?

News section (top stories) is not picked up

@gsouf Google is using now "Top Stories" section to show news apart from the earlier way of showing news section in search results. This Top Stories is not getting parsed.
Can you include this "Top Stories" also under news results.
Given example below :
https://www.google.co.in/search?safe=off&site=&source=hp&q=india&oq=india&gs_l=hp.3...380.705.0.866.6.5.0.0.0.0.232.232.2-1.1.0....0...1.1.64.hp..5.0.0.0.hLCgKAmWxZc

Image base64 is showing some invalid image

Using documentation getting image_group $image->image gives invalid image code below

data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==

Fix it

Return all results as literal types

All search results should be returned as literal types.

For instance $result->url returns a an url object.

The major cons is that it is harder to handle error (null vs object is harder to handle than empty string vs string)

Option to split the process in get results and parsing

Currently it is one process (Get data and parse it. )

Google can change the html every time and the complete process fails or outputs wrong results.
an option to split the process in two parts would be nice, like this:

Get Data ->output it as json - so I can cache or save it in a database, S3 Storage
Load json -> parse it

two independent process.
If google does changes in html. No problem, we have time to adjust the parsing and can parse it later.

Local pack result details not being parsed completely

Search term: "restaurants near me"
Method: CurlClient()

Results obtained:

local_pack: [
{
title: null,
url: null,
street: null,
stars: "4.0",
review: null,
phone: null
},
{
title: null,
url: null,
street: null,
stars: null,
review: null,
phone: null
},
{
title: null,
url: null,
street: null,
stars: "3.7",
review: null,
phone: null
}
]

Actual result:

Issues while using proxy

when i use proxy functionality by using proxy ip, port, username and password i have faced couple of issues.
(1) Its throwing GoogleCaptchaException while using proxy. ( though i have checked the proxy connecting locally, captcha problem didnt happen ) I tried with various proxies, still getting GoogleCaptchaException error.
(2) When using any special character like '-' in the username / password the below error is thrown.
Curl was unable to process the request. Error code:56. Message : "Received HTTP code 407 from proxy after CONNECT""

Mobile results are not parsed

Google is now giving new format of results for first page for certain searches and that is not being parsed by parser it seems. Please check it.
Normal results which use to come in first page : https://www.mondovo.io/files/serp-test/2016-09-06/127957-1.html
New format results in first page : https://www.mondovo.io/files/serp-test/2016-09-06/127813-1.html
When this new format results comes, parser is not able to parse.
Note : Both formats is currently sent by google, so need to handle both the formats.

Provide basic parsing for flight results

Add flight results, but only basic informations (mostly to know if it's present or not)

Fatal Error for getNumberOfResults

Following error comes for any search using any client when calling getNumberOfResults() function

Class 'Symfony\Component\CssSelector\CssSelectorConverter' not found
in Css.php line 28
at FatalErrorException->__construct() in HandleExceptions.php line 133
at HandleExceptions->fatalExceptionFromError() in HandleExceptions.php line 118
at HandleExceptions->handleShutdown() in HandleExceptions.php line 0
at Css::getConverter() in Css.php line 39
at Css::toXPath() in GoogleDom.php line 92
at GoogleDom->cssQuery() in GoogleSerp.php line 70
at GoogleSerp->getNumberOfResults() in SerpsClient.php line 70

Classical recipe results with a thumb are detected as video results

The result in the screen shot is detected as a video result

They should be implemented as classical result ( + recipe ?)

Answer Box Empty Description

Say i search
"how to earn money" then [''description'] is Empty

but when i do
"why eating meat is bad" then [''description'] is NOT EMPTY

Documentation issue for "news"

https://serp-spider.github.io/documentation/search-engine/google/parse-page/#in-the-news

The element cards is an array and not news

Get results from HTML file

Is it possible to get the results from a saved copy of google search results HTML file (not fetched with this scraper)?

Conflict with class and trait method

Hi @gsouf
After integration to 0.2, by following the migration steps, I am getting the following error.
FatalErrorException in GoogleUrl.php line 123:
Serps\SearchEngine\Google\GoogleUrl has colliding constructor definitions coming from traits

Please check and revert

VideoCover is empty

using simpsons movie trailer as search the videoCover is empty

Allow to run custom scripts on the dom

We should provide a common interface to run scripts on the dom.

That would allow to make some advanced analyse of the page as for instance:

graphical position of things in the page
Simulation of clicks

Answer box being detected as "classical_video"

For example, search "what is php", the first result is an answer result (perhaps can be called "classical_answer") but you've detected as "classical_video"

handle no country redirection

when querying google.com google will redirect to local country tld. Info: https://support.google.com/websearch/answer/873?hl=en

We need to provide a way to provision cookies with that value that can be achieved again everytime cookies is refreshed (including when using a new proxy)

Using the parameter gws_rd=cr is an alternative fix:

use Serps\SearchEngine\Google\GoogleUrl;

$googleUrl = new GoogleUrl('google.com');
$googleUrl->setParam('gws_rd', 'cr');

Google parser stop working.

Hello, yesterday I noticed that parser is broken now. I suggest it comes with google update. Can anyone check it ?

Issue with description field in news

When trying to access the $result->cards[0]->description, it returns an error ErrorException in InTheNews.php line 60: Trying to get property of non-object.

This seems to be because description doesn't exist for the news item in the search result. If doesn't exist, should return blank instead.

Natural results parsing fails

I have problems to parse natural results.

$results = $response->getNaturalResults(); foreach($results as $result){ // parse results echo $result->title; }

This is my code.
The Query is ok,proxy works and other parser are working, (Adword, Number of results).

Other user reported a small change in
#56

Maybe this is related.
I use version 0.1.4.

Crawler Issue

@gsouf
Google search format for regular results is changed it seems.
Crawler is not fetching regular results, only News and Local results are coming.
Please look in to this.
Have attached a sample working html (crawled yesterday) for you to compare with current search.
Old file which is working with crawler : https://mondovo-serp.s3.amazonaws.com/2017-04-20/1026010-1.html

Now i am getting both type of results.

The actual difference which i saw is, this tag
<--div class="rc" data-hveid="120" data-ved="0ahUKEwjB_Omp-7TTAhXEbSYKHacpDvoQFQh4KAEwBA">
is from old html,
which is coming like
<--div data-hveid="120" data-ved="0ahUKEwjB_Omp-7TTAhXEbSYKHacpDvoQFQh4KAEwBA">
<--div class="rc">
in new html.

throw Exception if natural results is not found

The natural results are the essential part of the result page.
I think it would be good to make a basic check on each request to look up that the natural results are found .

If google does some changes in the html and the organic results are not found anymore the exceptions is thrown.

Add caching feature

Most of the time a search is conducted using the same keyword and search (google) domain. In this situation there should be cache available so that the results can be obtained instantly without requesting google in real time. The cache data can be deleted after 24 hours timeframe and build new cache after that.
This feature will make the spider more efficient and there will less chances of captcha and ban from the search engine. Local file system can be used for this purpose.

make GoogleUrl::setPage 1 indexed

currently GoogleUrl::setPage is 0 indexed, it will be much more semantic to make it 1 indexed

Create the new GoogleError for page when IP is banned

Response code: 403
Page content:

<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"/><title>Sorry...</title><style> body { font-family: verdana, arial, sans-serif; background-color: #fff; color: #000; }</style></head><body><div><table><tr><td><b><font face=sans-serif size=10><font color=#4285f4>G</font><font color=#ea4335>o</font><font color=#fbbc05>o</font><font color=#4285f4>g</font><font color=#34a853>l</font><font color=#ea4335>e</font></font></b></td><td style="text-align: left; vertical-align: bottom; padding-bottom: 15px; width: 50%"><div style="border-bottom: 1px solid #dfdfdf;">Sorry...</div></td></tr></table></div><div style="margin-left: 4em;"><h1>We're sorry...</h1><p>... but your computer or network may be sending automated queries. To protect our users, we can't process your request right now.</p></div><div style="margin-left: 4em;">See <a href="https://support.google.com/websearch/answer/86640">Google Help</a> for more information.<br/><br/></div><div style="text-align: center; border-top: 1px solid #dfdfdf;"><a href="https://www.google.com">Google Home</a></div></body></html>

Adwords "visible_url" always null

The "visible_url" field for Adwords results is always null

Shopping ads parsing issue

@gsouf Recently parsing issue is coming in pages where shopping ads are there.
You can try out any term like "buy iphone" or "buy watches" in google, there will be a shopping tab in the top with 12 or 16 ads. But crawler is trying to parse one more (some other div could be matching this ad criteria), like ad count comes as 13 or 17 in crawler and throws error as url parameter is not there for this additional count.

Review Accept-Language header

Accept language takes LR to be generated.

LR is in the form lang_[ISO] but accept-language is just [ISO]

Proposal: Get the number of total result for a search

In the top of the query, you can see:
About 119,000,000 results (0.68 seconds)

I think is very import to know the number of total indexed elements.

serp-spider / search-engine-google Goto Github PK

search-engine-google's People

Contributors

Stargazers

Watchers

Forkers

search-engine-google's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs