GithubHelp home page GithubHelp logo

serp-spider / search-engine-google Goto Github PK

View Code? Open in Web Editor NEW
166.0 85.0 61.0 4.87 MB

:spider: Google client for SERPS

Home Page: https://serp-spider.github.io

License: Other

PHP 98.95% Shell 1.05%
search-engine google scraping serp

search-engine-google's People

Contributors

atefbb avatar gsouf avatar janpio avatar lmahesh5 avatar msiemens avatar shiftas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

search-engine-google's Issues

Add a non captcha blocking exception

The captchaException is thrown if the response status code is other than 200 or 404 and captcha input is found in response body. This works well if a captcha is found. But sometime google blocks the clients either permanently or temporarily for a certain period. In this there should be an Informed exception generated so that the developers can make a better decision what to do next while scraping. Have a look on the following attachment when google blocks the request without captcha.
google_block

wrong start value coming in the search url

The search url for page 1 and 2 is coming fine, but from 3rd page the start value becomes huge.
For example, page 1 : https://www.google.co.in/search?q=buy+watches&uule=w+CAIQICIHQ2hlbm5haQ&gws_rd=cr
page 2 : https://www.google.co.in/search?q=buy+watches&start=10&uule=w+CAIQICIHQ2hlbm5haQ&gws_rd=cr
page 3 : https://www.google.co.in/search?q=buy+watches&start=200&num=100&uule=w+CAIQICIHQ2hlbm5haQ&gws_rd=cr
Here page 3 start value should be 20, but it shows 200.
My inputs are page 1, results 10 / page 2, results 10 / page 3, results 100.
So, ideally the 3rd url should be start=20&num=100

Make cookies request dependent

Currently cookies are client dependent, and proxies are request dependent, both of cookies and proxy should have the same dependence because we dont want to share cookies accross many proxies.

Wrong parameters for exception RequestErrorException in GoogleClient.php 110

PHP Fatal error: Wrong parameters for Exception([string $exception [, long $code [, Exception $previous = NULL]]]) in /vendor/serps/search-engine-google/src/GoogleClient.php on line 110

           $errorDom = new GoogleError($response->getPageContent(), $effectiveUrl);

            if ($errorDom->isCaptcha()) {
                throw new GoogleCaptchaException(new GoogleCaptcha($errorDom));
            } else {
                throw new RequestErrorException($errorDom);
            }

$errorDom must be a string

Make tests easier to write

Hard unit tests means no joy at implementing new features and fixing google update

Easy way of writing test means fastest development.

Unit tests for google parser are hard to write and to read. A new - descriptive - way for writing tests is needed

Image are not standard

A new way to work with media is available in the core, the goal is to give a standard way to work with images from results.

All images should now be outputed with this format (MediaInterface)

Your requirements could not be resolved to an installable set of packages.

problem while installing using composer

Problem 1
- Can only install one of: psr/http-message[1.0, 1.0.1].
- Can only install one of: psr/http-message[1.0.1, 1.0].
- Can only install one of: psr/http-message[1.0, 1.0.1].
- serps/core 0.1.0 requires psr/http-message 1.0 -> satisfiable by psr/http-message[1.0].
- serps/http-client-curl v0.1.0 requires serps/core ~0.1.0 -> satisfiable by serps/core[0.1.0].
- Installation request for serps/http-client-curl ^0.1.0 -> satisfiable by serps/http-client-curl[v0.1.0].
- Installation request for psr/http-message == 1.0.1.0 -> satisfiable by psr/http-message[1.0.1].

Installation failed, reverting ./composer.json to its original content.

version 0.2 unable to fetch a google page

From #50:

If i use UserAgent i am not getting response from query function, the page keeps processing in some infinite loop. If the UserAgent param is passed empty then the response comes from query function.
Some issue with User Agent parameter.
Also, if i proceed by commenting out the user agent param, then response is coming. But, $response->getNaturalResults(); throws InvalidDOMException.

"url" is blank for classical results

For sample search "restaurants near me", the classical result set returned no value for the "url" field.

The search was attempted using the CurlClient()

Error with captcha

I obtain error when i have captcha return

Fatal error: Uncaught Serps\SearchEngine\Google\Exception\GoogleCaptchaException in /vendor/serps/search-engine-google/src/GoogleClient.php on line 112
Serps\SearchEngine\Google\Exception\GoogleCaptchaException: in /vendor/serps/search-engine-google/src/GoogleClient.php on line 112

The dom of the Google captcha page was changed?

I attached page of another captcha page. I was getting this very often in last days.
When I call getImageUrl() it throwing
PHP Fatal error: Call to a member function getAttribute() on null in /var/www/seos/vendor/serps/search-engine-google/src/Page/GoogleCaptcha.php on line 57

CaptchaPageDump.txt

Improve IDE integration

Analysing result will be easier when giving a better IDE complemention for resultset items.

For instance $item->url should be referenced as an url object

Dublicates in search results.

Have a problem.
I try parse 10 page for one keyword.
And its contains dublicates urls on different position
(1-5 urls from 10 pages)

I use ArrayCookieJar on request
But when i request every next page for keyword Google return new values for cookies.

Why i see dublicates in results?

Image base64 is showing some invalid image

Using documentation getting image_group $image->image gives invalid image code below

data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==

Fix it

Return all results as literal types

All search results should be returned as literal types.

For instance $result->url returns a an url object.

The major cons is that it is harder to handle error (null vs object is harder to handle than empty string vs string)

Option to split the process in get results and parsing

Currently it is one process (Get data and parse it. )

Google can change the html every time and the complete process fails or outputs wrong results.
an option to split the process in two parts would be nice, like this:

  1. Get Data ->output it as json - so I can cache or save it in a database, S3 Storage
  2. Load json -> parse it

two independent process.
If google does changes in html. No problem, we have time to adjust the parsing and can parse it later.

Local pack result details not being parsed completely

Search term: "restaurants near me"
Method: CurlClient()

Results obtained:

local_pack: [
{
title: null,
url: null,
street: null,
stars: "4.0",
review: null,
phone: null
},
{
title: null,
url: null,
street: null,
stars: null,
review: null,
phone: null
},
{
title: null,
url: null,
street: null,
stars: "3.7",
review: null,
phone: null
}
]

Actual result:
restaurants near me google search

Issues while using proxy

when i use proxy functionality by using proxy ip, port, username and password i have faced couple of issues.
(1) Its throwing GoogleCaptchaException while using proxy. ( though i have checked the proxy connecting locally, captcha problem didnt happen ) I tried with various proxies, still getting GoogleCaptchaException error.
(2) When using any special character like '-' in the username / password the below error is thrown.
Curl was unable to process the request. Error code:56. Message : "Received HTTP code 407 from proxy after CONNECT""

Mobile results are not parsed

Google is now giving new format of results for first page for certain searches and that is not being parsed by parser it seems. Please check it.
Normal results which use to come in first page : https://www.mondovo.io/files/serp-test/2016-09-06/127957-1.html
New format results in first page : https://www.mondovo.io/files/serp-test/2016-09-06/127813-1.html
When this new format results comes, parser is not able to parse.
Note : Both formats is currently sent by google, so need to handle both the formats.

Fatal Error for getNumberOfResults

Following error comes for any search using any client when calling getNumberOfResults() function

Class 'Symfony\Component\CssSelector\CssSelectorConverter' not found
in Css.php line 28
at FatalErrorException->__construct() in HandleExceptions.php line 133
at HandleExceptions->fatalExceptionFromError() in HandleExceptions.php line 118
at HandleExceptions->handleShutdown() in HandleExceptions.php line 0
at Css::getConverter() in Css.php line 39
at Css::toXPath() in GoogleDom.php line 92
at GoogleDom->cssQuery() in GoogleSerp.php line 70
at GoogleSerp->getNumberOfResults() in SerpsClient.php line 70

Answer Box Empty Description

Say i search
"how to earn money" then [''description'] is Empty

but when i do
"why eating meat is bad" then [''description'] is NOT EMPTY

Get results from HTML file

Is it possible to get the results from a saved copy of google search results HTML file (not fetched with this scraper)?

Conflict with class and trait method

Hi @gsouf
After integration to 0.2, by following the migration steps, I am getting the following error.
FatalErrorException in GoogleUrl.php line 123:
Serps\SearchEngine\Google\GoogleUrl has colliding constructor definitions coming from traits

Please check and revert

Allow to run custom scripts on the dom

We should provide a common interface to run scripts on the dom.

That would allow to make some advanced analyse of the page as for instance:

  • graphical position of things in the page
  • Simulation of clicks

handle no country redirection

when querying google.com google will redirect to local country tld. Info: https://support.google.com/websearch/answer/873?hl=en

We need to provide a way to provision cookies with that value that can be achieved again everytime cookies is refreshed (including when using a new proxy)

Using the parameter gws_rd=cr is an alternative fix:

use Serps\SearchEngine\Google\GoogleUrl;

$googleUrl = new GoogleUrl('google.com');
$googleUrl->setParam('gws_rd', 'cr');

Google parser stop working.

Hello, yesterday I noticed that parser is broken now. I suggest it comes with google update. Can anyone check it ?

Issue with description field in news

When trying to access the $result->cards[0]->description, it returns an error ErrorException in InTheNews.php line 60: Trying to get property of non-object.

This seems to be because description doesn't exist for the news item in the search result. If doesn't exist, should return blank instead.

Natural results parsing fails

I have problems to parse natural results.

$results = $response->getNaturalResults(); foreach($results as $result){ // parse results echo $result->title; }

This is my code.
The Query is ok,proxy works and other parser are working, (Adword, Number of results).

Other user reported a small change in
#56

Maybe this is related.
I use version 0.1.4.

Crawler Issue

@gsouf
Google search format for regular results is changed it seems.
Crawler is not fetching regular results, only News and Local results are coming.
Please look in to this.
Have attached a sample working html (crawled yesterday) for you to compare with current search.
Old file which is working with crawler : https://mondovo-serp.s3.amazonaws.com/2017-04-20/1026010-1.html

Now i am getting both type of results.

The actual difference which i saw is, this tag
<--div class="rc" data-hveid="120" data-ved="0ahUKEwjB_Omp-7TTAhXEbSYKHacpDvoQFQh4KAEwBA">
is from old html,
which is coming like
<--div data-hveid="120" data-ved="0ahUKEwjB_Omp-7TTAhXEbSYKHacpDvoQFQh4KAEwBA">
<--div class="rc">
in new html.

throw Exception if natural results is not found

The natural results are the essential part of the result page.
I think it would be good to make a basic check on each request to look up that the natural results are found .

If google does some changes in the html and the organic results are not found anymore the exceptions is thrown.

Add caching feature

Most of the time a search is conducted using the same keyword and search (google) domain. In this situation there should be cache available so that the results can be obtained instantly without requesting google in real time. The cache data can be deleted after 24 hours timeframe and build new cache after that.
This feature will make the spider more efficient and there will less chances of captcha and ban from the search engine. Local file system can be used for this purpose.

Create the new GoogleError for page when IP is banned

Response code: 403
Page content:

<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"/><title>Sorry...</title><style> body { font-family: verdana, arial, sans-serif; background-color: #fff; color: #000; }</style></head><body><div><table><tr><td><b><font face=sans-serif size=10><font color=#4285f4>G</font><font color=#ea4335>o</font><font color=#fbbc05>o</font><font color=#4285f4>g</font><font color=#34a853>l</font><font color=#ea4335>e</font></font></b></td><td style="text-align: left; vertical-align: bottom; padding-bottom: 15px; width: 50%"><div style="border-bottom: 1px solid #dfdfdf;">Sorry...</div></td></tr></table></div><div style="margin-left: 4em;"><h1>We're sorry...</h1><p>... but your computer or network may be sending automated queries. To protect our users, we can't process your request right now.</p></div><div style="margin-left: 4em;">See <a href="https://support.google.com/websearch/answer/86640">Google Help</a> for more information.<br/><br/></div><div style="text-align: center; border-top: 1px solid #dfdfdf;"><a href="https://www.google.com">Google Home</a></div></body></html>

Shopping ads parsing issue

@gsouf Recently parsing issue is coming in pages where shopping ads are there.
You can try out any term like "buy iphone" or "buy watches" in google, there will be a shopping tab in the top with 12 or 16 ads. But crawler is trying to parse one more (some other div could be matching this ad criteria), like ad count comes as 13 or 17 in crawler and throws error as url parameter is not there for this additional count.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.