nikolait / googlescraper Goto Github PK

A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.

Home Page: https://scrapeulous.com/

License: Apache License 2.0

Python 10.20% HTML 89.76% JavaScript 0.01% Shell 0.01% Dockerfile 0.03%

crawler python scraping search-engine search-engine-optimization search-engines

googlescraper's People

Contributors

Stargazers

Watchers

Forkers

he1my deltap billyli200465 formating zennro adammendoza wangkai2014 aniemerg atassumer logic-gate milannic xinluleo zimndev minhtri001 davideuler sablatticallabs sih4sing5hong5 youmuyou cnlouds itiki bashb0y geetanjaligg jaequery sdlearn mattlsmith truebit champ1 prashamtated urwithajit9 chucklai kahinke bennycah ivansalomon m7mdcc rusk85 zenonh vgoklani manugarri rakeshmukundan purejade shanbady data-processing goryszewskig claudiouzelac leadscloud b3nelof0n marklr wkryst plucena24 xkeeper shanac nosfer sisteamnik kenju254 ohhdemgirls snazz2001 marchon leonardtian baishuguoguo demogorgonz gajulambhade amumu-dev mlanderdahl mostdev killeent dangrafic-data dz0sy5 truls woodwardoge shagrat asakhilraj sirpaul iedux bcriswell squantrill shmiko jffifa tamt mindis sighalt keleiazz dharmeshpandav jatubio nikhildcodalsystem annie201 jamesspittal anthonyrnl paradox4918 gorgonaut04 piotr-teterwak tienhv withdrawn vinay2k2 v0re mac475 marquisknox eduoss n8hall ganeshpandey niceeverything

googlescraper's Issues

some SE not support HTTP mode, May be give some alert !!!

http://blekko.com/#ws/?q=love

this search engine, blekko.com not support http mode. only can use selenium mode. In this case， the program should alert user: not support http mode.

How do you think it is ？

Error installing from pip: FileNotFoundError: 'README.md'

Clean python3.4 virtualenv installation.

When

pip install GoogleScraper

I get

(python3.4)chefarov@debian:~$ pip install GoogleScraper
 Downloading/unpacking GoogleScraper
Downloading GoogleScraper-0.1.0.tar.gz (48kB): 48kB downloaded
 Running setup.py (path:/opt/envs/python3.4/build/GoogleScraper/setup.py) egg_info for package       GoogleScraper
 Traceback (most recent call last):
   File "<string>", line 17, in <module>
    File "/opt/envs/python3.4/build/GoogleScraper/setup.py", line 15, in <module>
     long_description=open('README.md').read(),
  FileNotFoundError: [Errno 2] No such file or directory: 'README.md'
  Complete output from command python setup.py egg_info:
  Traceback (most recent call last):

File "<string>", line 17, in <module>

File "/opt/envs/python3.4/build/GoogleScraper/setup.py", line 15, in <module>

long_description=open('README.md').read(),

FileNotFoundError: [Errno 2] No such file or directory: 'README.md'

I am using python3-pip

(python3.4)chefarov@debian:~$ pip --version
pip 1.5.6 from /opt/envs/python3.4/lib/python3.4/site-packages (python 3.4)

(python3.4)chefarov@debian:~$ which pip
 /opt/envs/python3.4/bin/pip

Can't using GoogleScraper because except a ImportError

Hello NikolaiT , thanks give us a beeeeeeeeeeter tools 👍

But, When I try this repo , I hava a issue to reporting .

uname -a
Linux vagrant-centos65.vagrantup.com 2.6.32-431.el6.x86_64 #1 SMP Fri Nov 22 03:15:09 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
pip install -r requirements.txt
Requirement already satisfied (use --upgrade to upgrade): lxml in /root/.venv/lib/python3.4/site-packages (from -r requirements.txt (line 1))
Requirement already satisfied (use --upgrade to upgrade): selenium in /root/.venv/lib/python3.4/site-packages (from -r requirements.txt (line 2))
Requirement already satisfied (use --upgrade to upgrade): beautifulsoup4 in /root/.venv/lib/python3.4/site-packages (from -r requirements.txt (line 3))
Requirement already satisfied (use --upgrade to upgrade): cssselect in /root/.venv/lib/python3.4/site-packages (from -r requirements.txt (line 4))
Requirement already satisfied (use --upgrade to upgrade): requests in /root/.venv/lib/python3.4/site-packages (from -r requirements.txt (line 5))
Cleaning up...
python run.py
Traceback (most recent call last):
File "run.py", line 8, in
from GoogleScraper.core import main
File "/root/.venv/app/GoogleScraper/GoogleScraper/core.py", line 30, in
from GoogleScraper.results import maybe_create_db
ImportError: No module named 'GoogleScraper.results'

So:

repsonse a ImportError: ImportError: No module named 'GoogleScraper.results' , and not generate GoogleScraper.py file , Can you help me ?

Image search not working

Doesnt seem to be doing an image search if I use the image for the --search-type argument.

GoogleScraper --keyword "apple" --search-engines "bing" --search-type image --scrapemethod selenium -f csv

I only get normal web search results and not image.

Programmatically specified config is overwritten by default command line args

So if I call GoogleScraper programmatically, using scrape_with_config(config), it first updates the global Config, then calls main(). But then in the first line of main(), parse_cmd_args() is called, and updates Config again with command line args.

The problem is that, even if you don't specify any command line arg, the default values will still be populated into the result dictionary returned by get_command_line(), which will then be used to update Config again.

Additionally, I think even if you do specify command line args, the config passed to scrape_with_config() should take precedent. In other words, the solution to this problem is simply to call parse_cmd_args() before updating Config with the config supplied to scrape_with_config().

The Google Image search is broken.

The command line image searcher for google is broken.

The --search-type parameter is ignored, I commented on how to fix it:
87e1cb2#diff-6b40e7a6253722f69f8bd6588796f92cR52
After the fix I get this error:

while running:
GoogleScraper --search-engines "google" -m http --keyword "apple" --search-type image

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.4/threading.py", line 920, in _bootstrap_inner
    self.run()
  File "/usr/lib/env/lib/python3.4/site-packages/GoogleScraper/scraping.py", line 554, in run
    SearchEngineScrape.blocking_search(self, self.search, *args, **kwargs)
  File "/home/vlad/gscrape_test/lib/env/lib/python3.4/site-packages/GoogleScraper/scraping.py", line 214, in blocking_search
    callback(*args, **kwargs)
  File "/home/vlad/gscrape_test/lib/env/lib/python3.4/site-packages/GoogleScraper/scraping.py", line 543, in search
    self.parser.parse(html)
  File "/home/vlad/gscrape_test/lib/env/lib/python3.4/site-packages/GoogleScraper/parsing.py", line 89, in parse
    self._parse()
  File "/home/vlad/gscrape_test/lib/env/lib/python3.4/site-packages/GoogleScraper/parsing.py", line 146, in _parse
    css_to_xpath('{container} {result_container}'.format(**selectors))
TypeError: format() argument after ** must be a mapping, not str

I'm not fluent in python, otherwise I'd try to fix it myself.

'num_pages' vs 'num_of_pages'

The name of this parameter does not seem consistent across the project,
see for instance:

I think the right parameter key is 'num_pages' and 'num_of_pages' should be replaced by 'num_pages' everywhere, but I'm not sure.

Thank you for writing useful software, btw!

Issue during installation via pip install git+git://github.com/NikolaiT/GoogleScraper/

Downloading/unpacking git+git://github.com/NikolaiT/GoogleScraper/
Cloning git://github.com/NikolaiT/GoogleScraper/ to /tmp/pip-ebq84mt9-build
Running setup.py (path:/tmp/pip-ebq84mt9-build/setup.py) egg_info for package from git+git://github.com/NikolaiT/GoogleScraper/

Downloading/unpacking lxml (from GoogleScraper==0.1.9)
Downloading lxml-3.4.1.tar.gz (3.5MB): 3.5MB downloaded
Running setup.py (path:/home/kenju254/workspaces/reverseA/GoogleScraperP/env/build/lxml/setup.py) egg_info for package lxml
Building lxml version 3.4.1.
Building without Cython.
ERROR: b'/bin/sh: 1: xslt-config: not found\n'
** make sure the development packages of libxml2 and libxslt are installed **

Using build configuration of libxslt
/usr/lib/python3.4/distutils/dist.py:260: UserWarning: Unknown distribution option: 'bugtrack_url'
  warnings.warn(msg)

warning: no previously-included files found matching '*.py'

Downloading/unpacking selenium (from GoogleScraper==0.1.9)
Downloading selenium-2.44.0.tar.gz (2.6MB): 2.6MB downloaded
Running setup.py (path:/home/kenju254/workspaces/reverseA/GoogleScraperP/env/build/selenium/setup.py) egg_info for package selenium

Downloading/unpacking cssselect (from GoogleScraper==0.1.9)
Downloading cssselect-0.9.1.tar.gz
Running setup.py (path:/home/kenju254/workspaces/reverseA/GoogleScraperP/env/build/cssselect/setup.py) egg_info for package cssselect

no previously-included directories found matching 'docs/_build'

Downloading/unpacking requests (from GoogleScraper==0.1.9)
Downloading requests-2.5.0-py2.py3-none-any.whl (464kB): 464kB downloaded
Downloading/unpacking PyMySql (from GoogleScraper==0.1.9)
Downloading PyMySQL-0.6.3-py2.py3-none-any.whl (63kB): 63kB downloaded
Downloading/unpacking sqlalchemy (from GoogleScraper==0.1.9)
Downloading SQLAlchemy-0.9.8.tar.gz (4.1MB): 4.1MB downloaded
Running setup.py (path:/home/kenju254/workspaces/reverseA/GoogleScraperP/env/build/sqlalchemy/setup.py) egg_info for package sqlalchemy

warning: no files found matching '*.jpg' under directory 'doc'
warning: no files found matching 'distribute_setup.py'
warning: no files found matching 'sa2to3.py'
warning: no files found matching 'ez_setup.py'
no previously-included directories found matching 'doc/build/output'

Installing collected packages: lxml, selenium, cssselect, requests, PyMySql, sqlalchemy, GoogleScraper
Running setup.py install for lxml
Building lxml version 3.4.1.
Building without Cython.
ERROR: b'/bin/sh: 1: xslt-config: not found\n'
** make sure the development packages of libxml2 and libxslt are installed **

Using build configuration of libxslt
building 'lxml.etree' extension
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -fPIC -I/home/kenju254/workspaces/reverseA/GoogleScraperP/env/build/lxml/src/lxml/includes -I/usr/include/python3.4m -I/home/kenju254/workspaces/reverseA/GoogleScraperP/env/include/python3.4m -c src/lxml/lxml.etree.c -o build/temp.linux-x86_64-3.4/src/lxml/lxml.etree.o -w
In file included from src/lxml/lxml.etree.c:239:0:
/home/kenju254/workspaces/reverseA/GoogleScraperP/env/build/lxml/src/lxml/includes/etree_defs.h:14:31: fatal error: libxml/xmlversion.h: No such file or directory
 #include "libxml/xmlversion.h"
                               ^
compilation terminated.
/usr/lib/python3.4/distutils/dist.py:260: UserWarning: Unknown distribution option: 'bugtrack_url'
  warnings.warn(msg)
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
Complete output from command /home/kenju254/workspaces/reverseA/GoogleScraperP/env/bin/python3 -c "import setuptools, tokenize;__file__='/home/kenju254/workspaces/reverseA/GoogleScraperP/env/build/lxml/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-aimi250p-record/install-record.txt --single-version-externally-managed --compile --install-headers /home/kenju254/workspaces/reverseA/GoogleScraperP/env/include/site/python3.4:
Building lxml version 3.4.1.

Building without Cython.

ERROR: b'/bin/sh: 1: xslt-config: not found\n'

** make sure the development packages of libxml2 and libxslt are installed **

Using build configuration of libxslt

running install

running build

running build_py

creating build

creating build/lib.linux-x86_64-3.4

creating build/lib.linux-x86_64-3.4/lxml

copying src/lxml/_elementpath.py -> build/lib.linux-x86_64-3.4/lxml

copying src/lxml/ElementInclude.py -> build/lib.linux-x86_64-3.4/lxml

copying src/lxml/cssselect.py -> build/lib.linux-x86_64-3.4/lxml

copying src/lxml/usedoctest.py -> build/lib.linux-x86_64-3.4/lxml

copying src/lxml/sax.py -> build/lib.linux-x86_64-3.4/lxml

copying src/lxml/pyclasslookup.py -> build/lib.linux-x86_64-3.4/lxml

copying src/lxml/doctestcompare.py -> build/lib.linux-x86_64-3.4/lxml

copying src/lxml/init.py -> build/lib.linux-x86_64-3.4/lxml

copying src/lxml/builder.py -> build/lib.linux-x86_64-3.4/lxml

creating build/lib.linux-x86_64-3.4/lxml/includes

copying src/lxml/includes/init.py -> build/lib.linux-x86_64-3.4/lxml/includes

creating build/lib.linux-x86_64-3.4/lxml/html

copying src/lxml/html/_diffcommand.py -> build/lib.linux-x86_64-3.4/lxml/html

copying src/lxml/html/html5parser.py -> build/lib.linux-x86_64-3.4/lxml/html

copying src/lxml/html/defs.py -> build/lib.linux-x86_64-3.4/lxml/html

copying src/lxml/html/soupparser.py -> build/lib.linux-x86_64-3.4/lxml/html

copying src/lxml/html/clean.py -> build/lib.linux-x86_64-3.4/lxml/html

copying src/lxml/html/formfill.py -> build/lib.linux-x86_64-3.4/lxml/html

copying src/lxml/html/usedoctest.py -> build/lib.linux-x86_64-3.4/lxml/html

copying src/lxml/html/_html5builder.py -> build/lib.linux-x86_64-3.4/lxml/html

copying src/lxml/html/diff.py -> build/lib.linux-x86_64-3.4/lxml/html

copying src/lxml/html/ElementSoup.py -> build/lib.linux-x86_64-3.4/lxml/html

copying src/lxml/html/_setmixin.py -> build/lib.linux-x86_64-3.4/lxml/html

copying src/lxml/html/init.py -> build/lib.linux-x86_64-3.4/lxml/html

copying src/lxml/html/builder.py -> build/lib.linux-x86_64-3.4/lxml/html

creating build/lib.linux-x86_64-3.4/lxml/isoschematron

copying src/lxml/isoschematron/init.py -> build/lib.linux-x86_64-3.4/lxml/isoschematron

copying src/lxml/lxml.etree.h -> build/lib.linux-x86_64-3.4/lxml

copying src/lxml/lxml.etree_api.h -> build/lib.linux-x86_64-3.4/lxml

copying src/lxml/includes/xpath.pxd -> build/lib.linux-x86_64-3.4/lxml/includes

copying src/lxml/includes/xmlparser.pxd -> build/lib.linux-x86_64-3.4/lxml/includes

copying src/lxml/includes/xmlschema.pxd -> build/lib.linux-x86_64-3.4/lxml/includes

copying src/lxml/includes/xinclude.pxd -> build/lib.linux-x86_64-3.4/lxml/includes

copying src/lxml/includes/xslt.pxd -> build/lib.linux-x86_64-3.4/lxml/includes

copying src/lxml/includes/xmlerror.pxd -> build/lib.linux-x86_64-3.4/lxml/includes

copying src/lxml/includes/config.pxd -> build/lib.linux-x86_64-3.4/lxml/includes

copying src/lxml/includes/relaxng.pxd -> build/lib.linux-x86_64-3.4/lxml/includes

copying src/lxml/includes/uri.pxd -> build/lib.linux-x86_64-3.4/lxml/includes

copying src/lxml/includes/c14n.pxd -> build/lib.linux-x86_64-3.4/lxml/includes

copying src/lxml/includes/etreepublic.pxd -> build/lib.linux-x86_64-3.4/lxml/includes

copying src/lxml/includes/tree.pxd -> build/lib.linux-x86_64-3.4/lxml/includes

copying src/lxml/includes/dtdvalid.pxd -> build/lib.linux-x86_64-3.4/lxml/includes

copying src/lxml/includes/htmlparser.pxd -> build/lib.linux-x86_64-3.4/lxml/includes

copying src/lxml/includes/schematron.pxd -> build/lib.linux-x86_64-3.4/lxml/includes

copying src/lxml/includes/lxml-version.h -> build/lib.linux-x86_64-3.4/lxml/includes

copying src/lxml/includes/etree_defs.h -> build/lib.linux-x86_64-3.4/lxml/includes

creating build/lib.linux-x86_64-3.4/lxml/isoschematron/resources

creating build/lib.linux-x86_64-3.4/lxml/isoschematron/resources/rng

copying src/lxml/isoschematron/resources/rng/iso-schematron.rng -> build/lib.linux-x86_64-3.4/lxml/isoschematron/resources/rng

creating build/lib.linux-x86_64-3.4/lxml/isoschematron/resources/xsl

copying src/lxml/isoschematron/resources/xsl/XSD2Schtrn.xsl -> build/lib.linux-x86_64-3.4/lxml/isoschematron/resources/xsl

copying src/lxml/isoschematron/resources/xsl/RNG2Schtrn.xsl -> build/lib.linux-x86_64-3.4/lxml/isoschematron/resources/xsl

creating build/lib.linux-x86_64-3.4/lxml/isoschematron/resources/xsl/iso-schematron-xslt1

copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/iso_svrl_for_xslt1.xsl -> build/lib.linux-x86_64-3.4/lxml/isoschematron/resources/xsl/iso-schematron-xslt1

copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/iso_abstract_expand.xsl -> build/lib.linux-x86_64-3.4/lxml/isoschematron/resources/xsl/iso-schematron-xslt1

copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/iso_schematron_skeleton_for_xslt1.xsl -> build/lib.linux-x86_64-3.4/lxml/isoschematron/resources/xsl/iso-schematron-xslt1

copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/iso_schematron_message.xsl -> build/lib.linux-x86_64-3.4/lxml/isoschematron/resources/xsl/iso-schematron-xslt1

copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/iso_dsdl_include.xsl -> build/lib.linux-x86_64-3.4/lxml/isoschematron/resources/xsl/iso-schematron-xslt1

copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/readme.txt -> build/lib.linux-x86_64-3.4/lxml/isoschematron/resources/xsl/iso-schematron-xslt1

running build_ext

building 'lxml.etree' extension

creating build/temp.linux-x86_64-3.4

creating build/temp.linux-x86_64-3.4/src

creating build/temp.linux-x86_64-3.4/src/lxml

x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -fPIC -I/home/kenju254/workspaces/reverseA/GoogleScraperP/env/build/lxml/src/lxml/includes -I/usr/include/python3.4m -I/home/kenju254/workspaces/reverseA/GoogleScraperP/env/include/python3.4m -c src/lxml/lxml.etree.c -o build/temp.linux-x86_64-3.4/src/lxml/lxml.etree.o -w

In file included from src/lxml/lxml.etree.c:239:0:

/home/kenju254/workspaces/reverseA/GoogleScraperP/env/build/lxml/src/lxml/includes/etree_defs.h:14:31: fatal error: libxml/xmlversion.h: No such file or directory

#include "libxml/xmlversion.h"

compilation terminated.

/usr/lib/python3.4/distutils/dist.py:260: UserWarning: Unknown distribution option: 'bugtrack_url'

warnings.warn(msg)

error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

Cleaning up...
Command /home/kenju254/workspaces/reverseA/GoogleScraperP/env/bin/python3 -c "import setuptools, tokenize;file='/home/kenju254/workspaces/reverseA/GoogleScraperP/env/build/lxml/setup.py';exec(compile(getattr(tokenize, 'open', open)(file).read().replace('\r\n', '\n'), file, 'exec'))" install --record /tmp/pip-aimi250p-record/install-record.txt --single-version-externally-managed --compile --install-headers /home/kenju254/workspaces/reverseA/GoogleScraperP/env/include/site/python3.4 failed with error code 1 in /home/kenju254/workspaces/reverseA/GoogleScraperP/env/build/lxml
Storing debug log for failure in /home/kenju254/.pip/pip.log

It fails at some point ..

This is inside a Virtual Environment created with Python3

metaclass conflict

Hi, I'm in the virtualenv and after completing the installation, it fails at this.

GoogleScraper sel --keyword-file examples/kw.txt --search-engine duckduckgo

2014-11-15 08:22:59,610 INFO sqlalchemy.engine.base.Engine SELECT CAST('test plain returns' AS VARCHAR(60)) AS anon_1
2014-11-15 08:22:59,610 INFO sqlalchemy.engine.base.Engine ()
2014-11-15 08:22:59,611 INFO sqlalchemy.engine.base.Engine SELECT CAST('test unicode returns' AS VARCHAR(60)) AS anon_1
2014-11-15 08:22:59,611 INFO sqlalchemy.engine.base.Engine ()
2014-11-15 08:22:59,611 INFO sqlalchemy.engine.base.Engine PRAGMA table_info("scraper_search")
2014-11-15 08:22:59,611 INFO sqlalchemy.engine.base.Engine ()
2014-11-15 08:22:59,612 INFO sqlalchemy.engine.base.Engine PRAGMA table_info("serp")
2014-11-15 08:22:59,612 INFO sqlalchemy.engine.base.Engine ()
2014-11-15 08:22:59,612 INFO sqlalchemy.engine.base.Engine PRAGMA table_info("link")
2014-11-15 08:22:59,612 INFO sqlalchemy.engine.base.Engine ()
Traceback (most recent call last):
File "/root/Desktop/MyPrograms/GoogleScraper/env/bin/GoogleScraper", line 9, in
load_entry_point('GoogleScraper==0.1.4', 'console_scripts', 'GoogleScraper')()
File "/root/Desktop/MyPrograms/GoogleScraper/env/lib/python3.2/site-packages/pkg_resources.py", line 356, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/root/Desktop/MyPrograms/GoogleScraper/env/lib/python3.2/site-packages/pkg_resources.py", line 2431, in load_entry_point
return ep.load()
File "/root/Desktop/MyPrograms/GoogleScraper/env/lib/python3.2/site-packages/pkg_resources.py", line 2147, in load
['name'])
File "/root/Desktop/MyPrograms/GoogleScraper/env/lib/python3.2/site-packages/GoogleScraper/init.py", line 20, in
from GoogleScraper.core import scrape_with_config
File "/root/Desktop/MyPrograms/GoogleScraper/env/lib/python3.2/site-packages/GoogleScraper/core.py", line 13, in
from GoogleScraper.scraping import SelScrape, HttpScrape
File "/root/Desktop/MyPrograms/GoogleScraper/env/lib/python3.2/site-packages/GoogleScraper/scraping.py", line 236, in
class HttpScrape(SearchEngineScrape, threading.Timer):
TypeError: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its bases

I'm using a fresh install of Kali 1.09a with all dependencies.

search results contains filetype marker makes title and link not found

For example, search keyword is android filetype:pdf.
The results could not retrieve title and link.
Cause: the results css selector is not correct in parsing.py:

'results': (['li.g', 'h3.r > a:first-child', 'div.s span.st'], ),

it should be 'results': (['li.g', 'h3.r > a', 'div.s span.st'], ),

when filetype mark exists, element a is not the first child of h3.r element

here below is the example:

<h3 class="r">
    <span class="_ogd b w xsm">[PDF]</span>
    <a href="http://images.comparecellular.com/phones/1562/Samsung-Galaxy-Nexus-User-Guide.pdf" onmousedown="return rwt(this,'','','','1','AFQjCNHJEetsVIhkFu3tl9G6aJANVeyw5g','','0CB4QFjAA','','',event)" target="_blank">Galaxy Nexus User Guide - Google Help</a>
</h3>

can you support use google ip to scrape ?

i know google have thousands of ip, 216.58.209.154 , 74.125.198.199

can you let the program support use google's ip to scrape ?

http://216.58.209.154/search

Discussion of Google behavior when scraping

How often can one request with a specific IP, does User Agent altering help to make more requests, Can multiple IP addresses request different pages for one keyword in independent search sessions...?

not pretty logger error

C:\Python33\python.exe D:/workfiles/PythonScript/GoogleScraper-master/usage.py
2015-01-05 17:27:15,699 - GoogleScraper - INFO - 0 cache files found in .scrapecache/
2015-01-05 17:27:15,699 - GoogleScraper - INFO - 0/5 keywords have been cached and are ready to get parsed. 5 remain to get scraped.
2015-01-05 17:27:15,749 - GoogleScraper - INFO - Going to scrape 5 keywords with 1 proxies by using 3 threads.
2015-01-05 17:27:15,816 - GoogleScraper - INFO - [+] HttpScrape[localhost][search-type:normal] created using the search engine google. Number of keywords to scrape=1, using proxy=None, number of pages per keyword=2
2015-01-05 17:27:15,816 - GoogleScraper - INFO - [+] HttpScrape[localhost][search-type:normal] created using the search engine google. Number of keywords to scrape=2, using proxy=None, number of pages per keyword=2
2015-01-05 17:27:15,816 - GoogleScraper - INFO - [+] HttpScrape[localhost][search-type:normal] created using the search engine google. Number of keywords to scrape=2, using proxy=None, number of pages per keyword=2
2015-01-05 17:27:23,844 - GoogleScraper - ERROR - Connection timeout (<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x00000000042907B8>, 'Connection to www.google.com timed out. (connect timeout=5)')
2015-01-05 17:27:23,844 - GoogleScraper - ERROR - Connection timeout (<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x000000000429B8D0>, 'Connection to www.google.com timed out. (connect timeout=5)')
Exception in thread Thread-2:
Traceback (most recent call last):
  File "C:\Python33\lib\site-packages\requests-2.2.1-py3.3.egg\requests\packages\urllib3\connection.py", line 136, in connect
    timeout=self.timeout,
  File "C:\Python33\lib\socket.py", line 435, in create_connection
    raise err
  File "C:\Python33\lib\socket.py", line 426, in create_connection
    sock.connect(sa)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Python33\lib\site-packages\requests-2.2.1-py3.3.egg\requests\adapters.py", line 330, in send
    timeout=timeout
  File "C:\Python33\lib\site-packages\requests-2.2.1-py3.3.egg\requests\packages\urllib3\connectionpool.py", line 480, in urlopen
    body=body, headers=headers)
  File "C:\Python33\lib\site-packages\requests-2.2.1-py3.3.egg\requests\packages\urllib3\connectionpool.py", line 285, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "C:\Python33\lib\http\client.py", line 1061, in request
    self._send_request(method, url, body, headers)
  File "C:\Python33\lib\http\client.py", line 1099, in _send_request
    self.endheaders(body)
  File "C:\Python33\lib\http\client.py", line 1057, in endheaders
    self._send_output(message_body)
  File "C:\Python33\lib\http\client.py", line 902, in _send_output
    self.send(msg)
  File "C:\Python33\lib\http\client.py", line 840, in send
    self.connect()
  File "C:\Python33\lib\site-packages\requests-2.2.1-py3.3.egg\requests\packages\urllib3\connection.py", line 141, in connect
    (self.host, self.timeout))
requests.packages.urllib3.exceptions.ConnectTimeoutError: (<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x000000000429B8D0>, 'Connection to www.google.com timed out. (connect timeout=5)')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Python33\lib\threading.py", line 637, in _bootstrap_inner
    self.run()
  File "D:\workfiles\PythonScript\GoogleScraper-master\GoogleScraper\scraping.py", line 609, in run
    SearchEngineScrape.blocking_search(self, self.search, *args, **kwargs)
  File "D:\workfiles\PythonScript\GoogleScraper-master\GoogleScraper\scraping.py", line 228, in blocking_search
    callback(*args, **kwargs)
  File "D:\workfiles\PythonScript\GoogleScraper-master\GoogleScraper\scraping.py", line 595, in search
    raise te
  File "D:\workfiles\PythonScript\GoogleScraper-master\GoogleScraper\scraping.py", line 586, in search
    params=self.search_params, timeout=5)
  File "C:\Python33\lib\site-packages\requests-2.2.1-py3.3.egg\requests\api.py", line 55, in get
    return request('get', url, **kwargs)
  File "C:\Python33\lib\site-packages\requests-2.2.1-py3.3.egg\requests\api.py", line 44, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Python33\lib\site-packages\requests-2.2.1-py3.3.egg\requests\sessions.py", line 383, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Python33\lib\site-packages\requests-2.2.1-py3.3.egg\requests\sessions.py", line 486, in send
    r = adapter.send(request, **kwargs)
  File "C:\Python33\lib\site-packages\requests-2.2.1-py3.3.egg\requests\adapters.py", line 387, in send
    raise Timeout(e)
requests.exceptions.Timeout: (<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x000000000429B8D0>, 'Connection to www.google.com timed out. (connect timeout=5)')

Exception in thread Thread-3:
Traceback (most recent call last):
  File "C:\Python33\lib\site-packages\requests-2.2.1-py3.3.egg\requests\packages\urllib3\connection.py", line 136, in connect
    timeout=self.timeout,
  File "C:\Python33\lib\socket.py", line 435, in create_connection
    raise err
  File "C:\Python33\lib\socket.py", line 426, in create_connection
    sock.connect(sa)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Python33\lib\site-packages\requests-2.2.1-py3.3.egg\requests\adapters.py", line 330, in send
    timeout=timeout
  File "C:\Python33\lib\site-packages\requests-2.2.1-py3.3.egg\requests\packages\urllib3\connectionpool.py", line 480, in urlopen
    body=body, headers=headers)
  File "C:\Python33\lib\site-packages\requests-2.2.1-py3.3.egg\requests\packages\urllib3\connectionpool.py", line 285, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "C:\Python33\lib\http\client.py", line 1061, in request
    self._send_request(method, url, body, headers)
  File "C:\Python33\lib\http\client.py", line 1099, in _send_request
    self.endheaders(body)
  File "C:\Python33\lib\http\client.py", line 1057, in endheaders
    self._send_output(message_body)
  File "C:\Python33\lib\http\client.py", line 902, in _send_output
    self.send(msg)
  File "C:\Python33\lib\http\client.py", line 840, in send
    self.connect()
  File "C:\Python33\lib\site-packages\requests-2.2.1-py3.3.egg\requests\packages\urllib3\connection.py", line 141, in connect
    (self.host, self.timeout))
requests.packages.urllib3.exceptions.ConnectTimeoutError: (<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x00000000042907B8>, 'Connection to www.google.com timed out. (connect timeout=5)')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Python33\lib\threading.py", line 637, in _bootstrap_inner
    self.run()
  File "D:\workfiles\PythonScript\GoogleScraper-master\GoogleScraper\scraping.py", line 609, in run
    SearchEngineScrape.blocking_search(self, self.search, *args, **kwargs)
  File "D:\workfiles\PythonScript\GoogleScraper-master\GoogleScraper\scraping.py", line 228, in blocking_search
    callback(*args, **kwargs)
  File "D:\workfiles\PythonScript\GoogleScraper-master\GoogleScraper\scraping.py", line 595, in search
    raise te
  File "D:\workfiles\PythonScript\GoogleScraper-master\GoogleScraper\scraping.py", line 586, in search
    params=self.search_params, timeout=5)
  File "C:\Python33\lib\site-packages\requests-2.2.1-py3.3.egg\requests\api.py", line 55, in get
    return request('get', url, **kwargs)
  File "C:\Python33\lib\site-packages\requests-2.2.1-py3.3.egg\requests\api.py", line 44, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Python33\lib\site-packages\requests-2.2.1-py3.3.egg\requests\sessions.py", line 383, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Python33\lib\site-packages\requests-2.2.1-py3.3.egg\requests\sessions.py", line 486, in send
    r = adapter.send(request, **kwargs)
  File "C:\Python33\lib\site-packages\requests-2.2.1-py3.3.egg\requests\adapters.py", line 387, in send
    raise Timeout(e)
requests.exceptions.Timeout: (<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x00000000042907B8>, 'Connection to www.google.com timed out. (connect timeout=5)')

2015-01-05 17:27:24,821 - GoogleScraper - ERROR - Connection timeout (<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x00000000042A8BA8>, 'Connection to www.google.com timed out. (connect timeout=5)')
Exception in thread Thread-4:
Traceback (most recent call last):
  File "C:\Python33\lib\site-packages\requests-2.2.1-py3.3.egg\requests\packages\urllib3\connection.py", line 136, in connect
    timeout=self.timeout,
  File "C:\Python33\lib\socket.py", line 435, in create_connection
    raise err
  File "C:\Python33\lib\socket.py", line 426, in create_connection
    sock.connect(sa)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Python33\lib\site-packages\requests-2.2.1-py3.3.egg\requests\adapters.py", line 330, in send
    timeout=timeout
  File "C:\Python33\lib\site-packages\requests-2.2.1-py3.3.egg\requests\packages\urllib3\connectionpool.py", line 480, in urlopen
    body=body, headers=headers)
  File "C:\Python33\lib\site-packages\requests-2.2.1-py3.3.egg\requests\packages\urllib3\connectionpool.py", line 285, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "C:\Python33\lib\http\client.py", line 1061, in request
    self._send_request(method, url, body, headers)
  File "C:\Python33\lib\http\client.py", line 1099, in _send_request
    self.endheaders(body)
  File "C:\Python33\lib\http\client.py", line 1057, in endheaders
    self._send_output(message_body)
  File "C:\Python33\lib\http\client.py", line 902, in _send_output
    self.send(msg)
  File "C:\Python33\lib\http\client.py", line 840, in send
    self.connect()
  File "C:\Python33\lib\site-packages\requests-2.2.1-py3.3.egg\requests\packages\urllib3\connection.py", line 141, in connect
    (self.host, self.timeout))
requests.packages.urllib3.exceptions.ConnectTimeoutError: (<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x00000000042A8BA8>, 'Connection to www.google.com timed out. (connect timeout=5)')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Python33\lib\threading.py", line 637, in _bootstrap_inner
    self.run()
  File "D:\workfiles\PythonScript\GoogleScraper-master\GoogleScraper\scraping.py", line 609, in run
    SearchEngineScrape.blocking_search(self, self.search, *args, **kwargs)
  File "D:\workfiles\PythonScript\GoogleScraper-master\GoogleScraper\scraping.py", line 228, in blocking_search
    callback(*args, **kwargs)
  File "D:\workfiles\PythonScript\GoogleScraper-master\GoogleScraper\scraping.py", line 595, in search
    raise te
  File "D:\workfiles\PythonScript\GoogleScraper-master\GoogleScraper\scraping.py", line 586, in search
    params=self.search_params, timeout=5)
  File "C:\Python33\lib\site-packages\requests-2.2.1-py3.3.egg\requests\api.py", line 55, in get
    return request('get', url, **kwargs)
  File "C:\Python33\lib\site-packages\requests-2.2.1-py3.3.egg\requests\api.py", line 44, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Python33\lib\site-packages\requests-2.2.1-py3.3.egg\requests\sessions.py", line 383, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Python33\lib\site-packages\requests-2.2.1-py3.3.egg\requests\sessions.py", line 486, in send
    r = adapter.send(request, **kwargs)
  File "C:\Python33\lib\site-packages\requests-2.2.1-py3.3.egg\requests\adapters.py", line 387, in send
    raise Timeout(e)
requests.exceptions.Timeout: (<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x00000000042A8BA8>, 'Connection to www.google.com timed out. (connect timeout=5)')

i use googlescraper scrape 5 keyword with non proxy, so the 5 keyword is all failed. but the error infomartion is not suitable for reading, and the Program will get stuck。

in this code, super().after_search(request.text) will not run:

        except self.requests.ConnectionError as ce:
            logger.error('Network problem occurred {}'.format(ce))
            raise ce
        except self.requests.Timeout as te:
            logger.error('Connection timeout {}'.format(te))
            raise te

        if not request.ok:
            logger.error('HTTP Error: {}'.format(request.status_code))
            self.handle_request_denied(request.status_code)
            return False
!!!!  # is have except or return false, below code is never run
        super().after_search(request.text)

so these code is never run :

    def after_search(self, html):
        """Store the results and parse em.

        Notify the progress queue if necessary.

        Args:
            html: The scraped html.
        """
        self.parser.parse(html)
        self.store()
        if self.progress_queue:
            self.progress_queue.put(1)
        self.cache_results()
        self.search_number += 1

progress_thread.join()

will stop.

the program will never exit.

if some browser instance error, all the scraper is stop

if one browser instance error (timeout, can not open), the instance will never run again.

if all the instance need captcha, all the scrape will stop.

we need the program scrape always , if some keywords occur error, need ignore it.

A new idea for better scraping

i use GoogleScraper to scrape 10000 keywords, But there will always be a failure, I think this is unavoidable， Maybe just need try again.

So i suggest GoogleScraper can store the failure scrape keywords. when the scrapejobs id done, if have any failure keywords, scrape again.Until the completion of the entire scrape.

Expect your response

#33 issue, Which selector_class to use should be judge

#33 issue

selector_dict = {
    'results': {
        'us_ip': {
            'container': '#b_results',
            'result_container': '.b_algo',
            'link': 'h2 > a::attr(href)',
            'snippet': '.b_caption > .b_attribution > p::text',
            'title': 'h2::text',
            'visible_link': 'cite::text'
        },
        'de_ip': {
            'container': '#b_results',
            'result_container': '.b_algo',
            'link': 'h2 > a::attr(href)',
            'snippet': '.b_caption > p::text',
            'title': 'h2::text',
            'visible_link': 'cite::text'
        }
    },
    'ads_main': {
        'us_ip': {
            'container': '#b_results .b_ad',
            'result_container': '.sb_add',
            'link': 'h2 > a::attr(href)',
            'snippet': '.sb_addesc::text',
            'title': 'h2 > a::text',
            'visible_link': 'cite::text'
        },
        'de_ip': {
            'container': '#b_results .b_ad',
            'result_container': '.sb_add',
            'link': 'h2 > a::attr(href)',
            'snippet': '.b_caption > p::text',
            'title': 'h2 > a::text',
            'visible_link': 'cite::text'
        }
    }
}
for result_type, selector_class in selector_dict.items():
    for selector_specific, selectors in selector_class.items():

Because results have two item, us_ip and de_ip, then, serp_result will be have double result. per 10 item use 'snippet': '.b_caption > p::text' and per 10 item use 'snippet': '.sb_addesc::text', but in china, 'snippet': '.sb_addesc::text' have no snippet, the result shoube

{'ads_main': [{'link': 'http://2413684.r.msn.com/?ld=d3ruRnTwsPmIaUls4aKL--NjVUCUwrVYiq1RZFM9IFMBK7NWB-VE_xchEIW6-kApI8yQTwbqgY9lCh4N2avp9OGntqJvaeKM425XlnNiZn6iFU6Fageo0NS1hMQrKO8AQ0Q3N0SI8hPQRdQPYO5FdEJgA10_g&u=http%3a%2f%2findex.about.com%2fslp%3f%26q%3dbest%2bseo%2btools%26sid%3d9b3473ca-1503-47a7-9e0b-2a013d5accd7-0-ab_msb%26kwid%3dbest%2520seo%2520tools%26cid%3d3906103690',
               'snippet': None,
               'title': 'Best Seo Tools - Best Seo Tools Search Now!',
               'visible_link': 'About.com/Best Seo Tools'},
              {'link': 'http://45020106.r.msn.com/?ld=d3fjxPO_IBMPmvR0k1vrVO0zVUCUwdwFN31ryLdmieEW8NCMrtoRo9BZC_Rt6QNsMHdBqwNkwm2xTRf-bD-B9TZcEmXwmbbIYYkCU6q2Se1zsPlvS6j7PRSDszHqscGsegkzRkFCAxF1mAqvMiPFSs1ON2Eao&u=surferdudehits.com',
               'snippet': None,
               'title': 'Premium Traffic for $7.95 | surferdudehits.com',
               'visible_link': 'surferdudehits.com'},
              {'link': 'http://3298057.r.msn.com/?ld=d3OD_VLpTKI-kMJ0dF_CITFjVUCUxN0FalWnZMSIt3e7v5R19iISY77aojSiFe6ICKt_Glrsm9zefr15xonxtlypbOSfJY40JpRjtqLly5PaXmtjTwPO6DXFoVcJ-f0Vxl_fWD8pHGWWYwGRaiOxFoPF2_8AU&u=ipasshortcut.com%2f%3fid%3d5999%26tid%3dpro',
               'snippet': None,
               'title': 'Direct Sales Marketing | breakthroughmastermind.com',
               'visible_link': 'http://breakthroughmastermind.com'},
              {'link': 'http://2482071.r.msn.com/?ld=d3FaCavjiziXjBQh-qAZr74zVUCUzVXXesnDZKVdzwgfz3UKNj8WBli3-mU_uKnqAfyu2GPpArwAvi3NkoBAmE0U5pjoej_X8YS9efyHgNzo4KvJH1c-YGf3xzSoD-JiCkfYWYxU1Dv6Y1PYCZLVk-vQjJPp4&u=list.qoo10.sg%2fgmkt.inc%2fCategory%2fGroup.aspx%3fg%3d10%26jaehuid%3d2000149996',
               'snippet': None,
               'title': 'Best e-Ticket Deals | Qoo10.sg',
               'visible_link': 'www.Qoo10.sg'},
              {'link': 'http://2413684.r.msn.com/?ld=d3ruRnTwsPmIaUls4aKL--NjVUCUwrVYiq1RZFM9IFMBK7NWB-VE_xchEIW6-kApI8yQTwbqgY9lCh4N2avp9OGntqJvaeKM425XlnNiZn6iFU6Fageo0NS1hMQrKO8AQ0Q3N0SI8hPQRdQPYO5FdEJgA10_g&u=http%3a%2f%2findex.about.com%2fslp%3f%26q%3dbest%2bseo%2btools%26sid%3d9b3473ca-1503-47a7-9e0b-2a013d5accd7-0-ab_msb%26kwid%3dbest%2520seo%2520tools%26cid%3d3906103690',
               'snippet': None,
               'title': 'Best Seo Tools - Best Seo Tools Search Now!',
               'visible_link': 'About.com/Best Seo Tools'},
              {'link': 'http://45020106.r.msn.com/?ld=d3fjxPO_IBMPmvR0k1vrVO0zVUCUwdwFN31ryLdmieEW8NCMrtoRo9BZC_Rt6QNsMHdBqwNkwm2xTRf-bD-B9TZcEmXwmbbIYYkCU6q2Se1zsPlvS6j7PRSDszHqscGsegkzRkFCAxF1mAqvMiPFSs1ON2Eao&u=surferdudehits.com',
               'snippet': None,
               'title': 'Premium Traffic for $7.95 | surferdudehits.com',
               'visible_link': 'surferdudehits.com'},
              {'link': 'http://2413684.r.msn.com/?ld=d3ruRnTwsPmIaUls4aKL--NjVUCUwrVYiq1RZFM9IFMBK7NWB-VE_xchEIW6-kApI8yQTwbqgY9lCh4N2avp9OGntqJvaeKM425XlnNiZn6iFU6Fageo0NS1hMQrKO8AQ0Q3N0SI8hPQRdQPYO5FdEJgA10_g&u=http%3a%2f%2findex.about.com%2fslp%3f%26q%3dbest%2bseo%2btools%26sid%3d9b3473ca-1503-47a7-9e0b-2a013d5accd7-0-ab_msb%26kwid%3dbest%2520seo%2520tools%26cid%3d3906103690',
               'snippet': 'Over 60 Million Visitors.',
               'title': 'Best Seo Tools - Best Seo Tools Search Now!',
               'visible_link': 'About.com/Best Seo Tools'},
              {'link': 'http://45020106.r.msn.com/?ld=d3fjxPO_IBMPmvR0k1vrVO0zVUCUwdwFN31ryLdmieEW8NCMrtoRo9BZC_Rt6QNsMHdBqwNkwm2xTRf-bD-B9TZcEmXwmbbIYYkCU6q2Se1zsPlvS6j7PRSDszHqscGsegkzRkFCAxF1mAqvMiPFSs1ON2Eao&u=surferdudehits.com',
               'snippet': 'Bring 1,000 premium targeted visitors to your website for $7.95',
               'title': 'Premium Traffic for $7.95 | surferdudehits.com',
               'visible_link': 'surferdudehits.com'},
              {'link': 'http://3298057.r.msn.com/?ld=d3OD_VLpTKI-kMJ0dF_CITFjVUCUxN0FalWnZMSIt3e7v5R19iISY77aojSiFe6ICKt_Glrsm9zefr15xonxtlypbOSfJY40JpRjtqLly5PaXmtjTwPO6DXFoVcJ-f0Vxl_fWD8pHGWWYwGRaiOxFoPF2_8AU&u=ipasshortcut.com%2f%3fid%3d5999%26tid%3dpro',
               'snippet': 'Discover How To Make Your First $3,000 A Month With This Proven System!',
               'title': 'Direct Sales Marketing | breakthroughmastermind.com',
               'visible_link': 'http://breakthroughmastermind.com'},
              {'link': 'http://2482071.r.msn.com/?ld=d3FaCavjiziXjBQh-qAZr74zVUCUzVXXesnDZKVdzwgfz3UKNj8WBli3-mU_uKnqAfyu2GPpArwAvi3NkoBAmE0U5pjoej_X8YS9efyHgNzo4KvJH1c-YGf3xzSoD-JiCkfYWYxU1Dv6Y1PYCZLVk-vQjJPp4&u=list.qoo10.sg%2fgmkt.inc%2fCategory%2fGroup.aspx%3fg%3d10%26jaehuid%3d2000149996',
               'snippet': 'USS, SEA Aquarium, Batam & a lot more awesome deals!',
               'title': 'Best e-Ticket Deals | Qoo10.sg',
               'visible_link': 'www.Qoo10.sg'},
              {'link': 'http://2413684.r.msn.com/?ld=d3ruRnTwsPmIaUls4aKL--NjVUCUwrVYiq1RZFM9IFMBK7NWB-VE_xchEIW6-kApI8yQTwbqgY9lCh4N2avp9OGntqJvaeKM425XlnNiZn6iFU6Fageo0NS1hMQrKO8AQ0Q3N0SI8hPQRdQPYO5FdEJgA10_g&u=http%3a%2f%2findex.about.com%2fslp%3f%26q%3dbest%2bseo%2btools%26sid%3d9b3473ca-1503-47a7-9e0b-2a013d5accd7-0-ab_msb%26kwid%3dbest%2520seo%2520tools%26cid%3d3906103690',
               'snippet': 'Over 60 Million Visitors.',
               'title': 'Best Seo Tools - Best Seo Tools Search Now!',
               'visible_link': 'About.com/Best Seo Tools'},
              {'link': 'http://45020106.r.msn.com/?ld=d3fjxPO_IBMPmvR0k1vrVO0zVUCUwdwFN31ryLdmieEW8NCMrtoRo9BZC_Rt6QNsMHdBqwNkwm2xTRf-bD-B9TZcEmXwmbbIYYkCU6q2Se1zsPlvS6j7PRSDszHqscGsegkzRkFCAxF1mAqvMiPFSs1ON2Eao&u=surferdudehits.com',
               'snippet': 'Bring 1,000 premium targeted visitors to your website for $7.95',
               'title': 'Premium Traffic for $7.95 | surferdudehits.com',
               'visible_link': 'surferdudehits.com'}],
 'num_results': '',
 'results': [{'link': 'http://best-seo-tools.net/',
              'snippet': None,
              'title': 'BEST SEO TOOLS',
              'visible_link': 'best-seo-tools.net'},
             {'link': 'http://www.best-5.com/seo-tools/',
              'snippet': None,
              'title': '2014 Best SEO Tools | Best 5 SEO Tool Reviews',
              'visible_link': 'www.best-5.com/seo-tools'},
             {'link': 'http://seo-tools-review.toptenreviews.com/',
              'snippet': None,
              'title': 'SEO Tools Review 2014 | Best SEO Keyword Tools',
              'visible_link': 'seo-tools-review.toptenreviews.com'},
             {'link': 'http://www.bestseotools.net/',
              'snippet': None,
              'title': 'www.bestseotools.net',
              'visible_link': 'www.bestseotools.net'},
             {'link': 'http://www.iblogzone.com/2012/02/best-seo-tools-for-2012.html',
              'snippet': None,
              'title': 'Best SEO Tools - SEO & Inbound Marketing Blog …',
              'visible_link': 'www.iblogzone.com/2012/02/best-seo-tools-for-2012.html'},
             {'link': 'http://www.searchenginejournal.com/the-best-seo-tools-what-how-and-why/60842/',
              'snippet': None,
              'title': 'The Best SEO Tools: What, How, and Why - …',
              'visible_link': 'www.searchenginejournal.com/the-best-seo-tools-what-how-and-why/60842'},
             {'link': 'http://moz.com/blog/100-free-seo-tools',
              'snippet': None,
              'title': '100 Free SEO Tools & Resources for Every …',
              'visible_link': 'moz.com/blog/100-free-seo-tools'},
             {'link': 'http://bestseotools.com/',
              'snippet': None,
              'title': 'Best SEO Tools of 2014',
              'visible_link': 'bestseotools.com'},
             {'link': 'http://www.socialseo.com/the-top-15-free-seo-tools.html',
              'snippet': None,
              'title': 'The Best 15 Free SEO Tools Online - Top SEO …',
              'visible_link': 'www.socialseo.com/the-top-15-free-seo-tools.html'},
             {'link': 'http://www.link-assistant.com/',
              'snippet': None,
              'title': 'Link-Assistant.Com - Official Site',
              'visible_link': 'www.link-assistant.com'},
             {'link': 'http://best-seo-tools.net/',
              'snippet': 'SEO Company : Spider view This tool ... Site Ranking. Website Cloaking Check This tool lets you check a list of urls for googlebot cheaters : SEO Company ...',
              'title': 'BEST SEO TOOLS',
              'visible_link': 'best-seo-tools.net'},
             {'link': 'http://www.best-5.com/seo-tools/',
              'snippet': 'Looking for SEO tools? Our reviews of the best Search Engine Optimization Tools will help you choose the program that is best for you. Make the right choice',
              'title': '2014 Best SEO Tools | Best 5 SEO Tool Reviews',
              'visible_link': 'www.best-5.com/seo-tools'},
             {'link': 'http://seo-tools-review.toptenreviews.com/',
              'snippet': 'Looking for the best SEO tools? Read expert reviews and compare features of the best, cheapest and sometimes free SEO tools.',
              'title': 'SEO Tools Review 2014 | Best SEO Keyword Tools',
              'visible_link': 'seo-tools-review.toptenreviews.com'},
             {'link': 'http://www.bestseotools.net/',
              'snippet': 'Dominio registrato con Totalhosting.it Potresti essere interessato anche a : Power by ; Copyright 2014 Phonia Srl-P.I.02050680442-All Rights Reserved',
              'title': 'www.bestseotools.net',
              'visible_link': 'www.bestseotools.net'},
             {'link': 'http://www.iblogzone.com/2012/02/best-seo-tools-for-2012.html',
              'snippet': 'SEO Tools are designed to help make our SEO efforts a bit easier and less tedious. While there are many out there, here are some SEO tools to get you started.',
              'title': 'Best SEO Tools - SEO & Inbound Marketing Blog …',
              'visible_link': 'www.iblogzone.com/2012/02/best-seo-tools-for-2012.html'},
             {'link': 'http://www.searchenginejournal.com/the-best-seo-tools-what-how-and-why/60842/',
              'snippet': "Power-charge your SEO with the industry's finest SEO tools. Rankings, backlinks, competitors, reports, analytics - you name it - all in one place.",
              'title': 'The Best SEO Tools: What, How, and Why - …',
              'visible_link': 'www.searchenginejournal.com/the-best-seo-tools-what-how-and-why/60842'},
             {'link': 'http://moz.com/blog/100-free-seo-tools',
              'snippet': 'At Moz, we love using premium SEO Tools. Paid tools are essential when you need advanced features, increased limits, historical features, or professional support. For ...',
              'title': '100 Free SEO Tools & Resources for Every …',
              'visible_link': 'moz.com/blog/100-free-seo-tools'},
             {'link': 'http://bestseotools.com/',
              'snippet': 'Comprehensive List of the Best SEO Tools of 2014 - Updated Monthly. Find the Best and Top Rated SEO Tools',
              'title': 'Best SEO Tools of 2014',
              'visible_link': 'bestseotools.com'},
             {'link': 'http://www.socialseo.com/the-top-15-free-seo-tools.html',
              'snippet': 'The Top 15 Free SEO Tools Posted September 13th, 2007 by Brian Gilley. We are building out a more comprehensive list of SEO and social media tools that you might …',
              'title': 'The Best 15 Free SEO Tools Online - Top SEO …',
              'visible_link': 'www.socialseo.com/the-top-15-free-seo-tools.html'},
             {'link': 'http://www.link-assistant.com/',
              'snippet': 'Get all SEO tools in one pack - download free edition of SEO PowerSuite and get top 10 rankings for your site on Google and other search engines!',
              'title': 'Link-Assistant.Com - Official Site',
              'visible_link': 'www.link-assistant.com'}]}

--- 10 result with no snippet----
http://best-seo-tools.net/
http://www.best-5.com/seo-tools/
http://seo-tools-review.toptenreviews.com/
http://www.bestseotools.net/
http://www.iblogzone.com/2012/02/best-seo-tools-for-2012.html
http://www.searchenginejournal.com/the-best-seo-tools-what-how-and-why/60842/
http://moz.com/blog/100-free-seo-tools
http://bestseotools.com/
http://www.socialseo.com/the-top-15-free-seo-tools.html
http://www.link-assistant.com/
--- repeat 10 result with snippet----
http://best-seo-tools.net/
http://www.best-5.com/seo-tools/
http://seo-tools-review.toptenreviews.com/
http://www.bestseotools.net/
http://www.iblogzone.com/2012/02/best-seo-tools-for-2012.html
http://www.searchenginejournal.com/the-best-seo-tools-what-how-and-why/60842/
http://moz.com/blog/100-free-seo-tools
http://bestseotools.com/
http://www.socialseo.com/the-top-15-free-seo-tools.html
http://www.link-assistant.com/

Json Output doesn't work

I tried the json output I got this error:

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.4/threading.py", line 920, in _bootstrap_inner
    self.run()
  File "/home/vlad/gscrape_test/lib/env/lib/python3.4/site-packages/GoogleScraper/scraping.py", line 554, in run
    SearchEngineScrape.blocking_search(self, self.search, *args, **kwargs)
  File "/home/vlad/gscrape_test/lib/env/lib/python3.4/site-packages/GoogleScraper/scraping.py", line 214, in blocking_search
    callback(*args, **kwargs)
  File "/home/vlad/gscrape_test/lib/env/lib/python3.4/site-packages/GoogleScraper/scraping.py", line 544, in search
    self.store()
  File "/home/vlad/gscrape_test/lib/env/lib/python3.4/site-packages/GoogleScraper/scraping.py", line 267, in store
    json.dump(obj, self.json_outfile, indent=2, sort_keys=True)
  File "/usr/lib/python3.4/json/__init__.py", line 178, in dump
    for chunk in iterable:
  File "/usr/lib/python3.4/json/encoder.py", line 422, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/usr/lib/python3.4/json/encoder.py", line 396, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.4/json/encoder.py", line 429, in _iterencode
    o = _default(o)
  File "/usr/lib/python3.4/json/encoder.py", line 173, in default
    raise TypeError(repr(o) + " is not JSON serializable")
TypeError: datetime.datetime(2014, 11, 26, 15, 31, 54, 300949) is not JSON serializable

per search result have 20 result

i find one search have 20 result, 10 is not have snippet, 10 is the same but have snippet.

parsing.py#141 selector_specific ?

Result have 10 us_ip result and 10 de_ip result !

normal_search_selectors = {
        'results': {
            'us_ip': {
                'container': '#b_results',
                'result_container': '.b_algo',
                'link': 'h2 > a::attr(href)',
                'snippet': '.b_caption > .b_attribution > p::text',
                'title': 'h2::text',
                'visible_link': 'cite::text'
            },
            'de_ip': {
                'container': '#b_results',
                'result_container': '.b_algo',
                'link': 'h2 > a::attr(href)',
                'snippet': '.b_caption > p::text',
                'title': 'h2::text',
                'visible_link': 'cite::text'
            }
        },
        'ads_main': {
            'us_ip': {
                'container': '#b_results .b_ad',
                'result_container': '.sb_add',
                'link': 'h2 > a::attr(href)',
                'snippet': '.sb_addesc::text',
                'title': 'h2 > a::text',
                'visible_link': 'cite::text'
            },
            'de_ip': {
                'container': '#b_results .b_ad',
                'result_container': '.sb_add',
                'link': 'h2 > a::attr(href)',
                'snippet': '.b_caption > p::text',
                'title': 'h2 > a::text',
                'visible_link': 'cite::text'
            }
        }
    }

base search url does not produce different output

Hello.

Setting (e.g israeli google)
--base-search-url http://google.co.il/ncr
seems to be ignored. What I mean is that:

GoogleScraper http -q apple -n 10 -p 1 -s stdout

produces same output as

GoogleScraper http -q apple -n 10 -p 1 -s stdout  --base-search-url http://google.co.il/ncr

In the 1st case I was expecting to see (among the results)
Link: http://en.wikipedia.org/wiki/Apple
Whereas in the 2nd case:
Link: http://he.wikipedia.org/wiki/Apple

But what I get in >>both<< cases is:
Link: http://el.wikipedia.org/wiki/Apple

which is wikipedia in greek (where my IP resides).

error on firefox web driver

Everytime I'm importing GoogleScraper (in both my own script and use.py) I get this error by selenium firefox web driver:

Any thoughts?

File "use.py", line 1, in <module>
    import GoogleScraper
  File "/crawler/GoogleScraper/GoogleScraper.py", line 61, in <module>
    from selenium import webdriver
  File "/System/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/selenium/webdriver/__init__.py", line 17, in <module>
    from .firefox.webdriver import WebDriver as Firefox
  File "/System/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/selenium/webdriver/firefox/webdriver.py", line 27, in <module>
    from selenium.webdriver.firefox.firefox_profile import FirefoxProfile
  File "/System/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/selenium/webdriver/firefox/firefox_profile.py", line 343
    except (IOError, KeyError), e

elif self.verbosity == 2 and self.verbosity % 5 == 0:

elif self.verbosity == 2 and self.verbosity % 5 == 0:

2 % 5 = 2

usage.py#17

'search_engine': 'yandex',

shoud be

'search_engines': 'yandex',

parsing.py 746

l = Link(
                        link=link['link'],
                        snippet=link['snippet'],
                        title=link['title'],
                        visible_link=link['visible_link'],
                        domain=parsed.netloc,
                        rank=rank,
                        serp=serp
                    )

should be add link_type

l = Link(
                        link=link['link'],
                        snippet=link['snippet'],
                        title=link['title'],
                        visible_link=link['visible_link'],
                        domain=parsed.netloc,
                        rank=rank,
                        serp=serp,
                        link_type=key  # add this
                    )

Splitting keywords evenly among Selenium instances

I have a concern about the grouper function in utils.py (which was probably lifted from Python's itertools). Say you set your number of selenium instances to scrape with to 10 (num_workers in config.cfg) and you gave the scraper a set of 15 keywords - independently of what you set maximum_workers to in config.cfg, it would launch 15 Selenium instances.
Shouldn't grouper(iterable, n, fillvalue=None) in utils.py be replaced by something like this:

def chunkIt(seq, num):
    avg = len(seq) / float(num)
    out = []
    last = 0.0

    while last < len(seq):
        out.append(seq[int(last):int(last + avg)])
        last += avg

    return out

(taken from http://stackoverflow.com/questions/2130016/splitting-a-list-of-arbitrary-size-into-only-roughly-n-equal-parts)

And then change assign_keywords_to_scrapers(all_keywords) on line 34 of core.py to something like this:

mode = Config['SCRAPING'].get('scrapemethod')

num_workers = Config['SCRAPING'].getint('num_workers', 1)

 if len(all_keywords) > num_workers:
     #kwgroups = grouper(all_keywords, len(all_keywords)//num_workers, fillvalue=None)
     kwgroups = chunkIt(all_keywords, num_workers)
else:
    # thats a little special there :)
    kwgroups = [[kw, ] for kw in all_keywords]

    return kwgroups

That way the keyword split would be perfect for every combination of num_workers and set of keywords no?

error: the following arguments are required: scrapemethod Issue

Hi
I was expecting use GoogleScraper inside of my python code but it still want command line argument. When I ran example/basic_usage.py from my IDE following output shows up.

usage: GoogleScraper [-h] [-q keyword] [--keyword-file KEYWORD_FILE]
[-n number_of_results_per_page]
[-z num_browser_instances]
[--base-search-url BASE_SEARCH_URL] [-p num_of_pages]
[-s results_storing] [-t search_type]
[--proxy-file PROXY_FILE] [--config-file CONFIG_FILE]
[--simulate] [--print] [-x] [--view] [--fix-cache-names]
[--check-oto] [-v VERBOSITY] [--debug {INFO,DEBUG}]
[--view-config] [--mysql-proxy-db MYSQL_PROXY_DB]
{http,sel}
GoogleScraper: error: the following arguments are required: scrapemethod

I see scrapemethod is http in config dict but it still want to get it from arg. Also lxml is anorther dependencies that doesn't comes with built-in python3.

TypeError

$ python3 use.py
Traceback (most recent call last):
File "use.py", line 1, in
import GoogleScraper
File "GoogleScraper-master/GoogleScraper.py", line 156, in
class GoogleScrape(threading.Timer):
TypeError: function() argument 1 must be code, not str

sqlalchemy warning

SAWarning: Ignoring declarative-like tuple value of attribute proxy_id: possibly a copy-and-paste error with a comma left at the end of the line?
_as_declarative(cls, classname, cls.dict)

No working GoogleScraper instance

Sometimes I fuck up and I leave no working GoogleScraper instance online. For example right now GoogleScraper 0.1.3 doesn't work when installing from pip. Nor does the most recent version on Github work. So the users aren't happy.

If this happens and I overlook it, please shoot me short message in this thread and I will fix the issue.

Best regards

KeyError: 'url' GoogleScraper\database.py", line 93, in str

In file database.py line 93

return '<Link at rank {rank} has url: {url}>'.format(**self.__dict__)

Should be

return '<Link at rank {rank} has url: {link}>'.format(**self.__dict__)

Usage of google advanced query strings in a scraper query

Usage of query keywords like "allintext,intext" breaks the scraper

using cssselect() to target query data

OK Nikolai,

This is very good - I've managed to figure out (fumble out) cssselect() and am able to traverse the query results by html element. So I can get at the link text and also the description (or "snippet"). Now I will hack some more to cast the results as a dictionary of {'url':'...', 'title':'...', 'snippet':'...'} and feed it back to my calling application.

In the process I discovered that there is no need to scrape all link tags from the google results page - what you can do is to give cssselect something like an xPath and it will get at the query result urls exactly:

The query results page contains each "hit" in an html list item tag and from there inside a few divs and spans. You can specify these (with their class names) to cssselect() and it will only select those. Here's an example:

 <li class="g"><!--m-->
 <div class="rc" data-hveid="86">
  <span class="altcts">
  </span>
  <h3 class="r"><a href="http://www.coindesk.com/
                            singapore-regulators-interfere-bitcoin/"
...

To see what I mean:

replace

links = dom.cssselect('a')
return [e.get('href') for e in links]

with

links = dom.cssselect('li.g div.rc h3.r a') 
return [e.get('href') for e in links]

and no need to pass the results through _clean_results() to eliminate badboys!

For my aim of getting at more info inside those list tags I cast a wider net with:

divs = dom.cssselect('li.g div.rc')

and then iterate over each element and extract addtional tags:

for el in divs: 
    link_title =  el.cssselect('a')[0].text_content()
    link_url = el.cssselect('a')[0].get('href')
    link_snippet = el.cssselect('span.st')[0].text_content()
    linkdict = {'url':link_url, 'title':link_title, 'snippet':link_snippet}
    linklist.append(linkdict)

So, that's the day before Xmas well spent!

Continue last scrape error

last_modified = datetime.datetime.fromtimestamp(os.path.getmtime(last_search.keyword_file))

last_modified = datetime.datetime.utcfromtimestamp(os.path.getmtime(last_search.keyword_file))

when store serp

sqlalchemy.exc.InvalidRequestError: This Session's transaction has been rolled back by a nested rollback() call.  To begin a new transaction, issue Session.rollback() first.

my modify

def store(self):
        """Store the parsed data in the sqlalchemy scoped session."""
        assert self.session, 'No database session. Turning down.'

        with self.db_lock:
            serp = SearchEngineResultsPage(
                search_engine_name=self.search_engine,
                scrapemethod=self.scrapemethod,
                page_number=self.current_page,
                requested_at=self.current_request_time,
                requested_by=self.ip,
                query=self.current_keyword,
                num_results_for_keyword=self.parser.search_results['num_results'],
            )
            self.scraper_search.serps.append(serp)

            serp, parser = parse_serp(serp=serp, parser=self.parser)
            # if have no result, skip store
            if serp.num_results == 0:
                return False
            try:
                self.session.add(serp)
                self.session.commit()
            except:
                return False

            store_serp_result(dict_from_scraping_object(self), self.parser)
            return True

scraping.py
if the result page have no serp, do not store it.

def after_search(self):
        """Store the results and parse em.

        Notify the progress queue if necessary.

        Args:
            html: The scraped html.
        """
        self.parser.parse(self.html)
        if not self.store():
            logger.error("No results for store, skip current keyword:{0}".format(self.current_keyword))
            self.search_number += 1
            return
        if self.progress_queue:
            self.progress_queue.put(1)
        self.cache_results()
        self.search_number += 1

caching.py

serp = None #get_serp_from_database(session, query, search_engine, scrapemethod)

serp = get_serp_from_database(session, query, search_engine, scrapemethod)

            if not serp:
                serp, parser = parse_again(fname, search_engine, scrapemethod, query)

            serp.scraper_searches.append(scraper_search)
            session.add(serp)
            # my added
            if num_cached % 200 == 0:
                session.commit()

Keyword not found in title: Message: ''

I believe this happens when scrape is attempted and page has not fully loaded. The result is an empty record in the "links" table.

Maybe we can improve error handling here so that it retries "x" times.

proxy_check do nothing ?

    def proxy_check(self):
        assert self.proxy and self.webdriver, 'Scraper instance needs valid webdriver and proxy instance to make the proxy check'

        self.webdriver.get(Config['GLOBAL'].get('proxy_check_url'))

        data = self.webdriver.page_source

        if not self.proxy.host in data:
            logger.warning('Proxy check failed: {host}:{port} is not used while requesting'.format(**self.proxy.__dict__))
        else:
            logger.info('Proxy check successful: All requests going through {host}:{port}'.format(**self.proxy.__dict__))

proxy_check do nothing, if proxy is not vaild, programe still use it.

want to add function: Continue last scrape

i hope to add a function, let user can continue the last scrape.

No num_results_for_kw

Hi, very nice module!!. But I have a question, when I search with number of page greater than two the num_result_for_kw it's empty. For example:

$ python3 GoogleScraper.py -p 2 -n 10 -q "Unix"
None
[+] 10 links found! The search with the keyword "Unix" has ``

In the case of one page:

$ python3 GoogleScraper.py -p 1 -n 10 -q "Unix"
None
[+] 10 links found! The search with the keyword "Unix" has About 77,000,000 results

It's normal? it's a bug?

Thank you very much.

core.py 250 not add used_search_engines

scraper_search not add used_search_engines
line 250

    scraper_search = ScraperSearch(
        number_search_engines_used=1,
        number_proxies_used=len(proxies) - 1 if None in proxies else len(proxies),
        number_search_queries=len(keywords),
        started_searching=datetime.datetime.utcnow(),
        used_search_engines=",".join(search_engines)   # this is add
    )

NoParserForSearchEngineException

If I try to execute a slightly modified example with multiple search engines added, I get GoogleScraper.parsing.NoParserForSearchEngineException: No such parser for yandex

def basic_usage():
# See in the config.cfg file for possible values
config = {
    'SCRAPING': {
        'use_own_ip': 'True',
        'keyword': 'Let\'s go bubbles!',
        'search_engines': 'google, yandex',
        'num_pages_for_keyword': 10
    },
    'SELENIUM': {
        'sel_browser': 'phantomjs',
        'num_workers': 10,
    },
    'GLOBAL': {
        'do_caching': 'False'
    }
}

scraping.py line 935, sel scraper num pages is error

 # Click the next page link not when leaving the loop
 if self.current_page < self.num_pages_per_keyword + 1:
    self.next_url = self._goto_next_page()

should be

 # Click the next page link not when leaving the loop
 if self.current_page < self.num_pages_per_keyword:
    self.next_url = self._goto_next_page()

if self.num_pages_per_keyword =1 , code

 for self.current_page in range(1, self.num_pages_per_keyword + 1):

only run once. but, self.next_url = self._goto_next_page() will run twice.

Discussion about which tool/framework to use for multiple connections

Simple threading, asynchronous IO, blocking or non blocking sockets, forcing DNS queries to the socket?

scraping.py function handle_request_denied

self.search_input = WebDriverWait(self.webdriver, 5).until(self._get_search_input_field())

should be

self.search_input = WebDriverWait(self.webdriver, 5).until(EC.visibility_of_element_located(self._get_search_input_field()))

invalid search keywords

I graped the latest git version and when testing it using the args
sel --debug --keyword-file ./keywords based on what it says in the readme.

i noticed that it is searching google for the individual characters of each keyword not the actual keyword, tracing the code i found that this is caused by Line 830 self.keywords = set(keywords)
the call to set() is what is splitting the keyword into individual characters
looks like the keywords variable here should not be a string, it should be a list, but it gets passed as a string

i also had some errors related to caching so i disabled it ('do_caching': False)

Waiting until the keyword appears in the title may not be enough

scraping.py line 900

# Waiting until the keyword appears in the title may
# not be enough. The content may still be from the old page.
try:
    WebDriverWait(self.webdriver, 5).until(EC.title_contains(self.current_keyword))
except TimeoutException as e:
    logger.error(SeleniumSearchError('Keyword "{}" not found in title: {}'.format(self.current_keyword, self.webdriver.title)))
    break

like search engine ask.com, the title is always "Ask.com - What's Your Question?" . So i suggest determine url param .

discussion of similar projects and adaption of their research.

What are similar projects? What can we do better? Which research is already done?

google result is null, because google use ajax search

sometimes google use ajax request to get the serp result, like the url,
https://www.google.com.sg/search?biw=768&bih=799&q=sand+crusher&oq=sand+crusher&gs_l=serp.3..0j0i30l9.35717.106298.6.106480.19.15.0.0.0.0.292.1366.0j7j1.8.0.msedr...0...1c.1.60.serp..11.8.1364.4CtAbfCRlaE&bav=on.2,or.&bvm=bv.82001339,d.cGU&fp=807c0983af5fb146&tch=1&ech=1&psi=fvqgVP2-CdW1oQTkioGoCg.1419836023708.7

google use javascript get the json file, and then parse the json file to html code append to html code.

so the real html source code not contain normal_search_selectors.

i think it in China, use proxy, google redirect the url
https://www.google.com.sg/?gfe_rd=cr&ei=ffqgVObpEpHn-QOa-oC4CQ#q=sand+crusher
so ,default use ajax search.

Not Working

when trying this command:
GoogleScraper -h

I get :

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.7/bin/GoogleScraper", line 9, in <module>
    load_entry_point('GoogleScraper==0.1.8', 'console_scripts', 'GoogleScraper')()
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pkg_resources.py", line 339, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pkg_resources.py", line 2470, in load_entry_point
    return ep.load()
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pkg_resources.py", line 2184, in load
    ['__name__'])
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/GoogleScraper/__init__.py", line 13, in <module>
    from GoogleScraper.config import get_config
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/GoogleScraper/config.py", line 4, in <module>
    import configparser
ImportError: No module named configparser

Pip installation would be nice

It would be very nice if one would be able to install GoogleScraper using pip. This would be extremely useful when used together with virtualenv.

How to install socks/GoogleScraper with Python 3 on Ubuntu?

I have problems installing socks on my Ubuntu 12.04 machine. And so GoogleScraper does not work.

sudo apt-get install python-socksipy only gives me an importable module for Python 2.7
sudo apt-get install python-txsocksx does not work either.

How to get the script running on Ubuntu 12.04? I did not come across any installation instruction...

Can't get it working

Hey,

Not an expert on this but thought to give it a try . Followed all the steps but i get this error 👍

(.venv)[root@main google]# python run.py
Traceback (most recent call last):
File "run.py", line 8, in
from GoogleScraper.core import main
File "/home/botswiz/google/GoogleScraper/init.py", line 11, in
Config = get_config()
File "/home/botswiz/google/GoogleScraper/config.py", line 101, in get_config
parse_config(cmd_args)
File "/home/botswiz/google/GoogleScraper/config.py", line 88, in parse_config
logger.error('Exception trying to parse file {}: {}'.format(CONFIG_FILE, e))
ValueError: zero length field name in format

Thanks

http scraper getresponse() unexpected keyword 'buffering'

Error occurs while processing a query with the http parser. Seems to occur reliably when reading a keyword with inurl and a complex string. Does not occur without inurl:"some_string".

COMMAND:

GoogleScraper -m http --search-engines "google" --output-format csv --output-filename outputTest --keyword-file "keywords.txt"

keywords.txt:
inurl:"https://" inurl:"some/path/to/page.html"
Another Key Word

SYSTEM:

Python 3.4 on OS X

OUTPUT SNIPPET:

2014-11-26 17:22:53,146 - GoogleScraper - ERROR - Connection timeout HTTPSConnectionPool(host='www.google.com', port=443): Read timed out. (read timeout=3.0)
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/requests-2.4.3-py3.4.egg/requests/packages/urllib3/connectionpool.py", line 331, in _make_request
    httplib_response = conn.getresponse(buffering=True)
TypeError: getresponse() got an unexpected keyword argument 'buffering'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/requests-2.4.3-py3.4.egg/requests/packages/urllib3/connectionpool.py", line 333, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/http/client.py", line 1172, in getresponse
    response.begin()
  File "/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/http/client.py", line 351, in begin
    version, status, reason = self._read_status()
  File "/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/http/client.py", line 313, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/socket.py", line 371, in readinto
    return self._sock.recv_into(b)
  File "/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/ssl.py", line 746, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/ssl.py", line 618, in read
    v = self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

nikolait / googlescraper Goto Github PK

googlescraper's People

Contributors

Stargazers

Watchers

Forkers

googlescraper's Issues

GoogleScraper sel --keyword-file examples/kw.txt --search-engine duckduckgo

Recommend Projects

Recommend Topics

Recommend Org

Jobs