mycognitive / google_dl Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 4.0 14 KB

Python script to download files via Google search.

Python 100.00%

command-line google-search python-script

google_dl's People

Contributors

Stargazers

Watchers

Forkers

lu43n stjordanis shrideh

google_dl's Issues

google_dl doesn't do anything.

The script doesn't downloads files anymore (it was some time ago).

Example:

$ google_dl -v http://example.com
$ google_dl "foo" -f pdf

I'm expecting that script will download the files as before.
It may be the issue in xgoogle lib, if so, send PR into mycognitive/xgoogle.

UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 48: ordinal not in range(128)

This command:

$ google_dl -x -r -s http://www.example.com/foo/bar/ "foo"
Downloading 'b'%0d%0a\xc3\xb6m_bar.pdf'' from 'b'http://www.example.com/%250d%250a-foo%C3%B6m_bar.pdf'' into . ...
Traceback (most recent call last):
  File "~/google_dl", line 171, in <module>
    print("File '%s' already exists, skipping." % (path))
UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 48: ordinal not in range(128)

fails with encoding error.

UnicodeEncodeError: 'ascii' codec can't encode character

Error:

Traceback (most recent call last):
  File "google_dl", line 169, in <module>
    print("Downloading '%s' from '%s' into %s..." % (filename, url, dirname))
UnicodeEncodeError: 'ascii' codec can't encode character '\u0142' in position 22: ordinal not in range(128)

ModuleNotFoundError: No module named 'xgoogle.search'

Hi
Followed your instructions in Readme. But search module is not found.

$ ./google_dl.py
Traceback (most recent call last):
  File "./google_dl.py", line 11, in <module>
    from xgoogle.search import GoogleSearch, SearchError
ModuleNotFoundError: No module named 'xgoogle.search'

xgoogle is installed:

$ pip list
Package        Version
-------------- -------
beautifulsoup4 4.4.1
chardet        2.3.0
colorama       0.3.2
html5lib       0.999
nltk           3.0.5
pip            19.2.3
requests       2.4.3
setuptools     41.2.0
six            1.10.0
urllib3        1.9.1
wheel          0.24.0
xgoogle        1.4

search.py is present in xgoogle directory:

$ ls xgoogle/xgoogle/
BeautifulSoup.py   browser.py         realtime.py        sponsoredlinks.py
__init__.py        googlesets.py      search.py          translate.py

Export search result into CSV file

Add optional parameter to export data result (list of URLs, etc.) into given CSV file, instead of downloading all files.

http.client.BadStatusLine

Downloading 'b'foo.pdf'' from 'b'http://www.example.com/foo.pdf'' into ....
Traceback (most recent call last):
  File "google_dl", line 173, in <module>
    page.dlFile(url, path)
  File "google_dl", line 54, in dlFile
    with urllib.request.urlopen(request) as i, open(path, "wb") as o:
  File "/usr/local/Cellar/python3/3.4.3/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 161, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/local/Cellar/python3/3.4.3/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 463, in open
    response = self._open(req, data)
  File "/usr/local/Cellar/python3/3.4.3/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 481, in _open
    '_open', req)
  File "/usr/local/Cellar/python3/3.4.3/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 441, in _call_chain
    result = func(*args)
  File "/usr/local/Cellar/python3/3.4.3/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 1210, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/usr/local/Cellar/python3/3.4.3/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 1185, in do_open
    r = h.getresponse()
  File "/usr/local/Cellar/python3/3.4.3/Frameworks/Python.framework/Versions/3.4/lib/python3.4/http/client.py", line 1171, in getresponse
    response.begin()
  File "/usr/local/Cellar/python3/3.4.3/Frameworks/Python.framework/Versions/3.4/lib/python3.4/http/client.py", line 351, in begin
    version, status, reason = self._read_status()
  File "/usr/local/Cellar/python3/3.4.3/Frameworks/Python.framework/Versions/3.4/lib/python3.4/http/client.py", line 321, in _read_status
    raise BadStatusLine(line)
http.client.BadStatusLine: ''

This may happen at random.

Doesn't return all the repeated results

$ google_dl -v -x -r "site:www.imdb.com/board"
site:www.imdb.com/board
Trying to download results from page #1  (results 1-47)
...

$ python3
>>> from xgoogle.search import GoogleSearch
>>> gs = GoogleSearch("site:www.imdb.com/board", repeat=True)

The tool is only downloading 47 items, but Google reports 5,740 results. It needs to take into account all of them.

The fix should be either in this repo or in xgoogle, depending where is the problem.

Support for omitted results

The following command downloads only one file:

google_dl -s mast.queensu.ca -f pdf ""

The expected result is to download all the files.

For example when googling site:mast.queensu.ca filetype:pdf it also shows one file, but when I click on repeat the search with the omitted results included, it shows 6k of other files.

The above script should automatically follow the omitted result criteria, so it can download all the files.

If the option can be parameterized, this method can be activated by specifying -r (to repeat the search with the omitted results included).

Est. 2-4h

NameError: name 'name2codepoint' is not defined

Traceback (most recent call last):
  File "binfiles/google_dl", line 161, in <module>
    for results in page:
  File "binfiles/google_dl", line 79, in __next__
    results = self.gs.get_results()
  File "~/.python/xgoogle/search.py", line 201, in get_results
    results = self._extract_results(page)
  File "~/.python/xgoogle/search.py", line 288, in _extract_results
    eres = self._extract_result(result)
  File "~/.python/xgoogle/search.py", line 295, in _extract_result
    title, url = self._extract_title_url(result)
  File "~/.python/xgoogle/search.py", line 309, in _extract_title_url
    title = self._html_unescape(title)
  File "~/.python/xgoogle/search.py", line 369, in _html_unescape
    return re.sub(r'&([^;]+);', entity_replacer, s, re.U)
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/re.py", line 191, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "~/.python/xgoogle/search.py", line 356, in entity_replacer
    if entity in name2codepoint:
NameError: name 'name2codepoint' is not defined

Example command:

google_dl -s http://www.marquette.edu/maqom/ -f pdf ""

ValueError: unknown url type

Traceback (most recent call last):
  File "~/binfiles/google_dl", line 165, in <module>
    path = get_path_via_url(url, args.dest, args.dirs)
  File "~/binfiles/google_dl", line 105, in get_path_via_url
    request = urllib.request.Request(url, method="HEAD")
  File "~/anaconda/lib/python3.6/urllib/request.py", line 329, in __init__
    self.full_url = url
  File "~/anaconda/lib/python3.6/urllib/request.py", line 355, in full_url
    self._parse()
  File "~/anaconda/lib/python3.6/urllib/request.py", line 384, in _parse
    raise ValueError("unknown url type: %r" % self.full_url)
ValueError: unknown url type: '/search?q=%C5%93+%C5%93+%C5%93+%C5%93+filetype:pdf&num=50&hl=en&prmd=ivns&tbm=isch&tbo=u&source=univ&sa=X&ved=0ahUKEwiRx83UpdTWAhWKLVAKHQ4rB1c49ABQsAAInwA'

Adds as package

Adds command to be installable via pip or easy_install.

mycognitive / google_dl Goto Github PK

google_dl's People

Contributors

Stargazers

Watchers

Forkers

google_dl's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs