mycognitive / google_dl Goto Github PK
View Code? Open in Web Editor NEWPython script to download files via Google search.
Python script to download files via Google search.
The script doesn't downloads files anymore (it was some time ago).
Example:
$ google_dl -v http://example.com
$ google_dl "foo" -f pdf
I'm expecting that script will download the files as before.
It may be the issue in xgoogle lib, if so, send PR into mycognitive/xgoogle.
This command:
$ google_dl -x -r -s http://www.example.com/foo/bar/ "foo"
Downloading 'b'%0d%0a\xc3\xb6m_bar.pdf'' from 'b'http://www.example.com/%250d%250a-foo%C3%B6m_bar.pdf'' into . ...
Traceback (most recent call last):
File "~/google_dl", line 171, in <module>
print("File '%s' already exists, skipping." % (path))
UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 48: ordinal not in range(128)
fails with encoding error.
Error:
Traceback (most recent call last):
File "google_dl", line 169, in <module>
print("Downloading '%s' from '%s' into %s..." % (filename, url, dirname))
UnicodeEncodeError: 'ascii' codec can't encode character '\u0142' in position 22: ordinal not in range(128)
Hi
Followed your instructions in Readme. But search module is not found.
$ ./google_dl.py
Traceback (most recent call last):
File "./google_dl.py", line 11, in <module>
from xgoogle.search import GoogleSearch, SearchError
ModuleNotFoundError: No module named 'xgoogle.search'
xgoogle is installed:
$ pip list
Package Version
-------------- -------
beautifulsoup4 4.4.1
chardet 2.3.0
colorama 0.3.2
html5lib 0.999
nltk 3.0.5
pip 19.2.3
requests 2.4.3
setuptools 41.2.0
six 1.10.0
urllib3 1.9.1
wheel 0.24.0
xgoogle 1.4
search.py is present in xgoogle directory:
$ ls xgoogle/xgoogle/
BeautifulSoup.py browser.py realtime.py sponsoredlinks.py
__init__.py googlesets.py search.py translate.py
Add optional parameter to export data result (list of URLs, etc.) into given CSV file, instead of downloading all files.
Downloading 'b'foo.pdf'' from 'b'http://www.example.com/foo.pdf'' into ....
Traceback (most recent call last):
File "google_dl", line 173, in <module>
page.dlFile(url, path)
File "google_dl", line 54, in dlFile
with urllib.request.urlopen(request) as i, open(path, "wb") as o:
File "/usr/local/Cellar/python3/3.4.3/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 161, in urlopen
return opener.open(url, data, timeout)
File "/usr/local/Cellar/python3/3.4.3/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 463, in open
response = self._open(req, data)
File "/usr/local/Cellar/python3/3.4.3/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 481, in _open
'_open', req)
File "/usr/local/Cellar/python3/3.4.3/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 441, in _call_chain
result = func(*args)
File "/usr/local/Cellar/python3/3.4.3/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 1210, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/usr/local/Cellar/python3/3.4.3/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 1185, in do_open
r = h.getresponse()
File "/usr/local/Cellar/python3/3.4.3/Frameworks/Python.framework/Versions/3.4/lib/python3.4/http/client.py", line 1171, in getresponse
response.begin()
File "/usr/local/Cellar/python3/3.4.3/Frameworks/Python.framework/Versions/3.4/lib/python3.4/http/client.py", line 351, in begin
version, status, reason = self._read_status()
File "/usr/local/Cellar/python3/3.4.3/Frameworks/Python.framework/Versions/3.4/lib/python3.4/http/client.py", line 321, in _read_status
raise BadStatusLine(line)
http.client.BadStatusLine: ''
This may happen at random.
$ google_dl -v -x -r "site:www.imdb.com/board"
site:www.imdb.com/board
Trying to download results from page #1 (results 1-47)
...
$ python3
>>> from xgoogle.search import GoogleSearch
>>> gs = GoogleSearch("site:www.imdb.com/board", repeat=True)
The tool is only downloading 47 items, but Google reports 5,740 results. It needs to take into account all of them.
The fix should be either in this repo or in xgoogle, depending where is the problem.
The following command downloads only one file:
google_dl -s mast.queensu.ca -f pdf ""
The expected result is to download all the files.
For example when googling site:mast.queensu.ca filetype:pdf
it also shows one file, but when I click on repeat the search with the omitted results included
, it shows 6k of other files.
The above script should automatically follow the omitted result criteria, so it can download all the files.
If the option can be parameterized, this method can be activated by specifying -r
(to repeat the search with the omitted results included).
Est. 2-4h
Traceback (most recent call last):
File "binfiles/google_dl", line 161, in <module>
for results in page:
File "binfiles/google_dl", line 79, in __next__
results = self.gs.get_results()
File "~/.python/xgoogle/search.py", line 201, in get_results
results = self._extract_results(page)
File "~/.python/xgoogle/search.py", line 288, in _extract_results
eres = self._extract_result(result)
File "~/.python/xgoogle/search.py", line 295, in _extract_result
title, url = self._extract_title_url(result)
File "~/.python/xgoogle/search.py", line 309, in _extract_title_url
title = self._html_unescape(title)
File "~/.python/xgoogle/search.py", line 369, in _html_unescape
return re.sub(r'&([^;]+);', entity_replacer, s, re.U)
File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/re.py", line 191, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "~/.python/xgoogle/search.py", line 356, in entity_replacer
if entity in name2codepoint:
NameError: name 'name2codepoint' is not defined
Example command:
google_dl -s http://www.marquette.edu/maqom/ -f pdf ""
Traceback (most recent call last):
File "~/binfiles/google_dl", line 165, in <module>
path = get_path_via_url(url, args.dest, args.dirs)
File "~/binfiles/google_dl", line 105, in get_path_via_url
request = urllib.request.Request(url, method="HEAD")
File "~/anaconda/lib/python3.6/urllib/request.py", line 329, in __init__
self.full_url = url
File "~/anaconda/lib/python3.6/urllib/request.py", line 355, in full_url
self._parse()
File "~/anaconda/lib/python3.6/urllib/request.py", line 384, in _parse
raise ValueError("unknown url type: %r" % self.full_url)
ValueError: unknown url type: '/search?q=%C5%93+%C5%93+%C5%93+%C5%93+filetype:pdf&num=50&hl=en&prmd=ivns&tbm=isch&tbo=u&source=univ&sa=X&ved=0ahUKEwiRx83UpdTWAhWKLVAKHQ4rB1c49ABQsAAInwA'
Adds command to be installable via pip
or easy_install
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.