Comments (6)
Hi, sorry for the late reply.
- If you have already got a proxy, e.g. http://user:[email protected]:8888
from icrawler.buildin import GoogleImageCrawler
from icrawler.utils import Proxy, ProxyPool
class MyCrawler(GoogleImageCrawler):
def set_proxy_pool():
self.proxy_pool = ProxyPool()
self.proxy_pool.add_proxy(Proxy('http://user:[email protected]:8888', 'http'))
- If you do not own a valid proxy, you can use the
scan()
method to scan some available proxies from the Internet, which may not be stable enough.
from icrawler.buildin import GoogleImageCrawler
from icrawler.utils import ProxyPool
class MyCrawler(GoogleImageCrawler):
def set_proxy_pool():
self.proxy_pool = ProxyPool()
self.proxy_pool.scan(region='overseas', expected_num=10)
Then it will first scan 10 available proxies from the Internet and use these proxies for crawling tasks.
from icrawler.
Hi,
sorry for the delay. I didn't manage to have it working.
If I use urllib2:
proxy = urllib2.ProxyHandler({'http':http_proxy,
'https':http_proxy})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
it works, I can copy stuff from the web without issue (without Proxy setup it doesn't work).
here how https_proxy looks like:
http_proxy='chdoleninet\c1xxxx:[email protected]:8080'
so as you can I have some complicated login and some special character in my pwd.
This is what I tried following your example
self.proxy_pool.add_proxy(Proxy('https://chdolenine\c1xxxx:[email protected]:8080', 'https'))
and I get this message:
2017-03-05 17:41:06,553 - INFO - downloader - downloader-003 is waiting for new download tasks
2017-03-05 17:41:06,592 - INFO - downloader - downloader-004 is waiting for new download tasks
2017-03-05 17:41:11,486 - ERROR - parser - Exception caught when fetching page https://www.google.com/search?q=sunny&start=0&tbs=cdr%3A1%2Ccd_min%3A%2Ccd_max%3A&tbm=isch&ijn=0, error: HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: /search?q=sunny&start=0&tbs=cdr%3A1%2Ccd_min%3A%2Ccd_max%3A&tbm=isch&ijn=0 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x0000000006039400>, 'Connection to www.google.com timed out. (connect timeout=5)')), remaining retry times: 2
2017-03-05 17:41:11,533 - INFO - downloader - downloader-002 is waiting for new download tasks
2017-03-05 17:41:11,549 - INFO - downloader - downloader-001 is waiting for new download tasks
2017-03-05 17:41:11,555 - INFO - downloader - downloader-003 is waiting for new download tasks
any idea what is not setup properly ?
Thanks
Cheers
Fabien
from icrawler.
It seems that the proxy format is not correct. icrawler
depends on the requests
library to handle all http requests which is more widely used than urllib2
, see here for more information.
Maybe you should try chdolenine\\user:[email protected]:8080
instead of https://chdolenine\c1xxxx:[email protected]:8080
since your proxy is not using the https protocol, but I'm not sure whether it is supported because the documentation shows only http(s) and socks protocol.
from icrawler.
Hi @hellock,
I made some test suing request in standalone and the following works:
import requests
proxies = {
'http': 'http://user:[email protected]:8080/',
'https': 'https://user:[email protected]:8080/'
}
r = requests.get('https://www.google.com',proxies=proxies)
if r.status_code == requests.codes.ok:
print(r.headers['content-type'])
which gives:
text/html; charset=ISO-8859-1
so it works but if I used the same config for icrawler, it still doesn't work. Is requests used in a different way in icrawler.
Thanks
Cheers
Fabien
from icrawler.
Hi @tarrade ,
Thanks for the test. icrawler
just use the same method to work with proxies, see here for details.
Here are the codes I have tested.
from icrawler.builtin import GoogleImageCrawler
from icrawler.utils import ProxyPool, Proxy
class MyCrawler(GoogleImageCrawler):
def set_proxy_pool(self):
self.proxy_pool = ProxyPool()
self.proxy_pool.add_proxy(Proxy('https://103.14.8.239:8080', 'https'))
def main():
crawler = MyCrawler(
downloader_threads=2, storage={'root_dir': 'your_image_dir'})
crawler.crawl(keyword='sunny', max_num=10)
if __name__ == '__main__':
main()
from icrawler.
Hi @hellock,
I replaced the address in self.proxy_pool.add_proxy(Proxy('https://103.14.8.239:8080', 'https'))
with my own and I got errors as below. I can use my own address to visit Google with shadowsocks in Chrome.
2019-02-25 16:22:11,900 - INFO - icrawler.crawler - start crawling...
2019-02-25 16:22:11,901 - INFO - icrawler.crawler - starting 1 feeder threads...
2019-02-25 16:22:11,901 - INFO - feeder - thread feeder-001 exit
2019-02-25 16:22:11,901 - INFO - icrawler.crawler - starting 1 parser threads...
2019-02-25 16:22:11,902 - INFO - icrawler.crawler - starting 2 downloader threads...
2019-02-25 16:22:12,218 - ERROR - parser - Exception caught when fetching page https://www.google.com/search?q=sunny&ijn=0&start=0&tbs=&tbm=isch, error: HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: /search?q=sunny&ijn=0&start=0&tbs=&tbm=isch (Caused by ProxyError('Cannot connect to proxy.', RemoteDisconnected('Remote end closed connection without response',))), remaining retry times: 2
2019-02-25 16:22:12,497 - ERROR - parser - Exception caught when fetching page https://www.google.com/search?q=sunny&ijn=0&start=0&tbs=&tbm=isch, error: HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: /search?q=sunny&ijn=0&start=0&tbs=&tbm=isch (Caused by ProxyError('Cannot connect to proxy.', RemoteDisconnected('Remote end closed connection without response',))), remaining retry times: 1
2019-02-25 16:22:12,598 - ERROR - parser - Exception caught when fetching page https://www.google.com/search?q=sunny&ijn=0&start=0&tbs=&tbm=isch, error: HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: /search?q=sunny&ijn=0&start=0&tbs=&tbm=isch (Caused by ProxyError('Cannot connect to proxy.', ConnectionResetError(10054, '远程主机强迫关闭了一个现有的连接。', None, 10054, None))), remaining retry times: 0
2019-02-25 16:22:14,602 - INFO - parser - no more page urls for thread parser-001 to parse
2019-02-25 16:22:14,602 - INFO - parser - thread parser-001 exit
2019-02-25 16:22:16,902 - INFO - downloader - no more download task for thread downloader-002
2019-02-25 16:22:16,902 - INFO - downloader - no more download task for thread downloader-001
2019-02-25 16:22:16,902 - INFO - downloader - thread downloader-002 exit
2019-02-25 16:22:16,902 - INFO - downloader - thread downloader-001 exit
2019-02-25 16:22:17,902 - INFO - icrawler.crawler - Crawling task done!
Hi @tarrade ,
Thanks for the test.icrawler
just use the same method to work with proxies, see here for details.Here are the codes I have tested.
from icrawler.builtin import GoogleImageCrawler from icrawler.utils import ProxyPool, Proxy class MyCrawler(GoogleImageCrawler): def set_proxy_pool(self): self.proxy_pool = ProxyPool() self.proxy_pool.add_proxy(Proxy('https://103.14.8.239:8080', 'https')) def main(): crawler = MyCrawler( downloader_threads=2, storage={'root_dir': 'your_image_dir'}) crawler.crawl(keyword='sunny', max_num=10) if __name__ == '__main__': main()
from icrawler.
Related Issues (20)
- ERROR - downloader - Response status code 404 HOT 6
- KeyError:'data' when using BaiduImageCrawler HOT 6
- Bing images search question
- Excellent ! HOT 1
- Pulling different images using keywords from a list
- TypeError: 'NoneType' object is not iterable HOT 23
- I want to download the image along with its name
- KeyError: 'm' HOT 1
- Can't turn INFO logs off HOT 2
- Can i get the page title and url for each image?
- TypeError: 'NoneType' object is not iterable HOT 1
- Problem with GoogleImageCrawler HOT 4
- How to suggest me for like icrawler but for this task search google search
- 以前直接下载安装icrawler就行,其他的库会自己安装 HOT 1
- Tkinter UI for icrawler HOT 1
- GoogleImageCrawler not working! HOT 9
- Bug in get_filename loses images HOT 1
- Minimal example not working HOT 3
- Crawl only jpg files HOT 2
- BingImageCrawler AttributeError: 'HTMLParser' object has no attribute 'unescape HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from icrawler.