Hi there, I work behind a proxy and I cannot access <a href="https:/

Hi, sorry for the late reply. If you have already got a proxy,

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

How to setup a Proxy about icrawler HOT 6 CLOSED

hellock commented on July 17, 2024

How to setup a Proxy

from icrawler.

Comments (6)

hellock commented on July 17, 2024

Hi, sorry for the late reply.

If you have already got a proxy, e.g. http://user:[email protected]:8888

from icrawler.buildin import GoogleImageCrawler
from icrawler.utils import Proxy, ProxyPool

class MyCrawler(GoogleImageCrawler):

    def set_proxy_pool():
        self.proxy_pool = ProxyPool()
        self.proxy_pool.add_proxy(Proxy('http://user:[email protected]:8888', 'http'))

If you do not own a valid proxy, you can use the scan() method to scan some available proxies from the Internet, which may not be stable enough.

from icrawler.buildin import GoogleImageCrawler
from icrawler.utils import ProxyPool

class MyCrawler(GoogleImageCrawler):

    def set_proxy_pool():
        self.proxy_pool = ProxyPool()
        self.proxy_pool.scan(region='overseas', expected_num=10)

Then it will first scan 10 available proxies from the Internet and use these proxies for crawling tasks.

from icrawler.

tarrade commented on July 17, 2024

Hi,

sorry for the delay. I didn't manage to have it working.
If I use urllib2:

proxy = urllib2.ProxyHandler({'http':http_proxy,
                              'https':http_proxy})                                          
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)

it works, I can copy stuff from the web without issue (without Proxy setup it doesn't work).
here how https_proxy looks like:
http_proxy='chdoleninet\c1xxxx:[email protected]:8080'

so as you can I have some complicated login and some special character in my pwd.

This is what I tried following your example
self.proxy_pool.add_proxy(Proxy('https://chdolenine\c1xxxx:[email protected]:8080', 'https'))

and I get this message:

2017-03-05 17:41:06,553 - INFO - downloader - downloader-003 is waiting for new download tasks
2017-03-05 17:41:06,592 - INFO - downloader - downloader-004 is waiting for new download tasks
2017-03-05 17:41:11,486 - ERROR - parser - Exception caught when fetching page https://www.google.com/search?q=sunny&start=0&tbs=cdr%3A1%2Ccd_min%3A%2Ccd_max%3A&tbm=isch&ijn=0, error: HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: /search?q=sunny&start=0&tbs=cdr%3A1%2Ccd_min%3A%2Ccd_max%3A&tbm=isch&ijn=0 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x0000000006039400>, 'Connection to www.google.com timed out. (connect timeout=5)')), remaining retry times: 2
2017-03-05 17:41:11,533 - INFO - downloader - downloader-002 is waiting for new download tasks
2017-03-05 17:41:11,549 - INFO - downloader - downloader-001 is waiting for new download tasks
2017-03-05 17:41:11,555 - INFO - downloader - downloader-003 is waiting for new download tasks

any idea what is not setup properly ?

Thanks
Cheers
Fabien

from icrawler.

hellock commented on July 17, 2024

It seems that the proxy format is not correct. icrawler depends on the requests library to handle all http requests which is more widely used than urllib2, see here for more information.

Maybe you should try chdolenine\\user:[email protected]:8080 instead of https://chdolenine\c1xxxx:[email protected]:8080 since your proxy is not using the https protocol, but I'm not sure whether it is supported because the documentation shows only http(s) and socks protocol.

from icrawler.

tarrade commented on July 17, 2024

Hi @hellock,

I made some test suing request in standalone and the following works:

import requests
proxies = {
    'http': 'http://user:[email protected]:8080/',
    'https': 'https://user:[email protected]:8080/'
}
r = requests.get('https://www.google.com',proxies=proxies)
if r.status_code == requests.codes.ok:
    print(r.headers['content-type'])

which gives:

text/html; charset=ISO-8859-1

so it works but if I used the same config for icrawler, it still doesn't work. Is requests used in a different way in icrawler.

Thanks
Cheers
Fabien

from icrawler.

hellock commented on July 17, 2024

Hi @tarrade ,
Thanks for the test. icrawler just use the same method to work with proxies, see here for details.

Here are the codes I have tested.

from icrawler.builtin import GoogleImageCrawler
from icrawler.utils import ProxyPool, Proxy


class MyCrawler(GoogleImageCrawler):

    def set_proxy_pool(self):
        self.proxy_pool = ProxyPool()
        self.proxy_pool.add_proxy(Proxy('https://103.14.8.239:8080', 'https'))


def main():
    crawler = MyCrawler(
        downloader_threads=2, storage={'root_dir': 'your_image_dir'})
    crawler.crawl(keyword='sunny', max_num=10)


if __name__ == '__main__':
    main()

from icrawler.

Bluearrow commented on July 17, 2024

Hi @hellock,

I replaced the address in self.proxy_pool.add_proxy(Proxy('https://103.14.8.239:8080', 'https')) with my own and I got errors as below. I can use my own address to visit Google with shadowsocks in Chrome.

2019-02-25 16:22:11,900 - INFO - icrawler.crawler - start crawling...
2019-02-25 16:22:11,901 - INFO - icrawler.crawler - starting 1 feeder threads...
2019-02-25 16:22:11,901 - INFO - feeder - thread feeder-001 exit
2019-02-25 16:22:11,901 - INFO - icrawler.crawler - starting 1 parser threads...
2019-02-25 16:22:11,902 - INFO - icrawler.crawler - starting 2 downloader threads...
2019-02-25 16:22:12,218 - ERROR - parser - Exception caught when fetching page https://www.google.com/search?q=sunny&ijn=0&start=0&tbs=&tbm=isch, error: HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: /search?q=sunny&ijn=0&start=0&tbs=&tbm=isch (Caused by ProxyError('Cannot connect to proxy.', RemoteDisconnected('Remote end closed connection without response',))), remaining retry times: 2
2019-02-25 16:22:12,497 - ERROR - parser - Exception caught when fetching page https://www.google.com/search?q=sunny&ijn=0&start=0&tbs=&tbm=isch, error: HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: /search?q=sunny&ijn=0&start=0&tbs=&tbm=isch (Caused by ProxyError('Cannot connect to proxy.', RemoteDisconnected('Remote end closed connection without response',))), remaining retry times: 1
2019-02-25 16:22:12,598 - ERROR - parser - Exception caught when fetching page https://www.google.com/search?q=sunny&ijn=0&start=0&tbs=&tbm=isch, error: HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: /search?q=sunny&ijn=0&start=0&tbs=&tbm=isch (Caused by ProxyError('Cannot connect to proxy.', ConnectionResetError(10054, '远程主机强迫关闭了一个现有的连接。', None, 10054, None))), remaining retry times: 0
2019-02-25 16:22:14,602 - INFO - parser - no more page urls for thread parser-001 to parse
2019-02-25 16:22:14,602 - INFO - parser - thread parser-001 exit
2019-02-25 16:22:16,902 - INFO - downloader - no more download task for thread downloader-002
2019-02-25 16:22:16,902 - INFO - downloader - no more download task for thread downloader-001
2019-02-25 16:22:16,902 - INFO - downloader - thread downloader-002 exit
2019-02-25 16:22:16,902 - INFO - downloader - thread downloader-001 exit
2019-02-25 16:22:17,902 - INFO - icrawler.crawler - Crawling task done!

Hi @tarrade ,
Thanks for the test. icrawler just use the same method to work with proxies, see here for details.

Here are the codes I have tested.

from icrawler.builtin import GoogleImageCrawler
from icrawler.utils import ProxyPool, Proxy


class MyCrawler(GoogleImageCrawler):

    def set_proxy_pool(self):
        self.proxy_pool = ProxyPool()
        self.proxy_pool.add_proxy(Proxy('https://103.14.8.239:8080', 'https'))


def main():
    crawler = MyCrawler(
        downloader_threads=2, storage={'root_dir': 'your_image_dir'})
    crawler.crawl(keyword='sunny', max_num=10)


if __name__ == '__main__':
    main()

from icrawler.

How to setup a Proxy about icrawler HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs