GithubHelp home page GithubHelp logo

How to setup a Proxy about icrawler HOT 6 CLOSED

hellock avatar hellock commented on July 17, 2024
How to setup a Proxy

from icrawler.

Comments (6)

hellock avatar hellock commented on July 17, 2024

Hi, sorry for the late reply.

  1. If you have already got a proxy, e.g. http://user:[email protected]:8888
from icrawler.buildin import GoogleImageCrawler
from icrawler.utils import Proxy, ProxyPool

class MyCrawler(GoogleImageCrawler):

    def set_proxy_pool():
        self.proxy_pool = ProxyPool()
        self.proxy_pool.add_proxy(Proxy('http://user:[email protected]:8888', 'http'))
  1. If you do not own a valid proxy, you can use the scan() method to scan some available proxies from the Internet, which may not be stable enough.
from icrawler.buildin import GoogleImageCrawler
from icrawler.utils import ProxyPool

class MyCrawler(GoogleImageCrawler):

    def set_proxy_pool():
        self.proxy_pool = ProxyPool()
        self.proxy_pool.scan(region='overseas', expected_num=10)

Then it will first scan 10 available proxies from the Internet and use these proxies for crawling tasks.

from icrawler.

tarrade avatar tarrade commented on July 17, 2024

Hi,

sorry for the delay. I didn't manage to have it working.
If I use urllib2:

proxy = urllib2.ProxyHandler({'http':http_proxy,
                              'https':http_proxy})                                          
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)

it works, I can copy stuff from the web without issue (without Proxy setup it doesn't work).
here how https_proxy looks like:
http_proxy='chdoleninet\c1xxxx:[email protected]:8080'

so as you can I have some complicated login and some special character in my pwd.

This is what I tried following your example
self.proxy_pool.add_proxy(Proxy('https://chdolenine\c1xxxx:[email protected]:8080', 'https'))

and I get this message:

2017-03-05 17:41:06,553 - INFO - downloader - downloader-003 is waiting for new download tasks
2017-03-05 17:41:06,592 - INFO - downloader - downloader-004 is waiting for new download tasks
2017-03-05 17:41:11,486 - ERROR - parser - Exception caught when fetching page https://www.google.com/search?q=sunny&start=0&tbs=cdr%3A1%2Ccd_min%3A%2Ccd_max%3A&tbm=isch&ijn=0, error: HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: /search?q=sunny&start=0&tbs=cdr%3A1%2Ccd_min%3A%2Ccd_max%3A&tbm=isch&ijn=0 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x0000000006039400>, 'Connection to www.google.com timed out. (connect timeout=5)')), remaining retry times: 2
2017-03-05 17:41:11,533 - INFO - downloader - downloader-002 is waiting for new download tasks
2017-03-05 17:41:11,549 - INFO - downloader - downloader-001 is waiting for new download tasks
2017-03-05 17:41:11,555 - INFO - downloader - downloader-003 is waiting for new download tasks

any idea what is not setup properly ?

Thanks
Cheers
Fabien

from icrawler.

hellock avatar hellock commented on July 17, 2024

It seems that the proxy format is not correct. icrawler depends on the requests library to handle all http requests which is more widely used than urllib2, see here for more information.

Maybe you should try chdolenine\\user:[email protected]:8080 instead of https://chdolenine\c1xxxx:[email protected]:8080 since your proxy is not using the https protocol, but I'm not sure whether it is supported because the documentation shows only http(s) and socks protocol.

from icrawler.

tarrade avatar tarrade commented on July 17, 2024

Hi @hellock,

I made some test suing request in standalone and the following works:

import requests
proxies = {
    'http': 'http://user:[email protected]:8080/',
    'https': 'https://user:[email protected]:8080/'
}
r = requests.get('https://www.google.com',proxies=proxies)
if r.status_code == requests.codes.ok:
    print(r.headers['content-type'])

which gives:

text/html; charset=ISO-8859-1

so it works but if I used the same config for icrawler, it still doesn't work. Is requests used in a different way in icrawler.

Thanks
Cheers
Fabien

from icrawler.

hellock avatar hellock commented on July 17, 2024

Hi @tarrade ,
Thanks for the test. icrawler just use the same method to work with proxies, see here for details.

Here are the codes I have tested.

from icrawler.builtin import GoogleImageCrawler
from icrawler.utils import ProxyPool, Proxy


class MyCrawler(GoogleImageCrawler):

    def set_proxy_pool(self):
        self.proxy_pool = ProxyPool()
        self.proxy_pool.add_proxy(Proxy('https://103.14.8.239:8080', 'https'))


def main():
    crawler = MyCrawler(
        downloader_threads=2, storage={'root_dir': 'your_image_dir'})
    crawler.crawl(keyword='sunny', max_num=10)


if __name__ == '__main__':
    main()

from icrawler.

Bluearrow avatar Bluearrow commented on July 17, 2024

Hi @hellock,

I replaced the address in self.proxy_pool.add_proxy(Proxy('https://103.14.8.239:8080', 'https')) with my own and I got errors as below. I can use my own address to visit Google with shadowsocks in Chrome.

2019-02-25 16:22:11,900 - INFO - icrawler.crawler - start crawling...
2019-02-25 16:22:11,901 - INFO - icrawler.crawler - starting 1 feeder threads...
2019-02-25 16:22:11,901 - INFO - feeder - thread feeder-001 exit
2019-02-25 16:22:11,901 - INFO - icrawler.crawler - starting 1 parser threads...
2019-02-25 16:22:11,902 - INFO - icrawler.crawler - starting 2 downloader threads...
2019-02-25 16:22:12,218 - ERROR - parser - Exception caught when fetching page https://www.google.com/search?q=sunny&ijn=0&start=0&tbs=&tbm=isch, error: HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: /search?q=sunny&ijn=0&start=0&tbs=&tbm=isch (Caused by ProxyError('Cannot connect to proxy.', RemoteDisconnected('Remote end closed connection without response',))), remaining retry times: 2
2019-02-25 16:22:12,497 - ERROR - parser - Exception caught when fetching page https://www.google.com/search?q=sunny&ijn=0&start=0&tbs=&tbm=isch, error: HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: /search?q=sunny&ijn=0&start=0&tbs=&tbm=isch (Caused by ProxyError('Cannot connect to proxy.', RemoteDisconnected('Remote end closed connection without response',))), remaining retry times: 1
2019-02-25 16:22:12,598 - ERROR - parser - Exception caught when fetching page https://www.google.com/search?q=sunny&ijn=0&start=0&tbs=&tbm=isch, error: HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: /search?q=sunny&ijn=0&start=0&tbs=&tbm=isch (Caused by ProxyError('Cannot connect to proxy.', ConnectionResetError(10054, '远程主机强迫关闭了一个现有的连接。', None, 10054, None))), remaining retry times: 0
2019-02-25 16:22:14,602 - INFO - parser - no more page urls for thread parser-001 to parse
2019-02-25 16:22:14,602 - INFO - parser - thread parser-001 exit
2019-02-25 16:22:16,902 - INFO - downloader - no more download task for thread downloader-002
2019-02-25 16:22:16,902 - INFO - downloader - no more download task for thread downloader-001
2019-02-25 16:22:16,902 - INFO - downloader - thread downloader-002 exit
2019-02-25 16:22:16,902 - INFO - downloader - thread downloader-001 exit
2019-02-25 16:22:17,902 - INFO - icrawler.crawler - Crawling task done!

Hi @tarrade ,
Thanks for the test. icrawler just use the same method to work with proxies, see here for details.

Here are the codes I have tested.

from icrawler.builtin import GoogleImageCrawler
from icrawler.utils import ProxyPool, Proxy


class MyCrawler(GoogleImageCrawler):

    def set_proxy_pool(self):
        self.proxy_pool = ProxyPool()
        self.proxy_pool.add_proxy(Proxy('https://103.14.8.239:8080', 'https'))


def main():
    crawler = MyCrawler(
        downloader_threads=2, storage={'root_dir': 'your_image_dir'})
    crawler.crawl(keyword='sunny', max_num=10)


if __name__ == '__main__':
    main()

from icrawler.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.