GithubHelp home page GithubHelp logo

hellock / icrawler Goto Github PK

View Code? Open in Web Editor NEW
830.0 830.0 174.0 271 KB

A multi-thread crawler framework with many builtin image crawlers provided.

Home Page: http://icrawler.readthedocs.io/en/latest/

License: MIT License

Python 100.00%
bing-image crawler flickr-api google-images python scrapy spider

icrawler's People

Contributors

dreamflasher avatar gcheron avatar hellock avatar innerlee avatar kethan1 avatar knappmk avatar lightjohn avatar pasqlisena avatar peilin-yang avatar prairie-guy avatar redalice64 avatar richylyq avatar sibojia avatar sleepless-se avatar taoyudong avatar xinntao avatar zhiyuanchen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

icrawler's Issues

Crawl file

The documentation file said that we can download video or other types of file. I googled and haven't found any example about this. Can you give me an example of crawl a file that is not image.

use proxy to download

I hope icrawler can support proxy service with flexible strategies as soon as possible. I think it's useful in many applications.

By the way, icrawler is really an excellent image crawler framework with multi-thread and higher extendibility. And the code has been well commented.

This framework greatly reduce my workload. Thx a lot.

Adding additional file check before choosing to keep a file?

Hi! This is a really cool framework and I've been using it for some simple implementations and it has been working great.

Currently, what I'm trying to do is add an additional check of file criteria before the file is saved. I already have this checking code working for offline files already saved, but it's a little long to include here. Maybe for an example, a redundant check of file name/length/size would be work. (Or I can provide the code if it matters!)

The process should be basically this:

  1. File is downloaded using icrawler
  2. File is checked to meet certain criteria and true/false is returned
  3. If the file meets the criteria, it's saved, if not, it's skipped

I've ready the API reference, and I think maybe it can be done by overriding keep_file? But I only see the response/size parameters, and I'm not sure how to add or use my own.

keep_file
Parameters:
response (Response) – response of requests. min_size (tuple or None)
minimum size of required images. max_size (tuple or None) – maximum size of required images.

The code example I've been using is this one, and it runs great by itself - I'm just not sure how to initiate that initial check before a file is saved/not saved.

from icrawler.builtin import GoogleImageCrawler

google_crawler = GoogleImageCrawler(storage={'root_dir': 'your_image_dir'})
google_crawler.crawl(keyword='cat', max_num=100)

Thanks! I'm still new to programming and would really appreciate any help.

How to setup a Proxy

Hi there,

I work behind a proxy and I cannot access https://www.google.com:

2017-02-26 22:55:54,615 - ERROR - parser - Exception caught when fetching page https://www.google.com/search?q=sunny&start=100&tbs=cdr%3A1%2Ccd_min%3A%2Ccd_max%3A&tbm=isch&ijn=1, error: HTTPSConnectionPool(host='www.google.com', port=443): Max retrie
s exceeded with url: /search?q=sunny&start=100&tbs=cdr%3A1%2Ccd_min%3A%2Ccd_max%3A&tbm=isch&ijn=1 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x00000000032C5908>, 'Connection to www.google.co
m timed out. (connect timeout=5)')), remaining retry times: 2

I am not sure what the following means:

Then it will scan 10 valid overseas (out of mainland China) proxies and automatically use these proxies to request pages and images.

Is there a way to setup the proxy as for urllib2:

http_proxy = 'chdoleninet\\' + cnumber + ':' + windowspwd + '@sc-wvs-ch-win-pr-01-vip1.ch.doleni.net:8080'
proxy = urllib2.ProxyHandler({'http':http_proxy,
                              'https':http_proxy})                                          
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)

Thanks a lot
Cheers
Fabien

google crawler stuck

I tried to used the built-in crawler to crawl images from google. I have a list of keywords to crawl and the crawler works well with most of the keywords. I only changed the keyword for searching, and the program always got stuck at the same keyword.
The last several messages I got:
2017-07-20 12:03:15,237 - INFO - downloader - no more download task for thread downloader-003
2017-07-20 12:03:15,237 - INFO - downloader - thread downloader-003 exit
2017-07-20 12:03:15,480 - INFO - downloader - no more download task for thread downloader-004
2017-07-20 12:03:15,480 - INFO - downloader - thread downloader-004 exit
2017-07-20 12:03:16,180 - INFO - downloader - no more download task for thread downloader-002
2017-07-20 12:03:16,180 - INFO - downloader - thread downloader-002 exit
2017-07-20 12:03:16,832 - INFO - downloader - no more download task for thread downloader-001
2017-07-20 12:03:16,832 - INFO - downloader - thread downloader-001 exit

While for successful crawling, I got
2017-07-20 13:13:06,827 - INFO - downloader - no more download task for thread downloader-001
2017-07-20 13:13:06,827 - INFO - downloader - thread downloader-001 exit
2017-07-20 13:13:06,966 - INFO - downloader - no more download task for thread downloader-003
2017-07-20 13:13:06,966 - INFO - downloader - thread downloader-003 exit
2017-07-20 13:13:07,104 - INFO - downloader - no more download task for thread downloader-002
2017-07-20 13:13:07,104 - INFO - downloader - thread downloader-002 exit
2017-07-20 13:13:11,862 - INFO - downloader - no more download task for thread downloader-004
2017-07-20 13:13:11,862 - INFO - downloader - thread downloader-004 exit
2017-07-20 13:13:12,439 - INFO - icrawler.crawler - Crawling task done!

How do I get 1000 images correctly?

I set to download 1000 images, but only download 250~350 images.
How do I get 1000 images correctly?

crawler = GoogleImageCrawler(parser_threads=2, downloader_threads=4, storage={'root_dir': TMP_FOLDER})
crawler.crawl(keyword=keyword, max_num=1000, date_min=None, date_max=None)

Q: How to return the file url for reporting?

Hi

I just found your nice package.

I currently use the google image crawler. I am fine with the numerical filenames it produces, but for reporting I'd also like to store the URL or a least the original filename in a dict?

I'm a bit confused how I'd achieve that.

Cheers,
C

OSError: cannot identify image file

Firstly, I really appreciate your work!

I am currently using this to scrape Google images, but there seems to be an error I can not solve.

Exception in thread downloader-02: Traceback (most recent call last): File "/home/tn/lib/python3.5/threading.py", line 914, in _bootstrap_inner self.run() File "/home/tn/lib/python3.5/threading.py", line 862, in run self._target(*self._args, **self._kwargs) File "/home/tn/lib/python3.5/site-packages/icrawler-0.2.1-py3.5.egg/icrawler/downloader.py", line 228, in thread_run self.download(task, request_timeout, **kwargs) File "/home/tn/lib/python3.5/site-packages/icrawler-0.2.1-py3.5.egg/icrawler/downloader.py", line 124, in download img = Image.open(io.BytesIO(response.content)) File "/home/tn/lib/python3.5/site-packages/Pillow-3.2.0-py3.5-linux-x86_64.egg/PIL/Image.py", line 2309, in open % (filename if filename else fp)) OSError: cannot identify image file <_io.BytesIO object at 0x2aaabc453a40>

Could you assist me with this?

Is the date range inclusive or exclusive?

Hi, thanks for the awesome code. I am using it to download google images and it works perfectly. I just have a question, is the date range inclusive or exclusive?

I mean, if i set date_min=1, date_max=5, how many day's data am i downloading?

Thanks a lot.

Is date_min and date_max for Google Image Search working?

Hi, thanks for the great package!

I am trying to get minimum and maximum date working for Google Image Search. My minimum example

from icrawler.builtin import GoogleImageCrawler
from datetime import date

def scrape():
    date_min = date(1985, 1, 1)
    date_max = date(1990, 1, 1)
    with open("labels.txt", mode="r") as f:
        for j in range(sum(1 for line in open("labels.txt", mode="r"))):
            label = "{}".format(f.readline().rstrip())
            google_crawler = GoogleImageCrawler(
                parser_threads=4, downloader_threads=20, storage={'root_dir': 'test/{}'.format(label)})
            google_crawler.crawl(keyword=label, max_num=1000, date_min=date_min, date_max=date_max)

if __name__ == "__main__":
    scrape()

is not working as expected. It should download no images, but is downloading images from all time periods. Am I doing something wrong?

Thanks for your help!

Tips for IP not getting banned by google

Just curious on if people have their IP blacklisted by google by using this and tips to avoid it, eg max number of requests in X amount of time? What measures does the crawler already take in this regard?

Error when using GoogleImageCrawler

Hi icrawler developers,
I've tried BingImageCrawler and GoogleImageCrawler. BingImageCrawler worked, however, The fowllowing sample code of GoogleImageCrawler didn't work. Can anyone help?

from icrawler.builtin import GoogleImageCrawler
google_crawler = GoogleImageCrawler(parser_threads=2, downloader_threads=4,
                                    storage={'root_dir': 'dog'})
google_crawler.crawl(keyword='dog', max_num=1000,
                     date_min=None, date_max=None,
                     min_size=(10,10), max_size=None)

All I received was

2018-01-10 19:22:48,838 - INFO - icrawler.crawler - start crawling...
2018-01-10 19:22:48,839 - INFO - icrawler.crawler - starting 1 feeder threads...
2018-01-10 19:22:48,844 - INFO - feeder - thread feeder-001 exit
2018-01-10 19:22:48,844 - INFO - icrawler.crawler - starting 2 parser threads...
2018-01-10 19:22:48,849 - INFO - icrawler.crawler - starting 4 downloader threads...
2018-01-10 19:22:48,882 - ERROR - parser - Exception caught when fetching page https://www.google.com/search?q=dog&ijn=0&start=0&tbs=cdr%3A1%2Ccd_min%3A%2Ccd_max%3A%2Csur%3A%2Citp%3A%2Cic%3A%2Cisc%3A&tbm=isch&lr=, error: HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: /search?q=dog&ijn=0&start=0&tbs=cdr%3A1%2Ccd_min%3A%2Ccd_max%3A%2Csur%3A%2Citp%3A%2Cic%3A%2Cisc%3A&tbm=isch&lr= (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],)",),)), remaining retry times: 2
2018-01-10 19:22:48,913 - ERROR - parser - Exception caught when fetching page https://www.google.com/search?q=dog&ijn=0&start=0&tbs=cdr%3A1%2Ccd_min%3A%2Ccd_max%3A%2Csur%3A%2Citp%3A%2Cic%3A%2Cisc%3A&tbm=isch&lr=, error: HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: /search?q=dog&ijn=0&start=0&tbs=cdr%3A1%2Ccd_min%3A%2Ccd_max%3A%2Csur%3A%2Citp%3A%2Cic%3A%2Cisc%3A&tbm=isch&lr= (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],)",),)), remaining retry times: 1
2018-01-10 19:22:48,940 - ERROR - parser - Exception caught when fetching page https://www.google.com/search?q=dog&ijn=0&start=0&tbs=cdr%3A1%2Ccd_min%3A%2Ccd_max%3A%2Csur%3A%2Citp%3A%2Cic%3A%2Cisc%3A&tbm=isch&lr=, error: HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: /search?q=dog&ijn=0&start=0&tbs=cdr%3A1%2Ccd_min%3A%2Ccd_max%3A%2Csur%3A%2Citp%3A%2Cic%3A%2Cisc%3A&tbm=isch&lr= (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],)",),)), remaining retry times: 0
2018-01-10 19:22:50,850 - INFO - parser - no more page urls for thread parser-002 to parse
2018-01-10 19:22:50,850 - INFO - parser - thread parser-002 exit
2018-01-10 19:22:50,943 - INFO - parser - no more page urls for thread parser-001 to parse
2018-01-10 19:22:50,943 - INFO - parser - thread parser-001 exit
2018-01-10 19:22:53,851 - INFO - downloader - no more download task for thread downloader-001
2018-01-10 19:22:53,852 - INFO - downloader - no more download task for thread downloader-002
2018-01-10 19:22:53,852 - INFO - downloader - no more download task for thread downloader-004
2018-01-10 19:22:53,852 - INFO - downloader - no more download task for thread downloader-003
2018-01-10 19:23:00,710 - INFO - downloader - thread downloader-001 exit
2018-01-10 19:23:00,714 - INFO - downloader - thread downloader-002 exit
2018-01-10 19:23:00,715 - INFO - downloader - thread downloader-004 exit
2018-01-10 19:23:00,715 - INFO - downloader - thread downloader-003 exit
2018-01-10 19:23:00,859 - INFO - icrawler.crawler - Crawling task done!

Accessing task_queue

I am trying to access task_queue to access the task dictionary. But i am unable to, so can someone please suggest how to go about it.
Thanks

hello, the program crashed.

Traceback (most recent call last):
File "G:/imdbfull/bingcrawler.py", line 27, in
content = urllib2.urlopen(url).read()
File "F:\Anaconda2\lib\urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "F:\Anaconda2\lib\urllib2.py", line 431, in open
response = self._open(req, data)
File "F:\Anaconda2\lib\urllib2.py", line 449, in _open
'_open', req)
File "F:\Anaconda2\lib\urllib2.py", line 409, in _call_chain
result = func(*args)
File "F:\Anaconda2\lib\urllib2.py", line 1227, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "F:\Anaconda2\lib\urllib2.py", line 1197, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 10060] >

setup prefix for file

Hello, is it possible to setup a prefix for the files generates by the google\bing\flicker downloder?

for example file can start google_0001.jpg or something like that

thanks

Exception in parser causes program freeze

When I crawl pictures with 4 parsers, I see exception:

Exception in thread parser-001:
Traceback (most recent call last):
File "/home/q/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/home/q/anaconda2/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/q/anaconda2/lib/python2.7/site-packages/icrawler/parser.py", line 104, in worker_exec
for task in self.parse(response, **kwargs):
File "/home/q/anaconda2/lib/python2.7/site-packages/icrawler/builtin/bing.py", line 21, in parse
img_url = '{}.jpg'.format(match.group(1))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 48-56: ordinal not in range(128)
...
2017-05-05 16:36:35,446 - INFO - downloader - no more download task for thread downloader-001
2017-05-05 16:36:35,447 - INFO - downloader - thread downloader-001 exit
2017-05-05 16:36:35,634 - INFO - downloader - no more download task for thread downloader-003
2017-05-05 16:36:35,635 - INFO - downloader - thread downloader-003 exit
2017-05-05 16:36:41,574 - INFO - downloader - no more download task for thread downloader-004
2017-05-05 16:36:41,574 - INFO - downloader - thread downloader-004 exit
2017-05-05 16:36:56,608 - INFO - downloader - no more download task for thread downloader-002
2017-05-05 16:36:56,608 - INFO - downloader - thread downloader-002 exit

Then the program froze.

Infinite loop even if work finished

crawler.py

        while True:
            if threading.active_count() <= 1:
                break

Crawler never stops if there were already more than 1 threads runnin, eg if you are running this on a web server, it will not end. I've simply disabled this line to get things working fine.

Baidu invalid escape | JSON

Hey again,

I'm currently trying to download from Google, Bing and Baidu.

Google and Bing work perfectly, but Baidu gives me this error:

Exception in thread parser-001:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/site-packages/icrawler/parser.py", line 105, in worker_exec
    for task in self.parse(response, **kwargs):
  File "/usr/lib/python3.6/site-packages/icrawler/builtin/baidu.py", line 32, in parse
    content = json.loads(response.content.decode('utf-8'))
  File "/usr/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.6/json/decoder.py", line 355, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid \escape: line 8 column 113 (char 10372)

Also, for another part of the JSON:

json.decoder.JSONDecodeError: Invalid \escape: line 20 column 110 (char 19232)

Could you give me a hint what seems to be the problem? Or is it really Baidu providing invalid json?

Search query was still Duck.

Thank you!

How to save the file name differently

Hi I am trying to save pictures from each url. I'd like to know which pictures are from the same url. I tried to change the code in downloader.py, but it still gave me 1, 2, 3... file name for each image. If possible, I am wondering how to change the downloaded image file name to 1_1, 2_1 so first pic from first url, second pic from first url. Or anything format that helps.

I am not an expert on coding, trying to use it for my dissertation. Really appreciate if any one could help!!

Trying to use other downloader_cls

Hi there,

first of all: keep up the good work! I really like icrawler. :)

I'm currently trying to use the GoogleImageCrawler, but want to subsitute the downloader class. I tried overwriting it like this:

from icrawler import ImageDownloader

class AdvancedDownloader(ImageDownloader):
	def download(self, task, default_ext, timeout=5, max_retry=3, **kwargs):
		print(task['file_url'])
		ImageDownloader.download(task, default_ext, timeout=5, max_retry=3, **kwargs)

calling it like this:

from icrawler.builtin import GoogleImageCrawler
from ExtendedDowloader import AdvancedDownloader

google_crawler = GoogleImageCrawler('/home/user/Downloads/test', downloader_cls=AdvancedDownloader)
google_crawler.crawl(keyword='Duck', offset=0, max_num=100,
	date_min=None, date_max=None, feeder_thr_num=2,
	parser_thr_num=2, downloader_thr_num=8,
	min_size=(200,200), max_size=None)

Result:

Traceback (most recent call last):
  File "/home/user/projects/test.py", line 4, in <module>
    google_crawler = GoogleImageCrawler('/home/user/Downloads/test', downloader_cls=AdvancedDownloader)
  File "/usr/lib/python3.5/site-packages/icrawler/builtin/google.py", line 45, in __init__
    **kwargs)
TypeError: __init__() got multiple values for keyword argument 'downloader_cls'

I also tried overwriting the whole GoogleImageCrawler class, but the same Error came up.

So, how do I do it?

Background: I want to use multiple Crawlers (Bing, Baidu, Google) and want to check if I already downloaded the exact same URL (and maybe also check md5) to avoid duplicates.

How doea the cache machinsim work?

In case of exception and rerunning the crawler

  1. Does it cache the html page?
  2. Does it cache the parsed html?
  3. Does it cache the image urlist?

thanks

The interval for icrawler

First, I want to say thank you. The icrawler does have me a lot.
I have read the source code, but didn't find a way to specify the interval for crawling
a website. Is it possible to specify the interval for icrawler?

Parser Error On Running Google Image Crawler

Hi, thanks for the great project. So far I crawled baidu image and bing image smoothly,but I have 2 issues:
1.
I am using python3.6 on MacOS Sierra.
Google image crawler passed the test, but it didn't create a folder and didn't download the images. When I run the example of google image crawler, the log reads :
"
ERROR - parser - Exception caught when fetching page https://www.google.com/search?q=sunny&ijn=1&start=100&tbs=cdr%3A1%2Ccd_min%3A%2Ccd_max%3A%2Csur%3A&tbm=isch&lr=, error: HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: /search?q=sunny&ijn=1&start=100&tbs=cdr%3A1%2Ccd_min%3A%2Ccd_max%3A%2Csur%3A&tbm=isch&lr= (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x104856278>: Failed to establish a new connection: [Errno 65] No route to host',)), remaining retry times: 6
"
The baidu image crawler and bing image crawler works well on my enviroment.
2.
Is it possible to download more than 1000 images?

Any idea could be helpful.
Thanks!

Python 2.7 support

Hey, I tried to run the Google example but got the following error

Traceback (most recent call last): File "google.py", line 1, in <module> from icrawler.builtin import BaiduImageCrawler, BingImageCrawler, GoogleImageCrawler File "/home/oak/venv2/local/lib/python2.7/site-packages/icrawler/__init__.py", line 4, in <module> from .crawler import Crawler File "/home/oak/venv2/local/lib/python2.7/site-packages/icrawler/crawler.py", line 10, in <module> from icrawler import storage as storage_package File "/home/oak/venv2/local/lib/python2.7/site-packages/icrawler/storage/__init__.py", line 1, in <module> from .base import BaseStorage File "/home/oak/venv2/local/lib/python2.7/site-packages/icrawler/storage/base.py", line 4 class BaseStorage(metaclass=ABCMeta): ^ SyntaxError: invalid syntax

parser error(?) in flickr crawler

When I command to crawl 1000 images, I got
<parser - no more page urls for thread parser-001 to parse> message around 500th image.

That means there's no more images?
But when I search on flickr site, there are more hundreds thousands of images

Is there an easy way to specify to the crawler what language to use?

corrupted images after crawling

Thank you for providing an amazing tool! I'm looking to crawl through some images using the built-in flickr crawler, but after downloading, this is what i see in my directory:
corrupted

my parameters:
flickr_crawler.crawl(max_num=100, tags='construction', user_id='59595815@N03')

Any ideas?

How to change root_dir in storage argument?

I was try to change the root_dir by the following:
google_crawler = GoogleImageCrawler( feeder_threads=1, parser_threads=1, downloader_threads=4, storage=storage)
google_crawler.set_storage(new_storage)
But it doesn't seem to work. Did I do it the wrong way?

how to save url in json file / not downloads image file

Hello! Thank you very much for making the icrawler. I'm using the library in useful ways. :)
I have a question. I don't want to download images directly, I just want to get the url of images. I programmed using reference to the #34 that were uploaded last time. This is my code.

import base64
from collections import OrderedDict

from icrawler import ImageDownloader
from icrawler.builtin import GoogleImageCrawler
from six.moves.urllib.parse import urlparse


class MyImageDownloader(ImageDownloader):

    def get_filename(self, task, default_ext):
        url_real = OrderedDict()
        url_path = urlparse(task['file_url'])[2]
        #print(task['file_url'])
        url_real['url'] = task['file_url']
        # print(url_real)
        if '.' in url_path:
            extension = url_path.split('.')[-1]
            if extension.lower() not in [
                    'jpg', 'jpeg', 'png', 'bmp', 'tiff', 'gif', 'ppm', 'pgm'
            ]:
                extension = default_ext
        else:
            extension = default_ext
        filename = base64.b64encode(url_path.encode()).decode()
        url_real['file_name'] = '{}.{}'.format(filename, extension)
        print(url_real)
        return '{}.{}'.format(filename, extension)

def get_json(keyword, save, num):
    google_crawler = GoogleImageCrawler(
        downloader_cls=MyImageDownloader,
        downloader_threads=4,
        storage={'root_dir': save})
    google_crawler.crawl(keyword=keyword, max_num=num)         

get_json('sugar glider', '/Users/user/Downloads/url_test', 1000)

Running this code still saves the image in the directory. But I don't need images. Is there any good way? In conclusion, I want to save url and filename in json file!

Date/Time format

Cant seem to figure the date/time format for min_size= max_size=

More documentation on "How do I use my own proxy?"

Hi there.

First I really appreciate your amazing work on icrawler. It even comes with built-in proxies available for use. Awesome!

Yet, the documentation on using custom proxies is quite ambiguous and implicit. I had to check the source of icrawler.utils.set_proxy_pool and examine the output json provided by icrawler.utils.proxy_pool.default_scan to guess that by supplying json looks like given below to icrawler.utils.Proxypool would the crawler use my local proxy. (If I don't get it wrong)

{
  "http": [
    {
      "addr": "127.0.0.1:8081", 
      "protocol": "http", 
      "weight": 1.0, 
      "last_checked": 1537064366}], 
      "https": []
}

If my guess is correct, would you like to update your documentation or shall I open a pull request on this?

images are sometimes saved with wrong extensions

Some url's of images doesn't end with '.jpg' or '.png', but there are some additional text with slashes added. When such images are saved in the folder, a couple of subfolders are created. For example:
image with url https://vignette3.wikia.nocookie.net/hotwheels/images/5/5d/Tesla_Model_X_DTX01.png/revision/latest?cb=20170412160359
is saved to folder/000264.png/revision/latest
The file "latest" is a valid image.
Here are more examples of such url's that reproduce the problem:
https://teslamotorsclub.com/tmc/media/solid-black-tesla-model-x-22-inch-wheel-ts115-matte-black-2.116043/full
https://s.yimg.com/uu/api/res/1.2/RTEqzLkvrxNuNROqK4G2OQ--/Zmk9c3RyaW07aD01NjI7cT04MDt3PTEwMDA7c209MTthcHBpZD15dGFjaHlvbg--/http://l.yimg.com/yp/offnetwork/1971a974e71f0dd283dd6e987bb5bcc7

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.