hellock / icrawler Goto Github PK

View Code? Open in Web Editor NEW

830.0 830.0 174.0 271 KB

A multi-thread crawler framework with many builtin image crawlers provided.

Home Page: http://icrawler.readthedocs.io/en/latest/

License: MIT License

Python 100.00%

bing-image crawler flickr-api google-images python scrapy spider

icrawler's People

Contributors

Stargazers

Watchers

Forkers

yysijie lightjohn lpzhang yan7109 hong1830 shaform awesome-archive taoyunaming sjoerdapp raohuaming ddmng sibojia saurabhjimmy kwin-wang skyuuka peilin-yang walkoncross peterprokop ghdeng1992 tobezs caomw lvchigo kanishk-nd zhiweizhong kamicreed raindrop4steven itorres1994 redalice64 ck0379 deepvoltaire brendonyim bmyan garami fashtimedotcom huangxiao233 iamsee standby sysuzyq zhengdixin emachieng yibeihuang innerlee sun00j chendicao nashory guo2017 niaoyu kirtanp nvn6w kevinkit niuyuanyuanna freezeburger dreamflasher zhouleidcc alexband codermckee neohanju gengkunling aadaa88 mehrdad-shokri samtlink webht mx2013713828 jurjsorinliviu denethor1997 lee-hongseok amirunpri2018 datlife songguangchao zluo np-csu leiqi bupttianlei wilbert-wu wavo303 hubitor lim-0 lingfengzhu williamdeve tomzhang llcing xiaoye77 chenyouxin113 sushantport jinler dreamnotover zackpashkin hhy5277 aserun catalin-podariu maoxianxin silverbulletmdc yzhou0919 mengm0 tanjingme toruitas michaelpdu k0bai 111111m git4won

icrawler's Issues

Crawl file

The documentation file said that we can download video or other types of file. I googled and haven't found any example about this. Can you give me an example of crawl a file that is not image.

use proxy to download

I hope icrawler can support proxy service with flexible strategies as soon as possible. I think it's useful in many applications.

By the way, icrawler is really an excellent image crawler framework with multi-thread and higher extendibility. And the code has been well commented.

This framework greatly reduce my workload. Thx a lot.

Adding additional file check before choosing to keep a file?

Hi! This is a really cool framework and I've been using it for some simple implementations and it has been working great.

Currently, what I'm trying to do is add an additional check of file criteria before the file is saved. I already have this checking code working for offline files already saved, but it's a little long to include here. Maybe for an example, a redundant check of file name/length/size would be work. (Or I can provide the code if it matters!)

The process should be basically this:

File is downloaded using icrawler
File is checked to meet certain criteria and true/false is returned
If the file meets the criteria, it's saved, if not, it's skipped

I've ready the API reference, and I think maybe it can be done by overriding keep_file? But I only see the response/size parameters, and I'm not sure how to add or use my own.

keep_file
Parameters:
response (Response) – response of requests. min_size (tuple or None)
minimum size of required images. max_size (tuple or None) – maximum size of required images.

The code example I've been using is this one, and it runs great by itself - I'm just not sure how to initiate that initial check before a file is saved/not saved.

from icrawler.builtin import GoogleImageCrawler

google_crawler = GoogleImageCrawler(storage={'root_dir': 'your_image_dir'})
google_crawler.crawl(keyword='cat', max_num=100)

Thanks! I'm still new to programming and would really appreciate any help.

Can we use the order information of the SE as the name of results?

I found that the filename order of my result pics for a single query is not the order of the SE result.

How to setup a Proxy

Hi there,

I work behind a proxy and I cannot access https://www.google.com:

2017-02-26 22:55:54,615 - ERROR - parser - Exception caught when fetching page https://www.google.com/search?q=sunny&start=100&tbs=cdr%3A1%2Ccd_min%3A%2Ccd_max%3A&tbm=isch&ijn=1, error: HTTPSConnectionPool(host='www.google.com', port=443): Max retrie
s exceeded with url: /search?q=sunny&start=100&tbs=cdr%3A1%2Ccd_min%3A%2Ccd_max%3A&tbm=isch&ijn=1 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x00000000032C5908>, 'Connection to www.google.co
m timed out. (connect timeout=5)')), remaining retry times: 2

I am not sure what the following means:

Then it will scan 10 valid overseas (out of mainland China) proxies and automatically use these proxies to request pages and images.

Is there a way to setup the proxy as for urllib2:

http_proxy = 'chdoleninet\\' + cnumber + ':' + windowspwd + '@sc-wvs-ch-win-pr-01-vip1.ch.doleni.net:8080'
proxy = urllib2.ProxyHandler({'http':http_proxy,
                              'https':http_proxy})                                          
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)

Thanks a lot
Cheers
Fabien

Encodeing Error for google crawler

The parser gives me this error when I try to hit this url:
encoding error : input conversion failed due to input error, bytes 0x81 0x95 0xE3 0x82
https://www.google.com/search?q=%E6%92%AE%E5%BD%B1&ijn=0&start=0&tbs=cdr%3A1%2Ccd_min%3A%2Ccd_max%3A%2Csur%3ANone&tbm=isch&lr=lang_ja
That url gets generated by this term: 撮影, manually added the lr parameter for now to force language results, but thats not related to this issue.

Any way to workaround this?

google crawler stuck

I tried to used the built-in crawler to crawl images from google. I have a list of keywords to crawl and the crawler works well with most of the keywords. I only changed the keyword for searching, and the program always got stuck at the same keyword.
The last several messages I got:
2017-07-20 12:03:15,237 - INFO - downloader - no more download task for thread downloader-003
2017-07-20 12:03:15,237 - INFO - downloader - thread downloader-003 exit
2017-07-20 12:03:15,480 - INFO - downloader - no more download task for thread downloader-004
2017-07-20 12:03:15,480 - INFO - downloader - thread downloader-004 exit
2017-07-20 12:03:16,180 - INFO - downloader - no more download task for thread downloader-002
2017-07-20 12:03:16,180 - INFO - downloader - thread downloader-002 exit
2017-07-20 12:03:16,832 - INFO - downloader - no more download task for thread downloader-001
2017-07-20 12:03:16,832 - INFO - downloader - thread downloader-001 exit

While for successful crawling, I got
2017-07-20 13:13:06,827 - INFO - downloader - no more download task for thread downloader-001
2017-07-20 13:13:06,827 - INFO - downloader - thread downloader-001 exit
2017-07-20 13:13:06,966 - INFO - downloader - no more download task for thread downloader-003
2017-07-20 13:13:06,966 - INFO - downloader - thread downloader-003 exit
2017-07-20 13:13:07,104 - INFO - downloader - no more download task for thread downloader-002
2017-07-20 13:13:07,104 - INFO - downloader - thread downloader-002 exit
2017-07-20 13:13:11,862 - INFO - downloader - no more download task for thread downloader-004
2017-07-20 13:13:11,862 - INFO - downloader - thread downloader-004 exit
2017-07-20 13:13:12,439 - INFO - icrawler.crawler - Crawling task done!

How do I get 1000 images correctly?

I set to download 1000 images, but only download 250~350 images.
How do I get 1000 images correctly?

crawler = GoogleImageCrawler(parser_threads=2, downloader_threads=4, storage={'root_dir': TMP_FOLDER})
crawler.crawl(keyword=keyword, max_num=1000, date_min=None, date_max=None)

Q: How to return the file url for reporting?

I just found your nice package.

I currently use the google image crawler. I am fine with the numerical filenames it produces, but for reporting I'd also like to store the URL or a least the original filename in a dict?

I'm a bit confused how I'd achieve that.

Cheers,
C

Result searching by website different from results from icrawler in BaiduImage

when I use the keyword: '热水器+浴室' to search on the website of Baidu, I got the correct result:

however, when I use the same keyword in icrawler, I got an incorrect result:

I noticed that you add the color rules in

icrawler/icrawler/builtin/baidu.py

Line 50 in 38c9257

search_filter.add_rule('color', format_color, color_choices)

is it related to the color filters?

Python 2.7 compatibility broken

list.clear() breaks python 2 compatibility. Getting an error stating just that when running the readme's google crawler

OSError: cannot identify image file

Firstly, I really appreciate your work!

I am currently using this to scrape Google images, but there seems to be an error I can not solve.

Exception in thread downloader-02: Traceback (most recent call last): File "/home/tn/lib/python3.5/threading.py", line 914, in _bootstrap_inner self.run() File "/home/tn/lib/python3.5/threading.py", line 862, in run self._target(*self._args, **self._kwargs) File "/home/tn/lib/python3.5/site-packages/icrawler-0.2.1-py3.5.egg/icrawler/downloader.py", line 228, in thread_run self.download(task, request_timeout, **kwargs) File "/home/tn/lib/python3.5/site-packages/icrawler-0.2.1-py3.5.egg/icrawler/downloader.py", line 124, in download img = Image.open(io.BytesIO(response.content)) File "/home/tn/lib/python3.5/site-packages/Pillow-3.2.0-py3.5-linux-x86_64.egg/PIL/Image.py", line 2309, in open % (filename if filename else fp)) OSError: cannot identify image file <_io.BytesIO object at 0x2aaabc453a40>

Could you assist me with this?

Is the date range inclusive or exclusive?

Hi, thanks for the awesome code. I am using it to download google images and it works perfectly. I just have a question, is the date range inclusive or exclusive?

I mean, if i set date_min=1, date_max=5, how many day's data am i downloading?

Thanks a lot.

Is date_min and date_max for Google Image Search working?

Hi, thanks for the great package!

I am trying to get minimum and maximum date working for Google Image Search. My minimum example

from icrawler.builtin import GoogleImageCrawler
from datetime import date

def scrape():
    date_min = date(1985, 1, 1)
    date_max = date(1990, 1, 1)
    with open("labels.txt", mode="r") as f:
        for j in range(sum(1 for line in open("labels.txt", mode="r"))):
            label = "{}".format(f.readline().rstrip())
            google_crawler = GoogleImageCrawler(
                parser_threads=4, downloader_threads=20, storage={'root_dir': 'test/{}'.format(label)})
            google_crawler.crawl(keyword=label, max_num=1000, date_min=date_min, date_max=date_max)

if __name__ == "__main__":
    scrape()

is not working as expected. It should download no images, but is downloading images from all time periods. Am I doing something wrong?

Thanks for your help!

Tips for IP not getting banned by google

Just curious on if people have their IP blacklisted by google by using this and tips to avoid it, eg max number of requests in X amount of time? What measures does the crawler already take in this regard?

Error when using GoogleImageCrawler

Hi icrawler developers,
I've tried BingImageCrawler and GoogleImageCrawler. BingImageCrawler worked, however, The fowllowing sample code of GoogleImageCrawler didn't work. Can anyone help?

from icrawler.builtin import GoogleImageCrawler
google_crawler = GoogleImageCrawler(parser_threads=2, downloader_threads=4,
                                    storage={'root_dir': 'dog'})
google_crawler.crawl(keyword='dog', max_num=1000,
                     date_min=None, date_max=None,
                     min_size=(10,10), max_size=None)

All I received was

2018-01-10 19:22:48,838 - INFO - icrawler.crawler - start crawling...
2018-01-10 19:22:48,839 - INFO - icrawler.crawler - starting 1 feeder threads...
2018-01-10 19:22:48,844 - INFO - feeder - thread feeder-001 exit
2018-01-10 19:22:48,844 - INFO - icrawler.crawler - starting 2 parser threads...
2018-01-10 19:22:48,849 - INFO - icrawler.crawler - starting 4 downloader threads...
2018-01-10 19:22:48,882 - ERROR - parser - Exception caught when fetching page https://www.google.com/search?q=dog&ijn=0&start=0&tbs=cdr%3A1%2Ccd_min%3A%2Ccd_max%3A%2Csur%3A%2Citp%3A%2Cic%3A%2Cisc%3A&tbm=isch&lr=, error: HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: /search?q=dog&ijn=0&start=0&tbs=cdr%3A1%2Ccd_min%3A%2Ccd_max%3A%2Csur%3A%2Citp%3A%2Cic%3A%2Cisc%3A&tbm=isch&lr= (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],)",),)), remaining retry times: 2
2018-01-10 19:22:48,913 - ERROR - parser - Exception caught when fetching page https://www.google.com/search?q=dog&ijn=0&start=0&tbs=cdr%3A1%2Ccd_min%3A%2Ccd_max%3A%2Csur%3A%2Citp%3A%2Cic%3A%2Cisc%3A&tbm=isch&lr=, error: HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: /search?q=dog&ijn=0&start=0&tbs=cdr%3A1%2Ccd_min%3A%2Ccd_max%3A%2Csur%3A%2Citp%3A%2Cic%3A%2Cisc%3A&tbm=isch&lr= (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],)",),)), remaining retry times: 1
2018-01-10 19:22:48,940 - ERROR - parser - Exception caught when fetching page https://www.google.com/search?q=dog&ijn=0&start=0&tbs=cdr%3A1%2Ccd_min%3A%2Ccd_max%3A%2Csur%3A%2Citp%3A%2Cic%3A%2Cisc%3A&tbm=isch&lr=, error: HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: /search?q=dog&ijn=0&start=0&tbs=cdr%3A1%2Ccd_min%3A%2Ccd_max%3A%2Csur%3A%2Citp%3A%2Cic%3A%2Cisc%3A&tbm=isch&lr= (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],)",),)), remaining retry times: 0
2018-01-10 19:22:50,850 - INFO - parser - no more page urls for thread parser-002 to parse
2018-01-10 19:22:50,850 - INFO - parser - thread parser-002 exit
2018-01-10 19:22:50,943 - INFO - parser - no more page urls for thread parser-001 to parse
2018-01-10 19:22:50,943 - INFO - parser - thread parser-001 exit
2018-01-10 19:22:53,851 - INFO - downloader - no more download task for thread downloader-001
2018-01-10 19:22:53,852 - INFO - downloader - no more download task for thread downloader-002
2018-01-10 19:22:53,852 - INFO - downloader - no more download task for thread downloader-004
2018-01-10 19:22:53,852 - INFO - downloader - no more download task for thread downloader-003
2018-01-10 19:23:00,710 - INFO - downloader - thread downloader-001 exit
2018-01-10 19:23:00,714 - INFO - downloader - thread downloader-002 exit
2018-01-10 19:23:00,715 - INFO - downloader - thread downloader-004 exit
2018-01-10 19:23:00,715 - INFO - downloader - thread downloader-003 exit
2018-01-10 19:23:00,859 - INFO - icrawler.crawler - Crawling task done!

Accessing task_queue

I am trying to access task_queue to access the task dictionary. But i am unable to, so can someone please suggest how to go about it.
Thanks

hello, the program crashed.

Traceback (most recent call last):
File "G:/imdbfull/bingcrawler.py", line 27, in
content = urllib2.urlopen(url).read()
File "F:\Anaconda2\lib\urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "F:\Anaconda2\lib\urllib2.py", line 431, in open
response = self._open(req, data)
File "F:\Anaconda2\lib\urllib2.py", line 449, in _open
'_open', req)
File "F:\Anaconda2\lib\urllib2.py", line 409, in _call_chain
result = func(*args)
File "F:\Anaconda2\lib\urllib2.py", line 1227, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "F:\Anaconda2\lib\urllib2.py", line 1197, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 10060] >

setup prefix for file

Hello, is it possible to setup a prefix for the files generates by the google\bing\flicker downloder?

for example file can start google_0001.jpg or something like that

thanks

downloader-04 is waiting for new download tasks

I really like your crawler! However, I don't know what's kind of problem about it. Maybe you can add a time constraint for feeder or something else.

not able to download images of multiple class?

how to download images of multiple classs?

Exception in parser causes program freeze

When I crawl pictures with 4 parsers, I see exception:

Exception in thread parser-001:
Traceback (most recent call last):
File "/home/q/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/home/q/anaconda2/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/q/anaconda2/lib/python2.7/site-packages/icrawler/parser.py", line 104, in worker_exec
for task in self.parse(response, **kwargs):
File "/home/q/anaconda2/lib/python2.7/site-packages/icrawler/builtin/bing.py", line 21, in parse
img_url = '{}.jpg'.format(match.group(1))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 48-56: ordinal not in range(128)
...
2017-05-05 16:36:35,446 - INFO - downloader - no more download task for thread downloader-001
2017-05-05 16:36:35,447 - INFO - downloader - thread downloader-001 exit
2017-05-05 16:36:35,634 - INFO - downloader - no more download task for thread downloader-003
2017-05-05 16:36:35,635 - INFO - downloader - thread downloader-003 exit
2017-05-05 16:36:41,574 - INFO - downloader - no more download task for thread downloader-004
2017-05-05 16:36:41,574 - INFO - downloader - thread downloader-004 exit
2017-05-05 16:36:56,608 - INFO - downloader - no more download task for thread downloader-002
2017-05-05 16:36:56,608 - INFO - downloader - thread downloader-002 exit

Then the program froze.

No module named 'icrawler.builtin'

Infinite loop even if work finished

crawler.py

        while True:
            if threading.active_count() <= 1:
                break

Crawler never stops if there were already more than 1 threads runnin, eg if you are running this on a web server, it will not end. I've simply disabled this line to get things working fine.

Add an option to not download in case file already exists

There is keep_file but at that point the file is already downloaded. It would be nice to have an option to skip the download in case a file with that name already exists.

Baidu invalid escape | JSON

Hey again,

I'm currently trying to download from Google, Bing and Baidu.

Google and Bing work perfectly, but Baidu gives me this error:

Exception in thread parser-001:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/site-packages/icrawler/parser.py", line 105, in worker_exec
    for task in self.parse(response, **kwargs):
  File "/usr/lib/python3.6/site-packages/icrawler/builtin/baidu.py", line 32, in parse
    content = json.loads(response.content.decode('utf-8'))
  File "/usr/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.6/json/decoder.py", line 355, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid \escape: line 8 column 113 (char 10372)

Also, for another part of the JSON:

json.decoder.JSONDecodeError: Invalid \escape: line 20 column 110 (char 19232)

Could you give me a hint what seems to be the problem? Or is it really Baidu providing invalid json?

Search query was still Duck.

Thank you!

How to save the file name differently

Hi I am trying to save pictures from each url. I'd like to know which pictures are from the same url. I tried to change the code in downloader.py, but it still gave me 1, 2, 3... file name for each image. If possible, I am wondering how to change the downloaded image file name to 1_1, 2_1 so first pic from first url, second pic from first url. Or anything format that helps.

I am not an expert on coding, trying to use it for my dissertation. Really appreciate if any one could help!!

Trying to use other downloader_cls

Hi there,

first of all: keep up the good work! I really like icrawler. :)

I'm currently trying to use the GoogleImageCrawler, but want to subsitute the downloader class. I tried overwriting it like this:

from icrawler import ImageDownloader

class AdvancedDownloader(ImageDownloader):
	def download(self, task, default_ext, timeout=5, max_retry=3, **kwargs):
		print(task['file_url'])
		ImageDownloader.download(task, default_ext, timeout=5, max_retry=3, **kwargs)

calling it like this:

from icrawler.builtin import GoogleImageCrawler
from ExtendedDowloader import AdvancedDownloader

google_crawler = GoogleImageCrawler('/home/user/Downloads/test', downloader_cls=AdvancedDownloader)
google_crawler.crawl(keyword='Duck', offset=0, max_num=100,
	date_min=None, date_max=None, feeder_thr_num=2,
	parser_thr_num=2, downloader_thr_num=8,
	min_size=(200,200), max_size=None)

Result:

Traceback (most recent call last):
  File "/home/user/projects/test.py", line 4, in <module>
    google_crawler = GoogleImageCrawler('/home/user/Downloads/test', downloader_cls=AdvancedDownloader)
  File "/usr/lib/python3.5/site-packages/icrawler/builtin/google.py", line 45, in __init__
    **kwargs)
TypeError: __init__() got multiple values for keyword argument 'downloader_cls'

I also tried overwriting the whole GoogleImageCrawler class, but the same Error came up.

So, how do I do it?

Background: I want to use multiple Crawlers (Bing, Baidu, Google) and want to check if I already downloaded the exact same URL (and maybe also check md5) to avoid duplicates.

How doea the cache machinsim work?

In case of exception and rerunning the crawler

Does it cache the html page?
Does it cache the parsed html?
Does it cache the image urlist?

thanks

The interval for icrawler

First, I want to say thank you. The icrawler does have me a lot.
I have read the source code, but didn't find a way to specify the interval for crawling
a website. Is it possible to specify the interval for icrawler?

Parser Error On Running Google Image Crawler

Hi, thanks for the great project. So far I crawled baidu image and bing image smoothly,but I have 2 issues:
1.
I am using python3.6 on MacOS Sierra.
Google image crawler passed the test, but it didn't create a folder and didn't download the images. When I run the example of google image crawler, the log reads :
"
ERROR - parser - Exception caught when fetching page https://www.google.com/search?q=sunny&ijn=1&start=100&tbs=cdr%3A1%2Ccd_min%3A%2Ccd_max%3A%2Csur%3A&tbm=isch&lr=, error: HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: /search?q=sunny&ijn=1&start=100&tbs=cdr%3A1%2Ccd_min%3A%2Ccd_max%3A%2Csur%3A&tbm=isch&lr= (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x104856278>: Failed to establish a new connection: [Errno 65] No route to host',)), remaining retry times: 6
"
The baidu image crawler and bing image crawler works well on my enviroment.
2.
Is it possible to download more than 1000 images?

Any idea could be helpful.
Thanks!

Python 2.7 support

Hey, I tried to run the Google example but got the following error

Traceback (most recent call last): File "google.py", line 1, in <module> from icrawler.builtin import BaiduImageCrawler, BingImageCrawler, GoogleImageCrawler File "/home/oak/venv2/local/lib/python2.7/site-packages/icrawler/__init__.py", line 4, in <module> from .crawler import Crawler File "/home/oak/venv2/local/lib/python2.7/site-packages/icrawler/crawler.py", line 10, in <module> from icrawler import storage as storage_package File "/home/oak/venv2/local/lib/python2.7/site-packages/icrawler/storage/__init__.py", line 1, in <module> from .base import BaseStorage File "/home/oak/venv2/local/lib/python2.7/site-packages/icrawler/storage/base.py", line 4 class BaseStorage(metaclass=ABCMeta): ^ SyntaxError: invalid syntax

Setting file_idx_offset='auto' does nothing - fails to get max files index from directory

icrawler/icrawler/downloader.py

Line 45 in 1d70132

def set_file_idx_offset(self, file_idx_offset=0):

parser error(?) in flickr crawler

When I command to crawl 1000 images, I got
<parser - no more page urls for thread parser-001 to parse> message around 500th image.

That means there's no more images?
But when I search on flickr site, there are more hundreds thousands of images

How to set the download image resolution?

I check most of the downloaded image, most of them are in low res. Do you suggest some way to download high resolution image?

Is there an easy way to specify to the crawler what language to use?

Google for instance has a GET param of lr to set what language to use: https://www.google.com/support/enterprise/static/gsa/docs/admin/72/gsa_doc_set/xml_reference/request_format.html#1077312

https://www.google.com/search?lr=lang_ja&biw=1920&bih=950&tbs=lr%3Alang_1ja&tbm=isch&sa=1&q=%E6%9C%9B%E9%81%A0%E9%8F%A1&oq=%E6%9C%9B%E9%81%A0%E9%8F%A1&gs_l=psy-ab.3..0j0i30k1l3.3776.4410.0.5621.3.3.0.0.0.0.72.174.3.3.0....0...1.1.64.psy-ab..0.3.172...0i12i24k1.tDlU0ujIqcs

For words like 望遠鏡, the chinese results gets something very different from the japanese results

Error when download image from url startwith https

ERROR - downloader - Exception caught when downloading file https://image.thanhnien.vn/665/uploaded/thanhchau/2017_07_27/tb1_zqpu.jpg, error: [Errno 1] _ssl.c:

Is there a way to get the metadata of the images which are being scraped, along with the images?

I am interested in the image metadata such as the date, source, etc! Is there some way to get this along with the images?

save meta data on the images from crawler

Hello is it possible to save meta data like mapping image url to filename for later use and to save license type as well?

corrupted images after crawling

Thank you for providing an amazing tool! I'm looking to crawl through some images using the built-in flickr crawler, but after downloading, this is what i see in my directory:

my parameters:
flickr_crawler.crawl(max_num=100, tags='construction', user_id='59595815@N03')

Any ideas?

How to change root_dir in storage argument?

I was try to change the root_dir by the following:
google_crawler = GoogleImageCrawler( feeder_threads=1, parser_threads=1, downloader_threads=4, storage=storage)
google_crawler.set_storage(new_storage)
But it doesn't seem to work. Did I do it the wrong way?

how to save url in json file / not downloads image file

Hello! Thank you very much for making the icrawler. I'm using the library in useful ways. :)
I have a question. I don't want to download images directly, I just want to get the url of images. I programmed using reference to the #34 that were uploaded last time. This is my code.

import base64
from collections import OrderedDict

from icrawler import ImageDownloader
from icrawler.builtin import GoogleImageCrawler
from six.moves.urllib.parse import urlparse


class MyImageDownloader(ImageDownloader):

    def get_filename(self, task, default_ext):
        url_real = OrderedDict()
        url_path = urlparse(task['file_url'])[2]
        #print(task['file_url'])
        url_real['url'] = task['file_url']
        # print(url_real)
        if '.' in url_path:
            extension = url_path.split('.')[-1]
            if extension.lower() not in [
                    'jpg', 'jpeg', 'png', 'bmp', 'tiff', 'gif', 'ppm', 'pgm'
            ]:
                extension = default_ext
        else:
            extension = default_ext
        filename = base64.b64encode(url_path.encode()).decode()
        url_real['file_name'] = '{}.{}'.format(filename, extension)
        print(url_real)
        return '{}.{}'.format(filename, extension)

def get_json(keyword, save, num):
    google_crawler = GoogleImageCrawler(
        downloader_cls=MyImageDownloader,
        downloader_threads=4,
        storage={'root_dir': save})
    google_crawler.crawl(keyword=keyword, max_num=num)         

get_json('sugar glider', '/Users/user/Downloads/url_test', 1000)

Running this code still saves the image in the directory. But I don't need images. Is there any good way? In conclusion, I want to save url and filename in json file!

The program does not finish the job

The program does not finish the job, at the end of the program I see this,And then I can not do anything.Please help!

Date/Time format

Cant seem to figure the date/time format for min_size= max_size=

images are sometimes saved with wrong extensions

Some url's of images doesn't end with '.jpg' or '.png', but there are some additional text with slashes added. When such images are saved in the folder, a couple of subfolders are created. For example:
image with url https://vignette3.wikia.nocookie.net/hotwheels/images/5/5d/Tesla_Model_X_DTX01.png/revision/latest?cb=20170412160359
is saved to folder/000264.png/revision/latest
The file "latest" is a valid image.
Here are more examples of such url's that reproduce the problem:
https://teslamotorsclub.com/tmc/media/solid-black-tesla-model-x-22-inch-wheel-ts115-matte-black-2.116043/full
https://s.yimg.com/uu/api/res/1.2/RTEqzLkvrxNuNROqK4G2OQ--/Zmk9c3RyaW07aD01NjI7cT04MDt3PTEwMDA7c209MTthcHBpZD15dGFjaHlvbg--/http://l.yimg.com/yp/offnetwork/1971a974e71f0dd283dd6e987bb5bcc7

hellock / icrawler Goto Github PK

icrawler's People

Contributors

Stargazers

Watchers

Forkers

icrawler's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs