GithubHelp home page GithubHelp logo

jiepai's Introduction

Python3 网络爬虫开发实战

本书介绍了如何利用 Python 3 开发网络爬虫。书中首先详细介绍了环境配置过程和爬虫基础知识;然后讨论了 urllib、requests 等请求库,Beautiful Soup、XPath、pyquery 等解析库以及文本和各类数据库的存储方法;接着通过多个案例介绍了如何进行 Ajax 数据爬取,如何使用 Selenium 和 Splash 进行动态网站爬取;接着介绍了爬虫的一些技巧,比如使用代理爬取和维护动态代理池的方法,ADSL 拨号代理的使用,图形、 极验、点触、宫格等各类验证码的破解方法,模拟登录网站爬取的方法及 Cookies 池的维护。 此外,本书还结合移动互联网的特点探讨了使用 Charles、mitmdump、Appium 等工具实现 App 爬取 的方法,紧接着介绍了 pyspider 框架和 Scrapy 框架的使用,以及分布式爬虫的知识,最后介绍了 Bloom Filter 效率优化、Docker 和 Scrapyd 爬虫部署、Gerapy 爬虫管理等方面的知识。

本书由图灵教育 - 人民邮电出版社出版发行,版权所有,禁止转载。

作者:崔庆才

购买地址:

加读者群:

视频资源:

Python3 爬虫三大案例实战分享

自己动手,丰衣足食!Python3 网络爬虫实战案例

jiepai's People

Contributors

germey avatar newcluge avatar orangeshen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

jiepai's Issues

2020.1.27 image_detail没了,该如何借助ajax获取文章中所有图片

我看前边的issue大家用的都是image_list,我也注意到他了,但是image_list只是搜索页显示的几张图片,并不是文章中的全部图片。和书中所指image_detail是完全不一样的东西。但是image_detail里消失了。
求教:除了重新get各个文章url,还有没有其他办法获取全部图片?

头条接口结构有部分变动-2018年8月1日

下面是变动后的实现代码

import requests
from urllib.parse import urlencode
from requests import codes
import os
from hashlib import md5
from multiprocessing.pool import Pool


def get_page(offset):
    params = {
        'offset': offset,
        'format': 'json',
        'keyword': '街拍',
        'autoload': 'true',
        'count': '20',
        'cur_tab': '1',
        'from': 'search_tab'
    }
    base_url = 'https://www.toutiao.com/search_content/?'
    url = base_url + urlencode(params)
    try:
        resp = requests.get(url)
        if codes.ok == resp.status_code:
            return resp.json()
    except requests.ConnectionError:
        return None


def get_images(json):
    if json.get('data'):
        data = json.get('data')
        for item in data:
            if item.get('cell_type') is not None:
                continue
            title = item.get('title')
            images = item.get('image_list')
            for image in images:
                yield {
                    'image': 'https:' + image.get('url'),
                    'title': title
                }


def save_image(item):
    img_path = 'img' + os.path.sep + item.get('title')
    if not os.path.exists(img_path):
        os.makedirs(img_path)
    try:
        resp = requests.get(item.get('image'))
        if codes.ok == resp.status_code:
            file_path = img_path + os.path.sep + '{file_name}.{file_suffix}'.format(
                file_name=md5(resp.content).hexdigest(),
                file_suffix='jpg')
            if not os.path.exists(file_path):
                with open(file_path, 'wb') as f:
                    f.write(resp.content)
                print('Downloaded image path is %s' % file_path)
            else:
                print('Already Downloaded', file_path)
    except requests.ConnectionError:
        print('Failed to Save Image,item %s' % item)


def main(offset):
    json = get_page(offset)
    for item in get_images(json):
        print(item)
        save_image(item)


GROUP_START = 0
GROUP_END = 7

if __name__ == '__main__':
    pool = Pool()
    groups = ([x * 20 for x in range(GROUP_START, GROUP_END + 1)])
    pool.map(main, groups)
    pool.close()
    pool.join()

解析返回Json问题

现在这个解析方式取到的街拍文章大多都不是正常的街拍美图


取出dataarticle_url再对article_url进行爬取解析出来取得的图片质量和数量都很不错

第39行代码有点问题

38 for image in images:
39 origin_image = re.sub("list", "origin",image.get('url')
“origin”后面的逗号是中文的,最后位置缺个反括号

就是其实把并没有把一个词条下所有的图片的爬取下来,我找了一晚上几乎都是的,图片并不全呀,有些博文根据image-list来的,可是这个标签下并没有包括所有图片。点进去一个词条,network里找图片的链接找不到,好像被编码了,新手求教,或许可以稍微指点下~

网址本身点进去图片链接好像就不太全 问题解决了吗??遇到了同样的问题

就是其实把并没有把一个词条下所有的图片的爬取下来,我找了一晚上几乎都是的,图片并不全呀,有些博文根据image-list来的,可是这个标签下并没有包括所有图片。点进去一个词条,network里找图片的链接找不到,好像被编码了,新手求教,或许可以稍微指点下~

Originally posted by @czl-2019 in #25 (comment)

结合了遍历搜索页和图集的一个代码,做一个分享

import requests
from urllib.parse import urlencode
from requests import codes
import os
from hashlib import md5
from multiprocessing.pool import Pool
import re
import random

def get_page(offset):
    headers = {
        'cookie': 'tt_webid=6787304267841324551; WEATHER_CITY=%E5%8C%97%E4%BA%AC; tt_webid=6787304267841324551; csrftoken=6c8d91e61b7db691bfa45021a0e7e511; UM_distinctid=16ffb7f1fcfec-0d6566ad15973e-396a4605-144000-16ffb7f1fd02eb; s_v_web_id=k631j38t_qMPq6VOD_jioN_4lgi_BQB1_GGhAVEKoAmXJ; __tasessionId=ocfmlmswt1580527945483',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
        'x-requested-with': 'XMLHttpRequest',
        'referer': 'https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D',
    }
    params = {
        'aid': '24',
        'app_name': 'web_search',
        'offset': offset,
        'format': 'json',
        'keyword': '街拍',
        'autoload': 'true',
        'count': '20',
        'en_qc': '1',
        'cur_tab': '1',
        'from': 'search_tab',
        'pd': 'synthesis',
    }
    base_url = 'https://www.toutiao.com/api/search/content/?'
    url = base_url + urlencode(params)
    # print(url)
    try:
        resp = requests.get(url, headers=headers)
        if 200  == resp.status_code:
            return resp.json()
    except requests.ConnectionError:
        return None
    
def get_images(json):
    headers = {
        'cookie': 'tt_webid=6787304267841324551; WEATHER_CITY=%E5%8C%97%E4%BA%AC; tt_webid=6787304267841324551; csrftoken=6c8d91e61b7db691bfa45021a0e7e511; UM_distinctid=16ffb7f1fcfec-0d6566ad15973e-396a4605-144000-16ffb7f1fd02eb; s_v_web_id=k631j38t_qMPq6VOD_jioN_4lgi_BQB1_GGhAVEKoAmXJ; __tasessionId=ocfmlmswt1580527945483',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
        'x-requested-with': 'XMLHttpRequest',
        'referer': 'https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D',
    }
    if json.get('data'):
        data = json.get('data')
        for item in data:
            if item.get('title') is None: # 刨掉前部分无关内容
                continue
            title = re.sub('[\t]', '', item.get('title')) # 获取标题
            url = item.get("article_url")  #获取子链接
            if url == None:
                continue
            try:
                resp = requests.get(url,headers=headers)
                if 200  == resp.status_code:
                    images_pattern = re.compile('JSON.parse\("(.*?)"\),\n',re.S)
                    result = re.search(images_pattern,resp.text)
                    if result == None: # 非图集形式
                        images = item.get('image_list')
                        for image in images:
                            origin_image = re.sub("list.*?pgc-image", "large/pgc-image", image.get('url')) # 改成origin/pgc-image是原图
                            yield {
                                'image': origin_image,
                                'title': title
                            }
                    else: # 图集形式 抓取gallery下json格式数据
                        url_pattern=re.compile('url(.*?)"width',re.S)
                        result1 = re.findall(url_pattern,result.group(1))
                        for i in range(len(result1)):
                            yield{
                                'image': "http://p%d.pstatp.com/origin/pgc-image/"%(random.randint(1,10)) + 
                                           result1[i][result1[i].rindex("u002F")+5:result1[i].rindex('\\"')], #存储url
                                'title': title
                            }
            except requests.ConnectionError: # 打开子链接失败就直接保存图集中前部分
                for image in images:
                    origin_image = re.sub("list.*?pgc-image", "large/pgc-image", image.get('url')) # 改成origin/pgc-image是原图
                    yield {
                        'image': origin_image,
                        'title': title
                    }
                
def save_image(item):
    img_path = 'img' + os.path.sep + item.get('title')
    if not os.path.exists(img_path):
        os.makedirs(img_path) # 生成目录文件夹
    try:
        resp = requests.get(item.get('image'))
        if codes.ok == resp.status_code:
            file_path = img_path + os.path.sep + '{file_name}.{file_suffix}'.format(
                file_name=md5(resp.content).hexdigest(), 
                file_suffix='jpg')  # 单一文件的路径
            if not os.path.exists(file_path):
                with open(file_path, 'wb') as f:
                    f.write(resp.content) 
                print('Downloaded image path is %s' % file_path)
            else:
                print('Already Downloaded', file_path)
    except Exception as e:
        print(e)
        
def main(offset):
    json = get_page(offset)
    for item in get_images(json):
        save_image(item)

if __name__ == '__main__':
    '''
    for i in range(3):
        main(20*i)
    '''
    pool = Pool()
    groups = ([x * 20 for x in range(0, 3)])
    pool.map(main, groups)

代码主体参考了@Anodsaber的版本以及崔老师的视频讲解!加了一点点自己学习过程中的注释。

主要改动是在get_images函数中,增加了一个逻辑判断,如果打开搜索页的子链接后发现是以JSON格式加载的图集,就继续抓取url并返回一个迭代器;如果是直接加载出图集的形式,就按照固有的操作,即在搜索页中抓取image的url。另外设置了异常处理部分,如果打开子链接出现问题,那么就仍然使用搜索页中抓取的模式。

在抓取图集的子链接中发现子链接形式如下 http://p1.pstatp.com/origin/pgc-image/path ,其中p1部分的数字会随机跳动,自己不是很理解这之中的原理,也是出于谨慎加了一个随机抽取一个0-9的自然数构造url的小细节。

小白第一次发帖,多多指教!

小白看书时候发现parameter里多了timestamp,代码小改动了下

加了headers, 加了timestamp, 但是跑的时候偶尔出现OSError: [Errno 22] The filename, directory name, or volume label syntax is incorrect,我把这个名字直接自己创业却又是可以的,大神们帮忙看看

import requests
import os
from urllib.parse import urlencode
from hashlib import md5
from multiprocessing.pool import Pool
from datetime import datetime

def get_page(offset):
timestamp = str(datetime.timestamp(datetime.today())).replace('.', '')[:-3]
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/74.0.3729.169 Safari/537.36',
'cookie': 'tt_webid=6705372327364445699; WEATHER_CITY=%E5%8C%97%E4%BA%AC; '
'UM_distinctid=16b7fbc4c5f2f3-055c8cad207e35-3e385b04-144000-16b7fbc4c601fb;'
' tt_webid=6705372327364445699; csrftoken=565955f383dfff6e64e1fcaf538414be;'
' CNZZDATA1259612802=429684378-1561215529-%7C1561296529; s_v_web_id=4b402c5aa53e24a17fca9d68bd6eb7ff',
'x-requested-with': 'XMLHttpRequest'
}
params = {
'aid': 24,
'app_name': 'web_search',
'offset': offset,
'format': 'json',
'keyword': '街拍',
'autoload': 'true',
'count': '20',
'en_qc': '1',
'cur_tab': '1',
'from': 'search_tab',
'time': timestamp,
}
url = 'https://www.toutiao.com/api/search/content/?' + urlencode(params)
try:
response = requests.get(url, headers=headers)
if response.status_code == 200:
return response.json()
except requests.ConnectionError:
return None

def get_images(json_data):
if json_data.get('data'):
for item in json_data.get('data'):
if item.get('cell_type') is not None:
continue
title = item.get('title')
images = item.get('image_list')
for image in images:
yield{
'image': image.get('url'),
'title': title
}

def save_image(item):
if not os.path.exists(item.get('title')):
os.mkdir(item.get('title'))
try:
response = requests.get(item.get('image'))
if response.status_code == 200:
file_path = '{0}/{1}.{2}'.format(item.get('title'), md5(response.content).hexdigest(), 'jpg')
if not os.path.exists(file_path):
with open(file_path, 'wb') as f:
f.write(response.content)
else:
print('Already Downloaded', file_path)
except requests.ConnectionError:
print('Failed to save image')

def main(offset):
json = get_page(offset)
for item in get_images(json):
print(item)
save_image(item)

GROUP_START = 1
GROUP_END = 20

if name == 'main':
pool = Pool()
groups = ([x *20 for x in range(GROUP_START, GROUP_END + 1)])
pool.map(main, groups)
pool.close()
pool.join()

解决了返回data为空、爬取大图、以title命名文件的bug

import requests
from urllib.parse import urlencode
from requests import codes
import os
from hashlib import md5
from multiprocessing.pool import Pool
import re


def get_page(offset):
    headers = {
        'cookie': 'tt_webid=6667396596445660679; csrftoken=3a212e0c06e7821650315a4fecf47ac9; tt_webid=6667396596445660679; WEATHER_CITY=%E5%8C%97%E4%BA%AC; UM_distinctid=16b846003e03d7-0dd00a2eb5ea11-353166-1fa400-16b846003e1566; CNZZDATA1259612802=2077267981-1561291030-https%253A%252F%252Fwww.baidu.com%252F%7C1561361230; __tasessionId=4vm71cznd1561363013083; sso_uid_tt=47d6f9788277e4e071f3825a3c36a294; toutiao_sso_user=e02fd616c83dff880adda691cd201aaa; login_flag=6859a0b8ffdb01687b00fe96bbeeba6e; sessionid=21f852358a845d783bdbe1236c9b385b; uid_tt=d40499ec45187c2d411cb7bf656330730d8c15a783bb6284da0f73104cd300a2; sid_tt=21f852358a845d783bdbe1236c9b385b; sid_guard="21f852358a845d783bdbe1236c9b385b|1561363028|15552000|Sat\054 21-Dec-2019 07:57:08 GMT"; s_v_web_id=6f40e192e0bdeb62ff50fca2bcdf2944',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36',
        'x-requested-with': 'XMLHttpRequest',
        'referer': 'https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D',
    }
    params = {
        'aid': '24',
        'app_name': 'web_search',
        'offset': offset,
        'format': 'json',
        'keyword': '街拍',
        'autoload': 'true',
        'count': '20',
        'en_qc': '1',
        'cur_tab': '1',
        'from': 'search_tab',
        'pd': 'synthesis',
    }
    base_url = 'https://www.toutiao.com/api/search/content/?'
    url = base_url + urlencode(params)
    # print(url)
    try:
        resp = requests.get(url, headers=headers)
        if 200  == resp.status_code:
            return resp.json()
    except requests.ConnectionError:
        return None


def get_images(json):
    if json.get('data'):
        data = json.get('data')
        for item in data:
            if item.get('title') is None:
                continue
            title = re.sub('[\t]', '', item.get('title'))
            images = item.get('image_list')
            for image in images:
                origin_image = re.sub("list.*?pgc-image", "large/pgc-image", image.get('url'))
                yield {
                    'image': origin_image,
                    'title': title
                }


def save_image(item):
    img_path = 'img' + os.path.sep + item.get('title')
    if not os.path.exists(img_path):
        os.makedirs(img_path)
    try:
        resp = requests.get(item.get('image'))
        if codes.ok == resp.status_code:
            file_path = img_path + os.path.sep + '{file_name}.{file_suffix}'.format(
                file_name=md5(resp.content).hexdigest(),
                file_suffix='jpg')
            if not os.path.exists(file_path):
                with open(file_path, 'wb') as f:
                    f.write(resp.content)
                print('Downloaded image path is %s' % file_path)
            else:
                print('Already Downloaded', file_path)
    except Exception as e:
        print(e)


def main(offset):
    json = get_page(offset)
    for item in get_images(json):
        save_image(item)


GROUP_START = 0
GROUP_END = 9

if __name__ == '__main__':
    pool = Pool()
    groups = ([x * 20 for x in range(GROUP_START, GROUP_END + 1)])
    pool.map(main, groups)
    pool.close()
    pool.join()

代码主要修改了以下内容:
1.添加了headers,使返回的json数据的data不为空

headers = {
        'cookie': 'tt_webid=6667396596445660679; csrftoken=3a212e0c06e7821650315a4fecf47ac9; tt_webid=6667396596445660679; WEATHER_CITY=%E5%8C%97%E4%BA%AC; UM_distinctid=16b846003e03d7-0dd00a2eb5ea11-353166-1fa400-16b846003e1566; CNZZDATA1259612802=2077267981-1561291030-https%253A%252F%252Fwww.baidu.com%252F%7C1561361230; __tasessionId=4vm71cznd1561363013083; sso_uid_tt=47d6f9788277e4e071f3825a3c36a294; toutiao_sso_user=e02fd616c83dff880adda691cd201aaa; login_flag=6859a0b8ffdb01687b00fe96bbeeba6e; sessionid=21f852358a845d783bdbe1236c9b385b; uid_tt=d40499ec45187c2d411cb7bf656330730d8c15a783bb6284da0f73104cd300a2; sid_tt=21f852358a845d783bdbe1236c9b385b; sid_guard="21f852358a845d783bdbe1236c9b385b|1561363028|15552000|Sat\054 21-Dec-2019 07:57:08 GMT"; s_v_web_id=6f40e192e0bdeb62ff50fca2bcdf2944',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36',
        'x-requested-with': 'XMLHttpRequest',
        'referer': 'https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D',
    }

2.将title中的制表符删掉,避免出现文件名不符合要求的bug

title = re.sub('[\t]', '', item.get('title'))

3.原始的大图URL过期了,于是改成新的形式

origin_image = re.sub("list.*?pgc-image", "large/pgc-image", image.get('url'))

设置cookie爬取街拍组图内所有图片

需要从浏览器中拷贝cookie 否则只加载基本网页结构

import requests
import re
import json
import os
from urllib.parse import urlencode
from requests import codes
from multiprocessing.pool import Pool
heardes = {
        "user-agent": "Mozilla / 5.0(Windows NT 10.0;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 69.0.3497.100Safari / 537.36",
        "cookie": 'tt_webid=6591674770407851527; UM_distinctid=16555dd135182a-0aaf457dfee96f-2711938-144000-16555dd13525c0; csrftoken=e874c1c4c92de13796f658055b321509; tt_webid=6591674770407851527; WEATHER_CITY=%E5%8C%97%E4%BA%AC; uuid="w:69e07afc4113469c8a67bc7e1191a8fe"; ccid=5b05c00c464a23364e612a873a8f8bf8; CNZZDATA1259612802=1586168555-1534743351-%7C1539254542; __tasessionId=y296rn1i01539258641108',
        "x-requested-with": "XMLHttpRequest",
    
    }
def get_soucre_code(offset):
    params = {
        "offset": offset,
        "format": "json",
        "keyword": "街拍",
        "autoload": "true",
        "count": 20,
        "cur_tab": 3,
        "from": "gallery",
    }
    base_url = "https://www.toutiao.com/search_content/?"
    url = base_url + urlencode(params)
    try:
        resp = requests.get(url,headers=heardes)
        if codes.ok == resp.status_code:
            return resp
    except requests.ConnectionError:
        return None
def get_images(json):
    if json.get('data'):
        data = json.get('data')
        for item in data:
            title = item.get('title')
            open_url = item.get('open_url')
            yield title,open_url

def get_images_url(open_url,title):
    try:
        image_source_code = requests.get('https://www.toutiao.com{}'.format(open_url),headers=heardes)
        pic_json_reg =re.compile("gallery: JSON.parse(.*?)siblingList:",re.S)
        pic_url_reg = re.compile('"url":"(.*?)"')
        urls = pic_json_reg.search(image_source_code.text)
        urls = urls.group(0)
        urls = urls.replace('\\','')
        pic_urls = pic_url_reg.findall(urls)
        if pic_urls:
            for num in range(0,len(pic_urls),4):
                save_image(pic_urls[num],title,num)
                print(pic_urls[num])
    except requests.ConnectionError:
        return None

def save_image(pic_url,title,num):
    path = './data/{title}/'.format(title=title)
    if not os.path.exists('data'):
        os.mkdir('data')
    if not os.path.exists(path):
        os.mkdir(path)
    try:
        pic = requests.get(pic_url,headers=heardes)
        with open('data/{title}/{num}.jpg'.format(title=title,num=num),mode='wb') as f:
            f.write(pic.content)
            print('{title}/{num}.jpg 写入成功'.format(title=title,num=num))
    except requests.ConnectionError:
        return None


def main(page):
    json = get_soucre_code(page)
    json = json.json()
    for title,open_url in get_images(json):
        get_images_url(open_url,title)
  
if __name__ == '__main__':
    pool = Pool()
    groups = [x * 20 for x in range(1,11)]
    pool.map(main,groups)
    pool.close()
    pool.join()

2019/11/4,可爬取示例

import os
from hashlib import md5
from multiprocessing.pool import Pool
import requests
from urllib.parse import urlencode

GROUP_STRAT = 1
GROUP_END = 10

URL = 'https://www.toutiao.com/api/search/content/?'

def get_page(offset):
    headers = {
        'cookie': 'tt_webid=6755317032361018894; WEATHER_CITY=%E5%8C%97%E4%BA%AC; __tasessionId=iix0i88fv1572844826803; tt_webid=6755317032361018894; s_v_web_id=9e3d1341a0ce3a6625dafa55bd50a8c5; csrftoken=4e5d24d1aa648714a20fa478adb67e3c; sso_uid_tt=9989002b18278fa8a91a2d9bfceff0db; toutiao_sso_user=22695687fb520aba3bba92d81182e09d; sid_guard=77ee6776e543023f76e7fb53e187aa0b%7C1572844883%7C5184000%7CFri%2C+03-Jan-2020+05%3A21%3A23+GMT; uid_tt=6bb53dbf3ec71913bc04a6c4a27ac973; sid_tt=77ee6776e543023f76e7fb53e187aa0b; sessionid=77ee6776e543023f76e7fb53e187aa0b',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36',
        'x-requested-with': 'XMLHttpRequest',
        'referer': 'https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D',
    }
    params = {
        'aid': 24,
        'app_name': 'web_search',
        'offset': offset,
        'format': 'json',
        'keyword': '街拍美图',
        'autoload': 'true',
        'count': '20',
        'en_qc': '1',
        'cur_tab': '1',
        'from': 'search_tab',
        'pd': 'synthesis',
        'timestamp': '1572845169181'
    }
    real_url = URL + urlencode(params)
    print(real_url)
    try:
        response = requests.get(real_url,headers = headers)
        response.raise_for_status()
        #print(response.status_code)
        #print(response.json())
        return response.json()
    except:
        print('请求失败')

def get_images(json):
    if json.get('data'):
        items = json.get('data')
        try:
            for item in items:
                title = item.get('title')
                images = item.get('image_list')
                for image in images:
                    yield {
                        'title': title,
                        'image': image.get('url')
                    }
        except TypeError:
            print("images返回空,未知错误,爬取下一页")
    else:
        print('data为空')
def save_image(item):
    if not os.path.exists("今日头条街拍"):
        os.makedirs("今日头条街拍" )
    if not os.path.exists("今日头条街拍" + '/' + item.get('title')):
        os.makedirs("今日头条街拍" + '/' + item.get('title'))
    try:
        print(item.get('image'))
        response = requests.get(item.get('image'))
        response.raise_for_status()
        file_path = '{0}/{1}.{2}'.format("今日头条街拍" + '/' + item.get('title'), md5(response.content).hexdigest(), 'jpg')
        if not os.path.exists(file_path):
            with open(file_path,'wb') as f:
                print('正在下载图片...')
                f.write(response.content)
                print('下载成功!')
        else:
            print('Already Download',file_path)
    except:
        print('图片链接异常,保存出错')

def main(offset):
    print(offset)
    json = get_page(offset)
    for item in get_images(json):
        print(item)
        #print(type(item))
        save_image(item)

if __name__ == '__main__':
    print('hello')
    # for offset in range(GROUP_STRAT+1,GROUP_END+1):
    #     print(20 * offset)
    #     main(20 * offset)

    pool = Pool()
    pool.map(main, [offset * 20 for offset in range(GROUP_STRAT,GROUP_END+1)])
    pool.close()
    pool.join()
```

Ajax格式有所变化

2018-04-01 抓取数据时发现image_detail已经改成了image_list,需要修改get_iamges函数

get_page()中可以不要aid,app_name,keyword等字段吗??

def get_page(offset):
t=int(time.time())
params={
'aid':'24',
'app_name':'web_search',
'offset':offset,
'format':'json',
'keyword':'%E8%A1%97%E6%8B%8D',
'autoload':'true',
'count':'20',
'en_qc':'1',
'cur_tab':'1',
'from':'search_tab',
'pd':'synthesis',
'timestamp':t
}
url='https://www.toutiao.com/api/search/content/?'+urlencode(params)
try:
response=requests.get(url)
response.raise_for_status()
return response.json()
except requests.ConnectionError:
return None

街拍爬取组图内所有图片连接(需设置cookie)

import requests
import re
import json
import os
from urllib.parse import urlencode
from requests import codes
from multiprocessing.pool import Pool
heardes = {
"user-agent": "Mozilla / 5.0(Windows NT 10.0;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 69.0.3497.100Safari / 537.36",
"cookie": 'tt_webid=6591674770407851527; UM_distinctid=16555dd135182a-0aaf457dfee96f-2711938-144000-16555dd13525c0; csrftoken=e874c1c4c92de13796f658055b321509; tt_webid=6591674770407851527; WEATHER_CITY=%E5%8C%97%E4%BA%AC; uuid="w:69e07afc4113469c8a67bc7e1191a8fe"; ccid=5b05c00c464a23364e612a873a8f8bf8; CNZZDATA1259612802=1586168555-1534743351-%7C1539254542; __tasessionId=y296rn1i01539258641108',
"x-requested-with": "XMLHttpRequest",
"referer": "https://www.toutiao.com/a6602140390814384653/"
}
def get_soucre_code(offset):
params = {
"offset": offset,
"format": "json",
"keyword": "街拍",
"autoload": "true",
"count": 20,
"cur_tab": 3,
"from": "gallery",
}
base_url = "https://www.toutiao.com/search_content/?"
url = base_url + urlencode(params)
try:
resp = requests.get(url,headers=heardes)
if codes.ok == resp.status_code:
return resp
except requests.ConnectionError:
return None
def get_images(json):
if json.get('data'):
data = json.get('data')
for item in data:
title = item.get('title')
open_url = item.get('open_url')
yield title,open_url

def get_images_url(open_url,title):
try:
image_source_code = requests.get('https://www.toutiao.com{}'.format(open_url),headers=heardes)
pic_json_reg =re.compile("gallery: JSON.parse(.?)siblingList:",re.S)
pic_url_reg = re.compile('"url":"(.
?)"')
urls = pic_json_reg.search(image_source_code.text)
urls = urls.group(0)
urls = urls.replace('\','')
pic_urls = pic_url_reg.findall(urls)
if pic_urls:
for num in range(0,len(pic_urls),4):
save_image(pic_urls[num],title,num)
print(pic_urls[num])
except requests.ConnectionError:
return None

def save_image(pic_url,title,num):
path = './data/{title}/'.format(title=title)
if not os.path.exists('data'):
os.mkdir('data')
if not os.path.exists(path):
os.mkdir(path)
try:
pic = requests.get(pic_url,headers=heardes)
with open('data/{title}/{num}.jpg'.format(title=title,num=num),mode='wb') as f:
f.write(pic.content)
print('{title}/{num}.jpg 写入成功'.format(title=title,num=num))
except requests.ConnectionError:
return None

def main(page):
json = get_soucre_code(page)
json = json.json()
for title,open_url in get_images(json):
get_images_url(open_url,title)
# get_images(json)
if name == 'main':
pool = Pool()
groups = [x * 20 for x in range(1,11)]
pool.map(main,groups)
pool.close()
pool.join()

params 加入 app_name = "web_search" 错误

我观察到 ajax 的url 带有app_name参数 加入以后,params为 params = {

    'aid': '24',
    'app_name':'web_search'
    'offset': offset,
    'format': 'json',
    'keyword': '街拍',
    'autoload': 'true',
    'count': '20',
    'cur_tab': '1',
    'from': 'search_tab',
    'pd': 'synthesis'
}

但是 返回为NONE,
请求为https://www.toutiao.com/api/search/content/?aid=24&app_name=web_search&offset=20&format=json&keyword=%E8%A1%97%E6%8B%8D&autoload=true&count=20&en_qc=1&cur_tab=1&from=search_tab&pd=synthesis 我的添加顺序也对 但是就是不成功请教一下为啥子呢

以前抓的都是小图,我改了下代码,抓大图

import os
from multiprocessing.pool import Pool
import requests
from urllib.parse import urlencode
from hashlib import md5
from requests import codes


def get_page(offset):
   params = {
       'offset': offset,
       'format': 'json',
       'keyword': '街拍',
       'autoload': 'true',
       'count': '20',
       'cur_tab': '1',
       'from':'search_tab',
       'pd':'synthesis',
   }
   url = 'http://www.toutiao.com/search_content/?' + urlencode(params)
   try:
       response = requests.get(url)
       if response.status_code == 200:
           return response.json()
   except requests.ConnectionError:
       return None


def get_images(json):
   if json.get('data'):
       for item in json.get('data'):
           title = item.get('title')
           images = item.get('image_list')
           if images:
               for image in images:
                   yield {
                       'image': image.get('url'),
                       'title': title
                   }


def save_image(item):
   img_path = 'img' + os.path.sep + item.get('title')
   if not os.path.exists(img_path):
       os.makedirs(img_path)
   try:
       resp = requests.get('https:'+item.get('image').replace('list','large'))
       if codes.ok == resp.status_code:
           file_path = img_path + os.path.sep + '{file_name}.{file_suffix}'.format(
               file_name=md5(resp.content).hexdigest(),
               file_suffix='jpg')
           if not os.path.exists(file_path):
               with open(file_path, 'wb') as f:
                   f.write(resp.content)
               print('Downloaded image path is %s' % file_path)
           else:
               print('Already Downloaded', file_path)
   except requests.ConnectionError:
       print('Failed to Save Image,item %s' % item)

def main(offset):
   json = get_page(offset)
   for item in get_images(json):
       print(item)
       save_image(item)


GROUP_START = 1
GROUP_END = 20

if __name__ == '__main__':
   pool = Pool()
   groups = ([x * 20 for x in range(GROUP_START, GROUP_END + 1)])
   pool.map(main, groups)
   pool.close()
   pool.join()

关键是这里
'image': image.get('url'),
resp = requests.get('https:'+item.get('image').replace('list','large'))
在上次这个代码基础上
#5

获取原图方式有点问题

做了简单的修改

1. 原图(大图)之前使用的lager方法只能获取很少的图片部分(几张图片),修改后的代码现在测试都能正常获取下载全部的图片
2. 之前构建的url地址获取的 json做了下修改,加入cookies能正常获取网页提取的url地址(如果大量爬取或许需要加入cookies池),我不能理解为什么是那样去获取(很神奇他竟然不需要cookies能获取完整的json,虽然和我们访问网页获取的不一致),希望有人解释一下.

见附件
Jiepai-master.zip

重定位次数过多问题

requests.exceptions.TooManyRedirects: Exceeded 30 redirects.
request重定位不能超过30次,禁止重定位又不能正确获取地址。求问怎么解决啊?

params和图片下载地址的更改,程序又能跑起来啦

今天看了下demo由于今日头条中参数的更改导致程序又跑不起来了,更改了下,现在又能跑起来了
import requests
from urllib.parse import urlencode
import os
from hashlib import md5
from multiprocessing.pool import Pool
import re

def get_page(offset):
params = {
'aid':'24',
'offset': offset,
'format': 'json',
'autoload': 'true',
'count': '20',
'cur_tab': '1',
'from': 'search_tab',
'pd':'synthesis'
}
base_url = 'https://www.toutiao.com/api/search/content/?keyword=%E8%A1%97%E6%8B%8D'
url = base_url + urlencode(params)
try:
resp = requests.get(url)
print(url)
if 200 == resp.status_code:
print(resp.json())
return resp.json()
except requests.ConnectionError:
return None

def get_images(json):
if json.get('data'):
data = json.get('data')
for item in data:
if item.get('cell_type') is not None:
continue
title = item.get('title')
images = item.get('image_list')
for image in images:
origin_image = re.sub("list", "origin",image.get('url'))
yield {
'image': origin_image,
'title': title
}

def save_image(item):
img_path = 'img' + os.path.sep + item.get('title')
print(img_path)
if not os.path.exists(img_path):
os.makedirs(img_path)
try:
resp = requests.get(item.get('image'))
if 200 == resp.status_code:
file_path = img_path + os.path.sep + '{file_name}.{file_suffix}'.format(
file_name=md5(resp.content).hexdigest(),
file_suffix='jpg')
if not os.path.exists(file_path):
with open(file_path, 'wb') as f:
f.write(resp.content)
print('Downloaded image path is %s' % file_path)
else:
print('Already Downloaded', file_path)
except requests.ConnectionError:
print('Failed to Save Image%s'%item)

def main(offset):
json = get_page(offset)
for item in get_images(json):
print(item)
save_image(item)

GROUP_START = 0
GROUP_END = 7

if name == 'main':
pool = Pool()
groups = ([x * 20 for x in range(GROUP_START, GROUP_END + 1)])
pool.map(main, groups)
pool.close()
pool.join()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.