GithubHelp home page GithubHelp logo

Comments (5)

owner888 avatar owner888 commented on July 17, 2024

phpspider.php 文件的第 2114 行,下载应该使用 $collect_url $html = requests::$method($collect_url, $params);

否则不会去下载 attached_url

你改下提个patch给我呀

from phpspider.

kavt avatar kavt commented on July 17, 2024

大哥!!!我真的太感谢你了!!
我找半天 总是找不到为什么加载了 下载不了,只能下载主页,原来是代码有问题!
@owner888 群主啊,你害人不清啊!! 虽然你的代码节省了我们大量时间,你好歹也测试下啊。
我搞了3天3夜没找到原因

from phpspider.

kavt avatar kavt commented on July 17, 2024

@yingzheng1980 还有个问题,有的详情页有分页 有的没有 如何判断呢

`
'fields' => array(

    array(
        'name' => "contents",
        'selector' => "//div[contains(@class,'art-pre')]/a/@href", ////div[contains(@class,'art-pre')]//a//@href

        ////*[@id="form1"]/div[6]/div/div[2]/div[1]/div[2]/a[5]
        ////div[contains(@class,'art-pre')]//a//@href
        
        'repeated' => true,
        'required' => true,//必填

        'children' => array(

          
            array(
                // 抽取出其他分页的url待用
                'name' => 'content_page_url',
               
                'selector' => "//text()"
            ),

        
            array(
                // 抽取其他分页的内容
                'name' => 'page_content',
  
                'source_type' => 'attached_url',
                'attached_url' => 'content_page_url',   // 'attached_url'=>"https://www.zhihu.com/r/answers/{comment_id}/comments",
                'selector' => "//div[contains(@class,'textWrap')]"
            ),
        ),`

from phpspider.

kavt avatar kavt commented on July 17, 2024

http://www.qikan.com.cn/articleinfo/dinb20222801-2.html

from phpspider.

kavt avatar kavt commented on July 17, 2024

改了也不对,
1:没有底部分页的会自动无视 跳过
'selector' => "//div[contains(@Class,'art-pre')]/a/@href", 因为有的网页没有分页 底部就没有下一页
没有这个就不会被执行采集

2:有分页的 只采集下一页等几个内容页,当前页内容并没有被采集

`<?php
require_once DIR . '/../autoloader.php';
use phpspider\core\phpspider;
use phpspider\core\requests;
use phpspider\core\selector;
/* Do NOT delete this comment /
/
不要删除这段注释 */

/【重要 模拟登录】/
$cookies = "ASP.NET_SessionId=uqbyzahwaa5fedgldcawsogx; Hm_lvt_782a719ae16424b0c7041b078eb9804a=1657892367,1658402814,1658581663,1658932747; Hm_lvt_29f14b13cac2f8b4e5fc964806f3ea52=1657892367,1658402820,1658581663,1658932747; Hm_lpvt_782a719ae16424b0c7041b078eb9804a=1658932755; Hm_lpvt_29f14b13cac2f8b4e5fc964806f3ea52=1658932755; UserToken=nrbjZ+ZFD3ulIoEX50957cwO1CrVaO5/NLAFj6bcy1Gx6rsh; LoginUserName=kavt12; LoginPassword=NRW/PSbsXFo=";

requests::set_cookies($cookies, 'www.qikan.com.cn');

/*[for scan_urls 计算出年份和周数 每次请求加上即可]
*

取当年份
取当周数

*/

//周数

$year=date('Y');
$week = date('W'); //电脑报一般周一下午出 周数-2
$week=$week-2;

//die;

//7 主要尝试增加分页

$configs = array(
'name' => 'diannaobao',
'log_show' => true,
'max_fields' => 2, //最大采集2条 每次
'domains' => array(
'www.qikan.com.cn'
),

//入口



'scan_urls' => array(
    "http://www.qikan.com.cn/magdetails/683A509F-26A9-46BC-B01D-3EFE1BABD7D8/{$year}/{$week}.html"   //  http://www.qikan.com.cn/magdetails/683A509F-26A9-46BC-B01D-3EFE1BABD7D8/2022/27.html
),





//内容 也对了
 'content_url_regexes' => array(
        "http://www.qikan.com.cn/article/[\s\S]+",  //http://www.qikan.com.cn/article/dinb20222701.html
        "http://www.qikan.com.cn/articleinfo/[\s\S]+"
    ),





'fields' => array(



    array(
        'name' => "contents",
        //'selector_type' => 'regex',
        'selector' => "//div[contains(@class,'art-pre')]/a/@href", ////div[contains(@class,'art-pre')]//a//@href

        ////*[@id="form1"]/div[6]/div/div[2]/div[1]/div[2]/a[5]
        ////div[contains(@class,'art-pre')]//a//@href
        
        'repeated' => true,
        'required' => true,//必填

        'children' => array(

          
            array(
                // 抽取出其他分页的url待用
                'name' => 'content_page_url',
               
                'selector' => "//text()"
            ),

        
            array(
                // 抽取其他分页的内容
                'name' => 'page_content',
  
                'source_type' => 'attached_url',
                'attached_url' => 'content_page_url',   // 'attached_url'=>"https://www.zhihu.com/r/answers/{comment_id}/comments",
                'selector' => "//div[contains(@class,'textWrap')]"
            ),
        ),
    ),




    // 抽取内容页的文章标题
    array(
        'name' => "title",
        'selector' => "//div[contains(@class,'article')]//h1",   //     备用//*[@id=\"form1\"]/div[6]/div/div[2]/div[1]/h1
        'required' => true
    ),


    //正文 //div[contains(@class,'textWrap')]

/*
array(
'name' => "text",
'selector' => "//div[contains(@Class,'textWrap')]", ///article-main //div[contains(@Class,"article-content")]', 内容部分 //div[contains(@Class,'textWrap')]
),
*/

/*

        array(
        'name' => "contents",
        'selector' => "//html",  ////div[@id='art-pre']//a//@href
        'repeated' => true,
        'children' => array(
                array(
                    // 抽取出其他分页的url待用
                    'name' => 'content_page_url',
                    'selector' => "div[contains(@class,'art-pre')]//a//@href"   ////div[contains(@class,'art-pre')]//a//@href
                ),
                array(
                    // 抽取其他分页的内容
                    'name' => 'page_content',
                    // 发送 attached_url 请求获取其他的分页数据
                    // attached_url 使用了上面抓取的 content_page_url
                    'source_type' => 'attached_url',
                    'attached_url' => 'http://www.qikan.com.cn/{content_page_url}',  //"https://www.zhihu.com/r/answers/{comment_id}/comments",  http://www.qikan.com.cn/
                    'selector' => "//div[contains(@class,'textWrap')]"
                )
        )
    ),

*/

    //图片
     array(
        'name' => "pic",
        'selector' => "//figure[contains(@class,'image')]//img",  ///html/body/div[1]/div[3]/div[1]/div/div[2]/div[1]/div[1]/div[5]  

        //返回的是图片数组 需要取一个出来 $data=$data[0];  估计要做个判断 一个就直接显示 多个就显示第一个   目前看来不处理也可以的
    ),



      


),

'export' => array(
    'type'  => 'sql',
    'file'  => './data/8.sql',
    'table' => '数据表',
),

);

$spider = new phpspider($configs);

//【如何对采集到的字段进行二次处理?】 on_extract_field进行二次处理即可
$spider->on_extract_field = function($fieldname, $data, $page)
{
if ($fieldname == 'contents')
{

                  $contents = $data;
                  $data = "";

                  $num=count($contents)-1;

                  for ($i=0; $i <$num ; $i++) { 
                  	 $data .= $contents[$i]['page_content'];
                  }

                  /*
                  foreach ($contents as $content) 
                  {
                      $data .= $content['page_content'];
                  }
              */

              
}
return $data;

};

$spider->start();`

from phpspider.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.