mmerian / phpcrawl Goto Github PK

View Code? Open in Web Editor NEW

58.0 58.0 34.0 234 KB

Copy of http://phpcrawl.cuab.de/ for using with composer

License: GNU General Public License v2.0

PHP 98.89% JavaScript 0.87% CSS 0.24%

composer crawler php phpcrawl

phpcrawl's People

Contributors

Stargazers

Watchers

phpcrawl's Issues

Excuse me, is not compatible with php7.1 yao

PHP Warning:  Declaration of MyCrawler::handleDocumentInfo($DocInfo) should be compatible with PHPCrawler::handleDocumentInfo(PHPCrawlerDocumentInfo $PageInfo) in /var/www/srclast/PHPCrawl/rsclast.class.php on line 10
Page requested: https://security.alibaba.com/top.htm?spm=0.0.0.0.gqgp1o&time= ()
Referer-page: 
Content not received

Summary:
Links followed: 1
Documents received: 0
Bytes received: 0 bytes
Process runtime: 1.7769010066986 sec
root@ydxred:/var/www/srclast/PHPCrawl#

PHPCrawlerUtils class appears twice in the libs folder

...and when autoloading the classes the wrong one is loaded (the one missing some of the methods used by the script). PHPCrawlerUtils appears in libs/Utils (the right one) but also in libs/.

Problems in PHPCrawlerLinkFinder.class.php prepareHTMLChunk()

The following regexes in the prepareHTMLChunk function leads to a complete empty html source for many pages:

$html_source = preg_replace("#^(?:(?!<script).)*<\/script># Uis", "", $html_source);

$html_source = preg_replace("#<\!--.*(?:-->|$)# Uis", "", $html_source);

$html_source = preg_replace("#^(?:(?!<\!--).)*-->#Uis", "", $html_source);

My regex skills are not good enough to debug it.

Call to undefined method PHPCrawlerUtils::getURIContent()

Framework: Laravel
PHPCrawl version: 0.83

Issue:
I'm trying to set the obeyRobotsTxt but it uses the wrong PHPCrawlerUtils. obeyRobotsTxt calls PHPCrawlerRobotsTxtParser::parseRobotsTxt which in turn calls PHPCrawlerUtils::getURIContent but it doesn't find it, reason why is because it uses this Class:

vendor/mmerian/phpcrawl/libs/PHPCrawlerUtils.class.php //Doesn't contain getURIContent

Instead of this one, which it should use.

vendor/mmerian/phpcrawl/libs/Utils/PHPCrawlerUtils.class.php ////Does contain getURIContent

error:

Call to undefined method PHPCrawlerUtils::getURIContent()

autoload warning:

Warning: Ambiguous class resolution, "PHPCrawlerUtils" was found in both "/Users/macmini2/securityscan/vendor/mmerian/phpcrawl/libs/PHPCrawlerUtils.class.php" and "/Users/macmini2/securityscan/vendor/mmerian/phpcrawl/libs/Utils/PHPCrawlerUtils.class.php", the first will be used.

Get links after crawl

Hello

I'm using the following code

class MyCrawler extends PHPCrawler {
        function handleDocumentInfo(PHPCrawlerDocumentInfo $PageInfo)   {
            // Your code comes here!
            // Do something with the $PageInfo-object that
            // contains all information about the currently 
            // received document.

            // As example we just print out the URL of the document
                //weekly pode ser fornecido como paramentro
            //return -1;
        }
    } 

    $crawler = new MyCrawler();
    $crawler->setURL('http://example.com'); 
    $crawler->setWorkingDirectory("/dev/shm/");
    $crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);
    $crawler->excludeLinkSearchDocumentSections(PHPCrawlerLinkSearchDocumentSections::ALL_SPECIAL_SECTIONS); 
    $crawler->addContentTypeReceiveRule("#text/html#");
        $crawler->go();

How can I access the PHPCrawlerDocumentInfo->links_found after the crawl is complete?
Thanks in advance.

Maintenance of phpcrawl

Hello

Is phpcrawl being maintained by the author?
I have tried to reach him by e-mail regarding the class but no success...

Html5

Does it work with html5?

mmerian / phpcrawl Goto Github PK

phpcrawl's People

Contributors

Stargazers

Watchers

Forkers

phpcrawl's Issues

Excuse me, is not compatible with php7.1 yao

PHPCrawlerUtils class appears twice in the libs folder

Problems in PHPCrawlerLinkFinder.class.php prepareHTMLChunk()

Call to undefined method PHPCrawlerUtils::getURIContent()

Get links after crawl

Maintenance of phpcrawl

Html5

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs