GithubHelp home page GithubHelp logo

mmerian / phpcrawl Goto Github PK

View Code? Open in Web Editor NEW
58.0 58.0 34.0 234 KB

Copy of http://phpcrawl.cuab.de/ for using with composer

License: GNU General Public License v2.0

PHP 98.89% JavaScript 0.87% CSS 0.24%
composer crawler php phpcrawl

phpcrawl's People

Contributors

mmerian avatar theputzy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

phpcrawl's Issues

Excuse me, is not compatible with php7.1 yao

PHP Warning:  Declaration of MyCrawler::handleDocumentInfo($DocInfo) should be compatible with PHPCrawler::handleDocumentInfo(PHPCrawlerDocumentInfo $PageInfo) in /var/www/srclast/PHPCrawl/rsclast.class.php on line 10
Page requested: https://security.alibaba.com/top.htm?spm=0.0.0.0.gqgp1o&time= ()
Referer-page: 
Content not received

Summary:
Links followed: 1
Documents received: 0
Bytes received: 0 bytes
Process runtime: 1.7769010066986 sec
root@ydxred:/var/www/srclast/PHPCrawl# 

Problems in PHPCrawlerLinkFinder.class.php prepareHTMLChunk()

The following regexes in the prepareHTMLChunk function leads to a complete empty html source for many pages:

$html_source = preg_replace("#^(?:(?!<script).)*<\/script># Uis", "", $html_source);

$html_source = preg_replace("#<\!--.*(?:-->|$)# Uis", "", $html_source);

$html_source = preg_replace("#^(?:(?!<\!--).)*-->#Uis", "", $html_source);

My regex skills are not good enough to debug it.

Call to undefined method PHPCrawlerUtils::getURIContent()

Framework: Laravel
PHPCrawl version: 0.83

Issue:
I'm trying to set the obeyRobotsTxt but it uses the wrong PHPCrawlerUtils. obeyRobotsTxt calls PHPCrawlerRobotsTxtParser::parseRobotsTxt which in turn calls PHPCrawlerUtils::getURIContent but it doesn't find it, reason why is because it uses this Class:

vendor/mmerian/phpcrawl/libs/PHPCrawlerUtils.class.php //Doesn't contain getURIContent

Instead of this one, which it should use.

vendor/mmerian/phpcrawl/libs/Utils/PHPCrawlerUtils.class.php ////Does contain getURIContent

error:

Call to undefined method PHPCrawlerUtils::getURIContent()

autoload warning:

Warning: Ambiguous class resolution, "PHPCrawlerUtils" was found in both "/Users/macmini2/securityscan/vendor/mmerian/phpcrawl/libs/PHPCrawlerUtils.class.php" and "/Users/macmini2/securityscan/vendor/mmerian/phpcrawl/libs/Utils/PHPCrawlerUtils.class.php", the first will be used.

Get links after crawl

Hello

I'm using the following code

class MyCrawler extends PHPCrawler {
        function handleDocumentInfo(PHPCrawlerDocumentInfo $PageInfo)   {
            // Your code comes here!
            // Do something with the $PageInfo-object that
            // contains all information about the currently 
            // received document.

            // As example we just print out the URL of the document
                //weekly pode ser fornecido como paramentro
            //return -1;
        }
    } 

    $crawler = new MyCrawler();
    $crawler->setURL('http://example.com'); 
    $crawler->setWorkingDirectory("/dev/shm/");
    $crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);
    $crawler->excludeLinkSearchDocumentSections(PHPCrawlerLinkSearchDocumentSections::ALL_SPECIAL_SECTIONS); 
    $crawler->addContentTypeReceiveRule("#text/html#");
        $crawler->go();  

How can I access the PHPCrawlerDocumentInfo->links_found after the crawl is complete?
Thanks in advance.

Maintenance of phpcrawl

Hello

Is phpcrawl being maintained by the author?
I have tried to reach him by e-mail regarding the class but no success...

Html5

Does it work with html5?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.