webcoast-dk / versatile-crawler Goto Github PK

Extendable and easy to use crawler extension for TYPO3 CMS

License: GNU General Public License v3.0

PHP 100.00%

typo3 crawler indexing search extendable

versatile-crawler's Introduction

EXT:versatile_crawler

Versatile Crawler is basically an extension to crawl pages and content in an TYPO3 CMS installation. It is developed on the basis of TYPO3 CMS version 8. The extension has a clear and easily understandable structure and provides queue and crawler functions for pages and records.

Installation & Setup

Clone the extension from GitHub, manually or via composer, and activate the extension via the extension manager. Create a crawler configuration record on the page you like to start the indexing on, e.g. on the homepage. Go the scheduler module and create a queue task and a process task. Configure a cron job that triggers TYPO3 CMS' scheduler.

Prerequisites

TYPO3 CMS 8
PHP 7 w/ cURL

Documentation

The documentation can be found in the GitHub wiki: https://github.com/webcoast-dk/versatile-crawler/wiki

Contributing

Feel free to fork the repository, make changes and create a pull request. If you are not into coding or do not have the time, open up an issue.

Credits

The extension is developed and maintained by Thorben Nissen (https://www.kapp-hamburg.de/en/)

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

Icons

The icons are taken from the Material design (https://materialdesignicons.com/) and are licensed under the SIL Open Font License 1.1.

versatile-crawler's People

Contributors

Stargazers

Watchers

Forkers

phasenraum2010 gvv1401 dustin-webschuppen maxfrerichs

versatile-crawler's Issues

PHP warning in belog

TYPO3 8.7.10 and PHP 7.2.2

Core: Error handler (FE): PHP Warning: strcmp() expects parameter 1 to be string, array given in /.../web/typo3conf/ext/versatile_crawler/Classes/Frontend/IndexHook.php line 31

11.5 LTS Compatibility

Hi, do you have plans for TYPO3 11.5 compatibility? :-)

On a first look one need to switch to an alternative hook, see https://docs.typo3.org/c/typo3/cms-core/master/en-us/Changelog/10.4/Deprecation-91012-VariousHooksRelatedToTypoScriptFrontendController.html ($GLOBALS['TYPO3_CONF_VARS']['SC_OPTIONS']['tslib/class.tslib_fe.php']['hook_eofe'] was removed)

Better example possible?

Hi!
I was very glad to find your extension during the process of updating a 4.2 to 8.7! Thank you so much for making it public!
Would it be possible to add simple configurations to the documentation? I need to crawl for news and regular textmedia contents. It would be most helpful, if you could add screenshots of how to do achieve a proper setup.
What is to be done in "Query string for this type of records (without "id" and "L")"?
And do I have to select a language?
Currently I am confused by the list that "Records" offer. One is called "Page", a second one "Pages" (which actually only allows me to assign one page -- why?), then "Page Content"...
More through trying and experimenting I put something together, but everything is failing so far. The logs show mainly errors like this:

http://dev.t3-45.lokal/index.php?id=1&L=0&&cHash=

(Also the double ampersand seems to be problematic...)

Thanks for any help!!!
(The text domain might be confusing: I am on a TYPO3 8.7.8 actually...)

process queue

the process queue seem to work, but i have the following error in the backend protocol:

Core: Error handler (FE): PHP Warning: strcmp() expects parameter 1 to be string, array given in /kunden/nn/rp-hosting/nnn/nn
/typo3cms/muster8/typo3conf/ext/versatile_crawler/Classes/Frontend/IndexHook.php line 31

How prevent page indexing (for at part of the tree)?

Hi!
I have several pages on my site, the content of which is used on other pages of the site (as content for tabs). These pages should not appear in search results. How can I prevent these pages from being indexed?

Page items failed

Hi - just checked out master and tried it on one site. I created the configuration for pages, created the tasks and run.
Problem is that all items failed

Unfortunally there is no way in BE to see the errors, but in DB i found the entries, eg:

An error occurred. The call to the url "http://ewp.t3/index.php?id=26&L=0" returned the status code 500
Testing the url in FE gave no error at all.

Error: An error occurred. the call to the url "%s" did not returned a valid json. This could .....

Hello,

thirst thanks for your extension. I try to use it for indexed_search, because crawler task does not correctly set freeIndexUid.

I have configured the crawler and when i run, i got the error message from FrontendRequestCrawler Link 67. This is because $content from line 57 is the plain html code from the page. And then you do json_decode which does not work, because there is html code and not json.

Did i do anything wrong?

Thanks for your help!
Harald

No crawling pages below sysfolders?

Today I tried the master branch of versatile-crawler in TYPO3 9.5. It works good but I have one problem. I can't crawl pages if they are below of a page with type "sysfolder". Really? Or I have a wrong configuration?
At first I have the configuration on my root page and try crawling pages (for the meta navi) below a sysfolder. Pages will not be indexed. On the second I try to put the configuration directly on those sysfolder. But this also doesn't help.

Crawl paginated page with Fluid Widget Paginator without detailview

Is it possible to crawl a page which has a list view (extbase) with the fluid widget paginator but without an detail view?
At the moment, only the first page is crawled and i can't find a way to index the other pages from the paginator widget (paramter: tx_extkey_pluginkey[widget_0][currentPage]=2).

No RealUrl links in search results

RealUrl is configured and correct works for general pages and details view (news).
But search results without realurl links.
Link to general page like - http://www.domain.com/index.php?id=64.
Link to details view page like - http://www.domain.com/index.php?id=66&id=66&tx_news_pi1[action]=detail&tx_news_pi1[controller]=News&tx_news_pi1[news]=10&cHash=fa0afb9123e9999e8e68a7636632b5e6 (why page id twice?)
I can not understand whose problem it is? Crawler? IndexedSearch? RealUrl?
Maybe you can give me a clue or a suggestion?