typo3-solr / ext-tika Goto Github PK
View Code? Open in Web Editor NEWA TYPO3 CMS extension that provides Apache Tika functionality
License: GNU General Public License v3.0
A TYPO3 CMS extension that provides Apache Tika functionality
License: GNU General Public License v3.0
There might be some reasons to drop support for Tika app and Solr support:
If we were to decide to do that, it would also result in a new major version as it is a breaking change. Nothing is set in stone or even decided yet. We're just looking for opinions for now.
For extracting mp3 metadata, I had to add to \ApacheSolrForTypo3\Tika\Service\Extractor\MetaDataExtractor::normalizeMetaData()
a mapping of xmpDM:duration
to (int)($value / 1000)
.
Maybe this could be configurable?
--> EXT:extractor has a nicely configurable metadata mapping (normalization) handling. There no code change would be necessary - but EXT:extractor does not support SolrCell, only Tika App local or Tika server, while EXT:tika does this very nicely.
TIKA preview breaks in the page-menu - the problem seems to be the following condition
if (!$this->table === 'sys_file') {
return false;
}
in the canHandle method - instead it should be:
if ($this->table !== 'sys_file') {
return false;
}
We should run the tests against TYPO3 8.7.0 on travis.
We should prepare the 3.1.0 release
Sometimes you want to limit the filesize to prevent that tika is used for very big files.
Add a list/view showing supported file types for each extractor (meta data, text/content)
We are about to go in production with the ext-tika extensions. But we have 2 major problems.
It would be nice to have these 2 fixed ASAP - alternatively we could provide a pull request for it.
Hi,
as we can now reproduce, ext-tika seems to save wrong height and width metadata of jpeg (and maybe other) files.
Main problem/symptom - for example:
when a file is cropped or scaled down before uploading (e.g. photoshop) there will still be the initial width and height values in the metadata of this file along with the new and correct values (correlates with the EXIF data of the file).
Now after the upload process (with the tika extension installed) the initial values will be saved in the TYPO3 database (sys_file_metadata) instead of the new and correct values.
If you now want to crop the image with the TYPO3 cropping tool out of the core, it will save wrong cropping areas because of these values in sys_file_metadata.
Steps to reproduce:
Counter check:
If you need any further information, don't hesitate to hit me up.
At the moment the supported file types are hard coded within the extension.
Tina can provide a complete list of file types it supports. The extension should query Tika for that list and use that when TYPO3 asks what file types we can handle.
Since the tika server can also run on another node, it's not required that java is installed.
We should raise the following warnings / errors when java is not installed:
App mode: Error
Server mode: Warning
Solr mode: nothing
If I open a ContextMenu from a page or item, the Provider from Tika (ApacheSolrForTypo3\Tika\ContextMenu\Preview) tries to check if it can handle the context menu or not.
If there is no file with the same uid as the page or content element, the Provider will throw an error, for example:
#1317178604: No file found for given UID: 120 (More information)
So Tika is blocking you from using the ContextMenu at Pages and Content Elements (maybe more).
Checked on TYPO3 8.7.17/18
We should support version 1.19 of Apache Tika
Under CMS 8 LTS we get a warning in Backend:
Core: Error handler (BE): PHP Warning: file_get_contents(http://localhost:9999/tika): failed to open stream: Connection refused in /var/www/8.7.local.typo3.org/typo3conf/ext/tika/Classes/Service/Tika/ServerService.php line 242
Current status report check fails when using an already running Tika Server
Tika module should be configureable in the extensionmanager
When the metadata is extracted with solr cell there is an error when the scheduler task is executed. We should debug this.
EXT:tika 2.0 won't use the old service APIs anymore. Those checked whether a service is available. As checking that would have been costly for Tika by starting the JVM each time we stored the results of a check in the TYPO3 registry.
We won't store the availability check results anymore and will instead assume the configuration to be valid.
However, we should add checks for valid configuration to the Status Report.
From https://forge.typo3.org/issues/77659
Using Tika with SOLR for metadata extraction.
The Unsupported exception in SolrCellService::detectLanguageFromFile
leads to failure reporting the success of a file upload.
Steps to reproduce:
This error message comes from ExtendedFileUtility::func_upload(1157)
. When a breakpoint is set here the thrown exception is from TIKA with the message "The Tika Solr service does not support language detection"
. Being unable to extract metadata should not prevent getting an upload finished message.
We should check and make the extension compatible with tika 1.14.
We should automatically do a TER upload, when a tag was created and all build are green.
The FlashMessage is not rendered in the backend in TYPO3 8.0
We should allow by ext_emconf.php to install EXT:tika with TYPO3 9.3 (9.3 is not officially supported yet)
It would be nice to have the possibility to see the extracted content of the file in the backend with one click
Make Tika Backend module functionally for new EXT:solr Main Module added in
TYPO3-Solr/ext-solr#1300
In SolrCellService.php a version compare vor Solr > 3.1 leads to an error as the var $solrVersion returns also the patch version with leads to a value like 3.1.21.
That will call the solr method extractByQuery wich is only availabel in solr 4.
The var $solrVersion should be stripped to major and minor version nr to make the condition pass correct.
We should support version 1.19 of Apache Tika
As integrator i want to have the possiblitlity to extract the content of the files in a zip archive.
Solr Cell check will fail if the handler has been set to lazy startup and has never been used.
Easiest fix would be to try to call /update/extract/ prior to getting the plugins.
That way the handler will be loaded and exist in the plugin list.
Fatal error: Call to undefined method ApacheSolrForTypo3\Tika\Service\Tika\SolrCellService::ping() in /..../typo3conf/ext/tika/Classes/Backend/SolrModule/TikaControlPanelModuleController.php on line 242
How to reproduce:
Since php 5.4 we can use the short array syntax
$foo = ['a'];
instead of
$foo = array('a');
We should use this where possible to make the code more readable and aligned with ext-solr.
As followup of TYPO3-Solr/ext-solr#852 we should use the ConnectionManager in a way, that we pass a read and write connection, when we use the extraction on the solr server.
We should add release notes for EXT:tika 3.0.0 as well.
We should support EXT:solr 9 and use the solarium queries for extraction
Using GeneralUtility::getFileAbsFileName fails in TYPO3 8.0 using second argument, because it returns "" for files outside the PATH_site folder
I use the tika app for heavy extraction work on a website full of media files.
Here, I had to add another memory-expanding argument to the tika app command:
-Xmx512M
(verbose version: -XX:MaxHeapSize=512m
)
This prevents the Java VM error "Could not reserve enough space for object heap".
The tika EXTCONF could offer options like 256m, 512m etc., which would then be applied to the tika app java shell calls.
Thanks for your continued work!
We want to extend the testsuite and coverage
Hi,
we recently had a virus report from our hoster.
The test documents testWORD.doc
and test-documents.tbz2
appear to have an infection with Win.Exploit.CVE_2016_3316-1.
Not sure if this is a false positive but I wanted to let you know.
Cheers,
Alex
Prepare release version 3.1.1
As a workaround I had to add $GLOBALS['TYPO3_CONF_VARS']['SYS']['FileInfo']['fileExtensionToMimeType']['mp3'] = 'audio/mpeg3';
for \ApacheSolrForTypo3\Tika\Service\Tika\SolrCellService::getSupportedMimeTypes()
to match.
However solr then returns the 'audio/mpeg' mimetype for the extracted mp3 file - so rather getSupportedMimeTypes() should be extended by adding 'audio/mpeg'.
'audio/mpeg' is RFC-defined: https://tools.ietf.org/html/rfc3003
It is also the first mime type mentioned at Wikipedia "MP3": https://en.wikipedia.org/wiki/MP3
We should use the latest tika version 1.18
Need to add EXT:filemetadata as dependency
It would be very nice if the supportedFileTypes were not hardcoded in the extractors but a list in the extension configuration since you might have sites where you would like to be able to configure this. I can provided a pull-request fixing this since I now have to XClass the extractors in order to control this.
The connection handling in EXT:solr 8.0.0 was changed to diffrentiate between read and write connections. We should make the required changes in EXT:tika too and rais the suggested EXT:solr version to 8.0.0
We should:
to have the extension compatible with EXT:solr dev-master for the upcomming 4.0.0 release
Prepare the release of version 2.2
on Apache Solr(Version 6.6!) mode the Reports module does not show the right status.
We need to make Tika compatible with Apache Solr 6.6
https://github.com/TYPO3-Solr/ext-tika/blob/release-2.3.x/Classes/Report/TikaStatus.php#L184-L187
can be solved as follows:
if (array_key_exists('/update/extract', $plugins->plugins->QUERYHANDLER)
|| array_key_exists('/update/extract', $plugins->plugins->QUERY)) {
$solrCellConfigurationOk = true;
}
Currently EXT:tika allways tries to delete the local tempfile, even when it does not exist. We should check if it still exists before trying to unlink it.
In the Extension Manager configuration view:
If EXT:solr is installed offer a custom field type to select the Solr server connection instead of having to enter the host, port, and path.
Since php 5.5 calls like:
\TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\CMS\Core\Imaging\IconRegistry');
can be changed to:
use \TYPO3\CMS\Core\Imaging\IconRegistry;
\TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance(IconRegistry::class);
We should use this where possible to make the code more readable and aligned with ext-solr.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.