Comments (16)
@thomashohn Thanks i think this is a good idea. It would be nice to allow a configuration that limits the file extensions that are send to tika. If you can provide a patch, this would be nice!
from ext-tika.
@timohund There seems to be an some todo's on fetching this from the tika server - but I think i would prefer to have the extension configuring what the extract or not?
from ext-tika.
I can't quite follow here what you want to achieve. With the last release we're querying Tika for supported file types already instead of having a hard-coded list. Am I missing something? Can you point to the concrete code you're referring to?
from ext-tika.
The files in https://github.com/TYPO3-Solr/ext-tika/tree/master/Classes/Service/Extractor - seems pretty hardcoded to me - or?
from ext-tika.
I still don't know what you mean
from ext-tika.
I see the MetaDataExtractor changed to fetch data from TIKA :-) So if I would like to exclude files - I need to configure that in on TIKA server? For instance I don't want it to extract metadata for images or ?
from ext-tika.
It seems it's a really slow Saturday morning for me since I really can't follow you.
- Can you point to concrete lines of code?
- Explain your issue?
- Explain what you want to achieve/change?
- Why?
Of course the EXT:tika fetches meta data from Tika, what else would you expect? It's been like that since forever.
Why wouldn't you want images meta data? That's data such as width, height, exposure, camera, geo location, description...
Please describe it to me in easy language^^ :)
from ext-tika.
Before the new release:
public function canProcess(File $file)
{
// TODO use MIME type instead of extension
// tika.jar --list-supported-types -> cache supported types
// compare to file's MIME type
return in_array($file->getProperty('extension'),
$this->supportedFileTypes);
}
The $this->supportedFileTypes was a "hardcoded" array
Now its:
public function canProcess(File $file)
{
$tikaService = $this->getExtractor();
$mimeTypes = $tikaService->getSupportedMimeTypes();
return in_array($file->getMimeType(), $mimeTypes);
}
If I don't want to process say gif files - I would need to configure that on the TIKA server - or? Before I had to take gif out of the array $this->supportedFileTypes?
So with the new version I have to be sure my TIKA server is configured to only send back the supported mime types i want to process?
from ext-tika.
Ok, clear now, thanks! :)
However, it's still not clear why anyone would want to do that? Also, as you notice it had a TODO
comment before :) - It was a missing feature. As you mentioned you modified the extension before. It was never something we supported so far. I'm not even sure Tika supports selectively enabling meta data extraction. If it does though, that's where I'd look.
I don't think this should or needs to be something EXT:tika does. (For the 95% of use cases)
from ext-tika.
If you buy images from iStock and other companies the images contains a lot of additional meta-information you don't want to extract beacause it will confuse your users when searching. I'll make a PR anyway since I fix it in my own code - then you can decide if it should be merged into EXT:tika or not ;-)
from ext-tika.
Hmm, IMO that's usually pretty valuable meta data. maybe you can provide an example?
from ext-tika.
Hi - yes I can.
- You have a lot of meta data files and start to use Solr and TIKA - your "old" valuable meta data will be overwritten - which is kind of annoying
- Meta data in files does not match the kind of meta data you want. For instance for a iStock photo that could be the title. You might want another title or add info to the title - this is not possible.
I find the PR quite realistic and it comes from a real-world scenario :-)
from ext-tika.
A short sidenote from me. I see the usecase but i d rather like to discus this with you in the new year. TYPO3 is missing a meta data manager and therefore curation of meta data could be something that an add-on could offer.
from ext-tika.
Fine with me - as I said yearlier in the thread - I need to make a "fix" no matter what in my own code - since we can't retrieve meta-data from image files currently :-)
from ext-tika.
Ok, I can see your use case (and your pain stemming from it), too now.
Now here's how I see the situation: IMO EXT:tika is a pure utility to extract meta data from files, a utility that is called/used by the TYPO3 core. The tika extension does not know about any existing meta data for a file that you might want to keep. Neither does the extension offer any custom mapping.
The mapping issue can be seen as a missing feature; I believe EXT:extractor offers something like that.
However, the extension's job is to simply provide meta data to the core. On that end I agree with Olivier, that what you describe is rather an issue that falls into the responsibility of the TYPO3 core.
So my suggestion would be: Feel free to open another issue for meta data property mapping, that would actually be useful to have. However, knowing about when to overwrite data in what cases is not (currently) in the domain of EXT:tika.
Advice for filing future issues:
I had to ask multiple times to understand your issue. The easier you can make it for us to understand your situation, the easier it will be for us to help you and/or agree with your issue. You should always provide as much information as possible. Read through this whole convo again and I hope you will see it was not easy to understand why/what issue you had. That saves us both a lot of time.
from ext-tika.
Fixed in #48
from ext-tika.
Related Issues (20)
- [TASK] Refactor class ServerService: Use TYPO3 PSR-18 client HOT 1
- [TASK] Check and adjust log levels HOT 2
- Illegal string offset 'driverRestrictions' HOT 1
- [BUG] Release 6.0 has no dependency on filemetadata in composer but in emconf HOT 1
- [FEATURE] Tika with Tesseract OCR HOT 2
- [BUG] Usage of removed constant TYPO3_OS
- [TASK] Aware removal of Solr Cell (Tika extraction) on Apache Solr 9.0+
- [TASK] Update TIKA to 1.27 on release-11.0.x HOT 1
- [TASK] Release 11.0.0
- [BUG] Reports module crashes with 503 error
- [BUG] trim() expects string, not null in Tika AppService -> shellOutputToArray function
- Epic [TASK] Make EXT:tika TYPO3 12 LTS compatible HOT 2
- [TASK] Fix ExtensionUtility::registerModule
- [BUG] Broken GitHub Actions : LOCAL_VOLUME_NAME is not set
- [BUG] Error logging in TikaStatus fails
- [BUG] SolrCell broken HOT 1
- [BUG] TikaStatus can't handle all response types of SolrWriteService->extractByQuery() HOT 1
- [TASK] Support for Apache Tika 2+ HOT 4
- [TASK] Raise PHP restriction to 8.3.99 on release-11.0.x HOT 1
- [TASK] Release 11.0.2 HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ext-tika.