GithubHelp home page GithubHelp logo

tikadetect's Introduction

About

Simple demo script to demonstrate how the Apache Tika API can be called from Python for doing mime type detection. Access to the Java API is done using PyJnius.

Adapted from:

http://www.hackzine.org/using-apache-tika-from-python-with-jnius.html

Note that this not intended as a production-ready tool! My main reason for writing this was to get more familiar with PyJnius and the Tika API. So far I've only managed to test this using Python 2.7 running under Linux Mint. Other platforms may work ... or maybe not!

##Installation of PyJnius and its dependencies This script uses PyJnius for accessing the Tika Java classes. I found several guides on the installation of PyJnius and its dependencies, and none of them quite worked for me (python-dev in particular isn't explicitly mentioned anywhere). After some experimentation the following did the trick for me under Linux Mint (haven tried under Windows yet):

###Step 1: install Cython

sudo apt-get install cython

###Step 2: install python-dev

sudo apt-get install python-dev

###Step 3: clone & install pyjnius

git clone https://github.com/kivy/pyjnius.git
cd pyjnius
sudo python setup.py install

###Step 4: download & install Apache Tika Download the latest runnable jar from:

https://tika.apache.org/download.html

Then save it wherever you prefer.

Done!

##Configuration Open config.py in a text editor and update tikaJar to the location of the Tika JAR on your system (see above).

##Command line use

###Usage

python tikadetect.py [-h] [--magiconly] directory

This will result in a recursive scan of directory and all its subdirectories. Output is written to stdout, using the following format dfor each analysed file:

/path/to/file.ext: mimetype

###Positional arguments

directory: directory that will be analysed

###Optional arguments

-h, --help: show help message and exit --magiconly: establish mimetype from magic bytes only (ignoring filename extension)

Note that by default mimetype detection is done using a combination of magic bytes and filename extensions (the latter can be disabled using the --magiconly switch).

##Documentation of Tika methods See this link (describes Tika 1.5), and have a look at the detect methods (which are called in the script):

https://tika.apache.org/1.5/api/org/apache/tika/Tika.html

tikadetect's People

Contributors

bitsgalore avatar

Stargazers

Tom Davies avatar Christine Doig avatar Chris Mattmann avatar Chris Adams avatar

Watchers

Chris Adams avatar James Cloos avatar  avatar

Forkers

vicgc mingo-wild

tikadetect's Issues

No container aware detection!

As it is now the script doesn't support container-aware detection!

As far as I understand it from the docs this would be just a matter of using TikaInputStream instead of FileInputStream, but I just couldn't get this to work! Not sure if this is due to a bug in PyJnius or my lack of knowledge of Java. (BTW with container-aware detection results get a lot better for e.g. MS Office and OpenDocument formats).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.