GithubHelp home page GithubHelp logo

tikadetect's Introduction

About

Simple demo script to demonstrate how the Apache Tika API can be called from Python for doing mime type detection. Access to the Java API is done using PyJnius.

Adapted from:

http://www.hackzine.org/using-apache-tika-from-python-with-jnius.html

Note that this not intended as a production-ready tool! My main reason for writing this was to get more familiar with PyJnius and the Tika API. So far I've only managed to test this using Python 2.7 running under Linux Mint. Other platforms may work ... or maybe not!

##Installation of PyJnius and its dependencies This script uses PyJnius for accessing the Tika Java classes. I found several guides on the installation of PyJnius and its dependencies, and none of them quite worked for me (python-dev in particular isn't explicitly mentioned anywhere). After some experimentation the following did the trick for me under Linux Mint (haven tried under Windows yet):

###Step 1: install Cython

sudo apt-get install cython

###Step 2: install python-dev

sudo apt-get install python-dev

###Step 3: clone & install pyjnius

git clone https://github.com/kivy/pyjnius.git
cd pyjnius
sudo python setup.py install

###Step 4: download & install Apache Tika Download the latest runnable jar from:

https://tika.apache.org/download.html

Then save it wherever you prefer.

Done!

##Configuration Open config.py in a text editor and update tikaJar to the location of the Tika JAR on your system (see above).

##Command line use

###Usage

python tikadetect.py [-h] [--magiconly] directory

This will result in a recursive scan of directory and all its subdirectories. Output is written to stdout, using the following format dfor each analysed file:

/path/to/file.ext: mimetype

###Positional arguments

directory: directory that will be analysed

###Optional arguments

-h, --help: show help message and exit --magiconly: establish mimetype from magic bytes only (ignoring filename extension)

Note that by default mimetype detection is done using a combination of magic bytes and filename extensions (the latter can be disabled using the --magiconly switch).

##Documentation of Tika methods See this link (describes Tika 1.5), and have a look at the detect methods (which are called in the script):

https://tika.apache.org/1.5/api/org/apache/tika/Tika.html

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.