GithubHelp home page GithubHelp logo

slub / esfstats Goto Github PK

View Code? Open in Web Editor NEW
4.0 15.0 4.0 32 KB

A Python3 program that extracts some statistics regarding field coverage from an elasticsearch index

License: Apache License 2.0

Python 100.00%
python elasticsearch statistics cli

esfstats's Introduction

EFRE-Lod logo

esfstats - elasticsearch fields statistics

esfstats is a commandline command (Python3 program) that extracts some statistics regarding field coverage from an elasticsearch index.

Usage

esfstats
        required arguments:
          -index INDEX  elasticsearch index to use (default: None)
          -type TYPE    elasticsearch index (document) type to use (default: None)

        optional arguments:
          -h, --help    show this help message and exit
          -host HOST    hostname or IP address of the elasticsearch instance to use (default: localhost)
          -port PORT    port of the elasticsearch instance to use (default: 9200)
          -marc         ignore MARC indicator, i.e., combine only MARC tag + MARC code (valid/applicable for input generated with help of xbib/marc (https://github.com/xbib/marc) or input MARC JSON records that follow this structure) (default: False)
          -csv-output   prints the output as pure CSV data (all values are quoted)
                        (default: False)
  • example:
    esfstats -host [HOSTNAME OF YOUR ELASTICSEARCH INSTANCE] -index [YOUR ELASTICSEARCH INDEX] -type [DOCUMENT TYPE OF THE ELEASTICSEARCH INDEX] > [OUTPUT STATISTICS DOCUMENT]
    

Note

When utilising this commandline command with argument '-marc' the input JSON records need to be generated with help of xbib/marc (e.g. via marc2jsonl) or they need to follow at least this structure (otherwise the result will lead to unexpected behaviour).

Requirements

elasticsearch-py

e.g.

apt-get install python-elasticsearch

Run

  • install elasticsearch-py
  • clone this git repo or just download the esfstats.py file
  • run ./esfstats.py
  • for a hackish way to use esfstats system-wide, copy to /usr/local/bin

Install system-wide via pip

  • via pip:
    sudo -H pip3 install --upgrade [ABSOLUTE PATH TO YOUR LOCAL GIT REPOSITORY OF ESFSTATS]
    
    (which provides you esfstats as a system-wide commandline command)

Description

(of the column headers of a resulting statistic)

... in English

existing

  • number of records that contain this field (path), i.e., field coverage

%

  • ^ percentage of 'existing'
  • (existing / Total Records * 100)

notexisting

  • number of records that do not contain this field (path)

!%

  • ^ percentage of 'notexisting'
  • (not existing / Total Records * 100)

occurrence

  • total count of the occurrence of this field (path) over all records, i.e., an indicator for field where multiple values are allowed

unique (appr.)

  • number of unique/distinct values of this field (path), i.e., cardinality
  • note: this value is an approximated value

field name

  • the field (path) of this statistic line

... in German

Erklärung der Spaltenköpfe

existing

  • gibt an, wieviele Felder diesen Pfades existieren.

%

  • existing in Prozent
  • existing / Total Records * 100

notexisting

  • gibt an, wieviele Rekords nicht über diesen Pfad verfügen

!%

  • notexisting in Prozent
  • notexisting / Total Records * 100)

occurrence

  • gibt an, wieviele Werte diesen Pfades vorhanden sind. (Mehrfachbelegung)

unique (appr.)

  • gibt an, wieviele einzigartige Werte man in diesem Pfad findet
  • Hinweis: dieser Wert ist nur angenähert berechnet, d.h., er ist u.U. ungenau

field name

  • Der Pfad zu den analysierten Werten

esfstats's People

Contributors

boerni667 avatar toschilling avatar zazi avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

esfstats's Issues

mark values for 'occurrence' and 'unique (appr.)' as 'n/a' for upper paths

currently, those values will be assigned to '0' - albeit, the analysis as it is done right now for this case is wrong (i.e. there are no *.keyword fields for upper paths available). hence, it might be better to write 'n/a' into these cells. furthermore, calculating the occurrence and or cardinality for those upper paths does not really make sense in genernal (i.e. I think that these statistics are only interesting for concrete values (instead of sub trees of a record (as it is the case for upper path analysis))).

add controlfields statistics to output for -marc option

currently, controlfields will be skipped via processing with -marc option. however, it would be useful to add them to the output (probably requires determination of keyword fields, i.e., process marc tag field, when it is a keyword field).

how to deal with paths that end up with '.keyword' or that are 'keyword'

since 'keyword' fields are usually created at indexing time, e.g., 'field1.keyword'.

the algorithm usually iterates recursively over all paths in the mapping. for creating the aggregations (cardinality, terms count) the 'keyword' fields will be utilised by default. however, there could be causes where a path ends up with '.keyword', then the algorithm tries to generate the statistics from the upper path also with '.keyword' at the end. this would usually result in an error á la "Fielddata is disabled on text fields by default. Set fielddata=true on [*.keyword] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead." (where * is the upper path).

this issue is also somehow related to issue #3

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.