GithubHelp home page GithubHelp logo

mzimandl / klogproc Goto Github PK

View Code? Open in Web Editor NEW

This project forked from czcorpus/klogproc

0.0 0.0 0.0 858 KB

A utility for archiving logs generated by applications developed by the CNC

License: Apache License 2.0

Go 97.26% Python 2.67% Makefile 0.08%

klogproc's Introduction

Klogproc

build status

Klogproc is a utility for processing/archiving logs generated by applications run by the Institute of the Czech National Corpus.

In general Klogproc reads an application-specific log record format from a file or a Redis queue, parses individual lines and converts them into a target format which is then stored to ElasticSearch or InfluxDB (both at the same time can be used).

Klogproc replaces LogStash as a less resource-hungry and runtime environment demanding alternative. All the processing (reading, writing, handling multiple files) is performed concurrently which makes it quite fast.

Overview

Supported applications

Name config code note
Akalex akalex a Shiny app with a custom log (*)
APIGuard apiguard CNC's internal API proxy and watchdog
Calc calc a Shiny app with a custom log (*)
Gramatikat gramatikat a Shiny app with a custom log (*)
KonText kontext
KorpusDB korpus-db
Kwords kwords
Lists lists a Shiny app with a custom log (*)
Mapka mapka using Nginx/Apache access log
Morfio morfio
MQuery-SRU mquery-sru a Clarin FCS endpoint (JSONL log)
QuitaUP quita-up a Shiny app with a custom log (*)
SkE ske using Nginx/Apache access log
SyD syd a custom app log
Treq treq a custom app log
WaG wag web access log, currently without user credentials

(*) All the Shiny apps use the same log fromat.

The program supports three operation modes - batch, tail, redis

Batch processing of a directory or a file

For non-regular imports e.g. when migrating older data, batch mode allows importing of multiple files from a single directory. The contents of the directory can be even changed over time by adding newer log records and klogproc will be able to import only new items as it keeps a worklog with the newest record currently processed.

Batch processing of a Redis queue (deprecated)

Note: On the application side, this is currently supported only in KonText and SkE (with added special Python module scripts/redislog.py which is part of the klogproc project).

In this case, an application writes its log to a Redis queue (list type) and klogproc regularly takes N items from the queue (items are removed from there), transforms them and stores them to specified destinations.

Tail-like listening for changes in multiple files

This is the mode which replaces CNC's LogStash solution and it is a typical mode to use. One or more log file listeners can be configured to read newly added lines. The log files are checked in regular intervals (i.e. the change is not detected immediately). Klogproc remembers current inode and current seek position for watched files so it should be able to continue after outages etc. (as long as the log files are not overwritten in the meantime due to log rotation).

Installation

Install Go language if it is not already available on your system.

Clone the klogproc project:

git clone https://klogproc.git

Build the project:

go build

Copy the binary somewhere:

sudo cp klogproc /usr/local/bin

Create a config file (e.g. in /usr/local/etc/klogproc.json):

{
  "logPath": "/var/log/klogproc/klogproc.log",
  "logTail": {
    "intervalSecs": 15,
    "worklogPath": "/var/opt/klogproc/worklog-tail.log",
    "files": [
      {"path": "/var/log/ucnk/syd.log", "appType": "syd"},
      {"path": "/var/log/treq/treq.log", "appType": "treq"},
      {"path": "/var/log/ucnk/morfio.log", "appType": "morfio"},
      {"path": "/var/log/ucnk/kwords.log", "appType": "kwords", "tzShift": -120}
      {"path": "/var/log/wag/current.log", "appType": "wag", "version": "0.7"}
    ]
  },
  "elasticSearch": {
    "majorVersion": 6,
    "server": "http://elastic:9200",
    "index": "app",
    "pushChunkSize": 500,
    "scrollTtl": "3m",
    "reqTimeoutSecs": 10
  },
  "geoIPDbPath": "/var/opt/klogproc/GeoLite2-City.mmdb",
  "anonymousUsers": [0, 1, 2]
}

notes:

  • Do not forget to create directory for logging, worklog and also download and save GeoLite2-City database.
  • The applied tzShift for the kwords app is just an example; it should be applied iff the stored datetime values provide incorrect time-zone (e.g. if it looks like UTC time but the actual values reprezent local time) - see the section Time-zone notes for more info.

Configure systemd (/etc/systemd/system/klogproc.service):

[Unit]
Description=A custom agent for collecting UCNK apps logs
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/klogproc tail /usr/local/etc/klogproc.json
User=klogproc
Group=klogproc

[Install]
WantedBy=multi-user.target

Reload systemd config:

systemctl daemon-reload

Start the service:

systemctl start klogproc

Time-zone notes

Klogproc treats each log type individually when parsing but it converts all the timestamps to UTC. In case there is an application storing incorrect values (e.g. missing timezone info even if the time values are actually non-UTC), it is possible to use tzShift setting which defines number of minutes klogproc should add/remove to/from the logged values.

For the tail action, the config is as follows:

{
  "logTail": {
    "intervalSecs": 5,
    "worklogPath": "/path/to/tail-worklog",
    "numErrorsAlarm": 0,
    "errCountTimeRangeSecs": 15,
    "files": [
        {
          "path": "/path/to/application.log",
          "appType": "korpus-db",
          "tzShift": 120
        }
    ]
  }
}

For the batch mode, the config is like this:

{
  "logFiles": {
    "appType": "korpus-db",
    "worklogPath": "/path/to/batch-worklog",
    "srcPath": "/path/to/log/files/dir",
    "tzShift": 120,
    "partiallyMatchingFiles": false
  }
}

ElasticSearch compatibility notes

Because ElasticSearch underwent some backward incompatible changes between versions 5.x.x and 6.x.x , the configuration contains the majorVersion key which specifies how klogproc stores the data.

ElasticSearch 5

This version supports multiple data types ("mappings") per index which was also the default approach how CNC applications were stored - single index, multiple document types (one per application). In this case, the configuration directive elasticSearch.index specifies directly the index name klogproc works with. Individual document types can be distinguished either via ES internal _type property or via normal property type which is created by klogproc.

ElasticSearch 6

Here, multiple data mappings per index are being removed. Klogproc in this case uses its elasticSearch.index key as a prefix for index name created for an individual application. E.g. index = "log_archive" with configured "treq" and "morfio" apps expects you to have two indices: log_archive_treq and *log_archive_morfio". Please note that klogproc does not create the indices for you. The property type is still present in documents.

InfluxDB notes

InfluxDB is a pure time-based database with focus on processing (mostly numerical) measurements. Compared with ElasticSearch, its search capabilities are limited so it cannot be understood as a possible replacement of ElasticSearch. With configured InfluxDB output, klogproc can be used to add some more useful data to existing measurements generated by other applications (typically Telegraf, Netdata).

Please note that the InfluxDB output is not currently used in production.

klogproc's People

Contributors

tomachalek avatar mzimandl avatar dependabot[bot] avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.