A simple parser Using Regular Expression.
- Plugin type: parser
- Guess supported: yes
- regex: regular expression that must use Named Capturing Group (string, required)
- columns: column definition (list of object)
- regex_name: 'Named Capturing Group' can only include
[a-zA-Z0-9]
, so alias group name in regex can be specified (string, default:<name> attr value
)
- regex_name: 'Named Capturing Group' can only include
- skip_if_unmatch: if false, when a line don't match the regex, raise RuntimeException. If true, skip the line. (boolean, default:
false
)
in:
type: any file input plugin type
parser:
type: regex
regex: ^(?<remoteHost>[.:0-9]+) (?<identity>\S+) (?<user>\S+) \[(?<datetime>[^\]]*)\] "((?<method>\S+) (?<path>\S+) (?<protocol>HTTP/\d+\.\d+)|-)" (?<status>[0-9]+) (?<size>[0-9]+|-) "(?<referer>[^"]*)" "(?<userAgent>[^"]*)" (?<inByte>[0-9]+) (?<outByte>[0-9]+)$
columns:
- {name: remote_host, type: string, regex_name: remoteHost}
- {name: identity, type: string}
- {name: user, type: string}
- {name: datetime, type: timestamp, format: '%d/%b/%Y:%H:%M:%S %z'}
- {name: method, type: string}
- {name: path, type: string}
- {name: protocol, type: string}
- {name: status, type: long}
- {name: size, type: long}
- {name: referer, type: string}
- {name: user_agent, type: string, regex_name: userAgent}
- {name: in_byte, type: long, regex_name: inByte}
- {name: out_byte, type: long, regex_name: outByte}
Some apache LogFormats can be guessed.
After writing in:
section, you can let embulk guess parser:
section using this command:
$ embulk gem install embulk-parser-regex
$ embulk guess -g regex config.yml -o guessed.yml
$ ./gradlew gem # -t to watch change of files and rebuild continuously
$ ./gradlew check