GithubHelp home page GithubHelp logo

nielsbasjes / logparser Goto Github PK

View Code? Open in Web Editor NEW
154.0 11.0 42.0 2.85 MB

Easy parsing of Apache HTTPD and NGINX access logs with Java, Hadoop, Hive, Flink, Beam, Storm, Drill, ...

License: Apache License 2.0

Java 94.12% Shell 2.63% ANTLR 0.71% Perl 2.13% Dockerfile 0.42%
java parse httpd hive logformat nginx beam flink drill parser

logparser's Introduction

Apache HTTPD & NGINX access log parser

Github actions Build status Coverage Status License Maven Central If this project has business value for you then don't hesitate to support me with a small donation. If this project has business value for you then don't hesitate to support me with a small donation.

This is a Logparsing framework intended to make parsing Apache HTTPD and NGINX access log files much easier.

The basic idea is that you should be able to have a parser that you can construct by simply telling it with what configuration options the line was written. These configuration options are the schema of the access loglines.

So we are using the LogFormat that wrote the file as the input parameter for the parser that reads the same file. In addition to the config options specified in the Apache HTTPD manual under Custom Log Formats the following are also recognized:

  • common
  • combined
  • combinedio
  • referer
  • agent

For Nginx the log_format tokens are specified here and here.

Special notes about the Apache HTTPD token %{format}t

Quote from Apache HTTPD manual

%{format}t: The time, in the form given by format, which should be in strftime(3) format. (potentially localized)
  • Version 2.5 and before: It cannot be extracted. A simple workaround for this limitation: replace the %{...}t with %{timestamp}i . You will then get this timestamp field as if it was a request header: HTTP.HEADER:request.header.timestamp
  • Version 2.6 and newer: You will receive it as a textual TIME.LOCALIZEDSTRING:request.header.time which cannot be extracted any further.
  • Version 3.0 and newer: Support for parsing the customized time as long as all elements can be mapped to fields supported by joda-time. This means that many fields are supported, but not all. Check the implementation in the StrfTimeStampDissector class to see which are and are not supported.
  • Version 4.0 and newer: Switched to parsing using native java 8 time library supports a few fields differently. See StrfTimeToDateTimeFormatter.

Limitation: Only a single %{format}t entry is supported per line. Examples as described in the LogFormat examples section of the Apache HTTPD manual cannot be parsed.

You can use the %{format}t directive multiple times to build up a time format using the extended format tokens like msec_frac:
Timestamp including milliseconds
         "%{%d/%b/%Y %T}t.%{msec_frac}t %{%z}t"

In this case where all %{format}t fields are only separated by fixed text you can rewrite this example like this

"%{%d/%b/%Y %T}t.%{msec_frac}t %{%z}t"
"%{%d/%b/%Y %T.msec_frac %z}t"

Although the latter is NOT supported by Apache HTTPD this IS supported by this logparser so the above works as expected.

Analyze almost anything

I wrote this parser for practical reallife situations. In reality a lot happens that is not allowed when looking at the official specifications, yet in production they do happen. So several of the key parts in this parser try to recover from bad data where possible and thus allow to extract as much useful information as possible even if the data is not valid. Important examples of this are invalid encoding characters and chopped multibyte encoded characters that are both extracted as best as possible.

If you have a real logline that causes a parse error then I kindly request you to submit this line, the logformat and the field that triggered the error as a bug report.

Pre built versions

Prebuilt versions have been deployed to maven central so using it in a project is as simple as adding a dependency.

So using it in a Java based project is as simple as adding this to your dependencies

<dependency>
    <groupId>nl.basjes.parse.httpdlog</groupId>
    <artifactId>httpdlog-parser</artifactId>
    <version>5.11.0</version>
</dependency>

Building

Simply type : mvn package and the whole thing should build.

Java, Apache {Hadoop, Hive, Drill, Flink, Beam}

I'm a big user of bigdata tools like Apache Hadoop, Hive, etc. . So in here are also a Hadoop inputformat and a Hive/HCatalog Serde that are wrappers around this library.

Usage (Overview)

The framework needs two things:

  • The format specification in which the logfile was written (straight from the original apache httpd config file).
  • The identifiers for the fields that you want.

To obtain all the identifiers the system CAN extract from the specified logformat a separate developer call exists in various languages that allows you to get the list of all possible values.

Languages and Tools

The languages that are supported in this version:

Prebuilt plugins for these are provided in the distribution:

For tools like Apache Flink and Beam there is only example code that is also used to verify that the build still works on those systems.

Tools that ship a version of this parser in their distribution

Internal structure and type remapping

The basic model of this system is a tree. Each node in the tree has a 'type' and a 'name'. The 'type' is really a 'what is the format of this string' indicator. Because there are many more of those kinds of types than your average String or Long you will see a lot of different names. The 'name' is the "breadbrumb" towards the point in the tree where this is located.

A 'Dissector' is a class that can cut a specific type (format) into a bunch of new parts that each extend the base name and have their own type. Because an internal parser is constructed at the start of running a parser this tree has some dynamic properties. To start only the tree is constructed for the elements actually requested. This is done to avoid 'dissecting' something that is not wanted. So the parser will have a different structure depending on the requested output.

These dynamic properties also allow 'mapping a field to a different type'. Lets illustrate what this is by looking at the most common usecase. Assume you are trying to parse the logline for a pixel that was written by a webanalytics product. In that scenario it is common that the URL is that of a pixel and one of the query string parameters contains the actual URL. Now by default a querystring parameter gets the type STRING (which really means that is is arbitrary and cannot be dissected any further). Using this remapping (see API details per language) we can now say that a specific query string parameter really has the type HTTP.URL. As a consequence the system can now continue dissecting this specific query string parameter into things like the host, port and query string parameters.

All that is needed to map the 'g' and 'r' parameters so they are dissected further is this:

Java: Call these against the parser instance right after construction

parser.addTypeRemapping("request.firstline.uri.query.g", "HTTP.URI", Casts.STRING_ONLY);
parser.addTypeRemapping("request.firstline.uri.query.r", "HTTP.URI", Casts.STRING_ONLY);

Hive: Add these to the SERDEPROPERTIES

"map:request.firstline.uri.query.g"="HTTP.URI",
"map:request.firstline.uri.query.r"="HTTP.URI",

Special Dissectors

mod_unique_id

If you have a log field / request header that gets filled using mod_unique_id you can now peek inside the values that were used to construct this.

NOTE: https://httpd.apache.org/docs/current/mod/mod_unique_id.html clearly states

 it should be emphasized that applications should not dissect the encoding.
 Applications should treat the entire encoded UNIQUE_ID as an opaque token,
 which can be compared against other UNIQUE_IDs for equality only.

When you choose to ignore the clear 'should not' statement then simply add a type remapping to map the field to the type MOD_UNIQUE_ID

GeoIP parsing

Head for the separate README file for information about this dissector.

Parsing problems with Jetty generated logfiles

In Jetty there is the option to create a logfile in what they call the NCSARequestLog format. It was found that (historically) this had two formatting problems which cause parse errors:

  1. If the useragent is missing the empty value is logged with an extra ' ' after it. The fix for this in Jetty was committed on 2016-07-27 in the Jetty 9.3.x and 9.4.x branches
  2. Before jetty-9.2.4.v20141103 if there is no user available the %u field is logged as " - " (i.e. with two extra spaces around the '-').

To workaround these problems you can easily start the parser with a two line logformat:

ENABLE JETTY FIX
%h %l %u %t \"%r\" %>s %b "%{Referer}i" "%{User-Agent}i" %D

This ENABLE JETTY FIX is a 'magic' value that causes the underlying parser to enable the workaround for both of these problems. In order for this to work correctly the useragent field must look exactly like this: "%{User-Agent}i"

Donations

If this project has business value for you then don't hesitate to support me with a small donation.

If this project has business value for you then don't hesitate to support me with a small donation. If this project has business value for you then don't hesitate to support me with a small donation.

License

Apache HTTPD & NGINX Access log parsing made easy
Copyright (C) 2011-2023 Niels Basjes

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

logparser's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

logparser's Issues

Parsing hours in different timezone

By using docker I found that if the build system is in a different timezone than 'Central Europe' the build will fail due to the hours returned by the timestamp parsing being off.

Custom timeformat doesn't work

This test should pass but it fails over the regex that is used for %t

    @Test
    public void dissectCustomTimeFormatWithMissingTimeZone(){
        DissectorTester.create()
            .withDissector(new HttpdLogFormatDissector("%t"))
            .withDissector(new TimeStampDissector("TIME.STAMP", "yyyy-MM-dd HH:mm:ss"))
            .withInput("2017-12-25 00:00:00")
            .expect("TIME.EPOCH:request.receive.time.epoch", "1514160000000")
            .checkExpectations();
    }

Option to call setter only when value is non-null

I am using logparser in an Apache Beam Project.

Apache Beam does not like null values, thus having an option to set a value only when it's non-null would be helpful. Otherwise one has to write a custom setter with a null check for every field.
It's triggered when - is parsed to null. When the variable holding this value was initialized with an empty string, it becomes null after deserialization.

Fix RPMs

Seems the RPMs package the core jar files without the required dependencies.

diff --git a/httpdlog/httpdlog-serde/pom.xml b/httpdlog/httpdlog-serde/pom.xml
index ea5ed63..12897f1 100644
--- a/httpdlog/httpdlog-serde/pom.xml
+++ b/httpdlog/httpdlog-serde/pom.xml
@@ -88,12 +88,12 @@
               </requires>
               <mappings>
                 <mapping>
-                  <directory>/usr/lib/hive/lib</directory>
+                  <directory>/opt/${project.build.finalName}/hive/lib</directory>
                   <username>root</username>
                   <groupname>root</groupname>
                   <sources>
                     <source>
-                      <location>target/${project.build.finalName}.jar</location>
+                      <location>target/${project.build.finalName}-job.jar</location>
                     </source>
                   </sources>
                 </mapping>

Enhancement: Validate format & message

Hello,
Can I have a behavior of validating the format along with message and says valid or invalid either by true/false or true/exception manner.

Case:

  1. common format & common message --> returns true
  2. combined format & combine message --> returns true
  3. common format & combined message --> returns false
  4. combined format & common message --> returns false
  5. common format & custom message --> return false
  6. combined format & custom message --> return false
    etc?

Thanks,
Vn

Whitespaces in Apache HTTPD DateTime format

I am trying to parse the following Apache HTTPD Logformat:
%a %l %u %{%F}t %{%H}t:%{%M}t:%{%S}t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" \"%{Cookie}i\" t=%D

Since it has more than one %{format}t entry I translated this into:

%a %l %u %{%F %H:%M:%S}t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" \"%{Cookie}i\" t=%D

However, a log line like:

192.168.85.3 - - 2017-12-25 00:00:00 "GET /up.html HTTP/1.0" 203 8 "-" "HTTP-Monitor/1.1" "-" t=4920

fails. When I remove the whitespace in my teststring and adjust the format string like this:
%a %l %u %{%F.%H:%M:%S}t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" \"%{Cookie}i\" t=%D

192.168.85.3 - - 2017-12-25.00:00:00 "GET /up.html HTTP/1.0" 203 8 "-" "HTTP-Monitor/1.1" "-" t=4920

I get past parsing, but the following exception is thrown:

nl.basjes.parse.core.exceptions.DissectionFailure: Text '2017-12-25.00:00:00' could not be parsed: Unable to obtain ZonedDateTime from TemporalAccessor: {},ISO resolved to 2017-12-25.00:00:00 of type java.time.format.Parsed

	at nl.basjes.parse.httpdlog.dissectors.TimeStampDissector.dissect(TimeStampDissector.java:391)

How do I add a whitespace to my DateTime pattern? Must my log contain timezone information, or can I pass this in somehow?

Add mid star support

Add support for fields like this with a '' (=star) in the middle
"STRING:response.cookies.
.value",

Unique instances of the disectors

In the tree the instances of the disectors should be unique because that can improve performance in case the same disector is used twice.

Add Jetty NCSARequestLog Parser

Hi, I'm using this great tool in my work to parse logs. It seems that current parsers can't parse Jetty's NCSA Request Log.
For example:

"%{User-Agent}i" %D

will fail to parse a record when user agent is null ("-" in the log), where there'll be one blank appended after "-". Thus the record will be

"-"  %D

Note there're two blanks between "-" and %D. Same problem also occurs for %h and %l when their values are "-".

Maybe it's necessary to add parser to parse NCSA Request Log and make this tool more powerful.

Update version in documentation

Consider updating the maven example fragment shown in README and README-Java to the latest version (ie. <version>2.3</version> instead of <version>2.0</version>) to ensure that people are using the latest version even if they don't check explicitly on maven central which is it.

add name change interface

Hi~

I am using logparser well.
But there is one inconvenience.

That is, if the information parsed by @field ("PossiblePathName") is parsed as name/value, such as Map, then PossiblePathName is not useful key name.

A typical example is URL query parameter parsing.

@Field ("STRING:request.firstline.orifinal.uri.query.*")
public void addQuery (String name, String value) {
...
}

The results are shown below.

{
STRING:request.firstline.original.uri.query.param1 = xxxxxx,
STRING:request.firstline.original.uri.query.param2 = xxxxxx
}

Here, I wonder if there is a way to input the original parameter name (param1) instead of "request.firstline.original.uri.query.param1" or an interface provided separately.

Also for @field () with multiple PossiblePathName

@Field ({"TIME.DATE:request.receive.time.date:paramname1"}, {"TIME.DATE:request.receive.time.time:paramname2"}}
method {
}

I would like to have the ability to specify name in the form of.

thanks....

Wrong regex for timestamp parser

Hi

First of all, tnx for the great library.

But when I used the logparser in java, using an old log file from 1995 I couldn't get the parsing to work properly.
After long time debugging and trying I finally found that the year in a request's timestamp has to begin with a 2. It's hardcoded in the timestamp regex.

parsers.add(new TokenParser("%t", "request.receive.time", "TIME.STAMP", Casts.STRING_ONLY, "\\[[0-3][0-9]/(?:[A-Z][a-z][a-z])/2[0-9][0-9][0-9]:[0-9][0-9]:[0-9][0-9]:[0-9][0-9] [\\+|\\-][0-9][0-9][0-9]0\\]"));

Maybe there's a reason why you did this?
Otherwise it would be cool to make it work for less than the year 2000 ;)

Glenn

[PIG] Allow multiple values

For some values (like query sting parameters) you can have multiple with the same name.
For those types we should return a set of values always (even if there is only one).

Add full support for modifiers ( < and > and HTTP status )

http://httpd.apache.org/docs/current/mod/mod_log_config.html#customlog

Modifiers
Particular items can be restricted to print only for responses with specific HTTP status codes by placing a comma-separated list of status codes immediately following the "%". The status code list may be preceded by a "!" to indicate negation.

Format String Meaning
%400,501{User-agent}i Logs User-agent on 400 errors and 501 errors only. For other status codes, the literal string "-" will be logged.
%!200,304,302{Referer}i Logs Referer on all requests that do not return one of the three specified codes, "-" otherwise.
The modifiers "<" and ">" can be used for requests that have been internally redirected to choose whether the original or final (respectively) request should be consulted. By default, the % directives %s, %U, %T, %D, and %r look at the original request while all others look at the final request. So for example, %>s can be used to record the final status of the request and %<u can be used to record the original authenticated user on a request that is internally redirected to an unauthenticated resource.

JodaTime fails when parsing field 'MMM' with mix of upper and lower case

Default timestamp in Common Log format is "dd/MMM/yyyy:HH:mm:dd z", where MMM={JAN,FEB,...DEC}. JodeTime will fail however, if you us captialized month, for example 30/Sep/2016:00:00:06 +0000, where Sep is unrecognized by Jodatime. It seems to be expected behaviour. See http://joda-interest.219941.n2.nabble.com/JodaTime-1-5-2-DateTimeFormatter-and-case-sensitivity-td364187.html

Fix

Add uppercase/lowercase to fieldValue @ master/TimeStampDissector.java:311

Kubernetes NGINX Ingress Controller default logformat

Add the default logformat of the Kubernetes NGINX Ingress Controller as specified here
https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/log-format/

log_format upstreaminfo
    '{{ if $cfg.useProxyProtocol }}$proxy_protocol_addr{{ else }}$remote_addr{{ end }} - '
    '[$the_real_ip] - $remote_user [$time_local] "$request" '
    '$status $body_bytes_sent "$http_referer" "$http_user_agent" '
    '$request_length $request_time [$proxy_upstream_name] $upstream_addr '
    '$upstream_response_length $upstream_response_time $upstream_status';

Note that there is no way to determine from an actual log line if the ip was the $proxy_protocol_addr or the $remote_addr.
So define two shortcuts (one for each)

ApacheHttpdLoglineParser thread safety

Hi,

I am interested in using your library in our Spark deployment. For simple testing, the parser works great for our Apache logs.

One question I have is that the thread safety. I can see your library is using in the Hadoop/Hive/Pig, which normally parse the records in sequence in JVM processing. I wonder do you know that the "ApacheHttpdLoglineParser" parsing method is thread safe or not? In Spark, it indeed will run the parsing logic in multi-thread way, which could cause some issues for MR based code logic.

Thanks

URI with double # gives parse error

@Test
public void testDoubleHashes() throws Exception {
    DissectorTester.create()
        .withDissector(new HttpUriDissector())
        .withInput("https://www.basjes.nl/#foo#bar")
        .withInput("https://www.basjes.nl/path/?s2a=&Referrer=ADV1234#product_title&f=API&subid=?s2a=#product_title&name=12341234")
        .withInput("https://www.basjes.nl/path/?Referrer=ADV1234#&f=API&subid=#&name=12341234")
        .withInput("https://www.basjes.nl/path?sort&#x3D;price&filter&#x3D;new&sortOrder&#x3D;asc")
        .withInput("https://www.basjes.nl/login.html?redirectUrl=https%3A%2F%2Fwww.basjes.nl%2Faccount%2Findex.html&_requestid=1234#x3D;12341234&Referrer&#x3D;ENTblablabla")
        .expect("HTTP.HOST:host", "www.basjes.nl")
        .checkExpectations();
}

List of formats and data types

Hi Niels,
I'm working on adapting your parser for Apache Drill and I was wondering if there is a list somewhere of the fields that the parsers supports and the data types?
Thanks,

Javadoc bug

[WARNING] Javadoc Warnings
[WARNING] /home/niels/workspace/logparser/parser-core/src/main/java/nl/basjes/parse/core/Disector.java:62: warning - Tag @link: can't find addDisection(String, String, String, String, java.util.EnumSet) in nl.basjes.parse.core.Parsable

No stacktrace

Ensure there are NO calls to e.printStackTrace();

Improve parsing URIs

Although the URI class rejects them this discussion seems to point that a URI with a '[' in there should be valid
http://stackoverflow.com/questions/1547899/which-characters-make-a-url-invalid
http://stackoverflow.com/questions/11038967/brackets-in-a-request-url-are-legal-but-not-in-a-uri-java

Create Nginx parser

Goal: Support parsing Nginx logfiles in the same way we can parse the Apache accesslog files.

http://nginx.org/en/docs/http/ngx_http_log_module.html#log_format

The log format can contain common variables, 

Apparently these: http://nginx.org/en/docs/http/ngx_http_core_module.html#variables

and variables that exist only at the time of a log write:

$bytes_sent
the number of bytes sent to a client
$connection
connection serial number
$connection_requests
the current number of requests made through a connection (1.1.18)
$msec
time in seconds with a milliseconds resolution at the time of the log write
$pipe
“p” if request was pipelined, “.” otherwise
$request_length
request length (including request line, header, and request body)
$request_time
request processing time in seconds with a milliseconds resolution; time elapsed between the first
bytes were read from the client and the log write after the last bytes were sent to the client
$status
response status
$time_iso8601
local time in the ISO 8601 standard format
$time_local
local time in the Common Log Format

And special names for the response headers:

Header lines sent to a client have the prefix “sent_http_”, for example, $sent_http_content_range.

A basic example to start with:

The configuration always includes the predefined “combined” format:

log_format combined '$remote_addr - $remote_user [$time_local] '
                '"$request" $status $body_bytes_sent '
                '"$http_referer" "$http_user_agent"';

Undeclared dependency on Joda time

It seems that httpdlog-parser version 2.3 has an undeclared dependency on joda-time. Consider adding it as an explicit dependency (so that Maven automatically includes it whenever httpdlog-parser is used), or better yet - drop the dependency entirely and migrate to the java.time implementation provided by Java 8 (which is very close to joda-time and is always available).

Error occurred during setter call: No setter called for key

In Snapshot 4.1 - after using the new setterPolicy = NOT_NULL on a field like:

@Field(value = "NUMBER:connection.client.logname", setterPolicy = NOT_NULL)
  public void setRemoteLogName(String remoteLogName) {
    this.remoteLogName = remoteLogName;
  } 

the following error is thrown, when the field is - (null) in the log that gets processed:

Error occurred during setter call: No setter called for  key = "NUMBER:connection.client.logname"  name = "NUMBER:connection.client.logname"  value = "Value{filled=STRING, s='null', l=null, d=null}"

Usage Question

I am testing with version 3.0 and I have the Apache log format like following:
log.format=%a %{Host}i %u %t "%r" %>s %O "%{Referer}i" "%{User-Agent}i" %{Content-length}i %P %A

It will generate following lists based on example given here: https://github.com/nielsbasjes/logparser/blob/master/README-Java.md

STRING:connection.client.user
IP:connection.server.ip
TIME.STAMP:request.receive.time.last
TIME.DAY:request.receive.time.last.day
TIME.MONTHNAME:request.receive.time.last.monthname
TIME.MONTH:request.receive.time.last.month
TIME.WEEK:request.receive.time.last.weekofweekyear
TIME.YEAR:request.receive.time.last.weekyear
TIME.YEAR:request.receive.time.last.year
TIME.HOUR:request.receive.time.last.hour
TIME.MINUTE:request.receive.time.last.minute
TIME.SECOND:request.receive.time.last.second
TIME.MILLISECOND:request.receive.time.last.millisecond
TIME.DATE:request.receive.time.last.date
TIME.TIME:request.receive.time.last.time
TIME.ZONE:request.receive.time.last.timezone
TIME.EPOCH:request.receive.time.last.epoch
TIME.DAY:request.receive.time.last.day_utc
TIME.MONTHNAME:request.receive.time.last.monthname_utc
TIME.MONTH:request.receive.time.last.month_utc
TIME.WEEK:request.receive.time.last.weekofweekyear_utc
TIME.YEAR:request.receive.time.last.weekyear_utc
TIME.YEAR:request.receive.time.last.year_utc
TIME.HOUR:request.receive.time.last.hour_utc
TIME.MINUTE:request.receive.time.last.minute_utc
TIME.SECOND:request.receive.time.last.second_utc
TIME.MILLISECOND:request.receive.time.last.millisecond_utc
TIME.DATE:request.receive.time.last.date_utc
TIME.TIME:request.receive.time.last.time_utc
HTTP.URI:request.referer
HTTP.PROTOCOL:request.referer.protocol
HTTP.USERINFO:request.referer.userinfo
HTTP.HOST:request.referer.host
HTTP.PORT:request.referer.port
HTTP.PATH:request.referer.path
HTTP.QUERYSTRING:request.referer.query
STRING:request.referer.query.*
HTTP.REF:request.referer.ref
TIME.STAMP:request.receive.time
TIME.DAY:request.receive.time.day
TIME.MONTHNAME:request.receive.time.monthname
TIME.MONTH:request.receive.time.month
TIME.WEEK:request.receive.time.weekofweekyear
TIME.YEAR:request.receive.time.weekyear
TIME.YEAR:request.receive.time.year
TIME.HOUR:request.receive.time.hour
TIME.MINUTE:request.receive.time.minute
TIME.SECOND:request.receive.time.second
TIME.MILLISECOND:request.receive.time.millisecond
TIME.DATE:request.receive.time.date
TIME.TIME:request.receive.time.time
TIME.ZONE:request.receive.time.timezone
TIME.EPOCH:request.receive.time.epoch
TIME.DAY:request.receive.time.day_utc
TIME.MONTHNAME:request.receive.time.monthname_utc
TIME.MONTH:request.receive.time.month_utc
TIME.WEEK:request.receive.time.weekofweekyear_utc
TIME.YEAR:request.receive.time.weekyear_utc
TIME.YEAR:request.receive.time.year_utc
TIME.HOUR:request.receive.time.hour_utc
TIME.MINUTE:request.receive.time.minute_utc
TIME.SECOND:request.receive.time.second_utc
TIME.MILLISECOND:request.receive.time.millisecond_utc
TIME.DATE:request.receive.time.date_utc
TIME.TIME:request.receive.time.time_utc
HTTP.URI:request.referer.last
HTTP.PROTOCOL:request.referer.last.protocol
HTTP.USERINFO:request.referer.last.userinfo
HTTP.HOST:request.referer.last.host
HTTP.PORT:request.referer.last.port
HTTP.PATH:request.referer.last.path
HTTP.QUERYSTRING:request.referer.last.query
STRING:request.referer.last.query.*
HTTP.REF:request.referer.last.ref
NUMBER:connection.server.child.processid
BYTES:response.bytes
BYTESCLF:response.bytes
HTTP.HEADER:request.header.content-length
IP:connection.client.ip.last
HTTP.USERAGENT:request.user-agent.last
STRING:request.status.last
HTTP.USERAGENT:request.user-agent
STRING:connection.client.user.last
IP:connection.client.ip
HTTP.HEADER:request.header.host
HTTP.FIRSTLINE:request.firstline.original
HTTP.METHOD:request.firstline.original.method
HTTP.URI:request.firstline.original.uri
HTTP.PROTOCOL:request.firstline.original.uri.protocol
HTTP.USERINFO:request.firstline.original.uri.userinfo
HTTP.HOST:request.firstline.original.uri.host
HTTP.PORT:request.firstline.original.uri.port
HTTP.PATH:request.firstline.original.uri.path
HTTP.QUERYSTRING:request.firstline.original.uri.query
STRING:request.firstline.original.uri.query.*
HTTP.REF:request.firstline.original.uri.ref
HTTP.PROTOCOL_VERSION:request.firstline.original.protocol
HTTP.PROTOCOL:request.firstline.original.protocol
HTTP.PROTOCOL.VERSION:request.firstline.original.protocol.version
BYTES:response.bytes.last
BYTESCLF:response.bytes.last
HTTP.FIRSTLINE:request.firstline
HTTP.METHOD:request.firstline.method
HTTP.URI:request.firstline.uri
HTTP.PROTOCOL:request.firstline.uri.protocol
HTTP.USERINFO:request.firstline.uri.userinfo
HTTP.HOST:request.firstline.uri.host
HTTP.PORT:request.firstline.uri.port
HTTP.PATH:request.firstline.uri.path
HTTP.QUERYSTRING:request.firstline.uri.query
STRING:request.firstline.uri.query.*
HTTP.REF:request.firstline.uri.ref
HTTP.PROTOCOL_VERSION:request.firstline.protocol
HTTP.PROTOCOL:request.firstline.protocol
HTTP.PROTOCOL.VERSION:request.firstline.protocol.version
NUMBER:connection.server.child.processid.last
IP:connection.server.ip.last

Here is one sample log:
184.105.247.196 - - [03/Apr/2017:03:27:28 -0600] "GET /havimusor/month.calendar/2006/10/28/67.html HTTP/1.1" 404 419 "-" "-" - 115052 173.254.111.153

And here is my test code:

public class LogRecord {
    String clientIp;
    String customerHostName;
    String remoteUser;
    String dateTimeS;
    String httpRequest;
    String httpStatusCode;
    String bytesSent;
    String url;
    String userAgent;
    String contentLength;
    String serverProcessId;
    String customerDedicatedIP;

    public String getClientIp() {
        return clientIp;
    }

    @Field("IP:connection.client.ip")
    public void setClientIp(String clientIp) {
        this.clientIp = clientIp;
    }

    public String getCustomerHostName() {
        return customerHostName;
    }

    @Field("HTTP.HEADER:request.header.host")
    public void setCustomerHostName(String customerHostName) {
        this.customerHostName = customerHostName;
    }

    public String getRemoteUser() {
        return remoteUser;
    }

    @Field("STRING:connection.client.user")
    public void setRemoteUser(String remoteUser) {
        this.remoteUser = remoteUser;
    }

    public String getDateTimeS() {
        return dateTimeS;
    }

    @Field("TIME.STAMP:request.receive.time")
    public void setDateTimeS(String dateTimeS) {
        this.dateTimeS = dateTimeS;
    }

    public String getHttpRequest() {
        return httpRequest;
    }

    @Field("HTTP.FIRSTLINE:request.firstline")
    public void setHttpRequest(String httpRequest) {
        this.httpRequest = httpRequest;
    }

    public String getHttpStatusCode() {
        return httpStatusCode;
    }

    @Field("STRING:request.status.last")
    public void setHttpStatusCode(String httpStatusCode) {
        this.httpStatusCode = httpStatusCode;
    }

    public String getBytesSent() {
        return bytesSent;
    }

    @Field("BYTES:response.bytes")
    public void setBytesSent(String bytesSent) {
        this.bytesSent = bytesSent;
    }

    public String getUrl() {
        return url;
    }

    @Field("HTTP.URI:request.referer")
    public void setUrl(String url) {
        this.url = url;
    }

    public String getUserAgent() {
        return userAgent;
    }

    @Field("HTTP.USERAGENT:request.user-agent")
    public void setUserAgent(String userAgent) {
        this.userAgent = userAgent;
    }

    public String getContentLength() {
        return contentLength;
    }

    @Field("HTTP.HEADER:request.header.content-length")
    public void setContentLength(String contentLength) {
        this.contentLength = contentLength;
    }

    public String getServerProcessId() {
        return serverProcessId;
    }

    @Field("NUMBER:connection.server.child.processid")
    public void setServerProcessId(String serverProcessId) {
        this.serverProcessId = serverProcessId;
    }

    public String getCustomerDedicatedIP() {
        return customerDedicatedIP;
    }

    @Field("IP:connection.server.ip.last")
    public void setCustomerDedicatedIP(String customerDedicatedIP) {
        this.customerDedicatedIP = customerDedicatedIP;
    }

    public LogRecord() { }
}

Here is the issue for the httpRequest field. I may have some log entries like following example:
184.105.247.196 - - [03/Apr/2017:03:27:28 -0600] "\x16\x03\x01" 404 419 "-" "-" - 115052 173.254.111.153

And the above code will throw following error:
The input line does not match the specified log format.Line : Value{filled=STRING, s='184.105.247.196 - - [03/Apr/2017:03:27:28 -0600] "\x16\x03\x01" 404 419 "-" "-" - 115052 173.254.111.153', l=null, d=null}
LogFormat: %a %{Host}i %u %t "%r" %>s %O "%{Referer}i" "%{User-Agent}i" %{Content-length}i %P %A
RegEx : ^((?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)|:?(?:[0-9a-fA-F]{1,4}(?::|.)?){0,8}(?::|::)?(?:[0-9a-fA-F]{1,4}(?::|.)?){0,8}|-)\Q \E(.)\Q \E(.)\Q [\E([0-3][0-9]/(?:[a-zA-Z][a-zA-Z][a-zA-Z])/[1-9][0-9][0-9][0-9]:[0-9][0-9]:[0-9][0-9]:[0-9][0-9] [+|-][0-9][0-9][0-9][0-9])\Q] "\E((?:[a-zA-Z-_]+ .(?: HTTP/[0-9]+.[0-9]+)?)|-)\Q" \E([^\s])\Q \E([0-9]|-)\Q "\E(.)\Q" "\E(.)\Q" \E(.)\Q \E([0-9]*)\Q \E((?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)|:?(?:[0-9a-fA-F]{1,4}(?::|.)?){0,8}(?::|::)?(?:[0-9a-fA-F]{1,4}(?::|.)?){0,8}|-)$

The problem is that I almost tried all other format, with following:
STRING:request.firstline.uri.query.*
STRING:request.firstline.original.uri.query.*
HTTP.FIRSTLINE:request.firstline.original
etc

Some of them won't produce error, but they cannot catch the good request any more, like "GET /havimusor/month.calendar/2006/10/28/67.html HTTP/1.1" in the example.

I wonder what format I should specified to get both "GET /havimusor/month.calendar/2006/10/28/67.html HTTP/1.1" and "\x16\x03\x01" in this field.

2nd question is that we have 12 fields in the log, for now, I just want to catch these 12 fields. We have another implementation using REGEX, but as the data cases complex enough, I like to reuse your library. The issue is that I tested with 2.1M rows, but the performance is about 10x difference. Looks like internally you also use REGEX. I wonder if any place I misuse your library, and lead to such performance difference.

Here is my sample test code of using the above log record object:

        BufferedReader br = new BufferedReader(new FileReader(args[0]));
        String line;
        long totalRecordCnt = 0;
        long starting = System.currentTimeMillis();
        while ((line = br.readLine()) != null) {
            totalRecordCnt++;
            try {
                record = parser.parse(record, line);
            } catch (Throwable t) {
                System.err.println(t.getMessage());
            }
        }
        br.close();
        System.out.println("Parsing took " + (System.currentTimeMillis() - starting) + " mills, with " + totalRecordCnt + " total records!");

For this library, it took around 160s to 200s for 2146891 (68MB) records, and using plan REGEX to parse the same data will take about 13s to 15s range. I want to confirm what I did is the correct way to use this library, and I DO want to get all 12 fields out, as this logic is in our Spark ETL.

How to set DateTimeFormatter.Local?

Exception in thread "main" nl.basjes.parse.core.exceptions.DissectionFailure: Invalid format: "28/feb/2017:03:39:40 +0800" is malformed at "feb/2017:03:39:40 +0800"

because log in server , but analyzing log in my computer ....
DateTime Local is different

Parse or ignore timestap

As

%{format}t : The time, in the form given by format, which should be in strftime(3) format. (potentially localized)

How can I parse or ingore the timestamp if it's not in standard english format ?

Thanks.

Timestamp format Error should be a Warning

This error message should be a warning:

The timestamp format "%F %H:%M:%S.%usec_frac" does NOT contain a timezone so we assume "Z".

Otherwise, it can have the effect that a pipeline gets canceled since it looks like something really bad happened.

Handle numerical values

Update the entire framework to allow outputting numerical values.
This would mean that the fact if something is numerical or string is defined by the producing disector.
Need a few basic types only:

  • String
  • Long
  • Double

Parse error on android-app://... URI

As described here:

We find referrer values that look like these:

android-app://com.google.android.googlequicksearchbox
android-app://com.google.android.googlequicksearchbox/https/www.google.com

Which result in

nl.basjes.parse.core.exceptions.DissectionFailure: Unable to parse the URI: >>>android-app://com.google.android.googlequicksearchbox<<< (unknown protocol: android-app)
nl.basjes.parse.core.exceptions.DissectionFailure: Unable to parse the URI: >>>android-app://com.google.android.googlequicksearchbox/https/www.google.com<<< (unknown protocol: android-app)

Place of the error is that the
at nl.basjes.parse.httpdlog.dissectors.HttpUriDissector.dissect(HttpUriDissector.java:138)

url = new URL(fieldValue);
which fails with a MalformedURLException with the message "unknown protocol: android-app"

Parser breaking with 408 status code in mod_reqtimeout

The parser is breaking parsing mod_reqtimeout entry for 408 status code.

Example line:

187.41.80.255 - - [05/Aug/2016:09:32:06 -0300] "-" 408 - "-" "-" 0/14

Example LogFormat string:

String localLogFormat = "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %T/%D";

I will push to my repository a test case.

Allow lenient parsing at the end of the log line

I was having issues parsing lines like this until I made the change on the patch (I masked the ips)

*.*.*.* - - [31/Jul/2016:03:07:00 -0500] 4679405 4 "GET /admin/cc_info.php?done=1&mv_action=done&_r=106 HTTP/1.1" 200 2618 "https://*.*.*/admin/cc_info.php?done=1&mv_action=done&_r=106" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"
I changed line 162 of TokenFormatDissector regex.append("$.*");
lenient_eol_parsing_patch.txt

Could this be integrated

Add feature to support multiple logformats

Idea:
When changing the logformat it would be helpful if the parser simply supports multiple logformats.
On the first record it should try each of them until a matching format is found; from there it should use that format for all remaining records until a parse error is found. Then it should retry all possible formats and only fail if all of them fail.

To be considered: Output parameters are constrained by what all formats support ? Or simply assume 'absent' (i.e. null) if it is not possible in either the old or new format...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.