vectordotdev / vrl Goto Github PK

Vector Remap Language

License: Mozilla Public License 2.0

Rust 99.56% Shell 0.34% Python 0.11% PureBasic 0.01%

vrl's Introduction

Vector Remap Language (VRL)

VRL is a scripting language for processing observability data (logs, metrics, traces). Although VRL was originally created for use in Vector, it was designed to be generic and re-usable in many contexts.

Features

VRL is broken up into multiple components, which can be enabled as needed.

Feature	Default	Description
compiler	yes	The contains the core functionality of VRL. Compiling and running VRL programs.
parser	yes	Creates an abstract syntax tree (AST) from VRL source code.
value	yes	Contains the primary data type used in VRL.
diagnostic	yes	Logic related to errors and displaying info about them.
path	yes	Contains the parser, datatypes, and functions related to VRL paths.
stdlib	yes	All of the VRL functions from the standard library.
core	yes	Various data structures and utility methods (these may be renamed / moved in the future).
datadog_filter	yes	Implements the Datadog log search query filter syntax.
datadog_grok	yes	Implements the Datadog grok parser. (used with `parse_grok` and `parse_groks` in the stdlib).
datadog_search	yes	Implements the Datadog log search syntax.
cli	no	Contains functionality to create a CLI for VRL.
test_framework	no	Contains the test framework for testing VRL functions. Useful for testing custom functions.
lua	no	Makes the `Value` type compatible with the `mlua` crate.
arbitrary	no	Implements `Arbitrary` (from the `quickcheck` crate) for the `Value` type

Webassembly

All of the core features, and most of the standard library functions can be compiled with the wasm32-unknown-unknown target. There are a few stdlib functions that are unsupported. These will still compile, but abort at runtime.

Unsupported functions:

parse_grok
parse_groks
log
get_hostname
reverse_dns

vrl's People

Contributors

Stargazers

Watchers

vrl's Issues

`unnest` should unnest one level deeper

As noted on discord.

Say you have an event of:

{ "things": [{ "id": 1, "thing": "thong" }, { "id": 2, "thang": "thung" }], "timestamp": "monday" }

and you run:

. = unnest(.things)

This gives you an event array of:

[{ "things": { "id": 1, "thing": "thong" }, "timestamp": "monday" }, 
 { "things": { "id": 2, "thang": "thung" }, "timestamp": "monday" }]

It seems much more likely that the user is going to want the result to be:

[{ "id": 1, "thing": "thong", "timestamp": "monday" }, 
 { "id": 2, "thang": "thung", "timestamp": "monday" }]

We should either just change unnest to work this way, or perhaps make it parameterised so the user has an option as to which way it should work.

Address UX differences between VRL REPL and remap transform

Specifically, assigning a value to a target results in the compiler knowing the type of that value, which means error handling isn't needed (and in fact, is rejected), whereas at runtime in the remap transform, the value type might be unknown, which causes the program tested in the REPL not to be valid in the Remap transform.

I don't think we should "just" make the REPL forget types of assigned values, but I'm also not sure yet that the correct solution is. We'll have to experiment a bit to find the right balance.

Add `to_entries` and `from_entries` to VRL

Inspired by user in discord: https://discord.com/channels/742820443487993987/746070591097798688/832684085771370587

The user would like to transform:

{
  "labels": {
    "key1": "value1",
    "key2": "value2"
  }
}

into

{
  "labels": [
    {
      "key": "key1",
      "value": "value1"
    },
    {
      "key": "key2",
      "value": "value2"
    }
  ]
}

I think jq's to_entries and from_entries are good approaches to this.

Allow `else` keyword on newline in VRL

Broken off from vectordotdev/vector#6893 and vectordotdev/vector#6848.

The user has a program that looks like:

if (starts_with(image, "k8s.gcr.io/kube-apiserver") ||
    starts_with(image, "k8s.gcr.io/kube-scheduler")) {"k8s-plain"}
else {"unknown"}

However, we don't allow else to appear on a newline by itself without the closing }.

We should consider allowing this.

New `to_hive_partition` Remap function

Creating Hive partition strings is very common when writing to file-like storages (such as aws_s3). Unfortunately, creating these partition strings is fraught with foot-guns. To protect users from these issues we should offer a function that makes this task easy.

Examples

Given this event for all examples:

{
	"timestamp": "2021-01-14T21:26:45.433667Z",
	"application_id": 2,
	"environment": "production"
}

Array

And this Remap script:

to_hive_partition([.environment, .application_id, .timestamp], limit: 256)

Would produce this string:

environment=production/application_id=2/timestamp_year=2020/timestamp_month=01/timestamp_day=14/

Notice that:

The keys reflect the path names (curious if this is possible)
The timestamp is opinionated: truncated by the day and split into 3 partitions

Map

And this Remap script:

to_hive_partition({"env": .environment, "app": .application_id, "ts": .timestamp}, limit: 256)

Would produce this string:

env=production/app=2/ts_year=2020/ts_month=01/ts_day=14/

Notice that the map keys are used as the names

Single value

And this Remap script:

to_hive_partition(.timestamp, limit: 256)

Would produce this string:

timestamp_year=2020/timestamp_month=01/timestamp_day=14/

Requirements

Key/value pairs should be delimited by /
Keys and values should be delimited by =
Keys and values must be URL encoded as to not conflict with the reserved characters above.
The overall partition length cannot exceed 256 characters for most use cases.
Opinionated formatted of timestamps.
Includes a trailing / character

cc @jszwedko since he had the pleasure of creating such partition strings for the benchmarking work.

add `credit_card` filter to `redact` function

After vectordotdev/vector#5221 lands, we want to expand the list of filters to make it easy to filter sensitive data without having to write your own regexp (and to make sure those implementations are as performant as possible).

First on the list is a credit card filter:

{ "message": "order paid using credit card number 5555 5555 5555 4444" }

.message = redact(.message, filters = ["credit_card"])

{ "message": "order paid using credit card number ****" }

Some relevant references:

ref vectordotdev/vector#4519

fields extracted using parse_regex imply incorrect type definitions when parse_regex fails

Vector Version

vector 0.13.1 (v0.13.1 x86_64-unknown-linux-gnu 2021-04-29)

Vector Configuration File

[sources.generator]
  type = "generator"
  format = "syslog"
  
[transforms.remap]
  type = "remap"
  inputs = ["generator"]
  source = '''
  . |= parse_regex(.message, r'^<\d+>(?P<a>\S+') ?? {}
  .a = downcase(string(.a) ?? "")
  '''
  
[sinks.console]
  type = "console"
  inputs = ["remap"]
  encoding = "json"

Debug Output

error[E651]: unnecessary error coalescing operation
  ┌─ :2:17
  │
2 │   .a = downcase(string(.a) ?? "")
  │                 ^^^^^^^^^^ -- -- this expression never resolves
  │                 │          │
  │                 │          remove this error coalescing operation
  │                 this expression can't fail
  │
  = see language documentation at https://vrl.dev

error[E110]: invalid argument type
  ┌─ :2:17
  │
2 │   .a = downcase(string(.a) ?? "")
  │                 ^^^^^^^^^^^^^^^^
  │                 │
  │                 this expression resolves to the exact type "boolean"
  │                 but the parameter "value" expects the exact type "string"

Expected Behavior

Config should work as expected.

Actual Behavior

Config returns an error when using vector validate or running the config.

Example Data

Generated data.

Additional Context

Discussed this with @JeanMertz and he figured out that vector is making a wrong compile time assumption of the fields extracted using regex capture groups when parse_regex errors are suppressed using the " ?? {} " notation used above. A potential workaround is:

[sources.generator]
  type = "generator"
  format = "syslog"
  
[transforms.remap]
  type = "remap"
  inputs = ["generator"]
  source = '''
  . |= parse_regex(.message, r'^<\d+>(?P<a>\S+') ?? {"a": null}
  .a = downcase(string(.a) ?? "")
  '''
  
[sinks.console]
  type = "console"
  inputs = ["remap"]
  encoding = "json"

References

Discussed on Discord in the support channel on 11/05/2021 around 12:00 CET.

[grok] support `csv` filter

csv(headers[, separator[, quotingcharacter]])

https://docs.datadoghq.com/logs/processing/parsing/?tab=filters

Unable to parse timestamp of type millisecond since unix epoch

Vector Version

vector 0.13.1 (v0.13.1 x86_64-unknown-linux-gnu 2021-04-29)

Vector Configuration File

$ parse_timestamp!("1620291751177", "%s%3f")
function call error for "parse_timestamp" at (0:42): Invalid timestamp "1620291751177": premature end of input

$ parse_timestamp!("1620291751.177", "%s%.3f")
t'2021-05-06T09:02:31.177Z'

Debug Output

N/A

Expected Behavior

The first parse_timestamp command should have been able to parse the timestamp.

Actual Behavior

It didn't parse the timestamp.

New `parse_<framework>` Remap functions (niche log parsing)

I wanted to open this issue to see where we land on a solution since these types of parsing problems raise a few interesting questions. Parsing framework logs, like Rails, cuts deeper into the real-world observability problems and we will need to solve this elegantly in the near future.

Examples

For example, let's use Rails. Rails production logs look like this:

I, [2016-01-26T23:21:44.581108 #27447]  INFO -- : Started POST "/articles" for 127.0.0.1 at 2018-06-27 15:48:10 +0000
I, [2016-01-26T23:21:44.581108 #27447]  INFO -- : Processing by ArticlesController#create as HTML
I, [2016-01-26T23:21:44.581108 #27447]  INFO -- :   Parameters: {"utf8"=>"✓", "article"=>{"title"=>"Create New Post", "text"=>"Description for new post"}, "commit"=>"Save Article"}
I, [2016-01-26T23:21:44.581108 #27447]  INFO -- :   [1m[35m (0.2ms)[0m  [1m[35mBEGIN[0m
I, [2016-01-26T23:21:44.581108 #27447]  INFO -- :   ↳ app/controllers/articles_controller.rb:20
I, [2016-01-26T23:21:44.581108 #27447]  INFO -- :   [1m[36mArticle Create (1.0ms)[0m  [1m[32mINSERT INTO "articles" ("title", "text", "created_at", "updated_at") VALUES ($1, $2, $3, $4) RETURNING "id"[0m  [["title", "Create New Post"], ["text", "Description for new post"], ["created_at", "2018-06-27 15:48:11.116208"], ["updated_at", "2018-06-27 15:48:11.116208"]]
I, [2016-01-26T23:21:44.581108 #27447]  INFO -- :   ↳ app/controllers/articles_controller.rb:20
I, [2016-01-26T23:21:44.581108 #27447]  INFO -- :   [1m[35m (0.5ms)[0m  [1m[35mCOMMIT[0m
I, [2016-01-26T23:21:44.581108 #27447]  INFO -- :   ↳ app/controllers/articles_controller.rb:20
I, [2016-01-26T23:21:44.581108 #27447]  INFO -- : Redirected to http://localhost:3000/articles/29
I, [2016-01-26T23:21:44.581108 #27447]  INFO -- : Completed 302 Found in 25ms (ActiveRecord: 4.5ms)

Problem

Parsing these logs is not a fun endeavor:

You must unwrap the log with a custom Ruby prefix to extract the level, timestamp, and pid.
Custom Regexes will need to be written for lines that allow it.
A custom key/value parser is needed for the Ruby stringified parameters. ({"utf8"=>"✓",...)
ANSI stripping will be required on some lines.
Control flow would help to match on each line in order to apply the appropriate parsing strategy. This will help with performance.
I'm probably missing some steps.

Proposal

This is the kind of problem Vector should make easy. I realize this is a very specific version of this problem, but we should think about how we can reduce the pain here.

Proposal 1: Do nothing

We could decide to not handle these formats and tell users to:

Structure your logs at the app level using JSON or some other format.
1. Note: Vector is sometimes used by operators that do not control the app and therefore can't update the app to structure its logs.
Use Grok, since patterns already exist.
1. Note: this is slower and these formats appear to be community managed, so their quality is questionable.
Write custom regexes.

Proposal 2: Add a single `parse_rails` function

This function would perform format detection and return structured logs with a key that users can match on to determine the log type. One simple function call would handle it all.

This offers the best user experience but shifts a lot of tedious responsibility on us, especially since these formats can change across Rails versions.

Proposal 3: Add multiple `parse_rails_*` functions

parse_rails_request_start, parse_rails_controller, parse_rails_sql, etc, etc.

I don't like this option but wanted to list it to be comprehensive.

Proposal 4: Community managed Remap functions/formats

Delegate all of this to the community and let them manage this through a number of ways:

Community managed Remap functions.
Community managed patterns.
Community managed components (config macros).

Final thoughts

I don't particularly like any of these solutions, but I feel like this would be best solved with some sort of community approach if we could control quality.

Allow `'`s around values in `parse_key_value` / `parse_logfmt`

Reported by user in discord: https://discord.com/channels/742820443487993987/746070591097798688/832270139293433867

They'd like to parse a line that seems to be logfmt, but with 's around the value instead of "s:

{"appname":"haproxy","file":"/var/log/haproxy-http.log","host":"lb1","hostname":"lb1","message":"client_ip=10.10.0.99 client_port=51428 frontend_transport=stats backend_name=stats backend_server=<PROMEX> time_to_recieve=0 slot_wait_time=0 tcp_establish_time=0 responce_time=0 request_active_time=0 status_code=200 bytes_read=45096 request_cookie=- responce_cookie=- termination_state_cookie=LR-- process_concurrent_connections=1 frontend_concurrent_connections=1 backend_concurrent_connections=0 server_concurrent_connections=0 retries_count=0 server_queue=0 backend_queue=0 request_headers= responce_headers= request='"GET /metrics HTTP/1.1"'","procid":11296,"source_type":"file","timestamp":"2021-04-15T15:02:23Z"}

Note the request key's value. It seems to be double quoted, but I think it'd be reasonable to expand parse_key_value and parse_logfmt to allow this as a delimiter as well.

allow working with durations

I wonder if manipulating timestamps is common enough in events that it'd be worth introducing a duration type in the language itself.

For example, one could write:

.dt = to_timestamp(.dt) + 10s

The <integer>s part is the type that defines a duration (in this case in seconds). We'd support more than just seconds. Alternative notations are possible (including ISO8601 duration format).

Of course, an alternative would be to just allow adding seconds to timestamps:

.dt = to_timestamp(.dt) + 10 // seconds

Or having a duration function that takes an integer or float, and a unit of time, and converts that to seconds (or milliseconds, or whatever):

$hours = 10
.dt = to_timestamp(.dt) + duration($hours, "hour")

Of course, it's worth mentioning we already have a parse_duration function, which might be "just enough" to cover the basic use cases and there's very little demand for anything more advanced.

Add pipe operator support

Remap revolves around expressions, most expressions are implemented as functions, which take zero or more input arguments, and provide output.

For example:

trim(uppercase(to_string(.foo)))

The above will eventually (#4905) also work in string templates in Vector itself:

my_config_field = "{{ trim(uppercase(to_string(.foo))) }}"

The above works, but can be hard to read.

I wonder if we want to add support for the pipe operator (similar to Elixir).

.foo |> to_string |> uppercase |> trim

my_config_field = "{{ .foo |> to_string |> uppercase |> trim }}"

The gist of this syntax would be that given an expression, if the expression is followed by the pipe operator (|>), then the expression after that operator would receive the result of the previous expression as its first argument.

The precedence of the pipe operator would be higher than the other operators. This allows one to write:

.foo = .bar |> split |> contains("baz") || .baz |> to_string |> contains("qux") # bool

# similar to
.foo = ((.bar |> split |> (contains("baz")) || (.baz |> to_string |> contains("qux"))) # bool

You can still supply more arguments if need be:

"foo bar,baz" |> split(delimiter = ",") |> join(delimiter = "|") # "foo bar|baz"

This would be purely syntactic sugar to make the code easier to read, especially in string templates, but also in general.

There are still some design decisions to make (for example, what happens if you provide a non-function expression to the rhs of the pipe operator? Compile error?), but I wanted to put this issue up to gauge interest from others.

Also, some languages that focus specifically on templating use the | pipe syntax to do this:

.foo | to_string | uppercase | trim

We could do this, but then we'd lose the ability to implement bitwise operators in the future.

cc @binarylogic @FungusHumungus

Unresolvable VRL compiler complaint about fallibility

Discovered by a user.

Version: vector 0.13.0 (v0.13.0 x86_64-apple-darwin 2021-04-21)

I whittled down the reproducible case to:

[sources.in]
type = "stdin"

[transforms.remap]
  type = "remap"
  inputs = ["in"]
  source = '''
  foo, foo_err = parse_regex("hello 123 world", r'(?P<bar>\d+)')
  if foo_err != null {
    bar, bar_err = parse_regex("hello 123 world", r'(?P<bar>\d+)')
    if bar_err != null {
      log(bar_err)
    } else {
      .line = merge(.line, bar)
    }
  } else {
    .line = merge(.line, foo)
  }
  '''

[sinks.out]
type = "console"
inputs = ["remap"]
encoding.codec = "json"

If you run vector --config with this, it gives:

error[E103]: unhandled fallible assignment
   ┌─ :10:13
   │
10 │     .line = merge(.line, foo)
   │     ------- ^^^^^^^^^^^^^^^^^
   │     │       │
   │     │       this expression is fallible
   │     │       update the expression to be infallible
   │     or change this to an infallible assignment:
   │     .line, err = merge(.line, foo)
   │
   = see documentation about error handling at https://errors.vrl.dev/#handling
   = learn more about error code 103 at https://errors.vrl.dev/103
   = see language documentation at https://vrl.dev

If we resolve this by adding ! to the merge calls we see:

error[E620]: can't abort infallible function
   ┌─ :10:13
   │
10 │     .line = merge!(.line, foo)
   │             ^^^^^- remove this abort-instruction
   │             │
   │             this function can't fail
   │
   = see documentation about error handling at https://errors.vrl.dev/#handling
   = see language documentation at https://vrl.dev

Benchmark `now` and `uuidv4` vrl functions

Broken off from vectordotdev/vector#6408 as it requires enhancements to bench_function to allow skipping assertions for functions that return dynamic values.

Add a way to select only certain fields in VRL

Requested by user in discord: https://discord.com/channels/742820443487993987/764187584452493323/839503742813732884

We had an only_fields at one point, but it got backed out. I can't remember why though.

Benchmark `del` and `exists` vrl functions

Broken off from vectordotdev/vector#6408 as it requires changes to bench_function to support additional argument types (in this case, a path).

Add new `parse_regex_any` VRL function

As discussed in Discord with Jean Mertz, it would be very valuable to provide a VRL function for regex parsing that can accept a list of regex patterns, similar to how the deprecated regex parser operates.

From Discord:

Unfortunately there's no solution to this right now. We did add a match_any (https://vector.dev/docs/reference/vrl/functions/#match_any) function to speed up matching cases, but we don't have a parse_regex_any yet.

We are tracking compile-time optimisations in https://github.com/timberio/vector/issues/7636, which would include cases like these, but there's nothing concrete yet.

In the mean time, feel free to open an issue for this specific use-case, as I don't see a reason not to add parse_regex_any to at least have a solution in place until we have those optimisations.

Issue with `del` inside chained functions

Ex:

$ .message = "{\"foobar\": \"baz\"}"
"{\"foobar\": \"baz\"}"

$ . |= object!(parse_json!(del(.message)))
{ "foobar": "baz", "message": "{\"foobar\": \"baz\"}" }

Workaround

Move del function to a separate line either like:

msg = del(.message)
. |= object!(parse_json!(msg))

. |= object!(parse_json!(.message))
del(.message)

Add explicit "enum type" to VRL

We've been introducing functions that take an enum variant as an argument.

For example:

parse_nginx_log(.message, format: "combined")

The format argument has to be either "combined" or "error", and has to be a string literal. This has been awkwardly implemented and can be confusing to users. It also results in less-than-ideal error messages if people try to pass in non-literal string values (e.g. through an event field).

I've been pondering a solution, and although having full-blown enum support (both defining and referencing) can be powerful, I also believe it doesn't fit the simplicity we want to achieve with VRL.

Instead, I think we should add a new enum type that can only be referenced as a literal, similar to how we have a timestamp type.

Specifically, I propose we add e'...' as a type for an enum variant literal.

Given this function signature:

foo(timestamp: Timestamp, variant: Enum)

You'd call this function like so:

foo(timestamp: t'2021-01-01T00:00:00Z', variant: e'my_variant')

This would only compile if variant is passed an enum type (e'...') and if the provided enum variant is accepted by the function signature.

Function signatures would define the set of enum variants they accept, which the compiler can then automatically check (and provide useful error messages indicating which enum variants are expected and which one was provided). This is an improvement over the current situation because functions now have to build their own (non-ideal) error messages when dealing with enum-like strings, given that we don't actually have support for enums.

All of this would also bubble up into the Cue docs, which would result in better auto-generated function signatures and a list of enum variants supported for a specific parameter.

Support for HereDocs in Remap

HereDocs allow for the easy, and readable, creation of multi-line strings.

Example

Simple

"""
This is
a
long multi-line
string.
"""

Would produce:

"This is\na\nlong multi-line\nstring."

Indentation preserved

"""
  This is
a
  long multi-line
string.
"""

Would produce:

"  This is\na\n  long multi-line\nstring."

Leading white space ignored

.field = """
    This is
  a
    long multi-line
  string.
  """

Would produce:

"  This is\na\n  long multi-line\nstring."

Notice that the leading 2 spaces were trimmed since it did not align with the closing """ identifier.

Requirements

The leading \n should be stripped (as shown in the examples)
The trailing \n` should be stripped (as shown in the examples)
Program indentation should be ignored (spaces before the closing """)
Contained indentation should be preserved (beyond the close """)
New lines are preserved

Support non-quoted integer path segments

Enable usage of non-quoted integers in path segments. Some examples would be .0, .field.12, .2.label etc...

vectordotdev/vector#7045 enabled usage of non-quoted path identifiers that start with a number. Originally, support for non-quoted integers as path segments was also planned but was scraped once the issues around it became clear and it's usefulness was questioned.

Issue

Main issue is _ chars in integers. Currently the parser will parse 1_000 as an integer and will erase underscores. A naive implementation would use the parsed integer as a key which would result in following { "1_000": "a", "1000": "b"}.1_000 to evaluate as "b".

There are two known approaches:

Introduce more context into lexer so that it can know when it's in a path so to parse integers as identifiers.
Raise construction of integers to grammar where we have context.

Both options would introduce quite the complexity for a feature that is at best a nice to have and at worst could introduce more confusion than it would resolve.

Alternatives

Quoted integers are supported, ."0" works, so in the case of not going through with this change no other change is required.

Follow up on vectordotdev/vector#7045, vectordotdev/vector#6780

Allow specification of multiple timestamp formats to parse_timestamp

We're experiencing issues with timestamp parsing. Specifically, this is the offending timestamp format:
07/Apr/2021:23:09:56 +0000.

By looking at the Vector supported timestamp formats, it doesn't seem to support the above timestamp format. We were wondering whether it would be possible to add it or adjust Vector to permit expanding the timestamp format catalog from the config file?

Vector Version

vector 0.12.1 (v0.12.1 x86_64-unknown-linux-gnu 2021-03-12)

Expected Behavior

to_timestamp should produce a valid timestamp.

Actual Behavior

Apr 07 23:10:05 ip-10-10-25-9.us-west-2.compute.internal vector[3592]: Apr 07 23:10:05.664 WARN transform{component_kind=“transform” component_name=remap-nginx component_type=remap}: vector::internal_events::remap: Mapping failed with event. error=“remap error: function call error: No matching timestamp format found for \“07/Apr/2021:23:09:56 +0000\“” rate_limit_secs=30

Add Remap function for comparing semantic versions

I'm not sure I can think of an absolute must-have use case, but we should enable users to compare versions of things, like so:

semver::compare(v1, v2)

This idea is inspired by Rego's semver.compare function, which returns three possible integers:

1 if v1 is later than v2
0 if the two versions are the same
-1 if v1 is earler than v2

This would allow for constructing Boolean expressions:

if (semver::compare(.version, "1.0") == -1) {
    .deprecated = true
}

I'm open to other interfaces for this, for example:

semver::compare(v1, v2, expect = ["gt"]) // Boolean

semver::parse(v1) > semver::parse(v2)

Allow defaulting a non-existent path in VRL

Requested by user in discord: https://discord.com/channels/742820443487993987/746070591097798688/829788575234523156

They'd like to be able to lookup a field but set a default if it does not exist.

This is doable via:

$ bar = if exists(.foo) { .foo } else { "not set" }
"not set"

But it would be nice to add some sort of null coalescing support similar to the error coalescing we currently have using ??.

New `get_host_ip` Remap function

See vectordotdev/vector#5800 (comment).

VRL parse function for hexstring to string

We have several usecases which would require decode of hexstring to string from logs.
After text extraction with grok, there would be a need to get an hexstring to string function (parse_hexToString)

Example below:
74 65 73 74 20 73 74 72 69 6e 67 -> test string
7465737420737472696e67 -> test string

Allow for slice expressions in Remap

An expression to slice values would make transforming values easier.

Examples

.message[1:10]
.array[0:2]

Requirements

Ability to slice strings
Ability to slice arrays
Allow for negative indexing

Add `format_csv` function to VRL

Prompted by vectordotdev/vector#7115

This would be the reverse of the parse_csv function.

`parse_logfmt` unflattening

When encoding as logfmt, it is common to flatten structured data. For example:

{
  "foo": {
    "bar": 1
  }
}

Would be encoded as foo.bar=1 in logfmt.

Currently, the parser just reads the keys as-is so you end up with an object like:

{
  "foo.bar": 1
}

I think it'd be nice to unflatten. This could be an optional behavior.

Notably the PR adding encode_logfmt (vectordotdev/vector#6985) does this flattening.

It is possible to create excessively huge arrays in VRL

In VRL running:

.thing[9223372036854775807] = 1

will cause VRL to attempt to allocate a huge array. This basically locks it up and will likely crash it eventually (I didn't wait long enough to find out for sure).

We should probably put a reasonable limit in here to prevent this from occurring.

This will become much more important when we allow variables to be used as array indexers.

improve `parse_key_value` escaping management & align it with `encode_key_value`

parse_key_value has limited escaping support for some sequence, however the parsing does not remove the escaping character, this is inconsistent with encode_key_value that adds escape character causing parse_key_value(encode_key_value(...)) not to be a noop (I'm not sure it is even if there is no escaped character, it should but that's not the point here).

$ parse_key_value!(s'foo="\"bar\""', whitespace:"strict")
{ "foo": "\\\"bar\\\"" }

whereas it should be

{ "foo": "\"bar\"" }

Test cases that would have to be updated (as of today they are not in master)
https://github.com/timberio/vector/blob/2ce9ddfdb53792e24e3d0360a713f490f717ebb7/lib/vrl/stdlib/src/parse_key_value.rs#L557-L573

Add function memoization

There have been several situations in which a VRL function has relatively slow runtime performance, but the operation it performs results in a value that is relatively static.

see: vectordotdev/vector#6141

I propose we add a “function cache” to the VRL runtime.

The idea is this:

We add a new state::FunctionCache object.
Functions have access to their own piece of this cache, namespaces by the function identifier (to start)
They can store a Value type in the cache to fetch when needed.
We either implement a TTL in the cache, or a “query counter”. Alternatively we leave cache invalidation up to the function.

There are still some things to work out here, but it seems a valuable property to have in the runtime.

Keep `lib/stdlib/data/user_agent_regexes.yaml` up to date

File lib/vrl/stdlib/data/user_agent_regexes.yaml with regexes for parsing user agent string was introduced in vectordotdev/vector#8262. It's a copy of https://github.com/ua-parser/uap-core/blob/master/regexes.yaml file so it needs to be kept up to date.

The original file is updated on average once a month so it would be nice to also update then but we can be more coarse than that. The best option would be to automize this. The second best option would be to add some notification script which would detect changes and open an issue for it.

Add `parse_host` function to VRL

We had a question in Discord how to get the TLD of a domain name:

having strings like "qwe.domain.tld" and "qwe.asd.anotherdomain.tld" how can I receive last two items as strings? "domain.tld", "anotherdomain.tld"?

They decided on a solution like the following:

hostParts = split("qwe.asd.anotherdomain.tld", ".")
hostParts[-2] + "." + hostParts[-1] // "anotherdomain.tld"

The problem with this is that not all TLDs are a single word. E.g. co.uk is a well-known one, but there are many others that are comprised of multiple names separated by a dot.

As a language that focusses on correctness and the principle of least surprise at runtime, and given that manipulating/checking domains is a common task in the observability space we're operating in, I think we can do better here.

I propose we add a new parse_host function to wrap the tldextract crate that works like this:

hostParts = parse_host!("qwe.asd.anotherdomain.tld")
hostParts // { "domain": "anotherdomain", "subdomain": "qwe.asd", "suffix": "tld" }

The tldextract library has a local cache of TLDs to know what's part of the TLD and what's part of the domain. It also has an option to update this list at runtime, but I suggest we disable that, opting instead to update the crate occasionally when new TLDs need to be added to the cache.

Alternatively, we could update parse_url to add this information, but the problem there is that this function only works with a valid URL, which requires a scheme to be present (e.g. mydomain.co.uk would be rejected, but https://mydomain.co.uk works), which seems impractical for this particular use-case.

Add `parse_haproxy_log` VRL function

Similar to parse_nginx_log this should aim to support the default log formats out-of-the-box. Later, we can extend to allow users to specify custom formats.

Ref: https://github.com/vjeantet/grok/blob/master/patterns/haproxy

Expand supported redactors for redact

Broken off from vectordotdev/vector#7250 (comment)

The initial implementation of redact just had one redactor that always replaced with [REDACTED]. We should expand this to support additional redactors like:

Customizing the redaction string
Hashing the value to have consistent replacement text
Masking such as only showing the last 4 for social security numbers

No way to access keys with quotes in them in VRL

There doesn't appear to be a way to access keys that literally have "s in them in VRL.

vector 0.15.0 (x86_64-apple-darwin 994d812 2021-07-16)

$ . = { "event": {"\"log\"": 1 }}
$ .event
{ ""log"": 1 }
$ .event."\"log\""
null

confusing error messages in REPL mode regarding string error coalescing

Vector Version

vector 0.13.1 (v0.13.1 x86_64-apple-darwin 2021-04-29)

Vector Configuration File

Not applicable, but run the following commands in REPL mode:

.message = "abc"
match(string(.message) ?? "", r'a')

Debug Output

Expected Behavior

To be honest, I think the REPL mode should allow this notation to be more in line with testing real life events, or it should give the following error

error[E651]: unnecessary error coalescing operation
  ┌─ :1:7
  │
1 │ match(string(.message) ?? "", r'a')
  │       ^^^^^^^^^^^^^^^^ -- -- this expression never resolves
  │       │                │
  │       │                remove this error coalescing operation
  │       this expression can't fail
  │
  = see language documentation at https://vrl.dev

Perhaps this error should be warning instead?

Actual Behavior

vector throws the following two errors:

error[E651]: unnecessary error coalescing operation
  ┌─ :1:7
  │
1 │ match(string(.message) ?? "", r'a')
  │       ^^^^^^^^^^^^^^^^ -- -- this expression never resolves
  │       │                │
  │       │                remove this error coalescing operation
  │       this expression can't fail
  │
  = see language documentation at https://vrl.dev

error[E110]: invalid argument type
  ┌─ :1:7
  │
1 │ match(string(.message) ?? "", r'a')
  │       ^^^^^^^^^^^^^^^^^^^^^^
  │       │
  │       this expression resolves to the exact type "boolean"
  │       but the parameter "value" expects the exact type "string"
  │
  = try: ensuring an appropriate type at runtime
  =
  =     null == null = string!(null == null)
  =     match(null == null, a)
  =
  = try: coercing to an appropriate type and specifying a default value as a fallback in case coercion fails
  =
  =     null == null = to_string(null == null) ?? "default"
  =     match(null == null, a)
  =
  = see documentation about error handling at https://errors.vrl.dev/#handling
  = learn more about error code 110 at https://errors.vrl.dev/110
  = see language documentation at https://vrl.dev

The second warning is rather confusing: string(.message) ?? "" would never result in a boolean

Example Data

Additional Context

References

rename `to_syslog_severity` to `parse_syslog_severity`

This one is incorrectly named. It behaves similar to other parse_* functions (take in a string, try to parse it to the relevant type, error if it can't), but is named as a to_* function.

Implement a bit-test function within VRL

Current Vector Version

vector 0.12.2 (v0.12.2 x86_64-unknown-linux-gnu 2021-03-30)

Use-cases

We have logs that make use of a bit-flag field to indicate various states, for which we would like to use the route transform to split into multiple streams.

Attempted Solutions

Currently I believe I'd need to use the lua transform to test for bits in an integer field and convert those to separate named boolean values, and then use the route transform to test for those fields with the exists() function.

Proposal

This could either be done via an explicit function, ie test_bit(value: <integer>, bit: <integer>) :: <boolean> or use the typical bit manipulation syntax from Rust, ie <integer> & <integer> or <integer> & (1 << <integer>) depending on how fancy you want to make it. But really the simple test_bit() option would be fine for my use case.

References

add (syntax-only) modules for functions

Now that we're cranking out more and more functions, I'm worried it'll become more difficult to know the distinction between all of them.

Take for example the newly proposed to_level and to_severity. On their own, they make total sense, but when you put them next to to_string, to_int and the other to_* functions that convert to a concrete value, they can be confusing.

A similar situation arises with functions such as parse_aws_vpc_flow and parse_aws_elb.

We've already started categorizing the functions in our Cue documentation precisely because the list of functions is becoming longer still, and you need some way to more easily find what you're looking for.

I propose we extend this categorization to the language itself, by introducing function modules.

The concept is simple:

We define a new Module enum that has variants such as Root, Aws, Syslog, and more.
We update our Function trait to include a fn module(&self) -> Module function.
All function implementations are updated to define their module.
We update the parser to read function modules and convert them to syntax
Users now write syslog::to_level(.foo), or aws::parse_vpc_flow(.bar) instead.
Functions assigned the Root module will stay as-is, so you'd still use to_int, parse_json, and other often-used functions.

I'll leave it as an exercise to others to come up with the correct module names (I think the Cue files are a good starting point, and I'm sure @lucperkins has some good ideas for this list as well), but I do believe this would only add a small amount of syntactic noise to the language (especially since the most commonly used functions won't have a module prefix), while gaining a bit more structure in the ever-growing list of functions.

(note I used <module>::<ident> since it doesn't collide with other syntax we currently have, and it's also what I'm used to as a Rustacean (and it's not uncommon in other languages either), but we can entertain other forms of syntax for this)

cc: @binarylogic @FungusHumungus

Add `to_array` coercion function to VRL

to_array(null) # => []
to_array([]) # => []

# any other type results in an error

Currently, the same can be achieved using if conditions or (not exactly similar, but close) array(.foo) ?? [], but wanting to coerce a potential null value when working with arrays is a common enough occurrence to justify this function.

cc @leebenson

Cannot Parse HTTP Error Log with parse_apache_log

Discussed in vectordotdev/vector#8429

^{Originally posted by tastyfrankfurt July 22, 2021}
Having issues with the parse_apache_log parser doesnt like our error logs, see below for example,
$ parse_apache_log!(s'[Fri Jul 23 00:28:19.615325 2021] [:error] [pid 17705] ipa: INFO: [jsonserver_session] [email protected]: idview_show(None, version=u\'2.237\'): RequirementError', timestamp_format:"%a %b %d %H:%M:%S.%6f %Y", format: "error")
function call error for "parse_apache_log" at (0:246): failed parsing common log line

Thanks

Tasty

VRL: strange path creation due to double quotes

Consider the following VRL:

$ .nested.""field"" = "foobar"
"foobar"

$ .
{ "nested": { "": { "field": { "": "foobar" } } } }

This could arguably be either a syntax error, or possible create a path like .nested."field" or even .nested.""field"" (or any assortment of escaped chars...

I wouldn't expect the current behavior.

Add summary of passed/skipped/failed tests to VRL test harness

When running cargo test, cargo will print a summary of the numbers of tests run, as well as a listing of which ones failed. The VRL test suite has grown large enough to make scrolling through the listing to see which one(s) failed tedious. This could be improved by providing a similar summary of the test run after completion.

Ref: https://github.com/timberio/vector/pull/6738/checks?check_run_id=2096688778

Add headers for `parse_csv`

As a follow-up on vectordotdev/vector#5371 we should add an option to set headers for the parse_csv function, for example:

parse_csv!("foo,bar", headers: ["header1", "header2"])
// { "header1": "foo", "header2": "bar" }

Allow literal regexes from variables as arguments to VRL functions

A user ran into an issue trying to assign a regex to a variable in VRL and then use that variable in parse_regex. This fails because we currently require regex literals but this could be relaxed to allow a variable that contains a regex.

That is, this program should work:

my_regex = r'(?<Pfoo>e)'
parse_regex("test", my_regex)

VRL: syslog priority parsing

Current Vector Version

0.12.1

Use-cases

"The Priority value is calculated by first multiplying the Facility number by 8 and then adding the numerical value of the Severity."
~ cite from RFC5424

e.g.:

Priority = Facility * 8 + Severity 

Facility = Priority % 8
Severity = Priority // 8

Sometimes, when dealing with non-rfc-compliant syslog variations, only parts of the RFC are followed. In those cases, its not uncommon for the syslog priority to be one of the only consistent distinguishing features of somelogs.

Being able to extract both the facility and severity code from those messages would be pretty handy, and from a cursory overview, looks like it would fit in pretty nicely as a VRL function.

Attempted Solutions

None as of yet.

Proposal

Add to_syslog_facility_from_priority and to_syslog_severity_from_priority functions to VRL for making use of Syslog Priority fields.

Or just one function with three return values(Facility, Severity, Error), what do you all think?

One more question, I noticed that we use the syslog_loose create which has a decompose_pri function that pretty much takes care of what I'm talking about.
Would you all have a preference for using that over just implementing it in this plugin? I don't imagine it'd change how much code ends up there in either case, but I figured it's probably worth asking.

References

Related in spirit: vectordotdev/vector#5769

Add optimization pass to VRL compiler

We added the match_any function in vectordotdev/vector#7414 to allow for improved performance when matching a single field against multiple regular expressions. This was in response to an issue with a relatively slow match(.foo, ...) || match(.foo, ...) || ....

An alternative to adding new functions that are more performant in certain cases would be to identify those patterns at compile time and transform them into a more efficient form internally. This would require some kind of pattern-matching optimization pass in the VRL compiler.

vectordotdev / vrl Goto Github PK

vrl's Introduction

Vector Remap Language (VRL)

Features

Webassembly

vrl's People

Contributors

Stargazers

Watchers

Forkers

vrl's Issues

Examples

Array

Map

Single value

Requirements

Vector Version

Vector Configuration File

Debug Output

Expected Behavior

Actual Behavior

Example Data

Additional Context

References

Vector Version

Vector Configuration File

Debug Output

Expected Behavior

Actual Behavior

Examples

Problem

Proposal

Proposal 1: Do nothing

Proposal 2: Add a single parse_rails function

Proposal 3: Add multiple parse_rails_* functions

Proposal 4: Community managed Remap functions/formats

Final thoughts

Workaround

Example

Simple

Indentation preserved

Leading white space ignored

Requirements

Issue

Alternatives

Vector Version

Expected Behavior

Actual Behavior

Examples

Requirements

Vector Version

Vector Configuration File

Debug Output

Expected Behavior

Actual Behavior

Example Data

Additional Context

References

Current Vector Version

Use-cases

Attempted Solutions

Proposal

References

Discussed in vectordotdev/vector#8429

Current Vector Version

Use-cases

Attempted Solutions

Proposal

References

Recommend Projects

Recommend Topics

Recommend Org

Jobs

Proposal 2: Add a single `parse_rails` function

Proposal 3: Add multiple `parse_rails_*` functions