njaard / sonnerie Goto Github PK

View Code? Open in Web Editor NEW

255.0 11.0 17.0 455 KB

A simple timeseries database

License: Other

Rust 99.90% Emacs Lisp 0.10%

rust timeseries-database cli

sonnerie's People

Contributors

Stargazers

Watchers

Forkers

yutiansut ccmlm uuhan tsdb-io davide125 aresbit dev-m1-macbook kerollmops grantyuan db48x icodein pratikdhanave iq-scm mathiswellmann ettom

sonnerie's Issues

Panic when reading an invalid transaction file

Hey,

I am currently using sonnerie to collect textual data by using the CreateTx::new and CreateTx::commit API.
I have set up two crontab rules to do a minor compaction every 10min and a major one every day.

Unfortunately, I just checked the database this morning and no compaction was performed for 12h or something. It was due to the reading programs crashing with the following backtrace. I found out that it was due to a transaction file I just uploaded, and this transaction file looks very big so it must be the output of compaction. This bug is also present in the main branch.

tx.16fc0c7069ca48ed.zip

Click to expand the backtrace

thread 'main' panicked at 'range end index 4 out of range for slice of length 0', library/core/src/slice/index.rs:73:5
stack backtrace:
   0: rust_begin_unwind
             at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/panicking.rs:143:14
   2: core::slice::index::slice_end_index_len_fail_rt
             at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/slice/index.rs:73:5
   3: core::ops::function::FnOnce::call_once
             at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/ops/function.rs:227:5
   4: core::intrinsics::const_eval_select
             at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/intrinsics.rs:2361:5
   5: core::slice::index::slice_end_index_len_fail
             at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/slice/index.rs:67:9
   6: <core::ops::range::Range<usize> as core::slice::index::SliceIndex<[T]>>::index
             at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/slice/index.rs:303:13
   7: core::slice::index::<impl core::ops::index::Index<I> for [T]>::index
             at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/slice/index.rs:18:9
   8: <alloc::vec::Vec<T,A> as core::ops::index::Index<I>>::index
             at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/alloc/src/vec/mod.rs:2533:9
   9: sonnerie::segment_reader::SegmentReader::open
             at ./src/segment_reader.rs:39:42
  10: sonnerie::key_reader::Reader::new
             at ./src/key_reader.rs:29:9
  11: sonnerie::database_reader::DatabaseReader::new_opts
             at ./src/database_reader.rs:100:12
  12: sonnerie::database_reader::DatabaseReader::new
             at ./src/database_reader.rs:36:3
  13: sonnerie::main
             at ./src/main.rs:209:12
  14: core::ops::function::FnOnce::call_once
             at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/ops/function.rs:227:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

I am not sure but it looks like the decode_into_with_unescaping function isn't filling the buffer correctly which makes the program crash as it is trying to read it with the [..] panicking syntax.

Also, I would like to say thank you because the fact that we can't read (and compact) the database isn't impacting the write operations. This is very cool as we can simply remove the broken transaction to make the system work again 🎉

Create transactions as O_TMPFILE

So that cancelled transactions don't leave .tmp files behind.

When the transaction is committed, use linkat to make them permanent.

Sonnerie as embedded library

I would like to embed Sonnerie inside my application, but I'm not sure if that's easy to do. The tests seem to spin up an external sonnerie process?

compact --gegnum does not default to nanoseconds

the --help says that by default, it uses nanoseconds as the timestamp format, but the default timestamp format is instead %FT%T. There's no way to specify "unix epoch in nanoseconds" with strftime so therefor this behavior is necessary.

This should probably wait until the next compatibility-breaking release since it's a change in API. It may also benefit from a change in the api inside the formatting module.

This crate seems good, but why not cargo fmt

This crate satifies my need for an embedded time-series database and I'am quite interesting to look inside the implementation of this crate.

But the code styling seems not quite good for reading. Maybe it's good to

cargo fmt

Extract specific columns

Having the ability to extract specific columns directly via command line options would greatly improve the UX. Using cut -d' ' works fine until there are string columns which can possibly contain spaces. It's possible to work around this by replacing unescaped spaces with placeholders, but this is already pretty hacky. It would be very nice if you could just give sonnerie an argument like -f 1 to only output the first column.

Filter results on timestamp range

Is it possible to query a timeseries for a given key + filter on a timestamp range ?

Eg: be able to do : (pseudo commandline options)
$ sonnerie -d database/ read fibonacci --after-ts 2020-01-03 00:00:00 --before-ts 2020-01-05 00:00:00

That would return :
fibonacci 2020-01-03 00:00:00 2
fibonacci 2020-01-04 00:00:00 3

Mentioning SQLite3 as the chosen backend

Hi,

really interesting project.

Perhaps it would be relevant to mention it uses SQLite3 as backend. I did not notice it until the backup section, which made me look into the dependencies in the Cargo.toml.
For those looking for something embeddable it might be of relevance, like it was for me.

Best regards,
lwk

May I support you with Sonnerie's API

Dear Kalle,

I would like to use Sonnerie's API within my own Actix based web server. Besides that you state at the Sonneries repository "Sonnerie can be used as a Rust library so you can read and write databases directly, but the API is incomplete and poorly documented, for now." I managed to create and update a database.

As a contribution I would like to support Sonnerie's API in a first step with some documentation. What do you think? Do you have any specific idea about how this documentation should look like? I could just use Rusts capabilities to generate doc from the code base + additional annotations within the code.

What do you think?

Looking forward to hear from you soon.

Kind regards,

Tobias

Transactions that are the result of compactions should be sorted before successive transactions

Right now if you do a minor compaction, the filename for the new transaction file is determined at the time the compacted database is committed. If another transaction is started after the compaction begins, and is committed before the compaction completes, the filename for the transaction will be ordered after that other transaction, which would violate the precedence of records.

Allow a key to span segments

If a key can span segments, we'll be able to have very long large timeseries and we'll be able to store a timeseries with a field format that varies, which means that "unchecked mode" would become safe.

Basic curl put command failing

Trying to run the example curl command fails

curl -X PUT http://localhost:5555/ --data-binary 'fibonacci 2020-01-07T00:00:00 u 13'
parsing timestamp invalid digit found in string

This is running the basic sonnerie-serve

Adding/retrieving data using sonnerie read / add works fine

The root cause appears to be that in sonnerie-serve you parse i64 nanosecond timestamps on line 138

                let ts: Timestamp = timestamp
                    .parse()
                    .map_err(|e| format!("parsing timestamp {}", e))?;

where Timestamp is pub type Timestamp = u64

I assume the UNIX_EPOCH is an opaque value to the client, though sending timestamps through like that does indeed work.

Allow realtime erase

It should be possible to create a special entry that indicates "delete a range", which causes that range to simply not be read.

The special entry can be compacted away.

This will be very difficult.

Rationale

Sometimes you have a lot of records that you want to delete. Right now, you can delete them with a strategic compact --gegnum. For example, I often do compact -M --gegnum 'awk { if ($1 ~ /^prefix/) { } else { print } }' to delete all the records with keys that begin with prefix. This awk-based command can also filter records based on time and date. However, this is slow (linear time). You have to wait for the entire database to be scanned and rewritten before the deletion is complete.

This proposal instead suggests a different kind of record that marks a range to be deleted. This would be a small amount of data that instructs the reader to "ignore these records". That record is written in constant time and quickly, it essentially takes effect immediately and still conforms to all the transactional expectations.

Implementation

I propose a record with a format string of simply "\u007F", (i.e., the "DELETE" character).

The code in merge.rs would have special behavior when it sees the DELETE record; it holds it and applies its effect to each record that meets the criteria encoded in that DELETE record. When it can prove that no more following record can possibly meet the criteria of deletion, it can remove it. More than one deletions can be in effect at any given time, and so it may need to apply and hold multiple deletion operations.

To prevent complexities about having to read-ahead in a transaction file (or in that segment), a DELETE record must be the only record in both its segment and its file-of-segments.

A DELETE record has a key that represents the first key it can affect. This means that we don't need to make any change to existing code to be able to start processing DELETE commands at the right time. We want to be able to delete wildcards, so this is just the wildcard's prefix
The format is simply "\u007f"
The first 8 bytes represent the first timestamp to delete. Storing 0 there obviously means "Start at the beginning of time"
The following 8 bytes represent the last timestamp (inclusive) to delete. Of course you can store u64::MAX there.
To enable wildcard deletions, store the actual wildcard-syntax string, using varint-string encoding with the actual % symbol. An empty string indicates "don't use this field".
Also, to enable range deletions, we can store the last key that we want to affect, using the varint-string encoding like we use for everything else. This can be an empty string to indicate "don't use this field". The last key to delete can therefor never be actually empty to mean "empty", because the empty string always comes lexicographically prior to any other key; so we may never write those.

Here you see that there are 5 distinct fields that can be used to do deletions, and they must all simultaneously complied with:

The record's actual key representing "first key"
First timestamp
Last timestamp
Key wildcard
Last key

All of these fields are also possible to, somehow, mark as "we don't actually delete on this criteria".

When we are reading a database and processing a deletion, only the "last key" and "key wildcard" fields can actually have the effect of causing the Merge code to forget about the deletion.

When a major compaction is done, the DELETE command is applied, and then the actual DELETE record is removed in the final task. However when a minor compaction is performed, the DELETE record, and therefor its transaction file, must be retained.

API

The CreateTx struct shall be extended with a delete function that accepts the parameters specifying the criteria for deleted records.

All of the reading functions in DatabaseReader will transparently apply the deletions and never generate the record representing the deletion, except under the circumstance where the DatabaseReader is being used while in the process of a minor compaction (as the record must be retained). I don't know how that mode should be indicated, maybe when include_main_db==false?

All of the internal structs, such as in key_reader.rs will necessarily need to output the DELETE records.

User Interface

In the sonnerie CLI tool, the command can be called delete and has the same parameters as read. It produces no other output other than explaining why it may have failed (which can only be due to impossible constraints like --first-key came after --last-key.

Future Possibilities

I do not propose a way to delete records based on their value. That could be possible one day. Or maybe that's better to do as part of a compaction and just add a special UI for that that's a little easier than awk.

Per this proposal, a deletion must be the one and only record in a transaction file, that means that trying to perform a delete after any record has been added is invalid. Instead, we could have it rewrite the transaction on the fly (i.e., in linear time), while having the first record in the transaction be the DELETE so it can still be applied to previous, already-committed transactions.

Out of memory error when compacting

If I send ~250 different keys, with 1500 entries each. This is across multiple http PUT requests. I tried to then compact

ronnie@Casa:/mnt/d/muCapital$ sonnerie -d database compact
thread 'main' panicked at 'compacting: IOError(Os { code: 12, kind: OutOfMemory, message: "Cannot allocate memory" })', /home/ronnie/.cargo/registry/src/index.crates.io-6f17d22bba15001f/sonnerie-0.8.4/src/main.rs:148:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

If I send through a significantly smaller number of keys compact works but it's severely rate limiting. The same crash occurs whether minor or major compacting

Compacting interrupts reading/writing?

Not sure if I'm misunderstanding something, but in the README it says

Compacting doesn't block readers or writers, but only one can happen at any given moment, so a lock is placed to prevent multiple concurrent compactions.
Compactions are atomic, so you can cancel it (with ^C) at any time.

When doing

mkdir db
touch db/main

And then, simultenously running

while : ; do sonnerie -d db compact; done

and

for i in $(seq 1 10000); do echo $i; echo "test $i $i" | sonnerie -d db add --format u; done

The writing occasionally panics:
thread 'main' panicked at 'opening db: Os { code: 2, kind: NotFound, message: "No such file or directory" }' and compacting gives the following message disregarding "db/tx.17a24759f7b343cb", it is zero length. Whenever writing panics, the number for that iteration is missing from the database.

Is this to be expected? Obviously it's an unrealistic scenario, but as far as I can tell, this means there is a non-0 chance this can happen during normal operation too.

Is it possible to have a column containing a string?

Hey!

Thanks for you work. Very interesting to read/learn about.

I've a question, is it possible to have a column containing a String? It will behave as a "tag" or "label".

disregarding "db/tx.17a6f6aeb948b94b", it is zero length

Occasionally, if a writer is cancelled, a 0-length file is left behind. This can be reproduced by running

for i in $(seq 0 100000); do echo "test $i $i" | timeout 0.01s sonnerie -d db add --format "u"; done

The timeout will probably need adjusting based on the speed of the filesystem, but on my machine this very quickly produces lots of empty transaction files.

I suppose the behaviour is to be expected and sonnerie can't really prevent this from happening (?), but from an UX perspective, perhaps compacting could also clean up these empty files? Maybe an option flag to do it automatically?

Calls to println/eprintln

It would be nice if the library didn't just print stuff to stdout/stderr indiscriminately. AFAIK the idiomatic solution would be to use the log or tracing crate.

Better iterator API

Can we have ranged timestamp select API or at least Double ended iterators to fetch records from the newest to oldest?

ToRecord for String?

Hi, I'd like to write a function like this:

pub async fn write(
    key: String,
    values: impl RecordBuilder + std::marker::Send + 'static,
) -> Result<()> {
    let handle: JoinHandle<Result<()>> = tokio::task::spawn_blocking(move || {
        let db = std::path::Path::new(DB_PATH);

        let mut transaction = CreateTx::new(db)?;

        let timestamp = chrono::Utc::now();
        let naive_utc: NaiveDateTime = timestamp.naive_utc();

        transaction.add_record(&key, naive_utc, values)?;
        transaction.commit()?;
        Ok(())
    });

    handle.await?
}

The problem is that as far as I can tell, this function isn't callable with a Record containing a (non-static) &str, as it won't live long enough. E.g:

let foo = String::from("foo");
db::write("bar".to_owned(), sonnerie::record(foo.as_str())).await;

Is there an obvious solution to this issue that I'm missing? Would you consider adding a ToRecord impl for String?

Support more advanced conflict resolution strategies

Right now, when you have two records (identified by the same key+timestamp), the one from the most recent commit takes precedence. This issue is going to decide how to supporting aggregating those conflicts as opposed to just discarding the old one.

User stories

One common way is to count events. For example, if we record the number of events once per day. If you have multiple sources of this data, each source accumulates into the counter.
Maybe a user measures temperature. In this case, we want to store the minimum and maximum value, which means that min and max are the functions.
Maybe the user stores actual error messages. If you can receive more than one message per timestamp, you might want to just concatenate it to a string. Therefor, it'd be best to have a "join with delimiter"

By combining a record that has two sum fields, one with a count and one with a value, you also have enough information to produce the mean.

File format

I think it makes sense to store the aggregation method in the format string. This makes sense to me because the aggregation method should not ever change, and the format string is only stored once so it's efficient.

The format strings right now are single character codes like uff representing an unsigned 32-bit integer and 2 unsigned 32-bit floats. I propose a prefix or suffix after each one indicating the aggregation:

For example +u9f0f could represent "addition for the u", "maximum for the first f" and "minimum for the second f. I'm not too attached to the particular representation or even that it be constrained to single characters (in fact, it can't be if you need to specify the delimiter). A more complete list:

+ sum
9 maximum
1 minimum
| join with delimiter. The following character must then be " followed by the actual delimiter, backslash-escaped, and then another ". For example, |"," for delimiting with a comma.
No character all which means "replace".

API

Right now, you can make records with record. We would need a new function like record_agg which generates the format string with the appropriate marker. For example:

  record_agg(sonnerie::Aggregate::Max, 25u32)
    .record_agg(sonnerie::Aggregate::Sum, 25.0f64)
    .record_agg(sonnerie::Aggregate::Join(","), "one message")

Applying the aggregate

Right now, Merge::discard_repetitions will just keep on reading values from all the transactions until it gets the last one for a given key+timestamp. Instead, Merge should apply the correct aggregate for each column.

A compaction directly uses Merge so therefor compaction doesn't need special behavior.

When the aggregate is impossible to apply in some manner

What if the data types don't match, like you're using the "summation" operator but one field is integer and the other is float, or one is numeric and the other is a string? I think the solution is to "try to do the correct thing" and then fallback on replacing the value.

What this means is that if we can guarantee a lossless conversion, then the operator can still occur. For example, if you're doing addition on a f32 and an f64, we can convert that f32 into an f64 and still do a summation.

In the case of that lossless conversion, the datatype should then become the "wider" of the two. Even if the wider of the two is in the latter transaction. That is because if a program is running that takes a while to complete, it would be surprising if all of a sudden your data became corrupt because it committed its transaction later than newer processes.

When the aggregate itself conflicts

That is to say, the order of transactions isn't defined until commit-time. That means that if multiple transactions have different aggregate records, it's probably just user error, because there's no way to make mathematical sense of it. Practically speaking, when the merging occurs, there is a defined order to the records and so the aggregate can just be applied in that order. No special work needs to occur.

CLI

The CLI expects the user to enter valid format strings. We can just leave that as it is until we provide a more user-friendly UI.

sonnerie-serve

sonnerie-serve, like the CLI, accepts format strings in the stream. Therefor nothing special needs to be done there either.

Examples

Support for widening

If you create three separate transactions, the final value is the same as the values with the aggregate function:

key 2023-01-01T00:00:00 +f 1.0
key 2023-01-01T00:00:00 +F 2.0
key 2023-01-01T00:00:00 +f 3.0

You should read back one record:
key 2023-01-01T00:00:00 +F 6.0

Strings

Strings have their aggregate value joined with the delimiter:

key 2023-01-01T00:00:00 |","s One
key 2023-01-01T00:00:00 |","s Two
key 2023-01-01T00:00:00 |","s Three

Read back:
key 2023-01-01T00:00:00 |","s One,Two,Three

Multiple columns

Each column has its own aggregation:

key 2023-01-01T00:00:00 +u9f0f 3 32.0 19.0
key 2023-01-01T00:00:00 +u9f0f 5 48.0 21.0
key 2023-01-01T00:00:00 +u9f0f 7 23.0 6.0

Read back:
key 2023-01-01T00:00:00 +u9f0f 15 48.0 6.0

Conflicting data types

If there's a conflict in the data type and widening can't occur, then just retain the value from the newest transaction:

key 2023-01-01T00:00:00 +u 12
key 2023-01-01T00:00:00 +f 19.0

read back:
key 2023-01-01T00:00:00 +f 19.0

Retain old behavior

Without an aggregate column, just select the value from the most recent transaction:

key 2023-01-01T00:00:00 f+u 4.0 4
key 2023-01-01T00:00:00 f+u 2.0 6

read back:
key 2023-01-01T00:00:00 f+u 2.0 10