GithubHelp home page GithubHelp logo

vinland-avalon / cycledb Goto Github PK

View Code? Open in Web Editor NEW
8.0 1.0 0.0 20.6 MB

So far, this repo is to build *Grid Index*, a substitute for TSI Engine of InfluxDB. (Still in development)

Go 99.64% Makefile 0.28% Shell 0.08%

cycledb's Introduction

Purpose

So far, this repo is create to build Grid Index. Reach package here. Other codes are forked from InfluxDB.

Grid Index

Abstract

In some time-series databases, a form of an inverted index may be utilized for specific purposes, such as indexing and querying based on tag values or metadata associated with time-series data. This allows for efficient filtering and retrieval of time-series data based on specific tag values, enabling queries that involve filtering by tags.
However, inverted index is a trade-off between storage usage and query performance. The posting list, a data structure to store record identifiers associated with each term in the index, will store redundant identifiers among different tag keys. Since time-series databases always deal with large-scale data in senarios like real-time monitoring and algorithmic trading, such storage overhead is extremely severe.
In this project, the Grid Index is proposed to function the same as an inverted index with limited storage usage. The Grid Index takes multi-dimension index as hyperdimension space where identifiers of points represent records.
The Architecture of Grid Index is as follows. architecture

Method

In previous time-series databases, inverted index or other tag indexes are used to map tag values to a list of series identifiers. This is normally the first step to look up data. The Grid Index functions exactly the same.

Data Structure

A grid is a concept of multi-dimension space where each dimension represents a series of tag values for one tag key. The points inside the space are given ordered serial identifiers which are exactly the identifiers of time-series data. The grid will store the order of tag keys and the order of tag values, through which the identifiers of points can be accessed with calculation. It is worth noticing that it is actually unneccessary to store the space and identifiers and thus save storage. The Grid Index is consist of $N_{grid}$ grids.

Get Series ID

This operation will return a list of identifiers of "hit series", which refer to series that at least contains the same tag keys and tag values as input. The lookup procedure in a single grid is simple. By mapping the input tag keys and tag values to dimensions and coordinate values, the index can position a part of space where identifiers within are wanted. Then for that sub-space, calculate identifiers of every points and return the list. This algorithm will iterate every grid and add hit identifiers to final returned list.

Set Series Key

The purpose of this operation is to index an incoming series key, which is normally consist of tag pairs, and return the corresponding identifier. First, this algorithm will check if such series key has been indexed before. If so, there is no need to set again. If not, the algorithm will check if there is an existing grid able to contain the new series. Such grid should 1) have exactly same dimensions as the new series and 2) have free slot to append value in dimensions that lack coordinate value the same as tag value in new series. If algorithm finds one grid suitable, then append values to dimensions. This will activate pre-allocate space for identifiers. If not find any, the index will be expanded by opening up new grid and insert new seires there. The dimensions of newly created grid will be the same as incoming series, and the capacity for each dimension will be decided by an optimizer.

Complexity Analysis

The main purpose of Grid Index is to optimize storage usage. So this chapter tries to compare spatial complexity between traditional inverted index and Grid Index.

Upper Bound

Regarding the grid index as a compression optimization of invert index, the upper bound of compression ratio occurs when the tag values corresponding to all tag keys are fully arranged. In this case, the series keys are of full permutaion and no point in grids is wasted. Let there be $N$ tag keys, the $i_{th}$ tag key corresponds to $M_i$ tag values, and the average cost to store a tag value is $L$.

  • For traditional inverted index, for $i_{th}$ tag key, the size of posting list is $f(i)=\prod_{j=0}^{N-1}{M_j}$. The overall spatial complexity is $O(N \times \prod_{j=0}^{N-1}{M_j})$.
  • For Grid Index, a dimension with $M_i$ tag values will only take up $M_i \times L$ memory, so the overall spatial complexity is $O(L \times \sum_{i=0}^{N-1}{M_i})$. Also, since we always need a data structure such as hashmap or bit map to avoid fake positive cases, the spatial complexity is indeed $O(\prod_{i=0}^{N-1}M_i)$
  • The compression ratio is $N$. If the bitmap is optimized, the compression ratio will be $\frac{N \times \prod_{i=0}^{N-1}M_i}{L \times \sum_{i=0}^{N-1}{M_i}}$.

Lower Bound

The lowest compression ratio occurs when every tag value only occurs once, which will lead to many voids in grids. Let there be $N$ tag keys, each tag key corresponds to $M$ tag values, and the average cost to store a tag value is $L$. In this case, only the identifiers on diagonal is valid. So the $i_{th} | i \in [0, M-1]$ series key should be like $[ TagKey_0: TagValue_{0,i}, TagKey_1: TagValue_{1,i}, โ€ฆ TagKey_{N-1}: TagValue_{N-1,i} ]$.

  • For traditional inverted index, the overall spatial complexity is $O(M \times N)$.
  • For Grid Index, the overall spatial complexity is $O(L \times M \times N)$.
  • The compression ratio is $\frac{1}{L}$, revealing that even in worst case, the performance of Grid Index is close to that of traditional inverted index.

Result

In this chapter, we compare Grid Index with traditional inverted index. We implement Grid Index in Golang 1.18 and perform our evaluation via single-threaded experiments on an Ubuntu Linux machine with Intel Core Xeon(R) Gold 6330 2.00GHz CPU. As for other experimental parameters, $N$ is equal to 3, $L$ is 1 byte, and $K$ will be decided by optimizer automatically.
Figure (a) shows how memory usages of the two indexes change along with different number of tag values for each tag key.

Full_Permutation_Memory_Usage Figure (b) shows how the query latency changes. One operation in (b) refers to a batch of queries. The basic set up is shown in the table. Full_Permutation_Query_Latency

cycledb's People

Contributors

vinland-avalon avatar

Stargazers

Tianyang Liao avatar yche2656 avatar  avatar Leo Huang avatar Xinyi Liu avatar jmy avatar Kaining Huang avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.