GithubHelp home page GithubHelp logo

tensorbase / tensorbase Goto Github PK

View Code? Open in Web Editor NEW
1.4K 1.4K 117.0 33.74 MB

TensorBase is a new big data warehousing with modern efforts.

Home Page: https://tensorbase.io/

License: Apache License 2.0

Rust 99.93% Dockerfile 0.01% Shell 0.06%
analytics bigdata data data-infrastructure data-warehouse database engineering high-performance infrastructure modern rust rust-lang warehouse

tensorbase's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tensorbase's Issues

support float type

generally, all support to fixed length type is easy to add. Just pick up one currently implemented type as an example:)

[RFC]Support Builtin functions with Datafusion

The easiest way to support builtin functions seems to be by using ScalarUDF or AggregateUDF.

Changes

To support this the following changes would have to be made

lightjit::builtins

We have to convert the functions into type datafusion::ScalarUDF.

Also, add a get_udf() function that matches a string to the function.

Should we still have these located here or maybe move them into a different crate?

lang::parse

add a new function parse_builtins(p). This would be similar to the current parse_tables(p) function but look for any builtins and return a HashSet of the ones found.

engine::run()

add let builtins = parse::parse_builtins(p)?;. We would have to also add this field to datafusion::run the same as tabs or cols.

    let (tabs, cols) = parse::parse_tables(p)?;
    let builtins = parse::parse_tables(p)?; // <--- New
    log::debug!("projections - tabs: {:?}, cols: {:?}", tabs, cols);
    datafusions::run(ms, ps, current_db, raw_query, query_id, tabs, cols, builtins, qs) // <- also pass builtins

engine::datafusion::run()

Before running we can check the sql if any builtins are used. If used all we need is to cxt.register_udf(builtin).

Sudo code

let mut ctx = ExecutionContext::new();
...
if !builtins.is_empty() {
    for f in builtins.drain() {
        let udf = get_udf(f)?;
        ctx.register_udf();
    }
}
let df = ctx.sql(raw_query)?;
...

[RFC]Enable TPC-H Benchmarks

Why

Arrow-DataFusion has already supported the parts of TPC-H. But TensorBase does not support the storage of all that data types. To enable this benchmarks, it makes TensorBase more feature-mature.

How

From the Arrow-DataFusion, we should support the following type: DataType::Float64, DataType::Utf8, DataType::Date32. However, this is not economical and performance way. Firstly, it suggest enable Decimal, String, Datetime.

TODO

  • support Datetime type
  • support Date type #54
  • support String type #22
  • support Decimal type #26

JDBC Driver Connection Times Out on Read

@jinmingjian , Jin, I attempted to connect to the TB server using the Clickhouse JDBC driver (pulled in by DBeaver) on port 9528, but the connection attempt times out on the read. I tried configuring the connection properties both with no Database/Schema specified as well as with default. I also tried with the No authentication option set. Below are screen captures illustrating the connection properties and the connection error. Note that I also confirmed that the firewall is turned off.

clickhouse-jdbc-conn

clickhouse-jdbc-conn-error

Cannot join slack channel from the links in README or in offcial website, maybe consider opening a new communication channel like gitter?

When click the Slack Channel link in README or in official website, it redirects you to tensorbase's official Slack Channel link: https://tensorbase.slack.com/, but without an invitation. So I think newcomers cannot log in.
image

I can log in other slack workspaces like Kubernetes, so I guess it's just because tensorbase's slack link is not an invitation link. See that of k8s', there is a button says "GET MY INVITE" for people who are not in the group.
image

add storage layer

  • basic data layout and data partition design
  • enable data write path
  • add server client arch
    • add server
    • add client driver(rust)
    • change baseops to use client driver
    • change baseshell to use client driver
  • basic stress tests and benchmarks

Issue encountered attempting to create a table prior to importing a CSV (additional documentation requested)

Hello,

I am very interested in your project and I am attempting to begin testing it out. However, the documentation for tools that exist in m0 does not seem to be accurate (baseops, baseshell). Subsequently, attempting to use the clickhouse client to create a very simple table using ddl fails. I am not sure what to use for ENGINE although it appears to be required and using MergeTree fails. I tried with and without ORDER BY. Any assistance you can provide would be greatly appreciated.

-Chris Whelan

TensorBase :) create table sales (title string) ENGINE = MergeTree ORDER BY title;

CREATE TABLE sales
(
    `title` string
)
ENGINE = MergeTree
ORDER BY title

Query id: 22fd667c-851d-4087-9fb7-5a58128003de


0 rows in set. Elapsed: 0.001 sec.

Received exception from server (version 2021.3.0):
Code: 3. DB::Exception: Received from localhost:9528. WrappingLangError(ASTError). Error when AST processing.

Distributing Query

if uses DataFusion Ballista, this feature may be easily achieved. not sure who wants to try firstly:)

complete main type supports

  • LowCardinality String
  • String/Blob
  • Decimal
  • several other fixed length types (low hanging fruits)
  • Nullable types

Support table function `numbers` and `numbers_mt`

It's not an easy way to try tensorbase.

The Blog Hello, Base has some docs about nyc_taxi datasets benchmarks with ClickHouse. Yet the dataset is pretty large, it's hard for users to explorer tensorbase quickly.

Maybe we can implement some table functions like numbers or number_mt in ClickHouse.

make test runnable for anyone with linux system.

  • test with absolute data path, which causes test failed for others.
    we should use relative data path or use /tmp directory which exists any linux system.
  • test with data which isn't exists at this repository.

Status of this project?

What is the status of this project? I only see an initial commit m0 and hardly any coding changes afterwards. Is this project terminated?

Error inserting string value into table

Hi, the documentation indicates that the string data type is supported, but attempting to insert a string into an existing tables fails with NoFixedSizeDataTypeError.


SHOW CREATE TABLE sales

Query id: e96c5bbe-52ad-4fcc-9df0-1afdab76700e

┌─statement─────────────────────────────────────────────────┐
│ create table sales ( Region String ) ENGINE = BaseStorage │
└───────────────────────────────────────────────────────────┘

1 rows in set. Elapsed: 0.000 sec.

TensorBase :) insert into sales (Region) values ('North')

INSERT INTO sales (Region) VALUES

Query id: 98e40494-bbbc-461d-bb7d-7e7798987b4d


1 rows in set. Elapsed: 0.001 sec.

Received exception from server (version 2021.3.0):
Code: 4. DB::Exception: Received from localhost:9528. WrappingMetaError(NoFixedSizeDataTypeError). No fixed size for dynamic sized data type.

The same problem occurs with Decimal(x,y) data types.

Write Ahead Log

TB now already survives from the application crash. it is nice to have a WAL to protect against the kernel crash or machine sudden shutdown.

setup some github bots

  • github action for building
  • github action for testing
  • github action for coverage
  • add github performance bot
  • Dependabot

build failed cause by wrong branch name

error: failed to get `baselog` as a dependency of package `meta v0.1.0 (/Users/kaichen/Documents/projects/tensorbase/crates/meta)`

Caused by:
  failed to load source for dependency `baselog`

Caused by:
  Unable to update https://github.com/tensorbase/baselog.git

Caused by:
  failed to find branch `master`

Caused by:
  cannot locate remote-tracking branch 'origin/master'; class=Reference (4); code=NotFound (-3)

Joint a Foundation to Ensure Open Source Continuity

Is it possible to consider joining a foundation to ensure opensource continuity. Since this is AL 2.0 perhaps ASF may be a good fit if you can be accepted, otherwise there are other foundation like the Linux Foundation, Cloud Native Foundation, etc. which you can approach.

build failed with `rustc --explain E0433`

error[E0433]: failed to resolve: could not find `addr_of` in `ptr`
   --> /Users/kaichen/.cargo/registry/src/github.com-1ecc6299db9ec823/anyhow-1.0.40/src/error.rs:606:14
    |
606 |         ptr::addr_of!((*unerased.as_ptr())._object) as *mut E,
    |              ^^^^^^^ could not find `addr_of` in `ptr`

error[E0433]: failed to resolve: could not find `addr_of` in `ptr`
   --> /Users/kaichen/.cargo/registry/src/github.com-1ecc6299db9ec823/anyhow-1.0.40/src/error.rs:647:22
    |
647 |                 ptr::addr_of!((*unerased.as_ptr())._object) as *mut E,
    |                      ^^^^^^^ could not find `addr_of` in `ptr`

error: aborting due to 2 previous errors

For more information about this error, try `rustc --explain E0433`.
error: could not compile `anyhow`

To learn more, run the command again with --verbose.
warning: build failed, waiting for other jobs to finish...
error: build failed

[RFC] Distributed Storage & Query

WHY

Currently, TensorBase only supports single node mode. A single node may not have enough space for all the data and we need to store them in a distributed manner. By introducing components like Ballista, we can enable TB to support distributed storage and query.

HOW

Currently, a ClickHouse compatible SQL query will be parsed and passed to TB/engine, TB/engine will then invoke DataFusion to execute the query. To support distributed storage and query, we can add a distributed engine (e.g., Ballista) between TB/engine and DataFusion.

For example, when TB is configured to use Ballista to support distributed storage and query, TB/engine can act as a Ballista client and send ExecuteQuery to Ballista scheduler. The scheduler will then distribute the work to executer(s). For more details about the architecture of Ballista, please refer to this doc.

In the future, TB may support different distributed engines other than Ballista. We should be able to integrate them in a similar manner.

TODO

  • Add Ballista as a distributed storage and query engine
    • Add arrow-datafusion/ballista to TB #87

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.