GithubHelp home page GithubHelp logo

baidu / tera Goto Github PK

View Code? Open in Web Editor NEW
1.9K 184.0 439.0 16.03 MB

An Internet-Scale Database.

License: BSD 3-Clause "New" or "Revised" License

Python 1.90% Shell 0.63% C++ 94.78% C 1.50% Makefile 0.44% Java 0.73% CMake 0.02%
nosql database c-plus-plus baidu data storage bigtable hbase

tera's Introduction

Tera - An Internet-Scale Database

Build Status Coverity Scan Build Status Documentation Status

Copyright 2015, Baidu, Inc.

Tera is a high performance distributed NoSQL database, which is inspired by google's BigTable and designed for real-time applications. Tera can easily scale to petabytes of data across thousands of commodity servers. Besides, Tera is widely used in many Baidu products with varied demands,which range from throughput-oriented applications to latency-sensitive service, including web indexing, WebPage DB, LinkBase DB, etc. (中文)

Features

  • Linear and modular scalability
  • Automatic and configurable sharding
  • Ranged and hashed sharding strategies
  • MVCC
  • Column-oriented storage and locality group support
  • Strictly consistent
  • Automatic failover support
  • Online schema change
  • Snapshot support
  • Support RAMDISK/SSD/DFS tiered cache
  • Block cache and Bloom Filters for real-time queries
  • Multi-type table support (RAMDISK/SSD/DISK table)
  • Easy to use C++/Java/Python/REST-ful API

Data model

Tera is the collection of many sparse, distributed, multidimensional tables. The table is indexed by a row key, column key, and a timestamp; each value in the table is an uninterpreted array of bytes.

  • (row:string, (column family+qualifier):string, time:int64) → string

To learn more about the schema, you can refer to BigTable.

Architecture

架构图

Tera has three major components: sdk, master and tablet servers.

  • SDK: a library that is linked into every application client to access Tera cluster.
  • Master: master is responsible for managing tablet servers and tablets, automatic load balance and garbage collection of files in filesystem.
  • Tablet Server: tablet server is the core module in tera, and it uses an enhance Leveldb as a basic storage engine. Tablet server manages a set of tablets, handles read/write/scan requests and schedule tablet split and merge online.

Building blocks

Tera is built on several pieces of open source infrastructure.

  • Filesystem (required)

    Tera uses the distributed file system to store transaction log and data files. So Tera uses an abstract file system interface, called Env, to adapt to different implementations of file systems (e.g., BFS, HDFS, HDFS2, POXIS filesystem).

  • Distributed lock service (required)

    Tera relies on a highly-available and persistent distributed lock service, which is used for a variety of tasks: to ensure that there is at most one active master at any time; to store meta table's location, to discover new tablet server and finalize tablet server deaths. Tera has an adapter class to adapt to different implementations of lock service (e.g., ZooKeeper, Nexus)

  • High performance RPC framework (required)

    Tera is designed to handle a variety of demanding workloads, which range from throughput-oriented applications to latency-sensitive service. So Tera needs a high performance network programming framework. Now Tera heavily relies on Sofa-pbrpc to meet the performance demand.

  • Cluster management system (not necessary)

    A Tera cluster in Baidu typically operates in a shared pool of machines that runs a wide variety of other distributed applications. So Tera can be deployed in a cluster management system Galaxy, which uses for scheduling jobs, managing resources on shared machines, dealing with machine failures, and monitoring machine status. Besides, Tera can also be deployed on RAW machine or in Docker container.

Documents

Quick start

Contributing to Tera

Contributions are welcomed and greatly appreciated.

Read Roadmap to get a general knowledge about our development plan.

See Contributions for more details.

Follow us

To join us, please send resume to tera-user at baidu.com.

tera's People

Contributors

00k avatar baobaoyeye avatar bluebore avatar elithnever avatar fxsjy avatar imotai avatar junhuihuang avatar lylei avatar lylei9 avatar michellez avatar owenliang avatar pengyusun avatar shanshanpt avatar smwyzi avatar sunjoy1984 avatar taocp avatar xiaoqing-yuanfang avatar xupeilin avatar ye-tian-zero avatar yoyzhou avatar yvxiang avatar yyq224444 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tera's Issues

sync API issue

  • 用户自定义超时未实现
  • 内部batch导致高延迟

merge的一个问题(多个tablet进行merge可能导致时序混乱)

a和b进行merge,先分别执行unload
b的unload失败进而重新被load起来成为ready状态,
接着b和c被选中进行merge,分别进行unload,
之后a的unload成功,发现b仍处于unload状态(此时b的unload是为c而准备的,但a不知道,以为b还在为a而unload,因而将自己转为OnMerge状态,苦苦等待着b的unload结束开始merge) 😢

b和c的unload成功后merge出d来,b就从tablet界消失了,a却一直处于OnMerge状态。。。好可怜的a啊,求程序猿解救!

尝试使用tera过程中遇到的问题,请赐教

几个概念问题:
1、api描述中,LG,CF,都是什么概念?划分原则是什么样的?
2、mutation Reader 的读写接口中 的参数qualifier是什么?
3、scan接口描述中
在设置区间返回后,做了下面两个add
desc->AddColumnFamily("family21");
desc->AddColumn("family22", "qualifier22");
是指定scan的列么?
如果不指定,可以scan整行么?
scan多列之后,如何从result_stream中读某列数据。
请在api描述中解释一下,非常感谢。

修复流式scan

  • 给出当前scan单tablet的吞吐
  • 预期下流式scan的吞吐
  • 验证修复后的流式scan性能

2 issues about CopyToLocal

  1. parallel limitation for copy-to-local, to avoid impact to dfs
  2. read pending number will go high while copy-to-local, and may triger loadl balance to make the condition worser

这个神奇的core后面肯定隐藏着一个大bug

分布式文件系统不稳定的时候&用了tcmalloc会触发

(gdb) bt
0 tcmalloc::CentralFreeList::FetchFromOneSpans (this=0xe7a8a0, N=24, start=0x59485b10, end=0x59485b18) at src/central_freelist.cc:301
1 0x0000000000b86bb9 in tcmalloc::CentralFreeList::FetchFromOneSpansSafe (this=0xe7a8a0, N=24, start=0x59485b10, end=0x59485b18) at src/central_freelist.cc:282
2 0x0000000000b86c96 in tcmalloc::CentralFreeList::RemoveRange (this=0xe7a8a0, start=0x59485b10, end=0x59485b18, N=24) at src/central_freelist.cc:264
3 0x0000000000b8e53f in tcmalloc::ThreadCache::FetchFromCentralCache (this=0xf89d20, cl=Variable "cl" is not available.
) at src/static_vars.h:60
4 0x0000000000b851b3 in (anonymous namespace)::do_malloc_no_errno (size=Variable "size" is not available.
) at src/thread_cache.h:364
5 0x0000000000b92cb5 in tc_malloc (size=8) at src/tcmalloc.cc:1119
6 0x0000003f0bfaf59a in operator new () from /usr/lib64/libstdc++.so.6
7 0x00007fbdb43d26fb in bfs::FileInfo::mutable_name (this=0x1905140) at src/proto/file.pb.h:510
8 0x00007fbdb43cf863 in bfs::FileInfo::MergePartialFromCodedStream (this=0x1905140, input=0x594862d0) at src/proto/file.pb.cc:338
9 0x00007fbdb43cab58 in google::protobuf::internal::WireFormatLite::ReadMessageNoVirtualbfs::FileInfo (input=0x594862d0, value=0x1905140)
at ../../../third-64/protobuf/include/google/protobuf/wire_format_lite_inl.h:488
10 0x00007fbdb43ab933 in bfs::ListDirectoryResponse::MergePartialFromCodedStream (this=0x5ee8eea0, input=0x594862d0) at src/proto/nameserver.pb.cc:3013
11 0x00007fbdb43df0fc in google::protobuf::MessageLite::ParseFromCodedStream (this=0x5ee8eea0, input=0x594862d0) at google/protobuf/message_lite.cc:121
12 0x00007fbdb43df3de in google::protobuf::MessageLite::ParseFromZeroCopyStream (this=0x5ee8eea0, input=Variable "input" is not available.
) at google/protobuf/message_lite.cc:170
13 0x00007fbdb447adf9 in sofa::pbrpc::RpcClientImpl::DoneCallback () from ./bin/bfs_wrapper.so
14 0x00007fbdb44a0ba3 in boost::detail::function::void_function_obj_invoker1<boost::_bi::bind_t<void, boost::_mfi::mf2<void, sofa::pbrpc::RpcClientImpl, google::protobuf::Message*, sofa::pbrpc::shared_ptrsofa::pbrpc::RpcControllerImpl const&>, boost::_bi::list3boost::_bi::value<sofa::pbrpc::shared_ptr<sofa::pbrpc::RpcClientImpl >, boost::bi::valuegoogle::protobuf::Message*, boost::arg<1> ()()> >, void, sofa::pbrpc::shared_ptrsofa::pbrpc::RpcControllerImpl const&>::invoke () from ./bin/bfs_wrapper.so
15 0x00007fbdb448e048 in sofa::pbrpc::RpcControllerImpl::Done () from ./bin/bfs_wrapper.so
16 0x00007fbdb448ffb1 in sofa::pbrpc::RpcClientStream::on_received () from ./bin/bfs_wrapper.so
17 0x00007fbdb44a8e57 in sofa::pbrpc::RpcMessageStreamsofa::pbrpc::shared_ptr<sofa::pbrpc::RpcControllerImpl >::on_read_some () from ./bin/bfs_wrapper.so
18 0x00007fbdb44a16bf in boost::asio::asio_handler_invoke<boost::asio::detail::binder2<boost::_bi::bind_t<void, boost::_mfi::mf2<void, sofa::pbrpc::RpcByteStream, boost::system::error_code const&, unsigned long>, boost::bi::list3boost::bi::value<sofa::pbrpc::shared_ptr<sofa::pbrpc::RpcByteStream >, boost::arg<1> ()(), boost::arg<2> ()()> >, boost::system::error_code, unsigned long> > ()
from ./bin/bfs_wrapper.so
19 0x00007fbdb44a17a1 in boost::asio::detail::reactive_socket_recv_op<boost::asio::mutable_buffers_1, boost::_bi::bind_t<void, boost::_mfi::mf2<void, sofa::pbrpc::RpcByteStream, boost::system::error_code const&, unsigned long>, boost::_bi::list3boost::bi::value<sofa::pbrpc::shared_ptr<sofa::pbrpc::RpcByteStream >, boost::arg<1> ()(), boost::arg<2> (*)()> > >::do_complete ()
from ./bin/bfs_wrapper.so
20 0x00007fbdb448a463 in sofa::pbrpc::ThreadGroupImpl::thread_run () from ./bin/bfs_wrapper.so
21 0x0000003f0b90610a in start_thread () from /lib64/tls/libpthread.so.0
22 0x0000003f0b0c5ee3 in clone () from /lib64/tls/libc.so.6
23 0x0000000000000000 in ?? ()

编译不过

猜测是执行build.sh ?
libsofa-pbrpc.a(buffer.o):(.data.rel.ro._ZTVN4sofa5pbrpc11WriteBufferE[ZTVN4sofa5pbrpc11WriteBufferE]+0x38):对‘google::protobuf::io::ZeroCopyOutputStream::WriteAliasedRaw(void const, int)’未定义的引用
libsofa-pbrpc.a(gzip_stream.o):(.data.rel.ro._ZTVN4sofa5pbrpc30AbstractCompressedOutputStreamE[ZTVN4sofa5pbrpc30AbstractCompressedOutputStreamE]+0x38):对‘google::protobuf::io::ZeroCopyOutputStream::WriteAliasedRaw(void const, int)’未定义的引用
libsofa-pbrpc.a(gzip_stream.o):(.data.rel.ro._ZTVN4sofa5pbrpc16GzipOutputStreamE[ZTVN4sofa5pbrpc16GzipOutputStreamE]+0x38):对‘google::protobuf::io::ZeroCopyOutputStream::WriteAliasedRaw(void const, int)’未定义的引用
libsofa-pbrpc.a(block_wrappers.o):(.data.rel.ro._ZTVN4sofa5pbrpc28BlockCompressionOutputStreamE[ZTVN4sofa5pbrpc28BlockCompressionOutputStreamE]+0x38):对‘google::protobuf::io::ZeroCopyOutputStream::WriteAliasedRaw(void const, int)’未定义的引用
libsofa-pbrpc.a(block_wrappers.o):(.data.rel.ro._ZTVN4sofa5pbrpc18SnappyOutputStreamE[ZTVN4sofa5pbrpc18SnappyOutputStreamE]+0x38):对‘google::protobuf::io::ZeroCopyOutputStream::WriteAliasedRaw(void const, int)’未定义的引用
libsofa-pbrpc.a(block_wrappers.o):(.data.rel.ro._ZTVN4sofa5pbrpc15LZ4OutputStreamE[ZTVN4sofa5pbrpc15LZ4OutputStreamE]+0x38): more undefined references to `google::protobuf::io::ZeroCopyOutputStream::WriteAliasedRaw(void const, int)' follow
collect2: error: ld returned 1 exit status
make: *** [sofa-pbrpc-client] 错误 1
另外build.sh 写的太挫

有一个神奇的core

0 0x000000000099a847 in leveldb::TableIter::Valid (this=0x7f732788d550) at table/table.cc:115
1 0x000000000099a264 in leveldb::TableIter::value (this=0x7f732788d550) at table/table.cc:125
2 0x000000000099605a in leveldb::(anonymous namespace)::IterWrapper::value (this=0x7f7170971a30) at table/merger.cc:44
3 0x0000000000995ffa in leveldb::(anonymous namespace)::MergingIterator::value (this=0x7f714c2c1cb0) at table/merger.cc:231
4 0x00000000009c7040 in leveldb::DBImpl::DoCompactionWork (this=0x7f7325f0a4d0, compact=0x7f724e7bacd0) at db/db_impl.cc:1170
5 0x00000000009c5c32 in leveldb::DBImpl::BackgroundCompaction (this=0x7f7325f0a4d0) at db/db_impl.cc:871
6 0x00000000009c54a7 in leveldb::DBImpl::BackgroundCall (this=0x7f7325f0a4d0) at db/db_impl.cc:786
7 0x00000000009c540b in leveldb::DBImpl::BGWork (db=0x7f7325f0a4d0) at db/db_impl.cc:778
8 0x00000000009b8f60 in leveldb::ThreadPool::BGThread (this=0xe9e588) at util/thread_pool.cc:163
9 0x00000000009b8825 in leveldb::ThreadPool::BGThreadWrapper (arg=0xe9e588) at util/thread_pool.cc:46
10 0x0000003f0b90610a in start_thread () from /lib64/tls/libpthread.so.0
11 0x0000003f0b0c5ee3 in clone () from /lib64/tls/libc.so.6
12 0x0000000000000000 in ?? ()

有时候也会core在112行,反正都是Valid

简易版快照回滚

支持无merge情况下快照回滚,对性能暂无要求

  • 快照work
  • 支持回滚

incremental GC

对于已经ready的tablet进行gc,不需要全量ready

  • 将原来的全局gc从master里抽出来
  • 添加新的增量gc策略作为可选项,供大家试用
  • 替换掉原来的全局gc

客户端show出的size不准

没有计算level0的大小,导致show出来的和实际值差距非常大。
可以考虑计算level0的大小,也计算memtable的大小。
或者给出两个值:预估的数据量大小&DFS空间占用大小。
需要解决的问题

  1. 什么时候开始出问题的
  2. 怎么实现精确统计

flash_env issue

往flash上充sst cache失败后会一直读dfs,不会再尝试充一次

  • local失败处理
  • dfs失败处理

Load过程中recover失败会delete dbtable,触发shutdown,有时候死锁,有时候core

shutdown1过程中,拿不到某个锁,DBTable或是DBImpl,导致夯住。
另一种情况core
0 0x0000000000956232 in leveldb::VersionSet::LastSequence (this=0x2bf7a420105ff) at ./db/version_set.h:214
1 0x00000000009a4c11 in leveldb::DBImpl::GetLastVerSequence (this=0x7f8faabdf600) at db/db_impl.cc:1678
2 0x0000000000942de5 in leveldb::DBTable::GarbageClean (this=0x7f8e87bcff50) at db/db_table.cc:1112
3 0x000000000093c65a in leveldb::DBTable::Shutdown1 (this=0x7f8e87bcff50) at db/db_table.cc:135
#4 0x000000000093d1dd in ~DBTable (this=0x7f8e87bcff50) at db/db_table.cc:183

5 0x0000000000939c9d in leveldb::DB::Open (options=@0x7f8ebcc38ec0, dbname=@0x7f8ebcc38e88, dbptr=0x7f8ebcc38fd8) at db/db.cc:56
6 0x00000000007f78c8 in tera::io::TabletIO::Load (this=0x7f8ebcc38e30, schema=@0x7f8fba285a30, key_start=@0x7f8f571474c0, key_en

代码需要增加注释。

不论是内部编码规范还是开源实现。尤其是G家的开源实现。
代码注释往往比代码都多。尤其是.h文件中。

链接与snappy的问题

src/leveldb/Makefile

db_bench tera_bench 容易出现/usr/bin/ld: cannot find -lsnappy的问题

不影响teracli和tera_main的生成。

master操作日志持久化

load/unload/split/merge等过程需要持久化,否则master重启之后可能丢失状态,导致某些tablet不被服务

Upgrade split

  • 将原分裂流程的ts写meta表,更改到master改meta表;
  • ts端完成卸载旧表,加载两新子表(为后续高负载时,降低分裂时延做铺垫)
  • 通过master端分裂前记日志,保证master宕机重启,仍能完成分裂的重试;
  • 定期收集分裂key剥离

BUILD说明可以更专业化一些

1、依赖哪些外部基础库。
2、这些基础库的下载地址直接给出
3、依赖库是否有版本要求。比如pb是否可以使用pb3.
4、完整的安装过程示例以及结果提示。
如configure --help 。安装到哪儿、成功是什么样子的。
5、系统如何启动。
6、系统如何控制。

readme描述可能导致的误解

https://github.com/BaiduPS/tera/blob/master/README.md#L32 (系统依赖 “使用分布式文件系统(HDFS、NFS等)持久化数据与元信息” 这一句)

此处的NFS是指baidu私有的nfs,
公众看到”NFS“可能首先会联想到同样是分布式文件系统的NFS.
会不会造成歧义?如果是,建议把“NFS"这个词删掉。

另外,感觉是不是还需要说明下,代码里虽然有适配nfs文件系统的地方,但那个nfs是百度自研的系统,不是sun的那个。
免得用户鼓捣半天以为tera可以构建在sun的那个NFS上。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.