Comments (9)
Hi @jeromegn
Frankly we haven't figured out how to solve its root cause. Instead I came up with 2 workarounds I can share with you.
- any real time data analysis I just use in-memory DB, I only use persistent DB for storing/querying data, like a more traditional use of database.
This is because I had it run with flamegraph and I saw like a quarter of its runtime doing file IO. Plus one thing I found is that the stock sled storage does not do concurrency well. Between a start and end transaction, it locks the file, so each transaction has to be "atomic". I have a modded version which I wrapped with Arc<RwLock> that you can run it concurrently without database being locked. (You need to gracefully terminate the program though)
- Instead of having 100 symbols in a single table, I split into 100 individual tables.
Although the table storage structure might be in a hashmap, filtering and sorting still makes the result 0(n+) by nature, so a query minimised by 100 is at least 100 times faster.
from gluesql.
Did you mean selecting 360,000 (100 * 60 * 60) rows from sled storage took 2.7 seconds or 1 row from 360,000 rows??
Did it include client latency to get the all rows or just exeuction time of query?
And i wonder what was the full query text.
from gluesql.
https://gist.github.com/JakkuSakura/4bb9678501dbabf56c1b6d95269740aa
This is the source code used for benchmark
Not the exact one but still shows something
persistent database insertion for 1s data: 7ms
volatile database insertion for 1s data: 1ms
volatile shared database selection with 1 symbols 24hr data: 1581ms, 1hr=false
volatile shared database selection with 1 symbols 24hr data: 1587ms, 1hr=true
persistent database selection with 1 symbols 1hr data: 93ms, 1hr=false
volatile shared database selection with 1 symbols 1hr data: 75ms, 1hr=false
persistent database selection with 1 symbols 24hr data: 2067ms, 1hr=false
persistent database selection with 1 symbols 24hr data: 2071ms, 1hr=true
It's completely useless
There're no simple way we can improve
1hr=true means that we select 1hour's data from whole dataset
from gluesql.
Did you mean selecting 360,000 (100 * 60 * 60) rows from sled storage took 2.7 seconds or 1 row from 360,000 rows?? Did it include client latency to get the all rows or just exeuction time of query?
And i wonder what was the full query text.
It is 3,600 rows from 360,000, as in getting one symbol out of 100 symbols for 1 hr data.
from gluesql.
Jakku has set up a test with slight modification to above, changing symbol string to symbol ID. It did help quite a lot, so my next optimization guess is to change from Decimal to f64.
persistent database insertion for 1s data: 2ms
volatile database insertion for 1s data: 1ms
volatile shared database selection with 1 symbols 24hr data: 380ms, 1hr=false
volatile shared database selection with 1 symbols 24hr data: 385ms, 1hr=true
persistent database selection with 1 symbols 1hr data: 36ms, 1hr=false
volatile shared database selection with 1 symbols 1hr data: 17ms, 1hr=false
persistent database selection with 1 symbols 24hr data: 826ms, 1hr=true
persistent database selection with 1 symbols 24hr data: 838ms, 1hr=false
But again, query for selecting 86,400 rows out of 8,640,000 in persistent table taking 800+ms is quite slow, we are aiming at somewhere below 50ms.
from gluesql.
I'm wondering what are the limiting factors to the performances here. I get that changing symbol string to id improves performance, because it reduces the time for symbol comparison. But say for size of each row, do they make big difference in terms of performance in GlueSQL selection?
Roughly speaking, Sled is structured in BTreeMap, so the best possible for query should be O(logN). However, currently 1hr vs 24hr data is taking 36ms vs 838ms, at roughly 24 times O(N) instead of 4.5 times O(logN).
Same applies to SharedMemoryStorage, 17ms vs 385ms is around the 24 times.
Operation per row that's taking too much time that the O(logN) issue is becoming a O(N) issue.
from gluesql.
@kanekoshoyu I'm interested in knowing if you've figured this out.
Since SharedMemoryStorage
also shows degradation, I wonder if you could find the cause with a CPU profile. I suspect this is not a sled problem and possibly a glue sql internal thing?
from gluesql.
https://github.com/kanekoshoyu/gluesql_shared_sled_storage
This is the link to the modded sled storage
from gluesql.
Also a little micro-optimisations here and there.
Do not do string sorting/filtering, they are too slow. Try using an index instead.
Try using AST directly instead of the query string. Each query string is converted to AST at run-time.
from gluesql.
Related Issues (20)
- Add test cases for ast-builder/functions/text/character-conversion
- Implement `DEDUP` function
- Composite Storage support CTAS HOT 4
- Should we release a patch version for 0.14? HOT 2
- Implement Elixir binding for GlueSQL storages
- What is the recommended way to pass untrusted strings to the AST builder? HOT 1
- Missing impl From<Uuid> for ExprNode HOT 3
- Support for composite primary key
- Multiple Filters HOT 4
- gluesql-derive: Derive FromGlueSqlRow from Vec<Value> for structs HOT 2
- Support UPSERT HOT 1
- InvalidStateError for IndexedDB Example in Rust WebAssembly Environment
- Support `COMMENT ON {TABLE | COLUMN ..}` Statement
- How to recover from ConflictOnIndexDataDeleteSync?
- Payload does not implement Clone
- Implement SQL CHECK Constraint in Create Table
- Adding bytea method for the AST builder
- Support `Typed Array` for column type
- Failed to deal with non-monotonic SystemTime. attempt to subtract with overflow
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gluesql.