Comments (23)
@derlaft #115
This should satisfy everyone's needs.
from qryn.
@derlaft I see the problem. Will think about an acceptable solution.
from qryn.
Hmmm - looks like original loki uses some kind of fingerprint mapping: https://github.com/grafana/loki/blob/a3bacd5739da2d41f3f9a2ad45fb58e8d0039006/pkg/ingester/stream.go#L72
from qryn.
https://github.com/grafana/loki/blob/a3bacd5739da2d41f3f9a2ad45fb58e8d0039006/vendor/github.com/prometheus/common/model/fingerprinting.go#L23
@derlaft sorry, but we can't just replace the fingerprinting function due to backward compatibility reasons.. I'll discuss with @lmangani if we will make a env config for this or make the the fingerprint function pluggable.
from qryn.
(checked, maybe difference is not significant)FNV-1A
should still be much better than what's currently used btw
sorry, but we can't just replace the fingerprinting function due to backward compatibility reasons..
What exactly will break? From my understanding, new entries will use new fingerprint values, and in query time they will be joined into the result anyways. Maybe I am missing something.
if we will make a env config for this or make the the fingerprint function pluggable.
This is a major problem, IMO it has to be solved out of box one way or another (mapping hash values and avoiding collisions like in cloki if current fingerprinting hash function is kept).
from qryn.
There's also a minor problem with current hash impl:
Here value is converted from int to hex: https://github.com/lmangani/cLoki/blob/c40d1d9664523e60964a8d0b9338575c0a6d4720/lib/utils.js#L25
Here value is converted from hex to int again: https://github.com/joakimbeng/short-hash/blob/b7e1178b45b06bbf1697ba428b7da53664e96979/src/index.js#L5
from qryn.
This is a major problem, IMO it has to be solved out of box one way or another
So currently in the cloki install I observe:
- total rows:
3196510
insamples
table - only 752 uniq label combinations (
select uniq(arraySort(JSONExtractKeysAndValuesRaw(labels))) from time_series
) - whole 185 different collisions (
select count() from (select fingerprint, uniq(arraySort(JSONExtractKeysAndValuesRaw(labels))) c from time_series group by fingerprint having c > 1)
)
from qryn.
Interesting: despite having UInt64
in database schema, it looks like current hash function produces 32-bit value:
> shortHash('wutwutwutwutwutwutwutwutwut')
'6710f373'
SELECT max(fingerprint)
FROM samples_v2
┌─max(fingerprint)─┐
│ 4294963361 │
└──────────────────┘
from qryn.
Shorthash shortcomings are agreed upon fully. As said we were already addressing the same so this is a great booster.
Let's proceed adding better hashing function (ENV configured) with default to the safest from a collision perspective.
There should be no backwards compatibility issues when switching hashes other than temporary duplication while the TTL expires old fingerprints.
from qryn.
@akvlad thanks Vlad once again for leading this change
@derlaft changes were merged to master and CI builders should produce the containers shortly. Please let us know if this resolves this thread, and please feel free to PR any other suggestion you might have to the code anytime!
from qryn.
Please let us know if this resolves this thread
I don't see any collisions so far, but maybe it's worth checking after a couple of days.
I did notice a significant difference in memory requirements for running queries. Basically relatively simply queries which previously worked do not work now. One example:
{pod_namespace="***",level="error"} !~ "strconv.ParseInt"
WITH
str_sel AS
(
SELECT DISTINCT
fingerprint,
labels
FROM cloki.time_series
WHERE ((JSONHas(labels, 'pod_namespace') = 1) AND (JSONExtractString(labels, 'pod_namespace') = '***')) AND ((JSONHas(labels, 'level') = 1) AND (JSONExtractString(labels, 'level') = 'error'))
),
sel_a AS
(
SELECT
time_series.labels AS labels,
samples.string AS string,
samples.fingerprint AS fingerprint,
samples.timestamp_ms AS timestamp_ms
FROM cloki.samples_read AS samples
LEFT JOIN str_sel AS time_series ON samples.fingerprint = time_series.fingerprint
WHERE ((samples.timestamp_ms >= 1643891813000) AND (samples.timestamp_ms <= 1644496614000)) AND (samples.fingerprint IN (
SELECT fingerprint
FROM str_sel
)) AND (extractAllGroups(string, '(strconv.ParseInt)') = [])
ORDER BY
timestamp_ms DESC,
labels DESC
LIMIT 1000
)
SELECT *
FROM sel_a
ORDER BY
labels DESC,
timestamp_ms DESC
FORMAT JSONEachRow
Query id: d1895d6e-7123-401e-a0f7-fe1e563c8155
← Progress: 4.31 million rows, 3.07 GB (3.01 million rows/s., 2.14 GB/s.) ███████████████████████████ 17%
0 rows in set. Elapsed: 1.500 sec. Processed 4.31 million rows, 3.07 GB (2.87 million rows/s., 2.04 GB/s.)
Received exception from server (version 21.12.3):
Code: 241. DB::Exception: Received from clickhouse-staging-node3.zattoo.com:9000. DB::Exception: Memory limit (for query) exceeded: would use 2.53 GiB (attempt to allocate chunk of 8781984 bytes), maximum: 2.50 GiB: (avg_value_size_hint = 817.2848407643312, avg_chars_size = 971.1418089171974, limit = 8192): (while reading column string): (while reading from part /var/lib/clickhouse/store/225/2256bec5-f3ce-431a-a256-bec5f3ce331a/456804_300279_302368_224/ from mark 10 with max_rows_to_read = 8192): While executing MergeTreeReverse. (MEMORY_LIMIT_EXCEEDED)
I have a problem understanding how changing hashing alg could have caused this problem. Memory limit is quite low (since it's a non-prod), but it worked before. Do you think it could possibly be related?
from qryn.
- please check:
SELECT COUNT(DISTINCT fingerprint)
FROM cloki.time_series
WHERE ((JSONHas(labels, 'pod_namespace') = 1) AND (JSONExtractString(labels, 'pod_namespace') = '***')) AND ((JSONHas(labels, 'level') = 1) AND (JSONExtractString(labels, 'level') = 'error'))
to estimate how many labels you have.
- and please check
SELECT COUNT(fingerprint)
FROM cloki.time_series
WHERE ((JSONHas(labels, 'pod_namespace') = 1) AND (JSONExtractString(labels, 'pod_namespace') = '***')) AND ((JSONHas(labels, 'level') = 1) AND (JSONExtractString(labels, 'level') = 'error'))
to estimate how many duplicates you have.
from qryn.
@akvlad first query - 465. second query - also 465. Which makes OOM even more strange, right?
from qryn.
Yes.
What about samples and samples_v2?
how many entries do you have in the samples
table? What is the last timestamp in the samples
table?
from qryn.
SELECT
count(),
max(timestamp_ms)
FROM samples_v2
┌──count()─┬─max(timestamp_ms)─┐
│ 26488160 │ 1644506160881 │
└──────────┴───────────────────┘
from qryn.
┌─parts.table─┬─────rows─┬─latest_modification─┬─disk_size─┬─primary_keys_size─┬─engine─────────────┬─bytes_size─┬─compressed_size─┬─uncompressed_size─┬───────────────ratio─┐
│ samples_v2 │ 26499449 │ 2022-02-10 15:16:42 │ 2.25 GiB │ 27.13 KiB │ MergeTree │ 2417512731 │ 2.25 GiB │ 20.52 GiB │ 0.10941320871606185 │
│ time_series │ 521380 │ 2022-02-10 15:16:42 │ 28.93 MiB │ 672.00 B │ ReplacingMergeTree │ 30340130 │ 26.34 MiB │ 131.60 MiB │ 0.20014657659781684 │
└─────────────┴──────────┴─────────────────────┴───────────┴───────────────────┴────────────────────┴────────────┴─────────────────┴───────────────────┴─────────────────────┘
from qryn.
@derlaft so you don't have the samples
table? samples_v2
only? Even stranger. Ok. I'll retest with the data amounts you describe.
from qryn.
so you don't have the samples table? samples_v2 only? Even stranger.
yep. something went wrong during applying migrations?
from qryn.
@derlaft samples
table had similar OOM problems. but after migration to samples_v2
they disappeared.
from qryn.
Interesting, replacing WHERE
with PREWHERE
in the inner query helps to avoid OOM:
WITH
str_sel AS
(
SELECT DISTINCT
fingerprint,
labels
FROM cloki.time_series
WHERE ((JSONHas(labels, 'pod_namespace') = 1) AND (JSONExtractString(labels, 'pod_namespace') = '***')) AND ((JSONHas(labels, 'level') = 1) AND (JSONExtractString(labels, 'level') = 'error'))
),
sel_a AS
(
SELECT
time_series.labels AS labels,
samples.string AS string,
samples.fingerprint AS fingerprint,
samples.timestamp_ms AS timestamp_ms
FROM cloki.samples_read AS samples
LEFT JOIN str_sel AS time_series ON samples.fingerprint = time_series.fingerprint
PREWHERE ((samples.timestamp_ms >= 1643891813000) AND (samples.timestamp_ms <= 1644496614000)) AND (samples.fingerprint IN (
SELECT fingerprint
FROM str_sel
)) AND (extractAllGroups(string, '(strconv.ParseInt)') = [])
ORDER BY
timestamp_ms DESC,
labels DESCs
LIMIT 1000
)
SELECT *
FROM sel_a
ORDER BY
labels DESC,
timestamp_ms DESC
FORMAT JSONEachRow
...
24 rows in set. Elapsed: 4.189 sec. Processed 24.95 million rows, 20.35 GB (5.96 million rows/s., 4.86 GB/s.)
But no idea if that is to scale/apply to all the different kind of queries cloki can generate.
from qryn.
@derlaft
can you replace FROM cloki.samples_read AS samples
to FROM cloki.samples_v2 AS samples
?
Will it increase performance?
from qryn.
Yes!
FROM cloki.samples_v2
does not OOM
from qryn.
Will think about an acceptable solution.
Awesome, thanks a lot.
I think this one can be closed now
from qryn.
Related Issues (20)
- /influx/api/v2/write end point create crash on random senario HOT 7
- [Feature request] Multi-tenancy support
- [bug] Error while adding Prometheus datasource in 3.1.x-bun HOT 3
- Undefined Parser HOT 9
- Qryn not showing any data on grafana dashboard HOT 5
- Bad Request: Not supported HOT 16
- Failure to create materialized view on first load HOT 3
- [Feature Request] "Native" parsing of logfmt
- support the "| drop" pipeline for grafana 10.3+ HOT 2
- Grafana error when Pyroscope qryn response is empty HOT 2
- Pyroscope qryn cant select by multiple labels HOT 2
- Error (memory access out of bounds). Please check the server logs for more details. HOT 11
- Pyroscope ProfilesTypes returns wrong format
- Post to /telegraf api produce an error HOT 3
- Post to /influxapi produce an error HOT 15
- Broken json filters HOT 10
- Pyroscope metrics group by doesnt work
- Pyroscope flame graph error "Cannot read properties of: undefined (reading 'fields')" HOT 1
- Feature Request: Trino support HOT 1
- Error with npm in qryn 3.2.9 HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from qryn.