We are running quay.io/pganalyze/collector:v0.22.0 in

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

HI <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Memory leak on v0.22.0 about collector HOT 34 CLOSED

pganalyze commented on August 27, 2024

Memory leak on v0.22.0

from collector.

Comments (34)

lfittl commented on August 27, 2024 2

@dbamohsin @JordanP Whilst not directly an update on the memory leak, it might be worth deploying the new collector release 0.38.1, which has upgrade to the new query parser version.

The query parser now uses a different mechanism to access the parse trees (essentially it avoids a whole bunch of JSON parsing), which would have shown up as an increased memory usage if you are using log-based EXPLAIN (e.g. such as on GCP).

There may also be other memory-related improvements associated to a few changes in the query parser, so would be great to see if this impacts your environment in any way.

from collector.

beyondbill commented on August 27, 2024 1

@lfittl Awesome! I deployed v0.3.0 and will let you know after a few days!

from collector.

JordanP commented on August 27, 2024 1

It did not help, same crashes, same frequencies. I understand it's a hard to debug problem. It trips my monitoring system because processes are not supposed to crash. I am gonna silent this particular alert on this particular process, which means that issue will move out from my radar.

Ping me if I can be of any help here in the future.

from collector.

JordanP commented on August 27, 2024 1

Cool ! Thanks for the update !

I don't 2. is the root cause. I saw a growing memory usage even though I did not enable log collection/processing at all.

from collector.

dbamohsin commented on August 27, 2024 1

hi @lfittl - we have been running 0.37.0 in prod since Monday and unfortuantely the result is similar, although the growth is a lot choppier now (as can be seen in the second graph below

2 day window

last 6 hours

i will check the size of the collector internals today

from collector.

dbamohsin commented on August 27, 2024 1

HI @lfittl - we have been runnig 0.38.1 for a few days now and it does seem a lot smoother for memory:

I'll keep an eye on it for another week and let you know for sure.

from collector.

lfittl commented on August 27, 2024

@beyondbill Thanks for the report - that indeed looks like a memory leak.

Upgrading may fix it if the issue is related to Go garbage collector problems - newer collector versions have a newer Go version used for builds.

If that doesn't fix it, it would be helpful to narrow the problem down, if you could check whether turning off log collection (PGA_DISABLE_LOGS=1) avoids the memory usage behaving like this.

Thanks,
Lukas

from collector.

beyondbill commented on August 27, 2024

@lfittl We plan to upgrade to v0.25.1. When will the version be published to Docker registry?

➜  ~ docker pull quay.io/pganalyze/collector:v0.25.1
Error response from daemon: manifest for quay.io/pganalyze/collector:v0.25.1 not found: manifest unknown: manifest unknown

from collector.

lfittl commented on August 27, 2024

@beyondbill Published this morning - you should be good to run docker pull now.

from collector.

beyondbill commented on August 27, 2024

@lfittl Perfect! Thanks! I'll let you know whether it fixes the memory leak.

from collector.

beyondbill commented on August 27, 2024

I upgraded the collector to v0.25.1 with PGA_DISABLE_LOGS=1. However there's still a trend of memory leak. I'll leave it there for a longer period to conclude it.

from collector.

beyondbill commented on August 27, 2024

@lfittl I confirmed that the memory leak still exists... We currently put a memory limit on the collector pod in k8s so that it gets killed once memory usage reaches the limit. But we hope to see this issue gets fixed on the collector.

from collector.

JordanP commented on August 27, 2024

Same. It's leaking memory for sure, we (Gorgias, Scale plan) also deal with that leak with a memory limit (Kubernetes).

from collector.

lfittl commented on August 27, 2024

@JordanP Thanks for the report - we are still struggling to find the root cause here, but are certainly tracking this as an active issue that we keep investigating.

One thing that would be helpful for us to validate: When you set PGA_DISABLE_LOGS=1 on the container (which turns off log parsing), does the memory usage keep increasing the same way?

from collector.

JordanP commented on August 27, 2024

@lfittl log parsing is not working anyway because we are running PG and PGAnalyze in Kubernetes. Should I still try to set PGA_DISABLE_LOGS=1 ?

from collector.

lfittl commented on August 27, 2024

@beyondbill @JordanP We've just released collector version 0.30.0, which fixes a memory leak we found that could occur in certain situations. Could you upgrade and check whether that improves the situation for you?

from collector.

beyondbill commented on August 27, 2024

@lfittl I believe the memory is still leaking on v0.3.0

from collector.

JordanP commented on August 27, 2024

Same here, upgraded to 0.30 and memory usage is still growing:

Note that we have 2 PG analyze collectors (we have 2 PG databases) and one is not suffering from that leak. PGA is deployed configured the same but the PG instances are used for very different workloads.

from collector.

msakrejda commented on August 27, 2024

@JordanP can you tell us more about the nature of the different workloads (and log output)? Are the two lines in the graph you pasted above for the two different databases?

from collector.

JordanP commented on August 27, 2024

@uhoh-itsmaciek I don't know exactly what to say. The one for which PGA is not leaking is used to mostly "log" some sort, so it's mostly a append only workload. The other PG instance is our main one, that serves our application/website.

Yes the 2 lines in the graph are the memory consumption for the 2 PGA containers, each is monitoring a different PG database.

from collector.

lfittl commented on August 27, 2024

@JordanP Thanks for the update - I have a suspicion that this might be related to the amount of query text data the collector is handling.

Could you try enabling the reset setting in the Server Settings in the pganalyze console, and set it to every 6 hours:

Additionally, you'll need to create the following helper method, when using the separate monitoring user (create this method as a superuser):

CREATE OR REPLACE FUNCTION pganalyze.reset_stat_statements() RETURNS SETOF void AS
$$
  /* pganalyze-collector */ SELECT * FROM public.pg_stat_statements_reset();
$$ LANGUAGE sql VOLATILE SECURITY DEFINER;

from collector.

JordanP commented on August 27, 2024

@lfittl so I tried this but no luck. I am not that sure it's a memory leak anymore.

This is the memory used by the pganalyze collector (v0.30):

We have 2 Kubernetes clusters. each exhibit the same behavior. PGAnalyze restarts every ~day. The memory limit is to 250 Mi which is never actually reached. And Kubernetes just reports the container to have exited:

Exit code 139 means, but I am not sure, seg fault.

So it might not be just a memory leak. What weird is that there nothing in my container logs beside a lot of Submitted compact activity snapshot successfully and a couple of Could not collect activity for server: failed to upload/send activity snapshot: Post "https://api.pganalyze.com/v2/snapshots/compact": EOF

Thoughts ? Thanks !

from collector.

lfittl commented on August 27, 2024

@JordanP Thanks for the update!

I'm wondering if you're hitting the bug we fixed in c3d2c3c (released with 0.32.0)

That bug would surface itself as the collector crashing when it encountered a network error. The backtrace for the error unfortunately doesn't show reliably, which might explain why you are just seeing the container exiting without any explicit message.

Could you attempt updating to the v0.32.0 tag and see whether that fixes it?

from collector.

JordanP commented on August 27, 2024

@lfittl I upgraded to 0.32.0, I'll keep you posted.

from collector.

dbamohsin commented on August 27, 2024

Hi @lfittl,

I was going to email you directly but this issue seems to fit!

We are also seeing pods restart as they keep hitting a limit we have set - based on the trajectory of how memory seems to be eaten up by pganalyze, the limit is arbitrary and i think it would eventually hit any limit.

Like @JordanP above, it trips our monitoring too and ideally we dont want to create an indefinite silence on it so if you have any further details on it that would be great.

collector version 0.35

Thanks

Mohsin

from collector.

msakrejda commented on August 27, 2024

Hi @dbamohsin -- could you try with the 0.36 we just released when you have a chance? I don't think any of the fixes there would have impacted memory usage much, but it'd be nice to verify behavior of the latest version.

from collector.

dbamohsin commented on August 27, 2024

@uhoh-itsmaciek - I pre-empted this and upgraded this morning. Currently monitoring so will feed back in the next couple of days. Thanks

from collector.

dbamohsin commented on August 27, 2024

Hi @uhoh-itsmaciek verified behaviour on 0.36

from collector.

lfittl commented on August 27, 2024

I wanted to give everyone an update on this issue, and where we currently stand. There are three theories as to the root cause of this memory leak, by order in which we are tackling them:

Memory leak in the Postgres parser library that the pganalyze collector utilizes - this has been confirmed in a particular edge case, and a fix is available for testing at #151 (the parser library also now has a confirmed clean Valgrind leakcheck run)
A problem with log processing in pganalyze retaining too many old loglines that couldn't be correctly associated (we have some logic to ensure that a LOG line followed by e.g. DETAIL line gets linked together). Whilst previous audits of this code have not shown an issue, we'll take a closer look if (1) doesn't resolve things.
A yet undiscovered leak in the collector codebase that we haven't reproduced yet.

Separately from this, it's also likely that the newer Go release (1.16) improves memory usage on Linux specifically (where the collector is deployed most of the time), and we plan to utilize this Go version in the next or next next release. See details in the Go 1.16 release notes here: https://golang.org/doc/go1.16#runtime

from collector.

dbamohsin commented on August 27, 2024

Memory leak in the Postgres parser library that the pganalyze collector utilizes - this has been confirmed in a particular edge case, and a fix is available for testing at #151 (the parser library also now has a confirmed clean Valgrind leakcheck run)

The memory leak fix does not seem to make a difference to the leak unfortunately.

We applied on Friday 17th around 4pm:

from collector.

lfittl commented on August 27, 2024

Memory leak in the Postgres parser library that the pganalyze collector utilizes - this has been confirmed in a particular edge case, and a fix is available for testing at #151 (the parser library also now has a confirmed clean Valgrind leakcheck run)

The memory leak fix does not seem to make a difference to the leak unfortunately.

We applied on Friday 17th around 4pm:

Thanks for the update - appreciate the quick verification!

Two follow-on things to test:

We've just released collector version 0.37.0, which is built using Go 1.16 - can you update to the quay.io/pganalyze/collector:v0.37.0 tag and see if that makes any difference?
Per @JordanP's report, at least in their situation, it's unlikely that logs are the cause - which means (unless we see two different problems here), we should look for other leaks not yet known.

One way to get some more data on the root cause, would if you check the size of the collector internal state structure, which gets written to disk on every full snapshot (once every 10 minutes). Could you run this inside the container, a few hours apart, and share the size of the state file?

du -sm /state/pganalyze-collector.state

(since this is stored in memory as well, it gives us a hint whether the problem is the collector state itself, vs something else, like e.g. a runaway goroutine)

from collector.

dbamohsin commented on August 27, 2024

Hi @lfittl - although the memory situation seems a little better - it still gradually gets to the point of over 100% utilization eventually, so there still looks to be an underlying memory leak. we are now starting to get containers restarting after 10 days. Here is a 7 day view:

all collectors show the same behaviour regardless of instance - albeit some leak faster than others.

from collector.

JordanP commented on August 27, 2024

FWIW, I can't observe the memory leak I saw in since 2021 now that I am running v0.46. It seems fixed to me.

from collector.

lfittl commented on August 27, 2024

@JordanP Thank you for the update!

I believe we have made sufficient improvements here that I feel comfortable closing this issue. If anyone runs into this again, please feel free to report back / open a new issue.

from collector.

Memory leak on v0.22.0 about collector HOT 34 CLOSED

Comments (34)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs