Comments (34)
@dbamohsin @JordanP Whilst not directly an update on the memory leak, it might be worth deploying the new collector release 0.38.1, which has upgrade to the new query parser version.
The query parser now uses a different mechanism to access the parse trees (essentially it avoids a whole bunch of JSON parsing), which would have shown up as an increased memory usage if you are using log-based EXPLAIN (e.g. such as on GCP).
There may also be other memory-related improvements associated to a few changes in the query parser, so would be great to see if this impacts your environment in any way.
from collector.
@lfittl Awesome! I deployed v0.3.0 and will let you know after a few days!
from collector.
It did not help, same crashes, same frequencies. I understand it's a hard to debug problem. It trips my monitoring system because processes are not supposed to crash. I am gonna silent
this particular alert on this particular process, which means that issue will move out from my radar.
Ping me if I can be of any help here in the future.
from collector.
Cool ! Thanks for the update !
I don't 2.
is the root cause. I saw a growing memory usage even though I did not enable log collection/processing at all.
from collector.
hi @lfittl - we have been running 0.37.0
in prod since Monday and unfortuantely the result is similar, although the growth is a lot choppier now (as can be seen in the second graph below
2 day window
last 6 hours
i will check the size of the collector internals today
from collector.
HI @lfittl - we have been runnig 0.38.1 for a few days now and it does seem a lot smoother for memory:
I'll keep an eye on it for another week and let you know for sure.
from collector.
@beyondbill Thanks for the report - that indeed looks like a memory leak.
Upgrading may fix it if the issue is related to Go garbage collector problems - newer collector versions have a newer Go version used for builds.
If that doesn't fix it, it would be helpful to narrow the problem down, if you could check whether turning off log collection (PGA_DISABLE_LOGS=1) avoids the memory usage behaving like this.
Thanks,
Lukas
from collector.
@lfittl We plan to upgrade to v0.25.1. When will the version be published to Docker registry?
➜ ~ docker pull quay.io/pganalyze/collector:v0.25.1
Error response from daemon: manifest for quay.io/pganalyze/collector:v0.25.1 not found: manifest unknown: manifest unknown
from collector.
@beyondbill Published this morning - you should be good to run docker pull
now.
from collector.
@lfittl Perfect! Thanks! I'll let you know whether it fixes the memory leak.
from collector.
I upgraded the collector to v0.25.1 with PGA_DISABLE_LOGS=1
. However there's still a trend of memory leak. I'll leave it there for a longer period to conclude it.
from collector.
@lfittl I confirmed that the memory leak still exists... We currently put a memory limit on the collector pod in k8s so that it gets killed once memory usage reaches the limit. But we hope to see this issue gets fixed on the collector.
from collector.
Same. It's leaking memory for sure, we (Gorgias, Scale plan) also deal with that leak with a memory limit (Kubernetes).
from collector.
@JordanP Thanks for the report - we are still struggling to find the root cause here, but are certainly tracking this as an active issue that we keep investigating.
One thing that would be helpful for us to validate: When you set PGA_DISABLE_LOGS=1 on the container (which turns off log parsing), does the memory usage keep increasing the same way?
from collector.
@lfittl log parsing is not working anyway because we are running PG and PGAnalyze in Kubernetes. Should I still try to set PGA_DISABLE_LOGS=1 ?
from collector.
@beyondbill @JordanP We've just released collector version 0.30.0, which fixes a memory leak we found that could occur in certain situations. Could you upgrade and check whether that improves the situation for you?
from collector.
@lfittl I believe the memory is still leaking on v0.3.0
from collector.
Same here, upgraded to 0.30 and memory usage is still growing:
Note that we have 2 PG analyze collectors (we have 2 PG databases) and one is not suffering from that leak. PGA is deployed configured the same but the PG instances are used for very different workloads.
from collector.
@JordanP can you tell us more about the nature of the different workloads (and log output)? Are the two lines in the graph you pasted above for the two different databases?
from collector.
@uhoh-itsmaciek I don't know exactly what to say. The one for which PGA is not leaking is used to mostly "log" some sort, so it's mostly a append only workload. The other PG instance is our main one, that serves our application/website.
Yes the 2 lines in the graph are the memory consumption for the 2 PGA containers, each is monitoring a different PG database.
from collector.
@JordanP Thanks for the update - I have a suspicion that this might be related to the amount of query text data the collector is handling.
Could you try enabling the reset setting in the Server Settings in the pganalyze console, and set it to every 6 hours:
Additionally, you'll need to create the following helper method, when using the separate monitoring user (create this method as a superuser):
CREATE OR REPLACE FUNCTION pganalyze.reset_stat_statements() RETURNS SETOF void AS
$$
/* pganalyze-collector */ SELECT * FROM public.pg_stat_statements_reset();
$$ LANGUAGE sql VOLATILE SECURITY DEFINER;
from collector.
@lfittl so I tried this but no luck. I am not that sure it's a memory leak anymore.
This is the memory used by the pganalyze collector (v0.30):
We have 2 Kubernetes clusters. each exhibit the same behavior. PGAnalyze restarts every ~day. The memory limit is to 250 Mi which is never actually reached. And Kubernetes just reports the container to have exited:
Exit code 139 means, but I am not sure, seg fault.
So it might not be just a memory leak. What weird is that there nothing in my container logs beside a lot of Submitted compact activity snapshot successfully
and a couple of Could not collect activity for server: failed to upload/send activity snapshot: Post "https://api.pganalyze.com/v2/snapshots/compact": EOF
Thoughts ? Thanks !
from collector.
@JordanP Thanks for the update!
I'm wondering if you're hitting the bug we fixed in c3d2c3c (released with 0.32.0)
That bug would surface itself as the collector crashing when it encountered a network error. The backtrace for the error unfortunately doesn't show reliably, which might explain why you are just seeing the container exiting without any explicit message.
Could you attempt updating to the v0.32.0
tag and see whether that fixes it?
from collector.
@lfittl I upgraded to 0.32.0, I'll keep you posted.
from collector.
Hi @lfittl,
I was going to email you directly but this issue seems to fit!
We are also seeing pods restart as they keep hitting a limit we have set - based on the trajectory of how memory seems to be eaten up by pganalyze, the limit is arbitrary and i think it would eventually hit any limit.
Like @JordanP above, it trips our monitoring too and ideally we dont want to create an indefinite silence on it so if you have any further details on it that would be great.
collector version 0.35
Thanks
Mohsin
from collector.
Hi @dbamohsin -- could you try with the 0.36 we just released when you have a chance? I don't think any of the fixes there would have impacted memory usage much, but it'd be nice to verify behavior of the latest version.
from collector.
@uhoh-itsmaciek - I pre-empted this and upgraded this morning. Currently monitoring so will feed back in the next couple of days. Thanks
from collector.
Hi @uhoh-itsmaciek verified behaviour on 0.36
from collector.
I wanted to give everyone an update on this issue, and where we currently stand. There are three theories as to the root cause of this memory leak, by order in which we are tackling them:
-
Memory leak in the Postgres parser library that the pganalyze collector utilizes - this has been confirmed in a particular edge case, and a fix is available for testing at #151 (the parser library also now has a confirmed clean Valgrind leakcheck run)
-
A problem with log processing in pganalyze retaining too many old loglines that couldn't be correctly associated (we have some logic to ensure that a
LOG
line followed by e.g.DETAIL
line gets linked together). Whilst previous audits of this code have not shown an issue, we'll take a closer look if (1) doesn't resolve things. -
A yet undiscovered leak in the collector codebase that we haven't reproduced yet.
Separately from this, it's also likely that the newer Go release (1.16) improves memory usage on Linux specifically (where the collector is deployed most of the time), and we plan to utilize this Go version in the next or next next release. See details in the Go 1.16 release notes here: https://golang.org/doc/go1.16#runtime
from collector.
- Memory leak in the Postgres parser library that the pganalyze collector utilizes - this has been confirmed in a particular edge case, and a fix is available for testing at #151 (the parser library also now has a confirmed clean Valgrind leakcheck run)
The memory leak fix does not seem to make a difference to the leak unfortunately.
We applied on Friday 17th around 4pm:
from collector.
- Memory leak in the Postgres parser library that the pganalyze collector utilizes - this has been confirmed in a particular edge case, and a fix is available for testing at #151 (the parser library also now has a confirmed clean Valgrind leakcheck run)
The memory leak fix does not seem to make a difference to the leak unfortunately.
We applied on Friday 17th around 4pm:
Thanks for the update - appreciate the quick verification!
Two follow-on things to test:
-
We've just released collector version 0.37.0, which is built using Go 1.16 - can you update to the
quay.io/pganalyze/collector:v0.37.0
tag and see if that makes any difference? -
Per @JordanP's report, at least in their situation, it's unlikely that logs are the cause - which means (unless we see two different problems here), we should look for other leaks not yet known.
One way to get some more data on the root cause, would if you check the size of the collector internal state structure, which gets written to disk on every full snapshot (once every 10 minutes). Could you run this inside the container, a few hours apart, and share the size of the state file?
du -sm /state/pganalyze-collector.state
(since this is stored in memory as well, it gives us a hint whether the problem is the collector state itself, vs something else, like e.g. a runaway goroutine)
from collector.
Hi @lfittl - although the memory situation seems a little better - it still gradually gets to the point of over 100% utilization eventually, so there still looks to be an underlying memory leak. we are now starting to get containers restarting after 10 days. Here is a 7 day view:
all collectors show the same behaviour regardless of instance - albeit some leak faster than others.
from collector.
FWIW, I can't observe the memory leak I saw in since 2021 now that I am running v0.46. It seems fixed to me.
from collector.
@JordanP Thank you for the update!
I believe we have made sufficient improvements here that I feel comfortable closing this issue. If anyone runs into this again, please feel free to report back / open a new issue.
from collector.
Related Issues (20)
- Could not collect activity for server HOT 8
- Unable to DB_USE_IAM_AUTH with aws RDS instance HOT 4
- Error calling pg_stat_statements_reset() as requested intermittently HOT 2
- Regular collector oom-kill at midnight UTC on Sundays HOT 1
- Publish Docker image to ECR public registry HOT 1
- Error to send snapshots HOT 2
- Upgrade go.uuid module HOT 3
- Missing image for v0.50.1 HOT 2
- Additional security settings for pganalyze collector HOT 3
- Add rudimentary test suite for Helm chart
- Debian build uses EOL version HOT 1
- Error: Failed to reload collector HOT 1
- OTEL connector - can it support Datadog APM traces for PostgreSQL HOT 2
- OTEL connector - can it support Datadog?
- Feature request: Add DB_PASSWORD_FILE or similar HOT 1
- Regarding SSO login HOT 1
- Configuring Log Insights for pganalyze runnning on kubernetes pod HOT 1
- Can't ctrl-c pganalyze-collector HOT 1
- panic: runtime error: index out of range [0] with length 0 HOT 5
- Inefficient usage of AWS DescribeDBInstances and DescribeDBClusters API HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from collector.