GithubHelp home page GithubHelp logo

Memory leak on v0.22.0 about collector HOT 34 CLOSED

beyondbill avatar beyondbill commented on July 24, 2024
Memory leak on v0.22.0

from collector.

Comments (34)

lfittl avatar lfittl commented on July 24, 2024 2

@dbamohsin @JordanP Whilst not directly an update on the memory leak, it might be worth deploying the new collector release 0.38.1, which has upgrade to the new query parser version.

The query parser now uses a different mechanism to access the parse trees (essentially it avoids a whole bunch of JSON parsing), which would have shown up as an increased memory usage if you are using log-based EXPLAIN (e.g. such as on GCP).

There may also be other memory-related improvements associated to a few changes in the query parser, so would be great to see if this impacts your environment in any way.

from collector.

beyondbill avatar beyondbill commented on July 24, 2024 1

@lfittl Awesome! I deployed v0.3.0 and will let you know after a few days!

from collector.

JordanP avatar JordanP commented on July 24, 2024 1

It did not help, same crashes, same frequencies. I understand it's a hard to debug problem. It trips my monitoring system because processes are not supposed to crash. I am gonna silent this particular alert on this particular process, which means that issue will move out from my radar.

Ping me if I can be of any help here in the future.

from collector.

JordanP avatar JordanP commented on July 24, 2024 1

Cool ! Thanks for the update !

I don't 2. is the root cause. I saw a growing memory usage even though I did not enable log collection/processing at all.

from collector.

dbamohsin avatar dbamohsin commented on July 24, 2024 1

hi @lfittl - we have been running 0.37.0 in prod since Monday and unfortuantely the result is similar, although the growth is a lot choppier now (as can be seen in the second graph below

2 day window

image

last 6 hours

image

i will check the size of the collector internals today

from collector.

dbamohsin avatar dbamohsin commented on July 24, 2024 1

HI @lfittl - we have been runnig 0.38.1 for a few days now and it does seem a lot smoother for memory:

image

I'll keep an eye on it for another week and let you know for sure.

from collector.

lfittl avatar lfittl commented on July 24, 2024

@beyondbill Thanks for the report - that indeed looks like a memory leak.

Upgrading may fix it if the issue is related to Go garbage collector problems - newer collector versions have a newer Go version used for builds.

If that doesn't fix it, it would be helpful to narrow the problem down, if you could check whether turning off log collection (PGA_DISABLE_LOGS=1) avoids the memory usage behaving like this.

Thanks,
Lukas

from collector.

beyondbill avatar beyondbill commented on July 24, 2024

@lfittl We plan to upgrade to v0.25.1. When will the version be published to Docker registry?

➜  ~ docker pull quay.io/pganalyze/collector:v0.25.1
Error response from daemon: manifest for quay.io/pganalyze/collector:v0.25.1 not found: manifest unknown: manifest unknown

from collector.

lfittl avatar lfittl commented on July 24, 2024

@beyondbill Published this morning - you should be good to run docker pull now.

from collector.

beyondbill avatar beyondbill commented on July 24, 2024

@lfittl Perfect! Thanks! I'll let you know whether it fixes the memory leak.

from collector.

beyondbill avatar beyondbill commented on July 24, 2024

image
I upgraded the collector to v0.25.1 with PGA_DISABLE_LOGS=1. However there's still a trend of memory leak. I'll leave it there for a longer period to conclude it.

from collector.

beyondbill avatar beyondbill commented on July 24, 2024

image
@lfittl I confirmed that the memory leak still exists... We currently put a memory limit on the collector pod in k8s so that it gets killed once memory usage reaches the limit. But we hope to see this issue gets fixed on the collector.

from collector.

JordanP avatar JordanP commented on July 24, 2024

Same. It's leaking memory for sure, we (Gorgias, Scale plan) also deal with that leak with a memory limit (Kubernetes).

image

from collector.

lfittl avatar lfittl commented on July 24, 2024

@JordanP Thanks for the report - we are still struggling to find the root cause here, but are certainly tracking this as an active issue that we keep investigating.

One thing that would be helpful for us to validate: When you set PGA_DISABLE_LOGS=1 on the container (which turns off log parsing), does the memory usage keep increasing the same way?

from collector.

JordanP avatar JordanP commented on July 24, 2024

@lfittl log parsing is not working anyway because we are running PG and PGAnalyze in Kubernetes. Should I still try to set PGA_DISABLE_LOGS=1 ?

from collector.

lfittl avatar lfittl commented on July 24, 2024

@beyondbill @JordanP We've just released collector version 0.30.0, which fixes a memory leak we found that could occur in certain situations. Could you upgrade and check whether that improves the situation for you?

from collector.

beyondbill avatar beyondbill commented on July 24, 2024

@lfittl I believe the memory is still leaking on v0.3.0
image

from collector.

JordanP avatar JordanP commented on July 24, 2024

Same here, upgraded to 0.30 and memory usage is still growing:
image
Note that we have 2 PG analyze collectors (we have 2 PG databases) and one is not suffering from that leak. PGA is deployed configured the same but the PG instances are used for very different workloads.

from collector.

msakrejda avatar msakrejda commented on July 24, 2024

@JordanP can you tell us more about the nature of the different workloads (and log output)? Are the two lines in the graph you pasted above for the two different databases?

from collector.

JordanP avatar JordanP commented on July 24, 2024

@uhoh-itsmaciek I don't know exactly what to say. The one for which PGA is not leaking is used to mostly "log" some sort, so it's mostly a append only workload. The other PG instance is our main one, that serves our application/website.

Yes the 2 lines in the graph are the memory consumption for the 2 PGA containers, each is monitoring a different PG database.

from collector.

lfittl avatar lfittl commented on July 24, 2024

@JordanP Thanks for the update - I have a suspicion that this might be related to the amount of query text data the collector is handling.

Could you try enabling the reset setting in the Server Settings in the pganalyze console, and set it to every 6 hours:

Screen Shot 2020-06-23 at 12 55 36 AM

Additionally, you'll need to create the following helper method, when using the separate monitoring user (create this method as a superuser):

CREATE OR REPLACE FUNCTION pganalyze.reset_stat_statements() RETURNS SETOF void AS
$$
  /* pganalyze-collector */ SELECT * FROM public.pg_stat_statements_reset();
$$ LANGUAGE sql VOLATILE SECURITY DEFINER;

from collector.

JordanP avatar JordanP commented on July 24, 2024

@lfittl so I tried this but no luck. I am not that sure it's a memory leak anymore.

This is the memory used by the pganalyze collector (v0.30):
image

We have 2 Kubernetes clusters. each exhibit the same behavior. PGAnalyze restarts every ~day. The memory limit is to 250 Mi which is never actually reached. And Kubernetes just reports the container to have exited:
image

Exit code 139 means, but I am not sure, seg fault.

So it might not be just a memory leak. What weird is that there nothing in my container logs beside a lot of Submitted compact activity snapshot successfully and a couple of Could not collect activity for server: failed to upload/send activity snapshot: Post "https://api.pganalyze.com/v2/snapshots/compact": EOF

Thoughts ? Thanks !

from collector.

lfittl avatar lfittl commented on July 24, 2024

@JordanP Thanks for the update!

I'm wondering if you're hitting the bug we fixed in c3d2c3c (released with 0.32.0)

That bug would surface itself as the collector crashing when it encountered a network error. The backtrace for the error unfortunately doesn't show reliably, which might explain why you are just seeing the container exiting without any explicit message.

Could you attempt updating to the v0.32.0 tag and see whether that fixes it?

from collector.

JordanP avatar JordanP commented on July 24, 2024

@lfittl I upgraded to 0.32.0, I'll keep you posted.

from collector.

dbamohsin avatar dbamohsin commented on July 24, 2024

Hi @lfittl,

I was going to email you directly but this issue seems to fit!

image

We are also seeing pods restart as they keep hitting a limit we have set - based on the trajectory of how memory seems to be eaten up by pganalyze, the limit is arbitrary and i think it would eventually hit any limit.

Like @JordanP above, it trips our monitoring too and ideally we dont want to create an indefinite silence on it so if you have any further details on it that would be great.

collector version 0.35

Thanks

Mohsin

from collector.

msakrejda avatar msakrejda commented on July 24, 2024

Hi @dbamohsin -- could you try with the 0.36 we just released when you have a chance? I don't think any of the fixes there would have impacted memory usage much, but it'd be nice to verify behavior of the latest version.

from collector.

dbamohsin avatar dbamohsin commented on July 24, 2024

@uhoh-itsmaciek - I pre-empted this and upgraded this morning. Currently monitoring so will feed back in the next couple of days. Thanks

from collector.

dbamohsin avatar dbamohsin commented on July 24, 2024

Hi @uhoh-itsmaciek verified behaviour on 0.36

image

from collector.

lfittl avatar lfittl commented on July 24, 2024

I wanted to give everyone an update on this issue, and where we currently stand. There are three theories as to the root cause of this memory leak, by order in which we are tackling them:

  1. Memory leak in the Postgres parser library that the pganalyze collector utilizes - this has been confirmed in a particular edge case, and a fix is available for testing at #151 (the parser library also now has a confirmed clean Valgrind leakcheck run)

  2. A problem with log processing in pganalyze retaining too many old loglines that couldn't be correctly associated (we have some logic to ensure that a LOG line followed by e.g. DETAIL line gets linked together). Whilst previous audits of this code have not shown an issue, we'll take a closer look if (1) doesn't resolve things.

  3. A yet undiscovered leak in the collector codebase that we haven't reproduced yet.

Separately from this, it's also likely that the newer Go release (1.16) improves memory usage on Linux specifically (where the collector is deployed most of the time), and we plan to utilize this Go version in the next or next next release. See details in the Go 1.16 release notes here: https://golang.org/doc/go1.16#runtime

from collector.

dbamohsin avatar dbamohsin commented on July 24, 2024
  1. Memory leak in the Postgres parser library that the pganalyze collector utilizes - this has been confirmed in a particular edge case, and a fix is available for testing at #151 (the parser library also now has a confirmed clean Valgrind leakcheck run)

The memory leak fix does not seem to make a difference to the leak unfortunately.

We applied on Friday 17th around 4pm:

image

from collector.

lfittl avatar lfittl commented on July 24, 2024
  1. Memory leak in the Postgres parser library that the pganalyze collector utilizes - this has been confirmed in a particular edge case, and a fix is available for testing at #151 (the parser library also now has a confirmed clean Valgrind leakcheck run)

The memory leak fix does not seem to make a difference to the leak unfortunately.

We applied on Friday 17th around 4pm:

image

Thanks for the update - appreciate the quick verification!

Two follow-on things to test:

  1. We've just released collector version 0.37.0, which is built using Go 1.16 - can you update to the quay.io/pganalyze/collector:v0.37.0 tag and see if that makes any difference?

  2. Per @JordanP's report, at least in their situation, it's unlikely that logs are the cause - which means (unless we see two different problems here), we should look for other leaks not yet known.

One way to get some more data on the root cause, would if you check the size of the collector internal state structure, which gets written to disk on every full snapshot (once every 10 minutes). Could you run this inside the container, a few hours apart, and share the size of the state file?

du -sm /state/pganalyze-collector.state

(since this is stored in memory as well, it gives us a hint whether the problem is the collector state itself, vs something else, like e.g. a runaway goroutine)

from collector.

dbamohsin avatar dbamohsin commented on July 24, 2024

Hi @lfittl - although the memory situation seems a little better - it still gradually gets to the point of over 100% utilization eventually, so there still looks to be an underlying memory leak. we are now starting to get containers restarting after 10 days. Here is a 7 day view:

image

all collectors show the same behaviour regardless of instance - albeit some leak faster than others.

from collector.

JordanP avatar JordanP commented on July 24, 2024

FWIW, I can't observe the memory leak I saw in since 2021 now that I am running v0.46. It seems fixed to me.

from collector.

lfittl avatar lfittl commented on July 24, 2024

@JordanP Thank you for the update!

I believe we have made sufficient improvements here that I feel comfortable closing this issue. If anyone runs into this again, please feel free to report back / open a new issue.

from collector.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.