GithubHelp home page GithubHelp logo

Comments (8)

robey avatar robey commented on July 26, 2024

i think that exception is harmless.

how many queues do you have running? how many clients typically connect to it?

from kestrel.

mihneagiurgea avatar mihneagiurgea commented on July 26, 2024

We have 50 queue, but most of them (~35) have very low traffic (<1000 per day). The number of clients tipically connected is around 100.
The number of operations is around 1.400/second.

from kestrel.

robey avatar robey commented on July 26, 2024

Hm, yeah, none of those are particularly high numbers. You're definitely not running out of fds at 100 clients.

You might lower the max_memory_size: with 50 queues, if 10 of them fill up, that's 5GB (which is more than can fit in a 6GB JVM because of the way garbage collection works).

Things you can check: heap usage; the GC log (is it spending a lot of time in GC when it crashes?); how backed up the queues are.

from kestrel.

mihneagiurgea avatar mihneagiurgea commented on July 26, 2024

We noticed that when it crashes, the following error is written to nohup (not the kestrel log, but the nohup.out were it was started from):

java.lang.OutOfMemoryError: Java heap space
        at java.lang.Class.getDeclaredMethods0(Native Method)
        at java.lang.Class.privateGetDeclaredMethods(Class.java:2427)
        at java.lang.Class.getDeclaredMethod(Class.java:1935)
        at scala.runtime.RichString.format(RichString.scala:240)
        at net.lag.kestrel.KestrelHandler$$anonfun$get$2.apply(KestrelHandler.scala:218)
        at net.lag.kestrel.KestrelHandler$$anonfun$get$2.apply(KestrelHandler.scala:212)
        at net.lag.kestrel.QueueCollection$$anonfun$remove$1.apply(QueueCollection.scala:148)
        at net.lag.kestrel.QueueCollection$$anonfun$remove$1.apply(QueueCollection.scala:142)
        at net.lag.kestrel.PersistentQueue.operateReact(PersistentQueue.scala:292)
        at net.lag.kestrel.PersistentQueue.removeReact(PersistentQueue.scala:334)
        at net.lag.kestrel.QueueCollection.remove(QueueCollection.scala:142)
        at net.lag.kestrel.KestrelHandler.get(KestrelHandler.scala:212)
        at net.lag.kestrel.KestrelHandler.net$lag$kestrel$KestrelHandler$$handle(KestrelHandler.scala:113)
        at net.lag.kestrel.KestrelHandler$$anonfun$act$1$$anonfun$apply$1.apply(KestrelHandler.scala:68)
        at net.lag.kestrel.KestrelHandler$$anonfun$act$1$$anonfun$apply$1.apply(KestrelHandler.scala:66)
        at com.twitter.actors.Reaction.run(Reaction.scala:79)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)

At the time of the crash all queue sizes were relatively small (a few of them around 20-30 MB, and the rest < 500 KB).

from kestrel.

mihneagiurgea avatar mihneagiurgea commented on July 26, 2024

We restarted java with -verbose:gc to see what garbage collection was doing.

These are the last lines of gc.log, after around 3 hours of running (before crashing). The number of items remained constant throughout these 3 hours.

11298.686: [Full GC 1521716K->1521716K(1578432K), 1.1662680 secs]
11299.852: [Full GC 1521716K->1520693K(1578432K), 2.0338540 secs]
11301.901: [Full GC 1521726K->1521726K(1578432K), 1.1637670 secs]
11303.065: [Full GC 1521727K->1520726K(1578432K), 1.9922750 secs]
11305.077: [Full GC 1521726K->1521726K(1578432K), 1.1747770 secs]
11306.252: [Full GC 1521726K->1521726K(1578432K), 1.1726630 secs]
11307.425: [Full GC 1521726K->1521726K(1578432K), 1.1758330 secs]
11308.601: [Full GC 1521726K->1520851K(1578432K), 2.0171970 secs]
11310.622: [Full GC 1521727K->1521727K(1578432K), 1.1734600 secs]
11311.795: [Full GC 1521727K->1521727K(1578432K), 1.1749680 secs]
11312.971: [Full GC 1521727K->1521727K(1578432K), 1.1907470 secs]
11314.162: [Full GC 1521728K->1520970K(1578432K), 1.1750460 secs]
11315.340: [Full GC 1521727K->1521727K(1578432K), 1.1664440 secs]
11316.507: [Full GC 1521727K->1521727K(1578432K), 1.1724850 secs]
11317.679: [Full GC 1521727K->1521007K(1578432K), 1.1712050 secs]
11318.863: [Full GC 1521727K->1520790K(1578432K), 1.1731550 secs]
11320.039: [Full GC 1521727K->1521727K(1578432K), 1.1760910 secs]
11321.216: [Full GC 1521727K->1521727K(1578432K), 1.1724200 secs]
11322.389: [Full GC 1521727K->1521727K(1578432K), 1.1749710 secs]
11323.564: [Full GC 1521727K->1520987K(1578432K), 1.1942830 secs]
11324.761: [Full GC 1521727K->1521727K(1578432K), 1.1956870 secs]
11325.957: [Full GC 1521727K->1521727K(1578432K), 1.1730420 secs]

The entire log is here: http://pastie.org/1871509

Any ideea what this means? We restarted our production environment using 1.2.4 instead of 1.2.8, to see if this changes anything (in case there's a memory leak in 1.2.8).

from kestrel.

robey avatar robey commented on July 26, 2024

a leak is a possibility. :( [1.2 uses mina instead of netty.] but it's more likely that there just isn't enough heap space for the queues that are backing up.

you can try adding more heap space -- when java is given 6GB, it can't actually use all 6GB for the app, because of GC overhead. you can also try reducing the memory size of queues, to keep less stuff in memory.

from kestrel.

mihneagiurgea avatar mihneagiurgea commented on July 26, 2024

The items on our queues are being constantly processed, and are not clustering up. The queue sizes remain approximately constant while kestrel is running, so I don't see why there would be a need for more heap space.

from kestrel.

robey avatar robey commented on July 26, 2024

We have our kestrels monitored by ganglia, but any monitoring system will do. At worst, set up a cron to pipe kestrel's "stats" output to a file. What you want to do in see (ideally, graph) curr_items and curr_connections and correlate those to misbehavior. We run our kestrels pretty hot, and generally if one crashes, it's due to running out of file descriptors or running out of heap.

The logfile you posted makes it pretty clear that the JVM just ran out of heap, and was growing gradually the whole time.

We're currently running 1.2.2 on most machines, it looks like, so if regressing to 1.2.4 works, that would be a valuable data point that 1.2.8 has some kind of leak. (We're also in the process of upgrading to 2.1, but I'll post to the mailing list as that happens. We'll almost certainly find a few bugs as it rolls out.)

from kestrel.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.