Hello, We're using kestrel on our production servers (running on an

Kestrel 1.2.8 crashes frequently about kestrel HOT 8 CLOSED

mihneagiurgea commented on July 26, 2024

Kestrel 1.2.8 crashes frequently

from kestrel.

Comments (8)

robey commented on July 26, 2024

i think that exception is harmless.

how many queues do you have running? how many clients typically connect to it?

from kestrel.

mihneagiurgea commented on July 26, 2024

We have 50 queue, but most of them (~35) have very low traffic (<1000 per day). The number of clients tipically connected is around 100.
The number of operations is around 1.400/second.

from kestrel.

robey commented on July 26, 2024

Hm, yeah, none of those are particularly high numbers. You're definitely not running out of fds at 100 clients.

You might lower the max_memory_size: with 50 queues, if 10 of them fill up, that's 5GB (which is more than can fit in a 6GB JVM because of the way garbage collection works).

Things you can check: heap usage; the GC log (is it spending a lot of time in GC when it crashes?); how backed up the queues are.

from kestrel.

mihneagiurgea commented on July 26, 2024

We noticed that when it crashes, the following error is written to nohup (not the kestrel log, but the nohup.out were it was started from):

java.lang.OutOfMemoryError: Java heap space
        at java.lang.Class.getDeclaredMethods0(Native Method)
        at java.lang.Class.privateGetDeclaredMethods(Class.java:2427)
        at java.lang.Class.getDeclaredMethod(Class.java:1935)
        at scala.runtime.RichString.format(RichString.scala:240)
        at net.lag.kestrel.KestrelHandler$$anonfun$get$2.apply(KestrelHandler.scala:218)
        at net.lag.kestrel.KestrelHandler$$anonfun$get$2.apply(KestrelHandler.scala:212)
        at net.lag.kestrel.QueueCollection$$anonfun$remove$1.apply(QueueCollection.scala:148)
        at net.lag.kestrel.QueueCollection$$anonfun$remove$1.apply(QueueCollection.scala:142)
        at net.lag.kestrel.PersistentQueue.operateReact(PersistentQueue.scala:292)
        at net.lag.kestrel.PersistentQueue.removeReact(PersistentQueue.scala:334)
        at net.lag.kestrel.QueueCollection.remove(QueueCollection.scala:142)
        at net.lag.kestrel.KestrelHandler.get(KestrelHandler.scala:212)
        at net.lag.kestrel.KestrelHandler.net$lag$kestrel$KestrelHandler$$handle(KestrelHandler.scala:113)
        at net.lag.kestrel.KestrelHandler$$anonfun$act$1$$anonfun$apply$1.apply(KestrelHandler.scala:68)
        at net.lag.kestrel.KestrelHandler$$anonfun$act$1$$anonfun$apply$1.apply(KestrelHandler.scala:66)
        at com.twitter.actors.Reaction.run(Reaction.scala:79)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)

At the time of the crash all queue sizes were relatively small (a few of them around 20-30 MB, and the rest < 500 KB).

from kestrel.

mihneagiurgea commented on July 26, 2024

We restarted java with -verbose:gc to see what garbage collection was doing.

These are the last lines of gc.log, after around 3 hours of running (before crashing). The number of items remained constant throughout these 3 hours.

11298.686: [Full GC 1521716K->1521716K(1578432K), 1.1662680 secs]
11299.852: [Full GC 1521716K->1520693K(1578432K), 2.0338540 secs]
11301.901: [Full GC 1521726K->1521726K(1578432K), 1.1637670 secs]
11303.065: [Full GC 1521727K->1520726K(1578432K), 1.9922750 secs]
11305.077: [Full GC 1521726K->1521726K(1578432K), 1.1747770 secs]
11306.252: [Full GC 1521726K->1521726K(1578432K), 1.1726630 secs]
11307.425: [Full GC 1521726K->1521726K(1578432K), 1.1758330 secs]
11308.601: [Full GC 1521726K->1520851K(1578432K), 2.0171970 secs]
11310.622: [Full GC 1521727K->1521727K(1578432K), 1.1734600 secs]
11311.795: [Full GC 1521727K->1521727K(1578432K), 1.1749680 secs]
11312.971: [Full GC 1521727K->1521727K(1578432K), 1.1907470 secs]
11314.162: [Full GC 1521728K->1520970K(1578432K), 1.1750460 secs]
11315.340: [Full GC 1521727K->1521727K(1578432K), 1.1664440 secs]
11316.507: [Full GC 1521727K->1521727K(1578432K), 1.1724850 secs]
11317.679: [Full GC 1521727K->1521007K(1578432K), 1.1712050 secs]
11318.863: [Full GC 1521727K->1520790K(1578432K), 1.1731550 secs]
11320.039: [Full GC 1521727K->1521727K(1578432K), 1.1760910 secs]
11321.216: [Full GC 1521727K->1521727K(1578432K), 1.1724200 secs]
11322.389: [Full GC 1521727K->1521727K(1578432K), 1.1749710 secs]
11323.564: [Full GC 1521727K->1520987K(1578432K), 1.1942830 secs]
11324.761: [Full GC 1521727K->1521727K(1578432K), 1.1956870 secs]
11325.957: [Full GC 1521727K->1521727K(1578432K), 1.1730420 secs]

The entire log is here: http://pastie.org/1871509

Any ideea what this means? We restarted our production environment using 1.2.4 instead of 1.2.8, to see if this changes anything (in case there's a memory leak in 1.2.8).

from kestrel.

robey commented on July 26, 2024

a leak is a possibility. :( [1.2 uses mina instead of netty.] but it's more likely that there just isn't enough heap space for the queues that are backing up.

you can try adding more heap space -- when java is given 6GB, it can't actually use all 6GB for the app, because of GC overhead. you can also try reducing the memory size of queues, to keep less stuff in memory.

from kestrel.

mihneagiurgea commented on July 26, 2024

The items on our queues are being constantly processed, and are not clustering up. The queue sizes remain approximately constant while kestrel is running, so I don't see why there would be a need for more heap space.

from kestrel.

robey commented on July 26, 2024

We have our kestrels monitored by ganglia, but any monitoring system will do. At worst, set up a cron to pipe kestrel's "stats" output to a file. What you want to do in see (ideally, graph) curr_items and curr_connections and correlate those to misbehavior. We run our kestrels pretty hot, and generally if one crashes, it's due to running out of file descriptors or running out of heap.

The logfile you posted makes it pretty clear that the JVM just ran out of heap, and was growing gradually the whole time.

We're currently running 1.2.2 on most machines, it looks like, so if regressing to 1.2.4 works, that would be a valuable data point that 1.2.8 has some kind of leak. (We're also in the process of upgrading to 2.1, but I'll post to the mailing list as that happens. We'll almost certainly find a few bugs as it rolls out.)

from kestrel.

Kestrel 1.2.8 crashes frequently about kestrel HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs