GithubHelp home page GithubHelp logo

Comments (26)

ringofhealth avatar ringofhealth commented on June 10, 2024 2

aight will do! Ill start some test session this weekend

from thousand_island.

ringofhealth avatar ringofhealth commented on June 10, 2024 1

ah I see, thanks for clarifying. Going to deploy this to staging env and let it soak for a few days, will report back status

from thousand_island.

woylie avatar woylie commented on June 10, 2024 1

Thank you for digging into this. I just updated Bandit to 1.0.0-pre.3, and we'll be monitoring this. I'll let you know if this occurs again, and if it doesn't occur again, you'll hear from me in a week.

from thousand_island.

moogle19 avatar moogle19 commented on June 10, 2024

:emfile means "Too many open files" (see here).
It looks like you have more connections than available file descriptors. You could try to increase the file descriptor limit for your server.

from thousand_island.

woylie avatar woylie commented on June 10, 2024

We increased the file descriptor limit before.

For more context, we saw the memory usage continuously increasing over time, without resources ever being freed. After changing the adapter back to Cowboy, we don't see this happening anymore, and we don't see any errors indicating too many connections.

from thousand_island.

moogle19 avatar moogle19 commented on June 10, 2024

Could you pin down which resources wouldn't get freed?
I would assume Ports / Sockets, but it would be nice to have further informations.

from thousand_island.

mtrudel avatar mtrudel commented on June 10, 2024

I'm extremely interested / perplexed by this!

My first hunch here is that there's some aspect of your situation that's causing connections to not be closed off once you're done with them. My initial guess would be that there may be a bug in our HTTP/1 keepalive logic that's causing us to not be as aggressive as Cowboy in closing off lingering keep alive clients.

Some questions to help diagnose:

  • What mix of endpoints / transports are you using? HTTP/HTTPS? 1.1/2? WebSockets?
  • What config are you starting Bandit / Cowboy up with?
  • Do you see any other Bandit / Thousand Island messages in your logs (specifically timeout messages)?
  • Do the errors happen over time, all at once, after a certain number of connections? Anything we can do to try and repro?
  • What sort of connection rate (distinct connections / second, requests per connection, etc) does the system in question serve?
  • Are you able to share methods & charts that demonstrate 'memory usage continuously increasing over time'? Do you happen to have a metric available for total number of VM processes as well? Can you share that?

(on the off chance that you're able to share aspects of the project causing this, it would help immensely in diagnosing this).

Looking forward to hearing more!

from thousand_island.

woylie avatar woylie commented on June 10, 2024
* What mix of endpoints / transports are you using? HTTP/HTTPS? 1.1/2? WebSockets?

The endpoint serves HTTP 1.1 and WebSockets (LiveView) behind an Envoy proxy that terminates TLS.

* What config are you starting Bandit / Cowboy up with?

Only defaults.

* Do you see any other Bandit / Thousand Island messages in your logs (specifically timeout messages)?
* Do the errors happen over time, all at once, after a certain number of connections? Anything we can do to try and repro?
* What sort of connection rate (distinct connections / second, requests per connection, etc) does the system in question serve?
* Are you able to share methods & charts that demonstrate 'memory usage continuously increasing over time'? Do you happen to have a metric available for total number of VM processes as well? Can you share that?

I'll have to get back to you for that. We're pretty busy at the moment, and I'll be out of office from next week, so it may take a while. The only thing I can give you right now is this 24h error distribution chart:

image

From top to bottom:

  • GenServer.stop(:undefined, :normal, :infinity) ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started (no further stacktrace)
  • ** (exit) exited in: GenServer.call(:undefined, :acceptor_info, 5000)
  • Unexpected error in accept: :emfile

We also saw a couple of ** (exit) "Not a valid WebSocket upgrade request") errors.

image

from thousand_island.

mtrudel avatar mtrudel commented on June 10, 2024

When you're running under Cowboy, do you ever see log messages like the following?

Ranch acceptor reducing accept rate: out of file descriptors

Looking at where this error is happening in Thousand Island, the corresponding implementation in Ranch explicitly handles :emfile errors by waiting for 100ms and retrying an accept call. Ranch and Thousand Island are very similar in structure in this regard, so I would expect that Ranch is seeing the same :emfile error we are, they're just handling it.

If this is the case, I'll add a similar clause to Thousand Island (though I may handle it a tad differently; busy waiting without end like Ranch does is a tad sketch).

from thousand_island.

woylie avatar woylie commented on June 10, 2024

I didn't find that message in our logs.

from thousand_island.

mtrudel avatar mtrudel commented on June 10, 2024

I've added connection count limiting to Thousand Island (released as 0.6.6). This doesn't fix the core of the issue (why you're seeing increased memory usage & emfile errors), but it should help mitigate the issue for others via configuration levers, and brings us up to parity with ranch's support for this.

I'm still looking into the underlying issue, but I'm not able to reproduce anything locally. I've run a bunch of long-running soak tests on the HTTP/1, h2, and WebSocket stacks and I'm seeing what I would expect: no process growth and steady long-term memory usage. I still think that there's something in your setup that's causing connections to be held open longer than expected. This would explain everything you're seeing (growing memory use over time, and an eventual exhaustion of open file handles). My biggest suspects here are keepalive handling and/or a possible bug with handling invalid websocket upgrades.

To help debug this further, would you be able to provide:

  • Any Bandit / Thousand Island messages in your logs (specifically timeout messages) around / prior to this happening
  • A rough idea of connection rates, connection duration, HTTP/1 vs websocket breakdown / overall load & box size etc, so I know what scale of problem this may be
  • Can you quantify 'memory growing over time'? How fast does it grow, and by how much?

In the meantime, if you're game to help debugging this, you could try adding [http_1_options: [max_requests: 1]] to your bandit config (this would manifest in your Phoenix app's config.exs (or env sub-config) like so):

config :my_app, MyAppWeb.Endpoint,
  http: [...(port: xxx, other options)... , http_1_options: [max_requests: 1]]
  ...

This will cause Bandit's HTTP/1 stack to close the connection after each request, effectively disabling keepalives. If you could run this for a long enough period of time that you would normally have seen an error and see if it helps matters, it would be a great place to help bisect this error.

Thanks!

from thousand_island.

ringofhealth avatar ringofhealth commented on June 10, 2024

okay I ran this config

this is my config.exs

config :myapp, AppWeb.Endpoint,
  adapter: Bandit.PhoenixAdapter,
  url: [host: "localhost"],

my dev.exs

config :myapp, AppWeb.Endpoint,
  http: [ip: {127, 0, 0, 1}, port: 4000],
  http_1_options: [max_requests: 1],

with that http_1_option on, and it was good until I swap out the config

after I commented that option out, and reran the app, the VM blew up within 5-10 min with

                     []},
                 {'Elixir.Kernel',inspect,2,
                     [{file,"lib/kernel.ex"},{line,2254}]},
                 {'Elixir.Logger.Handler',do_log,4,
                     [{file,"lib/logger/handler.ex"},{line,141}]},
                 {'Elixir.Logger.Handler',log,2,
                     [{file,"lib/logger/handler.ex"},{line,84}]}]}
[warning] Attempted to log 140761 messages, which is above :discard_threshold
[warning] Attempted to log 0 messages, which is below :discard_threshold

from thousand_island.

ringofhealth avatar ringofhealth commented on June 10, 2024

I am using the latest build 1.0.0-pre2, let me know if there is more diagnostic I can run to help you narrow this down

from thousand_island.

ringofhealth avatar ringofhealth commented on June 10, 2024

aight something wierd is going on. so I originally ran it with the http_1_option, on then I turn it off and ran into a bunch of emfile error. Then I took out the option swapped out bandit and went back to cowboy.

while I was using other part of the application I ran into the emfile error, which has never happened before. I deleted my entire build dir and recompiled withoht bandit. we will see what happens

it seems that it had some side effect on the erlang installaton in asdf dir

from thousand_island.

mtrudel avatar mtrudel commented on June 10, 2024

@ringofhealth a couple of things that would be good to know:

  • What mix of endpoints / transports are you using? HTTP/HTTPS? 1.1/2? WebSockets?
  • What sort of clients are they? Are they direct from a browser or are they being forwarded through a reverse proxy? Which one if so?
  • Roughly how many connections per second are you handling? How long / how many connections does it take to start seeing emails errors?

from thousand_island.

mtrudel avatar mtrudel commented on June 10, 2024

@ringofhealth I'm confused about your indication of max_requests: 1 solving the problem. Are you saying that:

  • When you used a default Bandit config (ie: WITHOUT max_requests: 1, you saw the issue occur
  • When you added max_requests: 1 to your config, it fixed the problem and the issue did not recur?

from thousand_island.

ringofhealth avatar ringofhealth commented on June 10, 2024

@ringofhealth a couple of things that would be good to know:

  • What mix of endpoints / transports are you using? HTTP/HTTPS? 1.1/2? WebSockets?
  • What sort of clients are they? Are they direct from a browser or are they being forwarded through a reverse proxy? Which one if so?
  • Roughly how many connections per second are you handling? How long / how many connections does it take to start seeing emails errors?
  • this was all during local development, so over HTTP only, and with websockets for liveview 0.18.18. I am not sure if its loading HTTP/2, just using default configs
  • no reverse proxy, just local dev with local chrome
  • it was local dev env, so just my 1 user was using it

from thousand_island.

ringofhealth avatar ringofhealth commented on June 10, 2024

@ringofhealth I'm confused about your indication of max_requests: 1 solving the problem. Are you saying that:

  • When you used a default Bandit config (ie: WITHOUT max_requests: 1, you saw the issue occur
  • When you added max_requests: 1 to your config, it fixed the problem and the issue did not recur?

yes I was confused about this too, let me walk you through the timelines

-added bandited, with max_request:1 config on

this seemingly was good for a while, used it for about and hour to 2. Then I decided to try it without max_request: 1 setting

  • within 5 - 10 min, the error started popping up.
  • went back to max_config:1
  • error still popped up after
  • swapped out the adaptor line for bandit for cowboy
  • started running some heavy IO bound job for my application locally
  • EM error popped up for that as well
  • deleted my deps dir, removed bandit completed from mix and recompiled.
  • have not tried the jobs yet, but cowboy is stable

from thousand_island.

mtrudel avatar mtrudel commented on June 10, 2024

@ringofhealth those are some very interesting points! I've tried to reproduce this locally and can't seem to (despite working in a Phoenix app for an hour or so, I'm not seeing anything unexpected in any process or file metrics). A few questions:

  1. What OTP/Erlang are you using (copying & pasting the startup banner from an iex session would provide everything I'd need to know)?

  2. What OS are you on? The output of uname -a would be great!

  3. If you're able to, it would be great if you could:

    • Add adapter: Bandit.PhoenixAdapter to your config, WITHOUT any special options such as max_requests
    • rm -rf _build just to ensure you get any old artifacts out of the way
    • Provide screenshots of http://localhost:4000/dev/dashboard at startup, and after a few minutes of development (you'd indicated that the error occurs after 5-10 minutes of working, so maybe grab a screenshot after 2-3 minutes)
    • If you're able to, provide a screenshot after the error starts occurring

Thanks so much!

from thousand_island.

mtrudel avatar mtrudel commented on June 10, 2024

I think I've figured this out. Point release coming in the next few hours with a fix for testing!

from thousand_island.

mtrudel avatar mtrudel commented on June 10, 2024

I just released Bandit 1.0.0-pre.3 with a fix for this issue (I hope!)

Please update to 1.0.0-pre.3 or newer of Bandit and ensure that the issue no longer occurs. You DO NOT need to use max_requests: 1 any longer.

from thousand_island.

ringofhealth avatar ringofhealth commented on June 10, 2024

Thank you for digging into this. It does look like the lib is not throwing out emfile errors.

I do notice this issue on local dev mode

not sure why this is coming up now

[info] GET /live/longpoll
[debug] ** (Phoenix.Router.NoRouteError) no route found for GET /live/longpoll (HelioWeb.Router)
    (helio 0.1.0) lib/phoenix/router.ex:487: HelioWeb.Router.call/2
    (helio 0.1.0) lib/helio_web/endpoint.ex:1: HelioWeb.Endpoint.plug_builder_call/2
    (helio 0.1.0) lib/plug/debugger.ex:136: HelioWeb.Endpoint."call (overridable 3)"/2
    (helio 0.1.0) lib/helio_web/endpoint.ex:1: HelioWeb.Endpoint.call/2
    (phoenix 1.7.2) lib/phoenix/endpoint/sync_code_reload_plug.ex:22: Phoenix.Endpoint.SyncCodeReloadPlug.do_call/4
    (bandit 1.0.0-pre.3) lib/bandit/pipeline.ex:89: Bandit.Pipeline.call_plug/2
    (bandit 1.0.0-pre.3) lib/bandit/pipeline.ex:27: Bandit.Pipeline.run/6
    (bandit 1.0.0-pre.3) lib/bandit/http1/handler.ex:29: Bandit.HTTP1.Handler.handle_data/3
    (bandit 1.0.0-pre.3) lib/bandit/delegating_handler.ex:18: Bandit.DelegatingHandler.handle_data/3
    (bandit 1.0.0-pre.3) lib/thousand_island/handler.ex:355: Bandit.DelegatingHandler.handle_continue/2
    (stdlib 4.1.1) gen_server.erl:1123: :gen_server.try_dispatch/4
    (stdlib 4.1.1) gen_server.erl:865: :gen_server.loop/7
    (stdlib 4.1.1) proc_lib.erl:240: :proc_lib.init_p_do_apply/3

[error] GenServer #PID<0.4812.0> terminating
** (Phoenix.Router.NoRouteError) no route found for GET /live/longpoll (HelioWeb.Router)
    (helio 0.1.0) lib/phoenix/router.ex:487: HelioWeb.Router.call/2
    (helio 0.1.0) lib/helio_web/endpoint.ex:1: HelioWeb.Endpoint.plug_builder_call/2
    (helio 0.1.0) lib/plug/debugger.ex:136: HelioWeb.Endpoint."call (overridable 3)"/2
    (helio 0.1.0) lib/helio_web/endpoint.ex:1: HelioWeb.Endpoint.call/2
    (phoenix 1.7.2) lib/phoenix/endpoint/sync_code_reload_plug.ex:22: Phoenix.Endpoint.SyncCodeReloadPlug.do_call/4
    (bandit 1.0.0-pre.3) lib/bandit/pipeline.ex:89: Bandit.Pipeline.call_plug/2
    (bandit 1.0.0-pre.3) lib/bandit/pipeline.ex:27: Bandit.Pipeline.run/6
    (bandit 1.0.0-pre.3) lib/bandit/http1/handler.ex:29: Bandit.HTTP1.Handler.handle_data/3
    (bandit 1.0.0-pre.3) lib/bandit/delegating_handler.ex:18: Bandit.DelegatingHandler.handle_data/3
    (bandit 1.0.0-pre.3) lib/thousand_island/handler.ex:355: Bandit.DelegatingHandler.handle_continue/2
    (stdlib 4.1.1) gen_server.erl:1123: :gen_server.try_dispatch/4
    (stdlib 4.1.1) gen_server.erl:865: :gen_server.loop/7
    (stdlib 4.1.1) proc_lib.erl:240: :proc_lib.init_p_do_apply/3
Last message: {:continue, :handle_connection}
State: {%ThousandIsland.Socket{socket: #Port<0.117>, transport_module: ThousandIsland.Transports.TCP, read_timeout: 60000, span: %ThousandIsland.Telemetry{span_name: :connection, telemetry_span_context: #Reference<0.494451594.3792437251.136067>, start_time: -576460569945550400, start_metadata: %{parent_telemetry_span_context: #Reference<0.494451594.3792437249.137336>, remote_address: {127, 0, 0, 1}, remote_port: 57111, telemetry_span_context: #Reference<0.494451594.3792437251.136067>}}}, %{handler_module: Bandit.InitialHandler, http_1_enabled: true, http_2_enabled: true, opts: %{http_1: [], http_2: [], websocket: []}, plug: {Phoenix.Endpoint.SyncCodeReloadPlug, {HelioWeb.Endpoint, []}}, websocket_enabled: true}}

from thousand_island.

mtrudel avatar mtrudel commented on June 10, 2024

Yep; that's being worked upstream at phoenixframework/phoenix#5445

In the meantime it's great to hear we've solved the original issue here; thanks for your help and input!

I'll wait a few days for @woylie to chime in before closing this issue.

from thousand_island.

woylie avatar woylie commented on June 10, 2024

Sorry, forgot about this. We didn't encounter that issue again since switching back to Bandit.

from thousand_island.

mtrudel avatar mtrudel commented on June 10, 2024

Great to hear! Thanks again for the original issue!

from thousand_island.

mtrudel avatar mtrudel commented on June 10, 2024

@ringofhealth the leaking NoRouteError upstream issue has been merged - this should be fixed in the next Phoenix release.

from thousand_island.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.