GithubHelp home page GithubHelp logo

Comments (32)

bhaller avatar bhaller commented on May 29, 2024 1

Hmm. This is difficult for me to debug, because it runs upward of 50 GB of memory usage without the raise happening yet, for me; my laptop is not up to the task. :-> Can you get me a backtrace from the point where the raise occurs? I suspect the raise is coming from eidos_object_pool.h:57, but it would be good to confirm that. My guess is that on Debian (as perhaps on other platforms) the Individual class might be either 32 or 64 bytes (2^5 or 2^6), and thus allocating a block of 2^26+1 individuals exceeds a size of 2^31 or 2^32 that something on the system is imposing upon malloc allocations. I'm a bit surprised by that, though; modern 64-bit operating systems should not impose such restrictions on processes, it seems to me. I'd be curious whether this bug reproduces on macOS, but I can't test it until I'm back in NY in mid-September. But even without being able to reproduce the bug, I can probably fix it for you if you can confirm the location of the raise. I just need to make the eidos_object_pool code smart enough to allocate multiple buffers when the total size exceeds 2^31; shouldn't be very hard. What's the urgency of this for you?

from slim.

petrelharp avatar petrelharp commented on May 29, 2024 1

Not urgent.

from slim.

petrelharp avatar petrelharp commented on May 29, 2024 1

this doesn't happen in 581bc96

Because you get the bad_alloc error originally reported above instead, right? Or am I misunderstanding?

yes, that's right

from slim.

bhaller avatar bhaller commented on May 29, 2024 1

I'm going to close this for now, but if you disagree, comment and I will reopen it.

from slim.

bhaller avatar bhaller commented on May 29, 2024

One instance of Individual is presently 232 bytes, on macOS (probably the same on Debian). 232 * 2^26 is 15569256448 bytes; 232 * (2^26 + 1) is 15569256680 bytes. I'm not sure why that would be the magic threshold above which malloc() would fail – I was expecting the threshold to involve crossing 2^32, but that is only 4294967296 or 4 GB, whereas the threshold is apparently at ~14.5 GB – but ours not to reason why.

from slim.

bhaller avatar bhaller commented on May 29, 2024

Hi @petrelharp. I think I have fixed this, but since I can't actually reproduce the bug on my system (see above), I'm not sure. Could you test the current master branch to confirm the fix, and reopen this issue if there is still a problem? Thanks!

from slim.

petrelharp avatar petrelharp commented on May 29, 2024

Something is wrong. Running the script above with this version of SLiM causes my computer to freeze up entirely. I'm not sure how to debug that sort of thing: please advise?

from slim.

petrelharp avatar petrelharp commented on May 29, 2024

Oh, I can't re-open this: you have to.

from slim.

bhaller avatar bhaller commented on May 29, 2024

Well, yeah; allocating 2^26 individuals involves a huge amount of memory. You're the one who reported the bug, man; don't blame me. ;-> I was assuming that you were running this model on some cluster node with an insane amount of memory. It probably won't run on ordinary desktops/laptops. But the fact that it locks your machine up is good; that means that the original bug, the raise from malloc(), has probably been fixed, and now it just freezes your computer because it burns through all available memory and then sends you into swap-land. :->

from slim.

bhaller avatar bhaller commented on May 29, 2024

Probably the total memory footprint of this model is in the ballpark of 30 GB, but I'm not sure because I can't run it either. ;->

from slim.

petrelharp avatar petrelharp commented on May 29, 2024

now it just freezes your computer because it burns through all available memory and then sends you into swap-land. :->

That's not what's happening: after a second or two it totally freezes (e.g. capslock button doesn't turn on/off the light on the keyboard). Thrashing is different.

from slim.

bhaller avatar bhaller commented on May 29, 2024

My guess is that it just progresses beyond thrashing to total lock-up so quickly you don't see it. Maybe I'm wrong; but on my machine I can watch it under the debugger allocating block after block after block, burning through the GBs, until I kill it before it kills my machine. So to the extent that I can run it at all, it appears to be working. It's certainly possible, though, that there is also some sort of 32-bit overflow bug that occurs with a sufficiently large population size. However, a bug like that should only be able to lock up SLiM itself, not the whole machine; the rest of the OS ought to be protected from bad behavior by one process. The fact that it's locking up the whole machine says that it's exceeding some limit in the kernel, and excessive memory allocation seems like the obvious culprit. Anyway, I think one of us needs to run it on a machine that has enough memory to actually run the model, and see what happens then. I can do that when I get back to NY, but it'd be interesting to get a data point from you on it too, if you could.

from slim.

petrelharp avatar petrelharp commented on May 29, 2024

Additional info:

  1. this doesn't happen in 581bc96
  2. master still works fine if I don't add so many individuals.
    I'll try to test this on a machine with actually enough memory.

from slim.

bhaller avatar bhaller commented on May 29, 2024
  1. this doesn't happen in 581bc96

Because you get the bad_alloc error originally reported above instead, right? Or am I misunderstanding?

  1. master still works fine if I don't add so many individuals.

Because then memory limits are not exceeded? Or are you saying something else? Sorry if I'm being dense. :->

I'll try to test this on a machine with actually enough memory.

OK, great.

from slim.

bhaller avatar bhaller commented on May 29, 2024

this doesn't happen in 581bc96

Because you get the bad_alloc error originally reported above instead, right? Or am I misunderstanding?

yes, that's right

So if you run it with 2^26 instead of 2^26+1, does it lock up your machine the same with both the master head and the previous version? Because the behavior certainly has changed; with the previous version it was allocating a single block of at least 14.5 GB (maybe bigger?), while with the current version it progressively allocates a set of smaller blocks to avoid any maximum block allocation size that might be in effect. So the behavior might indeed change.

from slim.

petrelharp avatar petrelharp commented on May 29, 2024

So if you run it with 2^26 instead of 2^26+1,

Then I get the original error

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted

which is perhaps as it should be, actually: I'm trying to allocate more memory than there actually is? See this thread.

from slim.

petrelharp avatar petrelharp commented on May 29, 2024
master still works fine if I don't add so many individuals.

Because then memory limits are not exceeded? Or are you saying something else? Sorry if I'm being dense. :->

I don't know why, I just wanted to check that the build wasn't totally broken on my system.

from slim.

petrelharp avatar petrelharp commented on May 29, 2024

Maybe instead of whatever you did to fix this issue, you should just catch the bad_alloc and return a more informative error message.

from slim.

bhaller avatar bhaller commented on May 29, 2024

So if you run it with 2^26 instead of 2^26+1,

Then I get the original error

OK, I'm confused. Your bug report above says that with 2^26+1 you get the bad_alloc error, but "With one fewer individuals, it does not." I took that to mean that with 2^26 the model ran. No?

from slim.

bhaller avatar bhaller commented on May 29, 2024

Maybe instead of whatever you did to fix this issue, you should just catch the bad_alloc and return a more informative error message.

Well, at present I don't think so. It sounds like Debian perhaps has a maximum limit of 14 GB or 16 GB or something on any one malloc, and that is triggering this bad_alloc error. But it's perfectly possible that a machine might have much more than that amount of memory, and there's no reason that SLiM should prevent the user from running large models arbitrarily. By breaking up the one big allocation into multiple smaller allocations, we should allow larger models to be run. If the machine does not, in fact, have enough memory to handle the model, then it might swap or freeze or crash; but there's not really any way to know that that will happen. We should just try to run, and if the OS can't handle the load, that's the OS's problem. If it freezes the machine instead of killing the process, that's an OS bug; it's not our responsibility to prevent the kernel from crashing, and even if it were, we really have no way of knowing that we're running into a limit like that.

from slim.

petrelharp avatar petrelharp commented on May 29, 2024

That's not a fixed malloc limit in Debian: it's probably giving the bad malloc when you try to malloc more memory than the machine has, which is reasonable.

And, sorry, I'm not understanding you: that thread I linked to said that bad_alloc is actually an error that you can catch; if it is thrown then slim is going to exit, so the only question there is whether to re-interpret the error for users or not, right?

from slim.

bhaller avatar bhaller commented on May 29, 2024

That's not a fixed malloc limit in Debian: it's probably giving the bad malloc when you try to malloc more memory than the machine has, which is reasonable.

No, that's not how Unix systems are supposed to work. The whole point of virtual memory is that you can exceed the physical memory of the machine. An OS may need to draw a line on the size of any individual block, for technical reasons having to do with the design of the malloc allocator or the memory-mapping strategy or whatever; but the process as a whole is supposed to be unlimited, more or less. Certainly a total usage of 14 or 16 GB seems perfectly reasonable, and should be fulfilled. And in any case, if the OS really wants to limit a process to 16 GB total, for some arbitrary reason, then it can decline two 8 GB allocations just as easily as it can decline one 16 GB allocation, if it wants to (but I'm pretty sure Debian has no such total usage limit; 16 GB would be way too low). It makes sense to me to break up the large allocation into smaller blocks. Indeed, the fact that the initial allocation of the memory pool wasn't already being broken up into smaller blocks was just a bug; you can see from the design of the code that all subsequent allocations for the memory pool were limited in size, but the first one erroneously wasn't held to the same policy.

And, sorry, I'm not understanding you: that thread I linked to said that bad_alloc is actually an error that you can catch; if it is thrown then slim is going to exit, so the only question there is whether to re-interpret the error for users or not, right?

No, I don't think so. That bad_alloc error is because we're trying to do a single allocation that is over an OS threshold. That's fine; we shouldn't be doing that, and it's a bug that we were, and I've just fixed that bug (I think). The error message in that case could perhaps be improved, but that's really a side issue. By breaking up the allocation into smaller blocks – which we should have been doing all along – the bad_alloc throw doesn't happen, so that error message is no longer user-visible anyway.

from slim.

petrelharp avatar petrelharp commented on May 29, 2024

No, that's not how Unix systems are supposed to work. The whole point of virtual memory is that you can exceed the physical memory of the machine.

Exceeding physical plus swap?

from slim.

petrelharp avatar petrelharp commented on May 29, 2024

I tested this on a cluster node with lots of memory. It works fine to allocate 2^26 + 1 individuals; and when I multiplied that by 10 to get something bigger than the available memory, it was killed and did not crash the node. So, it's still clearly not good that it can crash my machines, but not as bad as it could be.

from slim.

bhaller avatar bhaller commented on May 29, 2024

The only limit on swap is available disk space. Are you so short on that, that you can't run a 30 GB process? If so, then if you free up some more disk space, does the behavior change? In any case, in my experience Unix systems issue warnings to the user when swap space is running low, rather than making allocations fail. And if the system did decide to make a 16 GB allocation fail because physical+swap was insufficient, wouldn't it simply do the same for two 8 GB allocations? If the kernel wants to give us a bad_alloc for exceeding memory limits, it is free to do so. The fix I did here is a good fix (assuming I didn't make a mistake in my changes) of what is clearly a bug; whatever further issues might have been exposed by the fix are new bugs (whether in SLiM or in the kernel). Maybe we ought to skype about this?

from slim.

petrelharp avatar petrelharp commented on May 29, 2024

The only limit on swap is available disk space.

Linux uses a swap partition.

from slim.

bhaller avatar bhaller commented on May 29, 2024

I tested this on a cluster node with lots of memory. It works fine to allocate 2^26 + 1 individuals; and when I multiplied that by 10 to get something bigger than the available memory, it was killed and did not crash the node. So, it's still clearly not good that it can crash my machines, but not as bad as it could be.

Yes, that sounds fine. Clusters typically handle allocation differently than desktop machines, because they want to prevent one process from interfering with any other process. So they usually enforce per-process memory limits that are typically much smaller than the total memory of the machine, and if a process exceeds that limit then it is unceremoniously killed. Desktop machines do not typically behave in that manner; instead, they typically act as if memory is infinite, and go into swapping as needed, and warn the user if swap space gets tight. There might be a hard limit at which they kill a process, but it is generally far beyond physical memory limits, in my experience, because there is no compelling reason to limit/kill a process that the user has chosen to run. (But then I have no experience with Debian specifically; perhaps it is unusual.)

from slim.

bhaller avatar bhaller commented on May 29, 2024

The only limit on swap is available disk space.

Linux uses a swap partition.

Oh? How big is it?

from slim.

petrelharp avatar petrelharp commented on May 29, 2024

Oh? How big is it?

On my laptop, both physical and swap are about 16G.

from slim.

petrelharp avatar petrelharp commented on May 29, 2024

Note that the 2^26+1 limit was determined on a different machine, though. The crash was on the laptop; I don't want to test on the other machine because I have some long-running jobs there.

from slim.

bhaller avatar bhaller commented on May 29, 2024

Note that the 2^26+1 limit was determined on a different machine, though. The crash was on the laptop; I don't want to test on the other machine because I have some long-running jobs there.

Hmm. Well, different OSes, different machines, etc., may have different memory policies, certainly. That's not SLiM's problem, in the end. If the kernel gives an allocation error, then the fun is over, and that's fine. But nothing we do should ever make the machine freeze; that's not our bug, and there's no point in trying to work around kernel bugs on a specific OS/machine; that's a fool's errand.

from slim.

bhaller avatar bhaller commented on May 29, 2024

Doing some issues cleanup prior to releasing SLiM 3.4. Skimming through the preceding history, it sounds to me like there is no longer a bug here. If a specific pattern of memory allocation on Debian locks the kernel, that's a kernel bug, not a bug in the process. The allocation of an extremely large block up front, in SLiM's object pool, was fixed. I'm inclined to close this now. @petrelharp do you see a bug remaining here that needs to be fixed on SLiM's side?

from slim.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.