Comments (32)
Hmm. This is difficult for me to debug, because it runs upward of 50 GB of memory usage without the raise happening yet, for me; my laptop is not up to the task. :-> Can you get me a backtrace from the point where the raise occurs? I suspect the raise is coming from eidos_object_pool.h:57, but it would be good to confirm that. My guess is that on Debian (as perhaps on other platforms) the Individual class might be either 32 or 64 bytes (2^5 or 2^6), and thus allocating a block of 2^26+1 individuals exceeds a size of 2^31 or 2^32 that something on the system is imposing upon malloc allocations. I'm a bit surprised by that, though; modern 64-bit operating systems should not impose such restrictions on processes, it seems to me. I'd be curious whether this bug reproduces on macOS, but I can't test it until I'm back in NY in mid-September. But even without being able to reproduce the bug, I can probably fix it for you if you can confirm the location of the raise. I just need to make the eidos_object_pool code smart enough to allocate multiple buffers when the total size exceeds 2^31; shouldn't be very hard. What's the urgency of this for you?
from slim.
Not urgent.
from slim.
this doesn't happen in 581bc96
Because you get the bad_alloc error originally reported above instead, right? Or am I misunderstanding?
yes, that's right
from slim.
I'm going to close this for now, but if you disagree, comment and I will reopen it.
from slim.
One instance of Individual is presently 232 bytes, on macOS (probably the same on Debian). 232 * 2^26 is 15569256448 bytes; 232 * (2^26 + 1) is 15569256680 bytes. I'm not sure why that would be the magic threshold above which malloc() would fail – I was expecting the threshold to involve crossing 2^32, but that is only 4294967296 or 4 GB, whereas the threshold is apparently at ~14.5 GB – but ours not to reason why.
from slim.
Hi @petrelharp. I think I have fixed this, but since I can't actually reproduce the bug on my system (see above), I'm not sure. Could you test the current master branch to confirm the fix, and reopen this issue if there is still a problem? Thanks!
from slim.
Something is wrong. Running the script above with this version of SLiM causes my computer to freeze up entirely. I'm not sure how to debug that sort of thing: please advise?
from slim.
Oh, I can't re-open this: you have to.
from slim.
Well, yeah; allocating 2^26 individuals involves a huge amount of memory. You're the one who reported the bug, man; don't blame me. ;-> I was assuming that you were running this model on some cluster node with an insane amount of memory. It probably won't run on ordinary desktops/laptops. But the fact that it locks your machine up is good; that means that the original bug, the raise from malloc(), has probably been fixed, and now it just freezes your computer because it burns through all available memory and then sends you into swap-land. :->
from slim.
Probably the total memory footprint of this model is in the ballpark of 30 GB, but I'm not sure because I can't run it either. ;->
from slim.
now it just freezes your computer because it burns through all available memory and then sends you into swap-land. :->
That's not what's happening: after a second or two it totally freezes (e.g. capslock button doesn't turn on/off the light on the keyboard). Thrashing is different.
from slim.
My guess is that it just progresses beyond thrashing to total lock-up so quickly you don't see it. Maybe I'm wrong; but on my machine I can watch it under the debugger allocating block after block after block, burning through the GBs, until I kill it before it kills my machine. So to the extent that I can run it at all, it appears to be working. It's certainly possible, though, that there is also some sort of 32-bit overflow bug that occurs with a sufficiently large population size. However, a bug like that should only be able to lock up SLiM itself, not the whole machine; the rest of the OS ought to be protected from bad behavior by one process. The fact that it's locking up the whole machine says that it's exceeding some limit in the kernel, and excessive memory allocation seems like the obvious culprit. Anyway, I think one of us needs to run it on a machine that has enough memory to actually run the model, and see what happens then. I can do that when I get back to NY, but it'd be interesting to get a data point from you on it too, if you could.
from slim.
Additional info:
- this doesn't happen in 581bc96
- master still works fine if I don't add so many individuals.
I'll try to test this on a machine with actually enough memory.
from slim.
- this doesn't happen in 581bc96
Because you get the bad_alloc error originally reported above instead, right? Or am I misunderstanding?
- master still works fine if I don't add so many individuals.
Because then memory limits are not exceeded? Or are you saying something else? Sorry if I'm being dense. :->
I'll try to test this on a machine with actually enough memory.
OK, great.
from slim.
this doesn't happen in 581bc96
Because you get the bad_alloc error originally reported above instead, right? Or am I misunderstanding?
yes, that's right
So if you run it with 2^26 instead of 2^26+1, does it lock up your machine the same with both the master head and the previous version? Because the behavior certainly has changed; with the previous version it was allocating a single block of at least 14.5 GB (maybe bigger?), while with the current version it progressively allocates a set of smaller blocks to avoid any maximum block allocation size that might be in effect. So the behavior might indeed change.
from slim.
So if you run it with 2^26 instead of 2^26+1,
Then I get the original error
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted
which is perhaps as it should be, actually: I'm trying to allocate more memory than there actually is? See this thread.
from slim.
master still works fine if I don't add so many individuals.
Because then memory limits are not exceeded? Or are you saying something else? Sorry if I'm being dense. :->
I don't know why, I just wanted to check that the build wasn't totally broken on my system.
from slim.
Maybe instead of whatever you did to fix this issue, you should just catch the bad_alloc
and return a more informative error message.
from slim.
So if you run it with 2^26 instead of 2^26+1,
Then I get the original error
OK, I'm confused. Your bug report above says that with 2^26+1 you get the bad_alloc error, but "With one fewer individuals, it does not." I took that to mean that with 2^26 the model ran. No?
from slim.
Maybe instead of whatever you did to fix this issue, you should just catch the
bad_alloc
and return a more informative error message.
Well, at present I don't think so. It sounds like Debian perhaps has a maximum limit of 14 GB or 16 GB or something on any one malloc, and that is triggering this bad_alloc error. But it's perfectly possible that a machine might have much more than that amount of memory, and there's no reason that SLiM should prevent the user from running large models arbitrarily. By breaking up the one big allocation into multiple smaller allocations, we should allow larger models to be run. If the machine does not, in fact, have enough memory to handle the model, then it might swap or freeze or crash; but there's not really any way to know that that will happen. We should just try to run, and if the OS can't handle the load, that's the OS's problem. If it freezes the machine instead of killing the process, that's an OS bug; it's not our responsibility to prevent the kernel from crashing, and even if it were, we really have no way of knowing that we're running into a limit like that.
from slim.
That's not a fixed malloc limit in Debian: it's probably giving the bad malloc when you try to malloc more memory than the machine has, which is reasonable.
And, sorry, I'm not understanding you: that thread I linked to said that bad_alloc
is actually an error that you can catch; if it is thrown then slim is going to exit, so the only question there is whether to re-interpret the error for users or not, right?
from slim.
That's not a fixed malloc limit in Debian: it's probably giving the bad malloc when you try to malloc more memory than the machine has, which is reasonable.
No, that's not how Unix systems are supposed to work. The whole point of virtual memory is that you can exceed the physical memory of the machine. An OS may need to draw a line on the size of any individual block, for technical reasons having to do with the design of the malloc allocator or the memory-mapping strategy or whatever; but the process as a whole is supposed to be unlimited, more or less. Certainly a total usage of 14 or 16 GB seems perfectly reasonable, and should be fulfilled. And in any case, if the OS really wants to limit a process to 16 GB total, for some arbitrary reason, then it can decline two 8 GB allocations just as easily as it can decline one 16 GB allocation, if it wants to (but I'm pretty sure Debian has no such total usage limit; 16 GB would be way too low). It makes sense to me to break up the large allocation into smaller blocks. Indeed, the fact that the initial allocation of the memory pool wasn't already being broken up into smaller blocks was just a bug; you can see from the design of the code that all subsequent allocations for the memory pool were limited in size, but the first one erroneously wasn't held to the same policy.
And, sorry, I'm not understanding you: that thread I linked to said that
bad_alloc
is actually an error that you can catch; if it is thrown then slim is going to exit, so the only question there is whether to re-interpret the error for users or not, right?
No, I don't think so. That bad_alloc error is because we're trying to do a single allocation that is over an OS threshold. That's fine; we shouldn't be doing that, and it's a bug that we were, and I've just fixed that bug (I think). The error message in that case could perhaps be improved, but that's really a side issue. By breaking up the allocation into smaller blocks – which we should have been doing all along – the bad_alloc throw doesn't happen, so that error message is no longer user-visible anyway.
from slim.
No, that's not how Unix systems are supposed to work. The whole point of virtual memory is that you can exceed the physical memory of the machine.
Exceeding physical plus swap?
from slim.
I tested this on a cluster node with lots of memory. It works fine to allocate 2^26 + 1 individuals; and when I multiplied that by 10 to get something bigger than the available memory, it was killed and did not crash the node. So, it's still clearly not good that it can crash my machines, but not as bad as it could be.
from slim.
The only limit on swap is available disk space. Are you so short on that, that you can't run a 30 GB process? If so, then if you free up some more disk space, does the behavior change? In any case, in my experience Unix systems issue warnings to the user when swap space is running low, rather than making allocations fail. And if the system did decide to make a 16 GB allocation fail because physical+swap was insufficient, wouldn't it simply do the same for two 8 GB allocations? If the kernel wants to give us a bad_alloc for exceeding memory limits, it is free to do so. The fix I did here is a good fix (assuming I didn't make a mistake in my changes) of what is clearly a bug; whatever further issues might have been exposed by the fix are new bugs (whether in SLiM or in the kernel). Maybe we ought to skype about this?
from slim.
The only limit on swap is available disk space.
Linux uses a swap partition.
from slim.
I tested this on a cluster node with lots of memory. It works fine to allocate 2^26 + 1 individuals; and when I multiplied that by 10 to get something bigger than the available memory, it was killed and did not crash the node. So, it's still clearly not good that it can crash my machines, but not as bad as it could be.
Yes, that sounds fine. Clusters typically handle allocation differently than desktop machines, because they want to prevent one process from interfering with any other process. So they usually enforce per-process memory limits that are typically much smaller than the total memory of the machine, and if a process exceeds that limit then it is unceremoniously killed. Desktop machines do not typically behave in that manner; instead, they typically act as if memory is infinite, and go into swapping as needed, and warn the user if swap space gets tight. There might be a hard limit at which they kill a process, but it is generally far beyond physical memory limits, in my experience, because there is no compelling reason to limit/kill a process that the user has chosen to run. (But then I have no experience with Debian specifically; perhaps it is unusual.)
from slim.
The only limit on swap is available disk space.
Linux uses a swap partition.
Oh? How big is it?
from slim.
Oh? How big is it?
On my laptop, both physical and swap are about 16G.
from slim.
Note that the 2^26+1 limit was determined on a different machine, though. The crash was on the laptop; I don't want to test on the other machine because I have some long-running jobs there.
from slim.
Note that the 2^26+1 limit was determined on a different machine, though. The crash was on the laptop; I don't want to test on the other machine because I have some long-running jobs there.
Hmm. Well, different OSes, different machines, etc., may have different memory policies, certainly. That's not SLiM's problem, in the end. If the kernel gives an allocation error, then the fun is over, and that's fine. But nothing we do should ever make the machine freeze; that's not our bug, and there's no point in trying to work around kernel bugs on a specific OS/machine; that's a fool's errand.
from slim.
Doing some issues cleanup prior to releasing SLiM 3.4. Skimming through the preceding history, it sounds to me like there is no longer a bug here. If a specific pattern of memory allocation on Debian locks the kernel, that's a kernel bug, not a bug in the process. The allocation of an extremely large block up front, in SLiM's object pool, was fixed. I'm inclined to close this now. @petrelharp do you see a bug remaining here that needs to be fixed on SLiM's side?
from slim.
Related Issues (20)
- 4.1 Memory Issue HOT 6
- SLiMgui should offer to load external script changes HOT 7
- Wrap Eidos code edition/analysis features into a proper language server. HOT 1
- Slim 4.1 core dumping on computing cluster HOT 19
- small bug in docs HOT 1
- missing parents when using addRecombinant() HOT 2
- "pretty" option for serialize() HOT 2
- Inconsistent global-variable behavior from `x = 1` versus `x = x + 1` HOT 11
- Compiling Eidos script. HOT 13
- Software depends on Qt patch version? HOT 7
- improve recipe 17.5 by using tspop or link_ancestors
- SLiM 4.2 release process HOT 23
- QtSLiM *Open Recipe* list is sorted lexicographically rather than naturally HOT 5
- "buffer overflow detected" when trying to install SLiM on Linux HOT 29
- provide `make test` functionality to run tests after building `slim` and `eidos`
- SLiM 4.2.1 release process HOT 11
- Name collision between binaries and directories prevents linking with `ld` on RHEL 8 HOT 4
- Ubuntu SLiM install error HOT 18
- SLiM 4.2.1 fc 3 release process HOT 9
- 4.2.2 release process HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from slim.