Comments (33)
from slim.
Yeah, I have no idea about the interpretation, other than that the issue happens before hitting any tskit
code, meaning it is generated somewhere within slim
. I've been thinking about this b/c I've been working on understanding if it really is an issue in tskit
, and all evidence so far points to "no". So I dug in to the slim code a bit and did #61. It seems that more aggressive debug code will be helpful during the sims to see if you can catch an invalid child ID in action.
from slim.
In the meantime, I just added some printf
statements into the tskit
code to see if sort/simplify ever do get hit there. It takes a couple hours to run, so I'll get back to you tomorrow.
It is such a bizarre number of nodes that it seems to me that simplify may have been run, even though the interval is set to the simulation length
from slim.
no printf
output.
from slim.
OK, the next time that tsk_node_table_add_row() is called, tsk_node_table_expand_metadata() detects an overflow and returns TSK_ERR_COLUMN_OVERFLOW. Guess what the value of that constant is? -704. tsk_node_table_add_row() sees that ret != 0 and returns the -704 to the caller. In SLiMSim::RecordNewGenome(), that return value goes into offspringTSKID, with no check that it is positive – no error check at all. So, there's the bug.
from slim.
this check function would really make more sense in tskit itself, as a utility function that all clients could call to sanity-check their tables
Maybe you're looking for this function?
from slim.
OK, so I guess I'm continuing to be dense, but. 1e9+2e6 is 1002000000. That is larger than 214748263, the reported number of rows in the node table. So why do you say that "slim is adding far too many nodes during a simulation somehow"? If anything, isn't the number of nodes much smaller than expected? (Which also seems odd.) I'm also puzzled by the negative child index of -704; do you think we've overflowed the range of int32_t here? That would seem quite surprising – and even more surprising that we didn't crash a long time ago if we really did that, since we must have been writing values to illegal memory locations for quite a long time! Can you spell out for me what exactly you think this debug output shows? Sorry – the tree-seq stuff is not my area of expertise, perhaps @petrelharp should be in the loop on this since he's the main architect of it in SLiM.
from slim.
For what it's worth, when I run the model with the debug code I get the same error, "child index out of range: -704, 214748263". But I don't know how to interpret what I'm seeing.
from slim.
That is larger than 214748263, the reported number of rows in the node table. So why do you say that "slim is adding far too many nodes during a simulation somehow"?
oops--I was off when I wrote that.
from slim.
For what it's worth, when I run the model with the debug code I get the same error, "child index out of range: -704, 214748263". But I don't know how to interpret what I'm seeing.
This is explained above, and should be clear from the diff in #61 where the values come from. Yes, the negative number is the problem, though. I agree--if this problem was happening all the time, then you'd see crashes all the time. Indeed, the node table is too small if anything.
from slim.
Can you spell out for me what exactly you think this debug output shows?
It seems to show that a negative child index is being recorded (see diff in #61). I am assuming that we are entering the tskit code for the first time, but it is hard for me to guarantee that as there seems to be multiple ways to get there (sort and simplify are called on at least two different lines in slim_core.cpp).
from slim.
Yes, I understand the literal meaning of the debug output, it's the interpretation – what is really happening to lead to that output – that I'm having trouble with. @petrelharp, thoughts? This is bizarre.
from slim.
Yes, if this code is doing what it intends to do I'll probably add it to the DEBUG build of SLiM. The more DEBUG checks the better. :->
from slim.
@petrelharp can say more, and it looks to me like it could/should be a function that gets called at all places right before the tsk_foo
function for sorting tables, in order to guarantee that all execution paths hit it.
I did it via exceptions b/c that was easiest for me. I'm guessing you have some way of handling these things so that your GUI doesn't explode upon error?
from slim.
For the record, fwdpy11
is also missing a check like this. I added one in order to generate a table collection of the same size for testing, but haven't committed it. I need to decide where and how is best to put it in.
from slim.
Just catching up on this now - strange stuff! But, looks like we can probably close tskit-dev/tskit#402?
from slim.
I just put the debugging code from #61 into SLiM, ran the SLiM model given above, and did not get any debugging output. The debug code did get executed at the end of the run – I had a breakpoint on it – and the number of nodes in the node table was 214748263 as before. But it generated no debug logs. Not sure why, but unimportant.
I then extended the check to also look for a negative value for either index, and I moved the debug code into _RunOneGenerationWF() so that it gets run in every generation, and changed it to log to std::cout rather than throwing, and had it log the number of edges and number of nodes in each generation, since that seemed odd (see discussion above). Here's the resulting output:
1: tables_.edges.num_rows == 2000000, tables_.nodes.num_rows == 4000000
2: tables_.edges.num_rows == 4000000, tables_.nodes.num_rows == 6000000
3: tables_.edges.num_rows == 6000000, tables_.nodes.num_rows == 8000000
4: tables_.edges.num_rows == 8000000, tables_.nodes.num_rows == 10000000
5: tables_.edges.num_rows == 10000000, tables_.nodes.num_rows == 12000000
...
104: tables_.edges.num_rows == 208000000, tables_.nodes.num_rows == 210000000
105: tables_.edges.num_rows == 210000000, tables_.nodes.num_rows == 212000000
106: tables_.edges.num_rows == 212000000, tables_.nodes.num_rows == 214000000
107: tables_.edges.num_rows == 214000000, tables_.nodes.num_rows == 214748263
child index out of range: -704, 214748263
108: tables_.edges.num_rows == 216000000, tables_.nodes.num_rows == 214748263
child index out of range: -704, 214748263
109: tables_.edges.num_rows == 218000000, tables_.nodes.num_rows == 214748263
child index out of range: -704, 214748263
110: tables_.edges.num_rows == 220000000, tables_.nodes.num_rows == 214748263
child index out of range: -704, 214748263
...
So, something clearly went wrong in generation 107; it looks like probably the nodes table overflowed or something, although 214748263 is not a round power of two (log2 of it is 27.678, which doesn't suggest anything to me).
It seems like the upshot here is:
(1) A sufficiently large model can overflow the tree-sequence recording tables;
(2) This overflow is not being caught by the tskit code, at least the version of that we're presently using in SLiM (which is a bit out of date I think), which seems like a bug unless we are somehow causing the overflow in SLiM without using a tskit API to do it;
(3) This debug code would perhaps be a useful thing to add somewhere in SLiM. Perhaps once we understand the bug better, a more precise check for the actual overflow condition can be added too/instead, though. Scanning every entry of the edge table in every generation is probably a bit time-consuming. But when in DEBUG mode, at least, it might be reasonable. Maybe there are a bunch of other similar integrity checks we ought to be doing as well. If they help catch a bug like this, they're worth it. :->
So, (2) seems like where we need to focus for now. @petrelharp, do you have a suggestion as to where in the code I ought to look to pin this down, or why it might be happening, or the significance of 214748263?
from slim.
@petrelharp, you have been intending to merge new tskit code into SLiM for a while now. Maybe now would be a good time to do that, so that I'm not debugging this issue against stale tskit code? It'd be nice to be sure that this bug still actually exists before spending a bunch of time tracking it down. :-> When do you think you'd have cycles free to do that merge?
from slim.
@molpopgen @jeromekelleher any insights before I delve deeper into this?
from slim.
Afraid not. You are still only 1/10th of the way to overflow by the time the error happens, which is why this is so odd.
In [1]: import numpy as np
In [2]: x = 214000000
In [3]: np.iinfo(np.int32).max
Out[3]: 2147483647
In [4]: np.iinfo(np.int32).max/x
Out[4]: 10.034970313084113
from slim.
Interesting that the factor is almost exactly ten. I wonder whether perhaps a metadata buffer is overflowing, with 10 bytes per entry, or something like that.
from slim.
If it is overflow, then I think you're right; we should definately merge in the new tskit code. I'm a bit overwhelmed at the moment, but this is pretty important: this is a big simulation, but not that big.
But, thinking more: if it really is overflow, then we shouldn't really be waiting for tskit to catch it, I think? The things inserted into the edge tables are node IDs, and the node IDs are assigned by SLiM and stored as properties of each individual (well, IIRC the indiv has ID n and the genomes have IDs 2n and 2n+1)? So at least for debugging and maybe always, we should be checking - say, when we add each new individual to the node table - whether their ID is negative?
I'm not convinced it's overflow, though: if it was overflow, why would the first one be -704? And as Kevin says, we shouldn't have gotten to overflow yet?
from slim.
Interesting that the factor is almost exactly ten. I wonder whether perhaps a metadata buffer is overflowing, with 10 bytes per entry, or something like that.
Oh, it's probably the metadata, you're right. Never mind my prevoius comment about individual IDs.
from slim.
So, is this caught in tskit now? Well, here's something representative of inserting metadata, which asserts that something isn't bigger than something; and in fact it is being checked that the second something doesn't get too big; which is not present in SLiM's code. Which is embarrasingly old.
So, once I merge in new tskit we should appropriately get an error here, if that's really the problem?
from slim.
I guess I'm confused how metadata can lead to a negative node value in the edge table, but I never tried to track down how and where you are adding rows.
from slim.
I was guessing by overflowing the metadata, which could be next to the edge table in memory? But then why would it be always the same problem?
from slim.
OK. I set a breakpoint in tsk_node_table_add_row_internal() when self->num_rows == 214748262 (one before the point at which things seem to go south). When that breakpoint is hit, self->metadata_length is 2147482620, which is 1028 less than 2^31. The next time tsk_node_table_add_row() is called, tsk_node_table_expand_metadata() is called with additional_length == 10 (so, 10 bytes per metadata entry, as suspected). In that call, expand_column() does get called to expand the metadata, and self->max_metadata_length_increment is 1024 so the size of the metadata buffer goes over 2^31 (or very nearly; 4 bytes short?). The expansion seems to proceed without drama, however, and tsk_size_t is uint32_t so it should be able to go up to 2^32, perhaps. I have not yet seen where things actually go south; I'm continuing to trace through the code. But it is certainly clear that it is the result of the metadata buffer growing beyond 2^31 bytes.
from slim.
ACK!!! We need to wrap that in
ret = tsk_node_table_add_row( ... )
if (ret < 0) {
handle the tskit error
}
from slim.
Looks like we've got the same error here and here and here and here and here maybe, although there's an error handler later... and a few more places...
... but deal with it correctly here.
from slim.
Yep. A quick search shows that we're calling tskit add_row() functions in various spots and not checking the return value we get back for being < 0. I think we never reviewed that aspect of our code after we got things working, @petrelharp! So I'll add a check not only there, but in several other spots too. OK, now the test model halts in SLiMgui and shows this error: "tsk_node_table_add_row: Table column too large; cannot be more than 2**31 bytes." Seems good. I just need to decide what debugging code to leave behind, and where to put it...
from slim.
I think I'm going to remove the debug check now that the bug is found, upon reflection. It's very specific to this scenario; one could write similar code to sanity-check all kinds of things about the state of the tables, which is probably a good idea, but (1) there's no reason to focus on checking this specific things without checking all the others, (2) checking all the others is not something I'm going to do right now, and really would exceed my level of knowledge of the inner workings of tskit, and (3) to that point, this check function would really make more sense in tskit itself, as a utility function that all clients could call to sanity-check their tables. Which would be great. I'll log a new issue to that effect.
from slim.
I think we're good - tskit is generally very good at checking inputs; the problem here is that we weren't paying attention to the error it gave us! Thanks for tarcking that down!!
from slim.
this check function would really make more sense in tskit itself, as a utility function that all clients could call to sanity-check their tables
Maybe you're looking for this function?
Indeed. See tskit-dev/tskit#592 which I just closed, but maybe should not have.
from slim.
Related Issues (20)
- 4.1 Memory Issue HOT 6
- SLiMgui should offer to load external script changes HOT 7
- Wrap Eidos code edition/analysis features into a proper language server. HOT 1
- Slim 4.1 core dumping on computing cluster HOT 19
- small bug in docs HOT 1
- missing parents when using addRecombinant() HOT 2
- "pretty" option for serialize() HOT 2
- Inconsistent global-variable behavior from `x = 1` versus `x = x + 1` HOT 11
- Compiling Eidos script. HOT 13
- Software depends on Qt patch version? HOT 7
- improve recipe 17.5 by using tspop or link_ancestors
- SLiM 4.2 release process HOT 23
- QtSLiM *Open Recipe* list is sorted lexicographically rather than naturally HOT 5
- "buffer overflow detected" when trying to install SLiM on Linux HOT 29
- provide `make test` functionality to run tests after building `slim` and `eidos`
- SLiM 4.2.1 release process HOT 11
- Name collision between binaries and directories prevents linking with `ld` on RHEL 8 HOT 4
- Ubuntu SLiM install error HOT 19
- SLiM 4.2.1 fc 3 release process HOT 9
- 4.2.2 release process HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from slim.