I'm basically moving <a class="issue-link js-issue-link" data-error-text="Failed to lo

Yes we can close the issue in tskit. I still think <a class="user-menti

Invalid child values in large simulations,about messerlab/slim

Comments (33)

molpopgen commented on June 5, 2024 2

Yes we can close the issue in tskit. I still think @petrelharp should take a look at slim, though, to sort out what's going on?

…

On Thu, May 7, 2020, 9:50 AM Jerome Kelleher ***@***.***> wrote: Just catching up on this now - strange stuff! But, looks like we can probably close tskit-dev/tskit#402 <tskit-dev/tskit#402>? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#62 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABQ6OH556F3Y2DGZT2XSOJLRQLRDNANCNFSM4JRNFIWA> .

from slim.

molpopgen commented on June 5, 2024 1

Yeah, I have no idea about the interpretation, other than that the issue happens before hitting any tskit code, meaning it is generated somewhere within slim. I've been thinking about this b/c I've been working on understanding if it really is an issue in tskit, and all evidence so far points to "no". So I dug in to the slim code a bit and did #61. It seems that more aggressive debug code will be helpful during the sims to see if you can catch an invalid child ID in action.

from slim.

molpopgen commented on June 5, 2024 1

In the meantime, I just added some printf statements into the tskit code to see if sort/simplify ever do get hit there. It takes a couple hours to run, so I'll get back to you tomorrow.

It is such a bizarre number of nodes that it seems to me that simplify may have been run, even though the interval is set to the simulation length

from slim.

molpopgen commented on June 5, 2024 1

no printf output.

from slim.

bhaller commented on June 5, 2024 1

OK, the next time that tsk_node_table_add_row() is called, tsk_node_table_expand_metadata() detects an overflow and returns TSK_ERR_COLUMN_OVERFLOW. Guess what the value of that constant is? -704. tsk_node_table_add_row() sees that ret != 0 and returns the -704 to the caller. In SLiMSim::RecordNewGenome(), that return value goes into offspringTSKID, with no check that it is positive – no error check at all. So, there's the bug.

from slim.

petrelharp commented on June 5, 2024 1

this check function would really make more sense in tskit itself, as a utility function that all clients could call to sanity-check their tables

Maybe you're looking for this function?

from slim.

bhaller commented on June 5, 2024

OK, so I guess I'm continuing to be dense, but. 1e9+2e6 is 1002000000. That is larger than 214748263, the reported number of rows in the node table. So why do you say that "slim is adding far too many nodes during a simulation somehow"? If anything, isn't the number of nodes much smaller than expected? (Which also seems odd.) I'm also puzzled by the negative child index of -704; do you think we've overflowed the range of int32_t here? That would seem quite surprising – and even more surprising that we didn't crash a long time ago if we really did that, since we must have been writing values to illegal memory locations for quite a long time! Can you spell out for me what exactly you think this debug output shows? Sorry – the tree-seq stuff is not my area of expertise, perhaps @petrelharp should be in the loop on this since he's the main architect of it in SLiM.

from slim.

bhaller commented on June 5, 2024

For what it's worth, when I run the model with the debug code I get the same error, "child index out of range: -704, 214748263". But I don't know how to interpret what I'm seeing.

from slim.

molpopgen commented on June 5, 2024

That is larger than 214748263, the reported number of rows in the node table. So why do you say that "slim is adding far too many nodes during a simulation somehow"?

oops--I was off when I wrote that.

from slim.

molpopgen commented on June 5, 2024

For what it's worth, when I run the model with the debug code I get the same error, "child index out of range: -704, 214748263". But I don't know how to interpret what I'm seeing.

This is explained above, and should be clear from the diff in #61 where the values come from. Yes, the negative number is the problem, though. I agree--if this problem was happening all the time, then you'd see crashes all the time. Indeed, the node table is too small if anything.

from slim.

molpopgen commented on June 5, 2024

Can you spell out for me what exactly you think this debug output shows?

It seems to show that a negative child index is being recorded (see diff in #61). I am assuming that we are entering the tskit code for the first time, but it is hard for me to guarantee that as there seems to be multiple ways to get there (sort and simplify are called on at least two different lines in slim_core.cpp).

from slim.

bhaller commented on June 5, 2024

Yes, I understand the literal meaning of the debug output, it's the interpretation – what is really happening to lead to that output – that I'm having trouble with. @petrelharp, thoughts? This is bizarre.

from slim.

bhaller commented on June 5, 2024

Yes, if this code is doing what it intends to do I'll probably add it to the DEBUG build of SLiM. The more DEBUG checks the better. :->

from slim.

molpopgen commented on June 5, 2024

@petrelharp can say more, and it looks to me like it could/should be a function that gets called at all places right before the tsk_foo function for sorting tables, in order to guarantee that all execution paths hit it.

I did it via exceptions b/c that was easiest for me. I'm guessing you have some way of handling these things so that your GUI doesn't explode upon error?

from slim.

molpopgen commented on June 5, 2024

For the record, fwdpy11 is also missing a check like this. I added one in order to generate a table collection of the same size for testing, but haven't committed it. I need to decide where and how is best to put it in.

from slim.

jeromekelleher commented on June 5, 2024

Just catching up on this now - strange stuff! But, looks like we can probably close tskit-dev/tskit#402?

from slim.

bhaller commented on June 5, 2024

I just put the debugging code from #61 into SLiM, ran the SLiM model given above, and did not get any debugging output. The debug code did get executed at the end of the run – I had a breakpoint on it – and the number of nodes in the node table was 214748263 as before. But it generated no debug logs. Not sure why, but unimportant.

I then extended the check to also look for a negative value for either index, and I moved the debug code into _RunOneGenerationWF() so that it gets run in every generation, and changed it to log to std::cout rather than throwing, and had it log the number of edges and number of nodes in each generation, since that seemed odd (see discussion above). Here's the resulting output:

1: tables_.edges.num_rows == 2000000, tables_.nodes.num_rows == 4000000
2: tables_.edges.num_rows == 4000000, tables_.nodes.num_rows == 6000000
3: tables_.edges.num_rows == 6000000, tables_.nodes.num_rows == 8000000
4: tables_.edges.num_rows == 8000000, tables_.nodes.num_rows == 10000000
5: tables_.edges.num_rows == 10000000, tables_.nodes.num_rows == 12000000
...
104: tables_.edges.num_rows == 208000000, tables_.nodes.num_rows == 210000000
105: tables_.edges.num_rows == 210000000, tables_.nodes.num_rows == 212000000
106: tables_.edges.num_rows == 212000000, tables_.nodes.num_rows == 214000000
107: tables_.edges.num_rows == 214000000, tables_.nodes.num_rows == 214748263
child index out of range: -704, 214748263
108: tables_.edges.num_rows == 216000000, tables_.nodes.num_rows == 214748263
child index out of range: -704, 214748263
109: tables_.edges.num_rows == 218000000, tables_.nodes.num_rows == 214748263
child index out of range: -704, 214748263
110: tables_.edges.num_rows == 220000000, tables_.nodes.num_rows == 214748263
child index out of range: -704, 214748263
...

So, something clearly went wrong in generation 107; it looks like probably the nodes table overflowed or something, although 214748263 is not a round power of two (log2 of it is 27.678, which doesn't suggest anything to me).

It seems like the upshot here is:

(1) A sufficiently large model can overflow the tree-sequence recording tables;

(2) This overflow is not being caught by the tskit code, at least the version of that we're presently using in SLiM (which is a bit out of date I think), which seems like a bug unless we are somehow causing the overflow in SLiM without using a tskit API to do it;

(3) This debug code would perhaps be a useful thing to add somewhere in SLiM. Perhaps once we understand the bug better, a more precise check for the actual overflow condition can be added too/instead, though. Scanning every entry of the edge table in every generation is probably a bit time-consuming. But when in DEBUG mode, at least, it might be reasonable. Maybe there are a bunch of other similar integrity checks we ought to be doing as well. If they help catch a bug like this, they're worth it. :->

So, (2) seems like where we need to focus for now. @petrelharp, do you have a suggestion as to where in the code I ought to look to pin this down, or why it might be happening, or the significance of 214748263?

from slim.

bhaller commented on June 5, 2024

@petrelharp, you have been intending to merge new tskit code into SLiM for a while now. Maybe now would be a good time to do that, so that I'm not debugging this issue against stale tskit code? It'd be nice to be sure that this bug still actually exists before spending a bunch of time tracking it down. :-> When do you think you'd have cycles free to do that merge?

from slim.

bhaller commented on June 5, 2024

@molpopgen @jeromekelleher any insights before I delve deeper into this?

from slim.

molpopgen commented on June 5, 2024

Afraid not. You are still only 1/10th of the way to overflow by the time the error happens, which is why this is so odd.

In [1]: import numpy as np                                                                                                                                                               

In [2]: x = 214000000                                                                                                                                                                    

In [3]: np.iinfo(np.int32).max                                                                                                                                                           
Out[3]: 2147483647

In [4]: np.iinfo(np.int32).max/x                                                                                                                                                         
Out[4]: 10.034970313084113

from slim.

bhaller commented on June 5, 2024

Interesting that the factor is almost exactly ten. I wonder whether perhaps a metadata buffer is overflowing, with 10 bytes per entry, or something like that.

from slim.

petrelharp commented on June 5, 2024

If it is overflow, then I think you're right; we should definately merge in the new tskit code. I'm a bit overwhelmed at the moment, but this is pretty important: this is a big simulation, but not that big.

But, thinking more: if it really is overflow, then we shouldn't really be waiting for tskit to catch it, I think? The things inserted into the edge tables are node IDs, and the node IDs are assigned by SLiM and stored as properties of each individual (well, IIRC the indiv has ID n and the genomes have IDs 2n and 2n+1)? So at least for debugging and maybe always, we should be checking - say, when we add each new individual to the node table - whether their ID is negative?

I'm not convinced it's overflow, though: if it was overflow, why would the first one be -704? And as Kevin says, we shouldn't have gotten to overflow yet?

from slim.

petrelharp commented on June 5, 2024

Interesting that the factor is almost exactly ten. I wonder whether perhaps a metadata buffer is overflowing, with 10 bytes per entry, or something like that.

Oh, it's probably the metadata, you're right. Never mind my prevoius comment about individual IDs.

from slim.

petrelharp commented on June 5, 2024

So, is this caught in tskit now? Well, here's something representative of inserting metadata, which asserts that something isn't bigger than something; and in fact it is being checked that the second something doesn't get too big; which is not present in SLiM's code. Which is embarrasingly old.

So, once I merge in new tskit we should appropriately get an error here, if that's really the problem?

from slim.

molpopgen commented on June 5, 2024

I guess I'm confused how metadata can lead to a negative node value in the edge table, but I never tried to track down how and where you are adding rows.

from slim.

petrelharp commented on June 5, 2024

I was guessing by overflowing the metadata, which could be next to the edge table in memory? But then why would it be always the same problem?

from slim.

bhaller commented on June 5, 2024

OK. I set a breakpoint in tsk_node_table_add_row_internal() when self->num_rows == 214748262 (one before the point at which things seem to go south). When that breakpoint is hit, self->metadata_length is 2147482620, which is 1028 less than 2^31. The next time tsk_node_table_add_row() is called, tsk_node_table_expand_metadata() is called with additional_length == 10 (so, 10 bytes per metadata entry, as suspected). In that call, expand_column() does get called to expand the metadata, and self->max_metadata_length_increment is 1024 so the size of the metadata buffer goes over 2^31 (or very nearly; 4 bytes short?). The expansion seems to proceed without drama, however, and tsk_size_t is uint32_t so it should be able to go up to 2^32, perhaps. I have not yet seen where things actually go south; I'm continuing to trace through the code. But it is certainly clear that it is the result of the metadata buffer growing beyond 2^31 bytes.

from slim.

petrelharp commented on June 5, 2024

ACK!!! We need to wrap that in

ret = tsk_node_table_add_row( ... )
if (ret < 0) {
   handle the tskit error
}

from slim.

petrelharp commented on June 5, 2024

Looks like we've got the same error here and here and here and here and here maybe, although there's an error handler later... and a few more places...

... but deal with it correctly here.

from slim.

bhaller commented on June 5, 2024

Yep. A quick search shows that we're calling tskit add_row() functions in various spots and not checking the return value we get back for being < 0. I think we never reviewed that aspect of our code after we got things working, @petrelharp! So I'll add a check not only there, but in several other spots too. OK, now the test model halts in SLiMgui and shows this error: "tsk_node_table_add_row: Table column too large; cannot be more than 2**31 bytes." Seems good. I just need to decide what debugging code to leave behind, and where to put it...

from slim.

bhaller commented on June 5, 2024

I think I'm going to remove the debug check now that the bug is found, upon reflection. It's very specific to this scenario; one could write similar code to sanity-check all kinds of things about the state of the tables, which is probably a good idea, but (1) there's no reason to focus on checking this specific things without checking all the others, (2) checking all the others is not something I'm going to do right now, and really would exceed my level of knowledge of the inner workings of tskit, and (3) to that point, this check function would really make more sense in tskit itself, as a utility function that all clients could call to sanity-check their tables. Which would be great. I'll log a new issue to that effect.

from slim.

petrelharp commented on June 5, 2024

I think we're good - tskit is generally very good at checking inputs; the problem here is that we weren't paying attention to the error it gave us! Thanks for tarcking that down!!

from slim.

bhaller commented on June 5, 2024

this check function would really make more sense in tskit itself, as a utility function that all clients could call to sanity-check their tables

Maybe you're looking for this function?

Indeed. See tskit-dev/tskit#592 which I just closed, but maybe should not have.

from slim.

Invalid child values in large simulations about slim HOT 33 CLOSED

Comments (33)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs