Hello! I have a few questions related to the missing data or NaN in

version 1.15 dealing with NaN about redm HOT 4 CLOSED

sugiharalab commented on August 27, 2024

version 1.15 dealing with NaN

from redm.

Comments (4)

SoftwareLiteracy commented on August 27, 2024

Perhaps a little background for context could be helpful.

In Simplex, and it's derivatives EmbedDimension, CCM... nan in the data are passed through the entire | embedding : knn : projection | pipeline, and as such, any nan in the data are automatically rendered in the library, excluded in prediction, and properly represented in the output.

SMap embeds the data, then creates a linear system matrix solved with a LAPACK/BLAS SVD. LAPACK does not allow nan. In versions 1.14 and earlier, time series rows that contained nan were removed prior the SVD. This effectively prevented any library vectors with nan, but also created gaps in the output and raises the question of whether Takens embedding remains theoretically valid.

S-map ingoreNan in version 1.15 is new, adjusting the library to ignore all embedding vectors with nan. This should properly represent the output with nan as appropriate, rather than the previous method that returned gaps in the output.

So the answer to the first question is no, version 1.15 SMap does not handle nan in the same way as versions 1.14 and earlier.

Answer to the second question is yes, Simplex based functions ignore nan. However, not by redefining the library. Since the numerical computations are all internal and nan are carried though | embedding : knn : projection | any projection influenced by a nan will return nan.

On a related note, one can also consider handling missing data with "bundle embedding"
An empirical dynamic modeling framework for missing or irregular samples

from redm.

ecosan327 commented on August 27, 2024

Thank you so much for the clarification of how different versions do in Simplex and in S-map.
Could you explain in more details about how the ingoreNan function adjust the library?
I am curious how to adjust the library in the state space(change the shadow manifold?) in order to cope with the gap issue.
I think this is interesting and really helpful.=)

from redm.

SoftwareLiteracy commented on August 27, 2024

It is a bit complex since E, Tp, tau all influence the availability of valid embedding vectors in response to a NaN. Simply, when a NaN is present no prediction should be made with a library vector that has a NaN neighbor (a function of E, tau) or where Tp would include a vector with a NaN component. Recall that projections are made by taking neighbors projected Tp time steps ahead (behind) in Simplex, while all neighbors are used in SMap.

Perhaps some examples can illustrate.

Insert a Nan into observation x[10]

library( rEDM )
df = circle
dim( df )
[1] 200   3

head( df, 2 )
  Time      x     y
1    1 0.0000 1.000
2    2 0.0631 0.998

df $ x[10] = NaN
df[ 8:12, ]
   Time      x      y
8     8 0.4278 0.9039
9     9 0.4840 0.8751
10   10    NaN 0.8428
11   11 0.5903 0.8072
12   12 0.6401 0.7683

Simplex

Simplex prediction with E=2, Tp=1 and library including NaN observation.

Note Time 11 & 12 do not have a prediction, since Tp = 1, E = 2. The prediction at Time 9 is likely from a neighbor that included a component of the NaN in it's embedding vector.

> Simplex( dataFrame = df, lib = '1 50', pred = '5 15',
           columns = 'x', target = 'x', E = 2, Tp = 1 )
   Time Observations Predictions Pred_Variance
1     5       0.2499         NaN           NaN
2     6       0.3105      0.2451      0.011215
3     7       0.3699      0.3056      0.010833
4     8       0.4278      0.3648      0.010375
5     9       0.4840         NaN           NaN
6    10          NaN      0.4183      0.002957
7    11       0.5903         NaN           NaN
8    12       0.6401         NaN           NaN
9    13       0.6873      0.6162      0.008243
10   14       0.7318      0.6816      0.006449
11   15       0.7733      0.7260      0.005701
12   16       0.8118      0.7676      0.004952

In the case of Tp = -1, we expect Time 9 & 10 to not have a prediction with E = 2:

> Simplex( dataFrame = df, lib = '1 50', pred = '5 15',
           columns = 'x', target = 'x', E = 2, Tp = -1 )
   Time Observations Predictions Pred_Variance
1     4       0.1883      0.2034     0.0030564
2     5       0.2499      0.2648     0.0029724
3     6       0.3105      0.3251     0.0028645
4     7       0.3699      0.3841     0.0027358
5     8       0.4278      0.4500     0.0040664
6     9       0.4840         NaN           NaN
7    10          NaN         NaN           NaN
8    11       0.5903      0.6460     0.0003792
9    12       0.6401      0.6518     0.0018654
10   13       0.6873      0.6983     0.0016638
11   14       0.7318      0.7420     0.0014626
12   15       0.7733         NaN           NaN

Perhaps this is clearer in the case where the observation (target) does not have a NaN, but the library still does, here we use target = 'y' and see no predictions at Time 11 & 12:

Simplex( dataFrame = df, lib = '1 50', pred = '5 15',
         columns = 'x', target = 'y', E = 2, Tp = 1 )
   Time Observations Predictions Pred_Variance
1     5       0.9683         NaN           NaN
2     6       0.9506      0.3156        0.8291
3     7       0.9291      0.3042        0.8033
4     8       0.9039      0.2916        0.7716
5     9       0.8751      0.2779        0.7345
6    10       0.8428     -0.2784        0.7446
7    11       0.8072         NaN           NaN
8    12       0.7683         NaN           NaN
9    13       0.7264     -0.2757        0.5360
10   14       0.6815      0.1936        0.4916
11   15       0.6340      0.1740        0.4369
12   16       0.5839      0.1539        0.3822

SMap

SMap is a bit different since all library vectors are processed (but localized with theta), and the SVD solver does not allow NaN. ignoreNaN (default TRUE) redefines the library to exclude appropriate vectors (gaps) in library according to E, Tp, tau.

The cross mapping example with SMap (columns = 'x', target = 'y')

> SMap( dataFrame = df, lib = '1 50', pred = '5 15',
        columns = 'x', target = 'y', theta = 2, E = 2, Tp = 1 ) [['predictions']]
   Time Observations Predictions Pred_Variance
1     5       0.9683         NaN           NaN
2     6       0.9506      0.9506        1.9172
3     7       0.9291      0.9289        1.9033
4     8       0.9039      0.9044        1.8924
5     9       0.8751      0.8750        1.8894
6    10       0.8428      0.8428        1.7217
7    11       0.8072         NaN           NaN
8    12       0.7683         NaN           NaN
9    13       0.7264      0.7270        1.3985
10   14       0.6815      0.6811        1.1857
11   15       0.6340      0.6346        1.0133
12   16       0.5839      0.5829        0.8618

Prior to version 1.15 and ignoreNaN, one could achieve a similar result by explicitly specifying validLib to exclude NaN.

Create a validLib vector. Recall df $ x[10] is nan, so the initial validLib has FALSE in row 10. Add FALSE to row 11 for the E = 2, Tp = 1 example:

> validLib = !is.nan(df $ x)
> validLib[11] = FALSE
> validLib[5:15]
[1] 1 1 1 1 1 0 0 1 1 1 1

Now using validLib = validLib, ignoreNan = FALSE:

> SMap( dataFrame = df, lib = '1 50', pred = '5 15', 
        columns = 'x', target = 'y', theta = 2, E = 2, Tp = 1, 
        validLib = validLib, ignoreNan = FALSE ) [['predictions']]
   Time Observations Predictions Pred_Variance
1     5       0.9683         NaN           NaN
2     6       0.9506      0.9506        1.7717
3     7       0.9291      0.9289        1.7709
4     8       0.9039      0.9044        1.7573
5     9       0.8751      0.8750        1.7315
6    10       0.8428      0.8428        1.7055
7    11       0.8072         NaN           NaN
8    12       0.7683         NaN           NaN
9    13       0.7264      0.7270        1.3448
10   14       0.6815      0.6811        1.1575
11   15       0.6340      0.6346        0.9989
12   16       0.5839      0.5829        0.8551

Whereas if one uses ignoreNan = FALSE with no validLib, all predictions are NaN since all neighbors (library vectors) are used which include the embedding vectors from the NaN in row 10.

SMap( dataFrame = df, lib = '1 50', pred = '5 15',
      columns = 'x', target = 'y', theta = 2, E = 2, Tp = 1,
      ignoreNan = FALSE ) [['predictions']]
   Time Observations Predictions Pred_Variance
1     5       0.9683         NaN           NaN
2     6       0.9506         NaN           NaN
3     7       0.9291         NaN           NaN
4     8       0.9039         NaN           NaN
5     9       0.8751         NaN           NaN
6    10       0.8428         NaN           NaN
7    11       0.8072         NaN           NaN
8    12       0.7683         NaN           NaN
9    13       0.7264         NaN           NaN
10   14       0.6815         NaN           NaN
11   15       0.6340         NaN           NaN
12   16       0.5839         NaN           NaN

For peek under-the-hood, the code that actually creates the library vector is here:
https://github.com/SugiharaLab/cppEDM/blob/c41f7f5b16d3b13895523f0ae4b541b45babdcb2/src/Parameter.cc#L394

While the SMap code to adjust lib if NaN are found is here:
https://github.com/SugiharaLab/cppEDM/blob/c41f7f5b16d3b13895523f0ae4b541b45babdcb2/src/API.cc#L449

Perhaps the SMap issued warning "Time delay embedding presumption violated." is a bit extreme, as it is not absolute whether-or-not the embedding violates Takens presumption for a specific prediction.

from redm.

ecosan327 commented on August 27, 2024

Thank you so much, especially the examples!

from redm.

version 1.15 dealing with NaN about redm HOT 4 CLOSED

Comments (4)

Simplex

SMap

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs