Answers to questions received via email:
Q1. For the anomaly_sequences
column in the labeled_anomalies.csv
, it means the start and end indices of true anomalies in stream. However, I don’t know the indice of your file is begun at 0
or 1
? For example, the [[6000,8127]] for channel id “D-2”, I want to know whether the start indice “6000” means the “6000”(begun at “1”) or “6001”(begun at “0”) row of the file “test/D-2.txt”?
The indices begin at 0.
Q2. For the anomaly_sequences
and num values
column in the labeled_anomalies.csv
, I found that some end indice is larger than the num values
: A-8.txt, A-9.txt, D-9.txt, F-2.txt. Is there any mistake?
This was an error and has been cleaned up. The anomalies go to the end of the sequence and the end of the range should equal num_values
- 1.
Q3. In both your test
and train
files, I found most values of data is 0
, and I want to know more background knowledge of the data to explain why most value of the value is 0
.
The “Raw experiment data” section of the readme explains this: “Model input data also includes one-hot encoded information about commands that were sent or received by specific spacecraft modules in a given time window. No identifying information related to the timing or nature of commands is included in the data.” So you see lots of zeroes where commands weren’t sent/received for to a specific spacecraft module in a time window. At most timesteps for most of the spacecraft submodules, there is no command activity. The first dimension is the prior telemetry values for that channel (the -1.000s in the example you screenshotted) and will be primarily nonzero.
Q4. What is the time interval between the adjacent rows?
For the anomalies from the SMAP spacecraft, values are aggregated into 1 minute buckets. For MSL, the time bucket size is variable as data rates are inconsistent and no interpolation between values was performed to fill missing buckets. This is one factor in the poorer performance seen for MSL anomalies and something we will be addressing in future iterations.
Q5. I found that the anomaly of channel id P-2
are described twice and different (in row 19 and row 53), however, there are no descriptions about the anomaly of channel id T-10
.
P-2 is the same channel with two anomalies occurring at different points in time, which is why you see two separate anomalies for that channel. These are entirely separate events and data that happen to occur for the same channel at different points in time. The full ranges of values are non-overlapping and the fact that the anomalous sequences have overlapping indices is coincidental.
T-10 didn’t have enough values to include so it was removed intentionally from the dataset and in the interest of time we didn’t rename all the channels.