pontusj101 / compromise_heatmap Goto Github PK

License: Apache License 2.0

Python 97.06% Dockerfile 0.66% Shell 2.28%

compromise_heatmap's Introduction

compromise_heatmap

A brief paper on this repo is available as Johnson & Ekstedt, "Towards a Graph Neural Network-Based Approach for Estimating Hidden States in Cyber Attack Simulations", 2023. Another brief paper, on the concept of detectors is available as Johnson, Ekstedt & Kakouros, "Introducing_Threat_Detectors_in_the_Meta_Attack_Language.pdf", 2024.

The code can run in GitHub Codespaces, as well as in GCP Batch and GCP Vertex AI.

In Codespaces (or some other execution environment), try python3 main train. To run it in GCP Vertex AI for hyperparameter tuning, use hp_tuning.sh. To run it in GCP Batch for parallel simulation, use batch.sh.

You will need a GCP bucket, and you will need to provide a GCP service account with access to that bucket.

The project is also integrated with Weights & Biases, so you will need a wandb API key.

compromise_heatmap's People

Contributors

Watchers

compromise_heatmap's Issues

A completely independent validation set

Spend some time in creating a convincing validation set that really mimics a production environment, including size, complexity, attacker policy, etc.

Sequence truncation is sometimes excessive, reducing the length of sequences dramatically, thus affecting training. Avoid truncating too much by instead removing the shortest sequences.

        # TODO: This truncation is sometimes excessive, reducing the length of sequences dramatically, thus affecting training. Avoid truncating too much by instead removing the shortest sequences.

https://github.com/pontusj101/rddl_training_data_producer/blob/30c411b3ca911359f6f7ff49fb9d604804ba184b/util.py#L31

Different GNNs for different attacker profiles

We can have one GNN that is specialized on insiders. for instance. Or on APT28. It will then have to train on such an attacker, of course.

Hyperparameter: compromised-ratio

Specify the ratio of compromised nodes during training.

Another GNN-LSTM architecture?

Currently, the LSTM consumes the GNN embedding as well as the previous LSTM hidden state of the node in question. The GNN, in turn, consumes the local graph structure as well as the most recent events recorded on neighboring nodes. Another architectural option would be to feed the GNN also with the LSTM hidden state. The GNN would thus directly access a representation of the memory of each neighboring node, which, if that memory is appropriately organized, could enable the GNN to make accurate predictions.

The current architecture ought also to be able to do that, though, as the LSTM should be able to record any relevant historical events in the neighborhood. As an example, if the access-granting credential was previously attacked, that should increase the probability of compromise of the associated host. This might possibly be a more difficult architecture, as the LSTM needs to distinguish between many different neighborhood configurations for many historical time steps.

Yet another option would be to have the LSTM remember only the historical records of the detectors. In this case, the log record (and, I think, the attack step type) is fed into the LSTM, which is tasked with remembering the salient features of that record. The GNN is subsequently given the same responsibility as in the pure GNN case, but instead of viewing the raw log history, it sees the LSTM hidden state. I think the drawback of this architecture is that it may be difficult for the LSTM to capture which event happened before another (e.g. in the case that A triggered first, followed by B and then C) as the time between the events may vary, as well as the time between the events and the prediction.

But the best thing to do is probably to start by learning about the existing temporal GNNs (#17) , many of which combine LSTMs and GNNs.

Now filtering by exact log window, but should select all above log_window, as they are truncate later anyhow.

    # TODO: Now filtering by exact log window, but should select all above log_window, as they are truncate later anyhow.

https://github.com/pontusj101/rddl_training_data_producer/blob/488bf63abf750f4c64aa8ea9706face090d0cdc2/util.py#L8

Neighbor sampling

Pytorch Geometric offers neighbor sampling, which might be relevant for our use case:

In the context of your project, which involves training Graph Neural Networks (GNNs) on simulations conforming to the Meta Attack Language (MAL) and using Reinforcement Learning (RL) to estimate the hidden state of cyber-attacks, neighbor sampling can provide several benefits:

Scalability: One of the primary benefits of neighbor sampling is its ability to handle large graphs efficiently. In the domain of cyber-attack simulations, the attack graphs can be quite extensive and complex. Neighbor sampling allows GNNs to scale to these large graphs by limiting the number of neighbors processed at each layer of the network.
Reduction in Computational Load: By sampling a fixed number of neighbors instead of using the entire neighborhood, the computational load is significantly reduced. This is particularly important when dealing with large-scale attack simulations where the full neighborhood might encompass a vast number of nodes.
Mitigating Overfitting: Neighbor sampling can also help in reducing overfitting. By randomly selecting a subset of neighbors, the model is less likely to overfit to the specific structure of the training graph and can generalize better to unseen data.
Variability in Training: Sampling introduces variability in the training process. Each epoch can use different subsets of neighbors, which can aid in robust learning as the model is exposed to different aspects of the data across training iterations.
Balancing Graph Structure and Features: In attack simulations, certain nodes (representing attack steps or system components) might be more connected than others, leading to a skewed representation. Neighbor sampling can help in balancing the influence of highly connected nodes versus less connected ones, ensuring that the model does not overemphasize certain parts of the graph over others.
Efficient Learning of Node Representations: Neighbor sampling can improve the efficiency of learning node representations in GNNs. By focusing on a subset of neighbors, the GNN can learn meaningful representations that capture local graph structures effectively, which is crucial for accurate predictions in attack simulations.
Applicability to Dynamic Graphs: Cyber-attack graphs in simulations can be dynamic, with the graph structure changing over time (e.g., as attacks progress or defenses are deployed). Neighbor sampling can be adapted to such dynamic graphs, allowing the GNN to focus on the most relevant parts of the graph at each time step.

Incorporating neighbor sampling in your GNN model for MAL-conformant RDDL simulations can thus lead to more efficient and effective learning, especially considering the complexity and scale of attack graphs in cybersecurity simulations.

The random walker attacker is problematic during long attack sequences

The biggest problem with the random walker is when the attack has progressed a longer while, because at that point, the attack horizon is very big and the alternatives are many for the attacker. At that point, it is very difficult to predict what the attacker will do, and very few patterns will be visible, as a number of unrelated attack steps can succeed each other.

One option is to simply delete the oldest attack steps in the attack horizon, thus maintaining a reasonably-sized horizon.

A similar option would be to assign weights to attack steps, so that the most recently discovered ones are the most likely to be attacked. For instance, 50% of the probability mass could be assigned to the most recently discovered steps, 25% to the second-most recently discovered, etc.

Temporal GNNs?

GATs work pretty well, but they suffer from a short memory, or more specifically a limited time window of past events. We might be able to live with that, but we could also consider various temporal GNNs. Currently, there is one such implementation, where a GAT is combined with an LSTM, but it does not seem to perform as well as the GATs.

But there are many ways to take time into account. Check out the many options in
https://pytorch-geometric-temporal.readthedocs.io/en/latest/modules/root.html#

According to ChatGPT, among the layers listed in the PyTorch Geometric Temporal documentation, the ability to recognize or handle different types of relations is a crucial aspect for tasks that involve heterogeneous networks or graphs with multiple types of nodes and edges. Here are some layers that are particularly tailored to handle different types of relationships:

LRGCN (Long Short Term Memory Relational Graph Convolution Layer): This layer explicitly handles different types of relations by allowing different relations to have their convolutional filters. This is particularly important in scenarios where the interactions or connections between nodes are not uniform and can significantly differ (e.g., different types of alerts or events in an intrusion detection system). By considering the number and type of relations, LRGCN can learn how different types of connections between nodes evolve over time.

HeteroGCLSTM (Heterogeneous Graph Convolutional Long Short Term Memory): This is a variant designed for heterogeneous graphs where different types of nodes and edges might exist. It considers a dictionary of in_channels for different node types and leverages the graph's metadata to understand the structure and type of nodes and edges. This could be particularly useful in an IDS scenario where different kinds of nodes (servers, routers, endpoints) and different types of edges (data flow, control commands, etc.) may represent various aspects of the network and its traffic.
These layers are explicitly designed to work with different types of relations, making them potentially more suitable for complex scenarios where understanding the different types of interactions between nodes is crucial. In the context of intrusion detection, recognizing different types of relations might help in identifying patterns of normal vs. malicious activities, understanding how different types of attacks propagate, and how different parts of the network interact under various conditions.

Hyperparameter: simulation length

The same solution as for log_window: store files of various length in the storage area.

Hyperparameter: attacker starting point

Currently, the (first) attacker always starts in the same position. Allow this behavior to be randomized. Or maybe not. The attacker cannot simply appear from nowhere in reality, so that should not be allowed in the simulations either.

Replace or fix PyRDDLGym

For unclear reasons, PyRDDLGym does not seem to scale well as the graph size increases. We should wither explore whether that issue can be remedied, or replace it with a better simulation engine.

Compress log history window

One compression algorithm is to transform
[1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1]
into
[11 4 4 2]
But integrating an LSTM or using some temporal GNN (#17) might be a better option.

Conv nets for log window patterns?

The logs will have the same toxic patterns shifted in time as the simulation proceeds. It would be good if the model didn't need to learn this anew for each position in the log window. Perhaps conv nets could be used? But also this is an alternative for dealing with temporal GNNs (#17)

Better instrumentation

Is it possible to integrate into TensorBoard? Or Biases & Weights?

More realistic MAL specification

The current MAL spec is very simplistic.

Check that the balance between compromised and uncompromised nodes is not too skewed. If it is, then we should sample the training data to get a more balanced dataset.

    # TODO: Check that the balance between compromised and uncompromised nodes is not too skewed. If it is, then we should sample the training data to get a more balanced dataset.

https://github.com/pontusj101/rddl_training_data_producer/blob/2ed17ddd9ac70b2b2e3d05a961c5654f6c5d03cf/gnn_trainer.py#L153

Improve prediction accuracy

The prediction accuracy should be improved. Here is an example of the current state.

This is what the model (the best-performing is a GAT) sees.
https://github.com/pontusj101/rddl_training_data_producer/assets/9636072/e26da30b-f7ca-4d85-851d-7b64e13a1a2b

This is what it predicts:
https://github.com/pontusj101/rddl_training_data_producer/assets/9636072/4c594dd8-76ca-45bf-a3ab-b5d2227622b8

And this is what actually happened:
https://github.com/pontusj101/rddl_training_data_producer/assets/9636072/f80a08e7-5eb9-433d-b18e-8df7be2b2459

Improvements can be achieved in different ways:

Try out new architectures, such as temporal GNNs (#17).
Modify aspects of the training data, such as the attacker policy (probably very important), the total amount of data, the amount of uncompromised snapshot sequences, etc. Many of these can be explored through hyperparameter optimization, e.g. in GCP Vertex AI.
Train for longer, or with other training hyperparameters, such as number of layers and neurons, batch size, learning rate, etc. These can also be explored through hyperparameter optimization. I have seen in experiments that more neurons are better, so we should really try to push that limit. I think that will require GPU acceleration (#18), but we should clearly have that in any event.

Tabular or MLP baseline

Make a baseline for fixed and small networks to compare with the GNNs. A tabular approach, simply recording the probabilities of compromise for different log histories, would probably be a good choice. There is some old code in the code base aiming to do that, but which would need to be refreshed.

Randomized pauses for the attacker

It will be harder to detect an attacker that varies the timing between attack steps. Make this into a hyperparameter to explore.

GPU acceleration

I have not succeeded in running on CUDA, as the GPUs I could access on GCP required older versions of PyTorch, which RDDL did not accept. Of course, better would be to find better GPUs (A100 or H100) than the measly P100 I could access. Different data centers (GCP regions) feature different hardware.

pontusj101 / compromise_heatmap Goto Github PK

compromise_heatmap's Introduction

compromise_heatmap

compromise_heatmap's People

Contributors

Watchers

compromise_heatmap's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs