harvardnlp / botnet-detection Goto Github PK

Topological botnet detection datasets and graph neural network applications

License: MIT License

Python 99.43% Shell 0.57%

botnet-detection's Introduction

botnet-detection

Topological botnet detection datasets and automatic detection with graph neural networks.

A collection of different botnet topologyies overlaid onto normal background network traffic, containing featureless graphs of relatively large scale for inductive learning.

Installation

From source

git clone https://github.com/harvardnlp/botnet-detection
cd botnet-detection
python setup.py install

To Load the Botnet Data

We provide standard and easy-to-use dataset and data loaders, which automatically handle the dataset dnowloading as well as standard data splitting, and can be compatible with most of the graph learning libraries by specifying the graph_format argument:

from botdet.data.dataset_botnet import BotnetDataset
from botdet.data.dataloader import GraphDataLoader

botnet_dataset_train = BotnetDataset(name='chord', split='train', graph_format='pyg')
botnet_dataset_val = BotnetDataset(name='chord', split='val', graph_format='pyg')
botnet_dataset_test = BotnetDataset(name='chord', split='test', graph_format='pyg')

train_loader = GraphDataLoader(botnet_dataset_train, batch_size=2, shuffle=False, num_workers=0)
val_loader = GraphDataLoader(botnet_dataset_val, batch_size=1, shuffle=False, num_workers=0)
test_loader = GraphDataLoader(botnet_dataset_test, batch_size=1, shuffle=False, num_workers=0)

The choices for dataset name are (indicating different botnet topologies):

'chord' (synthetic, 10k botnet nodes)
'debru' (synthetic, 10k botnet nodes)
'kadem' (synthetic, 10k botnet nodes)
'leet' (synthetic, 10k botnet nodes)
'c2' (real, ~3k botnet nodes)
'p2p' (real, ~3k botnet nodes)

The choices for dataset graph_format are (for different graph data format according to different graph libraries):

'pyg' for PyTorch Geometric
'dgl' for DGL
'nx' for NetworkX
'dict' for plain python dictionary

Based on different choices of the above argument, when indexing the botnet dataset object, it will return a corresponding graph data object defined by the specified graph library.

The data loader handles automatic batching and is agnostic to the specific graph learning library.

To Evaluate a Model Predictor

We prepare a standardized evaluator for easy evaluation and comparison of different models. First load the dataset class with BotnetDataset and the evaluation function eval_predictor. Then define a simple wrapper of your model as a predictor function (see examples), which takes in a graph from the dataset and returns the prediction probabilities for the positive class (as well as the loss from the forward pass, optionally).

We mainly use the average F1 score to compare across models. For example, to get evaluations on the chord test set:

from botdet.data.dataset_botnet import BotnetDataset
from botdet.eval.evaluation import eval_predictor
from botdet.eval.evaluation import PygModelPredictor

botnet_dataset_test = BotnetDataset(name='chord', split='test', graph_format='pyg')
predictor = PygModelPredictor(model)    # 'model' is some graph learning model
result_dict_avg, loss_avg = eval_predictor(botnet_dataset_test, predictor)

print(f'Testing --- loss: {loss_avg:.5f}')
print(' ' * 10 + ', '.join(['{}: {:.5f}'.format(k, v) for k, v in result_dict_avg.items()]))

test_f1 = result_dict_avg['f1']

To Train a Graph Neural Network for Topological Botnet Detection

We provide a set of graph convolutional neural network (GNN) models here with PyTorch Geometric, along with the corresponding training script (note: the training pipeline was tested with PyTorch 1.2 and torch-scatter 1.3.1). Various basic GNN models can be constructed and tested by specifing configuration arguments:

number of layers, hidden size
node updating model each layer (e.g. direct message passing, MLP, gated edges, or graph attention)
message normalization
residual hops
final layer type
etc. (check the model API and the training script)

As an example, to train a GNN model on the topological botnet datasets, simply run:

bash run_botnet.sh

With the above configuration, we run graph neural network models (with 12 layers, 32 hidden dimension, random walk normalization, and residual connections) on each of the topologies, and results are as below:

Topology	Chord	de Bruijn	Kademlia	LEET-Chord	C2	P2P
Test F1 (%)	99.061	99.926	98.935	99.231	98.992	98.692
Average	99.140

Note

We also provide labels on the edges under the name edge_y, which can be used for the complete botnet community recovery task, or for interpretation matters.

Citing

@article{zhou2020auto,
  title={Automating Botnet Detection with Graph Neural Networks},
  author={Jiawei Zhou*, Zhiying Xu*, Alexander M. Rush, and Minlan Yu},
  journal={AutoML for Networking and Systems Workshop of MLSys 2020 Conference},
  year={2020}
}

botnet-detection's People

Contributors

Stargazers

Watchers

botnet-detection's Issues

TypeError: scatter_add() takes from 2 to 5 positional arguments but 6 were given

Traceback (most recent call last):
File ".\train_botnet.py", line 270, in
scheduler, logger)
File ".\train_botnet.py", line 138, in train
x = model(batch.x, batch.edge_index)
File "C:\Program Files\Python36\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\tusha\Downloads\botnet-detection\botdet\models_pyg\gcn_model.py", line 129, in forward
xo = net(x, edge_index, edge_attr, deg, edge_weight, **kwargs)
File "C:\Program Files\Python36\lib\site-packages\torch\nn\modules\module.py", line 1051, in call_impl
return forward_call(*input, **kwargs)
File "C:\Users\tusha\Downloads\botnet-detection\botdet\models_pyg\gcn_model.py", line 220, in forward
xo = self.gcn(x, edge_index, edge_attr, deg, edge_weight, **kwargs)
File "C:\Program Files\Python36\lib\site-packages\torch\nn\modules\module.py", line 1051, in call_impl
return forward_call(*input, **kwargs)
File "C:\Users\tusha\Downloads\botnet-detection\botdet\models_pyg\gcn_base_models.py", line 240, in forward
x = scatter(self.aggr, x_j, edge_index[1], dim_size=x.size(0))
File "C:\Users\tusha\Downloads\botnet-detection\botdet\models_pyg\common.py", line 56, in scatter
out = op(src, index, 0, out, dim_size, fill_value)

Graph loading extremely slow?

Is it just me, or does getting the data loaders ready take much longer than it should? I'm on a machine with 512GB RAM (most of which is free), a 40-core Xeon Silver, SSD storage. Loading the graphs (train, test, val included) takes at least 30-40 minutes. For context, loading a graph with 170K nodes and 1.1M edges (via OGB) takes less than 10 seconds.

Is it normal to take this long to load these graphs? If yes, is there any way to speed this process up (apart from setting in_memory=False, which shifts the loading part to later on, if I understand correctly)?

Could you provide the input data format?

Detailed explanation of hdf5 instance format of pyg, dgl, nx, or dict.

I want draw architect model. Do you have any suggest for me?

I saw your code show some info about architect of model like this.

Now, i want draw a picture about this architect like this:

Do you have any suggest for me?
Thanks a lot.

xxxx Killed

Hi, i need your help.
When i run bash run_botnet.sh then i have a error below. Can you have solution for me?

Mon Aug 29 08:22:24 2022

loading dataset...
model ----------
GCNModel(
(gcn_net): ModuleList(
(0): GCNLayer(
(gcn): NodeModelAdditive (in_channels: 1, out_channels: 32, in_edgedim: None, deg_norm: rw, edge_gate: NoneType,aggr: add | number of parameters: 64)
(non_linear): Identity()
)
(1): GCNLayer(
(gcn): NodeModelAdditive (in_channels: 32, out_channels: 32, in_edgedim: None, deg_norm: rw, edge_gate: NoneType,aggr: add | number of parameters: 1056)

  (non_linear): Identity()
)
(2): GCNLayer(
  (gcn): NodeModelAdditive (in_channels: 32, out_channels: 32, in_edgedim: None, deg_norm: rw, edge_gate: NoneType,aggr: add | number of parameters: 1056)
  (non_linear): Identity()
)
(3): GCNLayer(
  (gcn): NodeModelAdditive (in_channels: 32, out_channels: 32, in_edgedim: None, deg_norm: rw, edge_gate: NoneType,aggr: add | number of parameters: 1056)
  (non_linear): Identity()
)
(4): GCNLayer(
  (gcn): NodeModelAdditive (in_channels: 32, out_channels: 32, in_edgedim: None, deg_norm: rw, edge_gate: NoneType,aggr: add | number of parameters: 1056)
  (non_linear): Identity()
)
(5): GCNLayer(
  (gcn): NodeModelAdditive (in_channels: 32, out_channels: 32, in_edgedim: None, deg_norm: rw, edge_gate: NoneType,aggr: add | number of parameters: 1056)
  (non_linear): Identity()
)
(6): GCNLayer(
  (gcn): NodeModelAdditive (in_channels: 32, out_channels: 32, in_edgedim: None, deg_norm: rw, edge_gate: NoneType,aggr: add | number of parameters: 1056)
  (non_linear): Identity()
)
(7): GCNLayer(
  (gcn): NodeModelAdditive (in_channels: 32, out_channels: 32, in_edgedim: None, deg_norm: rw, edge_gate: NoneType,aggr: add | number of parameters: 1056)
  (non_linear): Identity()
)
(8): GCNLayer(
  (gcn): NodeModelAdditive (in_channels: 32, out_channels: 32, in_edgedim: None, deg_norm: rw, edge_gate: NoneType,aggr: add | number of parameters: 1056)
  (non_linear): Identity()
)
(9): GCNLayer(
  (gcn): NodeModelAdditive (in_channels: 32, out_channels: 32, in_edgedim: None, deg_norm: rw, edge_gate: NoneType,aggr: add | number of parameters: 1056)
  (non_linear): Identity()
)
(10): GCNLayer(
  (gcn): NodeModelAdditive (in_channels: 32, out_channels: 32, in_edgedim: None, deg_norm: rw, edge_gate: NoneType,aggr: add | number of parameters: 1056)
  (non_linear): Identity()
)
(11): GCNLayer(
  (gcn): NodeModelAdditive (in_channels: 32, out_channels: 32, in_edgedim: None, deg_norm: rw, edge_gate: NoneType,aggr: add | number of parameters: 1056)
  (non_linear): Identity()
)

)
(dropout): Dropout(p=0.0, inplace=False)
(residuals): ModuleList(
(0): Linear(in_features=1, out_features=32, bias=False)
(1): Identity()
(2): Identity()
(3): Identity()
(4): Identity()
(5): Identity()
(6): Identity()
(7): Identity()
(8): Identity()
(9): Identity()
(10): Identity()
(11): Identity()
)
(non_linear): ReLU()
(final): Linear(in_features=32, out_features=2, bias=True)
)
/content/botnet_detection/run_botnet.sh: line 3: 3960 Killed CUDA_VISIBLE_DEVICES=$gpu python /content/botnet_detection/train_botnet.py --devid 0 --data_dir ./data/botnet --data_name "$topo" --batch_size 2 --enc_sizes 32 32 32 32 32 32 32 32 32 32 32 32 --act relu --residual_hop 1 --deg_norm rw --final proj --epochs 50 --lr 0.005 --early_stop 1 --save_dir ./saved_models --save_name "$topo"_model_lay12_rh1_rw_ep50.pt

Tips/sample code for visualizing graphs on the dataset

Hi,

Could you provide some tips on visualizing the graph data? It's a little bit hard when it is patched as Batch/pre-defined class for the graph data type. Thanks again for the great work.

urllib.error.URLError: <urlopen error unknown url type: https>

Hi, the author, When I'm using this package, the following error has been bothering me :

**raise URLError('unknown url type: %s' % type)
urllib.error.URLError: <urlopen error unknown url type: https>**

but I can't sovle it, could you please give some advices? Thank you ~~

Graphs have parallel edges

It seems the data pre-processing does not convert the graphs to simple graphs. Even though they are undirected and unweighted, some nodes in the graphs have multiple parallel edges. This bug seems to impact graphs loaded via torch_geometric and dgl - networkx already handles this.

Thankfully, the number of such parallel edges is not significant - I did a quick check on the train and validation set, and the graphs have ~ 11 extra parallel edges on average. Not a big issue (it should not impact model performance or results in any way), though.

torch_ scatter==1.3.1 installation was unsuccessful and torch_scatter==2.0.9 have a exception

Thank for your paper and repo.

Thank for your paper and repo.
But i have an trouble.
When i run load data script. It have an error "no module deepdish".
It really hard to fix. Pls help me

I think there is a mistake

In scatter_ of common.py, out (SRC, index, 0, out, dim_size, fill_value) has 6 parameters, but the display can only enter 2-5 parameters.

Synthetic data generation

Hi.

Fascinating work! Nice to finally see a usable collection of network-based datasets for GNN driven. Could you please add the code used for synthetic graph generation? I'm looking to generate more graphs with some additional constraints, so I'd appreciate even any useful links or pointers to generation code used by the authors 😸

Questions about the Graph Dataset

Thank you for sharing the code of your publication its is really helpful.
Can I ask about How did you structure the data into graphs?

What features did you use for both nodes and edges embeddings? And how did you decide on that?
How did you decide on which nodes are connected? Or on what bases did you generate the connections between the nodes?

Thanks a lot

Correspondences of normalization between code and paper

In the paper described random walk normalization instead of symmetric normalization.

In the source code, I think it is implemented in an incorrect way.

Here we normalize each node based on the inverse of its degree, then we do aggregation. I think we have to aggregate all features, then in the end apply an inverse degree normalization.

Assume x1, x2, x3 are nodes. If we want update representation of x3 node and it is connected with x1 and x2, we calculate in the following way relu(x1/degree_x1 + x2/degree_x2), but actually we have to do relu(x1/degree_x1 + x2/degree_x1).
So I think we have to normalize based on source node degree.

Please correct me if I understood something wrong.
Thanks in advance.

Undirected or directed? And how creat a graph?

Hi, i need your help.

When i read your paper. I saw you said: "All the graphs are undirected and preprocessed to have self-loops to speed up training". Besides, you also can said: "we propose to use a random walk style normalization ̄A=D−1A which only involves the degree of the source nodes to equate the normalized adjacency matrix to the corresponding probability transition matrix". In here, you use "degree of source node" terms, i think this terms equivalent with "out-degree" terms. But "out-degree" terms only use for directed graph. So, it make me confuse, i can't understand your graph is undirected or directed.
Why self-loop can speed up trainning? and What mean "normalized adjacency matrix to the corresponding probability transition matrix"?
I see your code in botgen folder. It seem create a botnet by pick random some node. So, I want to ask, by randomly selecting bot nodes, is it possible to create a botnet with the same topology as in reality and why?

Thank for your help!

Question about result

Hi, thank for your repo.
I have 1 question, usually when running the model on Chord, Leet, Debru, Kadem, C2 datasets, the results are the same as you described in the README file but the results when I run the model on the effective P2P dataset model yield >= 99.1% (higher than the data you announced 98.692% F1-score). I didn't change your settings, I just removed the fill value and changed some versions of the libraries so the code could run. In short, I want to ask why the model when I run it again gives significantly better results?
Thanks!

Question about CAIDA account

Really charming work.

And could u tell me how to get a CAIDA account? or a guide from the CAIDA website? I think it is hard to find.

Running the training scripts in Jupyter: Issues with arguments

I translated part of the training script in train_botnet.py to jupyter notebook and while calling train with the parameters provided in the script, I get:

TypeError: scatter_add() takes from 2 to 5 positional arguments but 6 were given

This seems to happen during the forward pass in: x = model(batch.x, batch.edge_index) in train when aggr = 'add' but similar issue happens when aggr = 'mean' and error changes to TypeError: scatter_mean() takes from 2 to 5 positional arguments but 6 were given and in similar fashion for aggr = 'max'

Any ideas why this happens?

Signed integer is greater than maximum

I seem to be having an issue downloading from the 'chord' dataset. When I run the line

botnet_dataset_train = BotnetDataset(name='chord', split='train', graph_format='dict')

I seem to get this message:

Downloading` https://zenodo.org/record/3689089/files/botnet_chord.tar.gz
Traceback (most recent call last):
File "train_lstm.py", line 4, in
botnet_dataset_train = BotnetDataset(name='chord', split='train', graph_format='dict')
File "/home/joshuavanstaden/.virtualenvs/cvML/lib/python3.8/site-packages/botdet-0.1.0-py3.8.egg/botdet/data/dataset_botnet.py", line 64, in init
self.download()
File "/home/joshuavanstaden/.virtualenvs/cvML/lib/python3.8/site-packages/botdet-0.1.0-py3.8.egg/botdet/data/dataset_botnet.py", line 133, in download
path = download_url(self.url, self.raw_dir)
File "/home/joshuavanstaden/.virtualenvs/cvML/lib/python3.8/site-packages/botdet-0.1.0-py3.8.egg/botdet/data/url_utils.py", line 54, in download_url
f.write(data.read())
File "/usr/lib/python3.8/http/client.py", line 471, in read
s = self._safe_read(self.length)
File "/usr/lib/python3.8/http/client.py", line 612, in _safe_read
data = self.fp.read(amt)
File "/usr/lib/python3.8/socket.py", line 669, in readinto
return self._sock.recv_into(b)
File "/usr/lib/python3.8/ssl.py", line 1241, in recv_into
return self.read(nbytes, buffer)
File "/usr/lib/python3.8/ssl.py", line 1099, in read
return self._sslobj.read(len, buffer)
OverflowError: signed integer is greater than maximum

Let me know if you need any more info. TIA.

Issues replicating on Jupyter notebook

I have been trying to run parts of the code in train_botnet.py in Jupyter notebook and I had encountered some issues.

First, I get issues with from torch_geometric.utils import scatter_ as it says scatter_ does not exist. I believe that the version of torch_geometric I have is a bit higher. But then I saw the scatter_() in common.py in the repository under models_pyg. So, I changed the imports in graph_attention.py to point to that scatter_ function.

When doing the train() as per the bash script (in a notebook version), I get
TypeError: scatter_add() takes from 2 to 5 positional arguments but 6 were given

is the scatter_ in common.py a replacement for the one in torch_geometric? It's a bit confusing how to re-do the analysis in the right manner, without knowing the right versions of packages used in this project. Any input is appreciated.