supercowpowers / zat Goto Github PK

View Code? Open in Web Editor NEW

416.0 40.0 110.0 5.21 MB

Zeek Analysis Tools (ZAT): Processing and analysis of Zeek network data with Pandas, scikit-learn, Kafka and Spark

License: MIT License

Makefile 0.25% Python 15.32% Jupyter Notebook 84.35% Dockerfile 0.08%

python networking security bro pandas scikit-learn spark zeek kafka zeek-analysis

zat's Introduction

Zeek Analysis Tools (ZAT)

The ZAT Python package supports the processing and analysis of Zeek data with Pandas, scikit-learn, Kafka, and Spark

Install

pip install zat
pip install zat[pyspark] (includes pyspark library)
pip install zat[all] (include pyarrow, yara-python, and tldextract)

Getting Started

Examples of Using ZAT

AWS Data Processing and ML Modeling

Please see SageWorks

Installing on Raspberry Pi!

Raspberry Pi Instructions

Recent Improvements

Faster/Smaller Pandas Dataframes for large log files: Large Dataframes
Better Panda Dataframe to Matrix (ndarray) support: Dataframe To Matrix
Scalable conversion from Zeek logs to Parquet: Zeek to Parquet
Vastly improved Spark Dataframe Class: Zeek to Spark
Updated/improved Notebooks: Analysis Notebooks
Zeek JSON to DataFrame class: Zeek JSON to DataFrame Example

Video Presentation

Data Analysis and Machine Learning with Zeek

Why ZAT?

Zeek already has a flexible, powerful scripting language why should I use ZAT?

Offloading: Running complex tasks like statistics, state machines, machine learning, etc.. should be offloaded from Zeek so that Zeek can focus on the efficient processing of high volume network traffic.

Data Analysis: We have a large set of support classes that help bridge from raw Zeek data to packages like Pandas, scikit-learn, Kafka, and Spark. We also have example notebooks that show step-by-step how to get from here to there.

Analysis Notebooks

Documentation

https://supercowpowers.github.io/zat/

Running the Tests

pip install pytest coverage pytest-cov
pytest zat

About SuperCowPowers

The company was formed so that its developers could follow their passion for Python, streaming data pipelines and having fun with data analysis. We also think cows are cool and should be superheros or at least carry around rayguns and burner phones. Visit SuperCowPowers

zat's People

Contributors

Stargazers

Watchers

Forkers

songofhack alfiyazi python3pkg chills42 ya5e snrna warmiceberg jjjan tiffanyblue analyticalpicasso sathyaanurag awesome-security b33l238u8 defenseteam cybertaoflow chixsh yikez978 emmoblin jonzeolla biancaguo inp2 rpsamuel jymcheong slumdunking p3t3rp4rk3r deepzec chennqqi ccgcyber giangzuzana walterjia japd06 broforks kartikeyap zhenxian-hu lorenaman cariosha gbrodar shakexingwu symen kramse smolige001 mattkinsey veeranji0425 jbanier johnscillieri anthonykasza booms007 idsdarg ctixsystems rahmiy zer0d0y benhe119 bhklimk sahar119911 juripero the-alchemist onegreydot phisyche gtrunsec twseptian git04112019 maxrp foeinlove bcp-infosec-repo arukaminado wapiti08 kwbrandenwagner ivanletteri svenvanhal 5l1v3r1 zulu8 ggarrett666 z-3zh raystyle newbdoc sea-cpu fluxcap1 laurent192 xx-zhang j4ckzh0u zwgirl ituco chukudubem 5p4c3-c4d37 deruke sulaimanzai fangod jhwilson619 178vit aaronfderybel shivbhproject delbs27 keithjjones ai-learning-land bladetuab nsxsoft poojaresearcher miguelcarbo sumpdesioke anhlbt

zat's Issues

Expand Bro logs Reader with Pattern and Compression Support

In addition being able to load a single uncompressed file with the BroLogReader

reader = bro_log_reader.BroLogReader('http.log')

Add support for loading multiple files (with optional gz detection)

reader = bro_log_reader.BroLogReader('http(.*).log.gz')

This would facilitate loading a set of files for analysis

Issue in reading various log files where service column has various values

#Operating system - Windows -64 bit

Jupyter Notebook - Python3

import pandas as pd
import numpy as np
import os
import gzip
from sqlalchemy import create_engine
import pandas as pd
import numpy as np
import sqlalchemy
import re
import sklearn

#pandas python code
reader = bro_log_reader.BroLogReader('storm_conn.log.txt')
storm_df = pd.DataFrame(reader.readrows())
print(storm_df.head())

#Sample log file

#separator \x09
#set_separator ,
#empty_field (empty)
#unset_field -
#path conn
#open 2016-08-01-22-37-35
#fields ts uid id.orig_h id.orig_p id.resp_h id.resp_p proto service duration orig_bytes resp_bytes conn_state local_orig local_resp missed_bytes history orig_pkts orig_ip_bytes resp_pkts resp_ip_bytes tunnel_parents
#types time string addr port addr port enum string interval count count string bool bool count string count count count count set[string]
9.503370 CDHyzp103R7vM3B4Di 10.0.2.104 64324 8.8.8.8 53 udp dns - - - S0 - - 0 D 1 62 0 0 (empty)
10.496136 CYRYDd4hzE1dGQnZ1f 10.0.2.104 64324 8.8.4.4 53 udp dns 0.001071 34 50 SF - - 0 Dd 1 62 1 78 (empty)
10.497730 CYetjF2BLsKlBeXLm 10.0.2.104 54512 8.8.4.4 53 udp dns 0.000924 34 62 SF - - 0 Dd 1 62 1 90 (empty)
6.840918 Ck1imZ29H4UmZErn6k :: 135 ff02::1:ffb8:a750 136 icmp - - - - OTH - - 0 - 1 64 0 0 (empty)
6.841024 Cv2Hil24LziEl96MV9 fe80::c06e:84b6:bcb8:a750 143 ff02::16 0 icmp - 0.500843 40 0 OTH - - 0 - 2 152 0 0 (empty)
6.840965 CCxmsT35XMpImxttC5 fe80::c06e:84b6:bcb8:a750 133 ff02::2 134 icmp - 8.001889 24 0 OTH - - 0 - 3 168 0 0 (empty)
6.600556 ChcQ99oXrhNvDh1yd fe80::c06e:84b6:bcb8:a750 546 ff02::1:2 547 udp - 63.681689 588 0 S0 - - 0 D 7 924 0 0 (empty)
169.437321 CtV9pH2iCY35WHgzz 10.0.2.104 49157 195.113.232.89 80 tcp http 0.002621 97 179 SF - - 0 ShADadfF 5 309 5 383 (empty)
166.786398 CSN7jI1CwHx1HRri27 fe80::c06e:84b6:bcb8:a750 56899 ff02::1:3 5355 udp dns 0.098369 44 0 S0 - - 0 D 2 140 0 0 (empty)
166.786838 CeI4z22y0pPPTNddwl 10.0.2.104 51506 224.0.0.252 5355 udp dns 0.098144 44 0 S0 - - 0 D 2 100 0 0 (empty)
167.095598 CTEJ5i3DStut47RUg3 10.0.2.104 137 10.0.2.255 137 udp dns 1.501089 150 0 S0 - - 0 D 3 234 0 0 (empty)
169.392290 CMNvAs1ZXKkapRqSIi 10.0.2.104 58727 8.8.8.8 53 udp dns 0.026124 34 140 SF - - 0 Dd 1 62 1 168 (empty)
169.421042 Chciyb4ArzaUY8ybWe 10.0.2.104 59236 8.8.8.8 53 udp dns 0.001232 34 140 SF - - 0 Dd 1 62 1 168 (empty)
147.217168 C9ZaK41wbQmwpoaBM1 :: 135 ff02::1:ffb8:a750 136 icmp - - - - OTH - - 0 - 1 64 0 0 (empty)
147.217241 C8TB0MUzTO5y95Prg fe80::c06e:84b6:bcb8:a750 133 ff02::2 134 icmp - 8.000417 24 0 OTH - - 0 - 3 168 0 0 (empty)
147.217324 C3v0Am2Hq7PyTLTedh fe80::c06e:84b6:bcb8:a750 143 ff02::16 0 icmp - 0.499631 40 0 OTH - - 0 - 2 152 0 0 (empty)
166.786946 CPTUIA3tLdBgAvvb8c 10.0.2.2 11 10.0.2.104 0 icmp - - - - OTH - - 0 - 0 0 0 0 (empty)
167.095837 CMwsTr3sIyp5AMYI9 10.0.2.2 3 10.0.2.104 0 icmp - - - - OTH - - 0 - 0 0 0 0 (empty)
276.375829 CcrgFD3Ql41215wEkc 10.0.2.104 49158 112.175.184.65 80 tcp http 0.684047 297 447 SF - - 0 ShADadfF 5 509 5 651 (empty)
159.293911 CFBrnw4dpzP9pYSWv8 fe80::c06e:84b6:bcb8:a750 546 ff02::1:2 547 udp - 63.043744 588 0 S0 - - 0 D 7 924 0 0 (empty)
277.082464 C1Jt3u5N88vTt65U1 10.0.2.104 49159 112.175.184.100 80 tcp http 0.644069 280 455 SF - - 0 ShADadfF 5 492 5 659 (empty)
277.750904 CAdXNp2YlyVLfcpnoh 10.0.2.104 49160 112.175.184.100 80 tcp http 0.671336 284 8205 SF - - 0 ShADadfF 7 576 9 8569 (empty)
273.482082 CcSdJ11pd1ra1c1hv4 fe80::c06e:84b6:bcb8:a750 57406 ff02::1:3 5355 udp dns 0.099220 44 0 S0 - - 0 D 2 140 0 0 (empty)
273.482430 CAtoscHO6Y9zTC9Fd 10.0.2.104 55240 224.0.0.252 5355 udp dns 0.099009 44 0 S0 - - 0 D 2 100 0 0 (empty)
273.781786 C57sgV3qGaWmDRhWN5 10.0.2.104 137 10.0.2.255 137 udp dns 1.501878 150 0 S0 - - 0 D 3 234 0 0 (empty)
276.090694 CNw3p02HMJ3eB8dKjl 10.0.2.104 54753 8.8.8.8 53 udp dns 0.283899 41 57 SF - - 0 Dd 1 69 1 85 (empty)
277.061944 CI6o9f3MBIEMSzqLa3 10.0.2.104 55875 8.8.8.8 53 udp dns 0.019935 31 47 SF - - 0 Dd 1 59 1 75 (empty)
277.728765 CFMb8KlLQqYZMzGgg 10.0.2.104 57520 8.8.8.8 53 udp dns 0.021402 35 51 SF - - 0 Dd 1 63 1 79 (empty)
273.482522 CBqcyX2WS1vqbl6fZj 10.0.2.2 11 10.0.2.104 0 icmp - - - - OTH - - 0 - 0 0 0 0 (empty)
273.781914 CsM80i2vCWNHKuSr87 10.0.2.2 3 10.0.2.104 0 icmp - - - - OTH - - 0 - 0 0 0 0 (empty)
338.427969 CoUu9t4mydgW6PWso8 10.0.2.104 49161 112.175.184.65 80 tcp http 0.628953 297 447 SF - - 0 ShADadfF 5 509 5 651 (empty)
339.057576 C5TxbK1SYv53S3WWN9 10.0.2.104 49162 112.175.184.100 80 tcp http 0.616180 280 455 SF - - 0 ShADadfF 5 492 5 659 (empty)

Add a 'usecols' arg to LogToDataframe

Easy to add a 'usecols' arg to LogToDataframe that will save lots of memory if the use case only involves some of the columns.

spark?

In the setup.py, there is a dependency on the python package spark. I was wondering if this is a typo and you mean pyspark? The spark package also exists but it's a web framework.

Deprecate Python 2.7 support

Supporting Python 2.7 is going to be dropped.

ImportError: cannot import name 'df_stats'

Following function doesn't exist:
from bat.utils import df_stats
is there an alternative function available? cant find it in docs

Refactor the LogToParquet class

The current class is quite bleh, we need to incorporate a better approach to reading the log data in and converting to a parquet format.

Examples using recent Kafka Bro/Zeek plugin

The Kafka plugin for Bro is better/more official than it was when the first notebook/presentations were made, so we should make an updated version of the examples/notebook that uses the new Kafka plugin.

https://github.com/apache/metron-bro-plugin-kafka

can't read log file with bat

My code just hangs when trying to use bat on a log file

from bat import bro_log_reader

reader = bro_log_reader.BroLogReader('ssh.log')
for row in reader.readrows():
print(row)

this code never completes or prints out anything, it is having problems with the for loop

How to run anomaly_detection.py

Hello Brian,

How do you run these py scripts? I tried looking on the handbook but it doesn't say. Also do custom written scripts have to be put in the $PREFIX/share/bro/site or a default folder (base), or do we put them in any folder thats suitable in what we are trying to log?

Kind regards
Daniel

Is Bat supposed to work under Windows?

I tried to do some work on Windows machine and ran into some issues when trying to import logfiles*.

If I follow the 'normal' pattern to import a log file, for example: conn_df = LogToDataFrame('conn.log') it crashes out with the following error message:

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-13-c8896a68360f> in <module>()
----> 1 conn_df = LogToDataFrame('conn.log')

~\AppData\Local\Continuum\anaconda3\lib\site-packages\bat\log_to_dataframe.py in __init__(self, log_filename, ts_index)
     22 
     23         # Create a bro reader on a given log file
---> 24         reader = bro_log_reader.BroLogReader(log_filename)
     25 
     26         # Create a Pandas dataframe from reader

~\AppData\Local\Continuum\anaconda3\lib\site-packages\bat\bro_log_reader.py in __init__(self, filepath, delimiter, tail, strict)
     57                             'unknown': lambda x: x}
     58         self.dash_mapper = {'bool': False, 'count': 0, 'int': 0, 'port': 0, 'double': 0.0,
---> 59                             'time': datetime.datetime.fromtimestamp(0), 'interval': datetime.timedelta(seconds=0),
     60                             'string': '-', 'unknown:': '-'}
     61 

OSError: [Errno 22] Invalid argument

I did verify that the file is in the current dir etc by a quick:

print(os.listdir(os.curdir))
['.ipynb_checkpoints', 'conn.log', 'Connection Time Diagrams.ipynb', 'data']

I've tried this with both Jupyter Notebook and on the Python command prompt with the same result.

Using the same approach/files/directory structure works just fine under Linux. I don't really have access to a Mac so haven't been able to test this on that platform.

Phase2: Streaming data pipeline

This notebook covers Phase 1 of the streaming data pipeline:

https://nbviewer.jupyter.org/github/SuperCowPowers/bat/blob/master/notebooks/Zeek_to_Kafka.ipynb

Lets write a second notebook that covers ETL/Filtering/Spark

Whitelisting Example

Hi,

From the start - I'm not sure if this is a good idea or not. It might have consequences that I have not thought about. However, here it goes...

In my experience there are quite a few vendors that use DNS for various purposes. A/V vendors, like Sophos[1] for example, uses lookups for detection. These lookups tends to be quite weird looking and generates quite a lot of noise in DNS exfil rules and other such tools.

I had a look in the anomaly_detection.py script and sort of got an hamfisted piece of code together. It's not very elegant and definitely not very efficient. But at least it shows the idea to a certain extent. I put the 'maxmind.com' domain in there since that works on the provided example dns.log file and shows what I'm trying to achieve.

        # These domains could cause false positives, ie - long queries/responses, 
        # weird domain names etc.
        known_good = set(['sophosxl.net',  'maxmind.com']) 


        # Create a Pandas dataframe from a Bro log
        bro_df = log_to_dataframe.LogToDataFrame(args.bro_log)
        print('Read in {:d} Rows...'.format(len(bro_df)))

        # If the log type is dns and the known_good set is populated we will remove
        # rows if and when suitable/necessary.
        if log_type == 'dns' and len(known_good) > 0:
            print("Checking for 'known good' domains and will remove from the dataframe")
            for index, row in bro_df.iterrows():
                if any([domain in row['query'] for domain in known_good]):
                    # bro_df.drop(index, inplace=False)
                    bro_df.drop(index, inplace=True)
                else:
                    pass
            print("We now have {:d} Rows...".format(len(bro_df)))
        else:
            pass

1 - As stated, not sure if this is a good idea or not?

2 - I don't think I'm doing this in the most efficient way here - it seems to be quite an intensive operation in the way I'm doing it. I'll have a look at seeing if there's a better/more Pythonic way of doing it. Maybe there's a very established way of doing this that I just don't know about yet? Maybe it should even be done on the initial population of the dataframe somehow?

3 - I'm also aware that this could be done on the Bro side. Getting rid of these by using Bro scripting to never even log them would save both disk space and Splunk license costs and also resolve this issue. However, I think that being able to iterate through and 'whitelist' certain domains as you analyze your dns logs could be quite useful.

What's your thoughts on this?

Cheers, Mike

[1]
1508220101.531867 CEypeHRT1TQrypVyk 192.1.1.1 57519 8.8.8.8 53 udp 21710 0.017866 3.1o19sr00or1942r9p3764opq8r30n2n4n91549o91253r94psn0o25s1o13r2p2.psqp3r741p635393648s22s52n1r1n0231997psrq7r530s1o70np98198np693.pn282s.rn3o9687n0r8o62pp317p94n645q2o618264op91.i.00.s.sophosxl.net 1 C_INTERNET 16 TXT 0 NOERROR F F T T 0 TXT 3 x c 258.000000 F

Think about an optimized Zeek log to parquet converter

The current path from Zeek log to parquet is, log -> Spark -> Parquet file. This is good/fine... .there may be some improvements/short-path that we might investigate.

Hand written 'block' converter
Simple 'wrap up' a convenience class that uses Spark internally
???

log->spark->parquet notebook

https://nbviewer.jupyter.org/github/SuperCowPowers/bat/blob/master/notebooks/Zeek_to_Spark.ipynb

Create a time delta plotting example

@swedishmike found an interesting issue when plotting the timedelta64[ns] columns. So we should probably create an example that explains the issue and how to easily get around it. See #44 for more info.

Figure out why coveralls is breaking

in the .travis.yml file we had to remove coveralls because it keep giving a 405 error... figure out what's going on and fix it. #73

    env: TOXENV=py27,coveralls,docs,style

Make a Spark load helper class

The 'standard' way of converting basically turns all the types into either StringType or LongType (so the two largest types memory wise). Also more importantly if a NaN Is found it will either fail or do the wrong thing.

Refs:

Make sure all notebooks run

Take a pass at all the notebooks and make sure they all run.

For some notebooks (like risky_domains) it needs third-party data files, so put in some super helpful messages for those.

Notebooks known to run:

Anomaly_Detection.ipynb
Bro_to_Scikit_Learn.ipynb
Bro_to_Spark_Cheesy.ipynb
Clustering_Picking_K.ipynb

Update code/examples that use df_stats

The df_stats import has 'moved' instead of

from bat.utils import df_stats

it's now

import bat.dataframe_stats as df_stats

So we need to update the code/examples to reflect the change.
See: #59

Not sure if this is a bug or just me misunderstanding things.

I tried to generate some graphs from the conn.log file when I ran into some problems.

If I follow the 'Bro_to_Plot.ipynb' framework and try the following:

conn_df['duration'].hist()

It bombs out with quite a lot of error output which culminates with:

TypeError: Cannot cast ufunc less input from dtype('float64') to dtype('<m8[ns]') with casting rule 'same_kind'

Some Googling later I tried with the following:

conn_df.duration = conn_df.duration.astype('timedelta64')

Once I did that the plot displayed properly.

I'm still very much a beginner when it comes to Pandas so I'm not sure if the problem lies in the generation of the Dataframe and the 'type assignment' or if it is something else?

Run from command line

I've been trying to get your package to work on the command line to get a feel for and understanding of how its doing its magic. Unfortunately, while python (both 2.7 and 3.6) does note success in importing and using the bat package, your basic example of pprint for the logs on the command line outputs nothing. Same goes for print which leads me to believe there is no working implementation for this to work on the command line.

The command of say python3.6 test.py which test.py is a direct copy of the brolog to python example on the docs, just hangs up doing nothing while running the command. Documentation isn't specific to working through command line, nor does it contain any information regarding why this isn't working. Any ideas?

Bro Log Generator to Pandas DataFrame optimization

If you look at the internals of the list of dict to arrays method in the Pandas frame.py class.. it looks fairly inefficient. So perhaps provide a BroThon class that does the conversion faster/better, also look at Apache Arrow implementation.

def _list_of_dict_to_arrays(data, columns, coerce_float=False, dtype=None):
    if columns is None:
        gen = (list(x.keys()) for x in data)
        sort = not any(isinstance(d, OrderedDict) for d in data)
        columns = lib.fast_unique_multiple_list_gen(gen, sort=sort)

    # assure that they are of the base dict class and not of derived
    # classes
    data = [(type(d) is dict) and d or dict(d) for d in data]

    content = list(lib.dicts_to_array(data, list(columns)).T)
    return _convert_object_array(content, columns, dtype=dtype,
                                 coerce_float=coerce_float)

Parsing issue (suspect regex issue?)

The Anomaly detection notebook is unable to handle data with "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0"

ValueError Traceback (most recent call last)
pandas/_libs/lib.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)"

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last)
in
1 # Create a Pandas dataframe from the Zeek HTTP log
2 log_to_df = LogToDataFrame()
----> 3 bro_df = log_to_df.create_dataframe('../data/http2.log')
4 print('Read in {:d} Rows...'.format(len(bro_df)))
5 bro_df.head()

/usr/local/lib/python3.6/dist-packages/zat/log_to_dataframe.py in create_dataframe(self, log_filename, ts_index, aggressive_category, usecols)
63
64 # Now actually read the Zeek Log using Pandas read CSV
---> 65 self._df = pd.read_csv(log_filename, sep='\t', names=header_names, usecols=usecols, dtype=pandas_types, comment="#", na_values='-')
66
67 # Now we convert 'time' and 'interval' fields to datetime and timedelta respectively

~/.local/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
683 )
684
--> 685 return _read(filepath_or_buffer, kwds)
686
687 parser_f.name = name

~/.local/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
461
462 try:
--> 463 data = parser.read(nrows)
464 finally:
465 parser.close()

~/.local/lib/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows)
1152 def read(self, nrows=None):
1153 nrows = _validate_integer("nrows", nrows)
-> 1154 ret = self._engine.read(nrows)
1155
1156 # May alter columns / col_dict

~/.local/lib/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows)
2057 def read(self, nrows=None):
2058 try:
-> 2059 data = self._reader.read(nrows)
2060 except StopIteration:
2061 if self._first_chunk:

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_column_data()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()

~/.local/lib/python3.6/site-packages/pandas/core/arrays/integer.py in _from_sequence_of_strings(cls, strings, dtype, copy)
325 @classmethod
326 def _from_sequence_of_strings(cls, strings, dtype=None, copy=False):
--> 327 scalars = to_numeric(strings, errors="raise")
328 return cls._from_sequence(scalars, dtype, copy)
329

~/.local/lib/python3.6/site-packages/pandas/core/tools/numeric.py in to_numeric(arg, errors, downcast)
149 coerce_numeric = errors not in ("ignore", "raise")
150 values = lib.maybe_convert_numeric(
--> 151 values, set(), coerce_numeric=coerce_numeric
152 )
153

pandas/_libs/lib.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)" at position 30242

Jupyter Notebooks: div float not working on Github viewer

I've already sent out a support email to Github on this:

Using a 'float' on a div used to work with the Github Jupyter notebook viewer but sometime during the last year it stopped working.

GitHub viewer (div float not working)

https://github.com/Kitware/bat/blob/master/notebooks/Bro_to_Scikit_Learn.ipynb

Same Notebook on NBViewer (div float working fine)

https://nbviewer.jupyter.org/github/Kitware/bat/blob/master/notebooks/Bro_to_Scikit_Learn.ipynb

So, shrug, perhaps I'll change the links in the Readme to be nbviewer links until Github fixes the div float issue... also if anyone knows a work around happy to change the notebook structure a bit.

Change Readme to markdown instead of RST

Now that 'warehouse' (aka pypi) finally supports markdown we can switch the readme to markdown.

Improve anomaly detection streaming example.

Add hdbscan in addition to the minibatchKmeans incremental algorithm.

Fix LogToDataFrame docs/readme/examples

The LogToDataFrame now has a slightly different interface. Fix all the docs

    log_to_df = LogToDataFrame()
    my_df = log_to_df.create_dataframe(log_path)

Better error handling with the bro log file is not found.

Right now the bro_log_reader.py doesn't raise an IOError exception when the bro log file is not found (or not readable), so that needs to be fixed. Also the log_to_dataframe class also needs to catch and rethrow that IOError so that scripts using either of those can properly catch the IOError.

Dataframe to Spark

The new output from LogToDataframe is great but I think the 'Nullable Integer Arrays' might be causing some issues when trying to convert to a Spark dataframe.

See: https://stackoverflow.com/questions/38794522/pandas-dataframe-to-spark-dataframe-can-not-merge-type-error

send data to splunk

Hello,
how is possible to send data directly to splunk?

Consider breaking requirements into extras

In particular, if pyspark is not a core dependency, it would be good to move that under a "spark" extra since that package is a 188 MB tarball download on my machine.

More Examples

Folks like examples: So maybe take a peek at some blog (like https://blog.rapid7.com/2016/06/02/working-with-bro-logs-queries-by-example/) and make some more example script or notebooks.

Parquet/Arrow Nullable integer arrays

So right now we convert Nullable integer arrays into 'float32' types before converting them into parquet files. This is 'ok' but we might think of something more clever in the future.

   # Nullable integer arrays are currently not handled by Arrow
    # See: https://issues.apache.org/jira/browse/ARROW-5379
    """Cast Nullable integer arrays to float32 before 'serializing'"""
    null_int_types = [pd.UInt16Dtype, pd.UInt32Dtype, pd.UInt64Dtype, pd.Int64Dtype]

Make the log name an required argument to all the examples

@brifordwylie - did you want me to add 'required=True' for the '-f' argparse argument to all the examples? It would make it a little bit tidier.

I can do that to all of them and then add as one pull request and approve it myself if you're ok with the action?

Cheers, Mike

AWS as data input/output examples

https://github.com/awslabs/aws-data-wrangler looks cool. We should do some examples/notebooks using that to pull/push data to S3/Athena/Redshift.

Change the way that the examples checks the name of the log files

Having played around with the example files I might have found a small issue.

The code currently roughly looks like this:

    if not args.bro_log.endswith('ssl.log'):
        print('This example only works with Bro ssl.log files..')
        sys.exit(1)

The potential issue here, as I see it, is the use of 'endswith'.

Given the fact that when Bro rotates out the logfiles they get a name like 'ssl.21:00:00-22:00:00.log.gz' we'd end up not being able to open a file that had been extracted without manually renaming it.

Maybe we could do a 'endswith .log and contains ssl' or something along the lines of that.

Thoughts?

New example to try and find Tor connections and also port number statistics

@brifordwylie

I've just uploaded another branch called 'ssl_tor_and_port_count'. It adds two new files

data/tor_ssl.log
examples/tor_and_port_count.py

The example shows two more use cases for Bat.

Some people see Tor as something bad and wants to know if users are accessing Tor resources from their networks. One known way of doing this is to look for the issuer and subject of the certificates which follow a very special pattern. This script uses regex's to try and locate connections that follow this pattern and will report them.
Having encrypted traffic on unusual ports can be an indication of potentially malicious activity so this script also outputs the totals of the port numbers that Bro have identified SSL traffic on.

I've based this on one of the earlier examples so it will follow the 'normal standard' when it comes to positional argument for the log file and so on. It will also accept the '-t' argument to tail a live log file.

The total number of identified potential Tor connections and the port number statistics will currently only print if the program is not working on a live log file. This is something that I'm currently in two minds whether or not I should change so I print the stats intermittently on a live log file too.

I've tested this back and forwards and it seems to work fairly ok. If you have the time could you please check out the branch and give it a go too and see if you can spot anything that's broken or something you'd want me to change?

If you're ok with me adding this to the main branch - please let me know and I'll make sure to sync the branch to the latest version and then create a pull request for a more formal review.

Cheers, Mike

P.S
I have not looked at adding anything to the Docs yet - I thought I'd run the program past you before I got into that bit. ;-)

pd.Concat does not work directly on LogToDataFrame

Realized that doing a pd.Concat on the LogToDataFrame classes (which should work fine) does not actually work. It appear that concat will try to create a combined dataframe of the same class as those in the list. So when the constructor is called.. there's obviously no log file to read and we get an exception.

Add documentation that covers installing without dependencies.

There was a good discussion about the large number of dependencies that the BAT project has #31 . We want to capture that discussion as a section of documentation that explains how to install BAT without all the dependencies.

Improve log rotation handling

Right now the Bro log tailing 'kinda' handles log rotation but there are lots of little corner cases that we're not taking care of. We might consider using something like Pygtail (https://github.com/bgreenlee/pygtail). Looking at the project/code they've put a lot of work into handling all those crazy corner cases.

dataframe_to_matrix class needs to be 'upgraded'

Since our log_to_dataframe class now handles/encodes NaN values properly... then the dataframe_to_matrix class also needs to make sure those are handled during the conversion to a numpy ndarray.

Switch docs to markdown instead of RST

Going to just use Github's documentation system instead of readthedocs...

FEATURE REQUEST - Bro JSON

Greetings! Cool project and thank you for creating it and sharing it.

I opened the issue more as a feature request. It would be nice if this also supported logs written by bro using the JSON output. In some medium-large sensor deployments it is common to use JSON wherever possible to ease ETL and compatibility with things like the ELK stack etc.

Unable to open bro logs - [Errno 22] Invalid argument

Tried open conn.logs and dns.logs using BAT.

When I try -
reader = bro_log_reader.BroLogReader('dns.log')

I get an error [Errno 22] Invalid argument

Traceback -

File "", line 52, in init
'time': datetime.datetime.fromtimestamp(0), 'interval': datetime.timedelta(seconds=0),

Please help

Huge requirements footprint

If I don't have Cython, scipy, or spark installed, trying to install bat will attempt to install those as well. All I want to do is pull a brolog into a dictionary. Can we reduce the requirements footprint such that most dependencies are optional, and attempting to use or import a feature that requires one of them causes an exception explaining the need for it?

Is there an issue with resampling/plotting of larger sets of data?

I've successfully done some resampling and plotting of data from various dns.log files from different hosts in my environment following the layout/example in https://github.com/Kitware/bat/blob/master/notebooks/Bro_to_Plot.ipynb.

Then I thought I'd do the same for conn.log and this is where I run into problems. I'm not sure if the issue lies with Bat and/or Pandas (or for that matter myself) so I thought I'd post it as an issue.

The steps I take are as follows:

bro_df = LogToDataFrame('data/conn.log')
bro_df.info()

<class 'bat.log_to_dataframe.LogToDataFrame'>
DatetimeIndex: 3596641 entries, 2018-01-03 08:59:43.217746 to 2018-01-03 09:54:49.948516
Data columns (total 20 columns):
conn_state        object
duration          timedelta64[ns]
history           object
id.orig_h         object
id.orig_p         int64
id.resp_h         object
id.resp_p         int64
local_orig        bool
local_resp        bool
missed_bytes      int64
orig_bytes        int64
orig_ip_bytes     int64
orig_pkts         int64
proto             object
resp_bytes        int64
resp_ip_bytes     int64
resp_pkts         int64
service           object
tunnel_parents    object
uid               object
dtypes: bool(2), int64(9), object(8), timedelta64[ns](1)
memory usage: 528.2+ MB

bro_df['uid'].resample('1T').count().plot()
plt.xlabel('Connections per Minute')

This gives the following output:

Which clearly looks incorrect.

I've tried using the rather small conn.log that's shipped in the /data directory for this repo and that works just fine which sort of leads me to think that it possibly could be related to the size of the dataframe?

I've also tried with different resample settings and none of them works.

The index seems to be sequential and doesn't contain any rogue entries. There are no duplicates either.

I've verified the same behavior on both Windows and Linux.

Any ideas?

Make sure all examples run

Anomaly detection for entire bro log long period

Hi I need to perform Anomaly detection detection for the entire bro log data continuously, I have found that anomaly_detection_streaming.py and anomaly_detection.py but no idea which one used for continuous monitor suppose for 1 month. Even if I used anomaly_detection_streaming.py when I restart the program it will start from current log only, the old data will miss. How I can turn the program for continuous monitor for long period of time without effecting the restart of program.

Push new version to PyPI

With the new dataframe optimizations we should push a new version to PyPI when we complete with testing/docs/etc.

bro_log_reader freezes for large files

Three significant issues with the bro_log_reader class:

For very large files (10^6 lines or greater) the readerrows() method is an inefficient way to create pandas dataframes.
The bro_log_reader class also does not optimize numpy type based off the number of unique instances of bro log objects.
Finally, the dict based approach in readerrows() puts the columns out of the expected order from the bro log

supercowpowers / zat Goto Github PK

zat's Introduction

Zeek Analysis Tools (ZAT)

Install

Getting Started

AWS Data Processing and ML Modeling

Installing on Raspberry Pi!

Recent Improvements

Video Presentation

Why ZAT?

Analysis Notebooks

Documentation

Running the Tests

About SuperCowPowers

zat's People

Contributors

Stargazers

Watchers

Forkers

zat's Issues

Jupyter Notebook - Python3

Recommend Projects

Recommend Topics

Recommend Org

Jobs