GithubHelp home page GithubHelp logo

ml4its / hackathon2021-anomalydetection Goto Github PK

View Code? Open in Web Editor NEW
3.0 2.0 2.0 406 KB

Anomaly Detection problem proposed at Hackathon 2021 organized by BrainNTNU and CogitoAI. The aim is to perform Unsupervised Anomaly Detection on a Radio Access Network (RAN) dataset shared by a Telenor Business Unit, with the possibility of leveraging the information on the position of the base stations.

Jupyter Notebook 100.00%
hackhathon anomaly-detection 2021 telenor sintef

hackathon2021-anomalydetection's Introduction

Hackathon 2021:

"Anomaly Detection on Telenor network data"

Scope:

The aim is to perform Unsupervised Anomaly Detection on a Radio Access Network (RAN) dataset shared by a Telenor Business Unit, with the possibility of leveraging the information on the position of the base stations.

Context:

In the Telecom domain, efficient and accurate Anomaly Detection is vital to be able to continuously monitor the network’s base stations’ key metrics and alert for possible incidents in time. With constant upgrades in the network infrastructure, the coming of 5G and the exponential increase of devices and antennas, it is unfeasible to carry out such detection without relying on data-driven models that automate this task.

Most commonly, the anomalies to be detected do not concern single measurements but come from systems recording several counters, that is, generating multivariate time series. The difficulty in detecting anomalies in multivariate time series arises from the fact that the contexts and the correlations between the different features, time windows and neighbouring base stations have to be taken into account and examined. There are two main types of anomalies that are desirable to detect:

  1. point anomalies
  2. trend anomalies

The latter, corresponding to misconfigurations/failures in the network, are especially hard to recognise, as they are not easily distinguishable from the “normal” behaviour, hence, leveraging the correlations between the time series components and the topological information is particularly important.

Data

The data that will be shared from Telenor concerns:

  • radio_kpis.csv: hourly aggregated RAN technical counters coming from 403 cells belonging to 31 different base stations
  • distance_matrix.csv: relative distance matrix of the cells.

Data Counters

All counters are normalised.

column name data type description
timestamp timestamp the metrics values correspond to the hour following the timestamp
cell_name string name of the cell
avail_period_duration double hourly rate the cell was available
bandwidth decimal(20,1) total available bandwidth for the sector in PRBs (3G is also mapped into PRB like measures (12,5 PRBs per carrier)
num_voice_attempts double total number of voice related attempts
num_data_attempts double total number of data related attempts
voice_failure_rate double total voice failure rate
data_failure_rate double total data failure rate
unavail_unplan_rate double hourly rate the cell was unplanned unavailable
unavail_total_rate double total unavailable hourly rate
voice_setup_failure_rate double voice related setup failure rate
voice_drop_rate double voice related drop rate
data_setup_failure_rate double data related setup failure rate
data_drop_rate double data related drop rate
thp_rate_tt_kpi double amount of Downlink data transfered per user over the estimated user throughput
ho_failure_rate double handover failure rate (inter-, intra- frequency, inter-,intra-technology)

The cell name is a string of numbers and digits that have a particular meaning, corresponding to the hierarchical structure of the base station.

  • Base stations - also called sites - beam signals to a 360° area around them.
  • Each site is divided into three sectors covering an area of 120°.
  • Multiple cells belong to each sector, each running at a prescribed frequency. Cells in the same sector running on the same frequency are identified by their carrier number. The numbering corresponds to their installation order.

There are two types of cells:

  • coverage cells: run at lower frequencies (700, 800, 900 MHz) and aim to “cover” a larger area around the site.
  • capacity cells: run at higher frequencies (1800, 2100, 2600 MHz) and serve a smaller area around the site, with a better quality signal.

Keeping in mind this structure above, the cell_name is of the form 'XX_ija', where:

  • XX in {00,01,02,..,30} denotes the site the cell belongs to;
  • i in {1,2,3} denotes the sector the cell belongs to;
  • j in {1,2,...} denotes the carrier;
  • a in {'Z','X','Y','W','V','R','Q','P'} denotes the technology and frequency of the cell based on the table below.
key technology frequency
'Z' 4G 2100MHz
'X' 4G 800MHz
'Y' 2G 900MHz
'W' 4G 2600MHz
'V' 3G 900MHz
'R' 4G 1800MHz
'Q' 3G 2100MHz
'P' 2G 1800MHz

Row sample in CSV format of the radio_kpis.csv dataset

cell_name,timestamp,avail_period_duration,bandwidth,num_voice_attempts,num_data_attempts,voice_failure_rate,data_failure_rate,unavail_unplan_rate,unavail_total_rate,voice_setup_failure_rate,voice_drop_rate,data_setup_failure_rate,data_drop_rate,thp_rate_tt_kpi,ho_failure_rate
02_21Y,2019-12-31 23:00:00+00:00,1.0,0.49975,0.001335,0.012488,0.0,0.000000,0.0,0.348986,0.0,0.0,0.000000,0.000000,0.000098,0.333333
11_31Y,2019-12-31 23:00:00+00:00,1.0,0.49975,0.028037,0.049471,0.0,0.000772,0.0,0.348986,0.0,0.0,0.000373,0.000644,0.000054,0.334979
25_21X,2019-12-31 23:00:00+00:00,1.0,0.49975,0.000000,0.000000,NaN,NaN,0.0,0.348986,NaN,NaN,NaN,NaN,NaN,NaN
00_22Z,2019-12-31 23:00:00+00:00,1.0,1.00000,0.005340,0.011638,0.0,0.000000,0.0,0.348986,0.0,0.0,0.000000,0.000000,0.000084,0.333333
11_21Z,2019-12-31 23:00:00+00:00,1.0,1.00000,0.148198,0.070752,0.0,0.001529,0.0,0.348986,0.0,0.0,0.000261,0.001442,0.000074,0.336182

Comments:

  • Nan values could derive from the fact that when calculating the rate, the denominator was 0. (e.g. data drop rate:NaN means that there have been no data attempts), or could be an indication of the cell being unavailable and not able to record signals, or simply missing values in the data due to technical reasons.
  • Very low throughput indicates an anomaly (the resource allocated per user is too low to satisfy the user's needs)
  • High number of data/voice attempts is also indication of an anomaly (some parameter misconfiguration is occurring)

hackathon2021-anomalydetection's People

Contributors

ruoccoma avatar saraannem-telenor avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.