GithubHelp home page GithubHelp logo

vc1492a / pynomaly Goto Github PK

View Code? Open in Web Editor NEW
302.0 25.0 36.0 34.94 MB

Anomaly detection using LoOP: Local Outlier Probabilities, a local density based outlier detection method providing an outlier score in the range of [0,1].

License: Other

Python 97.08% TeX 2.92%
outlier-detection anomaly-detection machine-learning outlier-scores probability outliers anomalies nearest-neighbors

pynomaly's People

Contributors

bbambico avatar lmeshoo avatar michaelschreier avatar robmarkcole avatar vc1492a avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pynomaly's Issues

Clarify orientation of 2d array

It might help to mention what the semantic orientation of the input data array is. Originally I had it transposed which gave me a somewhat unhelpful index error. If you have a square array, maybe for testing, you might not realize that the data is flipped.

Distance Matrix support

I'm currently using LOF for a Distance Matrix. Is it possible to also use a Distance Matrix for LoOP? Or are the points needed for the computation of the probabilities?

Add docstring to tests

While the names of the tests in tests/test_loop.py provide an indication as to the purpose of the unit tests, it would be beneficial to include some more verbose docstring to provide a better idea as to the purpose of each test.

Passing cluster_labels broken

I think I have found a bug that occurs when passing some cluster_labels.

When I completely reverse the order of all input (data and cluster_labels), and I reverse the result (local_outlier_probabilities), I would expect the same numbers. This does happen as long as all cluster_labels values are equal. Once I have two (really separate) clusters, the results change when flipped!
An extra indication that things go wrong (IMHO): the second cluster's neighbor numbers are in the first cluster!

A small reproduction example:

import matplotlib.pyplot as plt
from PyNomaly import loop

np.random.seed(1)
n = 9
data = np.append(np.random.normal(2, 1, [n, 2]), np.random.normal(8, 1, [n, 2]), axis=0)
clus = np.append(np.ones(n),                     2 * np.ones(n)).tolist()  # 2 cluster numbers!
model = loop.LocalOutlierProbability(data, n_neighbors=5, cluster_labels=clus)
fit = model.fit()
res = fit.local_outlier_probabilities
print(res)
print(fit.neighbor_matrix)

data_flipped = np.flipud(data)
clus_flipped = np.flipud(clus).tolist()
model2 = loop.LocalOutlierProbability(data_flipped, n_neighbors=5, cluster_labels=clus_flipped)
fit2 = model2.fit()
res2 = np.flipud(fit2.local_outlier_probabilities)
print(res2)
print(np.flipud(fit2.neighbor_matrix))

s  = 1 + 100 * res.astype(float)
s2 = 1 + 100 * res2.astype(float)
plt.scatter(data[:, 0], data[:, 1], c=clus, s=s,  marker='+')
plt.scatter(data[:, 0], data[:, 1], c=clus, s=s2, marker='x')
plt.show()

Abstract User Warnings

Abstract user warnings so you can issue warnings on object instantiation (i.e. f = loop() vs only loop.fit()).

Add regression tests for refactor validation

Add regression tests to establish a baseline with the current working version of PyNomaly by comparing the LoOP results of a predefined array of values to the expted results. These regression tests can be used to compare the results of future refactored versions. If we detect differences in the results, we will know that something in our refactored calculations is wrong.

Progress bar fails with ZeroDivisionError using a small number of observations

The below line contained within PyNomaly/loop.py fails with a ZeroDivisionError when the total number of observations is less than the terminal or cell width where the code was executed.

if index % block_size == 0:

The cause is a block_size of 0, which is not allowable with the current implementation.

This can be resolved by changing the code to the following:

    if total < w:
        block_size = int(w / total)
    else:
        block_size = int(total / w)

if index % block_size == 0:

Which will ensure the block_size remains greater than 0 even when the number of observations is less than the width of the terminal / cell window.

Add coverage reporting for Numba JIT-compiled functions

Any functions with the @jit decorator are not read by pytest-cov as having been executed despite them running successfully within the unit tests, as they are compiled by numba as C and thus are not Python functions which are able to be evaluated by pytest-cov.

This occurs even when setting the environment variable to disable numba execution, e.g. NUMBA_DISABLE_JIT = "1".

Implementation Speed

I am running the Loop algorithm on unclustered 37k two-dimensional points. It's taking forever to run. Is it because of implementation or the algorithm is inherently slow?

Division by zero when including cluster labels

When using Kreigel et al's original 2d-synthetic dataset, and when including the cluster labels, the result is a divide-by-zero error.

screen shot 2018-07-30 at 9 12 02 am

screen shot 2018-07-30 at 9 11 47 am

Without the cluster labels, the algorithm runs to completion, but produces the result we talked about last week (slightly too confidant probability values). The two behaviors may be related, but as I am not sure, I thought it better to mention both issues.

Alter Fit Convention

Provide parameters in LocalOutlierProbability() and provide data in fit() as opposed to providing data in LocalOutlierProbability() along with params. This is s.t. PyNomaly is more in line with scikit-learn and other popular libraries.

ZeroDivisionError when using DataFrame

I had been using version 0.1.5 without issues in the following script. But I decided to upgrade and now I see the following issue:

data = [43.3,62.9,55.2,48.6,67.1,421.5] # example data
new_array=pd.DataFrame(data)
scores = loop.LocalOutlierProbability(new_array).fit()
scores = scores.local_outlier_probabilities

Traceback:

---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-6-1cb16c12004f> in <module>()
      6 
      7 new_array=pd.DataFrame(l)
----> 8 scores = loop.LocalOutlierProbability(new_array).fit()
      9 scores = scores.local_outlier_probabilities
     10 np.where(scores > DETECTION_FACTOR)[0]

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/PyNomaly/loop.py in fit(self)
    226         store = self._norm_prob_local_outlier_factors(store)
    227         self.norm_prob_local_outlier_factor = np.max(store[:, 9])
--> 228         store = self._local_outlier_probabilities(store)
    229         self.local_outlier_probabilities = store[:, 10]
    230 

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/PyNomaly/loop.py in _local_outlier_probabilities(self, data_store)
    211         return np.hstack(
    212             (data_store,
--> 213              np.array([np.apply_along_axis(self._local_outlier_probability, 0, data_store[:, 7], data_store[:, 9])]).T))
    214 
    215     def fit(self):

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/lib/shape_base.py in apply_along_axis(func1d, axis, arr, *args, **kwargs)
    114     except StopIteration:
    115         raise ValueError('Cannot apply_along_axis when any iteration dimensions are 0')
--> 116     res = asanyarray(func1d(inarr_view[ind0], *args, **kwargs))
    117 
    118     # build a buffer for storing evaluations of func1d.

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/PyNomaly/loop.py in _local_outlier_probability(plof_val, nplof_val)
    111     def _local_outlier_probability(plof_val, nplof_val):
    112         erf_vec = np.vectorize(erf)
--> 113         return np.maximum(0, erf_vec(plof_val / (nplof_val * np.sqrt(2.))))
    114 
    115     def _n_observations(self):

ZeroDivisionError: float division by zero

Integrate Numba's JIT with key functions

Numba's JIT can greatly accelerate the processing speed of the _distances function and others. One of the main drawbacks of nearest-neighbor approaches is their computational resource needs - reducing this drawback by just-in-time compiling key functions has the potential to greatly accelerate the speeds in which observations are processed.

lopp for novetly detection

lof can be used to predict unseen data using predict method.
I can't find predict method in lopp. Does lopp support this scenario?

Changes to distance measure implementation to improve speed

Hello authors,
I have worked with a range of anomaly detection algorithms. I have using LoOP for my testbed and the time consumed is very high. More particularly, my training data includes about 46000 points with two features for each and the number of clusters is 1 (only normal traffic with label 0). It took me about 11000 seconds for the training phase and 0.5s for testing phase (with k=20).
when I reviewed the code in loop.py file, I saw you use two loops in the function:
def _distances(self, progress_bar: bool = False) -> None:
I have already rewritten this function and ignored the cluster (because I do not need it) and also use functions in numpy to reduce this function to only one 'for' loop. It only takes me about 120s instead.
This function likes:
def distance_dn(self,point_vector):
"""
Calculate distances from each point to the remaining points
"""
data = point_vector
k = self.n_neighbors
distance = np.zeros((len(data),k))
index = np.zeros((len(data),k))
t1 =time.time()
for i in range(len(data)):
data_i = np.array([[data[i][0],data[i][1]]])
point_arr = np.repeat(data_i,len(data),axis=0)
diff = (point_arr - data)**2
dis = (np.sum(diff,axis=1))**0.5
index[i] = (np.argpartition(dis,k))[0:k]
distance[i] = dis[(np.argpartition(dis,k))[0:k]]
#print(distance)
t2 = time.time()
#print(t2-t1)
return distance, index

I hope here can be my contribution for LoOP?

question again

when use loop to detect outlier, it often goes wrong with the statement "return (probabilistic_distance / ev_prob_dist) - 1.", and I have to fix neightbors to a lower value. But sometimes when I gave 4 or 5, it also comes with "ZeroDIvisionError: float division by zero". Thx for answering !!!

Library documentation

As the current capabilities of PyNomaly are solidified and new capabilities added, it would be beneficial to have dedicated documentation that is hosted and available to users outside of the readme.

Algorithm Discrepancies

Recently I've had to do outlier analysis for work and came across your package which neatly implements LoOP. When I was reading through the source code, I noticed two spots where the calculations don't seem to line up with what was described in the paper. There may be good reasons why it differs from the original algorithm (or I'm misreading it entirely).

  1. When calculating the PLOF for an object O, the expected value used in this package is the expected value of every pdist value, where the paper says to just use the expected value for the pdist of objects that are in O's neighborhood

  2. In the calculation of the nPLOF, first _prob_local_outlier_factors_ev calculates the expected value for PLOF squared just fine, but followed up after in _norm_prob_outlier_factor the value is again squared and sqaure rooted. The paper just specifies a square root and not another square.

I was just wondering if there was a reason for the deviation from the paper in these regards.
Thanks!

Inconsistency in case of dataframe and distance matrix input

This is not a project issue, but a suggestion to put some kind of warning in the distance matrix example in the README.

There is an example of using distance matrix as input for LoOP in the README. It shows how sklearn.neighbors.NearestNeighbors could be used to obtain distance matrix together with index matrix. It seems that this way the matricies also contain distance measures to a point itself, resulting to zero distance for the first nearest neighbor of every point.
On the other hand internal method _compute_distance_and_neighbor_matrix, used when data argument is specified, excludes the distances to a point itself and so giving different scores on same data.
I took a look into the test case, which allowes difference of 0.15 in scores vector, and thus the difference between 0.45 and 0.6 is considered negligible.
I think the output metrices of sklearn.neighbors.NearestNeighbors should be transformed first to be consistent with the internal algorithm.

data stream

Is this package suitable for stream data?
By stream data I mean An array of data that is constantly producing and I want to find an anomaly as soon as it appears.

Adjust indexing in data_store

cardinality, data_store[:, 3])]).T))

PyNomaly used to index the nearest neighbor distance column number 2, data_store[:, 2]. This indexing has been removed, but the indexing of the remaining calculations has not been changed to reflect this resulting in a column of None values in the second column of the data_store object.

Predict the probability of testing data

Hi Dear Author, I wonder does this package contains an API to do independent testing after fitting? For instance, something like:

m = loop.LocalOutlierProbability(data).fit()

scores_of_test_data = m.local_outlier_probabilities(test_data)

where the "data" is used for training (fitting) and "test_data" is another np.array that is for testing only, by which we want to know whether the "test_data" is the outlier for training "data", while we don't put them together for fitting (because fitting again every time takes a long time).

Does this package have such an API?

sir, help

In loop.py, data_store[vec][1] = distances[vec], sir, data_store is a 2 dimension numpy array and distances also, why it's legal when you give a sequence to a data_store[vec][1] ?

parallelize

It would be great if there's an option for embarrassingly parallel computations, especially if all N^2 distances are calculated.

Use Trees for data structures

It looks like all distances are currently being calculated, which is expensive. Borrowing from sklearn, BallTree and KDTree could be used to speed up nearest neighbor calculations.

numpy arrays as input don't work

Using data = pd.DataFrame(np.random.rand(100, 5)) as the input to LocalOutlierProbability works as expected, however, passing the raw ndarray np.random.rand(100, 5) throws an error:
self.points_vector = self.data.reshape(self.data.shape[1:]) ValueError: cannot reshape array of size 500 into shape (5,)

Alter Naming Convention

Could just be loop().fit() (so you can have lof().fit(), etc). E.g. from PyNomaly import loop, loop().fit().

data stream

Does this package support streaming data in anomaly detection?

Support Python 3.7

PyNomaly currently supports Python 3.4-3.6. Extend and test support with Python 3.7.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.