vc1492a / pynomaly Goto Github PK

Anomaly detection using LoOP: Local Outlier Probabilities, a local density based outlier detection method providing an outlier score in the range of [0,1].

License: Other

Python 97.08% TeX 2.92%

anomalies anomaly-detection machine-learning nearest-neighbors outlier-detection outlier-scores outliers probability

pynomaly's People

Contributors

Stargazers

Watchers

pynomaly's Issues

sir, help

In loop.py, data_store[vec][1] = distances[vec], sir, data_store is a 2 dimension numpy array and distances also, why it's legal when you give a sequence to a data_store[vec][1] ?

Alter Naming Convention

Could just be loop().fit() (so you can have lof().fit(), etc). E.g. from PyNomaly import loop, loop().fit().

Add test coverage metrics

Add test coverage and metrics to the readme via a badge for improved transparency.

.fit() fails if the sum of squared distances sums to 0

Due to division by zero. Should return mumpy array composed of 0s.

Integrate Numba's JIT with key functions

Numba's JIT can greatly accelerate the processing speed of the _distances function and others. One of the main drawbacks of nearest-neighbor approaches is their computational resource needs - reducing this drawback by just-in-time compiling key functions has the potential to greatly accelerate the speeds in which observations are processed.

Move continuous integration from Travis CI to another platform

Travis CI is no longer free. PyNomaly should be migrated to an alternative continuous integration platform such as Tox or Circle CI.

ZeroDivisionError when using DataFrame

I had been using version 0.1.5 without issues in the following script. But I decided to upgrade and now I see the following issue:

data = [43.3,62.9,55.2,48.6,67.1,421.5] # example data
new_array=pd.DataFrame(data)
scores = loop.LocalOutlierProbability(new_array).fit()
scores = scores.local_outlier_probabilities

Traceback:

---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-6-1cb16c12004f> in <module>()
      6 
      7 new_array=pd.DataFrame(l)
----> 8 scores = loop.LocalOutlierProbability(new_array).fit()
      9 scores = scores.local_outlier_probabilities
     10 np.where(scores > DETECTION_FACTOR)[0]

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/PyNomaly/loop.py in fit(self)
    226         store = self._norm_prob_local_outlier_factors(store)
    227         self.norm_prob_local_outlier_factor = np.max(store[:, 9])
--> 228         store = self._local_outlier_probabilities(store)
    229         self.local_outlier_probabilities = store[:, 10]
    230 

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/PyNomaly/loop.py in _local_outlier_probabilities(self, data_store)
    211         return np.hstack(
    212             (data_store,
--> 213              np.array([np.apply_along_axis(self._local_outlier_probability, 0, data_store[:, 7], data_store[:, 9])]).T))
    214 
    215     def fit(self):

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/lib/shape_base.py in apply_along_axis(func1d, axis, arr, *args, **kwargs)
    114     except StopIteration:
    115         raise ValueError('Cannot apply_along_axis when any iteration dimensions are 0')
--> 116     res = asanyarray(func1d(inarr_view[ind0], *args, **kwargs))
    117 
    118     # build a buffer for storing evaluations of func1d.

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/PyNomaly/loop.py in _local_outlier_probability(plof_val, nplof_val)
    111     def _local_outlier_probability(plof_val, nplof_val):
    112         erf_vec = np.vectorize(erf)
--> 113         return np.maximum(0, erf_vec(plof_val / (nplof_val * np.sqrt(2.))))
    114 
    115     def _n_observations(self):

ZeroDivisionError: float division by zero

Clarify orientation of 2d array

It might help to mention what the semantic orientation of the input data array is. Originally I had it transposed which gave me a somewhat unhelpful index error. If you have a square array, maybe for testing, you might not realize that the data is flipped.

Migrate to scikit-learn code base

Use code available in this pull request to migrate the PyNomaly code base to one based on scikit-learn. This will allow for the use of pre-computed distances, different distance metrics, and other functionality.

Update continuous integration to test against Python 3.9

As we get closer to a release of Python 3.9, it would be prudent to go ahead and begin testing the new version. Travis CI - the continuous integration service used by PyNomaly - seems to be offering support for current Python 3.9 builds.

Automatic clustering

Predict the probability of testing data

Hi Dear Author, I wonder does this package contains an API to do independent testing after fitting? For instance, something like:

m = loop.LocalOutlierProbability(data).fit()

scores_of_test_data = m.local_outlier_probabilities(test_data)

where the "data" is used for training (fitting) and "test_data" is another np.array that is for testing only, by which we want to know whether the "test_data" is the outlier for training "data", while we don't put them together for fitting (because fitting again every time takes a long time).

Does this package have such an API?

Progress bar fails with ZeroDivisionError using a small number of observations

The below line contained within PyNomaly/loop.py fails with a ZeroDivisionError when the total number of observations is less than the terminal or cell width where the code was executed.

PyNomaly/PyNomaly/loop.py

Line 36 in 744fa57

if index % block_size == 0:

The cause is a block_size of 0, which is not allowable with the current implementation.

This can be resolved by changing the code to the following:

    if total < w:
        block_size = int(w / total)
    else:
        block_size = int(total / w)

if index % block_size == 0:

Which will ensure the block_size remains greater than 0 even when the number of observations is less than the width of the terminal / cell window.

Add docstring to loop.py functions and unit tests

Use Trees for data structures

It looks like all distances are currently being calculated, which is expensive. Borrowing from sklearn, BallTree and KDTree could be used to speed up nearest neighbor calculations.

Alter Fit Convention

Provide parameters in LocalOutlierProbability() and provide data in fit() as opposed to providing data in LocalOutlierProbability() along with params. This is s.t. PyNomaly is more in line with scikit-learn and other popular libraries.

Add library download and usage statistics

It may be nice for both developers and users to have a better understanding of how PyNomaly is being used. See the usage charts put out by seaborn.

Changes to distance measure implementation to improve speed

Hello authors,
I have worked with a range of anomaly detection algorithms. I have using LoOP for my testbed and the time consumed is very high. More particularly, my training data includes about 46000 points with two features for each and the number of clusters is 1 (only normal traffic with label 0). It took me about 11000 seconds for the training phase and 0.5s for testing phase (with k=20).
when I reviewed the code in loop.py file, I saw you use two loops in the function:
def _distances(self, progress_bar: bool = False) -> None:
I have already rewritten this function and ignored the cluster (because I do not need it) and also use functions in numpy to reduce this function to only one 'for' loop. It only takes me about 120s instead.
This function likes:
def distance_dn(self,point_vector):
"""
Calculate distances from each point to the remaining points
"""
data = point_vector
k = self.n_neighbors
distance = np.zeros((len(data),k))
index = np.zeros((len(data),k))
t1 =time.time()
for i in range(len(data)):
data_i = np.array([[data[i][0],data[i][1]]])
point_arr = np.repeat(data_i,len(data),axis=0)
diff = (point_arr - data)**2
dis = (np.sum(diff,axis=1))**0.5
index[i] = (np.argpartition(dis,k))[0:k]
distance[i] = dis[(np.argpartition(dis,k))[0:k]]
#print(distance)
t2 = time.time()
#print(t2-t1)
return distance, index

I hope here can be my contribution for LoOP?

Algorithm Discrepancies

Recently I've had to do outlier analysis for work and came across your package which neatly implements LoOP. When I was reading through the source code, I noticed two spots where the calculations don't seem to line up with what was described in the paper. There may be good reasons why it differs from the original algorithm (or I'm misreading it entirely).

When calculating the PLOF for an object O, the expected value used in this package is the expected value of every pdist value, where the paper says to just use the expected value for the pdist of objects that are in O's neighborhood
In the calculation of the nPLOF, first _prob_local_outlier_factors_ev calculates the expected value for PLOF squared just fine, but followed up after in _norm_prob_outlier_factor the value is again squared and sqaure rooted. The paper just specifies a square root and not another square.

I was just wondering if there was a reason for the deviation from the paper in these regards.
Thanks!

Add other metric distances

Your implementation of LoOP is great but I would like to use it with others metrics distances to achieve a better accuracy for my dataset. For example, The Mahalanobis distance could be very helpful.

data stream

Does this package support streaming data in anomaly detection?

Add docstring to tests

While the names of the tests in tests/test_loop.py provide an indication as to the purpose of the unit tests, it would be beneficial to include some more verbose docstring to provide a better idea as to the purpose of each test.

Move from .egg installations to wheel

See PEP-0427.

Support Installation in conda environments without pip

This possibility to be exploded.

Add regression tests for refactor validation

Add regression tests to establish a baseline with the current working version of PyNomaly by comparing the LoOP results of a predefined array of values to the expted results. These regression tests can be used to compare the results of future refactored versions. If we detect differences in the results, we will know that something in our refactored calculations is wrong.

UserWarning if cluster_labels incorrect format.

Division by zero when including cluster labels

When using Kreigel et al's original 2d-synthetic dataset, and when including the cluster labels, the result is a divide-by-zero error.

Without the cluster labels, the algorithm runs to completion, but produces the result we talked about last week (slightly too confidant probability values). The two behaviors may be related, but as I am not sure, I thought it better to mention both issues.

question again

when use loop to detect outlier, it often goes wrong with the statement "return (probabilistic_distance / ev_prob_dist) - 1.", and I have to fix neightbors to a lower value. But sometimes when I gave 4 or 5, it also comes with "ZeroDIvisionError: float division by zero". Thx for answering !!!

Distance Matrix support

I'm currently using LOF for a Distance Matrix. Is it possible to also use a Distance Matrix for LoOP? Or are the points needed for the computation of the probabilities?

Support Python 3.7

PyNomaly currently supports Python 3.4-3.6. Extend and test support with Python 3.7.

Raise an exception instead of using sys.exit()

PyNomaly/PyNomaly/loop.py

Line 402 in 42136a4

sys.exit()

Library documentation

As the current capabilities of PyNomaly are solidified and new capabilities added, it would be beneficial to have dedicated documentation that is hosted and available to users outside of the readme.

ImportError: cannot import name 'assert_greater' from 'sklearn.utils._testing'

I would like to ask whether this function（assert_greater） is defined by you yourself or in which python library class.

Passing cluster_labels broken

I think I have found a bug that occurs when passing some cluster_labels.

When I completely reverse the order of all input (data and cluster_labels), and I reverse the result (local_outlier_probabilities), I would expect the same numbers. This does happen as long as all cluster_labels values are equal. Once I have two (really separate) clusters, the results change when flipped!
An extra indication that things go wrong (IMHO): the second cluster's neighbor numbers are in the first cluster!

A small reproduction example:

import matplotlib.pyplot as plt
from PyNomaly import loop

np.random.seed(1)
n = 9
data = np.append(np.random.normal(2, 1, [n, 2]), np.random.normal(8, 1, [n, 2]), axis=0)
clus = np.append(np.ones(n),                     2 * np.ones(n)).tolist()  # 2 cluster numbers!
model = loop.LocalOutlierProbability(data, n_neighbors=5, cluster_labels=clus)
fit = model.fit()
res = fit.local_outlier_probabilities
print(res)
print(fit.neighbor_matrix)

data_flipped = np.flipud(data)
clus_flipped = np.flipud(clus).tolist()
model2 = loop.LocalOutlierProbability(data_flipped, n_neighbors=5, cluster_labels=clus_flipped)
fit2 = model2.fit()
res2 = np.flipud(fit2.local_outlier_probabilities)
print(res2)
print(np.flipud(fit2.neighbor_matrix))

s  = 1 + 100 * res.astype(float)
s2 = 1 + 100 * res2.astype(float)
plt.scatter(data[:, 0], data[:, 1], c=clus, s=s,  marker='+')
plt.scatter(data[:, 0], data[:, 1], c=clus, s=s2, marker='x')
plt.show()

data stream

Is this package suitable for stream data?
By stream data I mean An array of data that is constantly producing and I want to find an anomaly as soon as it appears.

numpy arrays as input don't work

Using data = pd.DataFrame(np.random.rand(100, 5)) as the input to LocalOutlierProbability works as expected, however, passing the raw ndarray np.random.rand(100, 5) throws an error:
self.points_vector = self.data.reshape(self.data.shape[1:]) ValueError: cannot reshape array of size 500 into shape (5,)

lopp for novetly detection

lof can be used to predict unseen data using predict method.
I can't find predict method in lopp. Does lopp support this scenario?

Adjust indexing in data_store

PyNomaly/PyNomaly/loop.py

Line 266 in 02eb013

cardinality, data_store[:, 3])]).T))

PyNomaly used to index the nearest neighbor distance column number 2, data_store[:, 2]. This indexing has been removed, but the indexing of the remaining calculations has not been changed to reflect this resulting in a column of None values in the second column of the data_store object.

parallelize

It would be great if there's an option for embarrassingly parallel computations, especially if all N^2 distances are calculated.

Inconsistency in case of dataframe and distance matrix input

This is not a project issue, but a suggestion to put some kind of warning in the distance matrix example in the README.

There is an example of using distance matrix as input for LoOP in the README. It shows how sklearn.neighbors.NearestNeighbors could be used to obtain distance matrix together with index matrix. It seems that this way the matricies also contain distance measures to a point itself, resulting to zero distance for the first nearest neighbor of every point.
On the other hand internal method _compute_distance_and_neighbor_matrix, used when data argument is specified, excludes the distances to a point itself and so giving different scores on same data.
I took a look into the test case, which allowes difference of 0.15 in scores vector, and thus the difference between 0.45 and 0.6 is considered negligible.
I think the output metrices of sklearn.neighbors.NearestNeighbors should be transformed first to be consistent with the internal algorithm.

Abstract User Warnings

Abstract user warnings so you can issue warnings on object instantiation (i.e. f = loop() vs only loop.fit()).

Implementation Speed

I am running the Loop algorithm on unclustered 37k two-dimensional points. It's taking forever to run. Is it because of implementation or the algorithm is inherently slow?

Add coverage reporting for Numba JIT-compiled functions

Any functions with the @jit decorator are not read by pytest-cov as having been executed despite them running successfully within the unit tests, as they are compiled by numba as C and thus are not Python functions which are able to be evaluated by pytest-cov.

This occurs even when setting the environment variable to disable numba execution, e.g. NUMBA_DISABLE_JIT = "1".

vc1492a / pynomaly Goto Github PK

pynomaly's People

Contributors

Stargazers

Watchers

Forkers

pynomaly's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs