expediagroup / adaptive-alerting Goto Github PK

Anomaly detection for streaming time series, featuring automated model selection.

License: Apache License 2.0

Java 92.16% R 0.19% Shell 1.68% Scala 0.83% Makefile 0.11% Dockerfile 0.22% HCL 2.51% Smarty 1.17% Python 0.46% Jupyter Notebook 0.68%

anomaly-detection anomaly outlier-detection outlier monitoring time-series streaming

adaptive-alerting's Introduction

Adaptive Alerting (AA)

Streaming anomaly detection with automated model selection and fitting.

Wiki documentation

Build

To build the Maven project:

$ ./mvnw clean verify

To build the Docker images:

$ make docker_build

How the Travis CI build works

We use Travis CI to build AA Docker images and push them to Docker Hub. Here's how it works:

A developer pushes a branch (master or otherwise) to GitHub.
GitHub kicks off a Travis CI build.
Travis CI reads .travis.yml, which drives the build.
.travis.yml invokes the top-level Makefile.
The top-level Makefile
- runs a Maven build for the whole project
- invokes module-specific Makefiles to handle building and releasing Docker images
Each module-specific Makefile runs one or more module-specific build scripts to
- build the Docker images
- release the Docker images
For the release (docker push), the module-specific build script delegates to the shared scripts/publish-to-docker-hub.sh script. This script has logic to push the image to Docker Hub if and only if the current branch is the master.

adaptive-alerting's People

Contributors

Stargazers

Watchers

adaptive-alerting's Issues

Create new health check to replace the one we lost when removing haystack-commons

The way the haystack-commons health check works is that when an uncaught exception occurs in Kafka, the health controller would write to a file indicating that the app is unhealthy, and then the k8s liveness check would see that and fail the health check.

We want something like this, but we need to coordinate it with #253 and #245. Ill-formatted metrics should now cause the health check to fail.

Use MessagePack for serialization

Currently we are using json for data serialization to/from kafka topics.

We should have the option of using MessagePack throughout the system as it has performance benefits and it makes it consistent with the ingest topic which defaults to using MessagePack.

Add support for EWMA anomaly detection

ad-mapper crashes when wrong metrics sent to aa-metrics

The ad-mapper crashes when wrong metrics sent to aa-metrics Kafka topic.

Way to reproduce:
Make ad-mapper running locally in Kubernetes and run salesdata-to-gaas to push to Kafka's mapped-metrics topic.

Example metrics:
{"metric":{"tags":[null,null,"m_application=sales-data","unit=count","what=bookings","lob=tshop","region=travelocity-na"],"meta":{"source":"sales-data-api"}},"epochTimeInSeconds":1541112000,"value":1.0}

Result:
ad-mapper breaks and shuts down (and is restarted by kubernetes after)

Log:

2018-11-01 22:58:00 INFO  KafkaStreams:261 - stream-client [ad-mapper-a0320118-b8eb-44c1-a63a-5403a8d44063] State transition from REBALANCING to RUNNING
2018-11-01 22:58:00 INFO  StateChangeListener:40 - State change event called with newState=RUNNING and oldState=REBALANCING
2018-11-01 22:58:00 INFO  Fetcher:561 - [Consumer clientId=ad-mapper-a0320118-b8eb-44c1-a63a-5403a8d44063-StreamThread-1-consumer, groupId=ad-mapper] Resetting offset for partition aa-metrics-0 to offset 229.
2018-11-01 22:59:05 ERROR StreamThread:40 - Exception caught during Deserialization, taskId: 0_0, topic: aa-metrics, partition: 0, offset: 229
org.msgpack.core.MessageTypeException: Expected Map, but got Integer (7b)
	at org.msgpack.core.MessageUnpacker.unexpected(MessageUnpacker.java:595)
	at org.msgpack.core.MessageUnpacker.unpackMapHeader(MessageUnpacker.java:1297)
	at com.expedia.metrics.metrictank.MessagePackSerializer.deserialize(MessagePackSerializer.java:144)
	at com.expedia.metrics.metrictank.MessagePackSerializer.deserialize(MessagePackSerializer.java:61)
	at com.expedia.adaptivealerting.kafka.serde.MetricDataSerde$DataDeserializer.deserialize(MetricDataSerde.java:60)
	at com.expedia.adaptivealerting.kafka.serde.MetricDataSerde$DataDeserializer.deserialize(MetricDataSerde.java:50)
	at org.apache.kafka.common.serialization.ExtendedDeserializer$Wrapper.deserialize(ExtendedDeserializer.java:65)
	at org.apache.kafka.common.serialization.ExtendedDeserializer$Wrapper.deserialize(ExtendedDeserializer.java:55)
	at org.apache.kafka.streams.processor.internals.SourceNode.deserializeValue(SourceNode.java:56)
	at org.apache.kafka.streams.processor.internals.RecordDeserializer.deserialize(RecordDeserializer.java:61)
	at org.apache.kafka.streams.processor.internals.RecordQueue.addRawRecords(RecordQueue.java:91)
	at org.apache.kafka.streams.processor.internals.PartitionGroup.addRawRecords(PartitionGroup.java:117)
	at org.apache.kafka.streams.processor.internals.StreamTask.addRecords(StreamTask.java:567)
	at org.apache.kafka.streams.processor.internals.StreamThread.addRecordsToTasks(StreamThread.java:900)
	at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:801)
	at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:749)
	at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:719)
2018-11-01 22:59:05 INFO  StreamThread:200 - stream-thread [ad-mapper-a0320118-b8eb-44c1-a63a-5403a8d44063-StreamThread-1] State transition from RUNNING to PENDING_SHUTDOWN
2018-11-01 22:59:05 INFO  StreamThread:1108 - stream-thread [ad-mapper-a0320118-b8eb-44c1-a63a-5403a8d44063-StreamThread-1] Shutting down
2018-11-01 22:59:05 INFO  KafkaProducer:1054 - [Producer clientId=ad-mapper-a0320118-b8eb-44c1-a63a-5403a8d44063-StreamThread-1-producer] Closing the Kafka producer with timeoutMillis = 9223372036854775807 ms.
2018-11-01 22:59:05 INFO  StreamThread:200 - stream-thread [ad-mapper-a0320118-b8eb-44c1-a63a-5403a8d44063-StreamThread-1] State transition from PENDING_SHUTDOWN to DEAD
2018-11-01 22:59:05 INFO  KafkaStreams:261 - stream-client [ad-mapper-a0320118-b8eb-44c1-a63a-5403a8d44063] State transition from RUNNING to ERROR
2018-11-01 22:59:05 INFO  StateChangeListener:40 - State change event called with newState=ERROR and oldState=RUNNING
2018-11-01 22:59:05 WARN  KafkaStreams:413 - stream-client [ad-mapper-a0320118-b8eb-44c1-a63a-5403a8d44063] All stream threads have died. The instance will be in error state and should be closed.
2018-11-01 22:59:05 INFO  StreamThread:1128 - stream-thread [ad-mapper-a0320118-b8eb-44c1-a63a-5403a8d44063-StreamThread-1] Shutdown complete
2018-11-01 22:59:05 ERROR StateChangeListener:50 - uncaught exception occurred running kafka streams for thread=ad-mapper-a0320118-b8eb-44c1-a63a-5403a8d44063-StreamThread-1
org.apache.kafka.streams.errors.StreamsException: Deserialization exception handler is set to fail upon a deserialization error. If you would rather have the streaming pipeline continue after a deserialization error, please set the default.deserialization.exception.handler appropriately.
	at org.apache.kafka.streams.processor.internals.RecordDeserializer.deserialize(RecordDeserializer.java:74)
	at org.apache.kafka.streams.processor.internals.RecordQueue.addRawRecords(RecordQueue.java:91)
	at org.apache.kafka.streams.processor.internals.PartitionGroup.addRawRecords(PartitionGroup.java:117)
	at org.apache.kafka.streams.processor.internals.StreamTask.addRecords(StreamTask.java:567)
	at org.apache.kafka.streams.processor.internals.StreamThread.addRecordsToTasks(StreamThread.java:900)
	at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:801)
	at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:749)
	at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:719)
Caused by: org.msgpack.core.MessageTypeException: Expected Map, but got Integer (7b)
	at org.msgpack.core.MessageUnpacker.unexpected(MessageUnpacker.java:595)
	at org.msgpack.core.MessageUnpacker.unpackMapHeader(MessageUnpacker.java:1297)
	at com.expedia.metrics.metrictank.MessagePackSerializer.deserialize(MessagePackSerializer.java:144)
	at com.expedia.metrics.metrictank.MessagePackSerializer.deserialize(MessagePackSerializer.java:61)
	at com.expedia.adaptivealerting.kafka.serde.MetricDataSerde$DataDeserializer.deserialize(MetricDataSerde.java:60)
	at com.expedia.adaptivealerting.kafka.serde.MetricDataSerde$DataDeserializer.deserialize(MetricDataSerde.java:50)
	at org.apache.kafka.common.serialization.ExtendedDeserializer$Wrapper.deserialize(ExtendedDeserializer.java:65)
	at org.apache.kafka.common.serialization.ExtendedDeserializer$Wrapper.deserialize(ExtendedDeserializer.java:55)
	at org.apache.kafka.streams.processor.internals.SourceNode.deserializeValue(SourceNode.java:56)
	at org.apache.kafka.streams.processor.internals.RecordDeserializer.deserialize(RecordDeserializer.java:61)
	... 7 more
2018-11-01 22:59:05 ERROR HealthStatusController:47 - Setting the app status as 'UNHEALTHY'
2018-11-01 22:59:05 INFO  Application:91 - Shutting down topology
2018-11-01 22:59:05 INFO  KafkaStreams:261 - stream-client [ad-mapper-a0320118-b8eb-44c1-a63a-5403a8d44063] State transition from ERROR to PENDING_SHUTDOWN
2018-11-01 22:59:05 INFO  StateChangeListener:40 - State change event called with newState=PENDING_SHUTDOWN and oldState=ERROR
2018-11-01 22:59:05 INFO  StreamThread:1094 - stream-thread [ad-mapper-a0320118-b8eb-44c1-a63a-5403a8d44063-StreamThread-1] Informed to shut down
2018-11-01 22:59:05 INFO  KafkaStreams:261 - stream-client [ad-mapper-a0320118-b8eb-44c1-a63a-5403a8d44063] State transition from PENDING_SHUTDOWN to NOT_RUNNING
2018-11-01 22:59:05 INFO  KafkaStreams:867 - stream-client [ad-mapper-a0320118-b8eb-44c1-a63a-5403a8d44063] Streams client stopped completely
2018-11-01 22:59:05 INFO  Application:94 - Shutting down jmxReporter

Highlight:
org.apache.kafka.streams.errors.StreamsException: Deserialization exception handler is set to fail upon a deserialization error. If you would rather have the streaming pipeline continue after a deserialization error, please set the default.deserialization.exception.handler appropriately.

Upgrade AA to metrics-java 0.11.0

Research: pipelines via PipelineAI, Pachyderm, StreamSets

Research PipelineAI, Pachyderm, StreamSets to determine whether either might be a good fit for our ML pipeline needs. Some key needs would include:

Shipping data to S3 (ideally would like to continue being able to use AWS Glue+Athena here)
Scheduling model build jobs
Support for HPO and automated model selection
Hosting training/prediction workloads (any language and ML framework)

Detector lookup should be based on partially-specified metric tags, not metric IDs

@bibinss Can I let you fill in the details here please?

Implement RMSE evaluator

Integration with Twitter anomaly detection model

See https://github.com/twitter/AnomalyDetection.

This should be implemented as an external detector, linked in with a virtual detector.

Address the issues outstanding from PR #232

See #232 for more details.

Write unit test for KafkaMultiClusterAnomalyToMetricMapper

This class is missing a unit test. Please use kafka-junit.

ad-manager crashes when wrong metrics sent to mapped-metrics

The ad-manager shuts down when wrong metrics sent to mapped-metrics kafka topic.

Way to reproduce:
Make ad-manager running locally in Kubernetes and from a command prompt run fakespans to push to mapped-metrics topic, i.e.
fakespans --kafka-broker="192.168.99.100:9092" -topic mapped-metrics
ad-manager breaks and shuts down (and is restarted by kubernetes after)

Log:

2018-10-31 00:58:37 INFO  StateChangeListener:40 - State change event called with newState=ERROR and oldState=RUNNING
2018-10-31 00:58:37 WARN  KafkaStreams:413 - stream-client [ad-manager-df8e12fe-23d3-4c9e-a912-268dbd99a7dc] All stream threads have died. The instance will be in error state and should be closed.
2018-10-31 00:58:37 INFO  StreamThread:1128 - stream-thread [ad-manager-df8e12fe-23d3-4c9e-a912-268dbd99a7dc-StreamThread-1] Shutdown complete
2018-10-31 00:58:37 ERROR StateChangeListener:50 - uncaught exception occurred running kafka streams for thread=ad-manager-df8e12fe-23d3-4c9e-a912-268dbd99a7dc-StreamThread-1
org.apache.kafka.streams.errors.StreamsException: Deserialization exception handler is set to fail upon a deserialization error. If you would rather have the streaming pipeline continue after a deserialization error, please set the default.deserialization.exception.handler appropriately.
    at org.apache.kafka.streams.processor.internals.RecordDeserializer.deserialize(RecordDeserializer.java:74)
    at org.apache.kafka.streams.processor.internals.RecordQueue.addRawRecords(RecordQueue.java:91)
    at org.apache.kafka.streams.processor.internals.PartitionGroup.addRawRecords(PartitionGroup.java:117)
    at org.apache.kafka.streams.processor.internals.StreamTask.addRecords(StreamTask.java:567)
    at org.apache.kafka.streams.processor.internals.StreamThread.addRecordsToTasks(StreamThread.java:900)
    at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:801)
    at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:749)
    at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:719)
Caused by: org.apache.kafka.common.errors.SerializationException: com.fasterxml.jackson.core.JsonParseException: Unrecognized token '$a05d37fd': was expecting ('true', 'false' or 'null')
 at [Source: (byte[])"
$a05d37fd-539e-44b9-ab6c-78574623b39b$e2ec24b3-07ad-4fc1-9f09-bcc31fa00381"some-service*  some-span0�Њ����8����"; line: 2, column: 11]
Caused by: com.fasterxml.jackson.core.JsonParseException: Unrecognized token '$a05d37fd': was expecting ('true', 'false' or 'null')
 at [Source: (byte[])"
$a05d37fd-539e-44b9-ab6c-78574623b39b$e2ec24b3-07ad-4fc1-9f09-bcc31fa00381"some-service*  some-span0�Њ����8����"; line: 2, column: 11]
    at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1804)
    at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:679)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidToken(UTF8StreamJsonParser.java:3526)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._handleUnexpectedValue(UTF8StreamJsonParser.java:2621)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._nextTokenNotInObject(UTF8StreamJsonParser.java:826)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:723)
    at com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:4141)
    at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4000)
    at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3091)
    at com.expedia.adaptivealerting.kafka.serde.JsonPojoDeserializer.deserialize(JsonPojoDeserializer.java:65)
    at org.apache.kafka.common.serialization.ExtendedDeserializer$Wrapper.deserialize(ExtendedDeserializer.java:65)
    at org.apache.kafka.common.serialization.ExtendedDeserializer$Wrapper.deserialize(ExtendedDeserializer.java:55)
    at org.apache.kafka.streams.processor.internals.SourceNode.deserializeValue(SourceNode.java:56)
    at org.apache.kafka.streams.processor.internals.RecordDeserializer.deserialize(RecordDeserializer.java:61)
    at org.apache.kafka.streams.processor.internals.RecordQueue.addRawRecords(RecordQueue.java:91)
    at org.apache.kafka.streams.processor.internals.PartitionGroup.addRawRecords(PartitionGroup.java:117)
    at org.apache.kafka.streams.processor.internals.StreamTask.addRecords(StreamTask.java:567)
    at org.apache.kafka.streams.processor.internals.StreamThread.addRecordsToTasks(StreamThread.java:900)
    at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:801)
    at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:749)
    at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:719)
2018-10-31 00:58:37 ERROR HealthStatusController:47 - Setting the app status as 'UNHEALTHY'
2018-10-31 00:58:37 INFO  Application:91 - Shutting down topology
2018-10-31 00:58:37 INFO  KafkaStreams:261 - stream-client [ad-manager-df8e12fe-23d3-4c9e-a912-268dbd99a7dc] State transition from ERROR to PENDING_SHUTDOWN
2018-10-31 00:58:37 INFO  StateChangeListener:40 - State change event called with newState=PENDING_SHUTDOWN and oldState=ERROR
2018-10-31 00:58:37 INFO  StreamThread:1094 - stream-thread [ad-manager-df8e12fe-23d3-4c9e-a912-268dbd99a7dc-StreamThread-1] Informed to shut down
2018-10-31 00:58:37 INFO  KafkaStreams:261 - stream-client [ad-manager-df8e12fe-23d3-4c9e-a912-268dbd99a7dc] State transition from PENDING_SHUTDOWN to NOT_RUNNING
2018-10-31 00:58:37 INFO  StateChangeListener:40 - State change event called with newState=NOT_RUNNING and oldState=PENDING_SHUTDOWN
2018-10-31 00:58:37 INFO  KafkaStreams:867 - stream-client [ad-manager-df8e12fe-23d3-4c9e-a912-268dbd99a7dc] Streams client stopped completely
2018-10-31 00:58:37 INFO  Application:94 - Shutting down jmxReporter
2018-10-31 00:58:37 INFO  Application:97 - Shutting down logger. Bye!```

Integrate Aquila outlier detection model

Create Kafka Streams app for constant threshold outlier detector

Stop pushing NORMAL classifications to the anomalies topic

Pushing NORMAL classifications on the anomalies topic generates a lot of unnecessary messages and log entries. All we need to hear about is anomalies.

Remove anomvalidate module

We've decided to limit AA's scope to time series anomaly detection and have validation occur as a separate step in the pipeline, outside of AA. Therefore we need to remove the anomvalidate module.

Create kafka-junit test for KafkaAnomalyDetectorManager

Currently the unit test is based on Kafka's test-utils package, but kafka-junit provides deeper support as it allows us to set up a mock broker with input/output topics that we can use. Update the existing unit test to use kafka-junit.

Create Kafka Streams app to feed anomalies back into the metrics topic for visualization

We want to be able to overlay anomalies on metric data for visualization in tools like Grafana. So we need a Kafka Streams app that

reads anomalies from the anomalies topic,
transforms the anomalies into metric data, and
pushes the metric data onto the mdm topic

From an implementation perspective, the transformer shouldn't be tied to Kafka since it's conceivable that somebody would want to use other messaging technologies like Kinesis or RabbitMQ.

The transformation needs to enrich the original metric with a key/value tag with key detector_uuid and value being the detector UUID associated with the anomaly. The transformation must reject anomalies with this tag already specified.

StreamsAppConfigLoader not apply config overrides

Locally and in test we are observing that the kstreams apps fail to apply the configuration overrides to the base config. The result is that the app starts up with the base config (the one in the JAR).

Create Kafka Streams app for PEWMA outlier detector

Add Micrometer JMX capture to KafkaMultiClusterAnomalyToMetricMapper

Want to capture flow related metrics:

records consumed
records produced
time delay (processing lag)

Integration with AWS SageMaker DeepAR models

We would like to perform anomaly detection based on DeepAR prediction models:

https://docs.aws.amazon.com/sagemaker/latest/dg/deepar.html

This should be implemented as an external detector, linked in with a virtual detector.

Add support for PEWMA anomaly detection

Probabilistic Exponentially Weighted Moving Average (PEWMA) is an approach to streaming anomaly detection.

See https://www.ll.mit.edu/mission/cybersec/publications/publication-files/full_papers/2012_08_05_Carter_IEEESSP_FP.pdf

Remove JsonPojoSerde and associated de/serializer

These are deprecated, having been replaced by type-specific de/serializers. (These are generally initialized by reflection by a client that has no idea that it needs to read the POJO class from the config and set that on the deserializer.)

May need to update the Terraform module config to remove references to the POJO serde and/or de/serializers. May also need to implement type-specific replacements.

Create kafka-junit test for KafkaAnomalyDetectorMapper

Remove Spring from the Kafka Notifier app

In general we would prefer not to use Spring for the Kafka Streams apps, as these are pretty lightweight and don't really warrant pulling Spring in. The Notifier app is currently based on Spring Boot. We would like to remove Spring from this particular app.

Clarify which password `make release` is requesting

The make process has this piece:

Releasing
DOCKER_ORG=expediadotcom, DOCKER_IMAGE_NAME=haystack-adaptive-alerting-metric-router, QUALIFIED_DOCKER_IMAGE_NAME=expediadotcom/haystack-adaptive-alerting-metric-router
BRANCH=, TAG=, SHA=
Password:
Error response from daemon: Get https://registry-1.docker.io/v2/: unauthorized: incorrect username or password
make[1]: *** [release] Error 1
make: *** [release] Error 2

Please clarify which password is being requested in the prompt to avoid people sending incorrect credentials to docker.io.

Upgrade to metrics-java 0.6.0

Support external detectors in a fully-decoupled way

Problem

Currently, to add a new detector type, we have to create it in the AA codebase itself. External detectors should be integrated dynamically at runtime without requiring AA to be restarted.

Also external detectors should not be required to know anything about AA. This includes AA APIs/SPIs, message formats, and endpoints. We should be able to integrate detectors that were developed completely independently of AA.

As part of this issue, we would want to move both Aquila and RCF out of the AA codebase.

Proposed design

The idea is that the mapper would push messages onto type-specific topics where they can be picked up by the detectors. So AA doesn’t know anything about the detector other than the name of its topic, which would be a data record added the AA database at integration time.

The metric that AA pushes is in an AA-specific format. So it may seem that the detectors are coupled to AA. But it’s easy to avoid this. Detector implementers will generally have a detector that has its own proprietary expectations around request/response formats, and integration would involve creating an adapter (implemented as a Kafka Streams app) to wrap the detector in a way that enables the integration:

The adapter would:

listen to the input topic,
map the AA metric to the format that the detector expects,
invoke the detector,
get the response,
map it to the output format that AA expects, and finally
push the mapped output to the AA anomalies topic.

The deployment of external detectors is totally independent of AA.

Support left-, right- and two-tailed anomaly detection in general

In many cases, we want to generate anomalies only if observations fall too far on either one or the other side of some point forecast. For example, for bookings, we generally care more about bookings drops than spikes. (Bookings spikes can be anomalous and interesting too, but generally we care more about drops.) Another example would be Haystack telemetry, where we care about spikes in the error rate, latencies and durations. We would care about both spikes and drops in request volume.

We already support one-/two-tailed constant threshold detectors, but not others. We need to extend this support to some common cases, including EWMA, PEWMA and Holt-Winters. The solution should be such that we can apply it to other detectors based on point forecasts.

Start with:

EWMA
PEWMA
Holt-Winters

CodeCov complexity coverage appears not to be working

When clicking on the CodeCov report, the complexity coverage numbers are all zero.

Remove @author tags

One of our team values is shared code ownership, and including author tags tends to work against this goal. We can see who wrote which code anytime we want (git blame).

Create Kafka Streams app for EWMA outlier detector

Make KafkaAnomalyDetectorMapper outbound serde configurable

Currently this is hardcoded as a JsonPojoSerde. This should be configurable.

Integration with AWS SageMaker Random Cut Forest algorithm for anomaly detection

https://aws.amazon.com/blogs/machine-learning/use-the-built-in-amazon-sagemaker-random-cut-forest-algorithm-for-anomaly-detection/

This is largely R&D work at this point, aimed at proving that we have an integration strategy for AWS SageMaker. Benefits include:

Having access to trained models, in addition to the simple stream filters we already have. In particular, in the future we intend to integrate our internally-developed Aquila model, which we train offline, and so SageMaker integration will give us an option here.
Having access to new AWS-provided anomaly detection models as they appear.
Having a way to interface with non-SageMaker ML pipelines. (Expedia Group has multiple such pipelines, and it would be great to be able to integrate with any of them.)
Having access to models developed in ML frameworks such as TensorFlow, MXNet, etc.
Lowering the bar for developing, training and deploying new anomaly detection models.

Some questions we want to resolve:

How do we want to use the SageMaker models? Does AA call models running live in SageMaker? Does AA read the models from S3 and then apply them locally as filters? Both?
Relatedly, what are the performance characteristics of the two approaches I just described?
Also relatedly, which format(s) does SageMaker use when storing the model? Is there some Java SDK for reading those models into AA, or would we need to build Java versions along with a deserializer? We need to resolve this early on to ensure we have some reasonable way to read the models.
What's the integration strategy for supporting multiple pipelines? For example, do we want pipelines to dump models in a common format to a common location (e.g. transform to PMML and store in known S3 location)? Do we implement a service provider interface with provider-specific plugins? Something else?

Create sample for individual control chart

mc-a2m-mapper should send only anomalies back to the metrics topic

Currently mc-a2m-mapper is sending all classifications back to the metrics topic. This is neither necessary nor desired. We want to send only weak and strong anomalies back.

Move models into the model service DB

AnomalyToMetricMapper should add anomaly level to the generated metric

When visualizing the metrics, we want to be able to distinguish weak anomalies from strong anomalies. So the mapper should add an aa_anomaly_level tag whose values are WEAK or STRONG.

Avoid throwing exception when trying to create a metric and it already exists

In MetricEventHandler, if a metric already exists in the database, the handler throws an exception to avoid trying to reinsert it. At scale, this will happen continuously. We have a handler wrapping that to avoid filling the log with exceptions.

We would like to avoid throwing exceptions in what is definitely not an exceptional situation. The main reason is that all metrics enter through the front door, and if there are easy ways to avoid throwing exceptions continuously then it seems like we should adopt them. https://stackoverflow.com/questions/567579/how-expensive-are-exceptions

(I understand the whole concept of premature optimization, but doing the analysis here would take more effort than simply replacing the exception-based control flow with a non-exception-based control flow.)

Suggestion would be to provide an overriding implementation of the JPA MetricRepository.create() method that performs an existence check and then inserts only if the metric doesn't exist.

Alternatively we could do implement this functionality in the DB itself (create-if-absent sort of thing), but this is less preferable since it would tie us to a particular DBMS.

Options for transformations on metrics stream

Reviewing options that work well for our use case to run transformations (aggregation, groupby, math functions) on metrics streamed at kafka.

Kafka streams, KTable probably present as the best fit to read, transform and write back the metrics data to kafka.

Kapacitor (https://www.influxdata.com/blog/announcing-kapacitor-an-open-source-streaming-and-batch-time-series-processor/) provides efficient transformations using InfluxQL but does not support directly reading from a kafka source yet.

Please let us know for open source solutions that we could review for use.

Fix the broken Kafka e2e test (Scala)

The Scala e2e test in the Kafka module is currently ignored and probably doesn't work as we haven't been maintaining it. We would like it to work.

Remove dependency on haystack-commons

AA shouldn't have any Haystack dependencies. In particular we will need to ensure that we can handle the following without haystack-commons:

Metric-related data structures and serdes
KStream app configuration

Implement sync from ingest topic to dataset repo

Upgrade Jackson artifacts to 2.9.6

Add e2e integration testing based on docker-compose

Currently it is very easy for a change to break the Kafka apps. For example when I moved a Spring Boot-based Kafka app into the kafka module, it broke the Docker containers because Spring Boot puts the classes in BOOT-INF, and so our Docker startup scripts for the non-Boot apps could no longer find them. We need a way to prevent this from happening.

Our thought is to add e2e integration tests using docker-compose to set up the overall flow.

The metric definition appears twice
Detector UUID appears twice.
Detector UUID appears as detectorUuid one time and detectorUUID the other time.
The schema includes the investigationResults field, which we no longer use in AA
We reference prediction/threshold related fields that don't make sense for many of the detectors

Whoever takes this issue, let's discuss more carefully what the target data schema should be.

Expedia-specific: We will need to roll this out in a way that minimizes client impact. Some potential coordination areas:

Aquila is posting anomalies, so we will need to update it
Seyren will be posting anomalies, so we will need to update that
Vector is reading anomalies
Armor is reading anomalies
I don't think Haystack is reading anomalies directly, but double-check

Here's an example:

  {
    "metricData": {
      "metricDefinition": {
        "tags": {
          "kv": {
            "mtype": "gauge",
            "unit": "metric",
            "what": "[redacted]",
            "m_application": "[redacted]",
            "org_id": "1",
            "name": "[redacted]",
            "interval": "60",
            "m_account": "[redacted]",
            "m_region": "us-west-2"
          },
          "v": []
        },
        "meta": {
          "kv": {},
          "v": []
        },
        "key": "[redacted]"
      },
      "value": 0,
      "timestamp": 1542295565
    },
    "detectorUuid": "37ef6be4-dd86-11e8-956a-061bc4b8b7e4",
    "detectorType": "constant-detector",
    "anomalyResult": {
      "detectorUUID": "37ef6be4-dd86-11e8-956a-061bc4b8b7e4",
      "metricData": {
        "metricDefinition": {
          "tags": {
            "kv": {
              "mtype": "gauge",
              "unit": "metric",
              "what": "[redacted]",
              "m_application": "[redacted]",
              "org_id": "1",
              "name": "[redacted]",
              "interval": "60",
              "m_account": "[redacted]",
              "m_region": "us-west-2"
            },
            "v": []
          },
          "meta": {
            "kv": {},
            "v": []
          },
          "key": "[redacted]"
        },
        "value": 0,
        "timestamp": 1542295565
      },
      "anomalyLevel": "NORMAL",
      "predicted": null,
      "thresholds": null,
      "investigationResults": null
    }
  }

Also, this issue gives us a good opportunity to document an approach to schema migration.

Updates

Probably MappedMetricData shouldn't have an AnomalyResult at all.
Probably we should push AnomalyResult onto the anomaly topic instead of pushing a MappedMetricData onto that topic.