GithubHelp home page GithubHelp logo

expediagroup / adaptive-alerting Goto Github PK

View Code? Open in Web Editor NEW
202.0 31.0 49.0 6.74 MB

Anomaly detection for streaming time series, featuring automated model selection.

License: Apache License 2.0

Java 92.16% R 0.19% Shell 1.68% Scala 0.83% Makefile 0.11% Dockerfile 0.22% HCL 2.51% Smarty 1.17% Python 0.46% Jupyter Notebook 0.68%
anomaly-detection anomaly outlier-detection outlier monitoring time-series streaming

adaptive-alerting's Introduction

Build Status codecov License

Adaptive Alerting (AA)

Streaming anomaly detection with automated model selection and fitting.

Wiki documentation

Build

To build the Maven project:

$ ./mvnw clean verify

To build the Docker images:

$ make docker_build

How the Travis CI build works

We use Travis CI to build AA Docker images and push them to Docker Hub. Here's how it works:

  • A developer pushes a branch (master or otherwise) to GitHub.
  • GitHub kicks off a Travis CI build.
  • Travis CI reads .travis.yml, which drives the build.
  • .travis.yml invokes the top-level Makefile.
  • The top-level Makefile
    • runs a Maven build for the whole project
    • invokes module-specific Makefiles to handle building and releasing Docker images
  • Each module-specific Makefile runs one or more module-specific build scripts to
    • build the Docker images
    • release the Docker images
  • For the release (docker push), the module-specific build script delegates to the shared scripts/publish-to-docker-hub.sh script. This script has logic to push the image to Docker Hub if and only if the current branch is the master.

adaptive-alerting's People

Contributors

ankgupta avatar ashishagg avatar ayansen avatar bibinss avatar brett-miller avatar dependabot[bot] avatar derikulous avatar dinilatgit avatar djsutho avatar exp-dasutherland avatar hridyeshpant avatar keshavpeswani avatar mattcallanan avatar palash-goel avatar pranavsundriyal avatar shsethi avatar swfortier avatar tkamenov-expedia avatar tusharbahl avatar vldmrkl avatar williewheeler avatar zrashwani avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

adaptive-alerting's Issues

Create new health check to replace the one we lost when removing haystack-commons

The way the haystack-commons health check works is that when an uncaught exception occurs in Kafka, the health controller would write to a file indicating that the app is unhealthy, and then the k8s liveness check would see that and fail the health check.

We want something like this, but we need to coordinate it with #253 and #245. Ill-formatted metrics should now cause the health check to fail.

ad-mapper crashes when wrong metrics sent to aa-metrics

The ad-mapper crashes when wrong metrics sent to aa-metrics Kafka topic.

Way to reproduce:
Make ad-mapper running locally in Kubernetes and run salesdata-to-gaas to push to Kafka's mapped-metrics topic.

Example metrics:
{"metric":{"tags":[null,null,"m_application=sales-data","unit=count","what=bookings","lob=tshop","region=travelocity-na"],"meta":{"source":"sales-data-api"}},"epochTimeInSeconds":1541112000,"value":1.0}

Result:
ad-mapper breaks and shuts down (and is restarted by kubernetes after)

Log:

2018-11-01 22:58:00 INFO  KafkaStreams:261 - stream-client [ad-mapper-a0320118-b8eb-44c1-a63a-5403a8d44063] State transition from REBALANCING to RUNNING
2018-11-01 22:58:00 INFO  StateChangeListener:40 - State change event called with newState=RUNNING and oldState=REBALANCING
2018-11-01 22:58:00 INFO  Fetcher:561 - [Consumer clientId=ad-mapper-a0320118-b8eb-44c1-a63a-5403a8d44063-StreamThread-1-consumer, groupId=ad-mapper] Resetting offset for partition aa-metrics-0 to offset 229.
2018-11-01 22:59:05 ERROR StreamThread:40 - Exception caught during Deserialization, taskId: 0_0, topic: aa-metrics, partition: 0, offset: 229
org.msgpack.core.MessageTypeException: Expected Map, but got Integer (7b)
	at org.msgpack.core.MessageUnpacker.unexpected(MessageUnpacker.java:595)
	at org.msgpack.core.MessageUnpacker.unpackMapHeader(MessageUnpacker.java:1297)
	at com.expedia.metrics.metrictank.MessagePackSerializer.deserialize(MessagePackSerializer.java:144)
	at com.expedia.metrics.metrictank.MessagePackSerializer.deserialize(MessagePackSerializer.java:61)
	at com.expedia.adaptivealerting.kafka.serde.MetricDataSerde$DataDeserializer.deserialize(MetricDataSerde.java:60)
	at com.expedia.adaptivealerting.kafka.serde.MetricDataSerde$DataDeserializer.deserialize(MetricDataSerde.java:50)
	at org.apache.kafka.common.serialization.ExtendedDeserializer$Wrapper.deserialize(ExtendedDeserializer.java:65)
	at org.apache.kafka.common.serialization.ExtendedDeserializer$Wrapper.deserialize(ExtendedDeserializer.java:55)
	at org.apache.kafka.streams.processor.internals.SourceNode.deserializeValue(SourceNode.java:56)
	at org.apache.kafka.streams.processor.internals.RecordDeserializer.deserialize(RecordDeserializer.java:61)
	at org.apache.kafka.streams.processor.internals.RecordQueue.addRawRecords(RecordQueue.java:91)
	at org.apache.kafka.streams.processor.internals.PartitionGroup.addRawRecords(PartitionGroup.java:117)
	at org.apache.kafka.streams.processor.internals.StreamTask.addRecords(StreamTask.java:567)
	at org.apache.kafka.streams.processor.internals.StreamThread.addRecordsToTasks(StreamThread.java:900)
	at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:801)
	at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:749)
	at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:719)
2018-11-01 22:59:05 INFO  StreamThread:200 - stream-thread [ad-mapper-a0320118-b8eb-44c1-a63a-5403a8d44063-StreamThread-1] State transition from RUNNING to PENDING_SHUTDOWN
2018-11-01 22:59:05 INFO  StreamThread:1108 - stream-thread [ad-mapper-a0320118-b8eb-44c1-a63a-5403a8d44063-StreamThread-1] Shutting down
2018-11-01 22:59:05 INFO  KafkaProducer:1054 - [Producer clientId=ad-mapper-a0320118-b8eb-44c1-a63a-5403a8d44063-StreamThread-1-producer] Closing the Kafka producer with timeoutMillis = 9223372036854775807 ms.
2018-11-01 22:59:05 INFO  StreamThread:200 - stream-thread [ad-mapper-a0320118-b8eb-44c1-a63a-5403a8d44063-StreamThread-1] State transition from PENDING_SHUTDOWN to DEAD
2018-11-01 22:59:05 INFO  KafkaStreams:261 - stream-client [ad-mapper-a0320118-b8eb-44c1-a63a-5403a8d44063] State transition from RUNNING to ERROR
2018-11-01 22:59:05 INFO  StateChangeListener:40 - State change event called with newState=ERROR and oldState=RUNNING
2018-11-01 22:59:05 WARN  KafkaStreams:413 - stream-client [ad-mapper-a0320118-b8eb-44c1-a63a-5403a8d44063] All stream threads have died. The instance will be in error state and should be closed.
2018-11-01 22:59:05 INFO  StreamThread:1128 - stream-thread [ad-mapper-a0320118-b8eb-44c1-a63a-5403a8d44063-StreamThread-1] Shutdown complete
2018-11-01 22:59:05 ERROR StateChangeListener:50 - uncaught exception occurred running kafka streams for thread=ad-mapper-a0320118-b8eb-44c1-a63a-5403a8d44063-StreamThread-1
org.apache.kafka.streams.errors.StreamsException: Deserialization exception handler is set to fail upon a deserialization error. If you would rather have the streaming pipeline continue after a deserialization error, please set the default.deserialization.exception.handler appropriately.
	at org.apache.kafka.streams.processor.internals.RecordDeserializer.deserialize(RecordDeserializer.java:74)
	at org.apache.kafka.streams.processor.internals.RecordQueue.addRawRecords(RecordQueue.java:91)
	at org.apache.kafka.streams.processor.internals.PartitionGroup.addRawRecords(PartitionGroup.java:117)
	at org.apache.kafka.streams.processor.internals.StreamTask.addRecords(StreamTask.java:567)
	at org.apache.kafka.streams.processor.internals.StreamThread.addRecordsToTasks(StreamThread.java:900)
	at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:801)
	at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:749)
	at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:719)
Caused by: org.msgpack.core.MessageTypeException: Expected Map, but got Integer (7b)
	at org.msgpack.core.MessageUnpacker.unexpected(MessageUnpacker.java:595)
	at org.msgpack.core.MessageUnpacker.unpackMapHeader(MessageUnpacker.java:1297)
	at com.expedia.metrics.metrictank.MessagePackSerializer.deserialize(MessagePackSerializer.java:144)
	at com.expedia.metrics.metrictank.MessagePackSerializer.deserialize(MessagePackSerializer.java:61)
	at com.expedia.adaptivealerting.kafka.serde.MetricDataSerde$DataDeserializer.deserialize(MetricDataSerde.java:60)
	at com.expedia.adaptivealerting.kafka.serde.MetricDataSerde$DataDeserializer.deserialize(MetricDataSerde.java:50)
	at org.apache.kafka.common.serialization.ExtendedDeserializer$Wrapper.deserialize(ExtendedDeserializer.java:65)
	at org.apache.kafka.common.serialization.ExtendedDeserializer$Wrapper.deserialize(ExtendedDeserializer.java:55)
	at org.apache.kafka.streams.processor.internals.SourceNode.deserializeValue(SourceNode.java:56)
	at org.apache.kafka.streams.processor.internals.RecordDeserializer.deserialize(RecordDeserializer.java:61)
	... 7 more
2018-11-01 22:59:05 ERROR HealthStatusController:47 - Setting the app status as 'UNHEALTHY'
2018-11-01 22:59:05 INFO  Application:91 - Shutting down topology
2018-11-01 22:59:05 INFO  KafkaStreams:261 - stream-client [ad-mapper-a0320118-b8eb-44c1-a63a-5403a8d44063] State transition from ERROR to PENDING_SHUTDOWN
2018-11-01 22:59:05 INFO  StateChangeListener:40 - State change event called with newState=PENDING_SHUTDOWN and oldState=ERROR
2018-11-01 22:59:05 INFO  StreamThread:1094 - stream-thread [ad-mapper-a0320118-b8eb-44c1-a63a-5403a8d44063-StreamThread-1] Informed to shut down
2018-11-01 22:59:05 INFO  KafkaStreams:261 - stream-client [ad-mapper-a0320118-b8eb-44c1-a63a-5403a8d44063] State transition from PENDING_SHUTDOWN to NOT_RUNNING
2018-11-01 22:59:05 INFO  KafkaStreams:867 - stream-client [ad-mapper-a0320118-b8eb-44c1-a63a-5403a8d44063] Streams client stopped completely
2018-11-01 22:59:05 INFO  Application:94 - Shutting down jmxReporter

Highlight:
org.apache.kafka.streams.errors.StreamsException: Deserialization exception handler is set to fail upon a deserialization error. If you would rather have the streaming pipeline continue after a deserialization error, please set the default.deserialization.exception.handler appropriately.

Research: pipelines via PipelineAI, Pachyderm, StreamSets

Research PipelineAI, Pachyderm, StreamSets to determine whether either might be a good fit for our ML pipeline needs. Some key needs would include:

  • Shipping data to S3 (ideally would like to continue being able to use AWS Glue+Athena here)
  • Scheduling model build jobs
  • Support for HPO and automated model selection
  • Hosting training/prediction workloads (any language and ML framework)

ad-manager crashes when wrong metrics sent to mapped-metrics

The ad-manager shuts down when wrong metrics sent to mapped-metrics kafka topic.

Way to reproduce:
Make ad-manager running locally in Kubernetes and from a command prompt run fakespans to push to mapped-metrics topic, i.e.
fakespans --kafka-broker="192.168.99.100:9092" -topic mapped-metrics
ad-manager breaks and shuts down (and is restarted by kubernetes after)

Log:

2018-10-31 00:58:37 INFO  StateChangeListener:40 - State change event called with newState=ERROR and oldState=RUNNING
2018-10-31 00:58:37 WARN  KafkaStreams:413 - stream-client [ad-manager-df8e12fe-23d3-4c9e-a912-268dbd99a7dc] All stream threads have died. The instance will be in error state and should be closed.
2018-10-31 00:58:37 INFO  StreamThread:1128 - stream-thread [ad-manager-df8e12fe-23d3-4c9e-a912-268dbd99a7dc-StreamThread-1] Shutdown complete
2018-10-31 00:58:37 ERROR StateChangeListener:50 - uncaught exception occurred running kafka streams for thread=ad-manager-df8e12fe-23d3-4c9e-a912-268dbd99a7dc-StreamThread-1
org.apache.kafka.streams.errors.StreamsException: Deserialization exception handler is set to fail upon a deserialization error. If you would rather have the streaming pipeline continue after a deserialization error, please set the default.deserialization.exception.handler appropriately.
    at org.apache.kafka.streams.processor.internals.RecordDeserializer.deserialize(RecordDeserializer.java:74)
    at org.apache.kafka.streams.processor.internals.RecordQueue.addRawRecords(RecordQueue.java:91)
    at org.apache.kafka.streams.processor.internals.PartitionGroup.addRawRecords(PartitionGroup.java:117)
    at org.apache.kafka.streams.processor.internals.StreamTask.addRecords(StreamTask.java:567)
    at org.apache.kafka.streams.processor.internals.StreamThread.addRecordsToTasks(StreamThread.java:900)
    at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:801)
    at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:749)
    at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:719)
Caused by: org.apache.kafka.common.errors.SerializationException: com.fasterxml.jackson.core.JsonParseException: Unrecognized token '$a05d37fd': was expecting ('true', 'false' or 'null')
 at [Source: (byte[])"
$a05d37fd-539e-44b9-ab6c-78574623b39b$e2ec24b3-07ad-4fc1-9f09-bcc31fa00381"some-service*  some-span0�Њ����8����"; line: 2, column: 11]
Caused by: com.fasterxml.jackson.core.JsonParseException: Unrecognized token '$a05d37fd': was expecting ('true', 'false' or 'null')
 at [Source: (byte[])"
$a05d37fd-539e-44b9-ab6c-78574623b39b$e2ec24b3-07ad-4fc1-9f09-bcc31fa00381"some-service*  some-span0�Њ����8����"; line: 2, column: 11]
    at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1804)
    at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:679)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidToken(UTF8StreamJsonParser.java:3526)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._handleUnexpectedValue(UTF8StreamJsonParser.java:2621)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._nextTokenNotInObject(UTF8StreamJsonParser.java:826)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:723)
    at com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:4141)
    at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4000)
    at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3091)
    at com.expedia.adaptivealerting.kafka.serde.JsonPojoDeserializer.deserialize(JsonPojoDeserializer.java:65)
    at org.apache.kafka.common.serialization.ExtendedDeserializer$Wrapper.deserialize(ExtendedDeserializer.java:65)
    at org.apache.kafka.common.serialization.ExtendedDeserializer$Wrapper.deserialize(ExtendedDeserializer.java:55)
    at org.apache.kafka.streams.processor.internals.SourceNode.deserializeValue(SourceNode.java:56)
    at org.apache.kafka.streams.processor.internals.RecordDeserializer.deserialize(RecordDeserializer.java:61)
    at org.apache.kafka.streams.processor.internals.RecordQueue.addRawRecords(RecordQueue.java:91)
    at org.apache.kafka.streams.processor.internals.PartitionGroup.addRawRecords(PartitionGroup.java:117)
    at org.apache.kafka.streams.processor.internals.StreamTask.addRecords(StreamTask.java:567)
    at org.apache.kafka.streams.processor.internals.StreamThread.addRecordsToTasks(StreamThread.java:900)
    at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:801)
    at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:749)
    at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:719)
2018-10-31 00:58:37 ERROR HealthStatusController:47 - Setting the app status as 'UNHEALTHY'
2018-10-31 00:58:37 INFO  Application:91 - Shutting down topology
2018-10-31 00:58:37 INFO  KafkaStreams:261 - stream-client [ad-manager-df8e12fe-23d3-4c9e-a912-268dbd99a7dc] State transition from ERROR to PENDING_SHUTDOWN
2018-10-31 00:58:37 INFO  StateChangeListener:40 - State change event called with newState=PENDING_SHUTDOWN and oldState=ERROR
2018-10-31 00:58:37 INFO  StreamThread:1094 - stream-thread [ad-manager-df8e12fe-23d3-4c9e-a912-268dbd99a7dc-StreamThread-1] Informed to shut down
2018-10-31 00:58:37 INFO  KafkaStreams:261 - stream-client [ad-manager-df8e12fe-23d3-4c9e-a912-268dbd99a7dc] State transition from PENDING_SHUTDOWN to NOT_RUNNING
2018-10-31 00:58:37 INFO  StateChangeListener:40 - State change event called with newState=NOT_RUNNING and oldState=PENDING_SHUTDOWN
2018-10-31 00:58:37 INFO  KafkaStreams:867 - stream-client [ad-manager-df8e12fe-23d3-4c9e-a912-268dbd99a7dc] Streams client stopped completely
2018-10-31 00:58:37 INFO  Application:94 - Shutting down jmxReporter
2018-10-31 00:58:37 INFO  Application:97 - Shutting down logger. Bye!```

Remove anomvalidate module

We've decided to limit AA's scope to time series anomaly detection and have validation occur as a separate step in the pipeline, outside of AA. Therefore we need to remove the anomvalidate module.

Create kafka-junit test for KafkaAnomalyDetectorManager

Currently the unit test is based on Kafka's test-utils package, but kafka-junit provides deeper support as it allows us to set up a mock broker with input/output topics that we can use. Update the existing unit test to use kafka-junit.

Create Kafka Streams app to feed anomalies back into the metrics topic for visualization

We want to be able to overlay anomalies on metric data for visualization in tools like Grafana. So we need a Kafka Streams app that

  • reads anomalies from the anomalies topic,
  • transforms the anomalies into metric data, and
  • pushes the metric data onto the mdm topic

From an implementation perspective, the transformer shouldn't be tied to Kafka since it's conceivable that somebody would want to use other messaging technologies like Kinesis or RabbitMQ.

The transformation needs to enrich the original metric with a key/value tag with key detector_uuid and value being the detector UUID associated with the anomaly. The transformation must reject anomalies with this tag already specified.

StreamsAppConfigLoader not apply config overrides

Locally and in test we are observing that the kstreams apps fail to apply the configuration overrides to the base config. The result is that the app starts up with the base config (the one in the JAR).

Remove JsonPojoSerde and associated de/serializer

These are deprecated, having been replaced by type-specific de/serializers. (These are generally initialized by reflection by a client that has no idea that it needs to read the POJO class from the config and set that on the deserializer.)

May need to update the Terraform module config to remove references to the POJO serde and/or de/serializers. May also need to implement type-specific replacements.

Create kafka-junit test for KafkaAnomalyDetectorMapper

Currently the unit test is based on Kafka's test-utils package, but kafka-junit provides deeper support as it allows us to set up a mock broker with input/output topics that we can use. Update the existing unit test to use kafka-junit.

Remove Spring from the Kafka Notifier app

In general we would prefer not to use Spring for the Kafka Streams apps, as these are pretty lightweight and don't really warrant pulling Spring in. The Notifier app is currently based on Spring Boot. We would like to remove Spring from this particular app.

Clarify which password `make release` is requesting

The make process has this piece:

Releasing
DOCKER_ORG=expediadotcom, DOCKER_IMAGE_NAME=haystack-adaptive-alerting-metric-router, QUALIFIED_DOCKER_IMAGE_NAME=expediadotcom/haystack-adaptive-alerting-metric-router
BRANCH=, TAG=, SHA=
Password:
Error response from daemon: Get https://registry-1.docker.io/v2/: unauthorized: incorrect username or password
make[1]: *** [release] Error 1
make: *** [release] Error 2

Please clarify which password is being requested in the prompt to avoid people sending incorrect credentials to docker.io.

Support external detectors in a fully-decoupled way

Problem

Currently, to add a new detector type, we have to create it in the AA codebase itself. External detectors should be integrated dynamically at runtime without requiring AA to be restarted.

Also external detectors should not be required to know anything about AA. This includes AA APIs/SPIs, message formats, and endpoints. We should be able to integrate detectors that were developed completely independently of AA.

As part of this issue, we would want to move both Aquila and RCF out of the AA codebase.

Proposed design

aa-arch-logical

The idea is that the mapper would push messages onto type-specific topics where they can be picked up by the detectors. So AA doesn’t know anything about the detector other than the name of its topic, which would be a data record added the AA database at integration time.

The metric that AA pushes is in an AA-specific format. So it may seem that the detectors are coupled to AA. But it’s easy to avoid this. Detector implementers will generally have a detector that has its own proprietary expectations around request/response formats, and integration would involve creating an adapter (implemented as a Kafka Streams app) to wrap the detector in a way that enables the integration:

aa-detector-ext

The adapter would:

  • listen to the input topic,
  • map the AA metric to the format that the detector expects,
  • invoke the detector,
  • get the response,
  • map it to the output format that AA expects, and finally
  • push the mapped output to the AA anomalies topic.

The deployment of external detectors is totally independent of AA.

Support left-, right- and two-tailed anomaly detection in general

In many cases, we want to generate anomalies only if observations fall too far on either one or the other side of some point forecast. For example, for bookings, we generally care more about bookings drops than spikes. (Bookings spikes can be anomalous and interesting too, but generally we care more about drops.) Another example would be Haystack telemetry, where we care about spikes in the error rate, latencies and durations. We would care about both spikes and drops in request volume.

We already support one-/two-tailed constant threshold detectors, but not others. We need to extend this support to some common cases, including EWMA, PEWMA and Holt-Winters. The solution should be such that we can apply it to other detectors based on point forecasts.

Start with:

  • EWMA
  • PEWMA
  • Holt-Winters

Remove @author tags

One of our team values is shared code ownership, and including author tags tends to work against this goal. We can see who wrote which code anytime we want (git blame).

Integration with AWS SageMaker Random Cut Forest algorithm for anomaly detection

https://aws.amazon.com/blogs/machine-learning/use-the-built-in-amazon-sagemaker-random-cut-forest-algorithm-for-anomaly-detection/

This is largely R&D work at this point, aimed at proving that we have an integration strategy for AWS SageMaker. Benefits include:

  • Having access to trained models, in addition to the simple stream filters we already have. In particular, in the future we intend to integrate our internally-developed Aquila model, which we train offline, and so SageMaker integration will give us an option here.
  • Having access to new AWS-provided anomaly detection models as they appear.
  • Having a way to interface with non-SageMaker ML pipelines. (Expedia Group has multiple such pipelines, and it would be great to be able to integrate with any of them.)
  • Having access to models developed in ML frameworks such as TensorFlow, MXNet, etc.
  • Lowering the bar for developing, training and deploying new anomaly detection models.

Some questions we want to resolve:

  • How do we want to use the SageMaker models? Does AA call models running live in SageMaker? Does AA read the models from S3 and then apply them locally as filters? Both?
  • Relatedly, what are the performance characteristics of the two approaches I just described?
  • Also relatedly, which format(s) does SageMaker use when storing the model? Is there some Java SDK for reading those models into AA, or would we need to build Java versions along with a deserializer? We need to resolve this early on to ensure we have some reasonable way to read the models.
  • What's the integration strategy for supporting multiple pipelines? For example, do we want pipelines to dump models in a common format to a common location (e.g. transform to PMML and store in known S3 location)? Do we implement a service provider interface with provider-specific plugins? Something else?

Avoid throwing exception when trying to create a metric and it already exists

In MetricEventHandler, if a metric already exists in the database, the handler throws an exception to avoid trying to reinsert it. At scale, this will happen continuously. We have a handler wrapping that to avoid filling the log with exceptions.

We would like to avoid throwing exceptions in what is definitely not an exceptional situation. The main reason is that all metrics enter through the front door, and if there are easy ways to avoid throwing exceptions continuously then it seems like we should adopt them. https://stackoverflow.com/questions/567579/how-expensive-are-exceptions

(I understand the whole concept of premature optimization, but doing the analysis here would take more effort than simply replacing the exception-based control flow with a non-exception-based control flow.)

Suggestion would be to provide an overriding implementation of the JPA MetricRepository.create() method that performs an existence check and then inserts only if the metric doesn't exist.

Alternatively we could do implement this functionality in the DB itself (create-if-absent sort of thing), but this is less preferable since it would tie us to a particular DBMS.

Options for transformations on metrics stream

Reviewing options that work well for our use case to run transformations (aggregation, groupby, math functions) on metrics streamed at kafka.

Kafka streams, KTable probably present as the best fit to read, transform and write back the metrics data to kafka.

Kapacitor (https://www.influxdata.com/blog/announcing-kapacitor-an-open-source-streaming-and-batch-time-series-processor/) provides efficient transformations using InfluxQL but does not support directly reading from a kafka source yet.

Please let us know for open source solutions that we could review for use.

Fix the broken Kafka e2e test (Scala)

The Scala e2e test in the Kafka module is currently ignored and probably doesn't work as we haven't been maintaining it. We would like it to work.

Remove dependency on haystack-commons

AA shouldn't have any Haystack dependencies. In particular we will need to ensure that we can handle the following without haystack-commons:

  • Metric-related data structures and serdes
  • KStream app configuration

Add e2e integration testing based on docker-compose

Currently it is very easy for a change to break the Kafka apps. For example when I moved a Spring Boot-based Kafka app into the kafka module, it broke the Docker containers because Spring Boot puts the classes in BOOT-INF, and so our Docker startup scripts for the non-Boot apps could no longer find them. We need a way to prevent this from happening.

Our thought is to add e2e integration tests using docker-compose to set up the overall flow.

Move RCF detector out of AA into its own repo

The RCF detector should be an external detector since it involves offline training and it's not a basic statistical algo. We need to move this out of AA into a separate repo. We also need to remove the AWS dependencies from AA.

Clean up the MappedMetricData data schema

There are several issues:

  • The metric definition appears twice
  • Detector UUID appears twice.
  • Detector UUID appears as detectorUuid one time and detectorUUID the other time.
  • The schema includes the investigationResults field, which we no longer use in AA
  • We reference prediction/threshold related fields that don't make sense for many of the detectors

Whoever takes this issue, let's discuss more carefully what the target data schema should be.

Expedia-specific: We will need to roll this out in a way that minimizes client impact. Some potential coordination areas:

  • Aquila is posting anomalies, so we will need to update it
  • Seyren will be posting anomalies, so we will need to update that
  • Vector is reading anomalies
  • Armor is reading anomalies
  • I don't think Haystack is reading anomalies directly, but double-check

Here's an example:

  {
    "metricData": {
      "metricDefinition": {
        "tags": {
          "kv": {
            "mtype": "gauge",
            "unit": "metric",
            "what": "[redacted]",
            "m_application": "[redacted]",
            "org_id": "1",
            "name": "[redacted]",
            "interval": "60",
            "m_account": "[redacted]",
            "m_region": "us-west-2"
          },
          "v": []
        },
        "meta": {
          "kv": {},
          "v": []
        },
        "key": "[redacted]"
      },
      "value": 0,
      "timestamp": 1542295565
    },
    "detectorUuid": "37ef6be4-dd86-11e8-956a-061bc4b8b7e4",
    "detectorType": "constant-detector",
    "anomalyResult": {
      "detectorUUID": "37ef6be4-dd86-11e8-956a-061bc4b8b7e4",
      "metricData": {
        "metricDefinition": {
          "tags": {
            "kv": {
              "mtype": "gauge",
              "unit": "metric",
              "what": "[redacted]",
              "m_application": "[redacted]",
              "org_id": "1",
              "name": "[redacted]",
              "interval": "60",
              "m_account": "[redacted]",
              "m_region": "us-west-2"
            },
            "v": []
          },
          "meta": {
            "kv": {},
            "v": []
          },
          "key": "[redacted]"
        },
        "value": 0,
        "timestamp": 1542295565
      },
      "anomalyLevel": "NORMAL",
      "predicted": null,
      "thresholds": null,
      "investigationResults": null
    }
  }

Also, this issue gives us a good opportunity to document an approach to schema migration.

Updates

  • Probably MappedMetricData shouldn't have an AnomalyResult at all.
  • Probably we should push AnomalyResult onto the anomaly topic instead of pushing a MappedMetricData onto that topic.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.