phdata / pulse Goto Github PK

View Code? Open in Web Editor NEW

13.0 21.0 18.0 11.67 MB

phData Pulse application log aggregation and monitoring

License: Apache License 2.0

Makefile 1.35% Scala 77.40% Shell 5.59% Python 7.08% Java 8.57%

log-aggregation solr solrcloud hadoop akka-streams scala csd

pulse's Introduction

Hadoop log aggregation, alerting, and lifecycle management

Pulse

Pulse is an Apache 2.0 licensed log aggregation framework built on top of Solr Cloud (Cloudera Search). It can be used with applications written in any language, but was built especially for improving logging in Apache Spark Streaming applications running on Apache Hadoop.

Pulse gives application users full text centralized search of their logs, flexible alerts on their logs, and works with several visualization tools.

Pulse handles log lifecycle, so application developers don't have to worry about rotating or maintaining log indexes themselves.

See our documentation page on readthedocs.org for more details: https://pulse-logging.readthedocs.io/en/latest/

pulse's People

Contributors

Stargazers

Watchers

Forkers

ellachenlj juskaiser kjmccarthy keithssmith namanj shashireddypalle mariyamb mariyamg kamir afoerster jitkasempin cattmarlin tristaoeast srperi apsaltis jtbirdsell jeba-phdata

pulse's Issues

Buffering the events in HttpAppender will miss last msg(s) in the event the application exits unexpectedly

Also, it seems like msgs will sit in the buffer until the next message is added. Depending on how long that delay is, we could have a message in the buffer for potentially longer than is desirable.

HttpAppender#flush() can only be called from append(event) or close(). It's possible for events to be in the buffer for multiple minutes or more.

Create default configurations for first run

The default configuration would allow the Collection Roller and Alert Engine to start on a first install and validate everything is working. It can use a test collection 'pulse-test-default'

Collection Roller: Valid there is a solr config dir set and it has at least one valid config

Right now if the solrConfigSetDir property is misspelled or missing the application will not upload a config set dir and will not warn the user.

spark-shell-logging example script is missing a backslash

Causing it to not use the log4j config

el6 parcel hash in repo not matching hash in manifest.json

Add a role action to trigger test email

Create a main class that will take arguments (the conf and an email address) and sends an email
Add a role action in 'control.sh' and the service.sdl CSD file

Adding a non-string value to the MDC can cause the app to hang

When using the HttpAppender adding a non-string value to the MDC can cause the app to hang.

The error:

Exception in thread "HTTP appender dispatcher" java.lang.RuntimeException: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.String
	at io.phdata.pulse.log.HttpAppender$Dispatcher.flush(HttpAppender.java:357)
	at io.phdata.pulse.log.HttpAppender$Dispatcher.run(HttpAppender.java:341)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.String
	at io.phdata.pulse.log.JsonParser.marshallEventInternal(JsonParser.java:68)
	at io.phdata.pulse.log.JsonParser.marshallArray(JsonParser.java:24)
	at io.phdata.pulse.log.HttpAppender$Dispatcher.flush(HttpAppender.java:355)

The location:

      for (Map.Entry<String, String> entry : props) {
        jg.writeStringField(entry.getKey(), entry.getValue());
      }

I think the easy fix is to just call toString on the entry.getValue(), I don't see any downsides to this.

Application ID logging for Spark Driver

Pulse will automatically log the application ID for executors based off environmental variables passed into Yarn containers, but the same method isn't working for the drivers.

Figure out a way to automatically log application ids in drivers or write an example using the log4j MDC.

Create an endpoint to ingest raw json

The current endpoint requires the json payload conform to a LogEvent type. This LogEvent type has data needed for log4j or Python logging, but isn't flexible if we want to insert arbitrary data, like Metrics, into Pulse.

Create a new endpoint in LogCollectorRoutes with the path `v1/json/
Parse the json using Spray (this is already a project dependency) into a map[String, String]
Change all code in SolrCloudStreams to work with Map[String, String] instead of LogEvent. Figure out: do we still need the LogEvent class? How can we move it up the call stack?
The new endpoint should dump the Map[String, String] onto the solr stream.
At the end of the stream here: solrService.insertDocuments(latestCollectionAlias, events.map(DocumentConversion.toSolrDocument) we need a function to convert the Map into a SolrDocument

Java max memory should be exposed in log collector, alert engine, and collection roller

This example from the Kafka project can be used as an example: https://github.com/cloudera/cm_csds/blob/master/KAFKA/src/descriptor/service.sdl#L760

Max heaps for the alert engine and collection roller should default to 2G, log collector should default to 4G.

The properties should be exported as environmental variables, and added to the java arguments in control.sh

Client configurations?

From the base log appender:

curl -X POST -H 'Content-Type: application/json' -d '{"category": "'$category'","timestamp": '$timestamp', "level": "'$level'", "message": "'$message'", "threadName": "'$threadName'"}' http://0.0.0.0:9005/log?application=$application

It'd be nice to get this from a client config, eg /etc/pulse/conf/env.sh so I'll I'd have to use as a user is:

source /opt/cloudera/parcels/PULSE/lib/appenders/logger.sh

and I'd be running.

Queries calling StatsComponent on TextField are failing

Arcadia is calling StatsComponent on some text fields causing the error:

 error: Error reading Solr data: Field type text_general{class=org.apache.solr.schema.TextField,analyzer=org.apache.solr.analysis.TokenizerChain,args={positionIncrementGap=100, class=solr.TextField}} is not currently supported

This should be fixed by only using 'text_general' type, which uses TextField, on any short fields (like 'category'). Arcadia tries to aggregate on category, so I suspect this is where the issue is.

The Bash logger script adds extra single quotes around fields and breaks the log collector.

Stack trace:

Error posting documents to solr
org.apache.solr.common.SolrException: Could not find collection : 'kafka_kudu_streaming'_latest
at org.apache.solr.common.cloud.ClusterState.getCollection(ClusterState.java:162)
at org.apache.solr.client.solrj.impl.CloudSolrServer.directUpdate(CloudSolrServer.java:324)
at org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:563)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at io.phdata.pulse.common.SolrService.insertDocuments(SolrService.scala:176)

Can be fixed by updating the logger.sh script to not quote every property, or by sanitizing the params in the log-collector POST handler...

Additional Java options should be exposed for each service role

Additional java options should be exposed without having to override CSD_JAVA_OPTS with a parameter for each service

It can be modeled on the Kafka service: https://github.com/cloudera/cm_csds/blob/7ddc07f210ccbe2dd7edbce3306024b8076191f3/KAFKA/src/descriptor/service.sdl#L772

Description of intended audience

I think the first on the README should be "who is this tool for"

Make the 'text' field an indexed version of all fields and set as default

The existing 'text' field should contain all other fields in an indexed form. It should also be the default field for queries.

Place the 'log-example' jar into Artifactory

It should be available for debugging without compiling the project.

Add secured solrconfig set option

It can be copied from solrconfigv2. There are some changes made to solrconfig.xml that will need to be moved to the secure version of the file.

Use 'text' types in place of 'string' types in the Solr schema for fields that have multiple words

Use 'text' type in the Solr schema for the following fields to aid in search queries:
throwable
category
thread name

Empty SMTP password config results in alert engine crash

2018-08-27 14:40:20,134 INFO io.phdata.pulse.alertengine.notification.MailNotificationService: starting notification for profile mailProfile1
2018-08-27 14:40:20,185 INFO io.phdata.pulse.alertengine.notification.MailNotificationService: sending alert
2018-08-27 14:40:20,186 INFO io.phdata.pulse.alertengine.notification.Mailer: authenticating with password
2018-08-27 14:40:20,320 ERROR io.phdata.pulse.alertengine.AlertEngineMain$: caught exception in Collection Roller task
javax.mail.AuthenticationFailedException: null
at javax.mail.Service.connect(Service.java:306)
at javax.mail.Service.connect(Service.java:156)
at javax.mail.Service.connect(Service.java:105)
at javax.mail.Transport.send0(Transport.java:168)
at javax.mail.Transport.send(Transport.java:98)
at io.phdata.pulse.alertengine.notification.Mailer.sendMail(Mailer.scala:60)
at io.phdata.pulse.alertengine.notification.MailNotificationService$$anonfun$notify$1.apply(MailNotificationService.scala:35)
at io.phdata.pulse.alertengine.notification.MailNotificationService$$anonfun$notify$1.apply(MailNotificationService.scala:31)
at scala.collection.immutable.List.foreach(List.scala:392)
at io.phdata.pulse.alertengine.notification.MailNotificationService.notify(MailNotificationService.scala:31)
at io.phdata.pulse.alertengine.AlertEngineImpl$$anonfun$sendAlert$1.apply(AlertEngineImpl.scala:139)
at io.phdata.pulse.alertengine.AlertEngineImpl$$anonfun$sendAlert$1.apply(AlertEngineImpl.scala:138)
at scala.collection.immutable.List.foreach(List.scala:392)
at io.phdata.pulse.alertengine.AlertEngineImpl.sendAlert(AlertEngineImpl.scala:138)
at io.phdata.pulse.alertengine.AlertEngineImpl$$anonfun$notify$1.apply(AlertEngineImpl.scala:112)
at io.phdata.pulse.alertengine.AlertEngineImpl$$anonfun$notify$1.apply(AlertEngineImpl.scala:110)
at scala.collection.immutable.Map$Map1.foreach(Map.scala:116)
at io.phdata.pulse.alertengine.AlertEngineImpl.notify(AlertEngineImpl.scala:110)
at io.phdata.pulse.alertengine.AlertEngineImpl.run(AlertEngineImpl.scala:45)
at io.phdata.pulse.alertengine.AlertEngineMain$AlertEngineTask.run(AlertEngineMain.scala:143)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2018-08-27 14:40:20,321 WARN io.phdata.pulse.alertengine.AlertEngineMain$: Caught exit signal, trying to cleanup tasks

Kafka Integration

This story includes all work associated with getting Kafka integration working. It can be broken out as needed into smaller pieces.

Add kafka arguments to LogCollectorCliParser

LogCollectorCliParser

Here is a rough cut:

class LogCollectorCliParser(args: Seq[String]) extends ScallopConf(args) {
  lazy val port = opt[Int]("port", required = false, descr = "Listening port")
  lazy val zkHosts = opt[String]("zk-hosts", required = true, descr = "Zookeeper hosts")
  lazy val topic = opt[String]("topic", required = false, descr = "Kafka Topic")
  lazy val mode = opt[String]("consume-mode", required = false, descr = "'http' or 'kafka'", default = "http")

  verify()
}
}

Verify a mode is chosen and it is valid. To keep backward compatibility, if no mode is chosen it should default to 'http'.

Verify 'port' is chosen with the http listen mode and 'topic' is provided in the kafka listen mode.

Expose Kafka

Branch on the new 'listen-mode' in LogCollector.scala to start listening to the kafka topic

Integrate new arguments into `control.sh`

Add environment variables in control.sh for the new arguments.
Create a script in the bin directory to run the kafka consume mode by calling control.sh. This will make it easy to test changes to the scripts and arguments.

Create a test producer

Create a test producer that will put events onto a topic that will then be read by the kafka consumer. The test producer will make it easy to run the kafka consumer outside of unit tests but not in full production

Integrate new arguments into `service.sdl`

Service.sdl is the configuration file for the CSD https://github.com/cloudera/cm_ext/wiki/Service-Descriptor-Language-Reference

There should be at least two new arguments for consume-mode and topic.
Listen mode should default to http.

Deploy the CSD to Valhalla and test

Test all changes with the CSD and new parcel deployed on a test cluster.

I have scripts for this that are not yet committed, hopefully by the time this is reached they will be.

Document changes

Add a page to the docs dir describing usage and limitations. Register it in mkdocs.yml

Messages can be lost in the HttpAppender buffer when application exits

Since Pulse v2 we have an asynchronous appender, and it doesn’t get properly flushed because there’s nothing to call ‘close’ when the application is completed, so log events at the end of an application can get lost.

A workaround would be to close the buffer manually in a high levelfinally block or add a shutdown hook.