posthog / plugin-server Goto Github PK

View Code? Open in Web Editor NEW

8.0 7.0 6.0 3.2 MB

Service to process and save PostHog events, supporting JS/TS plugins at that

Home Page: https://posthog.com

JavaScript 0.44% TypeScript 99.38% Dockerfile 0.18%

posthog js ts javascript typescript node analytics

plugin-server's Introduction

⚠️ This Repository is Archived and is currently located within the https://github.com/posthog/posthog repository monorepo ⚠️

PostHog Plugin Server

This service takes care of processing events with plugins and more.

Get started

Let's get you developing the plugin server in no time:

Install dependencies and prepare for takeoff by running command yarn.
Start a development instance of PostHog - instructions here. After all, this is the PostHog Plugin Server, and it works in conjuction with the main server. To avoid interference, disable the plugin server there with setting the PLUGIN_SERVER_IDLE env variable before running. PLUGIN_SERVER_IDLE=true ./bin/start
Make sure that the plugin server is configured correctly (see Configuration). Two settings that you MUST get right are DATABASE_URL and REDIS_URL - they need to be identical between the plugin server and the main server.
If developing the enterprise Kafka + ClickHouse pipeline, set KAFKA_ENABLED to true and provide KAFKA_HOSTS plus CLICKHOUSE_HOST, CLICKHOUSE_DATABASE, CLICKHOUSE_USER, andCLICKHOUSE_PASSWORD.

Otherwise if developing the basic Redis + Postgres pipeline, skip ahead.
Start the plugin server in autoreload mode with yarn start, or in compiled mode with yarn build && yarn start:dist, and develop away!
To run migrations for the test, run yarn setup:test:postgres or setup:test:clickhouse. Run Postgres pipeline tests with yarn test:postgres:{1,2}. Run ClickHouse pipeline tests with yarn test:clickhouse:{1,2}. Run benchmarks with yarn benchmark.

Alternative modes

This program's main mode of operation is processing PostHog events, but there are also a few alternative utility ones. Each one does a single thing. They are listed in the table below, in order of precedence.

Name	Description	CLI flags
Help	Show plugin server configuration options	`-h`, `--help`
Version	Only show currently running plugin server version	`-v`, `--version`
Healthcheck	Check plugin server health and exit with 0 or 1	`--healthcheck`
Migrate	Migrate Graphile job queue	`--migrate`
Idle	Start server in a completely idle, non-processing mode	`--idle`

Configuration

There's a multitude of settings you can use to control the plugin server. Use them as environment variables.

Name	Description	Default value
DATABASE_URL	Postgres database URL	`'postgres://localhost:5432/posthog'`
REDIS_URL	Redis store URL	`'redis://localhost'`
BASE_DIR	base path for resolving local plugins	`'.'`
WORKER_CONCURRENCY	number of concurrent worker threads	`0` – all cores
TASKS_PER_WORKER	number of parallel tasks per worker thread	`10`
REDIS_POOL_MIN_SIZE	minimum number of Redis connections to use per thread	`1`
REDIS_POOL_MAX_SIZE	maximum number of Redis connections to use per thread	`3`
SCHEDULE_LOCK_TTL	how many seconds to hold the lock for the schedule	`60`
CELERY_DEFAULT_QUEUE	Celery outgoing queue	`'celery'`
PLUGINS_CELERY_QUEUE	Celery incoming queue	`'posthog-plugins'`
PLUGINS_RELOAD_PUBSUB_CHANNEL	Redis channel for reload events	`'reload-plugins'`
CLICKHOUSE_HOST	ClickHouse host	`'localhost'`
CLICKHOUSE_DATABASE	ClickHouse database	`'default'`
CLICKHOUSE_USER	ClickHouse username	`'default'`
CLICKHOUSE_PASSWORD	ClickHouse password	`null`
CLICKHOUSE_CA	ClickHouse CA certs	`null`
CLICKHOUSE_SECURE	whether to secure ClickHouse connection	`false`
KAFKA_ENABLED	use Kafka instead of Celery to ingest events	`false`
KAFKA_HOSTS	comma-delimited Kafka hosts	`null`
KAFKA_CONSUMPTION_TOPIC	Kafka incoming events topic	`'events_plugin_ingestion'`
KAFKA_CLIENT_CERT_B64	Kafka certificate in Base64	`null`
KAFKA_CLIENT_CERT_KEY_B64	Kafka certificate key in Base64	`null`
KAFKA_TRUSTED_CERT_B64	Kafka trusted CA in Base64	`null`
KAFKA_PRODUCER_MAX_QUEUE_SIZE	Kafka producer batch max size before flushing	`20`
KAFKA_FLUSH_FREQUENCY_MS	Kafka producer batch max duration before flushing	`500`
KAFKA_MAX_MESSAGE_BATCH_SIZE	Kafka producer batch max size in bytes before flushing	`900000`
LOG_LEVEL	minimum log level	`'info'`
SENTRY_DSN	Sentry ingestion URL	`null`
STATSD_HOST	StatsD host - integration disabled if this is not provided	`null`
STATSD_PORT	StatsD port	`8125`
STATSD_PREFIX	StatsD prefix	`'plugin-server.'`
DISABLE_MMDB	whether to disable MMDB IP location capabilities	`false`
INTERNAL_MMDB_SERVER_PORT	port of the internal server used for IP location (0 means random)	`0`
DISTINCT_ID_LRU_SIZE	size of persons distinct ID LRU cache	`10000`
PLUGIN_SERVER_IDLE	whether to disengage the plugin server, e.g. for development	`false`
CAPTURE_INTERNAL_METRICS	whether to capture internal metrics for posthog in posthog	`false`
PISCINA_USE_ATOMICS	corresponds to the piscina useAtomics config option (https://github.com/piscinajs/piscina#constructor-new-piscinaoptions)	`true`
PISCINA_ATOMICS_TIMEOUT	(advanced) corresponds to the length of time (in ms) a piscina worker should block for when looking for tasks - instances with high volumes (100+ events/sec) might benefit from setting this to a lower value	`5000`

Releasing a new version

Just bump up version in package.json on the main branch and the new version will be published automatically, with a matching PR in the main PostHog repo created.

It's advised to use bump patch/minor/major label on PRs - that way the above will be done automatically when the PR is merged.

Courtesy of GitHub Actions.

Walkthrough

The story begins with pluginServer.ts -> startPluginServer, which is the main thread of the plugin server.

This main thread spawns WORKER_CONCURRENCY worker threads, managed using Piscina. Each worker thread runs TASKS_PER_WORKER tasks (concurrentTasksPerWorker).

Main thread

Let's talk about the main thread first. This has:

pubSub – Redis powered pub-sub mechanism for reloading plugins whenever a message is published by the main PostHog app.
hub – Handler of connections to required DBs and queues (ClickHouse, Kafka, Postgres, Redis), holds loaded plugins. Created via hub.ts -> createHub. Every thread has its own instance.
piscina – Manager of tasks delegated to threads. makePiscina creates the manager, while createWorker creates the worker threads.
scheduleControl – Controller of scheduled jobs. Responsible for adding Piscina tasks for scheduled jobs, when the time comes. The schedule information makes it into the controller when plugin VMs are created.

Scheduled tasks are controlled with Redlock (redis-based distributed lock), and run on only one plugin server instance in the entire cluster.
jobQueueConsumer – The internal job queue consumer. This enables retries, scheduling jobs in the future (once) (Note: this is the difference between scheduleControl and this internal jobQueue). While scheduleControl is triggered via runEveryMinute, runEveryHour tasks, the jobQueueConsumer deals with meta.jobs.doX(event).runAt(new Date()).

Jobs are enqueued by job-queue-manager.ts, which is backed by Postgres-based Graphile-worker (graphile-queue.ts).
queue – Event ingestion queue. This is a Celery (backed by Redis) or Kafka queue, depending on the setup (EE/Cloud is Kafka due to high volume). These are consumed by the queue above, and sent off to the Piscina workers (src/main/ingestion-queues/queue.ts -> ingestEvent). Since all of the actual ingestion happens inside worker threads, you'll find the specific ingestion code there (src/worker/ingestion/ingest-event.ts). There the data is saved into Postgres (and ClickHouse via Kafka on EE/Cloud).

It's also a good idea to see the producer side of this ingestion queue, which comes from Posthog/posthog/api/capture.py. The plugin server gets the process_event_with_plugins Celery task from there, in the Postgres pipeline. The ClickHouse via Kafka pipeline gets the data by way of Kafka topic events_plugin_ingestion.
mmdbServer – TCP server, which works as an interface between the GeoIP MMDB data reader located in main thread memory and plugins ran in worker threads of the same plugin server instance. This way the GeoIP reader is only loaded in one thread and can be used in all. Additionally this mechanism ensures that mmdbServer is ready before ingestion is started (database downloaded from http-mmdb and read), and keeps the database up to date in the background.

Worker threads

This begins with worker.ts and createWorker().

hub is the same setup as in the main thread.

New functions called here are:

setupPlugins – Loads plugins and prepares them for lazy VM initialization.
createTaskRunner – Creates a Piscina task runner that allows to operate on plugin VMs.

Note: An organization_id is tied to a company and its installed plugins, a team_id is tied to a project and its plugin configs (enabled/disabled+extra config).

Questions?

Join our Slack community. 🦔

plugin-server's People

Contributors

Stargazers

Watchers

Forkers

mrooding loai-kanou jredl-va isabella232 crowdjustice ollipuu

plugin-server's Issues

Effectively monitor batch processing times.

This Kafka eachBatch is going to be an important unit of work. Since the next eachBatch runs only when the last one finished and is committed, the length of the slowest plugin will determine how quickly batches get processed.

For 500 events, if 1 event takes 30sec to process (e.g. some long await fetch) and the other 499 take a combined 1sec, this instance of the plugin server will be sitting idle for 29 seconds out of 30.

We need a way to reliably monitor, detect and alert about slow batch processing times. The console logs as shown in #154 are not enough.

PR #155 is about enforcing timeouts when we do encounter slow plugins to assure some level of throughput, and perhaps a prerequisite for starting this work.

To be specced out in another issue, but related to the above:

There is only one true way around this, and that is to convert both Kafka and Celery streams into some other, buffered, sequential, persistent and insanely fast queues that operate on the edge of the larger pipe. Possibly backed by Postgres? This buffer would be responsible for keeping track of where each of the 500 messages that arrive in one Kafka batch are in the processing pipeline. If the server crashes, it could pick up where it left off. Is Postgres fast enough for this or what could we use?

Read events from Kafka WAL

On EE we currently process events directly in the web worker. We however also emit a log with the processed event into kafka.

The plugin server should be able to read this log like it is currently able to read celery and process the event from there.
Adding Event Ingestion (#10) directly into the plugin server will unlock running plugins on EE.

Extract benchmarks from tests

I don't want to reach for the power cable every time I release a new version.

They're conceptually "testing" different things as well.

Benchmarks should still run in a github action.

Add a benchmarking system

It'd be super useful to know what are the limits of the plugins server and how is performance affected by various changes we make.

Lazy VMs / Improve reloads

Currently when reloading, we:

shut down celery
wait 2 seconds
shut down all workers
restart all workers (they will reload all plugins)
restart celery

Downtime of min 2sec, but more like 3-4sec. Even more if you have hundreds of plugins.

Ideally we should:

only restart plugins that changed
restart them live (keep the old one running while the new one restarts) and abort reload if there's an init error
rolling restart between threads, not all at once

Respect `team.anonymize_ips` earlier in the pipeline and hide IPs from plugins

We store the IP address of the user calling posthog.capture inside the $ip property. If this property exists, we use it. If not, we take the IP from the request and put it in the property.

When the event passes through plugins, we pass along the request's IP, all the way until ingestion. If a plugin wants to export the event to another database, it'll also have access to and send the request's IP, not the provided $ip property.

Solution: we should override the IP earlier in the pipeline, as to prevent storing PII that we don't want.

Enable multiprocessing

Currently the server can only use 1 thread, even if the machine has for instance 8 cores and could handle 8 server threads at once, greatly improving performance under load.

Plugin server gets stuck when ingesting via Kafka

We have had cases where the Kafka-powered plugin server just stops working. Such as last night on cloud.

Plugin server ingestion on its own is turned off, yet I'm still running the "queue latency" plugin, which emits one event via posthog.capture every minute. Between noon and 9am, the plugin server stopped ingesting events.

Here's what the logs show:

The line took to long to run plugins on... is part of a system to detect if and when eachBatch gets stuck (without killing anything).

What exactly happened is still to be determined. Everything came back online when the task was redeployed this morning.

Tests for Kafka

The Kafka code is currently untested. This should be improved, to be sure we don't break anything critical with random changes in the future.

Event ingestion in plugin server

As discussed in the call yesterday, we should try ingesting the event into postgres and clickhouse directly in this nodejs server as opposed to sending the event on to celery.

This is more important for EE, since celery can no longer handle the volume on app, and creating a system of python workers that listen to kafka topics is going to be annoying.

Add a README with instructions on how to run

Helps community + our own team to hit the ground running

Create Dockerfile

This should be dockerized for simplicity of spinning up with docker-compose setups.

Extract common code between posthog-plugin-server and posthog-cli

... and put it in the posthog-plugins package.

Code such as extracting a zip file and reading a .json file inside it.

Fix Kafka Sentry errors

There are a lot of recurring errors in sentry. I don't know which are serious, which are not, and which can be silenced. Should be investigated.

Sample error types:

The producer is disconnected
Timeout while acquiring lock (1 waiting locks): "updating target topics"
Connection timeout
Timeout while acquiring lock (3 waiting locks): "connect to broker URL-REMOVED:9096"
The replica is not available for the requested topic-partition
Connection error: Client network socket disconnected before secure TLS connection was established
Broker not connected

Ingestion in a worker thread

Currently done in the main thread, should probably be split out to workers.

Create PR with version bump in `posthog` repo

... when a version is bumped here.

Guarantee that scheduled tasks run on only one worker

If I have multiple plugin servers running, tasks like runEveryMinute will run every minute on each worker. They should only run once per cluseter.

A system of locking a "I am the scheduler!" token via node-redlock is probably the solution.

Do not load plugins for teams that have the plugin server disabled

If you enable plugins on a project, install and activate some plugins, and then disable plugins on that project, these plugins still get loaded (VMs created, etc) in the plugin server. The processEvent and similar functions will not get called, but setupPlugin and all scheduled functions will.

Move from `plugin.json` to `package.json`

Now that you can install plugins from npm, a package.json file is required. It duplicates most of the fields from plugin.json (name, description, main as is, url --> homepage), with the exception of lib and config.

We can just remove lib and keep all code in main / index.js. Most plugins will be compiled down to one .js file anyway, and magically including all code in lib.js inside index.js doesn't lead to a good experience while developing (no autocomplete for functions in lib.js).

For config, we can just use config in package.json.

Persistent storage

The github star sync plugin needs to store state (last read page, last ingested time) somewhere.

The only thing available is cache (redis), which should not be used as persistent storage.

A solution is needed.

I wouldn't want to introduce new services and databases (mongodb, anyone?), so the easiest thing is probably a postgres-backed localStorage API that you can call from within plugins.

Stream console.log output somewhere

To improve the dev experience, running console.log (and .info, etc) inside the plugin should send this information somewhere.

First thoughts:

Do console.log = (...args) => (posthog.capture(`${pluginName} console.log`, { args })? Too easy to endlessly loop...
Something on plugin_config in postgres? Too high throughput...
Redis?
Kafka?
Custom nodejs logger service that remembers the last 100 console.log messages per plugin and just periodically flushes the last state to postgres? How to thandle multithreading and multiple worker instances?
???

Add support for scheduled plugins

Todo:

Make it possible to register "tasks" in plugins, which can be called through celery (and/or via HTTP in the future?)
Add node-redlock and use it to to acquire a "scheduler" lock from redis in just one worker/thread
Calculate the "next run at" time for all scheduled tasks per plugin_config and store that somewhere (redis? postgres?)
Periodically check if any task needs to run. If so, run it by dispatching a task on celery and then calculating the next run time.
Handle rolling updates to plugins - reloading plugins shouldn't reset the scheduled times
Tasks scheduled to run while the server was down should be run up after it starts (perhaps except if the "next run at" is less than "time since it should have run"?)

Outgoing webhooks

Basically, we need to make this https://posthog.com/docs/integrations/slack be done inside of this https://github.com/PostHog/plugin-server.

The rationale is that in the current setup webhooks are fired with raw incoming events in the capture endpoint. They should instead be fired with the final event, as processed by plugins. We could also possibly smooth out some reliability issue just by the fact of refactoring.

Would be super cool done as a plugin, but there are two considerations that complicate this is if we want no regressions:

the plugin would have to allow selecting only firing the webhook for specific actions,
the format would have to customizable per specific action.

Handle errors thrown inside plugins

Currently if a plugin method throws an error, it will bubble up the server and get caught by Sentry (like this). We should logs these errors in a smarter way, so that they can be useful to plugin users and don't pollute our Sentry data.

Tests inside plugins

Packages should as posthog-maxmind-plugin and posthog-currency-normalization-plugin should come with some tests. Same for helloworldplugin, which is used to scaffold new plugins.

Should be as easy as integrating jest, importing the processEvent and setupPlugin functions and asserting that some input matches some output.

Rework logging to show thread ID

With multithreading it's confusing which console.* message comes from where. A nice addition would be prefixing e.g. [MAIN] 😮 Something happened for the main thread and e.g. [0003] 🍆 A thing started for thread ID 3.
Also, a silly but handy thing would be including the emoji as a separate parameter. Some devs hate them, but I think they really highlight things in logs. 😂
So something like:

import { threadId } from 'worker_threads'

function info(message: string, emoji?: string): void {
    console.info(`[${threadId?.toString().padStart(4, '0') ?? 'MAIN'}]${emoji ? ' ' + emoji : ''} ${message}`)
}

Cannot use a pool after calling end on the pool

Not sure why this happens

Server fails when uploading MaxMind DB on a live instance

Was setting up the MaxMind plugin today on a live instance. Uploading the DB file led to this:

Heroku Logs:

2020-12-02T15:11:10.466690+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/celery/backends/redis.py", line 177, in _consume_from
2020-12-02T15:11:10.466691+00:00 app[worker.1]: self._pubsub.subscribe(key)
2020-12-02T15:11:10.466691+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/contextlib.py", line 131, in __exit__
2020-12-02T15:11:10.466691+00:00 app[worker.1]: self.gen.throw(type, value, traceback)
2020-12-02T15:11:10.466699+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/celery/backends/redis.py", line 130, in reconnect_on_error
2020-12-02T15:11:10.466699+00:00 app[worker.1]: self._ensure(self._reconnect_pubsub, ())
2020-12-02T15:11:10.466700+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/celery/backends/redis.py", line 355, in ensure
2020-12-02T15:11:10.466700+00:00 app[worker.1]: return retry_over_time(
2020-12-02T15:11:10.466700+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/kombu/utils/functional.py", line 344, in retry_over_time
2020-12-02T15:11:10.466701+00:00 app[worker.1]: return fun(*args, **kwargs)
2020-12-02T15:11:10.466701+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/celery/backends/redis.py", line 115, in _reconnect_pubsub
2020-12-02T15:11:10.466701+00:00 app[worker.1]: metas = self.backend.client.mget(self.subscribed_to)
2020-12-02T15:11:10.466702+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/redis/client.py", line 1644, in mget
2020-12-02T15:11:10.466702+00:00 app[worker.1]: return self.execute_command('MGET', *args, **options)
2020-12-02T15:11:10.466703+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/redis/client.py", line 875, in execute_command
2020-12-02T15:11:10.466703+00:00 app[worker.1]: conn = self.connection or pool.get_connection(command_name, **options)
2020-12-02T15:11:10.466703+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/redis/connection.py", line 1185, in get_connection
2020-12-02T15:11:10.466704+00:00 app[worker.1]: connection.connect()
2020-12-02T15:11:10.466704+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/redis/connection.py", line 561, in connect
2020-12-02T15:11:10.466705+00:00 app[worker.1]: self.on_connect()
2020-12-02T15:11:10.466705+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/redis/connection.py", line 637, in on_connect
2020-12-02T15:11:10.466705+00:00 app[worker.1]: auth_response = self.read_response()
2020-12-02T15:11:10.466706+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/redis/connection.py", line 734, in read_response
2020-12-02T15:11:10.466706+00:00 app[worker.1]: response = self._parser.read_response()
2020-12-02T15:11:10.466706+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/redis/connection.py", line 333, in read_response
2020-12-02T15:11:10.466707+00:00 app[worker.1]: raise error
2020-12-02T15:11:10.466707+00:00 app[worker.1]: redis.exceptions.ConnectionError: max number of clients reached

Upgrade plugins within the APP

Do action matching in plugin server

If we match fully processed events to actions in the plugin server (in parallel with saving the event to DB), we'll be able to again have asynchronous events>actions matching. Even more importantly, we'll be able to support method onAction(action: Action, event: PluginEvent) on plugins, facilitating making webhooks a plugin. In the long run something like PostHog/posthog#3357 may be built on top of this.

Minimum PostHog version for plugins

I'd like to specify a minimum posthog version in my plugins to make sure all the required functionality is there. For example to say "my plugin requires attachments, so I need at least PostHog 1.17.0". It might also be good to pin to an exact version of posthog, since we might still have some slightly breaking changes.

The engines field in package.json seems to be a great place for this.

https://docs.npmjs.com/cli/v6/configuring-npm/package-json#engines

Add Sentry

If the plugin server crashes in production, we have no way of knowing.

Plugin server metrics

There are many metrics that the worker-based multithreaded plugin server can export.

For example the worker/task library provides:

Average task runtime: https://github.com/piscinajs/piscina#property-runtime-readonly
Average CPU % spent waiting for IO: https://github.com/piscinajs/piscina#property-waittime-readonly

Plus we can export things like:

Number of events processed
Breakdown by category (scheduled task, webhook, process_event, etc)
Failure rate

Celery in posthog is reporting with statsd. Should I do the same here? I'm new to this monitoring world. @fuziontech what would you suggest?

Remove backwards compatibility with setupTeam

I was just looking at this and thinking if we really need to be backwards compatible with setupTeam. The project is so early that I'm not sure this is justified. More of a nit really because it's just a matter of aesthetics.

Rate limit celery and kafka consumption to what the worker threads can handle

Copied and adapted from this comment.

The way piscina works is that we spawn WORKER_CONCURRENCY (e.g. 8) threads that can handle TASKS_PER_WORKER (defaults to 10) async tasks. When 8 threads * 10 tasks are processing, anything new tasks get put on piscina's internal queue.

There is a scenario where we will receive from Celery more tasks than piscina can handle, if all 80 slots are filled with waiting async connections... and since we'll keep waiting, we'll have CPU cycles to burn and will ask for even more tasks to backfill the limitless queue with... Until crashing from out of memory and losing all the events that were queued.

We can increase the parallel slots from 80 to 800 or 8000, but that's beside the point. Instead we should make sure that no matter the configuration, we always ask for just as many events as we can handle.

Celery uses something like this, where it just asks for the next message when it can. Instead it should ask for the next message when we have free slots.

There's code here that could be adapted to call celeryQueue.pause() and celeryQueue.resume() to either ask for more or stop asking for events.

This could also be used with Kafka.

Silence Kafka console errors

There are many kafka errors logged to the console in AWS, such as:

{"level":"ERROR","timestamp":"2021-02-09T10:33:34.091Z","logger":"kafkajs","message":"[Runner] The group is rebalancing, re-joining","groupId":"clickhouse-ingestion","memberId":"plugin-server-v0.7.5-01778658-264a-0000-8ffa-3881cf81f867-a0bd54dc-197a-4769-8485-ffef6b71b283","error":"The group is rebalancing, so a rejoin is needed","retryCount":0,"retryTime":352}
{"level":"ERROR","timestamp":"2021-02-09T10:33:32.747Z","logger":"kafkajs","message":"[Connection] Response Heartbeat(key: 12, version: 3)","broker":"ec2-IP.compute-1.amazonaws.com:9096","clientId":"plugin-server-v0.7.5-01778658-264a-0000-8ffa-3881cf81f867","error":"The group is rebalancing, so a rejoin is needed","correlationId":159,"size":10}
{"level":"ERROR","timestamp":"2021-02-09T10:31:10.975Z","logger":"kafkajs","message":"[Connection] Response OffsetCommit(key: 8, version: 5)","broker":"ec2-IP.compute-1.amazonaws.com:9096","clientId":"plugin-server-v0.7.5-01778658-264a-0000-8ffa-3881cf81f867","error":"Specified group generation id is not valid","correlationId":11,"size":53}

I guess they're actually fine and part of the normal life of a kafka connection? According to this issue we must build a custom logger to downgrade the severity of these Kafka errors.

Use labels to create releases

When a label like version:patch is added to a PR, actually release a new patch version of the plugin server when it lands in master.

Plugin Ordering

Instrument tests

This is currently lacking tests. We need to cover basic behavior, but also scalability and edge cases that can introduced by plugins (e.g. make a plugin that just waits for 30sec and then sends on the event, make one that blocks the running thread, etc).

Refactor `src/*.ts`

There's a lot of stuff in the root of the src folder. This should be cleaned up. Kind of pending behind the currently open PRs (#39, #34, #25) and better to do after.

Automatic GeoIP Plugin

As discussed in ops, we can go for the maxmind geoip redistributing license, pay them a bit of money and then distribute the geolite city database with the plugin server to offer effortless and "good enough" quality GeoIP information.

Things we need to make this work:

Get the commercial redistribution license and double check there's nothing there against using it in open source projects.
Create a new private Github repository that will:
- Store the latest .mmdb (or compressed) geolite2 database
- Deploy this to a custom subdomain on netlify or S3 or wherever, so we can access the database via HTTP on something like https://geoip.posthog.com/ip/latest.mmdb
- Have a scheduled github action to update the database each month from maxmind
- Also store the md5sum of the database in latest.md5sum or whatever, as we need to periodically update the database from the app and this helps quickly check if we're on the latest or not.
Expose a geoip global function object to plugin users
- This would wrap this new Reader().get() line inside geoip.get(), just with a preconfigured database.
- There needs to be a way to download the initial database and update it periodically. The plugin should be disabled until the database is downloaded... or it could await for it in geoip.get().
- The database, once downloaded, needs to be stored in postgresql somewhere. That's the only persistent storage system we have. File system won't work. Reusing the plugin attachments table seems like an option?
- The initial download could happen in the new geoip plugin's setupPlugin... and have a runEveryDay that checks if there are updates (basically just runs geoip.update()). Still we shouldn't have all teams or all threads updating it at the same time...
The main point is to have this geoip database reader running per-instance of the plugin server (or per worker thread), but not per enabled plugin in a team.
We should also not download the database for users who don't want geoip support.
I imagine "enable geoip" will be something added to the ingestion/preflight routine eventually, next to e.g. "enable session recording".

What did I miss?

Plugin server doesn't stop on CTRL+C

The process keeps on running in the background.

New package name proposal: @posthog/processor

This project started out as a worker to only process events received from the main PostHog server with plugins, and then pass them back to the main server, but it's now growing into a much more robust server with greater responsibility in the ingestion pipeline and beyond (possibly processing HTTP requests with Fastify).

Therefore the original name now seems to be somewhat narrow and boring. I suggest we rename the package. My proposal is: @posthog/processor – cool, short, and describing what all this does: data processing central to PostHog, from and into.

Feedback needed and welcome.

Custom timestamps in `posthog.capture`

For the github star sync plugin, I'd like to send events with a different timestamp than now. I'm can send the starred_at property, but I don't think I can use that instead of the timestamp under insights.

Clean up all plugins

Posting it in this repo, but it's a bit of a meta-issue.

Some things to do:

Some of the readmes contain old instructions (CLI, etc)
We should standardise to the "main" branch in all repos
We can probably remove the "posthog-" bit from the beginning of all plugin repos
If a plugin contains a "logo.png" in its main branch, we use it in the interface. We should add these to all plugins.
The "name" field for plugins could be changed from user hostile "posthog-currency-normalization-plugin" to a friendlier "Currency Normalization Plugin". Currently in the repository some "name" fields contain a nice and friendly name (e.g. "Email Scoring"), yet the "plugin.json" itself contains a technical name "mailboxlayer-plugin". The result is that the name changes in the interface from before to after installing the plugin. We should just go with the friendly names. This is made a bit trickier since the name is what we use to prevent the same plugin from being installed again... but a unique friendly name is still a unique name

Defend against malicious plugins

We should figure out what are the limits of the plugin system and find ways to protect against malicious plugins, deliberate or not. For example plugins that:

Await for 300sec and then sends on the event
Block the thread indefinitely with a long running for loop
Take up 100GB of RAM
Mint crypto
etc

Ideally none of these cases should bring down the server and long running tasks should get killed in some way.

Does not work on Heroku

I'm seeing issues like this on all review apps:

error: no pg_hba.conf entry for host "54.91.8.233", user "qrsiljldhoxjis", database "d17e94jsledgnn", SSL off

Heroku requires SSL on Postgres connections.

@Twixes any idea what might have changed? I remember this TLS PR, but it only touched Kafka...

E2E benchmarks for celery and Kafka

We have benchmarks (#5 and #41), yet they just measure the performance of the VM, yielding results up to 100k events/sec under specific loads (simple plugin + batches of 100)

What we don't have are E2E tests that take the overhead of Celery and Kafka into account.

Possibly related: #55

Better starter kit

Currently the helloworldplugin is our "blank plugin starter kit".

This should be extended with:

TypeScript support (copied from the maxmind plugin)
Tests (#16)

Instrument plugin usage

We should add some custom "plugin installed" or "plugin server launched" style events to see how many people are using plugins.