GithubHelp home page GithubHelp logo

deployment's People

Contributors

rjackson avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

deployment's Issues

Increase $wgQueryCacheLimit

rjackson: Oh btw, would it be possible to extend the [[Special:WantedPages]]? I find 1k to be quite limiting when dividing by the plethora of languages we support. Since this probably results in a higher workload I think generating this page like once a day would be enough, not every hour

10,000 is the highest we can go. I'm also not concerned about regeneration cadence; I only set it to 1 hour because it turned out to be relatively cheap to compute 1000.

I'll see how long it takes to compute 10,000 and make an informed decision; once a day would probably be more than enough, but if it only takes double-digit minutes to compute we might as well run it 4 times a day or so just to get fresher data sooner?

Media sync between pre-Cloud and Cloud infrastructure

If we want two way sync, Unison should be able to help us with this. It should allow us to sync our mediawiki-images volume (Google Compute disk backed NFS PersistentVolume) with the Valve-infrastructure media folder (over SSH tunnel).

Otherwise, rsync will suffice.

Continuous deployment

One of the strengths of Kubernetes is it removes the need for human intervention in so many aspects of running a site.

Currently, human intervention is necessary to deploy updates to our Kubernetes environment via kubectl apply....

This isn't great:

  • Every developer who works on the Wiki would need direct access to the Kubernetes clusters (not minimising risk)
  • I can deploy changes I haven't pushed up to Git yet (violating single-source-of-truth)
  • I could accidentally deploy changes for the development environment on the production cluster (error-prone)
  • I could fat-finger my keyboard and do something completely unintentional to the cluster (error-prone)

I would much rather have this repository serve as the source-of-truth for the Kubernetes clusters, and have the act of having commits land in here be the trigger for the cluster to update. That would mitigate all of the above concerns.

Useful resources:

Cloud friendly file storage

We're currently using NFS, as that's what we use for the existing site, but it's not exactly a cloud-friendly technology, and it requires our config hard-code a service IP address when running in Google Container Engine (kubernetes/kubernetes#48212)

We should look into Cloud-friendly alternatives, particularly those that are easy and quick to migrate to (i.e. minimal disruption to production).


September 2018: Thinking-through-writing about Object Storage

NFS as a ReadWriteMany volume has served fairly well since the Google Cloud migration, but its not been entirely reliable: We get occasional file-system related fatal errors (NFS erroneously reporting directories as not writable springs to mind), and there have been a couple instances of old image cache entries hanging around after overwrites. On top of that, I'm not confident in my ability to maintain NFS should any issues arise; it works now, but if that were to change I wouldn't know where to start digging.

I would rather we were persisting images into object storage, as is typical of modern cloud-deployed systems these days. As we're running on GKE, Google Cloud Storage ("GCS") is the obvious candidate.

Unfortunately, MediaWiki doesn't speak Object Storage natively, beyond SwiftFileBackend. It speaks Filesystem, or there are years-old extensions implementing FileRepo for Amazon S3 and Windows Azure Storage.

Another unfortunate scenario is the fact Kubernetes' volumes doesn't speak Object Storage either (or maybe it does through some of its storage drivers, but it doesn't speak GCS). There is an open issue for FUSE volumes (kubernetes/kubernetes#7890), which would help: We can mount GCS via FUSE. Until then, there are some workarounds mentioned in that issue; the most promising being a DaemonSet to mount the FUSE volume on the host, and then mounting the host path on the container.

So our options for Cloud Friendly (object) storage are:

  • Write a Google Cloud Storage extension for MediaWiki
  • Host-mounted FUSE volume via DaemonSet

I'm leaning more toward the first option at the moment, as the S3 and Azure extensions are available to serve as reference (assuming those extensions haven't been touched in years because they "just work", and not that they've been abandoned ๐Ÿ˜ฌ).

The FUSE-in-Kubernetes workaround feels like we'd just be replacing NFS layers with FUSE layers:

-MediaWiki -> FSFileBackend -> `/uploads` -> (NFS volume) -> disk
+MediaWiki -> FSFileBackend -> `/uploads` -> (FUSE workaround) -> GCS

Which would put me back in the same situation: Extra complexity I'm not familiar handling.

Let's set our sights on a new extension, so the flow is more like:

-MediaWiki -> FSFileBackend -> `/uploads` -> NFS -> disk
+MediaWiki -> GCSFileBackend -> GCS

Upload deployment configuration

With the migration to the cloud, we are changing our production environment from bare-metal servers to a Kubernetes-managed cluster. This repository will hold all of our Kubernetes' deployment configuration files.

Monitoring?

  • URL alerts
  • Stackdriver logs
  • PHP error tracking
  • Javascript error tracking
  • CSP report-url
  • Search console
  • APM? (Datadog?)
    • MediaWiki job queue
    • Varnish
    • Memcache
    • Apache

Caching problems with images

The cache system is way too slow.
Hi. The cache system on the Wiki, ever since it was migrated to MediaWiki 2.0 (I think, I just know that it the main framework was updated), has ever since been very slow. I tend to upload a lot of images, and a lot of times, I have to wait for days for an image to catch up, I almost can't ever check if my work looks good on the page, requiring me to check only when the image is finally fully cached. Here's an example, an image I recently updated.

The problem is that the old image size from the previous revision gets used for the new image, and so it stretches very poorly on the page. Other times, the image seems to be properly sized, but still appears blurry in the page, as shown here on Weight Room Warmer, which in case of that image, it was uploaded in 14:07, 22 January 2019, almost 1 month later and the image is still not fully cached.

And other times, the images doesn't properly update at all, as seen on Assassin's Attire, some paint variant tables (that were uploaded around 16:37, 7 July 2018) show up the previous version instead. I'm not sure if this is a case of the cache or something else, but even when I attempted re-uploading the file, it didn't worked.

I remember I didn't had these issues on the previous MediaWiki. Would there be a way to improve the time for the cache system, or perhaps optimize it somehow? I've worked and been into other Wikis, and this is the only one which I see this problem happening. The Weight Room Warmer pic was just an example, but there's other images around the Wiki having the same or similar problems (some from old uploads, other from recent modifications). Thanks. User Gabrielwoj Signature 1.png โ–ช User Gabrielwoj Signature 2.png - User Gabrielwoj Signature 3.png 20:25, 19 February 2019 (UTC)

Crons & jobs

The Kubernetes' jobs object is made exactly for the purpose of running arbitrary commands from images, and the cronjobs object for triggering such jobs on a schedule.

Need to review MediaWiki's maintenance docs and see what maintenance tasks are worth running on a cron to keep our Wiki pretty.

Our current (non-cloud) stack runs the runJobs.php script every minute, with a lock to prevent concurrent execution:

* * * * *     wikiops if [ ! -f /tmp/mediawiki.lock ]; then touch /tmp/mediawiki.lock; /usr/bin/php /valve/var/www/wiki.teamfortress.com/w/maintenance/runJobs.php --procs 8; rm /tmp/mediawiki.lock; fi
  • runJobs.php ?
  • Pre-Cloud infrastructure media sync: #10

SMTP Relay service

We haven't got any useful e-mail related logs because MediaWiki is sending emails directly to Mailjet.

It would be useful to set up an SMTP relay, primarily just for logging of outgoing e-mails.

There would also be a secondary benefit of increased resilience against connectivity issues between us & Mailjet, as the relay would also serve as an outgoing email queue.

Load testing

  • What's the typical traffic we currently receive (users/sec, insight from Google Analytics)
  • How many requests-per-second does that equate to?
  • What is our target per-pod traffic? Keeping this low for mediawiki minimises the impact of individual failures. Having it low for varnish will result in additional backend hits, however. ๐Ÿค”
  • How much resources does a single pod handling that target traffic require? This will provide answers for resource limits, for efficient bin-packing during heavy load.
  • Minimum replicas across stack (redundant nodes + redundancy per node == 4?)
  • Maximum replicas? Order of magnitude over typical traffic? What do our historic traffic spikes look like?

The above will also provide the insight required to configure node autoscaling.

Once the above resource limits have been found out & set, we can load test the following scenarios, using Google Analytics to provide insight:

Remove Javascript error tracking

This is just clogging up the Sentry logs with more noise than signal.

Some of the errors seem genuine, but the actual error detail is useless (no stacktrace, little useful insight in how to reproduce). Other errors are caused by the user (Custom on-wiki JS, browser extensions, ad-blocking services, network issues, ...)

The JS error noise is also having a detrimental impact on the value of the PHP error tracking, which is also in Sentry: The JS noise outweighs the PHP-related signals, and the JS noise causes us to hit our Sentry error tracking limit.

Simplify deployment via Helm?

Review Helm and see if it would be a good means of simplifying the Kubernetes deployment process from maintaining a bunch of raw Kubernetes resources (quite low-level), to maintaining a manifest of resources (high level). This would also move the responsibility for managing configuration from this repository into specific service repositories.

https://helm.sh/
http://blog.kubernetes.io/2016/10/helm-charts-making-it-simple-to-package-and-deploy-apps-on-kubernetes.html

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.