tfwiki / deployment Goto Github PK

View Code? Open in Web Editor NEW

1.0 6.0 0.0 166 KB

Deployment declarations

Shell 100.00%

kuberentes

deployment's People

Contributors

Stargazers

Watchers

deployment's Issues

Increase $wgQueryCacheLimit

rjackson: Oh btw, would it be possible to extend the [[Special:WantedPages]]? I find 1k to be quite limiting when dividing by the plethora of languages we support. Since this probably results in a higher workload I think generating this page like once a day would be enough, not every hour

10,000 is the highest we can go. I'm also not concerned about regeneration cadence; I only set it to 1 hour because it turned out to be relatively cheap to compute 1000.

I'll see how long it takes to compute 10,000 and make an informed decision; once a day would probably be more than enough, but if it only takes double-digit minutes to compute we might as well run it 4 times a day or so just to get fresher data sooner?

Media sync between pre-Cloud and Cloud infrastructure

If we want two way sync, Unison should be able to help us with this. It should allow us to sync our mediawiki-images volume (Google Compute disk backed NFS PersistentVolume) with the Valve-infrastructure media folder (over SSH tunnel).

Otherwise, rsync will suffice.

Continuous deployment

One of the strengths of Kubernetes is it removes the need for human intervention in so many aspects of running a site.

Currently, human intervention is necessary to deploy updates to our Kubernetes environment via kubectl apply....

This isn't great:

Every developer who works on the Wiki would need direct access to the Kubernetes clusters (not minimising risk)
I can deploy changes I haven't pushed up to Git yet (violating single-source-of-truth)
I could accidentally deploy changes for the development environment on the production cluster (error-prone)
I could fat-finger my keyboard and do something completely unintentional to the cluster (error-prone)

I would much rather have this repository serve as the source-of-truth for the Kubernetes clusters, and have the act of having commits land in here be the trigger for the cluster to update. That would mitigate all of the above concerns.

Useful resources:

Localisation strings often failing to render

The majority of requests work fine, but there's still an observable large chunk of requests which fail.

We can observe the same issue with Special:AllMessages, with no messages being returned at all on the failing requests:

What the horse?!

Cloud friendly file storage

We're currently using NFS, as that's what we use for the existing site, but it's not exactly a cloud-friendly technology, and it requires our config hard-code a service IP address when running in Google Container Engine (kubernetes/kubernetes#48212)

We should look into Cloud-friendly alternatives, particularly those that are easy and quick to migrate to (i.e. minimal disruption to production).

September 2018: Thinking-through-writing about Object Storage

NFS as a ReadWriteMany volume has served fairly well since the Google Cloud migration, but its not been entirely reliable: We get occasional file-system related fatal errors (NFS erroneously reporting directories as not writable springs to mind), and there have been a couple instances of old image cache entries hanging around after overwrites. On top of that, I'm not confident in my ability to maintain NFS should any issues arise; it works now, but if that were to change I wouldn't know where to start digging.

I would rather we were persisting images into object storage, as is typical of modern cloud-deployed systems these days. As we're running on GKE, Google Cloud Storage ("GCS") is the obvious candidate.

Unfortunately, MediaWiki doesn't speak Object Storage natively, beyond SwiftFileBackend. It speaks Filesystem, or there are years-old extensions implementing FileRepo for Amazon S3 and Windows Azure Storage.

Another unfortunate scenario is the fact Kubernetes' volumes doesn't speak Object Storage either (or maybe it does through some of its storage drivers, but it doesn't speak GCS). There is an open issue for FUSE volumes (kubernetes/kubernetes#7890), which would help: We can mount GCS via FUSE. Until then, there are some workarounds mentioned in that issue; the most promising being a DaemonSet to mount the FUSE volume on the host, and then mounting the host path on the container.

So our options for Cloud Friendly (object) storage are:

Write a Google Cloud Storage extension for MediaWiki
Host-mounted FUSE volume via DaemonSet

I'm leaning more toward the first option at the moment, as the S3 and Azure extensions are available to serve as reference (assuming those extensions haven't been touched in years because they "just work", and not that they've been abandoned 😬).

The FUSE-in-Kubernetes workaround feels like we'd just be replacing NFS layers with FUSE layers:

-MediaWiki -> FSFileBackend -> `/uploads` -> (NFS volume) -> disk
+MediaWiki -> FSFileBackend -> `/uploads` -> (FUSE workaround) -> GCS

Which would put me back in the same situation: Extra complexity I'm not familiar handling.

Let's set our sights on a new extension, so the flow is more like:

-MediaWiki -> FSFileBackend -> `/uploads` -> NFS -> disk
+MediaWiki -> GCSFileBackend -> GCS

Remove hardcoded NFS IP

deployment/k8s/nfs.yaml

Lines 68 to 72 in 7ecaefd

 nfs: 

 # IP of nfs-server service. GCE currently doesn't resolve from domain 

 # See https://github.com/kubernetes/kubernetes/issues/48212 

 server: 10.11.251.158 

 path: "/"

As notified in #5, this has been fixed. We can now replace the IP the service's full DNS name.

https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#a-records

Upload deployment configuration

With the migration to the cloud, we are changing our production environment from bare-metal servers to a Kubernetes-managed cluster. This repository will hold all of our Kubernetes' deployment configuration files.

502 errors on certain Special pages

LB or Varnish must be giving up waiting on MediaWiki for so long.

https://wiki.teamfortress.com/wiki/Special:WantedPages
https://wiki.teamfortress.com/wiki/Special:WantedFiles
https://wiki.teamfortress.com/wiki/Special:WantedTemplates

Monitoring?

Caching problems with images

The cache system is way too slow.
Hi. The cache system on the Wiki, ever since it was migrated to MediaWiki 2.0 (I think, I just know that it the main framework was updated), has ever since been very slow. I tend to upload a lot of images, and a lot of times, I have to wait for days for an image to catch up, I almost can't ever check if my work looks good on the page, requiring me to check only when the image is finally fully cached. Here's an example, an image I recently updated.

The problem is that the old image size from the previous revision gets used for the new image, and so it stretches very poorly on the page. Other times, the image seems to be properly sized, but still appears blurry in the page, as shown here on Weight Room Warmer, which in case of that image, it was uploaded in 14:07, 22 January 2019, almost 1 month later and the image is still not fully cached.

And other times, the images doesn't properly update at all, as seen on Assassin's Attire, some paint variant tables (that were uploaded around 16:37, 7 July 2018) show up the previous version instead. I'm not sure if this is a case of the cache or something else, but even when I attempted re-uploading the file, it didn't worked.

I remember I didn't had these issues on the previous MediaWiki. Would there be a way to improve the time for the cache system, or perhaps optimize it somehow? I've worked and been into other Wikis, and this is the only one which I see this problem happening. The Weight Room Warmer pic was just an example, but there's other images around the Wiki having the same or similar problems (some from old uploads, other from recent modifications). Thanks. User Gabrielwoj Signature 1.png ▪ User Gabrielwoj Signature 2.png - User Gabrielwoj Signature 3.png 20:25, 19 February 2019 (UTC)

Timeouts on uploads > 10MB

rjackson: is it normal that the wiki times out whenever i try to upload something that is more than 10MB?

Example: https://wiki.teamfortress.com/wiki/File:Degrootkeep_point_B.png

Looks like caches hadn't been flushed properly when I viewed that page, so maybe PHP isn't finishing its whateverydo.

Broken image, possibly due to filename encoding

https://wiki.teamfortress.com/wiki/File:%C3%9Cbersaw_1st_person_blu_festivized.png
"View the most recent version ... " shows it properly

Encoding problem? 🤷‍♂️

Simplify configs with Kustomize

https://kubernetes.io/blog/2018/05/29/introducing-kustomize-template-free-configuration-customization-for-kubernetes/

Looks like exactly what I was needing back in January. It'll be well worth getting this in place, and getting an actually-useful dev environment up and running (right now dev is rarely updated & used)

Crons & jobs

The Kubernetes' jobs object is made exactly for the purpose of running arbitrary commands from images, and the cronjobs object for triggering such jobs on a schedule.

Need to review MediaWiki's maintenance docs and see what maintenance tasks are worth running on a cron to keep our Wiki pretty.

Our current (non-cloud) stack runs the runJobs.php script every minute, with a lock to prevent concurrent execution:

* * * * *     wikiops if [ ! -f /tmp/mediawiki.lock ]; then touch /tmp/mediawiki.lock; /usr/bin/php /valve/var/www/wiki.teamfortress.com/w/maintenance/runJobs.php --procs 8; rm /tmp/mediawiki.lock; fi

runJobs.php ?
Pre-Cloud infrastructure media sync: #10

Add StructuredDiscussions extension

https://wiki.teamfortress.com/wiki/Team_Fortress_Wiki:Technical_requests#StructuredDiscussions

https://www.mediawiki.org/wiki/Extension:StructuredDiscussions

Automagically pull secrets during deployment

Auto pull secrets from tfwiki/secrets

Investigate options for deploying secrets onto Kubernetes cluster
Generate deployment GPG key and grant it access to secrets repository
Enable automagic secret pulling

SMTP Relay service

We haven't got any useful e-mail related logs because MediaWiki is sending emails directly to Mailjet.

It would be useful to set up an SMTP relay, primarily just for logging of outgoing e-mails.

There would also be a secondary benefit of increased resilience against connectivity issues between us & Mailjet, as the relay would also serve as an outgoing email queue.

Add Thanks extension

Talk:Technical requests#thank
Technical requests#Thanks

Extension:Thanks

Implement better captcha

Spambots being spambotty

Load testing

What's the typical traffic we currently receive (users/sec, insight from Google Analytics)
How many requests-per-second does that equate to?
What is our target per-pod traffic? Keeping this low for mediawiki minimises the impact of individual failures. Having it low for varnish will result in additional backend hits, however. 🤔
How much resources does a single pod handling that target traffic require? This will provide answers for resource limits, for efficient bin-packing during heavy load.
Minimum replicas across stack (redundant nodes + redundancy per node == 4?)
Maximum replicas? Order of magnitude over typical traffic? What do our historic traffic spikes look like?

The above will also provide the insight required to configure node autoscaling.

Once the above resource limits have been found out & set, we can load test the following scenarios, using Google Analytics to provide insight:

Average browsing
Major update browsing
Any other scenarios?

Add DisableAccount extension

Extension:DisableAccount

Should help us deal with account deletion requests.

Remove Javascript error tracking

This is just clogging up the Sentry logs with more noise than signal.

Some of the errors seem genuine, but the actual error detail is useless (no stacktrace, little useful insight in how to reproduce). Other errors are caused by the user (Custom on-wiki JS, browser extensions, ad-blocking services, network issues, ...)

The JS error noise is also having a detrimental impact on the value of the PHP error tracking, which is also in Sentry: The JS noise outweighs the PHP-related signals, and the JS noise causes us to hit our Sentry error tracking limit.

"External links" captcha triggering on references to .jpg?

<Zeyklon> Your edit includes new external links. To protect the wiki against automated spam, we kindly ask you to solve the following CAPTCHA
11:37 <Zeyklon> the wiki think .jpg is a link

Simplify deployment via Helm?

Review Helm and see if it would be a good means of simplifying the Kubernetes deployment process from maintaining a bunch of raw Kubernetes resources (quite low-level), to maintaining a manifest of resources (high level). This would also move the responsibility for managing configuration from this repository into specific service repositories.

https://helm.sh/
http://blog.kubernetes.io/2016/10/helm-charts-making-it-simple-to-package-and-deploy-apps-on-kubernetes.html

	nfs:
	# IP of nfs-server service. GCE currently doesn't resolve from domain
	# See https://github.com/kubernetes/kubernetes/issues/48212
	server: 10.11.251.158
	path: "/"

tfwiki / deployment Goto Github PK

deployment's People

Contributors

Stargazers

Watchers

deployment's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs