<input type="checkbox" id="" disabled=""

What does a major update spike look like? The <a hr

Load testing about deployment HOT 4 CLOSED

tfwiki commented on June 23, 2024

Load testing

from deployment.

Comments (4)

rjackson commented on June 23, 2024

WIP comment; updating as per findings

What's the typical traffic we currently receive (users/sec, insight from Google Analytics)

From Google Analytics' "Audience" data over the last 2 years, 2016-01-09 to 2018-01-09, converting to per-timeframe equivalents:

Metric	Value	Per day	Per hour	Per minute	Per second
Sessions	34,556,894	47,338	1,972	32	0.54
Users	14,372,905	19,689	820	13	0.21
Page views	119,635,151	163,884	6,829	114	1.9
Pages/session	3.46
Avg. Session Duration	00:04:26

(What we refer to as "users" in load test would map to "sessions" in the above data)

To mimic this average traffic in a load test, we would have to create a user which browses 3.5 pages every 4.5 minutes (1 page per 90 seconds):

class AverageUser(HttpLocust):
    """ Emulate an average user according to Google Analytics data collected between
        2016-01-09 and 2018-01-09 """

    task_set = Top10Pages

    # Average of 90 seconds, but include += 50% variance
    avg_wait = 90 * 1000
    min_wait = 0.5 * avg_wait
    max_wait = 1.5 * avg_wait

From the average session duration, we can also figure out how many simultaneous visitors the website serves: 32 sessions per minute * 4.5 minute average session duration = 144 simultaneous sessions.

How many requests-per-second does that equate to?

Running a load test (tfwiki/load-tests) which models this average user behaviour, operating 144 simultaneous users, with the load test also loading page resources (images, stylesheets, javascript) we see our typical traffic generates approximately 40 requests per second to the server.

This appears to be severely slowed down by the MediaWiki pod's handling images (we're seeing the pod I/O bound instead of CPU bound). It may be worth re-evaluating this without handling images to get a better idea of raw MediaWiki performance.

from deployment.

rjackson commented on June 23, 2024

What does a major update spike look like?

The Pyromania Update caused the largest single-day traffic spike in the Wiki's history on June 28, 2012. Let's create a "Major Update" traffic model based on traffic on that day:

Metric	Value	Per hour	Per minute	Per second
Sessions	447,249	18,635.38	310.59	5.18
Users	250,774	10,448.92	174.15	2.9
Page views	3,257,921	135,746.71	2,262.45	37.71
Pages/session	7.28
Avg. Session Duration	00:08:02

(What we refer to as "users" in load test would map to "sessions" in the above data)

To mimic these users in a load test, we would have to create a user which browses 7.5 pages every 8 minutes (1 page per 64 seconds):

class PyromaniacUser(HttpLocust):
    """ Emulate Pyromania update users, according to behaviour observed on 2012-06-28 """

    task_set = PyromaniaTop10Pages

    # Average of 64 seconds, but include += 50% variance
    avg_wait = 64 * 1000
    min_wait = 0.5 * avg_wait
    max_wait = 1.5 * avg_wait

From the average session duration, we can also figure out how many simultaneous visitors the website served during this event: 311 sessions per minute * 8 minute average session duration = 2488 simultaneous sessions.

from deployment.

rjackson commented on June 23, 2024

Yeeaah, current set up with 4 Varnish instances (1 per server) currently handles a boat load of traffic perfectly fine – I've got it handles 10,000 simulated Pyromania users and its not breaking a sweat. So no need to worry about getting resource limits perfect yet, we can handle that down the line.

from deployment.

rjackson commented on June 23, 2024

Don't care about these reports any more. Live traffic never quite matched up due to the Wiki having a lot of pages, and thus a lot of uncached content when we went live.

So the numbers seemed impressive, but weren't realistic.

With typical traffic nowadays, 4 Mediawiki containers just about manage the non-Cached traffic. Clearing Varnish significantly increases load, and Kubernetes' horizontal pod autoscaler can be slow to update accordingly, leading to single-digit minutes worth of perceived downtime. In those scenarios, manually scaling MW up to 8 containers seems to typical traffic well enough (likely more than needed), and Kubernetes will auto scale back down to the minimum of 4 containers when Varnish has taken over the brunt of the impact.

from deployment.

Load testing about deployment HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs