GithubHelp home page GithubHelp logo

service-capacity-modeling's Introduction

Service Capacity Modeling

Build Status

A generic toolkit for modeling capacity requirements in the cloud. Pricing information included in this repository are public prices.

NOTE: Netflix confidential information should never enter this repo. Please consider this repository public when making changes to it.

Trying it out

Run the tests:

# Test the capacity planner on included netflix models
$ tox -e py38

# Run a single test with a debugger attached if the test fails
$ .tox/py38/bin/pytest -n0 -k test_java_heap_heavy --pdb --pdbcls=IPython.terminal.debugger:Pdb

# Verify all type contracts
$ tox -e mypy

Run IPython for interactively using the library:

tox -e dev -- ipython

Example of Provisioning a Database

Fire up ipython and let's capacity plan a Tier 1 (important to the product aka "prod") Cassandra database.

from service_capacity_modeling.interface import CapacityDesires
from service_capacity_modeling.interface import FixedInterval, Interval
from service_capacity_modeling.interface import QueryPattern, DataShape

db_desires = CapacityDesires(
    # This service is important to the business, not critical (tier 0)
    service_tier=1,
    query_pattern=QueryPattern(
        # Not sure exactly how much QPS we will do, but we think around
        # 10,000 reads and 10,000 writes per second.
        estimated_read_per_second=Interval(
            low=1000, mid=10000, high=100000, confidence=0.9
        ),
        estimated_write_per_second=Interval(
            low=1000, mid=10000, high=100000, confidence=0.9
        ),
    ),
    # Not sure how much data, but we think it'll be below 1 TiB
    data_shape=DataShape(
        estimated_state_size_gib=Interval(low=100, mid=100, high=1000, confidence=0.9),
    ),
)

Now we can load up some models and do some capacity planning

from service_capacity_modeling.capacity_planner import planner
from service_capacity_modeling.models.org import netflix
import pprint

# Load up the Netflix capacity models
planner.register_group(netflix.models)

cap_plan = planner.plan(
    model_name="org.netflix.cassandra",
    region="us-east-1",
    desires=db_desires,
    # Simulate the possible requirements 512 times
    simulations=512,
    # Request 3 diverse hardware families to be returned
    num_results=3,
)

# The range of requirements in hardware resources (CPU, RAM, Disk, etc ...)
requirements = cap_plan.requirements

# The ordered list of least regretful choices for the requirement
least_regret = cap_plan.least_regret

# Show the range of requirements for a single zone
pprint.pprint(requirements.zonal[0].dict(exclude_unset=True))

# Show our least regretful choices of hardware in least regret order
# So for example if we can buy the first set of computers we would prefer
# to do that but we might not have availability in that family in which
# case we'd buy the second one.
for choice in range(3):
    num_clusters = len(least_regret[choice].candidate_clusters.zonal)
    print(f"Our #{choice + 1} choice is {num_clusters} zones of:")
    pprint.pprint(least_regret[choice].candidate_clusters.zonal[0].dict(exclude_unset=True))

Note that we can customize more information given what we know about the use case, but each model (e.g. Cassandra) supplies reasonable defaults.

For example we can specify a lot more information

db_desires = CapacityDesires(
    # This service is important to the business, not critical (tier 0)
    service_tier=1,
    query_pattern=QueryPattern(
        # Not sure exactly how much QPS we will do, but we think around
        # 50,000 reads and 45,000 writes per second with a rather narrow
        # bound
        estimated_read_per_second=Interval(
            low=40_000, mid=50_000, high=60_000, confidence=0.9
        ),
        estimated_write_per_second=Interval(
            low=42_000, mid=45_000, high=50_000, confidence=0.9
        ),
        # This use case might do some partition scan queries that are
        # somewhat expensive, so we hint a rather expensive ON-CPU time
        # that a read will consume on the entire cluster.
        estimated_mean_read_latency_ms=Interval(
            low=0.1, mid=4, high=20, confidence=0.9
        ),
        # Writes at LOCAL_ONE are pretty cheap
        estimated_mean_write_latency_ms=Interval(
            low=0.1, mid=0.4, high=0.8, confidence=0.9
        ),
        # We want single digit latency, note that this is not a p99 of 10ms
        # but defines the interval where 98% of latency falls to be between
        # 0.4 and 10 milliseconds. Think of:
        #   low = "the minimum reasonable latency"
        #   high = "the maximum reasonable latency"
        #   mid = "value between low and high such that I want my distribution
        #          to skew left or right"
        read_latency_slo_ms=FixedInterval(
            low=0.4, mid=4, high=10, confidence=0.98
        ),
        write_latency_slo_ms=FixedInterval(
            low=0.4, mid=4, high=10, confidence=0.98
        )
    ),
    # Not sure how much data, but we think it'll be below 1 TiB
    data_shape=DataShape(
        estimated_state_size_gib=Interval(low=100, mid=500, high=1000, confidence=0.9),
    ),
)

Example of provisioning a caching cluster

In this example we tweak the QPS up, on CPU time of operations down and SLO down. This more closely approximates a caching workload

cache_desires = CapacityDesires(
    service_tier=1,
    query_pattern=QueryPattern(
        # Not sure exactly how much QPS we will do, but we think around
        # 10,000 reads and 10,000 writes per second.
        estimated_read_per_second=Interval(
            low=10_000, mid=100_000, high=1_000_000, confidence=0.9
        ),
        estimated_write_per_second=Interval(
            low=1_000, mid=20_000, high=100_000, confidence=0.9
        ),
        # Memcache is consistently fast at queries
        estimated_mean_read_latency_ms=Interval(
            low=0.05, mid=0.2, high=0.4, confidence=0.9
        ),
        estimated_mean_write_latency_ms=Interval(
            low=0.05, mid=0.2, high=0.4, confidence=0.9
        ),
        # Caches usually have tighter SLOs
        read_latency_slo_ms=FixedInterval(
            low=0.4, mid=0.5, high=5, confidence=0.98
        ),
        write_latency_slo_ms=FixedInterval(
            low=0.4, mid=0.5, high=5, confidence=0.98
        )
    ),
    # Not sure how much data, but we think it'll be below 1000
    data_shape=DataShape(
        estimated_state_size_gib=Interval(low=100, mid=200, high=500, confidence=0.9),
    ),
)

cache_cap_plan = planner.plan(
    model_name="org.netflix.cassandra",
    region="us-east-1",
    desires=cache_desires,
    allow_gp2=True,
)

requirement = cache_cap_plan.requirement
least_regret = cache_cap_plan.least_regret

Notebooks

We have a demo notebook in notebooks you can use to experiment. Start it with

tox -e notebook -- jupyter notebook notebooks/demo.ipynb

Development

To contribute to this project:

  1. Make your change in a branch. Consider making a new model if you are making significant changes and registering it as a different name.
  2. Write a unit test using pytest in the tests folder.
  3. Ensure your tests pass (or debug them) with:
tox -e py38 -- -k test_<your_functionality> --pdb --pdbcls=IPython.terminal.debugger:Pdb

Release

TODO

service-capacity-modeling's People

Contributors

abersnaze avatar akashdeepgoel avatar alexsyeo avatar arunagrawal-84 avatar arunagrawal84 avatar gndcshv avatar jolynch avatar kaidanfullerton avatar nickmahilani avatar rajivshringi avatar raksoras avatar ramsrivatsa avatar shengweiwang avatar susheelaroskar avatar szimmer1 avatar tcdevoe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

service-capacity-modeling's Issues

Unclear repetition

I'm working on summarizing the cost, cpu, disk (local & attached) for both regional and zonal clusters. I want there to be more consistency in the way repetition is represented.

us-east-1: # trimmed
us-west-2:
  least_regret:
    - candidate_clusters:
        total_annual_cost: # redacted
        zonal:
          - cluster_type: cassandra # trimmed
          - cluster_type: cassandra # trimmed
          - cluster_type: cassandra # trimmed
        regional:
          - cluster_type: dgwkv
            total_annual_cost: # redacted
            count: 3
            instance:
              total_annual_cost: # redacted
              name: r5.large
            attached_drives:
              - name: gp2
                size_gib: 20
                annual_cost_per_gib: # redacted
                annual_cost_per_read_io: # redacted
                annual_cost_per_write_io: # redacted

In the sample above there are:

  • enumerated list of regions
  • duplicated zonals
  • instance with a count property

For rightsizing ask for existing compute usage

Right now most models are split into two parts:

  1. Try to determine the resources you need for a desire using math on the desire (CPU, RAM, Disk, Network, etc ...). Example
  2. Size and price clusters based on that particular service deployment mode (e.g. C* has to scale by factors of 2, and deploys to zones). Example

This makes sense for provisioning where we are trying to guess CPU time from e.g. payload sizes and RPS and such. For rightsizing it might makes more sense to just provide existing choices in the desire along with utilization and then the model can produce a ideal hardware for that specific requirement. Perhaps modify CapacityDesires to have an additional field called existing_deployment that takes either a Requirements or a Clusters. Maybe with the modification of instead of supplying a frequency to requriements, have a hardware shape/count (the cpu_count would be cpu * utilization for example).

Then models can short circuit the requirements generation or at least use the provided numbers as good defaults. RAM is the only one that seems tricky to me that might require merging.

Capacity plans should return recommended autoscaling policies

Right now we just make a recommendation like "12 m5d.2xlarge" but for software that can autoscale (stateless java apps, elasticsearch etc ...) it would be nice if we could return a hint of the autoscaling policy.

Step 1: Define how we will represent a scaling policy (e.g. how to represent various metrics like CPU utilization etc ...)
Step 2: Make the models return them

Improve C* scaling logic when including EVCache in KV plan

In our current logic (https://github.com/Netflix-Skunkworks/service-capacity-modeling/blob/main/service_capacity_modeling/models/org/netflix/key_value.py#L85), we scale the C* cluster by a factor of 1 - estimated_kv_cache_hit_rate, where estimated_kv_cache_hit_rate is configurable (default 0.8).

Per a previous convo with @jolynch and @szimmer1, we discussed possibly tying in the read/write ratio from the user desires into this calculation.

One toy example:

estimated_cache_hit_rate = extra_model_arguments.get("estimated_cache_hit_rate", 0.8)
estimated_cache_miss_rate = 1 - estimated_cache_hit_rate
rps_interval.scale(min(estimated_cache_miss_rate, max(0.1, 1 - read_write_ratio)))

Adding a new model

Greetings,

I was asked to add a new model to your capacity planner. Are there general directions or documentation on what it takes to add a new model? I see that the existing models vary significantly on how they are implemented. Since the model is provided to the capacity planner, there must be a protocol somewhere that must be followed so the planner understands the model, but if there is, I can't find it.

Thank you,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.