lewisfogden / heavylight Goto Github PK

View Code? Open in Web Editor NEW

5.0 3.0 2.0 2.31 MB

A lightweight actuarial modelling framework for Python

Home Page: https://lewisfogden.github.io/heavylight/

License: MIT License

Python 100.00%

actuarial cashflow modelling projection actuarial-modelling insurance

heavylight's People

Contributors

Stargazers

Watchers

Forkers

actuarialopensource matthewcaseres

heavylight's Issues

Performance vs. Alternative Implementations

Heavylight generally selects simplicity of coding over performance, in particular using Caching classes to store results adds some overhead vs. alternatives.

Directly using dictionaries (example below) removes quite a bit of this overhead but much more boilerplate code is needed. i.e. manual stores for each variable need to be initialised and then updated.

Version using HeavyLight:

from heavylight import Model

class Policy(Model):
    def num_pols(self, t):
            if t == 0:
                return 1
            else:
                return self.num_pols(t - 1) - self.num_deaths(t - 1)
            
    def num_deaths(self, t):
            return self.num_pols(t) * 0.01

def run_pol():
    p = Policy(do_run=True, proj_len=400)
    return p.num_pols(399)

Version Using Dictionaries - runs faster but much less readable:

class Policy:
    def __init__(self, proj_len: int):
        self.proj_len = proj_len
        self._v1 = {}
        self._v2 = {}
        
    def num_pols(self, t):     #_v1
        if t in self._v1:
            return self._v1[t]
        else:
            if t == 0:
                value = 1
            else:
                value = self.num_pols(t - 1) - self.num_deaths(t - 1)
            self._v1[t] = value
            return value
        
    def num_deaths(self, t):     #_v2
        if t in self._v2:
            return self._v2[t]
        else:
            value = self.num_pols(t) * 0.01
            self._v2[t] = value
            return value

def run_pol():
    p = Policy(400)
    return p.num_pols(399)

source code generator compatible with AI compilers

might not be a productive use of time to read this, be warned

motivation

memory bottlenexk

Because everything is elementwise and there is no big matmul the models are memory limited on GPU and not utilizing all the FLOPS. https://docs.nvidia.com/deeplearning/performance/dl-performance-gpu-background/index.html#element-op

use the AI compiler

To get around the bottleneck we can use the AI compiler

How AI compilers get around memory bottlenecks https://horace.io/brrr_intro.html
https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html

AI compiler doesn't work with memoization

Our models can't be compiled right now. Probably the logic related to checking if values are in the cache and stuff like that is causing what PyTorch people call a "graph break" as we need to execute python code between operations.

To avoid graph breaks, I believe we have to go around memoization/recursion and actually write a loop?

source code generation

But writing out the model as a loop would really suck. So we will have to generate the loop.

It would maybe be easier to just generate the unrolled loop, but 22 formulas across 277 timesteps would be like 6000 lines of code. And you can't really edit that by hand in any productive way, so we will probably have to actually write the loop just for ergonomics.

constraints

I don't want to deal with things that aren't single integer parameters, will probably enforce that.

implementation details

cache_graph.graph is currently unused. Probably use that sort of thing.
Enforce that timesteps are never going back more than 1. Enforce that only args is ever timesteps.

algorithm:

Check for data dependencies to t-1. All functions which are ever called as func(t-1) go into the t_prev_list.
Collect the graph for t=0, topological sort, source to source compile, ending with func_t_prev = func_t for all func in the t_prev_list.
- t=0 is handled separately because of if t == 0 initialization conditions on pols_if.
Collect the graph for t=1, sort, compile.
- calls to functions at time t-1 are going to reference func_t_prev.
- We expect no timestep related conditionals to be in play here as with t=0.

The whole function can be parameterized by t and that will determine the number of iterations in the loop or something like that? At the end of the day, the results of the compiler will be like this:

class MyClass:
   def __init__():
       # same code as before
       mp = ...
       ...
   def run(max_t: int):
       pols_if = mp.pols_if_init
       pols_death = pols_if * assume.mort_rate
       pols_if_prev = pols_if
       pols_death_prev = pols_death
       for _ in range(max_t):
           pols_if = mp.pols_if_init
           pols_death = pols_if * assume.mort_rate
           pols_if_prev = pols_if
           pols_death_prev = pols_death

Tables (heavytables): String lookups match on incorrect values

If table has keys 'A', 'B' and 'C', then looking up table['AB'] returns the value for table['B'].

Cause: np.searchsorted places 'AB' between 'A' and 'B'

Ideal behaviour: Should return np.nan or raise an exception if the key doesn't exist.

As keys and data should be aligned this shouldn't happen, but if incorrect data is passed in it will not fail.

What is ideal behavior for band lookups above max value?

I think in the past you wanted to return np.nan but then it changed type of array.

Currently it throws an error.

I almost think we shouldn't even throw the error. If someone says 999999 surely they mean np.inf anyways?

What is the best possible behavior?

Methods are sorted alphabetically when cached, change to sort by order defined

Users would like methods to appear in .df results according to the order they are defined (so similar functions are grouped together), currently the use of getmembers to get all the methods in the class means they are sorted alphabetically., per https://docs.python.org/3/library/inspect.html#inspect.getmembers

Using a vars or accessing the underlying class __dict__ would avoid this issue.

Update codecov badge

Oops, the badge is for the code coverage on my fork. Needs to be changed to track from https://app.codecov.io/gh/lewisfogden/heavylight

What utility/support items do we need?

A few topics

Inputs

preparing inputs (probably a simple dataframe converter)
validation (would need to specify the datatype of each input somewhere in the model?)

Outputs

function to extract one model point from a vectorised run (mostly written)
function to summarise all model points (mostly written)
function to extract specific variables, either aggregated. (pandas.DataFrame.agg style?)
exporter that saves to Excel, and includes function definition as a comment/note.

Examples

function to generate a new run folder containing demo/examples models/model templates (e.g. heavylight.demo.create_sample('numpy_template', 'path/to/folder')

ban kwargs?

I don't think the model you developed supports them. they are pretty annoying, should I stop supporting them as well? Should you start supporting them? Does it matter?

Colab Notebook doesn't run?

Was trying to see what was going on with this notebook but running all doesn't work with the errors below:

method level aggregation on LightModel

The LightModel currently aggregates all methods using the storage function. This is not practical because some functions might return different data types, it becomes messy and we end up having if statements and such.

So try to use decorators to apply storage functions at the method level. For benchmarking purposes, this allows a reduction in the number of floating point operations used to calculate results.

Memory optimization, not optimal

Was seeing reductions in cache size of 80% when larger reductions > 99% were expected on the lifelib Term_ME model.

A cached function pols_new_biz that is calculated and cleared by another function expenses will still be called in the loop

    for t in range(proj_len+1):
        for func in model._single_param_timestep_funcs:
            func(t)

Issue with negative indexes

From another issue

I've found an issue with the indexing in Tables - if you have multiple keys, an integer key going from 18-90 (say), and you look up 2, then you can get a false positive - it will find an earlier index. (2 would find the value for 90-16 for the prior key)

possible solution:
Can resolve by bounds checking. Possibly use optional bounds checking, like Table(safe=False) and I think it is best that safe=True by default. Safe=True performs any important validations that take time and safe=False doesn't?

`_Cache.values` returns insertion order, not time order.

import heavylight
class Non_T_Model(heavylight.Model):
    def get_a(self, a):
        return a * 10

ntm = Non_T_Model(proj_len=10):
ntm.get_a(5)
ntm.get_a(10)
ntm.get_a(7)

After this, ntm.get_a.values will return [50, 100, 70].

I don't think this is a big issue, as it is unlikely that non t functions are likely to have an order, however it might cause issues.

Aside: I've added a keys property, e.g. ntm.get_a.keys which is aligned to the values.

Memory optimized runs, `generate_actions` and `execute_actions`

If we optimize the model it basically says that after executing a function, then clear some other functions. This is sensitive to the order that functions are executed in, so it is possible that there will be cache misses. This won't happen if both calls are made to RunModel(c) where c is the same for an optimized vs unoptimized model, the call orders are the same so it is not terrible.

But still, this is a foot gun. The generate_actions and execute_actions approach basically avoids this by making model optimizations only happen in some other execute_actions context and not clearing the cache during normal model execution dependent on some internal state.

https://modelx.io/blog/2022/03/26/running-model-while-saving-memory/

Do projections need re-run? (i.e. does the cache need to be cleared)

The current approach in heavylight is that projections run when the instance is created, e.g. if the user model is:

class MyModel(heavylight.Model):
    def <user_method>(self, t):
        return <stuff>

Then when this is run proj1 = MyModel(do_run=True, proj_len=10), the model will be run and results stored in the cache.

Once the projection is run, users can access the values from proj1.<user_method>(t) for individual values; proj1.<user_method>.values for an array, and proj1.ToDataFrame() as the optional way to pull all single parameter values into a dataframe (handy for debugging/viewing as easy to copy into Excel). There is also a sum method on the cache which returns the total, e.g. proj.<user_method>.sum().

The proj_len variable controls how much of the projection is pre-computed (from t=0...proj_len-1), if the user requests a method result from after this then the model will run through all the intermediate calculations and cache these.

e.g.: proj.<user_method>(20) would calculate a further 11 values and cache them (10, 11, ... 20).

Rationale for doing the pre-computing is that the python stack can overload with a lot of recursion.

I initially allowed the cache to be cleared, however I found this risky (I use some proprietary software which doesn't always clear the cache correctly 😯), and instead decided that if a new projection is needed, you should just create a new instance.

Create automation around releases

As new features have just shipped and I would like to use them, I need the new code to be published. Often this is done with a GitHub action that publishes to PYPI on a new release.

Thoughts on bringing heavylight under a single API?

Changes I'm thinking. @MatthewCaseres would like your views too

BeforeRun / AfterRun to be removed as not useful.
Initialisation: remove need for user __init__ (i.e. Model behaviour rather than LightModel behaviour).
Backends, combine with a parameter backend in Model, covering:
- Standard (i.e. current model), non optimising
- t-2 simple aggregator that applies agg_func on all methods taking t as a parameter.
- graph Memory Optimised
t-2 and graph both take the agg_func parameter, which defaults to np.sum
for graph which requires a pre-run to optimise, this samples 50 points (say) from the dataset, but could pass a optimiser_data optional parameter?
All other keyword arguments become attributes of the class instance (e.g. data/basis) - as current heavylight.Model behaviour.
Use of the backend allows us to add different optimisers in future (potentially even a compiler)

The caller might look like this:

proj = Model(data=data, basis=basis, proj_len=120, backend='t-2', agg_func=np.mean)

Release 1.0.6

renaming the package

heavylight is a famous algorithm so maybe this package can't be easily found online. why not rename to LifeInsurance and then people look up "life insurance python" and find out about it?

edit: another way to increase discoverability is making PR to lifelib once package more stabilized

Functions using future periods cause RecursionError

As stated in the README, the functions that use future periods cause recursion error. For example:

import heavylight

class MyModel(heavylight.Model):
    def test(self, t):
        if t == 1440:
            return 1
        return self.test(t+1)

model = MyModel(do_run=True, proj_len=1440)
model_cashflows = model.ToDataFrame()
print(model_cashflows)

results in:

RecursionError: maximum recursion depth exceeded while calling a Python object

If I change the line 141 in heavylight.py from:

for t in range(proj_len):

for t in range(proj_len, -1, -1):

then it works without any problems.

So maybe the package should check if the given function calls itself with argument t+... and if so, then use the for-loop backwards.
To check the argument of the function, the ast package can be used.

Finish setting up codecov

See that the CI runs are failing. Please set up codecov. I cannot do this because I cannot modify your repository secrets.

https://docs.codecov.com/docs/quick-start

write tests

Tests

basic

check run works
check cache.values & cache.sum()

tables only

test tables constructed from dataframes
test tables from csv / xlsx
test vectorised and non-vectorised lookup

combined

test tables & model combined

advanced

optimised cache etc

Begin to host documentation site

Ideally publishing of site is automated with GitHub action

Differences between `Model` and `LightModel`

I'm starting up some docs and it isn't clear how to motivate the difference in the models. from a historical perspective, I wanted to do my own thing and contribute to a group effort without breaking existing stuff.

How do we communicate this to users in a way that isn't confusing? Do I just say

LightModel is like Model but it implements memory optimizations, automatic aggregations, and has a slightly different API

and move on from it? Is there anything to really be done from a source code perspective here that would be easy to do?