GithubHelp home page GithubHelp logo

lewisfogden / heavylight Goto Github PK

View Code? Open in Web Editor NEW
5.0 3.0 2.0 2.31 MB

A lightweight actuarial modelling framework for Python

Home Page: https://lewisfogden.github.io/heavylight/

License: MIT License

Python 100.00%
actuarial cashflow modelling projection actuarial-modelling insurance

heavylight's People

Contributors

lewisfogden avatar matthewcaseres avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

heavylight's Issues

Performance vs. Alternative Implementations

Heavylight generally selects simplicity of coding over performance, in particular using Caching classes to store results adds some overhead vs. alternatives.

Directly using dictionaries (example below) removes quite a bit of this overhead but much more boilerplate code is needed. i.e. manual stores for each variable need to be initialised and then updated.

Version using HeavyLight:

from heavylight import Model

class Policy(Model):
    def num_pols(self, t):
            if t == 0:
                return 1
            else:
                return self.num_pols(t - 1) - self.num_deaths(t - 1)
            
    def num_deaths(self, t):
            return self.num_pols(t) * 0.01

def run_pol():
    p = Policy(do_run=True, proj_len=400)
    return p.num_pols(399)

Version Using Dictionaries - runs faster but much less readable:

class Policy:
    def __init__(self, proj_len: int):
        self.proj_len = proj_len
        self._v1 = {}
        self._v2 = {}
        
    def num_pols(self, t):     #_v1
        if t in self._v1:
            return self._v1[t]
        else:
            if t == 0:
                value = 1
            else:
                value = self.num_pols(t - 1) - self.num_deaths(t - 1)
            self._v1[t] = value
            return value
        
    def num_deaths(self, t):     #_v2
        if t in self._v2:
            return self._v2[t]
        else:
            value = self.num_pols(t) * 0.01
            self._v2[t] = value
            return value

def run_pol():
    p = Policy(400)
    return p.num_pols(399)

source code generator compatible with AI compilers

might not be a productive use of time to read this, be warned

motivation

memory bottlenexk

Because everything is elementwise and there is no big matmul the models are memory limited on GPU and not utilizing all the FLOPS. https://docs.nvidia.com/deeplearning/performance/dl-performance-gpu-background/index.html#element-op

use the AI compiler

To get around the bottleneck we can use the AI compiler

AI compiler doesn't work with memoization

Our models can't be compiled right now. Probably the logic related to checking if values are in the cache and stuff like that is causing what PyTorch people call a "graph break" as we need to execute python code between operations.

To avoid graph breaks, I believe we have to go around memoization/recursion and actually write a loop?

source code generation

But writing out the model as a loop would really suck. So we will have to generate the loop.

It would maybe be easier to just generate the unrolled loop, but 22 formulas across 277 timesteps would be like 6000 lines of code. And you can't really edit that by hand in any productive way, so we will probably have to actually write the loop just for ergonomics.

constraints

I don't want to deal with things that aren't single integer parameters, will probably enforce that.

implementation details

cache_graph.graph is currently unused. Probably use that sort of thing.
Enforce that timesteps are never going back more than 1. Enforce that only args is ever timesteps.

algorithm:

  • Check for data dependencies to t-1. All functions which are ever called as func(t-1) go into the t_prev_list.
  • Collect the graph for t=0, topological sort, source to source compile, ending with func_t_prev = func_t for all func in the t_prev_list.
    • t=0 is handled separately because of if t == 0 initialization conditions on pols_if.
  • Collect the graph for t=1, sort, compile.
    • calls to functions at time t-1 are going to reference func_t_prev.
    • We expect no timestep related conditionals to be in play here as with t=0.

The whole function can be parameterized by t and that will determine the number of iterations in the loop or something like that? At the end of the day, the results of the compiler will be like this:

class MyClass:
   def __init__():
       # same code as before
       mp = ...
       ...
   def run(max_t: int):
       pols_if = mp.pols_if_init
       pols_death = pols_if * assume.mort_rate
       pols_if_prev = pols_if
       pols_death_prev = pols_death
       for _ in range(max_t):
           pols_if = mp.pols_if_init
           pols_death = pols_if * assume.mort_rate
           pols_if_prev = pols_if
           pols_death_prev = pols_death

Tables (heavytables): String lookups match on incorrect values

If table has keys 'A', 'B' and 'C', then looking up table['AB'] returns the value for table['B'].

Cause: np.searchsorted places 'AB' between 'A' and 'B'

Ideal behaviour: Should return np.nan or raise an exception if the key doesn't exist.

As keys and data should be aligned this shouldn't happen, but if incorrect data is passed in it will not fail.

What is ideal behavior for band lookups above max value?

I think in the past you wanted to return np.nan but then it changed type of array.

Currently it throws an error.

I almost think we shouldn't even throw the error. If someone says 999999 surely they mean np.inf anyways?

What is the best possible behavior?

What utility/support items do we need?

A few topics

Inputs

  • preparing inputs (probably a simple dataframe converter)
  • validation (would need to specify the datatype of each input somewhere in the model?)

Outputs

  • function to extract one model point from a vectorised run (mostly written)
  • function to summarise all model points (mostly written)
  • function to extract specific variables, either aggregated. (pandas.DataFrame.agg style?)
  • exporter that saves to Excel, and includes function definition as a comment/note.

Examples

  • function to generate a new run folder containing demo/examples models/model templates (e.g. heavylight.demo.create_sample('numpy_template', 'path/to/folder')

ban kwargs?

I don't think the model you developed supports them. they are pretty annoying, should I stop supporting them as well? Should you start supporting them? Does it matter?

method level aggregation on LightModel

The LightModel currently aggregates all methods using the storage function. This is not practical because some functions might return different data types, it becomes messy and we end up having if statements and such.

So try to use decorators to apply storage functions at the method level. For benchmarking purposes, this allows a reduction in the number of floating point operations used to calculate results.

Memory optimization, not optimal

Was seeing reductions in cache size of 80% when larger reductions > 99% were expected on the lifelib Term_ME model.

A cached function pols_new_biz that is calculated and cleared by another function expenses will still be called in the loop

    for t in range(proj_len+1):
        for func in model._single_param_timestep_funcs:
            func(t)

Issue with negative indexes

From another issue

I've found an issue with the indexing in Tables - if you have multiple keys, an integer key going from 18-90 (say), and you look up 2, then you can get a false positive - it will find an earlier index. (2 would find the value for 90-16 for the prior key)

possible solution:
Can resolve by bounds checking. Possibly use optional bounds checking, like Table(safe=False) and I think it is best that safe=True by default. Safe=True performs any important validations that take time and safe=False doesn't?

`_Cache.values` returns insertion order, not time order.

import heavylight
class Non_T_Model(heavylight.Model):
    def get_a(self, a):
        return a * 10

ntm = Non_T_Model(proj_len=10):
ntm.get_a(5)
ntm.get_a(10)
ntm.get_a(7)

After this, ntm.get_a.values will return [50, 100, 70].

I don't think this is a big issue, as it is unlikely that non t functions are likely to have an order, however it might cause issues.

Aside: I've added a keys property, e.g. ntm.get_a.keys which is aligned to the values.

Memory optimized runs, `generate_actions` and `execute_actions`

If we optimize the model it basically says that after executing a function, then clear some other functions. This is sensitive to the order that functions are executed in, so it is possible that there will be cache misses. This won't happen if both calls are made to RunModel(c) where c is the same for an optimized vs unoptimized model, the call orders are the same so it is not terrible.

But still, this is a foot gun. The generate_actions and execute_actions approach basically avoids this by making model optimizations only happen in some other execute_actions context and not clearing the cache during normal model execution dependent on some internal state.

https://modelx.io/blog/2022/03/26/running-model-while-saving-memory/

Do projections need re-run? (i.e. does the cache need to be cleared)

The current approach in heavylight is that projections run when the instance is created, e.g. if the user model is:

class MyModel(heavylight.Model):
    def <user_method>(self, t):
        return <stuff>

Then when this is run proj1 = MyModel(do_run=True, proj_len=10), the model will be run and results stored in the cache.

Once the projection is run, users can access the values from proj1.<user_method>(t) for individual values; proj1.<user_method>.values for an array, and proj1.ToDataFrame() as the optional way to pull all single parameter values into a dataframe (handy for debugging/viewing as easy to copy into Excel). There is also a sum method on the cache which returns the total, e.g. proj.<user_method>.sum().

The proj_len variable controls how much of the projection is pre-computed (from t=0...proj_len-1), if the user requests a method result from after this then the model will run through all the intermediate calculations and cache these.

e.g.: proj.<user_method>(20) would calculate a further 11 values and cache them (10, 11, ... 20).

Rationale for doing the pre-computing is that the python stack can overload with a lot of recursion.

I initially allowed the cache to be cleared, however I found this risky (I use some proprietary software which doesn't always clear the cache correctly ๐Ÿ˜ฏ), and instead decided that if a new projection is needed, you should just create a new instance.

Thoughts on bringing heavylight under a single API?

Changes I'm thinking. @MatthewCaseres would like your views too

  1. BeforeRun / AfterRun to be removed as not useful.
  2. Initialisation: remove need for user __init__ (i.e. Model behaviour rather than LightModel behaviour).
  3. Backends, combine with a parameter backend in Model, covering:
    • Standard (i.e. current model), non optimising
    • t-2 simple aggregator that applies agg_func on all methods taking t as a parameter.
    • graph Memory Optimised
  4. t-2 and graph both take the agg_func parameter, which defaults to np.sum
  5. for graph which requires a pre-run to optimise, this samples 50 points (say) from the dataset, but could pass a optimiser_data optional parameter?
  6. All other keyword arguments become attributes of the class instance (e.g. data/basis) - as current heavylight.Model behaviour.
  7. Use of the backend allows us to add different optimisers in future (potentially even a compiler)

The caller might look like this:

proj = Model(data=data, basis=basis, proj_len=120, backend='t-2', agg_func=np.mean)

renaming the package

heavylight is a famous algorithm so maybe this package can't be easily found online. why not rename to LifeInsurance and then people look up "life insurance python" and find out about it?

edit: another way to increase discoverability is making PR to lifelib once package more stabilized

Functions using future periods cause RecursionError

As stated in the README, the functions that use future periods cause recursion error. For example:

import heavylight

class MyModel(heavylight.Model):
    def test(self, t):
        if t == 1440:
            return 1
        return self.test(t+1)

model = MyModel(do_run=True, proj_len=1440)
model_cashflows = model.ToDataFrame()
print(model_cashflows)

results in:

RecursionError: maximum recursion depth exceeded while calling a Python object

If I change the line 141 in heavylight.py from:

for t in range(proj_len):

to

for t in range(proj_len, -1, -1):

then it works without any problems.

So maybe the package should check if the given function calls itself with argument t+... and if so, then use the for-loop backwards.
To check the argument of the function, the ast package can be used.

write tests

Tests

basic

  • check run works
  • check cache.values & cache.sum()

tables only

  • test tables constructed from dataframes
  • test tables from csv / xlsx
  • test vectorised and non-vectorised lookup

combined

  • test tables & model combined

advanced

  • optimised cache etc

Differences between `Model` and `LightModel`

I'm starting up some docs and it isn't clear how to motivate the difference in the models. from a historical perspective, I wanted to do my own thing and contribute to a group effort without breaking existing stuff.

How do we communicate this to users in a way that isn't confusing? Do I just say

LightModel is like Model but it implements memory optimizations, automatic aggregations, and has a slightly different API

and move on from it? Is there anything to really be done from a source code perspective here that would be easy to do?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.