GithubHelp home page GithubHelp logo

tjhladish / abcsmc Goto Github PK

View Code? Open in Web Editor NEW
15.0 15.0 7.0 8.33 MB

Sequential Monte Carlo Approximate Bayesian Computation with Partial Least Squares parameter estimator

License: GNU General Public License v3.0

C++ 1.83% C 97.97% Makefile 0.01% R 0.17% Shell 0.01% CMake 0.02%

abcsmc's People

Contributors

pearsonca avatar tjhladish avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

abcsmc's Issues

Transformations

I'm doing a complete write up of transformations as part of refactor. What is currently supported, as I understand it:

// a = sum of transformed scale addends
// b = prod of transformed scale factors
// c = sum of untransformed scale addends
// d = prod of untransformed scale factors
// (n.b. the transformed / untransformed bit only applies to when a-d applied - the actual values sum/prod'd are all on the transformed scale)
// u = arbitrary untransform function (though limited selection from config files)
//
// rescale.1, rescale.2 = config-specified values
//
// x = value on the _fitting_ scale
// x' = value on the _model_ scale
//
// x' = (rescale.2-rescale.1) (u((x + a) * b) + c) * d + rescale.1

That grok as correct?

DB-less Run Mode

For bundled executables, desire a version that can run DB less easily. Either ingesting a row from command line ala the run via executable simulation, or iterating over a vector of parameters.

@tjhladish any other design constraint thoughts from todays problem or others?

AbcSmc destructor

There is no dtor for the AbcSmc object. I believe that an explicit dtor is required in this case to address memory leaks related to the _model_pars and _model_mets class members. These pointers will not be cleaned up and deleted without an explicit dtor.

read-last-set-only filtering mode

Currently, every pass with AbcSmc must ingest the entire history of complete jobs to process the latest set => propose next set. This is because the weights are computed iteratively: the latest set weights actually depend on the entire history of weights. We don't store those weights, nor the parameters for noising draws.

We should continue to store the history, but if we also start storing weights and doubled variances, then we can simplify the code that deals with filtering to just take the last complete step. This could allow us to substantially trim down the AbcSmc object itself, getting rid of all the vectors-of-stuff and the assorted look-up-by-set accessors.

At the moment, we have no fitting-algorithmic reason to use all the data, though we do look at it for diagnostic reasons. We don't generally need an AbcSmc object, however, to do the diagnostic steps.

"fluent" / "flow" interface?

For many of the set_... methods in AbcSmc, it could be useful to return the self pointer, which would enable using the library like:

AbcSmc* abc = (new AbcSmc())->parse_config(args.config_file)->set_simulator(simulator);

instead of the current

AbcSmc* abc = new AbcSmc();
abc->parse_config(args.config_file);
abc->set_simulator(simulator);

The advantage is enabling that sort of fluent/flow syntax (which some people like to use). The main disadvantage is losing the ability to have the methods return true/false "success" indicators (e.g.).

R interface support

Our typical use case for post processing analysis (e.g., visualization, scenario evaluation) is R-based, using a relatively portable collection of tools.

I'm slowly but surely aggregating my big-stochastic-simulation-of-many scenarios set of generic tooling over here. AbcSmc specific stuff probably doesn't belong in that, but I think AbcSmc could come with an internal R package gathering up well-proscribed combinations of jsonlite + data.table + RSQlite functions that move from AbcSmc outputs to the kind of stuff that library wants.

The idea would be provide a user interface that masks the json / SQL syntax + AbcSmc structural decisions with R functions that yield data.table objects.

replace Eigen distribution with submodule approach

reference eigen repo: https://gitlab.com/libeigen/eigen/-/tree/master/

can basically purge Eigen as specific files, replace with the submodule root, and then need to address path issues.

This would entail slightly more disk footprint when building AbcSmc (since it's the whole repo, vs just the headers), but that's pretty marginal. Longer term would simplify updating Eigen (i.e. just pull in the submodule, then update AbcSmc's record of where the submodule is).

Handle getting rid of old database safely

Users may want a clean slate to re-run an experiment, i.e., delete the sqlite db in order to try again. (Currently, this has to be done manually, rm mydb.sqlite)

This can be dangerous, however, as the database could represent an arbitrary amount of compute time, and we don't want to make it too easy to delete dbs.

Suggest adding a command line argument for AbcSmc parser to specify that db should be truncated/deleted, but that this can only be done with interactive input from the user. The inconvenience is worth the precaution against losing a month of work.

realizations

For scenario analysis, we often want to see samples in progress. As in, we have, say, samples 1-10 for all scenarios, rather than finishing all the samples for a particular scenario, then moving on to the next.

This would be accomplished currently by using a "realization" pseudo parameter, and making that the last item in the parameter configuration file, and then effectively the outermost iteration loop when processing all the scenario elements.

Three points:

  • how scenario iteration occurs should be documented
  • how scenario iteration occurs should be reflected as some sort of tested requirement
  • relative to "sample" or "realization" variables particularly, while it is useful to support a more generic version as pseudo-parameters, more typically we simple want "N" samples. AbcSmc should offer a more streamlined approach to that. Some nice to haves for that implementation would that the config can be "updated" to larger N, that samples are the outer loop (probabaly meaning that whatever solves this auto creates the realization parameter in the "last" config position)

Config specification of posterior parameter(s)

When using "Posterior" type parameters, they should be drawn from the data storage as unified rows.

This would simplify their representation as parameters (get a row at time, could eliminate a fair chunk of the iteration logic), and reduce their footprint in the database (distinct posterior table, par_vals would point to the key of that, rather than whole table; can still materialize into view when querying).

The config specification itself could then refer to an array of column names, shared ranks, etc.

There's a potential downside in precluding unlinked posterior iteration, but: that's already precluded and I think this approach actually enables that approach. Could define multiple posterior parameters (for different blocks of columns), which would be allowed independent traversal.

configuration distinction between "fitting" and "project" modes

We have typically used AbcSmc in two modes: "fitting" and "project". However, there are not generally unambiguously distinguishing configuration file features. Nor is there a core abc executable that will take a -f|--fitting vs -p|--project flags.

If we could know that we were in "project" mode, then lots of boilerplate options could be removed and/or set to defaults for the configuration file.

For example, all of the predictive prior / PLS fraction keys are irrelevant (maybe even error inducing). num_samples could be defaulted all combinations.

prune "ranker.h"

There's a no-longer-maintained library of ranking operations in lib/ranker.h.

However, these are now likely superceded by c++17 standard operations / library.

We should eliminate this dependecy.

run_sql not a valid make target, and a bare `make` fails with errors

Quick start documentation instructs users to run make run_sql; make run_sql; make run_sql. This used to work, but run_sql is no longer a valid make target. Running make in the examples directory with the default target produces the following:

cmake -S . -B demo
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMake Warning (dev) in CMakeLists.txt:
  No cmake_minimum_required command is present.  A line of code such as

    cmake_minimum_required(VERSION 3.22)

  should be added at the top of the file.  The version specified may be lower
  if you wish to support older CMake versions for this project.  For more
  information run "cmake --help-policy CMP0000".
This warning is for project developers.  Use -Wno-dev to suppress it.

-- Configuring done
CMake Warning (dev) at CMakeLists.txt:9 (add_library):
  Policy CMP0028 is not set: Double colon in target name means ALIAS or
  IMPORTED target.  Run "cmake --help-policy CMP0028" for policy details.
  Use the cmake_policy command to set the policy and suppress this warning.

  Target "dice" links to target "GSL::gsl" but the target was not found.
  Perhaps a find_package() call is missing for an IMPORTED target, or an
  ALIAS target is missing?
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) at CMakeLists.txt:9 (add_library):
  Policy CMP0028 is not set: Double colon in target name means ALIAS or
  IMPORTED target.  Run "cmake --help-policy CMP0028" for policy details.
  Use the cmake_policy command to set the policy and suppress this warning.

  Target "dice" links to target "GSL::gslcblas" but the target was not found.
  Perhaps a find_package() call is missing for an IMPORTED target, or an
  ALIAS target is missing?
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) at CMakeLists.txt:12 (add_executable):
  Policy CMP0028 is not set: Double colon in target name means ALIAS or
  IMPORTED target.  Run "cmake --help-policy CMP0028" for policy details.
  Use the cmake_policy command to set the policy and suppress this warning.

  Target "dice_game" links to target "GSL::gsl" but the target was not found.
  Perhaps a find_package() call is missing for an IMPORTED target, or an
  ALIAS target is missing?
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) at CMakeLists.txt:12 (add_executable):
  Policy CMP0028 is not set: Double colon in target name means ALIAS or
  IMPORTED target.  Run "cmake --help-policy CMP0028" for policy details.
  Use the cmake_policy command to set the policy and suppress this warning.

  Target "dice_game" links to target "GSL::gslcblas" but the target was not
  found.  Perhaps a find_package() call is missing for an IMPORTED target, or
  an ALIAS target is missing?
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) at CMakeLists.txt:37 (add_executable):
  Policy CMP0028 is not set: Double colon in target name means ALIAS or
  IMPORTED target.  Run "cmake --help-policy CMP0028" for policy details.
  Use the cmake_policy command to set the policy and suppress this warning.

  Target "shared" links to target "GSL::gsl" but the target was not found.
  Perhaps a find_package() call is missing for an IMPORTED target, or an
  ALIAS target is missing?
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) at CMakeLists.txt:37 (add_executable):
  Policy CMP0028 is not set: Double colon in target name means ALIAS or
  IMPORTED target.  Run "cmake --help-policy CMP0028" for policy details.
  Use the cmake_policy command to set the policy and suppress this warning.

  Target "shared" links to target "GSL::gslcblas" but the target was not
  found.  Perhaps a find_package() call is missing for an IMPORTED target, or
  an ALIAS target is missing?
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) at CMakeLists.txt:37 (add_executable):
  Policy CMP0028 is not set: Double colon in target name means ALIAS or
  IMPORTED target.  Run "cmake --help-policy CMP0028" for policy details.
  Use the cmake_policy command to set the policy and suppress this warning.

  Target "pseudo" links to target "GSL::gsl" but the target was not found.
  Perhaps a find_package() call is missing for an IMPORTED target, or an
  ALIAS target is missing?
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) at CMakeLists.txt:37 (add_executable):
  Policy CMP0028 is not set: Double colon in target name means ALIAS or
  IMPORTED target.  Run "cmake --help-policy CMP0028" for policy details.
  Use the cmake_policy command to set the policy and suppress this warning.

  Target "pseudo" links to target "GSL::gslcblas" but the target was not
  found.  Perhaps a find_package() call is missing for an IMPORTED target, or
  an ALIAS target is missing?
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) at CMakeLists.txt:37 (add_executable):
  Policy CMP0003 should be set before this line.  Add code such as

    if(COMMAND cmake_policy)
      cmake_policy(SET CMP0003 NEW)
    endif(COMMAND cmake_policy)

  as early as possible but after the most recent call to
  cmake_minimum_required or cmake_policy(VERSION).  This warning appears
  because target "shared" links to some libraries for which the linker must
  search:

    abc, GSL::gsl, GSL::gslcblas

  and other libraries with known full path:

    /home/tjhladish/work/AbcSmc/examples/demo/libdice.so

  CMake is adding directories in the second list to the linker search path in
  case they are needed to find libraries from the first list (for backwards
  compatibility with CMake 2.4).  Set policy CMP0003 to OLD or NEW to enable
  or disable this behavior explicitly.  Run "cmake --help-policy CMP0003" for
  more information.
This warning is for project developers.  Use -Wno-dev to suppress it.

-- Generating done
-- Build files have been written to: /home/tjhladish/work/AbcSmc/examples/demo
cd demo && make direct
make[1]: Entering directory '/home/tjhladish/work/AbcSmc/examples/demo'
make[1]: *** No rule to make target 'direct'.  Stop.
make: *** [Makefile:36: demo/direct] Error 2

MPI and resume directory intent?

From what I can tell re MPI, the intent is to support distributing runs out to various nodes, running the simulations, collecting the results, then running ABC-PLS on the collected results.

"resume" directory seems to serve a related purpose.

These seem like useful approaches to offer, but are currently deprecated clutter. Recommend we just excise them for now, then revisit if necessary in the future. At that point, my imagination of AbcSmc would enable implementing those in a stand-alone way as some mixture of simulator / storage classes.

Simulator signature(s) supported

For broadest use of AbcSmc, we should prefer to only require the minimal arguments, while flexibly supporting receiving more state as desired by the library user.

The minimal arguments might be just the parameter values. Arguably, if someone is using AbcSmc, they're doing stochastic work, so should also be receiving a seed. But we shouldn't force defining a simulator in terms of arguments that they immediately ignore (e.g. serial in some cases, _mp in basically all cases).

Code Style Document

  • write a .github/CONTRIBUTING.md @pearsonca
  • which should cover the following items:
  • guidance for function signatures - e.g. when to use const arguments, when to pass by reference vs pointer, how to do spacing for & and *
  • naming / case guidance
  • ...?

notional `AbcConfig` approach

  • AbcConfig is a friend of AbcSmc (so can access private fields)
  • ... has concretely defined all the defaults, setters, etc necessary to fully specify a runnable AbcSmc state
  • AbcConfig has a template, virtual parse method (possibly also a deserialize? takes plain string filename and inflates to the relevant type for parse?). Sub classes define this parse
  • AbcConfig has a build method (receives filename => deserialize => parse => check state => yields an AbcSmc ready to work)
  • maybe defined AbcConfig static method to serve up relevant parser, based on sniffing filetype?

"manual" example / use case

Being able to use AbcSmc without having to provide a configuration file would be useful.

This kind of approach is currently represented in examples/manual/main.cpp, though is no longer functional as such.

Low priority: make this work again.

json => interpretter skeleton generator

With an established DSL for the configuration files, it should be possible to write a skeleton file which creates a-plain-old-data object convenience converter, ala:

struct SOME {
  // ...pertinent fields
  SOME(vector<double> params) // ... some automated assignment to pertinent fields
  {}
}

Basically, a static convenience tool - similar available for parsing scenarios, deparsing metrics, etc. Some of this is irrelevant if AbcSmc adopts an approach of passing around maps instead of vectors, but would enable disentangling model variable choices from storage choices (i.e. these skeletons could be editted to do a bit of more complex transformation from the preferred model value and the preferred analysis / storage value).

lib / build relationships

Ideally, if someone wants to use AbcSmc in some simulator (and not do anything fancy with elements under the hood), they should be able to include a header or two and to indicate the path just to find those.

However, currently the sqdb and jsoncpp libraries leaky out interface-wise - as in, if I'm just writing a simulator, I have to know the detailed structure inside the AbcSmc to provide the path to those dependencies as well OR we have to write some gnarly bits in those within abc headers.

Proposed solution: expose less of the interface publicly. If we remove the sqdb, jsoncpp elements from the signatures in the AbcSmc header and instead encapsulate talking to those dependencies to AbcSmc.cpp, that will hide (for build purposes) those dependencies from users.

documentation for .json configuration files

Need to establish a specification for .json configuration files (more broadly, other modes of providing those).

Notes:

TODO: ... assorted top level keys, which required / conflicting, type for associated values

num_samples: a positive integer or array of positive integers. Optional in "projection" mode; when provided in "projection" mode, should always be a single integer.

predictive_prior_(fraction|size) keys: mutually exclusive options for specifying how many draws to include when generating a predictive prior (in "fitting" mode). For either fraction or size, may be either a single value or an array. If an array, then must either match the length of the num_samples array OR num_samples must be length 1.

"parameters" key

  • must exist (TODO allow for doesn't exist => warn, assume empty?)
  • value must be an array (TODO also allow object, where keys are parameter names, values parameter objects?)
  • ...of objects, which must have keys name, dist_type, num_type, par1, par2 (TODO if allowing object for parameters, this gets tweaked; also worth thinking alternative arrangements. e.g. $name : { $dist_type : { ... dist specific pars ... } }
  • ...and optional keys: step, untransform, short_name
  • for untransforms: either a string (specifying a standard untransformation) OR an object
  • object version must have key: type (only "LOGISTIC" supported currently), min, max, (un)transformed_(addend|factor)

"metrics" key

  • ...
  • (do this when relevant to ongoing changes)

deprecate `AbcSmc::set_sim*` and `AbcSmc::set_executable` in favor of setting from configuration

To move AbcSmc toward being a compiled tool, should be marking various methods for deprecation, particularly those that using AbcSmc in compiled-library-with-simulator mode.

The set_sim family are definitely in the category. The key issue for backwards compatibility will be whether the as-a-library mode can correctly find the simulator symbol. That seems immently solvable: have a sensible "just works" default for finding the symbol if its named simulator, has the right signature [plus some useful error messages? convenience checkers?], and maybe allow a configuration option to use a different symbol name is people are picky.

cleanup merged branches

Would be useful to delete branches which have been successfully merged. By my count:

  • abclog
  • plssep
  • relog
  • runstat
  • simfunc
  • slices
  • utiling
  • mvn_noise
  • sqlite
  • db_locking

I don't have access to purge them here, but have trimmed off all of these in my locals.

DB normal forms

The current approach to organizing parameters is un-normalized, meaning that adding parameter levels (or additional parameters) typically entails substantial handwork. (Similar problem for metrics, or if we went to split parameters by fitted vs scenario settings, etc).

For any given interaction with a database (that is, for a one off simulation analysis), the unnormalized approach is easier to work with. But for repeated development in particular simulation (e.g. adapting scenarios, metrics), or for developing tools which are generic across simulations, a normalized approach would be more flexible.

So:

  • desire an interface which (from the simulation end, likely also from the post-processing end) presents an unnormalized view.
  • however, overlays a normalized representation, enabling extensions while underway
  • meaning: would be useful to write a normal-to-unnormal and vice versa layer in code. this would enable continued easy-of-use at the practical ends, while supporting flexibile additions/substractions/substitutions to those ends

use-as a tool in addition to library

Relative to the prepare, harness, etc approach in this application, would prefer to have "installing" abc 1) provide a library install AND 2) a command line tooling.

The command line tooling would then work roughly like

$ abcsmc (-p|--prepare) some_config.json [-o some_config.sqlite]
$ abcsmc (-s|--simulate) some_config.sqlite some_model.so

Possible solution would be to have abcsmc install 1) a shell script that handles parsing command line args and passing to 2) binaries that are purpose built to the relevant steps

Replace JsonCpp w/ Nlohmann/Json

Propose replacing the json parsing library.

This would entail internal rework, but should have no knock-on consequences (excepting potentially abcsmc + simulator compile approaches that rely on jsoncpp to parse non-abc-related json).

Desired AbcSmc Operational Modes

  • traditional fitting: start with functional priors, do a sequence of waves, get a resulting posterior
  • traditional scenario analysis / projection: start with posterior + scenario grid, run projections for scenarios for each particle in posterior, collect outcomes
  • model selection fitting: start with model grid + priors for parameters (possibly model specific), estimate those parameters, compare model selection criteria => indicate preferred model
  • iterative fitting: start with some priors, some fixed values => do waves => change which priors fixed vs sampled => do waves => ... => get "full" posterior (potentially also change metrics as waves go along)
  • model scenario fitting => projection: start with model grid + priors => fitting => carry forward grid plus distinct posteriors into scenario analysis

fingerprint capability / CCRC32

Desire to have a "fingerprint" capability for users.

Currently do that by including the CCRC32 library and relying on users to know its methods.

A more robust approach would be to encapsulate precisely the capability we're providing, and then export it via AbcUtil e.g., document it, and so on.

scenario DSL

One of the core AbcSmc capabilities is management of scenario projections: AbcSmc will systematically build up a scenario run plan and then run that plan.

However, there are use-case gaps and pain points. AbcSmc currently only directly supports "all combinations" build, so excluded combinations then have to be manually pruned after construction. This is wasteful (when there's lots of pruning), error prone / onerous (because user has to write pruning code), and may not be particularly portable (e.g. if AbcSmc supports different backends in the future, users would have to write different pruning for each backend).

AbcSmc also doesn't distinguish between parameters (fitted elements) versus settings (scenario assumptions). That smells a bit funny, though it's hard to specify the precise problems there. For future capabilities, might want have AbcSmc work either from a computed prior (i.e. the resulting posterior cloud from fitting) or from a sampling one (i.e. just draw a sample from this analytic distribution), and having distinct parameters versus settings seems a natural boundary to set which would make that swap easier.

So, in general, we ought to formally describe what the domain specific language (DSL) for AbcSmc is. This is currently captured ad-hoc in the json configuration files; these blend engineering settings, fitting parameters, targets, scenarios parameters, but also often neglect the actual desired outputs.

refactor examples to better distinguish examples from testing

Fairly hot: the examples code needs be more one-stop-shopping, see all the moving parts (i.e. complete, isolated, minimal working examples). @pearsonca has been using these more as integration tests, which results in more abstraction / reduced code duplication, but tends to be too complex for self-contained MREs.

These two use cases need to be distinguished. Initial implementation can be a bit messy.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.