rddy / mimi Goto Github PK

View Code? Open in Web Editor NEW

23.0 23.0 2.0 204 KB

Code for the paper, "First Contact: Unsupervised Human-Machine Co-Adaptation via Mutual Information Maximization"

License: MIT License

Python 26.45% Jupyter Notebook 73.55%

mimi's People

Contributors

Stargazers

Watchers

Forkers

guyko81 q87718350

mimi's Issues

rollout_policy LunarLander

running the lander notebook I got an error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-14-f979eb03d797> in <module>
      4   gp_min_kwargs=gp_min_kwargs,
      5   ep_kwargs=ep_kwargs,
----> 6   reward_model_train_kwargs=mi_model_train_kwargs
      7 )

~\anaconda3\envs\mimienv\lib\site-packages\mimi\opt.py in run(self, n_pols, n_steps_per_pol, n_eps_per_pol, gp_min_kwargs, ep_kwargs, reward_model_train_kwargs)
     80       self.param_bounds,
     81       n_calls=n_pols,
---> 82       **gp_min_kwargs
     83     )
     84     policy = self.policy_from_params(res.x)

~\anaconda3\envs\mimienv\lib\site-packages\skopt\optimizer\gp.py in gp_minimize(func, dimensions, base_estimator, n_calls, n_random_starts, n_initial_points, initial_point_generator, acq_func, acq_optimizer, x0, y0, random_state, verbose, callback, n_points, n_restarts_optimizer, xi, kappa, noise, n_jobs, model_queue_size)
    266         n_restarts_optimizer=n_restarts_optimizer,
    267         x0=x0, y0=y0, random_state=rng, verbose=verbose,
--> 268         callback=callback, n_jobs=n_jobs, model_queue_size=model_queue_size)

~\anaconda3\envs\mimienv\lib\site-packages\skopt\optimizer\base.py in base_minimize(func, dimensions, base_estimator, n_calls, n_random_starts, n_initial_points, initial_point_generator, acq_func, acq_optimizer, x0, y0, random_state, verbose, callback, n_points, n_restarts_optimizer, xi, kappa, n_jobs, model_queue_size)
    297     for n in range(n_calls):
    298         next_x = optimizer.ask()
--> 299         next_y = func(next_x)
    300         result = optimizer.tell(next_x, next_y)
    301         result.specs = specs

~\anaconda3\envs\mimienv\lib\site-packages\mimi\opt.py in cost_of_policy_params(self, policy_params)
     51   def cost_of_policy_params(self, policy_params):
     52     policy = self.policy_from_params(policy_params)
---> 53     rollouts = utils.rollout_policy(
     54       policy,
     55       self.env,

AttributeError: module 'mimi.utils' has no attribute 'rollout_policy'

this is so much fun! :)
but there's one more error: in format_rollouts (utils.py) we define 'rewards': []. However during training at slice_data the code tries to select the validation indexes of all key-value pairs, while the rewards is not filled (at least not in LunarLander). So it throws an index error, because the 'rewards' list remains empty.

I simply commented the rewards definition out, I hope the model still learns (based on the paper it should, haven't checked that part of the code yet)

def format_rollouts(rollouts, env):
  data = {
    'obses': [],
    'actions': [],
    'next_obses': [],
    #'rewards': []
  }

question for set up

First of all, thank you for the great articles and research. It said that I have No module named 'disvae' when I run " import disvae.utils.modelIO “at address, mimi-main\notebooks\mimi\models.py. I tried to download the module by using pip but it failed.

The error information was "ERROR: Could not find a version that satisfies the requirement disvae (from versions: none)ERROR: No matching distribution found for disvae".

Please tell me how to install it. I was wondering if the module is written by yourself. If that is true, please tell me where is the module and how to install it.

question in the setting up of lander notebook

First of all, thank you for the great articles and research.

I meet the prroblem that 'LunarLander' object has no attribute 'helipad_idx' in setting up lander notebook. Pleaae, telling me what the meaning of "self.goal = self.env.helipad_idx" in envs.py about 319 row. Maybe I can change or ignore this row to fix this problem.

I changed some parts of the codes to fix some problems that I met before. I am wondering if they may cause this problem.

1.About 304 line in envs.py, I change "self.n_act_dim = self.env.action_space.low.size" to "self.n_act_dim = self.env.action_space", "self.n_env_obs_dim = self.env.observation_space.low.size" to "self.n_env_obs_dim = self.env.observation_space". I did that to fix the problem that "'Discrete' object has no attribute 'low'"

2.I use the following code:
mi_model_init_kwargs = {
'n_env_obs_dim': env.n_min_env_obs_dim,
'n_user_obs_dim': env.n_user_obs_dim,
#'n_act_dim': env.n_act_dim,
'n_act_dim': 4,
'n_layers': 2,
'layer_size': 64
}

This help me to overcome the problem that " Error converting shape to a TensorShape: Dimension value must be integer or None or have an index method, got value 'Discrete(4)' with type '<class 'gym.spaces.discrete.Discrete'>'."

It seems that I always meet problems in the "n_act_dim". Please helping me to fix my problems.

question: mutual information as simple reward sign for general RL algorithms

Hi,

I thought it's the easiest to ask here: I was wondering if simply calculating the mutual information between the user input $x_t$ and the state $s_t$ and $S_{t+1}$, and giving it to the system as a simple $r_t$, then any general RL algorithm (TD3, DDPG, PPO) could use it as the target. And by maximizing the mutual information the agent could navigate as per the user request.
I probably miss something important, but can you give a feedback please?

Thanks!

Questions

Hello.

First of all, thank you for the great articles and research. Currently, I'm working on software for people with disabilities and trying to apply your research in it. Unfortunately, I'm not very familiar with the area of mutual information estimation and I have a bunch of questions. I hope you provide some brief answers.

Can we reuse old samples? Can I collect some new samples with the current interface, but also leave samples collected with the old interface?
Can we reuse the old estimator or does it needs to be trained from scratch? If we reuse the old estimator, it can drastically boost convergence, but also can introduce a bias. We need to provide fast feedback to the user and can't train an estimator/interface for too long, so I trying to find some solutions to overcome this limitation.
In my project, I have fairly big and complex raw observations. I train a network that takes them and, in a supervised fashion, estimates the desired action. Can I use latent representation from that network as input for MIMI? The first network provides fairly good estimations, but they are not always intuitive, so I wanna apply MIMI and train policy to sample more intuitive actions from the estimations. I'm concerning that latent representation could provide data leakage and MIMI would be collapsed to a trivial solution. Obviously, I can just use other ways to reduce inputs, but they would require additional computations and may introduce other problems.
As I see, we are using 1 network to estimate scores of joint (x, y) pairs and n_mine_samp=32 networks to estimate scores of marginal pairs. Is it just the default implementation or have you tried separable estimators (f(x, y) = g(x) * h(y), so we process only 2N "items" instead of N * N) and they failed? In my opinion, it may be much more efficient, but I don't know how crucial a role here is bias/variance.
Theoretically, separable estimators cover all pairs, 32 * N * N, instead of just 32 * N, so can be more restrictive. They have issues with bias/variance, but we are using 32 of them, so it may be not a problem. At the same time, we have access to all marginals, so may estimate tf.reduce_mean(tf.exp(shuffled_stats), axis=1) more precisely.
What do you think about that? Of course, it would be better just to test it, but it requires some time and effort.

Best regards.

rddy / mimi Goto Github PK

mimi's People

Contributors

Stargazers

Watchers

Forkers

mimi's Issues

rollout_policy LunarLander

format_rollouts

question for set up

question in the setting up of lander notebook

question: mutual information as simple reward sign for general RL algorithms

Questions

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs