Hello. First of all, thank you for the great articles and research.

The code isn't super clear, but in <a href="https://github.com/rddy

Questions about mimi HOT 5 OPEN

GreenWizard2015 commented on August 15, 2024

Questions

from mimi.

Comments (5)

rddy commented on August 15, 2024 1

Here are some projects that I think are cool and have potential in this space:

from mimi.

rddy commented on August 15, 2024

Learning from samples collected with an old interface (i.e., off-policy RL) would be a bit difficult in this setting. The problem is that the state of the MDP actually includes the user's internal model of the interface, so when you go back and sample old transitions from a previous interface, you will only get partial observations that do not include this aspect of the state. I think you can address this partial observability by using a recurrent neural network architecture for the policy and value functions that takes a history of observations and commands as input (instead of only the most recent observation and command). You would also need to use importance sampling to correct for the non-stationary state distribution in the replay buffer (see Emergent Real-World Robotic Skills via Unsupervised Off-Policy Reinforcement Learning and DADS).
You could indeed use a warm start to speed up training of the MI estimator, although I haven't tried it yet. If you find that this unduly biases the optimization, then you could try other ways to speed up optimization of the MI estimator, such as taking fewer gradient steps (this tends to be okay since all we typically care about are the relative values of the MI estimates for different interfaces).
That sounds like a promising way to do dimensionality reduction. I think it could actually work even in the case where the latent representation contains mostly information about the predicted action, and has discarded most other information that was originally in the high-dimensional command signal. Even though MIMI uses the MI between this latent and the state transition to compute a reward, the interface itself can be a function of the original command signal. Hence, you could initially train the interface via supervised learning, then fine-tune it using MIMI. MIMI would not necessarily converge to the same solution as the supervised pre-training, since the $\mathcal{I}(\mathbf{s}_t, f(\mathbf{x}_t))$ term (where $f$ is your pre-trained embedding model) is not necessarily maximized to begin with.
The code isn't super clear, but in this line we are actually reusing the same statistics network $T_{\phi}$ as in this earlier line, so it's just 1 network rather than 1+n_mine_samp networks. n_mine_samp just refers to the number of samples we use to compute a Monte Carlo estimate of the expectation in Equation 2.

Happy to discuss further! Also happy to provide more hands-on help with coding or setting up experiments.

from mimi.

GreenWizard2015 commented on August 15, 2024

The code isn't super clear, but in this line we are actually reusing the same statistics network Tϕ as in this earlier line, so it's just 1 network rather than 1+n_mine_samp networks. n_mine_samp just refers to the number of samples we use to compute a Monte Carlo estimate of the expectation in Equation 2.

My bad, I thought that each call of build_model would create a unique network, so we would have 32 + 1 + 1 networks. It would be greater if you add a notice to the build_model and specify that it creates a unique MLP per scope.

Learning from samples collected with an old interface (i.e., off-policy RL) would be a bit difficult in this setting. The problem is that the state of the MDP actually includes the user's internal model of the interface, so when you go back and sample old transitions from a previous interface, you will only get partial observations that do not include this aspect of the state. I think you can address this partial observability by using a recurrent neural network architecture for the policy and value functions that takes a history of observations and commands as input (instead of only the most recent observation and command). You would also need to use importance sampling to correct for the non-stationary state distribution in the replay buffer (see Emergent Real-World Robotic Skills via Unsupervised Off-Policy Reinforcement Learning and DADS).

I meant reusing samples only for MI training. As you wrote in the article, we must collect data, train MI and then adapt the interface (it doesn’t matter whether we are using RL or another approach.). You can somehow perform the adaptation stage without completely new data (offline algorithms, re-labeling old data with new rewards, synthetic data, etc.). The main bottleneck is MI training. It requires the active participation of the user in order to gather data about the new interface. Thus, the problem of effective use of data for MI training arises. Theoretically, only the "intuitiveness" of the transition (s, a) -> s` is important to us, so any transitions can be used, even from old interfaces. Am I correct or there are some restrictions on the data for MI?

from mimi.

rddy commented on August 15, 2024

The problem is that the intuitiveness of the transition (s, a, s') cannot be evaluated in isolation. For example, an intuitive interface for scrolling on a mobile phone could either involve swiping up to scroll up or swiping down to scroll up, but some kind of mixture of the two interfaces would be unintuitive. That being said, it might be possible to speed up MI estimation through warm starts or meta-learning.

from mimi.

GreenWizard2015 commented on August 15, 2024

I completely agree and also thought about the problem of "mirror" interfaces. However, I believe that people tend to do things they are used to. For example, if it is more convenient for a person to swipe up, and the interface requires a swipe down, then the person will swipe down with a smaller amplitude or other differences. A person cannot give absolutely independent feedback for each interface, he will remember the previous one and try to control the new one in the same way. If we do not have fully discrete variables, then the difference should be observed. Moreover, your article is based precisely on this assumption, therefore, I think, it is possible to identify more convenient actions and not just the interface. However, this task is more difficult and therefore it is possible that it cannot be solved in practice (requires more resources).

Thank you for your responses and for making it clear that there is no fundamental reason not to try to reuse the data.

If possible, I would be grateful if you suggest articles, resources, etc. on the use of AI to improve accessibility for people with disabilities.

from mimi.

Questions about mimi HOT 5 OPEN

Comments (5)

Related Issues (6)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs