Now in 761e211 DQN fails to converge on an environment with a 2x2 linear transform, while it's fine without the transform. The reason might be either that there were too few steps given (now 100 steps x 256 episodes, or that there is a problem with the environment being non-Markov).
Concrete step to try: try a linear q-network and see if it converges to different results with and without transform.
Next step: try policy gradients as they do not require to learn the value (which might not exist if the environment is not Markov), only the policy (which is simple -- compare two numbers and choose a corresponding action)
@jbrea adding you to the issue so that you learn faster about the progress.