(no title)
isaacimagine | 8 months ago
Most RL researchers consider these approaches not to be "real RL", as they can't assign credit outside the context window, and therefore can't learn infinite-horizon tasks. With 1m+ context windows, perhaps this is less of an issue in practice? Curious to hear thoughts.
highd|8 months ago
The hardness of the credit assignment problem is a statement about data sparsity. Architecture choices do not "bypass" it.
isaacimagine|8 months ago
The DT citation [10] is used on a single line, in a paragraph listing prior work, as an "and more". Another paper that uses DTs [53] is also cited in a similar way. The authors do not test or discuss DTs.
> hardness of the credit assignment ... data sparsity.
That is true, but not the point I'm making. "Bypassing credit assignment", in the context of long-horizon task modeling, is a statement about using attention to allocate long-horizon reward without horizon-reducing discount, not architecture choice.
To expand: if I have an environment with a key that unlocks a door thousands of steps later, Q-Learning may not propagate the reward signal from opening the door to the moment of picking up the key, because of the discount of future reward terms over a long horizon. A decision transformer, however, can attend to the moment of picking up the key while opening the door, which bypasses the problem of establishing this long-horizon causal connection.
(Of course, attention cannot assign reward if the moment the key was picked up is beyond the extent of the context window.)