top | item 44284057

(no title)

highd | 8 months ago

You can do Q-Learning with a transformer. You simply define the state space as the observation sequence. This is in fact natural to do in partially observed settings. So your distinction does not make sense.

discuss

order

isaacimagine|8 months ago

DT's reward-to-go vs. QL's Bellman incl. discount, not choice of architecture for policy. You could also do DTs with RNNs (though own problems w/ memory).

Apologies if we're talking past one another.