材料初步考察
- 蘑菇书 的形式和编排结构很好, 但数学部分糟糕 (第二章开头就一堆错误, 反正我读不下去).
- Sutton 的经典 Reinforcement Learning: An Introduction 2022 年的第二版. 主要是 value-based, 而非当今流行的 policy-based (书中只有二十几页描述). 没读.
- HuggingFace 也有个 deep-rl-class. 过于简化了而且比较啰嗦, 但是附方便的代码实践.
- OpenAI 的 Spinning Up 简单介绍了 policy gradient.
- CS 285 at UC Berkeley Deep Reinforcement Learning 比较现代. 看了前面一部分. (视频没看, 直接看的 slides.)
- 其他可以参考 Reinforcement Learning Resources — Stable Baselines3.
下面是简要的笔记: 统一了记号, 自己简明的证明和 PyTorch 实现, 还有杂七杂八的补充.
目标
Which algorithms? You should probably start with vanilla policy gradient (also called REINFORCE), DQN, A2C (the synchronous version of A3C), PPO (the variant with the clipped objective), and DDPG, approximately in that order. The simplest versions of all of these can be written in just a few hundred lines of code (ballpark 250-300), and some of them even less (for example, a no-frills version of VPG can be written in about 80 lines). 来自 Spinning Up.
先搞清楚最流行的方法. 至于具体应用场景… 呃… 我没有需求, 就是单纯玩玩而已, 所以不会特别深入, 看多少算多少.