deep reinforcement learning

1 minute read

Published: November 06, 2024

对reinforcement learning和 deep reinforcement learning的粗浅理解

参考：https://ydata123.org/fa24/interml/calendar.html

传统RL 代表为Q table，finite states, 如果有255 state, five sorts of action, so the total number of q table should be (255, 5)，所以我们可以给每一个state 一个最高Q得分的action，注意这里的action 的得分Q不仅取决于当前的reward，并且还会选择这个action后取决于未来选取最优最高的action的reward。下图是Q table update rule，注意这里的Q update 还需要一定减去当前的Q

核心代码实现：

import random
from tqdm import tqdm

alpha = 0.1
gamma = 0.8
epsilon = 0.1
episodes = 20000
q_table = np.zeros([env.observation_space.n, env.action_space.n])

for _ in tqdm(np.arange(episodes)):
    state = env.reset()

    done = False
    while not done:
        greedy = True
        if random.uniform(0, 1) < epsilon:
            greedy = False
            action = env.action_space.sample() # Explore action space
        else:
            action = np.argmax(q_table[state]) # Exploit learned values

        next_state, reward, done, _ = env.step(action)

        print(f"next_state is {next_state}")
        print(f"reward is {reward}")
        
        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])
        new_value = old_value + alpha*(reward + gamma * next_max - old_value)
        q_table[state, action] = new_value

        print("new_value is:", new_value)

        state = next_state
        epochs += 1
        

这里运行20000 episodes，Exploit learned values时候选择最高得分的action（当前state）action = np.argmax(q_table[state]) ，每一次action 选择后都对 q_table[state, action] 进行一次update。重复epsidoes，更新参数。

关于deep reinforcement learning

个人认为deep reinforcement learning 还是 multiply layer machine, 只不过 loss function is depend on Q value，in which the Q value is determined by actions selected。（还在学习，粗浅理解）

深度强化学习（DRL）中的神经网络和传统的监督学习网络非常相似，只不过损失函数的计算是基于Q值（或者策略函数），而不是直接通过标签（如监督学习中的正确输出）。

在DRL中，损失函数通常是根据贝尔曼方程计算得到的目标值（如Q值的更新公式）来定义的，反映了当前网络输出的价值估计和实际期望价值之间的差距。通过最小化这个差距（即损失函数），我们可以引导神经网络的参数更新，从而使其逐渐学到更好的策略。

具体来说，以 DQN（深度Q网络）为例：

输入：当前状态 ( s )。
输出：该状态下每个可能动作的 Q 值 ( Q(s, a) )。
损失函数：定义为网络预测的 Q 值和目标 Q 值之间的均方误差（MSE），其中目标 Q 值通过贝尔曼方程计算得到。

所以，可以总结为：DRL 中的神经网络的训练过程依赖于 Q 值或策略函数的更新，并通过特定的损失函数来调整网络的参数，使得神经网络输出越来越接近实际的期望回报。

Share on

Twitter Facebook LinkedIn

Tianyi Xiang

deep reinforcement learning

关于deep reinforcement learning

Share on

You May Also Enjoy

Leetcode

9. Palindrome Number

Topics of CV works

Topics of CV works

some interesting RSS workshops