deep reinforcement learning
对reinforcement learning和 deep reinforcement learning的粗浅理解
传统RL 代表为Q table,finite states, 如果有255 state, five sorts of action, so the total number of q table should be (255, 5),所以我们可以给每一个state 一个最高Q得分的action,注意这里的action 的得分Q不仅取决于当前的reward,并且还会选择这个action后取决于未来选取最优最高的action的reward。 下图是Q table update rule,注意这里的Q update 还需要一定减去当前的Q
import random
from tqdm import tqdm
alpha = 0.1
gamma = 0.8
epsilon = 0.1
episodes = 20000
q_table = np.zeros([env.observation_space.n, env.action_space.n])
for _ in tqdm(np.arange(episodes)):
state = env.reset()
done = False
while not done:
greedy = True
if random.uniform(0, 1) < epsilon:
greedy = False
action = env.action_space.sample() # Explore action space
action = np.argmax(q_table[state]) # Exploit learned values
next_state, reward, done, _ = env.step(action)
print(f"next_state is {next_state}")
print(f"reward is {reward}")
old_value = q_table[state, action]
next_max = np.max(q_table[next_state])
new_value = old_value + alpha*(reward + gamma * next_max - old_value)
q_table[state, action] = new_value
print("new_value is:", new_value)
state = next_state
epochs += 1
这里运行20000 episodes,Exploit learned values时候选择最高得分的action(当前state)action = np.argmax(q_table[state])
,每一次action 选择后都对 q_table[state, action]
关于deep reinforcement learning
个人认为deep reinforcement learning 还是 multiply layer machine, 只不过 loss function is depend on Q value,in which the Q value is determined by actions selected。(还在学习,粗浅理解)
具体来说,以 DQN(深度Q网络)为例:
- 输入:当前状态 ( s )。
- 输出:该状态下每个可能动作的 Q 值 ( Q(s, a) )。
- 损失函数:定义为网络预测的 Q 值和目标 Q 值之间的均方误差(MSE),其中目标 Q 值通过贝尔曼方程计算得到。
所以,可以总结为:DRL 中的神经网络的训练过程依赖于 Q 值或策略函数的更新,并通过特定的损失函数来调整网络的参数,使得神经网络输出越来越接近实际的期望回报。