Exercices - Reinforcement learning
Laurent Fainsin
2021 - 2022
""" # ╔═╡ 7fb608fa-da53-47d3-a585-235b4692c4dd md""" # Exercice 1 - Finite Horizon MDP > Revenue management: Littlewood’s model > > An airplane has 20 seats available, and the sell closes in 50 days. At every time epoch, the airplane decides the selling price: Either ``p_1`` = 5, and then it will sell a seat with probability ``q_1`` = 0.1, or ``p_2`` = 1, and then it will sell a seat with probability ``q_2`` = 0.8. """ # ╔═╡ 58baae0a-ae46-443c-8c08-8743a13eb860 md""" D'après l'énoncé on a: $\mathcal{S} = \left\{ s_1, s_2 \right\}$ $\mathcal{A} = \left\{ p_1, p_2 \right\}$ $\mathcal{R} = \left\{ p_1 q_1, p_2 q_2 \right\}$ """ # ╔═╡ dc2fcab2-7b23-47e3-b53a-0ad966645afd md""" D'après l'équation d'optimalité de Bellmann: $V(s) = r(s) + \gamma \sum_{s'} p(s,s')V(s')$ Dans notre cas on a donc (``\gamma = 1``): $V_T(s_1) = p_1 q_1 + q_1 V_{T-1}(s_1) + (1-q_1) V_{T-1}(s_2)$ $V_T(s_2) = p_2 q_2 + q_2 V_{T-1}(s_2) + (1-q_2) V_{T-1}(s_1)$ """ # ╔═╡ a6100c90-9a12-43d8-8ec3-dbf04b4ab5d1 md""" On peut alors poser: $P = \begin{pmatrix} q_1 & 1-q_1 \\ q_2 & 1-q_2 \end{pmatrix}$ $R = \begin{pmatrix} p_1 q_1 \\ p_2 q_2 \end{pmatrix}$ $V_T = \begin{pmatrix} V_T(s_1) \\ V_T(s_2) \end{pmatrix}$ tel que l'on puisse reformuler Bellmann: $V_T = R + \gamma P V_{T-1}$ """ # ╔═╡ a5df336b-3db1-441d-845b-3ddb0aa3213a md""" En selectionnant la valeur maximale de ``V_T`` on en déduit la politique optimal pour ce jour ``T``. Intuitivement on choisi le prix ``p_2`` si l'on souhaite maximiser notre gain. On peut le vérifier numériquement: """ # ╔═╡ a30def9e-50f0-4aa2-8e79-6d10ecee5bee begin p1_slider = @bind p1 Slider(0:1:10, default=5, show_value=true) q1_slider = @bind q1 Slider(0:0.1:1, default=0.1, show_value=true) p2_slider = @bind p2 Slider(0:1:10, default=1, show_value=true) q2_slider = @bind q2 Slider(0:0.1:1, default=0.8, show_value=true) md""" ``p_1``: $(p1_slider) ``\quad\quad`` ``q_1``: $(q1_slider) ``p_2``: $(p2_slider) ``\quad\quad`` ``q_2``: $(q2_slider) """ end # ╔═╡ 3176c1f0-c1fa-4c74-9133-f4371e584b6c P = [ q1 1-q1 q2 1-q2 ] # ╔═╡ f5566288-5ef4-495c-8442-a59f17f0814b R = [ p1 * q1 p2 * q2 ] # ╔═╡ f01bd868-a180-4327-98bf-de2298673523 γ = 1 # ╔═╡ f8c9b0c8-3f81-441b-b673-f07ed2ae3e5a begin text = ""; T = 50; V = [0 ; 0]; text *= "``V_{50} = $(V)``\n\n"; for t in 1:T V = R + γ * P * V; choix = argmax(V); V_display = round.(V, digits=2); text *= "``V_{$(T-t)} = $(V_display) → p_$(choix)``\n\n"; end Markdown.parse(text) end # ╔═╡ 73a0d6a0-94a9-45b3-9db9-6cc0c845ab95 md""" # Exercice 2 - Infinite Horizon MDP > See figure > > What is the optimal policy (for total discounted reward) for various values of ``\gamma`` ? 