TP-reinforcement-learning/TD/notebook.jl
2024-01-02 11:44:10 +01:00

253 lines
6.4 KiB
Julia
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

### A Pluto.jl notebook ###
# v0.19.36
using Markdown
using InteractiveUtils
# This Pluto notebook uses @bind for interactivity. When running this notebook outside of Pluto, the following 'mock version' of @bind gives bound variables a default value (instead of an error).
macro bind(def, element)
quote
local iv = try Base.loaded_modules[Base.PkgId(Base.UUID("6e696c72-6542-2067-7265-42206c756150"), "AbstractPlutoDingetjes")].Bonds.initial_value catch; b -> missing; end
local el = $(esc(element))
global $(esc(def)) = Core.applicable(Base.get, el) ? Base.get(el) : iv(el)
el
end
end
# ╔═╡ f70b9ab6-3a97-4d41-b1a1-268085565bcb
# ╠═╡ show_logs = false
# https://github.com/fonsp/Pluto.jl/wiki/%F0%9F%8E%81-Package-management#advanced-set-up-an-environment-with-pkgactivate
begin
using Pkg
Pkg.activate()
end
# ╔═╡ ec777da4-8ef0-44b4-8d56-38b31790a5b7
begin
using PlutoUI # pour les objets Pluto
TableOfContents(depth=4)
end
# ╔═╡ dc4eb528-82b2-11ed-3ca2-c321ff8ef647
html"""
<center>
<strong style="font-size: 2rem;">
Exercices - Reinforcement learning <br/>
Laurent Fainsin <br/>
2021 - 2022
</strong>
</center>
"""
# ╔═╡ 7fb608fa-da53-47d3-a585-235b4692c4dd
md"""
# Exercice 1 - Finite Horizon MDP
> Revenue management: Littlewoods model
>
> An airplane has 20 seats available, and the sell closes in 50 days. At every time epoch, the airplane decides the selling price: Either ``p_1`` = 5, and then it will sell a seat with probability ``q_1`` = 0.1, or ``p_2`` = 1, and then it will sell a seat with probability ``q_2`` = 0.8.
"""
# ╔═╡ 58baae0a-ae46-443c-8c08-8743a13eb860
md"""
D'après l'énoncé on a:
$\mathcal{S} = \left\{ s_1, s_2 \right\}$
$\mathcal{A} = \left\{ p_1, p_2 \right\}$
$\mathcal{R} = \left\{ p_1 q_1, p_2 q_2 \right\}$
"""
# ╔═╡ dc2fcab2-7b23-47e3-b53a-0ad966645afd
md"""
D'après l'équation d'optimalité de Bellmann:
$V(s) = r(s) + \gamma \sum_{s'} p(s,s')V(s')$
Dans notre cas on a donc (``\gamma = 1``):
$V_T(s_1) = p_1 q_1 + q_1 V_{T-1}(s_1) + (1-q_1) V_{T-1}(s_2)$
$V_T(s_2) = p_2 q_2 + q_2 V_{T-1}(s_2) + (1-q_2) V_{T-1}(s_1)$
"""
# ╔═╡ a6100c90-9a12-43d8-8ec3-dbf04b4ab5d1
md"""
On peut alors poser:
$P = \begin{pmatrix}
q_1 & 1-q_1 \\
q_2 & 1-q_2
\end{pmatrix}$
$R = \begin{pmatrix}
p_1 q_1 \\
p_2 q_2
\end{pmatrix}$
$V_T = \begin{pmatrix}
V_T(s_1) \\
V_T(s_2)
\end{pmatrix}$
tel que l'on puisse reformuler Bellmann:
$V_T = R + \gamma P V_{T-1}$
"""
# ╔═╡ a5df336b-3db1-441d-845b-3ddb0aa3213a
md"""
En selectionnant la valeur maximale de ``V_T`` on en déduit la politique optimal pour ce jour ``T``. Intuitivement on choisi le prix ``p_2`` si l'on souhaite maximiser notre gain. On peut le vérifier numériquement:
"""
# ╔═╡ a30def9e-50f0-4aa2-8e79-6d10ecee5bee
begin
p1_slider = @bind p1 Slider(0:1:10, default=5, show_value=true)
q1_slider = @bind q1 Slider(0:0.1:1, default=0.1, show_value=true)
p2_slider = @bind p2 Slider(0:1:10, default=1, show_value=true)
q2_slider = @bind q2 Slider(0:0.1:1, default=0.8, show_value=true)
md"""
``p_1``: $(p1_slider) ``\quad\quad``
``q_1``: $(q1_slider)
``p_2``: $(p2_slider) ``\quad\quad``
``q_2``: $(q2_slider)
"""
end
# ╔═╡ 3176c1f0-c1fa-4c74-9133-f4371e584b6c
P = [
q1 1-q1
q2 1-q2
]
# ╔═╡ f5566288-5ef4-495c-8442-a59f17f0814b
R = [
p1 * q1
p2 * q2
]
# ╔═╡ f01bd868-a180-4327-98bf-de2298673523
γ = 1
# ╔═╡ f8c9b0c8-3f81-441b-b673-f07ed2ae3e5a
begin
text = "";
T = 50;
V = [0 ; 0];
text *= "``V_{50} = $(V)``\n\n";
for t in 1:T
V = R + γ * P * V;
choix = argmax(V);
V_display = round.(V, digits=2);
text *= "``V_{$(T-t)} = $(V_display) → p_$(choix)``\n\n";
end
Markdown.parse(text)
end
# ╔═╡ 73a0d6a0-94a9-45b3-9db9-6cc0c845ab95
md"""
# Exercice 2 - Infinite Horizon MDP
> See figure
>
> What is the optimal policy (for total discounted reward) for various values of ``\gamma`` ?
"""
# ╔═╡ 3b87e4e9-81c9-48d9-9f58-e6271f975c40
md"""
Équation d'optimalité pour nos deux états (``s_0`` and ``s_1``):
``V_\star(s_0) = \max( 10 + \gamma V_\star(s_1), 1 + \gamma V_\star(s_0) )``
``V_\star(s_1) = \max( 0 + \gamma V_\star(s_1), -15 + \gamma V_\star(s_0) )``
"""
# ╔═╡ 045e6ef0-84a8-4ad1-bc7c-0494599fd825
md"""
Si ``\gamma \approx 0``:
``V_\star(s_0) \approx \max(10, 1) = 10``
``V_\star(s_1) \approx \max(0, -15) = 0``
"""
# ╔═╡ 8cd6c607-2094-4fa0-b1d8-445842ac5091
md"""
Si ``\gamma \approx 1``:
``V_\star(s_0) = 1 + \gamma V_\star(s_0) \implies V_\star(s_0) = \displaystyle\frac{1}{1-\gamma}``
``V_\star(s_0) = -15 + V_\star(s_0) \implies V_\star(s_1) = -15 + \displaystyle\frac{1}{1-\gamma}``
"""
# ╔═╡ fe6dc66f-2627-4c13-b32a-9751b2ff3a73
md"""
Si l'on résoud ces deux dernières équations on a:
``\gamma = 0.9``
``\gamma = \displaystyle\frac{15}{16} \approx 0.94``
"""
# ╔═╡ fdd906ab-2224-4c26-9fb9-d9855bfda257
md"""
Par disjonction des cas:
"""
# ╔═╡ 65dcb2b0-1021-45a6-afe7-a95f6e139636
md"""
Si ``\gamma \in [0, 0.9]``:
``V_\star(s_0) = 1 + \gamma V_\star(s_0)``
``V_\star(s_1) = \gamma V_\star(s_1)``
"""
# ╔═╡ 69a4815d-1a84-400c-ae2b-4753e6c96abc
md"""
Si ``\gamma \in [0.9, 0.94]``:
``V_\star(s_0) = 10 + \gamma V_\star(s_1)``
``V_\star(s_1) = \gamma V_\star(s_1)``
"""
# ╔═╡ 0b1a0996-d98d-43a5-8d4e-3faa6ded79a8
md"""
Si ``\gamma \in [0.94, 1]``:
``V_\star(s_0) = 1 + \gamma V_\star(s_0)``
``V_\star(s_1) = -15 + \gamma V_\star(s_0)``
"""
# ╔═╡ Cell order:
# ╟─f70b9ab6-3a97-4d41-b1a1-268085565bcb
# ╟─ec777da4-8ef0-44b4-8d56-38b31790a5b7
# ╟─dc4eb528-82b2-11ed-3ca2-c321ff8ef647
# ╟─7fb608fa-da53-47d3-a585-235b4692c4dd
# ╟─58baae0a-ae46-443c-8c08-8743a13eb860
# ╟─dc2fcab2-7b23-47e3-b53a-0ad966645afd
# ╟─a6100c90-9a12-43d8-8ec3-dbf04b4ab5d1
# ╟─a5df336b-3db1-441d-845b-3ddb0aa3213a
# ╟─a30def9e-50f0-4aa2-8e79-6d10ecee5bee
# ╟─3176c1f0-c1fa-4c74-9133-f4371e584b6c
# ╟─f5566288-5ef4-495c-8442-a59f17f0814b
# ╟─f01bd868-a180-4327-98bf-de2298673523
# ╟─f8c9b0c8-3f81-441b-b673-f07ed2ae3e5a
# ╟─73a0d6a0-94a9-45b3-9db9-6cc0c845ab95
# ╟─3b87e4e9-81c9-48d9-9f58-e6271f975c40
# ╟─045e6ef0-84a8-4ad1-bc7c-0494599fd825
# ╟─8cd6c607-2094-4fa0-b1d8-445842ac5091
# ╟─fe6dc66f-2627-4c13-b32a-9751b2ff3a73
# ╟─fdd906ab-2224-4c26-9fb9-d9855bfda257
# ╟─65dcb2b0-1021-45a6-afe7-a95f6e139636
# ╟─69a4815d-1a84-400c-ae2b-4753e6c96abc
# ╟─0b1a0996-d98d-43a5-8d4e-3faa6ded79a8