TP-reinforcement-learning/TD/notebook.jl

### A Pluto.jl notebook ###
# v0.19.36

using Markdown
using InteractiveUtils

# This Pluto notebook uses @bind for interactivity. When running this notebook outside of Pluto, the following 'mock version' of @bind gives bound variables a default value (instead of an error).
macro bind(def, element)
    quote
        local iv = try Base.loaded_modules[Base.PkgId(Base.UUID("6e696c72-6542-2067-7265-42206c756150"), "AbstractPlutoDingetjes")].Bonds.initial_value catch; b -> missing; end
        local el = $(esc(element))
        global $(esc(def)) = Core.applicable(Base.get, el) ? Base.get(el) : iv(el)
        el
    end
end

# ╔═╡ f70b9ab6-3a97-4d41-b1a1-268085565bcb
# ╠═╡ show_logs = false
# https://github.com/fonsp/Pluto.jl/wiki/%F0%9F%8E%81-Package-management#advanced-set-up-an-environment-with-pkgactivate
begin
  using Pkg
  Pkg.activate()
end

# ╔═╡ ec777da4-8ef0-44b4-8d56-38b31790a5b7
begin
  using PlutoUI # pour les objets Pluto
  TableOfContents(depth=4)
end

# ╔═╡ dc4eb528-82b2-11ed-3ca2-c321ff8ef647
html"""
<center>
	<strong style="font-size: 2rem;">
		Exercices - Reinforcement learning <br/>
		Laurent Fainsin <br/>
		2021 - 2022
	</strong>
</center>
"""

# ╔═╡ 7fb608fa-da53-47d3-a585-235b4692c4dd
md"""
# Exercice 1 - Finite Horizon MDP

> Revenue management: Littlewood’s model
>
> An airplane has 20 seats available, and the sell closes in 50 days. At every time epoch, the airplane decides the selling price: Either ``p_1`` = 5, and then it will sell a seat with probability ``q_1`` = 0.1, or ``p_2`` = 1, and then it will sell a seat with probability ``q_2`` = 0.8.
"""

# ╔═╡ 58baae0a-ae46-443c-8c08-8743a13eb860
md"""
D'après l'énoncé on a:

$\mathcal{S} = \left\{ s_1, s_2 \right\}$

$\mathcal{A} = \left\{ p_1, p_2 \right\}$

$\mathcal{R} = \left\{ p_1 q_1, p_2 q_2 \right\}$
"""

# ╔═╡ dc2fcab2-7b23-47e3-b53a-0ad966645afd
md"""
D'après l'équation d'optimalité de Bellmann:

$V(s) = r(s) + \gamma \sum_{s'} p(s,s')V(s')$

Dans notre cas on a donc (``\gamma = 1``):

$V_T(s_1) = p_1 q_1 + q_1 V_{T-1}(s_1) + (1-q_1) V_{T-1}(s_2)$
$V_T(s_2) = p_2 q_2 + q_2 V_{T-1}(s_2) + (1-q_2) V_{T-1}(s_1)$
"""

# ╔═╡ a6100c90-9a12-43d8-8ec3-dbf04b4ab5d1
md"""
On peut alors poser:

$P = \begin{pmatrix}
q_1 & 1-q_1 \\
q_2 & 1-q_2
\end{pmatrix}$

$R = \begin{pmatrix}
p_1 q_1 \\
p_2 q_2
\end{pmatrix}$

$V_T = \begin{pmatrix}
V_T(s_1) \\
V_T(s_2)
\end{pmatrix}$

tel que l'on puisse reformuler Bellmann:

$V_T = R + \gamma P V_{T-1}$
"""

# ╔═╡ a5df336b-3db1-441d-845b-3ddb0aa3213a
md"""
En selectionnant la valeur maximale de ``V_T`` on en déduit la politique optimal pour ce jour ``T``. Intuitivement on choisi le prix ``p_2`` si l'on souhaite maximiser notre gain. On peut le vérifier numériquement:
"""

# ╔═╡ a30def9e-50f0-4aa2-8e79-6d10ecee5bee
begin
  p1_slider = @bind p1 Slider(0:1:10, default=5, show_value=true)
  q1_slider = @bind q1 Slider(0:0.1:1, default=0.1, show_value=true)
  
  p2_slider = @bind p2 Slider(0:1:10, default=1, show_value=true)
  q2_slider = @bind q2 Slider(0:0.1:1, default=0.8, show_value=true)

  md"""  
    ``p_1``: $(p1_slider) ``\quad\quad``
    ``q_1``: $(q1_slider)
  
    ``p_2``: $(p2_slider) ``\quad\quad``
    ``q_2``: $(q2_slider)
  """
end

# ╔═╡ 3176c1f0-c1fa-4c74-9133-f4371e584b6c
P = [
	q1 1-q1
	q2 1-q2
]

# ╔═╡ f5566288-5ef4-495c-8442-a59f17f0814b
R = [
	p1 * q1
	p2 * q2
]

# ╔═╡ f01bd868-a180-4327-98bf-de2298673523
γ = 1

# ╔═╡ f8c9b0c8-3f81-441b-b673-f07ed2ae3e5a
begin
	text = "";
	T = 50;
	
	V = [0 ; 0];
	text *= "``V_{50} = $(V)``\n\n";
	
	for t in 1:T
		V = R + γ * P * V;
		choix = argmax(V);
		V_display = round.(V, digits=2);
		text *= "``V_{$(T-t)} = $(V_display) → p_$(choix)``\n\n";
	end

	Markdown.parse(text)
end

# ╔═╡ 73a0d6a0-94a9-45b3-9db9-6cc0c845ab95
md"""
# Exercice 2 - Infinite Horizon MDP

> See figure
>
> What is the optimal policy (for total discounted reward) for various values of ``\gamma`` ?
"""

# ╔═╡ 3b87e4e9-81c9-48d9-9f58-e6271f975c40
md"""
Équation d'optimalité pour nos deux états (``s_0`` and ``s_1``):

``V_\star(s_0) = \max( 10 + \gamma V_\star(s_1), 1 + \gamma V_\star(s_0) )``

``V_\star(s_1) = \max( 0 + \gamma V_\star(s_1), -15 + \gamma V_\star(s_0) )``
"""

# ╔═╡ 045e6ef0-84a8-4ad1-bc7c-0494599fd825
md"""
Si ``\gamma \approx 0``:

``V_\star(s_0) \approx \max(10, 1) = 10``

``V_\star(s_1) \approx \max(0, -15) = 0``
"""

# ╔═╡ 8cd6c607-2094-4fa0-b1d8-445842ac5091
md"""
Si ``\gamma \approx 1``:

``V_\star(s_0) = 1 + \gamma V_\star(s_0) \implies V_\star(s_0) = \displaystyle\frac{1}{1-\gamma}``

``V_\star(s_0) = -15 + V_\star(s_0) \implies V_\star(s_1) = -15 + \displaystyle\frac{1}{1-\gamma}``
"""

# ╔═╡ fe6dc66f-2627-4c13-b32a-9751b2ff3a73
md"""
Si l'on résoud ces deux dernières équations on a:

``\gamma = 0.9``

``\gamma = \displaystyle\frac{15}{16} \approx 0.94``
"""

# ╔═╡ fdd906ab-2224-4c26-9fb9-d9855bfda257
md"""
Par disjonction des cas:
"""

# ╔═╡ 65dcb2b0-1021-45a6-afe7-a95f6e139636
md"""
Si ``\gamma \in [0, 0.9]``:

``V_\star(s_0) = 1 + \gamma V_\star(s_0)``

``V_\star(s_1) = \gamma V_\star(s_1)``
"""

# ╔═╡ 69a4815d-1a84-400c-ae2b-4753e6c96abc
md"""
Si ``\gamma \in [0.9, 0.94]``:

``V_\star(s_0) = 10 + \gamma V_\star(s_1)``

``V_\star(s_1) = \gamma V_\star(s_1)``
"""

# ╔═╡ 0b1a0996-d98d-43a5-8d4e-3faa6ded79a8
md"""
Si ``\gamma \in [0.94, 1]``:

``V_\star(s_0) = 1 + \gamma V_\star(s_0)``

``V_\star(s_1) = -15 + \gamma V_\star(s_0)``
"""

# ╔═╡ Cell order:
# ╟─f70b9ab6-3a97-4d41-b1a1-268085565bcb
# ╟─ec777da4-8ef0-44b4-8d56-38b31790a5b7
# ╟─dc4eb528-82b2-11ed-3ca2-c321ff8ef647
# ╟─7fb608fa-da53-47d3-a585-235b4692c4dd
# ╟─58baae0a-ae46-443c-8c08-8743a13eb860
# ╟─dc2fcab2-7b23-47e3-b53a-0ad966645afd
# ╟─a6100c90-9a12-43d8-8ec3-dbf04b4ab5d1
# ╟─a5df336b-3db1-441d-845b-3ddb0aa3213a
# ╟─a30def9e-50f0-4aa2-8e79-6d10ecee5bee
# ╟─3176c1f0-c1fa-4c74-9133-f4371e584b6c
# ╟─f5566288-5ef4-495c-8442-a59f17f0814b
# ╟─f01bd868-a180-4327-98bf-de2298673523
# ╟─f8c9b0c8-3f81-441b-b673-f07ed2ae3e5a
# ╟─73a0d6a0-94a9-45b3-9db9-6cc0c845ab95
# ╟─3b87e4e9-81c9-48d9-9f58-e6271f975c40
# ╟─045e6ef0-84a8-4ad1-bc7c-0494599fd825
# ╟─8cd6c607-2094-4fa0-b1d8-445842ac5091
# ╟─fe6dc66f-2627-4c13-b32a-9751b2ff3a73
# ╟─fdd906ab-2224-4c26-9fb9-d9855bfda257
# ╟─65dcb2b0-1021-45a6-afe7-a95f6e139636
# ╟─69a4815d-1a84-400c-ae2b-4753e6c96abc
# ╟─0b1a0996-d98d-43a5-8d4e-3faa6ded79a8