{"id":3071,"date":"2025-01-15T15:06:17","date_gmt":"2025-01-15T15:06:17","guid":{"rendered":"https:\/\/sinatootoonian.com\/?p=3071"},"modified":"2025-12-27T16:02:12","modified_gmt":"2025-12-27T16:02:12","slug":"a-noobs-eye-view-of-reinforcement-learning","status":"publish","type":"post","link":"https:\/\/sinatootoonian.com\/index.php\/2025\/01\/15\/a-noobs-eye-view-of-reinforcement-learning\/","title":{"rendered":"A noob&#8217;s-eye view of reinforcement learning"},"content":{"rendered":"\n<p>I recently completed the <a href=\"https:\/\/www.coursera.org\/specializations\/reinforcement-learning\">Coursera Reinforcement Learning Specialization<\/a>. These are my notes, still under construction, on some of what I learned. The course was based on Sutton and Barto&#8217;s freely available <a href=\"http:\/\/incompleteideas.net\/book\/the-book-2nd.html\">reinforcement learning book<\/a>, so images will be from there unless otherwise stated. All errors are mine, so please let me know about any in the comments.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Agents and Environments<\/h2>\n\n\n\n<p>What differentiates reinforcement learning from other forms of learning is the setting of an <strong>agent<\/strong> acting in an <strong>environment<\/strong><em>.<\/em> The agent generates actions from a <strong>policy<\/strong>. In response, the environment occasionally delivers <strong>rewards<\/strong> (which can be negative i.e. punishment). The agent searches for a policy that maximizes the total reward it receives now and into the future. Since a reward now is typically better than the same reward tomorrow, the quantity of interest is the <em>discounted<\/em> cumulative reward, called the <strong>return<\/strong> $$G_t = R_t + \\gamma R_{t+1} + \\gamma^2 R_{t+2} \\dots$$ The <strong>discount <\/strong>$\\gamma$ adjusts the time-horizon of the agent. A myopic agent that only cares about immediate reward has a discount of 0. A far-sighted agent accumulates all future rewards by setting its discount to 1. Because the environment and the agent are not necessarily deterministic, we care about the <strong><em>expected<\/em> return<\/strong>. The task of the reinforcement learning agent is then to <mark style=\"background-color:#9DFF20\" class=\"has-inline-color\">find the policy with the highest expected return:<\/mark><br>$$ \\max_\\pi\\, \\mathbb E \\, (R_t + \\gamma R_{t+1} + \\gamma^2 R_{t+2}\\dots).$$<\/p>\n\n\n\n<p>How do we do this?<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Markov Decision Processes<\/h2>\n\n\n\n<p>To determine how to find an optimal policy, we first need to flesh out a model of the environment and the agent&#8217;s interactions with it. <\/p>\n\n\n\n<p>The first notion we need is that of <strong>state<\/strong>. This is meant to capture the properties of the agent and the environment relevant to the task. We will write the state at some $t$ as $s_t$, or $S_t$.<\/p>\n\n\n\n<p>The next notion we need is that of <strong>action<\/strong>. This describes what the agent did. It can be discrete, like moves in a chess game, or continuous, like steering a car. We will write the agent&#8217;s action at time $t$ as $a_t$ or $A_t$.<\/p>\n\n\n\n<p>After the agent acts at time $t$ the clock ticks to $t+1$ and two responses are generated by the <strong>environment<\/strong>. First, it delivers a <strong>reward<\/strong> (possibly 0) which we write as $r_{t+1}$, or $R_{t+1}$.<\/p>\n\n\n\n<p>Second, the state <strong>transitions<\/strong> to $S_{t+1}$, and the process repeats. This produces a history of interactions $(S_t, A_t, R_{t+1}), (S_{t+1}, A_{t+1}, R_{t+2}) \\dots$<\/p>\n\n\n\n<p>Rewards and transitions are often uncertain. Therefore, we can <strong>model<\/strong> the environment as a probability distribution on rewards and next-states. This can depend on the current state, the action taken, and in principle, all previous interactions, $$ p(R_{t+1}, S_{t+1} | S_t, A_t, (S_{t-1}, A_{t-1}, R_{t}), (S_{t-2}, A_{t-2}, R_{t-1}) \\dots ).$$<\/p>\n\n\n\n<p>Carrying around an ever-growing history of interactions quickly becomes impractical. Therefore, we make the key simplifying <strong>Markov <\/strong>assumption that the reward received and the state transition only depend on the current state and action. That is, we model the environment as the probability distribution<br>$$  p(R_{t+1}, S_{t+1} | S_t, A_t).$$<\/p>\n\n\n\n<p>This framework of sequential decision-making within a Markovian environment is called a Markov Decision Process, or MDP:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"352\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/01\/image-1024x352.png\" alt=\"\" class=\"wp-image-3152\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/01\/image-1024x352.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/01\/image-300x103.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/01\/image-768x264.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/01\/image-1536x529.png 1536w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/01\/image-2048x705.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Choosing actions: Policy <\/h2>\n\n\n\n<p>The <strong>policy<\/strong> describes the action the agent takes. In principle, this could depend on the entire history of interactions with environment. But we make the simplifying assumption that the action the agent takes only depends on the current state. The policy then becomes a probability distribution over actions, given the current state. We typically write this as $$ \\pi(A_t | S_t).$$ Although this is a regular probability distribution, we use $\\pi$ instead of $p$ because (a) there are a already so many $p$&#8217;s floating around, and (b) to distinguish this particular probability distribution, because it&#8217;s the one that we&#8217;re going to be optimizing.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">How good is a <strong>state<\/strong>? The state value function<\/h2>\n\n\n\n<p>We&#8217;re interested in maximizing our expected return. For any policy, we can rank states by the expected return when starting at that state and acting under that policy. We encode this information in the <strong>state value function<\/strong>,<br>$$ V_\\pi(s) = \\mathbb E_\\pi (R_{t+1} + \\gamma R_{t+2} + \\gamma^2 R_{t+3} + \\dots | S_t = s).$$ The expectation is with respect to the policy, hence the index by $\\pi$, and also the environment. <\/p>\n\n\n\n<p>The value of a state is then the expected return when starting from that state and pursuing the given policy.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">How good is an <strong>action<\/strong>? The state-action value function<\/h2>\n\n\n\n<p>We don&#8217;t just want to know how good each state is, but <span style=\"text-decoration: underline;\">how to act <\/span>in whatever state we&#8217;re currently in. How do we end up in states with high expected return?<\/p>\n\n\n\n<p>To determine that, let&#8217;s relate the state value function to actions we can take: \\begin{align} V_\\pi(s) &amp;=\\mathbb E_\\pi (R_{t+1} + \\gamma R_{t+2} + \\gamma^2 R_{t+3} + \\dots | S_t = s)\\\\ &amp;= \\sum_a \\pi(a|s) \\sum_{s&#8217;, R_{t+1}} p(s&#8217;, R_{t+1} | a, s) \\left[R_{t+1} + \\gamma \\mathbb E_\\pi (R_{t+2} + \\gamma R_{t+3}, \\dots | S_{t+1} = s&#8217;)\\right]\\\\&amp;= \\sum_a \\pi(a|s) \\mathbb E_\\pi(R_{t+1} + \\gamma R_{t+2} +  \\gamma^2 R_{t+3} +  \\dots | S_t = s, A_t = a) \\end{align}<\/p>\n\n\n\n<p>This says that the value of the current state is the expected return after taking an action, averaged over actions according to the policy. We encode the expected return after taking a given action in a given state into <strong>state-action value function<\/strong>. It gives us the expected return starting from any state action pair,<br>$$ Q_\\pi(s, a) = \\mathbb E_\\pi (R_{t+1} + \\gamma R_{t+2} + \\gamma^2 R_{t+3} + \\dots | S_t = s, A_t = a).$$<\/p>\n\n\n\n<p>This is the famous Q-function of Q-learning and Deep Q Networks. Why they call it Q, I don&#8217;t know!<\/p>\n\n\n\n<p>The two value functions are related through the policy:<br>$$ V_\\pi(s) = \\sum_a \\pi(a|s) Q_\\pi(s, a).$$<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Dynamic Programming<\/h2>\n\n\n\n<p>If we had the state-action value function, we would have a simple recipe for how to act: pick the action that gives the highest return!<br>$$ \\pi(a|S_t) = \\argmax_a \\, Q(S_t, a).$$ <\/p>\n\n\n\n<p>But how do we actually compute $Q$? Much of reinforcement learning is about addressing this question.<\/p>\n\n\n\n<p>If we have access to the model of the environment, then we can do so, at least in principle by relating the value of the current state to that of the next state. <\/p>\n\n\n\n<p>Let&#8217;s write the state value function again \\begin{align} V_\\pi(s) &amp;=\\mathbb E_\\pi (R_{t+1} + \\gamma R_{t+2} + \\gamma^2 R_{t+3} + \\dots | S_t = s)\\\\ &amp;= \\sum_{a, s&#8217;, R_{t+1}} \\pi(a|s) p(s&#8217;, R_{t+1} | a, s) \\left[R_{t+1} + \\gamma \\mathbb E_\\pi (R_{t+2} + \\gamma R_{t+3}, \\dots | S_{t+1} = s&#8217;)\\right]\\end{align}<\/p>\n\n\n\n<p>Notice that the last term is the state value function for the next state, \\begin{align} V_\\pi(s) &amp;=  \\sum_a \\pi(a|s) \\sum_{s&#8217;, r} p(s&#8217;, r | a, s) \\left[r + \\gamma \\color{red} V_\\pi(s&#8217;) \\color{black} \\right]\\end{align} <\/p>\n\n\n\n<p>This is the <strong>Bellman equation<\/strong> for the state value function. <\/p>\n\n\n\n<p>There is also a similar Bellman equation<strong> <\/strong>for the state-action value function. $$ \\begin{align} Q_\\pi(s,a) &amp;= \\sum_a \\pi(a|s) \\sum_{s&#8217;, R_{t+1}} p(s&#8217;, R_{t+1} | <br>a, s)\\left[R_{t+1} + \\gamma V_\\pi(s&#8217;)\\right]\\\\&amp;=\\sum_a \\pi(a|s) \\sum_{s&#8217;, R_{t+1}} p(s&#8217;, R_{t+1} | a, s)\\left[R_{t+1} + \\gamma \\sum_{a&#8217;} \\pi(a&#8217;|s&#8217;) \\color{red} Q(s&#8217;, a&#8217;) \\color{black}\\right]\\end{align}.$$<\/p>\n\n\n\n<p>The Bellman equations show how determining the value of the current state\/action depends on determining the value of subsequent states\/actions. This is a hallmark of <strong>dynamic programming<\/strong>, in which larger problems are solved by combining solutions of subproblems of the same type. <\/p>\n\n\n\n<p>Before we see how dynamic programming can be used to determine the value functions, notice that the Bellman equations are linear in the quantities we&#8217;re after. So we could solve them using standard techniques for linear systems. <\/p>\n\n\n\n<p>However, dynamic programming provides alternative methods which are efficient, and will be extendable to situations where we don&#8217;t have a model of the environment. <\/p>\n\n\n\n<p>These work by simply looping through states and applying the Bellman equations above as updates. For example iterative <strong>policy evaluation<\/strong> sets the state value functions by sweeping through states:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"175\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/01\/image-1-1024x175.png\" alt=\"\" class=\"wp-image-3235\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/01\/image-1-1024x175.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/01\/image-1-300x51.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/01\/image-1-768x131.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/01\/image-1.png 1536w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>and continuing until the values are stable.<\/p>\n\n\n\n<p>Once we have the values for each state, we can improve our policy by taking the greedy action in each state. This is called <strong>policy improvement<\/strong>:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"177\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/01\/image-2-1024x177.png\" alt=\"\" class=\"wp-image-3242\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/01\/image-2-1024x177.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/01\/image-2-300x52.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/01\/image-2-768x133.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/01\/image-2.png 1258w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><strong>Policy iteration<\/strong> chains these two steps together <\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Policy <span style=\"text-decoration: underline;\">evaluation<\/span> to estimate $V_\\pi$ for the current policy;<\/li>\n\n\n\n<li>Policy <span style=\"text-decoration: underline;\">improvement<\/span> to improve the policy,<\/li>\n<\/ol>\n\n\n\n<p>until policy improvement no longer changes the policy. At that point, our value function is are already based on picking the best action at each step, so we&#8217;ve found the optimal policy and value function.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"618\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/01\/image-4-1024x618.png\" alt=\"\" class=\"wp-image-3247\" style=\"width:379px;height:auto\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/01\/image-4-1024x618.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/01\/image-4-300x181.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/01\/image-4-768x463.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/01\/image-4.png 1522w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>In policy iteration we start with value functions based on some suboptimal policy, and gradually update our value functions towards their optimal values. What if we could start with the optimal value function right off the bat?<\/p>\n\n\n\n<p>To do so, we note that the optimal value function must correspond to picking the optimal action at each state: $$ V^*(s) = \\max_a \\sum_{s&#8217;,r} p(s&#8217;, r|s, a) (r + \\gamma V^*(s&#8217;)).$$ This is the <strong>Bellman optimality equation<\/strong> for the state value function. <\/p>\n\n\n\n<p>In <strong>value iteration<\/strong>, we apply the Bellman optimality equation as an update until our state value functions have converged to their optimal values. We then set our policy as greedy one at each state.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">Sampling, or what to do without a model<\/h1>\n\n\n\n<p>Dynamic programming has relied on using a model of the environment to determine optimal policies and value functions. As such it is an example of a <strong>model-based<\/strong> approach. What if we don&#8217;t have a model?<\/p>\n\n\n\n<p><em><strong>Model-free<\/strong><\/em> approaches use interactions with the environment itself to approximate it. They do so by noting that observed state transitions and rewards are <em>samples<\/em> from the probability distribution that describes the environment. <br>$$ (S_{t+1}, R_{t+1}) \\sim p(S_{t+1} ,R_{t+1}| S_t, A_t).$$ By replacing the probabilities in dynamic programming with samples we arrive at the most common reinforcement learning algorithms.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">TD Learning<\/h3>\n\n\n\n<p>Temporal difference (TD) learning is a model-free algorithm for estimating the value function. To derive it, we start with the Bellman equation for the state value,<br>$$ V(S_t) = \\sum_a \\pi(a|s) \\sum_{s&#8217;, r} p(s&#8217;,r|S_t, A_t)(r + \\gamma V(s&#8217;)).$$<\/p>\n\n\n\n<p>We then replace sum over actions with the action we actually took, and the average over next states and rewards with what we actually observed,<br>$$ \\pi(a|s) \\approx \\delta(a &#8211; A_t), \\quad p(S_{t+1}, R_{t+1}|S_t, A_t) \\approx \\delta(s &#8211; S_{t+1}) \\delta(r &#8211; R_{t+1}),$$ and get <br>$$ V(S_t) \\approx R_{t+1} + \\gamma V(S_{t+1}),$$<\/p>\n\n\n\n<p>TD then simply update the value function towards this value, damped by a learning rate $\\alpha$<br>$$ \\Delta V(S_t) = \\alpha (\\underbrace{R_{t+1} + \\gamma V(S_{t+1}) &#8211; V(S_t)}_{\\text{TD error}}).$$<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">SARSA<\/h3>\n\n\n\n<p>The same approach can be applied to update Q-values. We again start with the Bellman equation for state-action values, \\begin{align} Q(S_t, A_t) &amp;= \\sum_{s&#8217;, r} p(s&#8217;,r|S_t, A_t)(r + \\gamma \\sum_{a&#8217;} \\pi(a&#8217;|s&#8217;)Q(s&#8217;, a&#8217;))\\\\ &amp;\\approx (R_{t+1} + \\gamma Q(S_{t+1}, A_{t+1})).\\end{align}<\/p>\n\n\n\n<p>Just like TD learning, SARSA then performs a damped update of the Q values towards their targets: $$ \\Delta Q(S_t, A_t) = \\alpha (R_t + \\gamma Q(S_{t+1}, A_{t+1}) &#8211; Q(S_t, A_t)).$$<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Expected SARSA<\/h3>\n\n\n\n<p>In Expected SARSA we agains start with the Bellman equation<br>\\begin{align} Q(S_t, A_t) &amp;= \\sum_{s&#8217;, r} p(s&#8217;,r|S_t, A_t)(r + \\gamma \\sum_{a&#8217;} \\pi(a&#8217;|s&#8217;)Q(s&#8217;, a&#8217;))\\end{align} and note that the last term is an expectation over actions in the next state. So we modify our target to use this expectation over actions, given whatever state we arrived at<br>$$ \\Delta Q(S_t, A_t) = \\alpha (R_t + \\gamma \\sum_{a&#8217;} \\pi(a&#8217;|S_{t+1}) Q(S_{t+1}, a&#8217;) &#8211; Q(S_t, A_t)).$$ This improves performance by reducing the variability caused by random action selection, though it comes at higher computational cost.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Q-Learning<\/h3>\n\n\n\n<p>Finally, in Q-learning, we use the optimal Q-values as our target. The Bellman optimality equation for state-action values is<br>$$ Q^*(S_t, A_t) = \\sum_{s&#8217;, r} p(s&#8217;, r | S_t, A_t)(r + \\gamma \\max_{A_{t+1}} Q^*(S_{t+1}, A_{t+1})),$$ so we modify our target accordingly<\/p>\n\n\n\n<p>$$ \\Delta Q(S_t, A_t) = \\alpha (R_t + \\gamma \\max_{a&#8217;} Q(S_{t+1},a&#8217;) &#8211; Q(S_t, A_t)).$$<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>I recently completed the Coursera Reinforcement Learning Specialization. These are my notes, still under construction, on some of what I learned. The course was based on Sutton and Barto&#8217;s freely available reinforcement learning book, so images will be from there unless otherwise stated. All errors are mine, so please let me know about any in [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1,30],"tags":[28,94],"class_list":["post-3071","post","type-post","status-publish","format-standard","hentry","category-blog","category-notes-blog","tag-notes","tag-reinforcement-learning"],"acf":[],"_links":{"self":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts\/3071","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/comments?post=3071"}],"version-history":[{"count":202,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts\/3071\/revisions"}],"predecessor-version":[{"id":3281,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts\/3071\/revisions\/3281"}],"wp:attachment":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/media?parent=3071"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/categories?post=3071"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/tags?post=3071"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}