An MDP agent constructs a policy from the MDP once, and then uses that policy repeatedly to take action. The ALGORITHM keyword specifies the algorithm that is used to create the policy; don't confuse it with the PROGRAM keyword, which decides what actions to take.type (total-reward)
File uncertainty/domains/mdp.lisp
Definitions for Markov decision processes (MDPs).
An MDP is defined by initial state, transition model, rewards, and
distinguished terminal states. Model and rewards are hash tables
index by state (after application of hash-key function).
The entries in the model are alists keyed by action; each action is
associated with an action model: basically a list of transitions.
Markov chains (i.e., stochastic processes with no distinct agent)
can be defined by allowing only a no-op action in the MDP.
type (initial-state
model
rewards
terminal-states
hash-key
name)
type (transitions times-executed)
type (destination probability times-achieved)
Returns the transitions resulting from executing action a in state s according to model M.function (s m)
Returns the list of actions feasible in state s according to model M.
File uncertainty/environments/mdp.lisp
An MDP-environment is driven by an MDP (Markov Decision Process), which (probabilistically) says what state to transition to for each action.type (state reward terminalp)
A percept gives the current state, the reward received, and whether it is a terminal state.
method ((env mdp-environment) agent)
The percept is the current state, the reward, and whether this is terminal.method ((env mdp-environment))
We update by transitioning to a new state. When we hit a terminal state, we restart in the initial state (until there are no more epochs left).method ((env mdp-environment) agent)
Return a number saying how well this agent is doing.method ((env mdp-environment))
File uncertainty/algorithms/dp.lisp
Given an environment model M, value iteration determine the values of states U. Basic equation is U(i) <- r(i) + max_a sum_j M(a,i,j)U(j) where U(j) MUST be the old value not the new. function (mdp &optional uold &key epsilon)
A state is a sink if there are no actions that can lead to another state. Sinks can arise by accident during reinforcement learning of an environment model. Because they cause infinite loops, they must be detected. function (s m)
Given an initial policy P and initial utilities U, calculate the optimal policy. Do this by value determination alternating with policy update. function (mdp &optional u)
Given a fixed policy and a model, calculate the value of each state. This version does it by an iterative process similar to value iteration. Basic equation is U(i) <- r(i) + sum_j M(P(i),i,j)U(j) where U(j) MUST be the old value not the new. A better alternative is to set up the value equations and solve them using matrix methods. function (p uold m r &key epsilon)
Compute optimal policy given U and M function (u m r)
The following functions select actions in particular states Pick a random action function (state p)
Pick the currently best action with tie-breaking function (state u m r)
Simply pick a currently best action deterministically function (state u m r)
Q(a,s) is the value of doing a in s, calculated by averaging over the utilities of the possible outcomes. Used in several update equations. function (action state u m r)
File uncertainty/algorithms/stats.lisp
the policy used by the agent in acting variable
The policy loss of a utility function U for an mdp is defined as the difference in utility between the corresponding policy and the optimal policy, for the agent's current state. Calculate using value determination wrt the current policy function (mdp u)
AIMA Home | Authors | Lisp Code | AI Programming | Instructors Pages |