Intro to AI
Search
 Search problem
 a state space
 a successor function (with actions, costs)
 a start state and a goal test
 Solution
A solution is a sequence of actions (a plan) which transforms the start state to a goal state
 State Space Graphs

Search Trees
 Search Strategy
 DepthFirst Search
 BreastFirst Search
 Iterative deepening
run a DFS with a depth limit
 Uniform Cost Search (BestFirst)
expand a cheapest node first
 A* Search
combine UCS and Greedy
Cost f(n) = g(n) + h(n) where g(n) is pathcost (backward) and h(n) is goal proximity (forward)

Admissibility :
heuristic h is admissible if 0 <= h(n) <= h*(n) := the true cost to a nearest goal

Consistency :
h(A)  h(C) <= cost(A to C)

Constraint Satisfaction Problems
 Search problem
 Initial state
 Successor function : assign a value to an unassigned variable
 Goal test
 Search Strategy
 Backtracking Search
DFS + variableordering + failonviolation
 One variable a time
 Check constraints as you go
 Forward Checking
 Keep track of domains for unassigned variables and cross off values that violate a constraint when added to the existing assignment
 Consistency
 Arc Consistency
An arc X > Y is consistent iff for every x in the tail there is some y in the head which could be assigned without violating a constraint, if there is no such y, delete x
 KConsistency
For each k nodes, any consistent assignment to k1 can be extended to the kth node.
 Strong KConsistency
Also K1, K2, …, 1 consistent
 Arc Consistency
 Ordering
 Minimum remaining values (MRV)
Choose the variable with the fewest legal left values in its domain
 Least Constraining Value
Choose the value that rules out the fewest values in the remaining variables
 Minimum remaining values (MRV)
 Subproblems
 Cutset
try all possibilities of several variables and solve the subproblems
 Beam Search
 Backtracking Search
Adversarial Search
Types
 ZeroSum Games
Agents have opposite utilities
 General Games
Agents have independent utilities
Strategy

MiniMax avec AlphaBeta Pruning
MAX > MIN > MAX > … > Utility
 alpha : MAX’s best utility on path to root
 beta : MIN’s best utility on path to root
 Algo
 If the state is a terminal state, return the state’s utility
 If the next agent is MAX, return the max utility of its successors : maxvalue(state, alpha, beta)
 If the next agent is MIN, return the min utility of its successors : minvalue(state, alpha, beta)

Problem Like DFS, it costs too much. So we can do depthlimited search (or iterative deepening, etc.) and replace terminal utilities with an evaluation function
In practice, the evaluation function is a weighted linear sum of features.

Expectimax Search
MAX > EXP > MAX > EXP > … > Utility
Values now reflect averagecase (expectimax) outcomes, not worstcase (minmax) outcomes.
 Algo
 If the state is a terminal state, return the state’s utility
 If the next agent is MAX, return the max utility of its successors : maxvalue(state)
 If the next agent is EXP, return the min utility of its successors : expvalue(state)
 Algo
Utility
Utility Theorem [Ramsey, 1931; von Neumann & Morgenstern, 1944]
Rational Preferences

Orderability
A > B or A < B or A ~ B

Transitivity
A > B and B > C => A > C

Continuity
A > B > C => exists p such that [p, A; 1p,C] ~ B

Substitutability
A ~ B => [p, A; 1p,C] ~ [p, B; 1p,C]

Monotonicity
A > B => ( p >=q <=> [p, A; 1p,B] >= [q, A; 1q,B] )
Markov Decision Processes
Problem  Goal  Tech 

Known MDP (offline)  Compute  Value / Policy iteration 
Evaluate a fixed policy $\pi$  Policy evaluation  
Unknown MDP ModelBased  Compute  Value / Policy iteration on approx. MDP 
Evaluate a fixed policy $\pi$  Policy evaluation on approx MDP  
Unknown MDP ModelFree  Compute  Qlearning 
Evaluate a fixed policy $\pi$  Value learning 
Markov Decision Processes
“Markov” generally means that given the present state, the future and the past are independent
 State $s$ in $S$
 Action $a$ in $A$
 Transition function $T(s,a,s’)$
i.e., Probability that a from s leads to $s’$, $P(s’s,a)$
 Reward function $R(s,a,s’)$
 Start state
 (Terminal state)
Offline (T, R known)
state $s$ > take action $a$ > qstate $(s,a)$ > transition $(s,a,s’)$ > state $s’$
 Policy
In deterministic singleagent search problems, we wanted an optimal plan, or sequence of actions, from start to a goal.
Expectimax computes the action for a single state only.
But for MDPs, we want an optimal policy which gives an action for each state and maximizes expected utility if followed.
$\pi^*(s)$ = optimal action from state $s$
 Utility (value)
Sum of (discounted) rewards
Value of rewards decay exponentially with parameter gamma in $(0,1]$

The value of a state $s$:
The expected utility starting in $s$ and acting optimally is

The value of a qstate (s,a):
The expected utility starting out having taken action a from state $s$ and (thereafter) acting optimally is

Algo
Bellman updates
 Value Iteration
Asynchronous Value Iteration :
Here we update every state in each iteration. Actually, any sequences of Bellman updates will converge if every state is visited infinitely often.
 Initialization $V_0(s) = 0$

Iteration

Policy Iteration
Repeats policy evaluation  policy improvement until policy converges
 Policy Evaluation
Compute the utility of states under a fixed policy
 Iteration
 Linear System Solver
 Policy Improvement (Extraction) (onestep lookahead)
Extract a policy by doing miniexpectimax from values
 Policy Evaluation
Reinforcement Learning (T or R unknown)
 Learn an empirical MDP model
 Solve the learned MDP
 Direct Evaluation
Compute values under a fixed policy $\pi$
 Act according to $\pi$
 Note rewards $R(s,a,s’)$ when we experience $(s,a,s’)$
 Average samples
Problems:
 Need to learn each state separately and it takes a long time
 It wastes information about state connections
 Temporal Difference Learning (Value learning)
Compute values under a fixed policy $\pi$

Act accoring to $\pi$ to receive a sample ($a = \pi(s)$)

Update $V(s)$
 $\text{sample} = R(s,a,s’) + \gamma V’(s)$
 $V(s) = V(s) + \alpha (\text{sample}  V(s))$
Problem:
 Can’t turn values into a new policy because we learn values instead of Qvalues


Q Learning
Instead of computing $V(s)$
We compute $Q(s)$

Act accoring to $\pi$ to receive a sample ($a = \pi(s)$)

Update $Q(s,a)$
 $\text{sample} = R(s,a,s’) + \gamma \max_a Q(s’,a’)$
 $Q(s,a) = Q(s,a) + \alpha (\text{sample}  Q(s,a))$
Problem:
 We need to keep a table of all Qvalues
Solution:

We can use a vector of features to describe a state
Often the featurebased policies that work well aren’t the ones that approximate V/Q best. Need to learn policies that maximize rewards, not the values that predict them.


Exploration

Acts randomly with probability eps (eps decreases over time)

Exploration functions
Takes a value estimate u and a visit count n, and returns an optimistic utility
$f(u, n) = u + \frac{k}{n}$
 Regular Qupdate : $\text{sample} = R(s,a,s’) + \gamma \max_a Q(s’,a’)$
 Modified Qupdate : $\text{sample} = R(s,a,s’) + \gamma \max_a f(Q(s’,a’), N(s’,a’))$

Probability
 Bayes’ Rule
 $P(x,y) = P(xy) P(y) = P(yx) P(x)$
 $P(xy) = P(yx) P(x) P^{1}(y)$

Conditional Independence
$X$ is conditionally independent of $Y$ given $Z$
 $P(x,yz) = P(xz)P(yz)$
 $P(xz,y) = P(xz)$

Markov Model
X1 > X2 > X3 > ...
$P(X_1,X_2,…,X_T) = P(X_1)P(X_2X_1)…P(X_TX_{T1})$
Hidden Markov Model
X1 > X2 > X3 > ...
  
E1 E2 E3
 State independent of all past states and all past evidence given the previous state
 Evidence is independent of all past states and all past evidence given the current state
But the evidence variables are not independent. They are correlated by hidden states.
 Past variables independent of future variables given the present
if $t_1 < t_2 < t_3$ then $X_{t_1}$ is conditionally independent of $X_{t_3}$ given $X_{t_2}$
Forward Algorithm
Track the distribution $B(X_t) = P_t(X_te_{1:t})$ over time

Initialization:
Start with $B_1(X)$, usually uniform

Passage of time:
 Observation:
Problem:
 When $X$ is too big or when $X$ is continue, we can’t store all $B(X)$
Viterbi Algorithm
For responding queries like most likely explanation :
Passage of time and observation: (use max instead of sum)
Particle Filtering
Use particles to present the probability distribution

Initialization:
Start with a distribution (usually uniform) and set $N$ particles randomly according to the distribution.

Passage of time:
Each particle is moved by sampling its next position from the transition model. (Each particle moves randomly according to the transition model $P (X’x)$)

Observation:
 Downweight samples based on the evidence.
 Resample the particles :
Choose from the weighted sample distribution $N$ times (i.e. draw with replacement).
Dynamic Bayes Nets

Initialization
Generate prior samples for the t=1 Bayes net

Passage of time
Sample a successor for each particle

Observation

Downweight
Weight each entire sample by the liklyhood of the evidence conditioned on the sample

Resample
Select prior samples (tuples of values) in proportion to their likelihood

Bayes Nets
Definitions

A set of nodes, one per variable X

A directed, acyclic graph

Local Markov property
Each variable is conditionally independent of its nondescendants given its parent variables.
This is equivalent to the following factorization definition:
Conditional Independence
Causual Chains
X > Y > Z
 X is not necessary independent of Z (active)
 X is independent of Z given Y (inactive)
Evidence along the chain “blocks” the influence
Common Cause
Y > X
> Z
 X is not necessary independent of Z (active)
 X is independent of Z given Y (inactive)
Observing the cause blocks influence between effects
Common Effect
X > Z
Y >
 X and Y are independent (inactive)
 X is not necessary independent of Y given Z (active)
Observing an effect activates influence between possible causes
X > P > ... > Z
Y >
 X is not necessary independent of Y given Z (active)
General case
Are X and Y conditionally independent given evidence variables Z ?
Consider all (undirected) paths from X to Y, if there is no active paths, then X and Y are conditionally independent ! If there is an active path, the independence is not guaranteed. (But it may be independent)
A path is active if each triple is active (independence not guaranteed ).
Topology limits distributions
Given some graph topology G, the graph structure guarantees certain (conditional) independences and only certain joint distributions can be encoded.
Inference
We want to know $P(Qe_1,\cdots,e_k)$
where all variables $X_1,\cdots,X_n$ can be divided into three types
 Evidence variables $E_1,\cdots,E_k = e_1,\cdots,e_k$
 Query variable $Q$
 Hidden variables $H_1,\cdots,H_r$
So $P(Qe_1,\cdots,e_k)=\sum\limits_{h_1,\cdots,h_r}P(Qh_1,\cdots,h_r,e_1,\cdots,e_k)$
Example
R > T > L
We have $P(R), P(TR), P(LT)$ and we want to know $P(L)$

Inference by enumeration

$P(R), P(TR) \rightarrow P(R,T)$

$P(R,T),P(LT) \rightarrow P(R,T,L) \rightarrow P(L)$


Variable elimination

$P(R), P(TR) \rightarrow P(R,T) \rightarrow P(T)$

$P(T),P(LT) \rightarrow P(T,L) \rightarrow P(L)$

General Variable Elimination
$P(QE_1=e_1,\cdots,E_k=e_k)$

Start with local CPTs (conditionaly probability table) which are instantiated by evidence

While there are still hidden variables (not query variable or evidence)
 Pick a hidden variable H
 Join all factors mentionubg H
 Eliminate (sum out) H

Join all remaining factors and normalize
The elimination ordering can greatly affect the size of the largest factor. And for polytrees (directed graph with no undirected cycles) we can always find an efficient ordering. For other graphs, we can try to choose a set of variables such that if removed only a polytree remains.
General Example
Given $P(B), P(E), P(AB,E), P(jA), P(mA)$, we want to know $P(Bj,m)$
 Choose A

Join A
$P(AB,E), P(jA), P(mA) \rightarrow P(j,m,AB,E)$

Eliminate A
$P(j,m,AB,E) \rightarrow P(j,mB,E)$

 Choose E

Join E
$P(E), P(j,mB,E) \rightarrow P(j,m,EB)$

Eliminate E
$P(j,m,EB) \rightarrow P(j,mB)$

 Choose B

Join B
$P(B), P(j,mB) \rightarrow P(j,m,B)$

Normalize
$P(j,m,B) \rightarrow P(Bj,m)$

Sampling
Why sampling ?
 Learning: get samples from a distribution we don’t know
 Inference: getting a sample is faster than computing the right answer
Prior Sampling

For $i=1:n$
Sample $x_i$ from $P(X_i  \text{Parents}(X_i))$

Return $(x_1,\cdots,x_n)$
Rejection Sampling
If there are evidences, we reject inconsistent samples.
If evidence is unlikely, rejects lots of samples.

For $i=1:n$
Sample $x_i$ from $P(X_i  \text{Parents}(X_i))$
If $x_i$ is inconsistent with evidence, return and no sample is generated

Return $(x_1,\cdots,x_n)$
Likelihood Weighting
Fix evidence variables and sample the rest
Evidence influences the choice of downstream variables, but not upstream ones (it isn’t more likely to get a value matching the evidence)
The weight can be very small and sum of weights over all samples is indicative of how many “effective” samples were obtained, so want high weight.
 $w=1$

For $i=1:n$

If $X_i = x_i$ is an evidence variable
$w = w \times P(x_i  \text{Parents}(X_i))$

else
Sample $x_i$ from $P(X_i  \text{Parents}(X_i))$

 Return $(x_1,\cdots,x_n), w$
Gibbs Sampling
Gibbs sampling is a special case of more general methods called Markov chain Monte Carlo (MCMC) methods and MetropolisHastings is one of the more famous MCMC methods
 Fix evidences
 Initialize other variables randomly
 Repeat several times
 Choose a nonevidence variable $X$ randomly
 Resample $X$ from $P( X  \text{all other variables})$
In fact, only CPTs that have resampled variable need to be considered, and joined together
 Return $(x_1,\cdots,x_n)$
Value of Information
MEU = Maximum Expected Utility
VPI = Value of Perfect Information

Assume we have evidence $E = e$, if we act now the value is:
where $s$ is the parent of utility node and $a$ is the action.

Assume we now see that $E’ = e’$, if we act then the value is

BUT $E’$ is a random variable whose value is unknown, so we don’t know what $e’$ will be.
If $E’$ is revealed and then we act:

So the value of information $E’$: how much MEU goes up by revealing E’ first:
VPI Properties

Nonnegative
 Nonadditive
i.e. if we observe a same evidence twice, VPI doesn’t change

Orderindependent
 If the observed evidence is conditionally independent to the parents of utility given current evidences, then VPI = 0.
POMDP
Partially Observable Markov Decision Process
The agent cannot directly observe the underlying state.
Naïve Bayes
Classification
The idea is that each feature depends on the class.
To use Naïve Bayes, we need to calculate $P(Yf_1,\cdots,f_n)$ by using $P(y_k,f_1,\cdots,f_n)$.
Naïve Bayes for Text
Feature $W_i$ is the word at position $i$ and we assume that all positions share the same conditional probability $P(WY)$
To calculate a conditional probability $P(xy)$:

Maximum Likelihood:
$P(xy) = \frac{\text{count}(x,y)}{\text{count}(y)}$

Maximum Likelihood:
$P(xy) = \alpha \hat{P}(xy) + (1\alpha) \hat{P}(x)$

Laplace Smoothing:
$P(xy) = \frac{\text{count}(x,y)+k}{\text{count}(y)+kX}$
Perceptrons
Linear, Oneclass
Use $\text{sgn}(w^Tf(x))$ to classify instance
 $w=0$
 For each training instance
 $y=\text{sgn}(w^Tf(x))$

If $y$ is different from the exact label $y^*$
$w=w+y^*f(x)$
Linear, Multiclass
There are different weights $w_y$ and we use $\arg\max\limits_y {w_y}^Tf(x)$ to classify instance
 $w_y=0$ for all class
 For each training instance
 $y=\arg\max\limits_y {w_y}^Tf(x)$

If $y$ is different from the exact label $y^*$
$w_{y}=w_{y}f(x), \quad w_{y^*}=w_{y^*}+f(x)$
MIRA (Margin Infused Relaxed Algorithm)
To prevent overfitting
Try to minimize the change to weight
such that
If we consider change like:
problem becomes:
such that
and so
In practice, it’s bad to make updates that are too large so we can cap the maximum possible value of $\gamma$ with some constant $C$:
SVM (Support Vector Machines)
such that
The goal is to maximize the margin for all instances. Different from MIRA which optimize for an instance and try to minimize the size of change.
This algorithm (with kernel) can be used for online learning by using prime/dual problem.
Bordes, Antoine, et al. “Fast kernel classifiers with online and active learning.” Journal of Machine Learning Research 6.Sep (2005): 15791619.
Kernel Trick
For a perceptron, the weight is in fact a linear combination of features $f(x_i)$:
So $w_y^T f(x) = \sum\limits_i \alpha_{i,y} f(x_i)^Tf(x) = \sum\limits_i \alpha_{i,y} K(x_i,x)$
Kernels:
 Linear Kernel: $K(x_1,x_2)=x_1^Tx_2$
 Quadratic Kernel: $K(x_1,x_2)=(x_1^Tx_2+1)^2$
 RBF: $K(x_1,x_2)=\exp(\gammax_1x_2^2)$
 Discrete Kernel: e.g. string kernel