Intro to AI


Constraint Satisfaction Problems

Adversarial Search


  1. Zero-Sum Games

    Agents have opposite utilities

  2. General Games

    Agents have independent utilities


  1. MiniMax avec Alpha-Beta Pruning

    MAX -> MIN -> MAX -> … -> Utility

    • alpha : MAX’s best utility on path to root
    • beta : MIN’s best utility on path to root
    • Algo
      • If the state is a terminal state, return the state’s utility
      • If the next agent is MAX, return the max utility of its successors : max-value(state, alpha, beta)
      • If the next agent is MIN, return the min utility of its successors : min-value(state, alpha, beta)
    • Problem Like DFS, it costs too much. So we can do depth-limited search (or iterative deepening, etc.) and replace terminal utilities with an evaluation function

      In practice, the evaluation function is a weighted linear sum of features.

  2. Expectimax Search

    MAX -> EXP -> MAX -> EXP -> … -> Utility

    Values now reflect average-case (expectimax) outcomes, not worst-case (minmax) outcomes.

    • Algo
      • If the state is a terminal state, return the state’s utility
      • If the next agent is MAX, return the max utility of its successors : max-value(state)
      • If the next agent is EXP, return the min utility of its successors : exp-value(state)


Utility Theorem [Ramsey, 1931; von Neumann & Morgenstern, 1944]

Rational Preferences

Markov Decision Processes

Problem Goal Tech
Known MDP (offline) Compute \(V^*,Q^*,\pi^*\) Value / Policy iteration
  Evaluate a fixed policy $\pi$ Policy evaluation
Unknown MDP Model-Based Compute \(V^*,Q^*,\pi^*\) Value / Policy iteration on approx. MDP
  Evaluate a fixed policy $\pi$ Policy evaluation on approx MDP
Unknown MDP Model-Free Compute \(V^*,Q^*,\pi^*\) Q-learning
  Evaluate a fixed policy $\pi$ Value learning

Markov Decision Processes

“Markov” generally means that given the present state, the future and the past are independent

Offline (T, R known)

state $s$ -> take action $a$ -> q-state $(s,a)$ -> transition $(s,a,s’)$ -> state $s’$


Bellman updates

  1. Value Iteration

    Asynchronous Value Iteration :

    Here we update every state in each iteration. Actually, any sequences of Bellman updates will converge if every state is visited infinitely often.

    • Initialization $V_0(s) = 0$
    • Iteration

      \[V_{k+1}(s) = \max_a Q_k(s,a) = \max_a \sum_{s'} T(s,a,s') [R(s,a,s') + \gamma V_k(s')]\]
  2. Policy Iteration

    Repeats policy evaluation - policy improvement until policy converges

    • Policy Evaluation

      Compute the utility of states under a fixed policy

      \[V(s) = \sum_{s'} T(s,\pi(s),s') [R(s,\pi(s),s') + \gamma V(s')]\]
      • Iteration
      • Linear System Solver
    • Policy Improvement (Extraction) (one-step lookahead)

      Extract a policy by doing mini-expectimax from values

      \[\pi(s) = \arg\max_a Q(s,a) = \arg\max_a \sum_{s'} T(s,a,s') [R(s,a,s') + \gamma V(s')]\]

Reinforcement Learning (T or R unknown)

  1. Direct Evaluation

    Compute values under a fixed policy $\pi$

    • Act according to $\pi$
    • Note rewards $R(s,a,s’)$ when we experience $(s,a,s’)$
    • Average samples


    • Need to learn each state separately and it takes a long time
    • It wastes information about state connections
  2. Temporal Difference Learning (Value learning)

    Compute values under a fixed policy $\pi$

    • Act accoring to $\pi$ to receive a sample ($a = \pi(s)$)

    • Update $V(s)$

      1. $\text{sample} = R(s,a,s’) + \gamma V’(s)$
      2. $V(s) = V(s) + \alpha (\text{sample} - V(s))$


    • Can’t turn values into a new policy because we learn values instead of Q-values
  3. Q Learning

    Instead of computing $V(s)$

    \[V(s) = \max_a \sum_{s'} T(s,a,s')[R(s,a,s') + \gamma V(s)]\]

    We compute $Q(s)$

    \[Q(s) = \sum_{s'} T(s,a,s')[R(s,a,s') + \gamma \max_{a'} Q(s',a')]\]
    • Act accoring to $\pi$ to receive a sample ($a = \pi(s)$)

    • Update $Q(s,a)$

      1. $\text{sample} = R(s,a,s’) + \gamma \max_a Q(s’,a’)$
      2. $Q(s,a) = Q(s,a) + \alpha (\text{sample} - Q(s,a))$


    • We need to keep a table of all Q-values


    • We can use a vector of features to describe a state

      \[Q(s,a) = w_1 f_1(s,a) + w_2 f_2(s,a) + ... + w_n f_n(s,a)\]

      Often the feature-based policies that work well aren’t the ones that approximate V/Q best. Need to learn policies that maximize rewards, not the values that predict them.

  4. Exploration

    1. Acts randomly with probability eps (eps decreases over time)

    2. Exploration functions

      Takes a value estimate u and a visit count n, and returns an optimistic utility

      $f(u, n) = u + \frac{k}{n}$

      • Regular Q-update : $\text{sample} = R(s,a,s’) + \gamma \max_a Q(s’,a’)$
      • Modified Q-update : $\text{sample} = R(s,a,s’) + \gamma \max_a f(Q(s’,a’), N(s’,a’))$


Hidden Markov Model

X1 -> X2 -> X3 -> ...
|     |     |
E1    E2    E3
\[P(X_1,E_1,...,X_T,E_T) = P(X_1)P(E_1|X_1) P(X_2|X_1)P(E_2|X_2) \cdots P(X_T|X_{T-1})P(E_T|X_T)\]

Forward Algorithm

Track the distribution $B(X_t) = P_t(X_t|e_{1:t})$ over time

\[P(X_{t+1}|e_{1:t}) = \sum_{x_t} P(X_{t+1}|x_t)P(x_t|e_{1:t})\] \[B(X_{t+1}) = P(X_{t+1}|e_{1:t},e_{t+1}) \propto P(e_{t+1}|X_{t+1})P(X_{t+1}|e_{1:t})\]


Viterbi Algorithm

For responding queries like most likely explanation :

\[\arg\max_{x_{1:t}} P(x_{1:t} | e_{1:t})\]

Passage of time and observation: (use max instead of sum)

\[P(X_{t+1}|e_{1:t},e_{t+1}) = P(e_{t+1}|X_{t+1}) \max_{x_t} P(X_{t+1}|x_t)P(x_t|e_{1:t})\]

Particle Filtering

Use particles to present the probability distribution

Dynamic Bayes Nets


Bayes Nets



Conditional Independence

Causual Chains

X -> Y -> Z

Common Cause

Y -> X
  -> Z

Common Effect

X -> Z
Y ->
X -> P -> ... -> Z
Y ->

General case

Are X and Y conditionally independent given evidence variables Z ?

Consider all (undirected) paths from X to Y, if there is no active paths, then X and Y are conditionally independent ! If there is an active path, the independence is not guaranteed. (But it may be independent)

A path is active if each triple is active (independence not guaranteed ).

Topology limits distributions

Given some graph topology G, the graph structure guarantees certain (conditional) independences and only certain joint distributions can be encoded.


We want to know $P(Q|e_1,\cdots,e_k)$

where all variables $X_1,\cdots,X_n$ can be divided into three types

So $P(Q|e_1,\cdots,e_k)=\sum\limits_{h_1,\cdots,h_r}P(Q|h_1,\cdots,h_r,e_1,\cdots,e_k)$


R -> T -> L

We have $P(R), P(T|R), P(L|T)$ and we want to know $P(L)$

General Variable Elimination


The elimination ordering can greatly affect the size of the largest factor. And for poly-trees (directed graph with no undirected cycles) we can always find an efficient ordering. For other graphs, we can try to choose a set of variables such that if removed only a polytree remains.

General Example


Given $P(B), P(E), P(A|B,E), P(j|A), P(m|A)$, we want to know $P(B|j,m)$

  1. Choose A
    • Join A

      $P(A|B,E), P(j|A), P(m|A) \rightarrow P(j,m,A|B,E)$

    • Eliminate A

      $P(j,m,A|B,E) \rightarrow P(j,m|B,E)$

  2. Choose E
    • Join E

      $P(E), P(j,m|B,E) \rightarrow P(j,m,E|B)$

    • Eliminate E

      $P(j,m,E|B) \rightarrow P(j,m|B)$

  3. Choose B
    • Join B

      $P(B), P(j,m|B) \rightarrow P(j,m,B)$

    • Normalize

      $P(j,m,B) \rightarrow P(B|j,m)$


Why sampling ?

Prior Sampling

Rejection Sampling

If there are evidences, we reject inconsistent samples.

If evidence is unlikely, rejects lots of samples.

Likelihood Weighting

Fix evidence variables and sample the rest

Evidence influences the choice of downstream variables, but not upstream ones (it isn’t more likely to get a value matching the evidence)

The weight can be very small and sum of weights over all samples is indicative of how many “effective” samples were obtained, so want high weight.

Gibbs Sampling

Gibbs sampling is a special case of more general methods called Markov chain Monte Carlo (MCMC) methods and Metropolis-Hastings is one of the more famous MCMC methods

Value of Information

MEU = Maximum Expected Utility

VPI = Value of Perfect Information

VPI Properties


Partially Observable Markov Decision Process

The agent cannot directly observe the underlying state.

Naïve Bayes



The idea is that each feature depends on the class.

To use Naïve Bayes, we need to calculate $P(Y|f_1,\cdots,f_n)$ by using $P(y_k,f_1,\cdots,f_n)$.

Naïve Bayes for Text


Feature $W_i$ is the word at position $i$ and we assume that all positions share the same conditional probability $P(W|Y)$

To calculate a conditional probability $P(x|y)$:


Linear, Oneclass

Use $\text{sgn}(w^Tf(x))$ to classify instance

Linear, Multiclass

There are different weights $w_y$ and we use $\arg\max\limits_y {w_y}^Tf(x)$ to classify instance

MIRA (Margin Infused Relaxed Algorithm)

To prevent overfitting

Try to minimize the change to weight

\[\min\limits_w \frac{1}{2}\sum\limits_y\|w_y-w_{y}'\|^2\]

such that

\[w_{y^*}^T f(x) \geq w_y^T f(x) + 1\]

If we consider change like:

\[w_{y}=w_{y}'-\gamma f(x), \quad w_{y^*}=w_{y^*}'+\gamma f(x)\]

problem becomes:

\[\min\limits_\gamma \|\gamma f(x)\|^2\]

such that

\[(w_{y^*}'+\gamma f(x))^T f(x) \geq (w_y'-\gamma f(x))^T f(x) + 1\]

and so

\[\gamma = \frac{(w_y'-w_{y^*}')^T f(x)+1}{2f(x)^T f(x)}\]

In practice, it’s bad to make updates that are too large so we can cap the maximum possible value of $\gamma$ with some constant $C$:

\[\gamma = \min(\gamma, C)\]

SVM (Support Vector Machines)


\[\min\limits_w \frac{1}{2}\|w\|^2\]

such that

\[\forall i,y \quad w_{y^*}^T f(x_i) \geq w_y^T f(x_i) + 1\]

The goal is to maximize the margin for all instances. Different from MIRA which optimize for an instance and try to minimize the size of change.

This algorithm (with kernel) can be used for online learning by using prime/dual problem.

Bordes, Antoine, et al. “Fast kernel classifiers with online and active learning.” Journal of Machine Learning Research 6.Sep (2005): 1579-1619.

Kernel Trick

For a perceptron, the weight is in fact a linear combination of features $f(x_i)$:

\[w_y = \sum\limits_i \alpha_{i,y} f(x_i)\]

So $w_y^T f(x) = \sum\limits_i \alpha_{i,y} f(x_i)^Tf(x) = \sum\limits_i \alpha_{i,y} K(x_i,x)$