You are currently browsing the tag archive for the ‘teaching’ tag.

Here is the question from Ross’ book that I posted last week

Question 1 We have two coins, a red one and a green one. When flipped, one lands heads with probability {P_1} and the other with probability {P_2}. Assume that {P_1>P_2}. We do not know which coin is the {P_1} coin. We initially attach probability {p} to the red coin being the {P_1} coin. We receive one dollar for each heads and our objective is to maximize the total expected discounted return with discount factor {\beta}. Find the optimal policy.

This is a dynamic programming problem where the state is the belief that the red coin is {P_1}. Every period we choose a coin to toss, get a reward and updated our state given the outcome. Before I give my solution let me explain why we can’t immediately invoke uncle Gittins.

In the classical bandit problem there are {n} arms and each arm {i} provides a reward from an unknown distribution {\theta_i\in\Delta([0,1])}. Bandit problems are used to model tradeoffs between exploitation and exploration: Every period we either exploit an arm about whose distribution we already have a good idea or explore another arm. The {\theta_i} are randomized independently according to distributions {\mu_i\in \Delta(\Delta([0,1]))}, and what we are interested in is the expected discounted reward. The optimization problem has a remarkable solution: choose in every period the arm with the largest Gittins index. Then update your belief about that arm using Bayes’ rule. The Gittins index is a function which attaches a number {G(\mu)} (the index) to every belief {\mu} about an arm. What is important is that the index of an arm {i} depends only on {\mu_i} — our current belief about the distribution of the arm — not on our beliefs about the distribution of the other arms.

The independence assumption means that we only learn about the distribution of the arm we are using. This assumption is not satisfied in the red coin green coin problem: If we toss the red coin and get heads then the probability that the green coin is {P_1} decreases. Googling `multi-armed bandit’ with `dependent arms’ I got some papers which I haven’t looked at carefully but my superficial impression is that they would not help here.

Here is my solution. Call the problem I started with `the difficult problem’ and consider a variant which I call `the easy problem’. Let {r=p/(p+\sqrt{p(1-p)}} so that {r^2/(1-r)^2=p/1-p}. In the easy problem there are again two coins but this time the red coin is {P_1} with probability {r} and {P_2} with probability {1-r} and, independently, the green coin is {P_1} with probability {(1-r)} and {P_2} with probability {r}. The easy problem is easy because it is a bandit problem. We have to keep track of beliefs {p_r} and {p_g} about the red coin and the green coin ({p_r} is the probability that the red coin is {P_1}), starting with {p_r={r}} and {p_g=(1-r)}, and when we toss the red coin we update {p_r} but keep {p_g} fixed. It is easy to see that the Gittins index of an arm is a monotone function of the belief that the arm is {P_1} so the optimal strategy is to play red when {p_r\ge p_g} and green when {p_g\ge p_r}. In particular, the optimal action in the first period is red when {p\ge 1/2} and green when {p\le 1/2}.

Now here comes the trick. Consider a general strategy {g} that assigns to every finite sequence of past actions and outcomes an action (red or green). Denote by {V_d(g)} and {V_e(g)} the rewards that {g} gives in the difficult and easy problems respectively. I claim that

\displaystyle \begin{array}{rcl} &V_e(g)=r(1-r) \cdot P_1/(1-\beta)+ \\ &r(1-r) \cdot P_2/(1-\beta) + (r^2+(1-r)^2) V_d(g).\end{array}

Why is that ? in the easy problem there is a probability {r(1-r)} that both coins are {P_1}. If this happens then every {g} gives payoff {P_1/(1-\beta)}. There is a probability {r(1-r)} that both coins are {P_2}. If this happens then every {g} gives payoff {P_2/(1-\beta)}. And there is a probability {r^2+(1-r)^2} that the coins are different, and, because of the choice of {r}, conditionally on this event the probability of {G} being {P_1} is {p}. Therefore, in this case {g} gives whatever {g} gives in the difficult problem.

So, the payoff in the easy problem is a linear function of the payoff in the difficult problem. Therefore the optimal strategy in the difficult problem is the same as the optimal strategy in the easy problem. In particular, we just proved that, for every {p}, the optimal action in the first period is red when {p\ge 1/2} and green with {p\le 1/2}. Now back to the dynamic programming formulation, from standard arguments it follows that the optimal strategy is to keep doing it forever, i.e., at every period to toss the coin that is more likely to be the {P_1} coin given the current information.

See why I said my solution is tricky and specific ? it relies on the fact that there are only two arms (the fact that the arms are coins is not important). Here is a problem whose solution I don’t know:

Question 2 Let {0 \le P_1 < P_2 < ... < P_n \le 1}. We are given {n} coins, one of each parameter, all {n!} possibilities equally likely. Each period we have to toss a coin and we get payoff {1} for Heads. What is the optimal strategy ?

Here is the bonus question from the final exam in my dynamic optimization class of last semester. It is based on problem 8 Chapter II in Ross’ book Introduction to stochastic dynamic programming. It appears there as `guess the optimal policy’ without asking for proof. The question seems very natural, but I couldn’t find any information about it (nor apparently the students). I have a solution but it is tricky and too specific to this problem. I will describe my solution next week but perhaps somebody can show me a better solution or a reference ?

 

We have two coins, a red one and a green one. When flipped, one lands heads with probability P1 and the other with probability P2. We do not know which coin is the P1 coin. We initially attach probability p to the red coin being the P1 coin. We receive one dollar for each heads and our objective is to maximize the total expected discounted return with discount factor β. Describe the optimal policy, including a proof of optimality.

One reason I like thinking about probability and statistics is that my raw intuition does not fit well with the theory, so again and again I find myself falling into the same pits and enjoying the same revelations. As an impetus to start blogging again, I thought I should share some of these pits and revelations. So, for your amusement and instruction, here are three statistics questions with wrong answers which I came across during the last quarter. The mistakes are well known, and in fact I am sure I was at some level aware of them, but I still managed to believe the wrong answers for time spans ranging from a couple of seconds to a couple of weeks, and I still get confused when I try to explain what’s wrong. I will give it a shot some other post.

— Evaluating independent evidence —

The graduate students in a fictional department of economics are thrown to the sharks if it can be proved at significance level {\alpha=0.001} that they are guilty of spending less than eighty minutes a day on reading von Mises `Behavioral econometrics’. Before the penalty is delivered, every student is evaluated by three judges, who each monitors the student in a random sample of days and then conducts a statistical hypothesis testing about the true mean of daily minutes the student spend on von Mises:

\displaystyle \begin{array}{rcl} &H_0: \mu \ge 80\\&H_A: \mu<80\end{array}

The three samples are independent. In the case of Adam Smith, a promising grad student, the three judges came up with p-values {0.09, 0.1, 0.08}. Does the department chair have sufficient evidence against Smith ?

Wrong answer: Yup. The p-value in every test is the probability of failing the test under the null. These are independent samples so the probability to end up the three tests with such p-values is {0.09\cdot 0.1\cdot 0.08<0.001}. Therefore, the chair can dispose of the student. Of course it is possible that the student is actually not guilty and was just extremely unlucky to get monitored exactly on the days in which he slacked, but hey, that’s life or more accurately that’s statistics, and the chair can rest assured that by following this procedure he only loses a fraction of {0.001} of the innocent students.

— The X vs. the Y —

Suppose that in a linear regression of {Y} over {X} we get that

\displaystyle Y=4 + X + \epsilon

where {\epsilon} is the idiosyncratic error. What would be the slope in a regression of {X} over {Y} ?
Wrong answer: If {Y= 4 + X + \epsilon} then {X = -4 + Y + \epsilon'}, where {\epsilon'=-\epsilon}. Therefore the slope will be {1} with {\epsilon'} being the new idiosyncratic error.

— Omitted variable bias in probit regression —

Consider a probit regression of a binary response variable {Y} over two explanatory variables {X_1,X_2}:

\displaystyle \text{Pr}(Y=1)=\Phi\left(\beta_0 + \beta_1X_1 + \beta_2X_2\right)

where {\Phi} is the commulative distribution of a standard normal variable. Suppose that {\beta_2>0} and that {X_1} and {X_2} are positively correlated, i.e. {\rho(X_1,X_2)>0}. What can one say about the coefficient {\beta_1'} of {X_1} in a probit regression

\displaystyle \text{Pr}(Y=1)=\Phi\left(\beta_0'+ \beta_1'X_1\right)

of {Y} over {X_1} ?
Wrong answer: This is a well known issue of omitted variable bias. {\beta_1'} will be larger than {\beta_1}. One way to understand this is to consider the different meaning of the coefficients: {\beta_1} reflects the the impact on {Y} when {X_1} increases and {X_2} stays fixed, while {\beta_1'} reflects the impact on {Y} when {X_1} increases without controlling on {X_2}. Since {X_1} and {X_2} are positively correlated, and since {X_2} has positive impact on {Y} (as {\beta_2>0}), it follows that {\beta_1'>\beta_1}.

Recent Comments

Kellogg faculty blogroll