Previous: A simulation study: Air Battle Up: A simulation study: Air Battle Next: Observing Successful Patterns of Interaction in the World

Improving ``Unconscious'' Behaviors

The rules of a PMA are pairs of situation/action. As it turns out, a situation can be paired up with multiple actions. The object of learning here is to learn which actions when associated with a situation yield a better result, i.e., the pilot ends up in a more desirable situation.

Some situations in ABS are more desirable for the pilot than others, e.g., being right behind the enemy and in shooting range. Let's assume that we can assign a goodness value G(s) to each situation s between and . As the pilot makes a move, it finds itself in a new situation. This new situation is not known to the pilot since it also depends on the other pilot's move. Since the new situation is not uniquely determined by the pilot's move, the pilot's view of the game is not Markovian.

Q(s,a) is the evaluation of how appropriate action a is in situation s. R(s,a) is the goodness value of the state that the pilot finds itself after performing a in situation s. R(s,a) is determined as the game is played and cannot be determined beforehand. This is called the immediate reward. is a parameter between 0 and that we plan to vary that to determine how important it is to be in the state that the pilot ends up in after his move. In reinforcement based learning this is known as the discount factor. We let Q(s,a) = R(s,a) + max Q(s',k) where situation s' results after the pilot performs a in s. At the start of game, all Q(s,a) in the PMA are set to . As the game is played, Q is updated. As of this writing we are experimenting with setting appropriate parameters for Q.

lammens@cs.buffalo.edu