Skip to content
Anthony Liu edited this page Apr 17, 2016 · 7 revisions

Artificial Intelligence Wiki

Heuristics (value of a state)

  • Given an x, y, and theta for a roomba, we want to compute the potential future value of that roomba, considering time until flip as well
  • If we can assign to each time-dependent state a value, we can make hypotheses of the form:
    • If the UAV rotates this roomba, is the lifetime value of that roomba increased?
    • At the current timestep, which roomba results in the largest increase in value?
  • Consider the following value assignment scheme, V(x,y,theta,t) where t is the time since last flip
    • assume V is linearly separable into V(x,y) and V(theta)
    • V(x,y,theta) is thus a linear combination of V(x,y) and V(theta)
    • compute the time integral of V(x,y,theta)*Gamma(t) from 0 to t
    • also compute the time integral of V(x,y,-theta)*Gamma(t) from t to 20
      • the 20 second horizon is arbitrary; theoretically should be infinite
    • the sum of those two integrals is the value of the roomba's state
  • V(x,y)
    • See the expression in the rewards section
    • ignore the time component
  • V(theta)
    • = pi - |(theta1 + theta2)/2 - theta|
    • where theta1 is the smallest angle such that the roomba will hit the left edge of the goal
    • where theta2 is the largest angle " "

Reinforcement Learning

States

  • UAV State
    • position (x, y, z), orientation (yaw, pitch, roll), velocity (vx, vy, vz)
  • Roomba States (10 of these)
    • position (x, y) and velocity (vx, vy)
  • Obstacle States (5 of these)
    • position (x, y) and velocity (vx, vy)

Actions

  • move towards an (x, y, z) coordinate
    • don't move to a Roomba and hold off on actions until arrival because there's a lot of uncertainty in the state and it more fine-grained behavior may be necessary
    • however, moving towards specific Roombas is very intuitive, so it's well suited to the RL algorithm
    • as a compromise, move a fraction of the way towards a Roomba
    • in this way, the algorithm can adjust to changes in the state as it's moving towards its designated Roomba
  • give the UAV the ability to change it's decision; move towards Roombas as an autopilot efficiency thing
  • null actions to let the Roomba continue moving towards the chosen target; take advantage of its built up velocity

Rewards

RL agents receive rewards upon the transition from one state to another. Rewards imply that the action that resulted in the rewarding transition was what was responsible for it. We need to consider how we deal with 10 Hz of rewards.

  • reward all Roomba movements in the direction of the center of the goal
    • i.e. the component of motion in the direction of the vector from the Roomba to the midpoint of the goal line
    • this could be positive or negative
  • positively reward clustered Roombas
    • it's useful to cluster Roombas together because doing so allows the planning algorithm to quickly change the direction of motion for many Roombas within a (relatively) short period of time
    • need to determine a metric of how clustered the Roomba configuration is
      • score a cluster by summing the squared distances; compare to a configurable comparison score that determines whether or not a given cluster score deserves a positive or negative reward

Reward Function: *For individual roombas:

  • Considers x-y position, time, direction *((y+10)^2-(5-x)^2-(-5-x)^2) + (-1) ^((t/10)%2)*cos(direction) * some tuning factor *the direction/time reward function flips sign every ten seconds, and is maximized when roomba is aligned with positive y-axis and minimized when roomba is aligned with negative y-axis. *TODO: make reward change linearly with time within each 20 second interval

*Possible reward functions based on overall state:

  • Clusters of roombas