In this new ROS Project you are going to learn Step-by-Step how to create a robot cube that moves and that it learns to move using OpenAI environment.

In this fourth video we continue we talk about the first script we need to do reinforcement learning with OpenAI. Specifically the main script where we import the Robot Environment that we will define in the next video and we will use Qlearn.

OpenAi Course

Moving Cube Git:
https://bitbucket.org/theconstructcore/moving_cube/src/master/
Moving Cube Training AI Git: https://bitbucket.org/theconstructcore/moving_cube_ai/src/master/

If you didn’t follow up, please check the link below for the last post

https://www.theconstructsim.com/ros-projects-openai-moving-cube-robot-gazebo-step-step-part3/

#!/usr/bin/env python

'''
    Training code made by Ricardo Tellez <rtellez@theconstructsim.com>
    Based on many other examples around Internet
    Visit our website at www.theconstruct.ai
'''
import gym
import numpy
import time
import qlearn
from gym import wrappers

# ROS packages required
import rospy
import rospkg

# import our training environment
import old_way_moving_cube_env


if __name__ == '__main__':
    
    rospy.init_node('movingcube_gym', anonymous=True, log_level=rospy.WARN)

    # Create the Gym environment
    env = gym.make('OldMovingCube-v0')
    rospy.loginfo ( "Gym environment done")
        
    # Set the logging system
    rospack = rospkg.RosPack()
    pkg_path = rospack.get_path('moving_cube_training_pkg')
    outdir = pkg_path + '/training_results'
    env = wrappers.Monitor(env, outdir, force=True) 
    rospy.loginfo ( "Monitor Wrapper started")

    last_time_steps = numpy.ndarray(0)

    # Loads parameters from the ROS param server
    # Parameters are stored in a yaml file inside the config directory
    # They are loaded at runtime by the launch file
    Alpha = rospy.get_param("/moving_cube/alpha")
    Epsilon = rospy.get_param("/moving_cube/epsilon")
    Gamma = rospy.get_param("/moving_cube/gamma")
    epsilon_discount = rospy.get_param("/moving_cube/epsilon_discount")
    nepisodes = rospy.get_param("/moving_cube/nepisodes")
    nsteps = rospy.get_param("/moving_cube/nsteps")

    running_step = rospy.get_param("/moving_cube/running_step")

    # Initialises the algorithm that we are going to use for learning
    qlearn = qlearn.QLearn(actions=range(env.action_space.n),
                    alpha=Alpha, gamma=Gamma, epsilon=Epsilon)
    initial_epsilon = qlearn.epsilon

    start_time = time.time()
    highest_reward = 0

    # Starts the main training loop: the one about the episodes to do
    for x in range(nepisodes):
        rospy.logdebug("############### START EPISODE=>" + str(x))
        
        cumulated_reward = 0  
        done = False
        if qlearn.epsilon > 0.05:
            qlearn.epsilon *= epsilon_discount
        
        # Initialize the environment and get first state of the robot
        observation = env.reset()
        state = ''.join(map(str, observation))
        
        episode_time = rospy.get_rostime().to_sec()
        # for each episode, we test the robot for nsteps
        for i in range(nsteps):
            rospy.loginfo("############### Start Step=>"+str(i))
            # Pick an action based on the current state
            action = qlearn.chooseAction(state)
            rospy.loginfo ("Next action is:%d", action)
            # Execute the action in the environment and get feedback
            observation, reward, done, info = env.step(action)

            rospy.loginfo(str(observation) + " " + str(reward))
            cumulated_reward += reward
            if highest_reward < cumulated_reward:
                highest_reward = cumulated_reward

            nextState = ''.join(map(str, observation))

            # Make the algorithm learn based on the results
            rospy.logwarn("############### state we were=>" + str(state))
            rospy.logwarn("############### action that we took=>" + str(action))
            rospy.logwarn("############### reward that action gave=>" + str(reward))
            rospy.logwarn("############### State in which we will start nect step=>" + str(nextState))
            qlearn.learn(state, action, reward, nextState)

            if not(done):
                state = nextState
            else:
                rospy.loginfo ("DONE")
                last_time_steps = numpy.append(last_time_steps, [int(i + 1)])
                break
            rospy.loginfo("############### END Step=>" + str(i))
            #raw_input("Next Step...PRESS KEY")
            #rospy.sleep(2.0)
        m, s = divmod(int(time.time() - start_time), 60)
        h, m = divmod(m, 60)
        rospy.logwarn ( ("EP: "+str(x+1)+" - [alpha: "+str(round(qlearn.alpha,2))+" - gamma: "+str(round(qlearn.gamma,2))+" - epsilon: "+str(round(qlearn.epsilon,2))+"] - Reward: "+str(cumulated_reward)+"     Time: %d:%02d:%02d" % (h, m, s)))

    
    rospy.loginfo ( ("\n|"+str(nepisodes)+"|"+str(qlearn.alpha)+"|"+str(qlearn.gamma)+"|"+str(initial_epsilon)+"*"+str(epsilon_discount)+"|"+str(highest_reward)+"| PICTURE |"))

    l = last_time_steps.tolist()
    l.sort()

    #print("Parameters: a="+str)
    rospy.loginfo("Overall score: {:0.2f}".format(last_time_steps.mean()))
    rospy.loginfo("Best 100 score: {:0.2f}".format(reduce(lambda x, y: x + y, l[-100:]) / len(l[-100:])))

    env.close()

This the script for training.

From line 42 to line 49, we read the parameters for the q-learn algorithm from the parameter server which we’ll discuss later.

The training process is done from line 59 to line 108. For each step, the algorithm decides which action to take based on the current state, then it measures the observation, decides if the simulation is done and calculates rewards. It will keep learning to optimize the reward it gets.

The implementation of the q-learn algorithm was done by Victor Mayoral Vilches.

'''
Q-learning approach for different RL problems
as part of the basic series on reinforcement learning @
https://github.com/vmayoral/basic_reinforcement_learning
 
Inspired by https://gym.openai.com/evaluations/eval_kWknKOkPQ7izrixdhriurA
 
        @author: Victor Mayoral Vilches <victor@erlerobotics.com>
'''
import random

class QLearn:
    def __init__(self, actions, epsilon, alpha, gamma):
        self.q = {}
        self.epsilon = epsilon  # exploration constant
        self.alpha = alpha      # discount constant
        self.gamma = gamma      # discount factor
        self.actions = actions

    def getQ(self, state, action):
        return self.q.get((state, action), 0.0)

    def learnQ(self, state, action, reward, value):
        '''
        Q-learning:
            Q(s, a) += alpha * (reward(s,a) + max(Q(s') - Q(s,a))            
        '''
        oldv = self.q.get((state, action), None)
        if oldv is None:
            self.q[(state, action)] = reward
        else:
            self.q[(state, action)] = oldv + self.alpha * (value - oldv)

    def chooseAction(self, state, return_q=False):
        q = [self.getQ(state, a) for a in self.actions]
        maxQ = max(q)

        if random.random() < self.epsilon:
            minQ = min(q); mag = max(abs(minQ), abs(maxQ))
            # add random values to all the actions, recalculate maxQ
            q = [q[i] + random.random() * mag - .5 * mag for i in range(len(self.actions))] 
            maxQ = max(q)

        count = q.count(maxQ)
        # In case there're several state-action max values 
        # we select a random one among them
        if count > 1:
            best = [i for i in range(len(self.actions)) if q[i] == maxQ]
            i = random.choice(best)
        else:
            i = q.index(maxQ)

        action = self.actions[i]        
        if return_q: # if they want it, give it!
            return action, q
        return action

    def learn(self, state1, action1, reward, state2):
        maxqnew = max([self.getQ(state2, a) for a in self.actions])
        self.learnQ(state1, action1, reward, reward + self.gamma*maxqnew)

We put the parameter for the q-learn separately in a file called one_disk_walk_openai_params.yaml under the config folder, so you can tweak the parameters much easier.

moving_cube: #namespace

    running_step: 0.04 # amount of time the control will be executed
    pos_step: 0.016     # increment in position for each command
    
    #qlearn parameters
    alpha: 0.1
    gamma: 0.7
    epsilon: 0.9
    epsilon_discount: 0.999
    nepisodes: 500
    nsteps: 1000
    number_splits: 10 #set to change the number of state splits for the continuous problem and also the number of env_variable splits

    running_step: 0.06 # Time for each step
    wait_time: 0.1 # Time to wait in the reset phases

    n_actions: 5 # We have 3 actions

    speed_step: 1.0 # Time to wait in the reset phases

    init_roll_vel: 0.0 # Initial speed of the Roll Disk

    roll_speed_fixed_value: 100.0 # Speed at which it will move forwards or backwards
    roll_speed_increment_value: 10.0 # Increment that could be done in each step

    max_distance: 2.0 # Maximum distance allowed for the RobotCube
    max_pitch_angle: 0.2 # Maximum Angle radians in Pitch that we allow before terminating episode
    max_yaw_angle: 0.1 # Maximum yaw angle deviation, after that it starts getting negative rewards

    init_cube_pose:
      x: 0.0
      y: 0.0
      z: 0.0

    end_episode_points: 1000 # Points given when ending an episode

    move_distance_reward_weight: 1000.0 # Multiplier for the moved distance reward, Ex: inc_d = 0.1 --> 100points
    y_linear_speed_reward_weight: 1000.0 # Multiplier for moving fast in the y Axis
    y_axis_angle_reward_weight: 1000.0 # Multiplier of angle of yaw, to keep it straight

It might take a lot of time to tune the parameters. In ROSDS, we offer the gym computer feature to help you run training with different parameters parallelly. If you are interested, please check our paid program.

If you want to learn more applications with OpenAI in ROS, please check our OpenAI course in the robot ignite academy.

 

Edit by Tony Huang.

[irp posts=”10198″ name=”ROS Projects OpenAI with Moving Cube Robot in Gazebo Step-by-Step #Part5″]