预定/报价
Reinforcement Learning Assignment
LUCK2024-04-18 15:31:38
to solve the CliffBoxPushing grid-world game.
Novel ideas are welcome and will receive bonus credit. In this assignment, you need to implement
the code on your own and present a convincing presentation to demonstrate the implemented
algorithm.
he following links can help you to get to know more about current RL algorithms:OpenAI Spinning Up: https://spinningup.openai.com/en/latest/index.html
Figure 1. The Cliff Box Pushing Grid World.
2. The Environment
The environment is a 2D grid world as shown in Fig. 1. The size of the environment is 6×14. In Fig. 1,
A indicates the agent, B stands for the box, G is the goal, and x means cliff. You need to write code
to implement one of the RL algorithms and train the agent to push the box to the goal position. The
game ends under three conditions:
1. The agent or the box steps into the dangerous region (cliff).
2. The current time step attains the maximum time step of the game.
3. The box arrives at the goal.
The MDP formulation is described as follows:
- State: The state consists of the position of the agent and the box. In Python, it is a tuple, for
example, at time step 0, the state is ((5, 0), (4, 1)) where (5, 0) is the position of the agent
and (4, 1) is the position of the box.
- Action: The action space is [1,2,3,4], which is corresponding to [up, down, left, right]. The
agent needs to select one of them to navigate in the environment.
- Reward: The reward consists of
1. the agent will receive a reward of -1 at each timestep
2. the negative value of the distance between the box and the goal
3. the negative value of the distance between the agent and the box
4. the agent will receive a reward of -1000 if the agent or the box falls into the cliff.

OpenAI Spinning Up: https://spinningup.openai.com/en/latest/index.html
Figure 1. The Cliff Box Pushing Grid World.
2. The Environment
The environment is a 2D grid world as shown in Fig. 1. The size of the environment is 6×14. In Fig. 1,
A indicates the agent, B stands for the box, G is the goal, and x means cliff. You need to write code
to implement one of the RL algorithms and train the agent to push the box to the goal position. The
game ends under three conditions:
1. The agent or the box steps into the dangerous region (cliff).
2. The current time step attains the maximum time step of the game.
3. The box arrives at the goal.
The MDP form
OpenAI Spinning Up: https://spinningup.openai.com/en/latest/index.html
OpenAI Spinning Up: https://spinningup.openai.com/en/latest/index.html
Figure 1. The Cliff Box Pushing Grid World.
2. The Environment
The environment is a 2D grid world as shown in Fig. 1. The size of the environment is 6×14. In Fig. 1,
A indicates the agent, B stands for the box, G is the goal, and x means cliff. You need to write code
to implement one of the RL algorithms and train the agent to push the box to the goal position. The
game ends under three conditions:
1. The agent or the box steps into the dangerous region (cliff).
2. The current time step attains the maximum time step of the game.
3. The box arrives at the goal.
The MDP formulation is described as follows:
- State: The state consists of the position of the agent and the box. In Python, it is a tuple, for
example, at time step 0, the state is ((5, 0), (4, 1)) where (5, 0) is the position of the agent
and (4, 1) is the position of the box.
- Action: The action space is [1,2,3,4], which is corresponding to [up, down, left, right]. The
agent needs to select one of them to navigate in the environment.
- Reward: The reward consists of
1. the agent will receive a reward of -1 at each timestep
2. the negative value of the distance between the box and the goal
3. the negative value of the distance between the agent and the box
4. the agent will receive a reward of -1000 if the agent or the box falls into the cliff.

Figure 1. The Cliff Box Pushing Grid World.
2. The Environment
The environment is a 2D grid world as shown in Fig. 1. The size of the environment is 6×14. In Fig. 1,
A indicates the agent, B stands for the box, G is the goal, and x means cliff. You need to write code
to implement one of the RL algorithms and train the agent to push the box to the goal position. The
game ends under three conditions:
1. The agent or the box steps into the dangerous region (cliff).
2. The current time step attains the maximum time step of the game.
3. The box arrives at the goal.
The MDP formulation is described as follows:
- State: The state consists of the position of the agent and the box. In Python, it is a tuple, for
example, at time step 0, the state is ((5, 0), (4, 1)) where (5, 0) is the position of the agent
and (4, 1) is the position of the box.
- Action: The action space is [1,2,3,4], which is corresponding to [up, down, left, right]. The
agent needs to select one of them to navigate in the environment.
- Reward: The reward consists of
1. the agent will receive a reward of -1 at each timestep
2. the negative value of the distance between the box and the goal
3. the negative value of the distance between the agent and the box
4. the agent will receive a reward of -1000 if the agent or the box falls into the cliff.