Open Access
 Issue JNWPU Volume 39, Number 5, October 2021 1057 - 1063 https://doi.org/10.1051/jnwpu/20213951057 14 December 2021

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## 2 算法流程

1) 视觉感知算法模块

2) 决策控制算法模块

Actor网络使用梯度上升的方式优化θμ, θμ梯度求解方式如(15)式所示。

1. 初始化DDPG参数：θ, θ′, μ, μ′;

2. 创建YOLO目标定位训练数据集S;

//训练YOLO目标定位神经网络

3. for episode=1 to n1 do

4.随机从样本集S中抽取一个批次b;

5.训练YOLO网络参数m;

6. end for;

//模仿学习部分

7. for episode=1 to n2 do

8.随机从样本D中抽取一个批次b;

9.监督学习训练DDPG网络参数θ, μ;

10. end for;

11. 模仿学习训练完成得到初始策略A;

//强化学习训练部分

12. for episode=1 to n3 do

13.for t=1 to T-1 do

14. 摄像设备捕捉输入图像i;

15. YOLO网络定位目标所在图像位置;

16. 透视变化算法获取目标坐标信息st;

17. 使用策略A获取行为at=A(stg);

18. 执行at得到新的状态st+1, 并获得奖励值rt;

19. 存储(stg, at, rt, st+1g)到R中;

20. HER算法重新采样新目标，计算奖励值存储到R中;

21.end for;

22.for t=0 to n4 do

23. 从经验回放池R中随机采样一个批次B;

24. 在B上对策略A进行优化;

25. end for;

26. end for;

 图1透视变换得到目标相对于载物台的准确XOY平面坐标信息
 图2DDPG算法整体优化过程

## 3 实验设计与分析

 图3机械臂三维空间控制仿真环境

### 3.1 目标识别与定位实验

 图4YOLO目标检测输入数据
 图5目标定位损失、目标识别损失、精确度、召回率、校验集目标定位损失、校验集目标识别损失，mAP 0.5以及mAP∈[0.5, 0.95]

### 3.2 强化学习策略控制实验

 图6机械臂拾取-放置任务IL-DDPG-HER与DDPG-HER成功率实验对比分析

## References

1. Lumelsky V, Stepanov A. Dynamic path planning for a mobile automaton with limited information on the environment[J]. IEEE Trans on Automatic Control, 1986, 31(11) : 1058–1063 [Google Scholar]
2. Mnih V, Kavukcuoglu K, Silver D, et al. Playing atari with deep reinforcement learning[J]. Computer Science, 2013, 12(19) : 5602 [Google Scholar]
3. Lillicrap T P, Hunt J, Pritzel A, et al. Continuous control with deep reinforcement learning[J/OL]. (2019-07-05)[2021-10-22]. https://arxiv.org/pdf/1509.02971v6.pdf [Google Scholar]
4. Schulman J, Wolski F, Dhariwal P, et al. Proximal policy optimization algorithms[J/OL]. (2017-08-28)[2021-10-22]. https://arxiv.org/pdf/1707.06347.pdf [Google Scholar]
5. Mnih V, Badia A P, Mirza M, et al. Asynchronous methods for deep reinforcement learning[C]//International Conference on Machine Learning, 2016: 1928–1937 [Google Scholar]
6. Li J, Shi H, Hwang K S. An explainable ensemble feedforward method with Gaussian convolutional filter[J]. Knowledge Based Systems, 2021(225) : 107103 [Google Scholar]
7. Van Hasselt H, Guez A, Silver D. Deep reinforcement learning with double q-learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2016 [Google Scholar]
8. Schaul T, Horgan D, Gregor K, et al. Universal value function approximators[C]//International Conference on Machine learning, 2015: 1312–1320 [Google Scholar]
9. Andrychowicz M, Wolski F, Ray A, et al. Hindsight experience replay[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 5055–5065 [Google Scholar]
10. Hester T, Vecerik M, Pietquin O, et al. Deep q-learning from demonstrations[C]//Thirty-Second AAAI Conference on Artificial Intelligence, 2018 [Google Scholar]
11. Vecerík M, Hester T, Scholz J, et al. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards[J/OL]. (2018-10-08)[2021-10-22]. https://arxiv.org/pdf/1707.08817v1.pdf [Google Scholar]
12. Bochkovskiy A, Wang C Y, Liao H. YOLOv4: optimal speed and accuracy of object detection[J/OL]. (2020-04-23)[2021-10-22]. https://arxiv.org/pdf/2004.10934v1.pdf [Google Scholar]
13. Wu Y H, Lin S D. A low-cost ethics shaping approach for designing reinforcement learning agents[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2018 [Google Scholar]
14. Christiano P F, Leike J, Brown T B, et al. Deep reinforcement learning from human preferences[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, 2017 [Google Scholar]

## All Figures

 图1透视变换得到目标相对于载物台的准确XOY平面坐标信息 In the text
 图2DDPG算法整体优化过程 In the text
 图3机械臂三维空间控制仿真环境 In the text
 图4YOLO目标检测输入数据 In the text
 图5目标定位损失、目标识别损失、精确度、召回率、校验集目标定位损失、校验集目标识别损失，mAP 0.5以及mAP∈[0.5, 0.95] In the text
 图6机械臂拾取-放置任务IL-DDPG-HER与DDPG-HER成功率实验对比分析 In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.