### LQR vs Deep learning

Deep learning is not only fun it can also be a powerful tool for a large variety of problems. One of our theses investigated how deep learning algorithms can be used as an alternative to traditional control methods. They constructed a unicycle as a platform to perform the comparison on. Does it sound like a fun project? It was! Read the post for a deeper dive into the project.

#### Introduction

The issue of deep learning (DL) is a hot topic in modern society and is one of the most rapidly growing technical fields today. One of many subjects that could benefit from deep learning is control theory. Its nonlinearities enable implementation of a wider range of functions and adaptability to more complex systems. There has been significant progress in generating control policies in simulated environments using reinforcement learning. Algorithms are capable of solving complex physical control problems in continuous action spaces in a robust way. Even though the tasks are claimed to have real-world complexity it is hard to find an example of such high-level algorithms in an actual application. Moreover, we have found that in most applications in which these algorithms have been implemented, they have been trained on the hardware itself. This does not only enforce high demands on the hardware of the system but might be time-consuming or even practically infeasible for some systems. In these cases, a more efficient solution would be to train on a simulated system and transfer the algorithm to the real world.

Furthermore, one might wonder if a traditional control method would perform better or worse on the same system. In order to recognize how well the deep learning algorithm is actually performing, it would be interesting to compare it to another method on a similar control level.

The main purpose of this project was to provide an example of a fair comparison between a traditional control method and an algorithm based on DL, both run on a benchmark control problem. It should also demonstrate how algorithms developed in simulation can be transferred to a real physical system.

#### Design

Due to its unstable equilibrium point, the inverted pendulum setup is a commonly used benchmark in control theory. There can be found many variations of this system, all based on the same principal dynamics. An example of this is a unicycle which principal dynamics can be viewed as an inverted pendulum in two dimensions. Thus, as a platform to conduct our experiments, we constructed a unicycle.

Our main focus for the design was to keep it as lightweight and simple as possible. To emphasise the low hardware requirements, we chose the low-cost ESP32 microcontroller to act as the brain of our system. On it, we implemented all sensor fusion and communication to surrounding electronics necessary to easily test the two control algorithms on hardware. We dedicated one core specifically for the two control algorithms and added a button to switch between the two algorithms with a simple press.

To be used in simulation and control synthesis, we derived a nonlinear continuous-time mathematical model using Lagrangian dynamics. The unicycle is modelled as 3 parts, the wheel, the body and the reaction disk, including the inertia from all components in the hardware. It has 4 degrees of freedom; the spin of the wheel, the movement of the system, the pitch of the system and the rotation of the disk. The external forces on the system come from the disk and wheel motors.

#### Controller Synthesis

The infinite horizon linear quadratic regulator (LQR) is a model-based control problem which results in a state feedback controller. The feedback gain is determined offline by from an arbitrary initial state minimizing a weighted sequence of states and inputs over a time horizon that tends towards infinity. The LQR problem is one of the most commonly solved optimal control problems. As a mathematical model of the system is available and due to its characteristics, we implemented an LQR controller for this project.

For our deep learning control of the unicycle, we chose proximal policy optimization (PPO). The method is built on a policy-based reinforcement learning which offers practical ways of dealing with continuous spaces and an infinite number of actions. The PPO has shown superiority in complex control tasks compared to other policy-based algorithms and is considered to be the state-of-the-art method for reinforced learning in continuous spaces.

To make a long story short we trained the algorithm for the system by writing up the mathematical model of the unicycle in Python as an environment for the agent to train in. The actions the agent can take are the inputs to the two motors. After taking an action it moves to a new state and receives a reward. After some millions of iterations of taking actions and receiving rewards the agent eventually learns how to behave in this environment an creates a policy to stabilize the unicycle.

#### Results

Both methods successfully managed to stabilize the system. The LQR outperformed the PPO in most perspectives in which the hardware did not limit the control. As an example, in practice, the LQR managed to stabilize from a maximal pitch deviation of 28 degrees compared to the PPO method which managed 20 degrees. We observed this sub optimal behaviour of the PPO in several situations. Another example can be seen when applying an external impulse to the system.

Figure 2

As can be seen, the LQR handles the impulse in a somewhat expected way while the PPO goes its own ways.

This unexpected behaviour is not desirable for this system but we think it might be seen as beneficial for other systems. For example, systems with unspecified or even unknown optimal behaviour. However, for systems with a specified known optimal or expected behaviour, we would recommend the good old LQR, if applicable.

Even when exposed to model errors, the PPO did not show any sign of unreliability compared to the LQR in states it had encountered during training. However, when introduced to unknown states, the performance of the PPO is impacted. By keeping the limits of the training environment general enough this should not be an issue. However, when dealing with systems with large or even unknown state limits, LQR is probably a safer option.

We believe our project has shown a good and fair comparison between these two methods on the same system as well as has given a good and informative example of how a DL algorithm trained in simulation can be transferred to a real physical system. The unicycle is of course only an example of such a system, but we feel like we encountered a lot of interesting features that can be generalized and used to benefit other projects. If you have doubts, please read our report!