Policy Optimization in Reinforcement Learning: A Tale of Preconditioning and Regularization
Policy optimization, which learns the policy of interest by maximizing the value function via large-scale optimization techniques, lies at the heart of modern reinforcement learning (RL). In addition to value maximization, other practical considerations arise commonly as well, including the need of encouraging exploration, and that of ensuring certain structural properties of the learned policy due to safety, resource and operational constraints. These considerations can often be accounted for by resorting to regularized RL, which augments the target value function with a structure-promoting regularization term, such as Shannon entropy, Tsallis entropy, and log-barrier functions. Focusing on an infinite-horizon discounted Markov decision process, this talk first shows that entropy-regularized natural policy gradient methods converge globally at a linear convergence that is near independent of the dimension of the state-action space, whereas the vanilla softmax policy gradient method may take an exponential time to converge. Next, a generalized policy mirror descent algorithm is proposed to accommodate a general class of convex regularizers beyond Shannon entropy, even when the regularizer lacks strong convexity and smoothness. Time permitting, we will discuss how these ideas can be leveraged to solve zero-sum Markov games. Our results accommodate a wide range of learning rates, and shed light upon the role of regularization in enabling fast convergence in RL.
Date and Time
Location
Hosts
Registration
- Date: 23 Mar 2022
- Time: 12:00 PM to 01:00 PM
- All times are (GMT-05:00) US/Eastern
- Add Event to Calendar
- 1000 River Road
- Teaneck , New Jersey
- United States 07666
- Building: Muscarelle Center, M105,
- Room Number: M105
- Contact Event Hosts
- Co-sponsored by North Jersey Section
- Starts 16 February 2022 07:00 AM
- Ends 23 March 2022 12:00 PM
- All times are (GMT-05:00) US/Eastern
- No Admission Charge
Speakers
Dr. Yuejie Chi of ECE Department, Carnegie Mellon University
Machine Learning Assisted Network Slicing for Wireless Edge Computing System
Policy optimization, which learns the policy of interest by maximizing the value function via large-scale optimization techniques, lies at the heart of modern reinforcement learning (RL). In addition to value maximization, other practical considerations arise commonly as well, including the need of encouraging exploration, and that of ensuring certain structural properties of the learned policy due to safety, resource and operational constraints. These considerations can often be accounted for by resorting to regularized RL, which augments the target value function with a structure-promoting regularization term, such as Shannon entropy, Tsallis entropy, and log-barrier functions. Focusing on an infinite-horizon discounted Markov decision process, this talk first shows that entropy-regularized natural policy gradient methods converge globally at a linear convergence that is near independent of the dimension of the state-action space, whereas the vanilla softmax policy gradient method may take an exponential time to converge. Next, a generalized policy mirror descent algorithm is proposed to accommodate a general class of convex regularizers beyond Shannon entropy, even when the regularizer lacks strong convexity and smoothness. Time permitting, we will discuss how these ideas can be leveraged to solve zero-sum Markov games. Our results accommodate a wide range of learning rates, and shed light upon the role of regularization in enabling fast convergence in RL.
Biography:
Dr. Yuejie Chi is a Professor in the department of Electrical and Computer Engineering, and a faculty affiliate with the Machine Learning department and CyLab at Carnegie Mellon University. She received her Ph.D. and M.A. from Princeton University, and B. Eng. (Hon.) from Tsinghua University, all in Electrical Engineering. Her research interests lie in the theoretical and algorithmic foundations of data science, signal processing, machine learning and inverse problems, with applications in sensing and societal systems, broadly defined. Among others, Dr. Chi received the Presidential Early Career Award for Scientists and Engineers (PECASE), the inaugural IEEE Signal Processing Society Early Career Technical Achievement Award for contributions to high-dimensional structured signal processing and held the inaugural Robert E. Doherty Early Career Development Professorship. She was named a Goldsmith Lecturer by IEEE Information Theory Society and a Distinguished Lecturer by IEEE Signal Processing Society.
Email:
Address:United States
Agenda
Policy optimization, which learns the policy of interest by maximizing the value function via large-scale optimization techniques, lies at the heart of modern reinforcement learning (RL). In addition to value maximization, other practical considerations arise commonly as well, including the need of encouraging exploration, and that of ensuring certain structural properties of the learned policy due to safety, resource and operational constraints. These considerations can often be accounted for by resorting to regularized RL, which augments the target value function with a structure-promoting regularization term, such as Shannon entropy, Tsallis entropy, and log-barrier functions. Focusing on an infinite-horizon discounted Markov decision process, this talk first shows that entropy-regularized natural policy gradient methods converge globally at a linear convergence that is near independent of the dimension of the state-action space, whereas the vanilla softmax policy gradient method may take an exponential time to converge. Next, a generalized policy mirror descent algorithm is proposed to accommodate a general class of convex regularizers beyond Shannon entropy, even when the regularizer lacks strong convexity and smoothness. Time permitting, we will discuss how these ideas can be leveraged to solve zero-sum Markov games. Our results accommodate a wide range of learning rates, and shed light upon the role of regularization in enabling fast convergence in RL.