Why do optimal control and reinforcement learning use different notation? In optimal control, state

landdenaw 2022-06-24 Answered
Why do optimal control and reinforcement learning use different notation?
In optimal control, state is x, control is u, and dynamics are x ˙ = f ( x , u ). In reinforcement learning, state is s, action is a, and dynamics are s P ( s | s , a ).
I'm curious why these two fields that are so similar use such different notation? I've heard that one reason is that optimal control has roots in Russian, where u is the beginning of a relevant word (s for state, a for action, etc...) but I haven't been able to find any paper or textbook that describes the reason for the notational divergence. If you have an answer with a source that would be most helpful!
You can still ask an expert for help

Expert Community at Your Service

  • Live experts 24/7
  • Questions are typically answered in as fast as 30 minutes
  • Personalized clear answers
Learn more

Solve your problem for the price of one coffee

  • Available 24/7
  • Math expert for every subject
  • Pay only if we can solve it
Ask Question

Answers (1)

Answered 2022-06-25 Author has 16 answers
First of all, notations are often not standardized in scientific papers. So, notations will change from one field to another, they also change from one author to another within the same field, and even worse, they may change from one paper to another from the same author. The only times things are standardized is when people are building courses and textbooks so that readers are not confused. Students, in particular, tend to be very easily confused by a slight change of notation.
I have not seen any paper discussing the difference in notation and I do not think they will be any. Why? Because most of the people do not care much about such details. As said before, notations for variables are non standard and in the end convey little meaning. I personally tend to choose notations which are self-explanatory (as much as I can) using my own standards, which I am trying to keep constant yet I am updating it when I find a problem.
Most people are, however, very interested in the origin of the naming of objects and tools as there is often a nice story behind. For instance, why did Bellman named his theory "Dynamic Programming"? Or where does the term "Martingale" come from? So you will likely find some resources addressing those questions.
In any way, the main reason for the different languages and notations is not because "optimal control is Russian" (which is not completely true as it is also American) but because optimal control is considered in control theory (which emerged from control engineering, which, in turn, emerged from electrical engineering) and that reinforcement learning arose from computer science. The first one is (at least originally), essentially, continuous whereas the second one is, essentially, discrete.
In the early works from Pontryagin (Pontryagin's maximum principle) and Bellman (Dynamic Programming), both the state-space and the input-space are continuous (in fact the first works by Bellman were in discrete-time and not in continuous time, but that was fixed soon after). The cost is, therefore, also continuous. Why? Because, control theory was (and still is) addressing the control of continuous processes such as mechanical and electrical processes, all described by differential equations. This was later refined by considering many classes of systems such as discrete-time or hybrid systems and, perhaps more importantly, stochastic systems. This is still an active field of research, which I am currently working on. From my reading of the early papers from Lev Pontryagin and Richard Bellman, the state was always called x and the input called u. This is now standard all over the planet, at least in control. Sometimes, u is replaced by v, but this is close enough. I would say it is easy to speculate that this uniformity is simply due to education. People were taught like that and kept the notation. For the origin of u, I would tend to say that it might be because in the control of electrical motors, the control input is voltage, which is denoted by u or v (or even U or V) but I have no source for such an origin. I have checked in old control textbooks, and I could not find one where the input was not denoted by those letters. I found some old papers by Popov where the input is not denoted y those letters but the input is not really a control input there.
Reinforcement learning addressed problems in the discrete-time domain with discrete state and output spaces. Why? Because, in computer science everything is discrete. The probabilistic setting is also something that is typical from computer science and especially communication networks (i.e. queueing processes). Many of those constraints have been refined to make the theory broader but, in the end, reinforcement learning is a set of tools and methods to solve stochastic optimal control problems. In fact, the value function used in reinforcement learning is directly from Bellman's theory of Dynamic Programming, which serves as basis for solving optimal control problems. The main difference is that RL is applied to the control of systems which are not necessarily physical processes (e.g. playing video games), which were not interesting for control engineers. Also, it took quite some time to have computers to be able to run those algorithms in a realistic manner. Another thing is that RL addresses problems where you only have a very loose model for, that justifies the probabilistic point of view. While in control, we heavily rely on models (at least until very recently as there is now a lot of research done on model-free control and data-driven methods, which was actually fueled by the recent successes of ML/DL/RL).
In the end, different notations, different tools, different fields, but same ideas and goals. It is unfortunate that those fields do not have much more connections as it will save energy and time for everybody. There is no need to reinvent the wheel, but RL is a hot topic at the moment, optimal control is not.
In this talk, by Michael Jordan, he answers a question about Optimal Control vs. Reinforcement Learning. The question is at 45:30 and the answer sums up pretty much everything.

We have step-by-step solutions for your answer!

Expert Community at Your Service

  • Live experts 24/7
  • Questions are typically answered in as fast as 30 minutes
  • Personalized clear answers
Learn more

You might be interested in

asked 2021-02-23
Interpreting z-scores: Complete the following statements using your knowledge about z-scores.
a. If the data is weight, the z-score for someone who is overweight would be
b. If the data is IQ test scores, an individual with a negative z-score would have a
-high IQ
-low IQ
-average IQ
c. If the data is time spent watching TV, an individual with a z-score of zero would
-watch very little TV
-watch a lot of TV
-watch the average amount of TV
d. If the data is annual salary in the U.S and the population is all legally employed people in the U.S., the z-scores of people who make minimum wage would be
asked 2022-06-07
How to interpret the imaginary part of an inverse fourier transform
The fourier transform of arbitrary real data can (usually will) result in complex data. If the real data represents samples in time, then the complex FT data represent frequencies with the magnitudes representing the amplitude and the phase angles representing the phase.
If I perform some arbitrary manipulation on the complex FT data prior to applying an inverse fourier transform, the result of the IFT can also be complex. My question is, how do I interpret the imaginary part of the IFT result? My guess is that a complex IFT result is non-physical, and that this must imply that my original arbitrary manipulation was also therefore non-physical.
Is this correct? If so, are there equations which describe whether such a manipulation would be non-physical.
asked 2022-07-04
Generating Sets From Given Information
The problem I am working on is:
The three most popular options on a certain type of new car are a built-in GPS (A), a sunroof (B), and an automatic transmission (C). If 40% of all purchasers request A, 55% request B, 70% request C, 63% request A or B, 77% request A or C, 80% request B or C, and 85% request A or B or C, determine the probabilities of the following events. [Hint: “Aor B” is the event that at least one of the two options is requested; try drawing a Venn diagram and labeling all regions.]
a.The next purchaser will request at least one of the three options.
b.The next purchaser will select none of the three options.
c. The next purchaser will request only an automatic transmission and not either of the other two options.
d.The next purchaser will select exactly one of these three options.
I am absolutely positive that P ( A ) = 40 %, P ( B ) = 55 %, P ( C ) = 70 %, and P ( A B C ) = 85 %. However, the pieces of data I am not quite certain about are P ( A B C ) = 63 %, P ( A C B ) = 77 %, P ( B C A ) = 80 %, do these values correspond to the rest of the data? If so, then I seems nearly impossible to be able to generate the Venn Diagram. Could someone help?
EDIT: What I am having a difficult time interpreting is, when they say in the question, "...63% request A or B." To me, that says only A or only B; and under this interpretation I would write P ( A B C ) = 63 %. Under André Nicolas' interpretation, "63% request A or B," means P ( A B ) = 63 %. If it is the case that André Nicolas is correct, then it seems like they should have stated in the question, "63% request A or B, A and C, B and C, or A and B and C."
Also, I solved the problem under André Nicolas' assumption, and for part d), I know the answer but I am sure how to put in it math symbols. How would I do that?
asked 2020-12-15
A racquetball strikes a wall with a speed of 30 m/s and rebounds with a speed of 26 m/s. The collision takes 20ms. What is the average acceleration of the ball during collision?
a) 62
b) 50
c) 66
d) 54
e) 58
asked 2022-06-22
Interpreting the differences of two log normal distributions:
I have read a couple of posts, and did not see the exact interpretation, I apologize in advance if this is not in the right location
Purpose: I am preparing a paper on the distribution of litter densities along the shore line in freshwater environnements. The data is collected by hand and the individual pieces are classified and counted by volunteers. These programs exist in many countries and there are fairly large data sets.
The units are expressed as 'pieces of trash/meter(or foot) of shoreline.
- The data is collected in the same manner
- The volunteers have the same motivations
- There is no (under counting or over counting)
- Accuracy is basically the same across the spectrum
- The math is correct
- The graph below represents the graph of two sets of Data:
MCBP is regional results for Lake Geneva (Switzerland) n=100 samples
SLR results from the 'The Swiss Litter Report' n=365 samples The following code was used to calculate the distributions and present the graphs from a DataFrame in pandas/python 3.6:
df['Density] = df['Total']/df['Length']
df['Logs'] = df['Density'].apply(np.log)#<- skewed data(get it close to norm)
mu, sigma = stats.norm.fit(df['Logs'])
#repeat for df2 to get the second curve
#Build histograms for the two data sets
#plot the two disributions where x = df['Logs']
#and y = stats.norm.pdf(x, loc=mu, scale=sigma)
The resulting two distributions
mu for the the SLR disribution is 0.1564617, which is equal to the 5th percentile of the MCBP distirbution.
I am interpreting this as meaning:
There is a 5% oprobability that a sample from MCBP will be less than the average from SLR.
There is a 95% probability that a sample taken from the MCBP region will be greater than the national average
In general I can expect litter densities to be greater in the MCBP region than in the SLR region
Is this interpretation correct? (It does correlate with observations)
asked 2020-10-28
Given any two numbers, which is greater, the LCM of the numbers or the GCF of the numbers? Why?
asked 2022-06-10
Data that follows a distribution with non-finite expectation
I find it difficult to get around the idea of some random variables following some distributions (such as the Cauchy Distribution) not to have finite means.
How does one actually interpret data from such random variables? Do we give up and not investigate sample means or ...? Please help clarify this.

New questions