Sutton R.S., Barto A.G. — Reinforcement Learning :: Электронная библиотека попечительского совета мехмата МГУ

Главная Ex Libris Книги Журналы Статьи Серии Каталог Wanted Загрузка ХудЛит Справка Поиск по индексам Поиск Форум

Авторизация

Поиск по указателям

Sutton R.S., Barto A.G. — Reinforcement Learning

Обсудите книгу на научном форуме

Нашли опечатку?
Выделите ее мышкой и нажмите Ctrl+Enter

Название: Reinforcement Learning

Авторы: Sutton R.S., Barto A.G.

Аннотация:

In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the key ideas and algorithms of reinforcement learning. Their discussion ranges from the history of the field's intellectual foundations to the most recent developments and applications. The only necessary mathematical background is familiarity with elementary concepts of probability.

Язык:

Рубрика: Математика/

Статус предметного указателя: Готов указатель с номерами страниц

ed2k: ed2k stats

Год издания: 1998

Количество страниц: 322

Добавлена в каталог: 10.12.2005

Операции: Положить на полку | Скопировать ссылку для форума | Скопировать ID

Предметный указатель

N-step backups      164—166 191
N-step returns      165—166 191
N-step TD methods      164—168
N-step TD methods and Monte Carlo methods      164—166 170—171 172
N-step TD methods and TD( $\lambda$ )      169—172 191
N-step TD methods, problems with      167
Naive Q( $\lambda$ )      184
Neural networks      21 22 23 225 263—266
Neuroscience      22 192
Non-Markov problems      63 64 153 190—191 258—259
Nonassociative problems and methods      20 25 45 284
Nonbootstrapping methods      220—222
Nonstationarity      30 38—39 128 195
Off-line updating      166
Off-policy methods      126—127 211
Off-policy methods, DP as      216
Off-policy methods, Monte Carlo      124—128
Off-policy methods, problems with bootstrapping in      216—220 222
Off-policy methods, Q-learning as      182
On-line updating      166 175 192
On-policy distribution      196 201 216—217
On-policy distribution vs. uniform distribution      247—249
On-policy methods      122 130
On-policy methods vs. off-policy methods      130 148—150
On-policy methods, Monte Carlo      122—124
On-policy methods, Sarsa as      145—148 179—181
One-step methods      164
One-step methods, backups diagrams for      243
Optimal control      16—17 83—84
Optimality and approximation      80—81
Optimality and value functions      75—80
Optimality of TD(0)      141—145
Optimistic initial values      39—41 215
Partially observable MDPs (POMDPs)      17 49 258—259
Pattern recognition      4 18
Peng's Q( $\lambda$ )      182 183—185(fig.) 192
Petroleum refinery: example      6—7
Planning      9 227—254 259—260
Planning and DP      229
Planning and heuristic search      250—252
Planning and learning      9 227—254
Planning and reinforcement learning      5 9
Planning and trajectory sampling      246—250
Planning by Q-learning      229—230
Planning in Dyna      230—238
Planning, deliberation vs. reaction in      9 231
Planning, incremental      230 253
Planning, integrated with action and learning      230—238
Planning, partial-order      228
Planning, state-space vs. plan-space      228
Pleasure and pain      7—8 57
Pole-balancing      59 64 83 202 221(fig.) see
Policy      52
Policy evaluation      90—93 94(fig.)
Policy evaluation by DP      90—93
Policy evaluation by Monte Carlo methods      112—117
Policy evaluation by TD methods      133—146
Policy evaluation for action values      116—117 145—146
Policy evaluation of one policy while following another      124—126
Policy evaluation, iterative      91—92
Policy evaluation, termination of      92
Policy improvement      93—98
Policy improvement and Monte Carlo methods      118—119
Policy improvement, theorem of      95—97
Policy iteration      97—100 see
Policy iteration by Monte Carlo methods      118—119
Policy, $\epsilon$ -greedy      122
Policy, behavior      126
Policy, deterministic      92 291
Policy, equiprobable random      93
Policy, estimation      126
Policy, greedy      77 96 119
Policy, optimal      75
Policy, soft, $\epsilon$ -soft      100 122
Policy, stochastic      52
POMDPs      see "Partially observable MDPs"
Prediction problem      90
Prior knowledge, using      14 56 260
Prioritized sweeping      239—240 241(fig.) 253 254
Prototype states      210
Psychology and reinforcement learning      16 48 254 see "Law "Secondary
Psychology and shaping      260
Psychology and stimulus traces      192
Psychology and stimulus-response associations      7
Pursuit methods      43—45
Q( $\lambda$ )      182—185 192
Q-functions      84 see
Q-learning      23 148—151 159 224 277
Q-learning and eligibility traces      182—185 192
Q-learning and planning      229—230
Q-learning, convergence of      148 159 192 216 218
Q-planning      229—230 231—232
Quasi-Newton methods      223
Queuing example, access-control      154—155 157
R-learning      153—155 160
Racetrack: example      127—128 131
Radial basis functions (RBFs)      208 224
Random walk: examples with n-step methods on      166—168
Random walk: examples, $\lambda$ -return algorithm on      172
Random walk: examples, 19-state      167—168 172 178—179
Random walk: examples, batch updating on      142
Random walk: examples, five state      139—142 166—167
Random walk: examples, TD vs. MC methods on      139—142
Random walk: examples, TD( $\lambda$ ) on      178—179
Random walk: examples, values learned by TD(0) on      140(fig.)
Random-sample one-step tabular Q-planning      229—230
RBFs      see "Radial basis functions"
Reactive decision-making      231
Real-time DP      254
Real-time heuristic search      254
Recursive-least-square methods      223
Recycling robot: examples      6—7 55 56 66—68 78 83
Reference reward      41—42
REINFORCE algorithms      226
Reinforcement comparison methods      41—43 49 152
Reinforcement learning problem      4 51—85
Replacing traces or Replace-trace methods      186—189 221(fig.) see
Residual-gradient methods      224
RETURN      57—60 81 257
Return and value functions      68—69 75
Return, n-step, or corrected n-step truncated      165
Return, unified notation for      60—61
Rewards      7—8 51—52 53 54 81

Rewards and goals      56—57
Rewards in the n-armed bandit problem      26
Rewards, reference      41
RMS error      see "Root mean-squared error"
Robot examples      52 53 56 57 59 202 270
Robot examples, pick and place      54
Robot examples, recycling      55 56 66—68 78 83
Robot examples, trash collecting      6—7
Rod maneuvering: example      240—242
Root mean-squared error      141(fig.)
Rosenblatt, Frank      19
Rubik's cube: example      53
Sample backups      135 242 246 255
Sample models      129 227—228
Sample-average methods      28
Samuel, Arthur      21 22 109 225 see
Sarsa      145—148 159 210
Sarsa( $\lambda$ )      179—181 192
Sarsa( $\lambda$ ) and TD( $\lambda$ )      179—180
Sarsa( $\lambda$ ), linear, gradient-descent      211—213 224
Search control      232
Secondary reinforcers      21
Selectional learning      18 19 20
Selective bootstrap adaptation      20 225—226
Self-play      14 266 268
Semi-Markov decision process      276—277 280—281
Sequential design of experiments      48 83
Shannon, Claude      21 84 267—268
Shortcut maze: example      236—238
Signature tables      225—270
Simulated experience      111
SNARCs (Stochastic Neural-Analog Reinforcement Calculators)      18
Soap bubble: example      116 131
Softmax method      30—31 49
Space shuttle payload processing (SSPP): case study      284—286
Sparse distributed memory      224
State aggregation      199—200 202 223 224 225 259
State nodes      68
State representations      61—62
State representations, augmenting      258—259
State representations, exploiting structure in      260
State values or State value functions      68—73 75—77 84
State(s)      52—55 61—65 81 83 see
State-action eligibility traces      179—180 188
STeLLA system      19
Step-size parameters      12—13 37 38—39
Stochastic approximation convergence conditions      39
Stochastic approximation methods      33
Supervised algorithm      33
Supervised learning      4 19 32 193 195
sweeps      91 246 see
Symmetries: in Tic-tac-toe      14—15
Synchronous algorithms      110
System identification      254 see
Tabular methods      80—81
Target or Target function      57 134 164—165 194—195
TD error      152 174—165 178 180
TD learning      see "Temporal-difference learning"
TD( $\lambda$ )      22 169—175 191—192
TD( $\lambda$ ) and Sarsa( $\lambda$ )      179—180
TD( $\lambda$ ), backward view of      173—175
TD( $\lambda$ ), convergence of      191—192 201
TD( $\lambda$ ), forward view of      169—173
TD( $\lambda$ ), function approximation and      198—200 201 223—224
TD(0)      23 134 135(figs.) 159
TD(0) and Sarsa      145
TD(0) and TD( $\lambda$ )      175
TD(0) vs. constant- $\alpha$ MC      139—142
TD(0), $\lambda$ -return algorithm and      171
TD(0), convergence of      138 159
TD(0), function approximation and      194 223
TD(0), optimality under batch updating of      141—145
TD(1)      175 188 216
TD-Gammon      14 261—267
TD-Gammon and heuristic search      251
TD-Gammon and Samuel's checker player      268
TDNN      see "Time-delay neural network"
Temperature      31 42 49 278 285
Temporal-difference error      see "TD error"
Temporal-difference learning (TD learning)      133—160 256
Temporal-difference learning (TD learning) and GPI      157—158
Temporal-difference learning (TD learning), actor-critic methods and      151—153
Temporal-difference learning (TD learning), advantages of      138—145
Temporal-difference learning (TD learning), bootstrapping and      138
Temporal-difference learning (TD learning), convergence of      138
Temporal-difference learning (TD learning), DP and      134 138
Temporal-difference learning (TD learning), eligibility traces and      163 190—191
Temporal-difference learning (TD learning), Monte Carlo methods and      133—145 175 188
Temporal-difference learning (TD learning), n-step      164—168
Temporal-difference learning (TD learning), off-policy control and      148—151
Temporal-difference learning (TD learning), on-policy control and      145—148
Temporal-difference learning (TD learning), one-step      148 158 159 164 179
Temporal-difference learning (TD learning), roots of      21—23 158—159
Temporal-difference learning (TD learning), Tic-tac-toe example      12—13
Terminal state      58
Testbed, 10-armed      28
Thorndike, Edward      17—18
Tic-tac-toe: example      10—15 see
Tile coding      204—208 215 224 272—273
Time steps      52
Time-delay neural network (TDNN)      287—289
Trace-decay parameter      173 189—190
Trajectory sampling      246—250 253 254
Transient DP      254
Transition graphs      67—68
Transition probabilities      66
trial-and-error learning      16 17—21
Tsitsiklis and Van Roy's counterexample      219—220
Undiscounted continuing tasks      61 153—155 160 224
Value functions      8—9 68—80 82 229 286
Value functions and DP      89—90
Value functions, function approximation and      193
Value functions, optimal      75—80
Value functions, policy evaluation and      90—93
Value functions, roots of      16 84
Value iteration      100—103
Value(s)      8 69 84
Value(s), in the n-armed bandit problem      26
Value(s), relative      154
Vision system: example      65
Watkins's Q( $\lambda$ )      182—184 192 211 213 224
Werbos, Paul      23 84 109 159 223
Widrow, Bernard      19 20 49 131 223 225

1 2

О проекте