230 lines
9.1 KiB
TeX
230 lines
9.1 KiB
TeX
\documentclass[11pt]{article}
|
|
\usepackage{amsmath, amsfonts, amsthm}
|
|
\usepackage{amssymb}
|
|
\usepackage{fancyhdr,parskip}
|
|
\usepackage{fullpage}
|
|
\usepackage{mathrsfs}
|
|
\usepackage{mathtools}
|
|
\usepackage{float}
|
|
\usepackage{hyperref}
|
|
|
|
%%
|
|
%% Stuff above here is packages that will be used to compile your document.
|
|
%% If you've used unusual LaTeX features, you may have to install extra packages by adding them to this list.
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
\setlength{\headheight}{15.2pt}
|
|
\setlength{\headsep}{20pt}
|
|
\pagestyle{fancyplain}
|
|
|
|
%%
|
|
%% Stuff above here is layout and formatting. If you've never used LaTeX before, you probably don't need to change any of it.
|
|
%% Later, you can learn how it all works and adjust it to your liking, or write your own formatting code.
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%
|
|
% These commands create theorem-like environments.
|
|
\newtheorem{theorem}{Theorem}
|
|
\newtheorem{lemma}[theorem]{Lemma}
|
|
\newtheorem{corollary}[theorem]{Corollary}
|
|
\newtheorem{prop}[theorem]{Proposition}
|
|
\newtheorem{defn}[theorem]{Definition}
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%% This section contains some useful macros that will save you time typing.
|
|
%%
|
|
|
|
% Using \displaystyle (or \ds) in a block of math has a number of effects, but most notably, it makes your fractions come out bigger.
|
|
\newcommand{\ds}{\displaystyle}
|
|
|
|
% These lines are for displaying integrals; typing \dx will make the dx at the end of the integral look better.
|
|
\newcommand{\is}{\hspace{2pt}}
|
|
\newcommand{\dx}{\is dx}
|
|
|
|
% These commands produce the fancy Z (for the integers) and other letters conveniently.
|
|
\newcommand{\Z}{\mathbb{Z}}
|
|
\newcommand{\Q}{\mathbb{Q}}
|
|
\newcommand{\R}{\mathbb{R}}
|
|
\newcommand{\C}{\mathbb{C}}
|
|
\newcommand{\F}{\mathbb{F}}
|
|
\newcommand{\T}{\mathcal{T}}
|
|
\newcommand{\B}{\mathcal{B}}
|
|
|
|
% for fancy empty set char
|
|
\renewcommand{\emptyset}{\varnothing}
|
|
|
|
% customized commands for future assignements
|
|
\newcommand{\imply}{\Rightarrow}
|
|
\def\P{\mathscr{P}}
|
|
\def\L{\mathscr{L}}
|
|
\def\M{\mathscr{M}}
|
|
\DeclarePairedDelimiterX{\inp}[2]{\langle}{\rangle}{#1, #2}
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%% This is the header. It will appear on every page, and it's a good place to put your name, the assignment title, and stuff like that.
|
|
%% I usually leave the center header blank to avoid clutter.
|
|
%%
|
|
|
|
\fancyhead[L]{\textbf{CSE5100 Homework 3}}
|
|
\fancyhead[C]{\empty}
|
|
\fancyhead[R]{Zheyuan Wu}
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
\begin{document}
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%% Actual math starts here!
|
|
|
|
|
|
% Use an enumerated list to write up problems. First we begin a list.
|
|
|
|
\textbf{Use Of GenAI}
|
|
|
|
This homework is completed with the help of Windsurf VS code extension.\url{https://windsurf.com/}
|
|
|
|
What is used:
|
|
|
|
\begin{itemize}
|
|
\item Autofill feature to generate syntactically correct latex code (each tab key pressed filled no more than 100 characters, at most $20\%$ of the predicted text is adapted) for the homework with human supervision.
|
|
\item Use AI to debug the latex code and find unclosed parentheses or other syntax errors.
|
|
\item Use AI to autofill the parts that follows the same structure as the previous parts (example: case by case proofs).
|
|
\item Use AI to auto correct misspelled words or latex commands.
|
|
\end{itemize}
|
|
|
|
What is not used:
|
|
|
|
\begin{itemize}
|
|
\item Directly use AI to generate the solutions in latex document.
|
|
\item Use AI to ask for hint or solution for the problems.
|
|
\item Select part of the document and ask AI to fill the parts missing.
|
|
\end{itemize}
|
|
|
|
\newpage
|
|
|
|
\begin{enumerate}
|
|
\item [1.3] Deliveries
|
|
\begin{enumerate}
|
|
\item [1.3.1]
|
|
Create two graphs:
|
|
\begin{itemize}
|
|
\item In the first graph, compare the learning curves (average return vs. number of environment steps) for the experiments running with batch size of 1000. (The small batch
|
|
experiments.) (15 pts)
|
|
|
|
\begin{figure}[H]
|
|
\centering
|
|
\includegraphics[width=0.8\textwidth]{images/p1311.png}
|
|
\caption{Learning Curves for Batch Size of 1000}
|
|
\end{figure}
|
|
\item In the second graph, compare the learning curves for the experiments running with batch size of 4000. (The large batch experiments.) (15 pts)
|
|
|
|
\begin{figure}[H]
|
|
\centering
|
|
\includegraphics[width=0.8\textwidth]{images/p1312.png}
|
|
\caption{Learning Curves for Batch Size of 4000}
|
|
\end{figure}
|
|
Note that the x-axis should be number of environment steps, not number of policy gradient iterations.
|
|
\end{itemize}
|
|
|
|
\item [1.3.2]
|
|
Answer the following questions briefly:
|
|
|
|
Provide the exact command line configurations you used to run your experiments, including any parameters changed from their defaults.
|
|
|
|
The best configuration in both the small and large batch size cases should converge to a maximum score of 500.
|
|
\begin{itemize}
|
|
\item Which value estimator has better performance without advantage normalization: the trajectory-centric one, or the one using reward-to-go? Why? (10 pts)
|
|
|
|
The reward-to-go one has better performance without advantage normalization.
|
|
|
|
The reward-to-go has more fine-grained control over the learning process by using the rewards after the current timestep to estimate the Q-value for the current state-action pair.
|
|
|
|
\item Did advantage normalization help? (10 pts)
|
|
|
|
Yes, advantage normalization helps.
|
|
|
|
The advantage normalization helps the learning process by stabilizing the learning rate and preventing the policy from overfitting to the data.
|
|
|
|
\item Did the batch size make an impact? (10 pts)
|
|
|
|
Yes, the batch size makes an impact.
|
|
|
|
The larger batch size allows the agent to learn from more data in each update, which can help the agent to converge to a better policy, especially when the normalization and reward-to-go are used.
|
|
\end{itemize}
|
|
\end{enumerate}
|
|
|
|
\newpage
|
|
\item [2.3] Deliveries
|
|
\begin{enumerate}
|
|
\item [2.3.1] Plot a learning curve for the baseline loss. (5 pts)
|
|
|
|
\begin{figure}[H]
|
|
\centering
|
|
\includegraphics[width=0.8\textwidth]{images/p231.png}
|
|
\caption{Learning Curve for Baseline Loss for Batch Size of 5000}
|
|
\end{figure}
|
|
|
|
\item [2.3.2] Plot a learning curve for the evaluation return. You should expect to converge to the maximum reward of 500. (15 pts)
|
|
|
|
\begin{figure}[H]
|
|
\centering
|
|
\includegraphics[width=0.8\textwidth]{images/p232.png}
|
|
\caption{Learning Curve for Evaluation Return for Batch Size of 5000}
|
|
\end{figure}
|
|
|
|
\item [2.3.3]
|
|
Run another experiment with a decreased number of baseline gradient steps (-bgs in command line) and/or baseline learning rate (-blr in command line). How does this affect (a) the baseline learning curve and (b) the performance of the policy? (15 pts)
|
|
|
|
\begin{figure}[H]
|
|
\centering
|
|
\includegraphics[width=0.8\textwidth]{images/p2331.png}
|
|
\caption{Learning Curve for Baseline Loss for Batch Size of 5000 with Decreased Baseline Gradient Steps and/or Baseline Learning Rate}
|
|
\end{figure}
|
|
|
|
In general, the baseline learning curve is more stable and the performance of the policy is better when the number of baseline gradient steps is decreased and/or the baseline learning rate is decreased.
|
|
|
|
\begin{figure}[H]
|
|
\centering
|
|
\includegraphics[width=0.8\textwidth]{images/p2332.png}
|
|
\caption{Learning Curve for Average Return for Batch Size of 5000 with Decreased Baseline Gradient Steps and/or Baseline Learning Rate}
|
|
\end{figure}
|
|
|
|
In general, the performance of the policy is better when the number of baseline gradient steps is decreased and/or the baseline learning rate is decreased.
|
|
|
|
\item [2.3.4]
|
|
How does the command line argument -na influence the performance? Why is that the case? (5 pts)
|
|
|
|
The performance of the policy is better when the command line argument -na is used.
|
|
|
|
The command line argument -na helps the performance of the policy by normalizing the advantages, which helps the policy to learn more stable and faster.
|
|
|
|
\begin{figure}[H]
|
|
\centering
|
|
\includegraphics[width=0.8\textwidth]{images/p234.png}
|
|
\caption{Learning Curve for Average Return for Batch Size of 5000 with Command Line Argument -na}
|
|
\end{figure}
|
|
|
|
\end{enumerate}
|
|
\newpage
|
|
\item [2.4] Bonus (20pt)
|
|
|
|
% \begin{figure}[H]
|
|
% \centering
|
|
% \includegraphics[width=0.8\textwidth]{images/p241.png}
|
|
% \caption{Learning Curve for Average Return for HalfCheetah with Berkely Parameters}
|
|
% \end{figure}
|
|
|
|
\end{enumerate}
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%% Actual math ends here. Don't put any content below the \end{document} line.
|
|
%%
|
|
|
|
\end{document}
|