course/semester2/pprog/assignment1/report/report.tex

   1 \documentclass[10pt]{article}
   2 \usepackage{graphicx}
   3 \usepackage{caption}
   4 \usepackage{amsmath} % needed for math align
   5 \usepackage{bm} % needed for maths bold face
   6  \usepackage{graphicx}    % needed for including graphics e.g. EPS, PS
   7 \usepackage{fancyhdr}   % needed for header
   8 %\usepackage{epstopdf} % Needed for eps graphics
   9 \usepackage{hyperref}
  10 \usepackage{lscape}
  11
  12  \topmargin -1.5cm        % read Lamport p.163
  13  \oddsidemargin -0.04cm   % read Lamport p.163
  14  \evensidemargin -0.04cm  % same as oddsidemargin but for left-hand pages
  15  \textwidth 16.59cm
  16  \textheight 21.94cm
  17  %\pagestyle{empty}       % Uncomment if don't want page numbers
  18  \parskip 7.2pt           % sets spacing between paragraphs
  19  %\renewcommand{\baselinestretch}{1.5}  % Uncomment for 1.5 spacing between lines
  20  \parindent 0pt           % sets leading space for paragraphs
  21
  22
  23 \newcommand{\vect}[1]{\boldsymbol{#1}} % Draw a vector
  24 \newcommand{\divg}[1]{\nabla \cdot #1} % divergence
  25 \newcommand{\curl}[1]{\nabla \times #1} % curl
  26 \newcommand{\grad}[1]{\nabla #1} %gradient
  27 \newcommand{\pd}[3][ ]{\frac{\partial^{#1} #2}{\partial #3^{#1}}} %partial derivative
  28 \newcommand{\der}[3][ ]{\frac{d^{#1} #2}{d #3^{#1}}} %full derivative
  29 \newcommand{\phasor}[1]{\tilde{#1}} % make a phasor
  30 \newcommand{\laplacian}[1]{\nabla^2 {#1}} % The laplacian operator
  31
  32 \usepackage{color}
  33 \usepackage{listings}
  34
  35 \definecolor{darkgray}{rgb}{0.95,0.95,0.95}
  36 \definecolor{darkred}{rgb}{0.75,0,0}
  37 \definecolor{darkblue}{rgb}{0,0,0.75}
  38 \definecolor{pink}{rgb}{1,0.5,0.5}
  39 \lstset{language=Java}
  40 \lstset{backgroundcolor=\color{darkgray}}
  41 \lstset{numbers=left, numberstyle=\tiny, stepnumber=1, numbersep=5pt}
  42 \lstset{keywordstyle=\color{darkred}\bfseries}
  43 \lstset{commentstyle=\color{darkblue}}
  44 %\lstset{stringsyle=\color{red}}
  45 \lstset{showstringspaces=false}
  46 \lstset{basicstyle=\small}
  47
  48
  49 \begin{document}
  50
  51 \pagestyle{fancy}
  52 \fancyhead{}
  53 \fancyfoot{}
  54
  55 \fancyhead[LO, L]{}
  56 \fancyfoot[CO, C]{\thepage}
  57
  58 \title{\bf A Multithreaded N-Body Simulation}
  59 \author{Sam Moore\\[email protected]}
  60 \date{}
  61 \maketitle
  62
  63 \tableofcontents
  64
  65
  66
  67 \pagebreak
  68
  69 \section{Introduction}
  70
  71 An N-Body simulation's purpose is to solve for the positions and velocities of an arbitrary number of interacting bodies using numerical methods.
  72
  73 Complex systems involving large numbers of bodies, interacting over long periods of time, are often of interest. Hence there is considerable motivation to improve the speed (amount of time simulated over real time required) of N-Body simulations.
  74 This report focuses on increasing the speed of an N-Body simulation by introducing parallelism using threads.
  75
  76 The bodies in this simulation interact according to Newtonian laws of gravity. The force acting on a body $A$ due to body $B$ is given by:
  77 \begin{align*}
  78         \vect{F_{AB}} &= G \frac{m_A m_B}{\left| \vect{r_B} - \vect{r_A} \right|^3} \left(\vect{r_B} - \vect{r_A}\right)
  79 \end{align*}
  80 Where $m_A$ and $m_B$ are the masses of the two bodies; $\vect{r_A}$ and $\vect{r_B}$ are the position vectors of the bodies, and $G$ is the universal gravitational constant.
  81
  82 A template single threaded program was already provided by UWA. Some modifications were made in order to simplify the process of introducing threading to the program, and in one case to correct an error in the program.
  83
  84 \section{Euler's Method}
  85
  86 Euler's method has been used. A continuous time interval is approximated by a series of steps of length $\Delta t$, during which each body undergoes motion with constant acceleration.
  87
  88 The algorithm in pseudo-code is:
  89
  90
  91 \begin{lstlisting}
  92 FOR each time interval of duration dt LOOP
  93         FOR each body A // can be multithreaded
  94                 COMPUTE the force acting on A due to all other bodies B
  95         DONE
  96
  97         FOR each body A // can be multithreaded
  98                 COMPUTE the velocity of A as:
  99                          V += (Force on A / mass of A) * dt
 100                 COMPUTE the position of A as:
 101                          X += V * dt
 102         DONE
 103 DONE
 104 \end{lstlisting}
 105
 106
 107 The pseudo-code is misleading; it may appear that the two loops could be combined into a single loop. However, in the case where bodies are interacting, the force on any given body will depend upon the positions of other bodies. Consider two bodies $A$ and $B$ interacting within a single time step. If the force on $A$ is computed, and $A$'s position is updated before the force on $B$ is computed, then the force experienced by $B$ will not be equal and opposite to that experienced by $A$ - violating Newton's third law.
 108
 109 The template program provided by UWA places force and position updates for each body in the same loop. This can be fixed by moving the position updates to a separate loop. In the multithreaded implementation, it is also important to ensure that \emph{all} forces are computed before any positions are changed, and vice versa before force computations may continue.
 110
 111
 112
 113 \section{Parallelisation}
 114
 115 The original algorithm can be parallelised by dividing each of the two loops indicated between a number of concurrently running threads.
 116 Forces may be updated for separate bodies simultaneously with no risk of conflict, since the force of one body is independent of all others. Similarly, positions may be updated simultaneously. However, as discussed above, force and position updates must be done separately.
 117
 118 The most effective way of parallelising the computations is as follows (not showing graphics related code):
 119
 120 \begin{lstlisting}
 121 CREATE n threads
 122 FOR each thread T assign a subset of bodies
 123
 124 FOR each thread T
 125         FOR each time interval of duration dt LOOP
 126                 FOR each body A in T's subset
 127                         COMPUTE the force acting on A due to all other bodies
 128                 DONE
 129
 130                 BARRIER Wait for all threads to finish computing forces
 131
 132                 FOR each body A in T's subset
 133                         COMPUTE the position of A
 134                 DONE
 135
 136                 BARRIER Wait for all threads to finish computing positions
 137
 138         DONE
 139 DONE
 140 \end{lstlisting}
 141
 142 This algorithm avoids the cost of repeated thread creation and deletion; threads are only created once.
 143 The barriers between force and position updates ensure that forces and positions are never updated at the same time in separate threads.
 144
 145 \subsection{Graphics}
 146
 147 The template program provides graphics using OpenGL and freeglut. In the multithreaded implementations, the graphics can be run in the main thread independent of computations with no effect on the final output of the simulation. However, the following problems may arise if the graphics thread is not synchronised with position updates:
 148
 149 \begin{enumerate}
 150 \item If a body's position is being altered whilst the graphics thread is drawing that body, the body may be drawn partway through the update of the three position coordinates. This will cause the body to be drawn displaced from its actual position by a small amount.
 151
 152 \item Because position updates are finished in any order, with the barriers correctly implemented, the display may show some bodies remain stationary whilst the thread responsible for those bodies waits for other threads to reach the barrier.
 153 \end{enumerate}
 154
 155 To avoid these problems, the graphics thread must not commence drawing if position updates are in progress. Similarly, position updates must not commence whilst the graphics thread is still in the process of drawing.
 156 \pagebreak
 157
 158
 159 \section{POSIX Threads (pthreads)}
 160
 161 \subsection{Barriers}
 162
 163 Because \verb/pthread_barrier/ is unavailable on many UNIX systems, it was necessary to implement a barrier.
 164
 165 The barrier functions as follows:
 166 \begin{enumerate}
 167         \item Initialised as ``inactive'' and with the total number of participating threads specified
 168         \item When a thread calls \verb/Barrier_Join/ the barrier becomes ``active'' (if it wasn't already)
 169         \item Each thread entering the barrier by calling \verb/Barrier_Join/ increments a counter.
 170         \item Threads calling \verb/Barrier_Join/ must sleep using \verb/pthread_cond_wait/ until the counter reaches the total number of participating threads
 171         \item The final thread to enter the barrier awakens the sleeping threads using \verb/pthread_cond_broadcast/, and sets the barrier as ``inactive''.
 172 \end{enumerate}
 173
 174 The concept of ``inactive'' and ``active'' barriers was useful for synchronising the graphics thread with the position computations.
 175 The function \verb/Barrier_Wait/ allows a thread to pass the barrier if it is inactive, but forces the thread to wait if the barrier is active. The calling thread does not affect the barrier's counter of finished threads.
 176
 177 The function \verb/Barrier_Enter/ can be called in a thread to set the barrier as ``active'', and then continue to perform tasks \emph{before} joining the barrier. ie: The barrier may be considered ``active'' by calls to \verb/Barrier_Wait/ before any threads have called \verb/Barrier_Join/.
 178
 179 By calling \verb/Barrier_Enter/ in the position threads before the position updates begin, and \verb/Barrier_Wait/ in the graphics thread before the drawing begins, the graphics thread cannot commence drawing whilst positions are being updated.
 180 Similarly, calling \verb/Barrier_Enter/ in the graphics thread before drawing, and \verb/Barrier_Wait/ in all position threads, ensures that the position updates cannot be started whilst drawing is in progress.
 181
 182 To avoid deadlocks, three separate barriers have been used; one for each of the force, position and drawing updates.
 183
 184 \subsection{Division of Labour}
 185
 186 In the pthread implementation, some mechanism is needed to divide the bodies between threads.
 187 This has been done by creating a data structure to describe an array of bodies, with the length of the array. Division of the array into subsections can be accomplished in the C programming language using pointer arithmetic.
 188
 189 A global array of these data structures (\verb/System/) is constructed before thread creation, with each \verb/System/ containing a subsection of the original bodies array. Each thread is passed a pointer to one of the \verb/System/s through \verb/pthread_create/
 190
 191 Some helper functions have been implemented to perform tasks for a given \verb/System/ of bodies, and these are called where needed in the threads.
 192
 193 \subsection{Single Threaded Section}
 194 One disadvantage of having all computation threads continually running is that it is difficult to execute serial code in only \emph{one} thread. For example, a periodic printing of performance metrics and the updating of the number of steps computed must both be performed at the end of every time step by only one thread. To assist in executing serial code in one thread out of a ``team'', a function has been added to the \verb/Barrier/ implementation; \verb/Barrier_JoinCall/ will cause the last thread entering the barrier to execute a specified function containing serial code. This takes advantage of the fact that serial code can often occur immediately after a barrier.
 195
 196 \section{OpenMP}
 197
 198 The implementation in OpenMP is significantly simpler than pthreads. OpenMP automatically divides labour for threads encountering \verb/omp for/, which eliminates the need for constructing the \verb/System/ array. OpenMP also incorporates implicit barriers after most code sections. The use of \verb/omp single/ directives greatly simplifies the assignment of serial code to one thread in a team.
 199
 200 Because OpenMP does not offer as much low level control over threads, the integration of graphics with the simulation differs. Specifically, because OpenMP does not have a barrier structure (instead using \verb/omp barrier/ directives), all threads involved in a barrier must occur within the same \verb/omp parallel/ section.
 201
 202 Instead of creating a separate thread to run graphics, the OpenMP solution uses an \verb/omp master/ directive to perform graphics related code within the main computation loop. This occurs after the position computation barrier, and is followed by an explicit \verb/omp barrier/. It is important that the graphics related functions be run in the main thread; using \verb/omp single/ will often result in unusual behaviour, such as flickering of the display or the display not being drawn at all.
 203
 204
 205 \section{Barnes Hut Algorithm}
 206
 207 The Barnes Hut algorithm approximates a distant group of bodies by the group's centre of mass. This is done by the construction of a tree representing regions of space containing groups of bodies. Nodes at each level of the tree represent a volume obtained by dividing the volume of the parent. The force on a body can be approximated by a depth first search on the tree, stopping at nodes for which the centre of mass is ``distant'' to the body, otherwise continuing to the nodes children until nodes containing only one body are reached.
 208
 209 Figure \ref{cubes.png} shows a graphical representation of the Barnes Hut algorithm. In this image, only Nodes (depicted as cubes representing the space within the node) containing exactly one body are shown.
 210
 211 It is possible to parallelise the recursive algorithms used in constructing the tree and calculating forces. However, we have only implemented a serial version of the tree. Despite this, the calculation of forces can still be parallelised external to the tree.
 212
 213 {\bf I have reached the page limit...} but I have lots of graphs to talk about still. Feel free to stop reading here.
 214 \begin{center}
 215         \includegraphics[scale=0.50]{/home/sam/Documents/University/honours/course/semester2/pprog/assignment1/nbody-bh/cubes.png}
 216         \captionof{figure}{Barnes Hut - One body per cube}
 217         \label{cubes.png}
 218 \end{center}
 219
 220 \subsection{Errors}
 221
 222 Figure \ref{error_graph.png} shows the mean ``error'' between the Barnes Hut Algorithm and more exact brute force algorithm. This error is calculated by summing the absolute difference of each body's position components, and averaging over all position components. Note the logarithmic y axis. The error increases roughly linearly. There is little change in error for different values of $\theta$ between $0$ and $10$.
 223
 224 Note: The same error analysis script confirmed that the multithreaded versions of the brute force algorithm gave the same outputs.
 225
 226 \subsection{Performance for different $\theta$ values}
 227
 228 Figure \ref{efficiency.png} shows the time taken for the (single threaded) Barnes Hut Algorithm to run a specified number of steps, for different values of $\theta$. The brute force algorithm is shown for comparison; for $\theta < 1$, the brute force algorithm is generally faster.
 229 Taking into account both figures \ref{efficiency.png} and \ref{error_graph.png}, it would appear that $\theta = 10$ is a good balance between accuracy and speed, for this particular field of bodies.
 230
 231 \begin{landscape}
 232
 233 \begin{center}
 234         \includegraphics[scale=0.50]{/home/sam/Documents/University/honours/course/semester2/pprog/assignment1/nbody-bh/error_graph.png}
 235         \captionof{figure}{Barnes Hut mean error} \label{error_graph.png}
 236 \end{center}
 237
 238 \begin{center}
 239         \includegraphics[scale=0.50]{/home/sam/Documents/University/honours/course/semester2/pprog/assignment1/nbody-bh/efficiency.png}
 240         \captionof{figure}{Barnes Hut single threaded performance} \label{efficiency.png}
 241 \end{center}
 242
 243 \end{landscape}
 244
 245 \pagebreak
 246
 247 \section{Performance Analysis}
 248
 249 \subsection{Time vs Steps}
 250
 251 Figure \ref{time_vs_steps.png} shows the time taken to compute a varying number of steps for different programs. The field provided with 500 bodies was used.
 252
 253 Several things can be noted from this graph:
 254 \begin{itemize}
 255         \item With $\theta = 10$, the Barnes Hut algorithm is significantly faster than brute force algorithms
 256         \item Altering the number of threads has a significant effect on the brute force algorithm performance, but little effect on the Barnes Hut algorithm performance. This is because the vast majority of the Barnes Hut algorithm has not yet been multithreaded.
 257         \item The optimum number of threads is equal to the number of cores available on the host machine (Intel Core i5).
 258                 (Pleiades was not available at the time of testing).
 259         \item The pthreads and openmp implementations perform with similar speeds
 260         \item The single threaded, brute force algorithms perform the worst
 261 \end{itemize}
 262
 263 The time taken to run varies with CPU load due to other processes, and so these results should be considered as qualitative.
 264 Measurement of CPU cycles (as opposed to real time) is not a suitable metric, since on many UNIX systems, this measurement includes cycles run by all threads in a process.
 265
 266 \subsection{Time to compute vs Number of Bodies}
 267
 268 Figure \ref{time_vs_n.png} shows the time taken to compute 1000 steps against the number of bodies in the initial field. The brute force algorithms appear $O(n^2)$. Barnes Hut appears to be $O(n log(n))$. The fastest algorithm is Barnes Hut with $\theta = 10$.
 269
 270 This experiment revealed (and was limited by) a flaw in (my implementation of) the Barnes Hut algorithm. When the body fields are randomly generated, the result is usually unbound orbits. As the size of the volume occupied by bodies is increasing, the algorithm must continually resize the root node of the tree, and then all children nodes, and ultimately create additional child nodes to contain single particles. Although the implementation attempts to minimise allocations of memory, it was found that for large numbers of bodies the program often became unresponsive. This may be due to the more closely approaching bodies experiencing extremely large forces, and moving away from the centre of the universe at high velocities.
 271 .
 272
 273 \pagebreak
 274 \begin{landscape}
 275
 276 \begin{center}
 277         \includegraphics[scale=0.50]{/home/sam/Documents/University/honours/course/semester2/pprog/assignment1/performance_time_vs_steps.png}
 278         \captionof{figure}{Implementation Performance}
 279         \label{time_vs_steps.png}
 280 \end{center}
 281
 282 \begin{center}
 283         \includegraphics[scale=0.50]{/home/sam/Documents/University/honours/course/semester2/pprog/assignment1/time_vs_n.png}
 284         \captionof{figure}{Efficiency vs $n$} \label{time_vs_n.png}
 285 \end{center}
 286 \end{landscape}
 287
 288 \subsection{Analysis with Callgrind}
 289
 290 The tool \emph{callgrind} (part of \emph{valgrind}) can be used to generate information on the performance of a program, and to identify expensive lines of code or the cost of different function calls. Figure \ref{kcachegrind.png} shows a screenshot of \emph{kcachegrind}, a graphical tool to display the results obtained by callgrind. The program analysed is the Barnes Hut N-Body simulator, running for 10 steps, without graphics enabled.
 291
 292 A larger copy of the image is attached with this report, file name ``nbody-bh/kcachegrind.png''. Several things are clear from this display:
 293 \begin{enumerate}
 294         \item The most expensive computations are due to the force computation function \verb/Node_Forces/. Close to 93\% of the program's time is spent in this function.
 295
 296         \item Although there are only 500 bodies, the force calculation algorithm was called 15200000 times. This is significantly greater than the approximate 2500000 force computations for the brute force algorithm.
 297
 298         For this particular initial field and choice of value for $\theta$ ($0.5$), the single threaded Barnes Hut algorithm performed worse than the single threaded brute force algorithm. As shown in Figure \ref{time_vs_n.png} however, the Barnes Hut algorithm performs extremely well for larger numbers of bodies. The ``better'' algorithm to use is determined by the problem.
 299
 300         \item Most of the cost of a force calculation is due to two calculations (distance calculation and the actual force update)
 301                 The calls to \verb/sqrt/, a normally expensive function have a much lower total cost, but are called the same number of times.
 302                 The extra cost of the two calculations might be due to the costs of following the pointers (the code has not been optimised),
 303
 304         \item The \verb/for/ loops, which were used after replacing the original 9 body variables (x, y, z, Vx, ..., Fx, ...) with three arrays, to condense the code, have a small but significant cost associated with them.
 305 \end{enumerate}
 306
 307 \begin{landscape}
 308 \begin{center}
 309         \includegraphics[scale=0.50]{/home/sam/Documents/University/honours/course/semester2/pprog/assignment1/nbody-bh/kcachegrind.png}
 310         \captionof{figure}{Kcachegrind reveals why my program is slow} \label{kcachegrind.png}
 311 \end{center}
 312 \end{landscape}
 313
 314 \end{document}
 315