Parallel Programming - Final version

[matches/honours.git] / course / semester2 / pprog / assignment1 / report / report.tex
diff --git a/course/semester2/pprog/assignment1/report/report.tex b/course/semester2/pprog/assignment1/report/report.tex

new file mode 100644 (file)

index 0000000..d9d4e3d
--- /dev/null
+++ b/course/semester2/pprog/assignment1/report/report.tex
@@ -0,0 +1,315 @@
+\documentclass[10pt]{article}
+\usepackage{graphicx}
+\usepackage{caption}
+\usepackage{amsmath} % needed for math align
+\usepackage{bm} % needed for maths bold face
+ \usepackage{graphicx}    % needed for including graphics e.g. EPS, PS
+\usepackage{fancyhdr}  % needed for header
+%\usepackage{epstopdf} % Needed for eps graphics
+\usepackage{hyperref}
+\usepackage{lscape}
+
+ \topmargin -1.5cm        % read Lamport p.163
+ \oddsidemargin -0.04cm   % read Lamport p.163
+ \evensidemargin -0.04cm  % same as oddsidemargin but for left-hand pages
+ \textwidth 16.59cm
+ \textheight 21.94cm 
+ %\pagestyle{empty}       % Uncomment if don't want page numbers
+ \parskip 7.2pt           % sets spacing between paragraphs
+ %\renewcommand{\baselinestretch}{1.5}         % Uncomment for 1.5 spacing between lines
+ \parindent 0pt                  % sets leading space for paragraphs
+
+
+\newcommand{\vect}[1]{\boldsymbol{#1}} % Draw a vector
+\newcommand{\divg}[1]{\nabla \cdot #1} % divergence
+\newcommand{\curl}[1]{\nabla \times #1} % curl
+\newcommand{\grad}[1]{\nabla #1} %gradient
+\newcommand{\pd}[3][ ]{\frac{\partial^{#1} #2}{\partial #3^{#1}}} %partial derivative
+\newcommand{\der}[3][ ]{\frac{d^{#1} #2}{d #3^{#1}}} %full derivative
+\newcommand{\phasor}[1]{\tilde{#1}} % make a phasor
+\newcommand{\laplacian}[1]{\nabla^2 {#1}} % The laplacian operator
+
+\usepackage{color}
+\usepackage{listings}
+
+\definecolor{darkgray}{rgb}{0.95,0.95,0.95}
+\definecolor{darkred}{rgb}{0.75,0,0}
+\definecolor{darkblue}{rgb}{0,0,0.75}
+\definecolor{pink}{rgb}{1,0.5,0.5}
+\lstset{language=Java}
+\lstset{backgroundcolor=\color{darkgray}}
+\lstset{numbers=left, numberstyle=\tiny, stepnumber=1, numbersep=5pt}
+\lstset{keywordstyle=\color{darkred}\bfseries}
+\lstset{commentstyle=\color{darkblue}}
+%\lstset{stringsyle=\color{red}}
+\lstset{showstringspaces=false}
+\lstset{basicstyle=\small}
+
+
+\begin{document}
+
+\pagestyle{fancy}
+\fancyhead{}
+\fancyfoot{}
+
+\fancyhead[LO, L]{}
+\fancyfoot[CO, C]{\thepage}
+
+\title{\bf A Multithreaded N-Body Simulation}
+\author{Sam Moore\\[email protected]}
+\date{}
+\maketitle
+
+\tableofcontents
+
+
+
+\pagebreak
+
+\section{Introduction}
+
+An N-Body simulation's purpose is to solve for the positions and velocities of an arbitrary number of interacting bodies using numerical methods.
+
+Complex systems involving large numbers of bodies, interacting over long periods of time, are often of interest. Hence there is considerable motivation to improve the speed (amount of time simulated over real time required) of N-Body simulations.
+This report focuses on increasing the speed of an N-Body simulation by introducing parallelism using threads. 
+
+The bodies in this simulation interact according to Newtonian laws of gravity. The force acting on a body $A$ due to body $B$ is given by:
+\begin{align*}
+       \vect{F_{AB}} &= G \frac{m_A m_B}{\left| \vect{r_B} - \vect{r_A} \right|^3} \left(\vect{r_B} - \vect{r_A}\right)
+\end{align*}
+Where $m_A$ and $m_B$ are the masses of the two bodies; $\vect{r_A}$ and $\vect{r_B}$ are the position vectors of the bodies, and $G$ is the universal gravitational constant.
+
+A template single threaded program was already provided by UWA. Some modifications were made in order to simplify the process of introducing threading to the program, and in one case to correct an error in the program.
+
+\section{Euler's Method}
+
+Euler's method has been used. A continuous time interval is approximated by a series of steps of length $\Delta t$, during which each body undergoes motion with constant acceleration.
+
+The algorithm in pseudo-code is:
+
+
+\begin{lstlisting}
+FOR each time interval of duration dt LOOP
+       FOR each body A // can be multithreaded
+               COMPUTE the force acting on A due to all other bodies B
+       DONE
+
+       FOR each body A // can be multithreaded
+               COMPUTE the velocity of A as:
+                        V += (Force on A / mass of A) * dt
+               COMPUTE the position of A as:
+                        X += V * dt
+       DONE
+DONE
+\end{lstlisting}
+
+
+The pseudo-code is misleading; it may appear that the two loops could be combined into a single loop. However, in the case where bodies are interacting, the force on any given body will depend upon the positions of other bodies. Consider two bodies $A$ and $B$ interacting within a single time step. If the force on $A$ is computed, and $A$'s position is updated before the force on $B$ is computed, then the force experienced by $B$ will not be equal and opposite to that experienced by $A$ - violating Newton's third law. 
+
+The template program provided by UWA places force and position updates for each body in the same loop. This can be fixed by moving the position updates to a separate loop. In the multithreaded implementation, it is also important to ensure that \emph{all} forces are computed before any positions are changed, and vice versa before force computations may continue.
+
+
+
+\section{Parallelisation}
+
+The original algorithm can be parallelised by dividing each of the two loops indicated between a number of concurrently running threads. 
+Forces may be updated for separate bodies simultaneously with no risk of conflict, since the force of one body is independent of all others. Similarly, positions may be updated simultaneously. However, as discussed above, force and position updates must be done separately.
+
+The most effective way of parallelising the computations is as follows (not showing graphics related code):
+
+\begin{lstlisting}
+CREATE n threads
+FOR each thread T assign a subset of bodies
+
+FOR each thread T
+       FOR each time interval of duration dt LOOP
+               FOR each body A in T's subset
+                       COMPUTE the force acting on A due to all other bodies
+               DONE
+               
+               BARRIER Wait for all threads to finish computing forces
+
+               FOR each body A in T's subset
+                       COMPUTE the position of A
+               DONE
+
+               BARRIER Wait for all threads to finish computing positions
+
+       DONE
+DONE
+\end{lstlisting}
+
+This algorithm avoids the cost of repeated thread creation and deletion; threads are only created once.
+The barriers between force and position updates ensure that forces and positions are never updated at the same time in separate threads.
+
+\subsection{Graphics}
+
+The template program provides graphics using OpenGL and freeglut. In the multithreaded implementations, the graphics can be run in the main thread independent of computations with no effect on the final output of the simulation. However, the following problems may arise if the graphics thread is not synchronised with position updates:
+
+\begin{enumerate}
+\item If a body's position is being altered whilst the graphics thread is drawing that body, the body may be drawn partway through the update of the three position coordinates. This will cause the body to be drawn displaced from its actual position by a small amount.
+
+\item Because position updates are finished in any order, with the barriers correctly implemented, the display may show some bodies remain stationary whilst the thread responsible for those bodies waits for other threads to reach the barrier.
+\end{enumerate}
+
+To avoid these problems, the graphics thread must not commence drawing if position updates are in progress. Similarly, position updates must not commence whilst the graphics thread is still in the process of drawing.
+\pagebreak
+
+
+\section{POSIX Threads (pthreads)}
+
+\subsection{Barriers}
+
+Because \verb/pthread_barrier/ is unavailable on many UNIX systems, it was necessary to implement a barrier.
+
+The barrier functions as follows:
+\begin{enumerate}
+       \item Initialised as ``inactive'' and with the total number of participating threads specified
+       \item When a thread calls \verb/Barrier_Join/ the barrier becomes ``active'' (if it wasn't already)
+       \item Each thread entering the barrier by calling \verb/Barrier_Join/ increments a counter.
+       \item Threads calling \verb/Barrier_Join/ must sleep using \verb/pthread_cond_wait/ until the counter reaches the total number of participating threads
+       \item The final thread to enter the barrier awakens the sleeping threads using \verb/pthread_cond_broadcast/, and sets the barrier as ``inactive''.
+\end{enumerate}
+
+The concept of ``inactive'' and ``active'' barriers was useful for synchronising the graphics thread with the position computations.
+The function \verb/Barrier_Wait/ allows a thread to pass the barrier if it is inactive, but forces the thread to wait if the barrier is active. The calling thread does not affect the barrier's counter of finished threads.
+
+The function \verb/Barrier_Enter/ can be called in a thread to set the barrier as ``active'', and then continue to perform tasks \emph{before} joining the barrier. ie: The barrier may be considered ``active'' by calls to \verb/Barrier_Wait/ before any threads have called \verb/Barrier_Join/.
+
+By calling \verb/Barrier_Enter/ in the position threads before the position updates begin, and \verb/Barrier_Wait/ in the graphics thread before the drawing begins, the graphics thread cannot commence drawing whilst positions are being updated.
+Similarly, calling \verb/Barrier_Enter/ in the graphics thread before drawing, and \verb/Barrier_Wait/ in all position threads, ensures that the position updates cannot be started whilst drawing is in progress.
+
+To avoid deadlocks, three separate barriers have been used; one for each of the force, position and drawing updates.
+
+\subsection{Division of Labour}
+
+In the pthread implementation, some mechanism is needed to divide the bodies between threads.
+This has been done by creating a data structure to describe an array of bodies, with the length of the array. Division of the array into subsections can be accomplished in the C programming language using pointer arithmetic.
+
+A global array of these data structures (\verb/System/) is constructed before thread creation, with each \verb/System/ containing a subsection of the original bodies array. Each thread is passed a pointer to one of the \verb/System/s through \verb/pthread_create/
+
+Some helper functions have been implemented to perform tasks for a given \verb/System/ of bodies, and these are called where needed in the threads.
+
+\subsection{Single Threaded Section}
+One disadvantage of having all computation threads continually running is that it is difficult to execute serial code in only \emph{one} thread. For example, a periodic printing of performance metrics and the updating of the number of steps computed must both be performed at the end of every time step by only one thread. To assist in executing serial code in one thread out of a ``team'', a function has been added to the \verb/Barrier/ implementation; \verb/Barrier_JoinCall/ will cause the last thread entering the barrier to execute a specified function containing serial code. This takes advantage of the fact that serial code can often occur immediately after a barrier.
+
+\section{OpenMP}
+
+The implementation in OpenMP is significantly simpler than pthreads. OpenMP automatically divides labour for threads encountering \verb/omp for/, which eliminates the need for constructing the \verb/System/ array. OpenMP also incorporates implicit barriers after most code sections. The use of \verb/omp single/ directives greatly simplifies the assignment of serial code to one thread in a team.
+
+Because OpenMP does not offer as much low level control over threads, the integration of graphics with the simulation differs. Specifically, because OpenMP does not have a barrier structure (instead using \verb/omp barrier/ directives), all threads involved in a barrier must occur within the same \verb/omp parallel/ section.
+
+Instead of creating a separate thread to run graphics, the OpenMP solution uses an \verb/omp master/ directive to perform graphics related code within the main computation loop. This occurs after the position computation barrier, and is followed by an explicit \verb/omp barrier/. It is important that the graphics related functions be run in the main thread; using \verb/omp single/ will often result in unusual behaviour, such as flickering of the display or the display not being drawn at all.
+
+
+\section{Barnes Hut Algorithm}
+
+The Barnes Hut algorithm approximates a distant group of bodies by the group's centre of mass. This is done by the construction of a tree representing regions of space containing groups of bodies. Nodes at each level of the tree represent a volume obtained by dividing the volume of the parent. The force on a body can be approximated by a depth first search on the tree, stopping at nodes for which the centre of mass is ``distant'' to the body, otherwise continuing to the nodes children until nodes containing only one body are reached.
+
+Figure \ref{cubes.png} shows a graphical representation of the Barnes Hut algorithm. In this image, only Nodes (depicted as cubes representing the space within the node) containing exactly one body are shown.
+
+It is possible to parallelise the recursive algorithms used in constructing the tree and calculating forces. However, we have only implemented a serial version of the tree. Despite this, the calculation of forces can still be parallelised external to the tree.
+
+{\bf I have reached the page limit...} but I have lots of graphs to talk about still. Feel free to stop reading here. 
+\begin{center}
+       \includegraphics[scale=0.50]{/home/sam/Documents/University/honours/course/semester2/pprog/assignment1/nbody-bh/cubes.png}
+       \captionof{figure}{Barnes Hut - One body per cube}
+       \label{cubes.png}
+\end{center}
+
+\subsection{Errors}
+
+Figure \ref{error_graph.png} shows the mean ``error'' between the Barnes Hut Algorithm and more exact brute force algorithm. This error is calculated by summing the absolute difference of each body's position components, and averaging over all position components. Note the logarithmic y axis. The error increases roughly linearly. There is little change in error for different values of $\theta$ between $0$ and $10$.
+
+Note: The same error analysis script confirmed that the multithreaded versions of the brute force algorithm gave the same outputs.
+
+\subsection{Performance for different $\theta$ values}
+
+Figure \ref{efficiency.png} shows the time taken for the (single threaded) Barnes Hut Algorithm to run a specified number of steps, for different values of $\theta$. The brute force algorithm is shown for comparison; for $\theta < 1$, the brute force algorithm is generally faster.
+Taking into account both figures \ref{efficiency.png} and \ref{error_graph.png}, it would appear that $\theta = 10$ is a good balance between accuracy and speed, for this particular field of bodies.
+
+\begin{landscape}
+
+\begin{center}
+       \includegraphics[scale=0.50]{/home/sam/Documents/University/honours/course/semester2/pprog/assignment1/nbody-bh/error_graph.png}
+       \captionof{figure}{Barnes Hut mean error} \label{error_graph.png}
+\end{center}
+
+\begin{center}
+       \includegraphics[scale=0.50]{/home/sam/Documents/University/honours/course/semester2/pprog/assignment1/nbody-bh/efficiency.png}
+       \captionof{figure}{Barnes Hut single threaded performance} \label{efficiency.png}
+\end{center}
+
+\end{landscape}
+
+\pagebreak
+
+\section{Performance Analysis}
+
+\subsection{Time vs Steps}
+
+Figure \ref{time_vs_steps.png} shows the time taken to compute a varying number of steps for different programs. The field provided with 500 bodies was used.
+
+Several things can be noted from this graph:
+\begin{itemize}
+       \item With $\theta = 10$, the Barnes Hut algorithm is significantly faster than brute force algorithms
+       \item Altering the number of threads has a significant effect on the brute force algorithm performance, but little effect on the Barnes Hut algorithm performance. This is because the vast majority of the Barnes Hut algorithm has not yet been multithreaded.
+       \item The optimum number of threads is equal to the number of cores available on the host machine (Intel Core i5).
+               (Pleiades was not available at the time of testing).
+       \item The pthreads and openmp implementations perform with similar speeds
+       \item The single threaded, brute force algorithms perform the worst
+\end{itemize}
+
+The time taken to run varies with CPU load due to other processes, and so these results should be considered as qualitative.
+Measurement of CPU cycles (as opposed to real time) is not a suitable metric, since on many UNIX systems, this measurement includes cycles run by all threads in a process. 
+
+\subsection{Time to compute vs Number of Bodies}
+
+Figure \ref{time_vs_n.png} shows the time taken to compute 1000 steps against the number of bodies in the initial field. The brute force algorithms appear $O(n^2)$. Barnes Hut appears to be $O(n log(n))$. The fastest algorithm is Barnes Hut with $\theta = 10$.
+
+This experiment revealed (and was limited by) a flaw in (my implementation of) the Barnes Hut algorithm. When the body fields are randomly generated, the result is usually unbound orbits. As the size of the volume occupied by bodies is increasing, the algorithm must continually resize the root node of the tree, and then all children nodes, and ultimately create additional child nodes to contain single particles. Although the implementation attempts to minimise allocations of memory, it was found that for large numbers of bodies the program often became unresponsive. This may be due to the more closely approaching bodies experiencing extremely large forces, and moving away from the centre of the universe at high velocities.
+.
+
+\pagebreak
+\begin{landscape}
+
+\begin{center}
+       \includegraphics[scale=0.50]{/home/sam/Documents/University/honours/course/semester2/pprog/assignment1/performance_time_vs_steps.png}
+       \captionof{figure}{Implementation Performance}
+       \label{time_vs_steps.png}
+\end{center}
+
+\begin{center}
+       \includegraphics[scale=0.50]{/home/sam/Documents/University/honours/course/semester2/pprog/assignment1/time_vs_n.png}
+       \captionof{figure}{Efficiency vs $n$} \label{time_vs_n.png}
+\end{center}
+\end{landscape}
+
+\subsection{Analysis with Callgrind}
+
+The tool \emph{callgrind} (part of \emph{valgrind}) can be used to generate information on the performance of a program, and to identify expensive lines of code or the cost of different function calls. Figure \ref{kcachegrind.png} shows a screenshot of \emph{kcachegrind}, a graphical tool to display the results obtained by callgrind. The program analysed is the Barnes Hut N-Body simulator, running for 10 steps, without graphics enabled.
+
+A larger copy of the image is attached with this report, file name ``nbody-bh/kcachegrind.png''. Several things are clear from this display:
+\begin{enumerate}
+       \item The most expensive computations are due to the force computation function \verb/Node_Forces/. Close to 93\% of the program's time is spent in this function.
+
+       \item Although there are only 500 bodies, the force calculation algorithm was called 15200000 times. This is significantly greater than the approximate 2500000 force computations for the brute force algorithm. 
+
+       For this particular initial field and choice of value for $\theta$ ($0.5$), the single threaded Barnes Hut algorithm performed worse than the single threaded brute force algorithm. As shown in Figure \ref{time_vs_n.png} however, the Barnes Hut algorithm performs extremely well for larger numbers of bodies. The ``better'' algorithm to use is determined by the problem.
+
+       \item Most of the cost of a force calculation is due to two calculations (distance calculation and the actual force update)
+               The calls to \verb/sqrt/, a normally expensive function have a much lower total cost, but are called the same number of times.
+               The extra cost of the two calculations might be due to the costs of following the pointers (the code has not been optimised),  
+
+       \item The \verb/for/ loops, which were used after replacing the original 9 body variables (x, y, z, Vx, ..., Fx, ...) with three arrays, to condense the code, have a small but significant cost associated with them.
+\end{enumerate}
+
+\begin{landscape}
+\begin{center}
+       \includegraphics[scale=0.50]{/home/sam/Documents/University/honours/course/semester2/pprog/assignment1/nbody-bh/kcachegrind.png}
+       \captionof{figure}{Kcachegrind reveals why my program is slow} \label{kcachegrind.png}
+\end{center}
+\end{landscape}
+
+\end{document}
+