Implementation

All the runs are performed on the CS205 cluster at Harvard SEAS.

Tree representation

How the tree is represented affects how the Barnes-Hut algorithm can be parallelized. We opt to represent the tree with cell objects that refer to their daughters. The cells at the bottoms of the tree then hold individual particle objects. Then the entire tree can be traversed by starting with the root cell.

The construction of the tree differs slightly from the definition of the Barnes-Hut tree given on the Barnes-Hut page. We start with just the root cell, and add all the particles to it, one by one. When a particle is added to a cell, the cell’s center of mass is updated, and the following steps are followed:

  1. If this cell is empty, make this cell refer to this particle.
  2. If this cell already had exactly one particle, create this cell’s eight daughters, add the two particles to the appropriate daughters, and make this cell forget the original particle.
  3. If this cell already had more than one particle, its daughters have already been created, so add the new particle to the appropriate daughter.

Thus a tree can be split by splitting up, say, all eight first level daughters, and then reconstructed by creating a root node and making it refer to those eight daughters as its daughters.

An array of particles can easily be obtained by traversing the tree, and these particles don’t have any attachment to the tree. Thus, they can be split, gathered, and updated independently of the tree. The tree on each process still holds references to the particles on each process, but at each time step, the tree must be recalculated based on the most current particle information.

Leapfrog Integration

Leapfrog integration is a common time integration method used in N-body codes, because it is simple but also turns out to be symplectic. Symplectic integrators are attractive because they solve a discretized Hamiltonian that is close to the true Hamiltonian of the system. Thus, the energy of the simulation will oscillate around the true energy, and the error in the energy can be bounded. These properties only strictly hold for constant time steps, but some aspects can be preserved with adaptive time stepping (see Quinn, et al. http://arxiv.org/abs/astro–ph/9710043 for the methods, and  our Performance page for discussion on how constant time steps affect our simulation).

We use a constant time step leapfrog integrator with a kick and drift operator. Note that in leapfrog, the velocity and position are calculated at a half phase offset to each other. We start with vn and rn+0.5, and then apply a “kick” by calculating the accelerations from the forces particles feel from being in the positions rn+0.5:

vn+1 = vn + t a(rn+0.5)

Then a “drift” is applied, moving the particles only according to the velocity vn+1:

rn+1.5 = rn+0.5 + t vn+1

Serial implementation

For the serial implementation, we time the tree construction and the force computation for 100 particles over 100 timesteps:

Tree construction time: 2.277  sec
Force computation time:  20.80  sec

The fact that the force computation takes an order of magnitude longer than the tree construction motivates a parallel approach to the force calculation.

Throughout the explanation below we will refer to a “home process”, by which we mean the process that generates and assembles information needed by the other processes, often referred to as the “root process”. We adopt this terminology to avoid confusion with the root of the Barnes-Hut tree.

Basic parallel implementation “Parallel Basic”

Our first parallel implementation is based on the following reasoning. If each process has knowledge of the full oct-tree, then the problem becomes embarrassingly parallel. The strategy is therefore to split up the particles evenly between the processes after broadcasting the tree. The steps are as follows:

  1. Compute initial conditions on home process: a three-dimensional distribution of points with associated velocities.
  2. Construct tree on home process.
  3. Home process broadcasts tree to all processes.
  4. To share load, processes choose disjoint subsets of particles. Each process uses the tree to compute the force on its particles and the resulting new phase-space coordinates of its particles.
  5. Home process gathers new particle information.
  6. Repeat steps 2-5 at each timestep.

With this version of the code running 100 particles for 100 timesteps on 16 cores, we obtain the following sample timing measurements:

Tree construction time:  2.312  sec
Broadcast time:  2.002 sec
Force computation:  1.281  sec
Gather time:  0.9778 sec

The tree construction time is almost the same as the serial implementation above, as it is done identically in both cases (differences at that level can be ascribed to external loads on the cluster.) As expected, the parallel force computation time is almost exactly a factor of 16 shorter. We now have additional overhead however due to the communication that occurs between processes (corresponding to steps 3 and 5 above.)

The tree construction is now the dominant contributor to the total run time. This motivates our next step, which is to parallelize the tree construction as well as the force computation.

Parallel tree construction “Parallel Tree”

We want to reduce the tree construction time by splitting the work between processes. We could do this by splitting the tree at its first level, where the domain is divided into 8 octants. This would however impair the scalability of the strategy to more than 8 processes. We therefore decide to split the tree at its second level, where the domain is divided into 64 octants. The code is therefore scalable to up to 64 processes.

Since we are splitting space evenly, and not particles, load balancing becomes an issue. As the number of processes approaches 64, processes will only get one or two second level branches of the tree to fill. If the particle distribution is not even (as it is in a cluster), then some processes may be assigned space with no particles, an obvious load balance problem. On the other hand, when the number of processes is low (<16), each process gets multiple second level branches, so it is less likely that particles will be unevenly distributed. To further improve the even distribution of particles, the second level branches are distributed such that adjacent blocks go to different processes. Now the steps are as follows:

  1. Compute initial conditions on home process: a three-dimensional distribution of points with associated velocities, and broadcast this information to all processes.
  2. Each process selects branches of the tree to fill out and goes through the full list of particles: for each particle, it decides whether it belongs in its sections of the tree and constructs the appropriate branch(es) as particles are added.
  3. Home process assembles the tree by gathering all tree branches and combining them into a single tree.
  4. Home process broadcasts full tree to all processes.
  5. To share load, processes choose disjoint subsets of particles. Each process uses the tree to compute the force on its particles and the resulting new phase-space coordinates of its particles.
  6. Processes allgather new particle information so that they can independently construct their tree branches in the next timestep.
  7. Repeat steps 2-6 at each timestep.

With this version of the code running 100 particles for 100 timesteps on 16 cores, we obtain the following sample timing measurements:

Tree construction time:  0.1551  sec
Tree assembly time: 1.1932 sec
Broadcast time:  1.933  sec
Force computation:  1.645  sec
Allgather time:  0.3012 sec

The tree construction time has been reduced dramatically, by a factor of 15 and not 16 due to imperfect load balancing. As expected, the broadcast time is the same (we are still simply broadcasting the full tree to all processes.) As expected, the force computation is similar. The allgather time is similar to the gather time from the first parallel implementation. We expect that allgather should take longer, as it must do all that gather does and more, but instead it seems to be shorter. Since it is the shortest component, variations in the network performance will show up most acutely in the gather time, explaining the discrepancy. The parallelization of the tree construction introduces a new source of overhead corresponding to step 3: the home process gathers all the branches and re-assembles the tree.

For a more detailed analysis of the performance of our parallel implementations, check our Performance page!