Wavenology HPC (Future Product)

Wavenology integrates high-performance-computing techniques not only by applying pure computer programming efficiency enhancement methods (OpenMP, MPI, GPU etc.), but also parallelize the algorithm and software from the fundamental level of numerical simulations. WCT’s R&D team understands the nature of each numerical technique applied in Wavenology and is very careful on how to maximally utilize the computing resources. WCT clearly understand the important issues in design of parallel programs. We put more efforts on evolving the classical numerical techniques from its intrinsic nature in addition to applying standard acceleration prototypes.


 

Overview

 
For parallel computing, there are 4 types of parallelism. According to the grain, from coarse to fine, they are:
 
1. Task parallelism

 
Divides the problem task into several subtasks and execute them simultaneously as shown in Fig. 1. In this situation, the tasks sometimes need to exchange data. Therefore, a task manager is required to synchronize the tasks.

 



Fig. 1. Task parallelism. The problem task is divided into several subtasks. A task manager is required to synchronize the tasks.

2. Data parallelism
 
Separates the independent data into groups, then executes them simultaneously, as shown in Fig. 2. In this parallelism, the processing unit can be a computing node, a CPU core or the multiplier in CPU.

 



Fig. 2. Data parallelism. The processing unit can be a computing node, a CPU core or the multiplier in CPU.

 
3. Instruction parallelism
 
Reorders the instructions and makes them capable of executing simultaneously through a processor instruction pipeline.

 
4. Bit parallelism
 
Defines how many bits can be executed simultaneously, for example, 32-bit or 64-bit system.

 
Item 4 above is hardware bus bit-width. Item 3 depends on the processor architecture and the compiler ability. Item 2 depends on the computing algorithm and the compiler ability. Item 1 depends on how to divide the algorithm into subtasks, and how to execute and synchronize subtasks on different computing nodes simultaneously.


 

GPU Acceleration

 
Up to date, General-Purpose computation on Graphics Processing Units (GPGPU) is becoming a popular technology to speed up the task with high computational density. The basic idea is that the multiple identical execution cores in a Graphics Processing Unit (GPU) can execute the instructions in parallel. If a task can be divided into multiple sub-tasks, each sub-task can be individually delivered to a GPU execution core and then all sub-tasks can be executed in parallel. Thus, the total time cost for this task will be reduced. For an FDTD method, due to the localization of the memory and operations, it can fully take advantage of this kind of parallelization and obtain very high speed up.

 
We show the performance of the GPU acceleration in Wavenology EM package. In the simulation, we use an Nvidia 9600GSO graphic card, which has 384 MB local memory, and 96 CUDA cores. The maximum number of threads that can be created on GPU is 65535 and the maximum number of threads in each block is 512. The CPU core clock is 1375 MHz and the GPU memory clock is 800 MHz. The CPU system has a 4-core CPU Intel Q6600 with DDR2-800 memory. As mentioned above, in our current implementation, we fix the GPU block as 12×8 threads per block. We use a single domain cavity case to compare with a benchmark case: the Nvidia CUDA example FDTD3D. This example implements a finite-differential-time-domain method on single field propagation in a 3D space. We consider a cavity model which includes two electric dipole sources and three observers to record the signal. The computation domain is divided into 100×100×100 and 280×280×16 cells, respectively. The fields are recorded at every time step.

 
The speedup performances for the cavity cases and CUDA example FDTD3D are shown in the following table.

 

The speedup performance comparison

 

Case Name

FDTD cavity
100×100×100
cells

FDTD cavity
280×280×16
cells

Nvidia CUDA example
FDTD3D
280×280×280 cells

Speedup factor

20.2

22.4

40.1

 
As shown in the above table, the GPU speedup factor in the FDTD solver of Wavenology EM is about 20, while the CUDA example FDTD3D is 40. It is reasonable because of the following aspects:

 

  1. Example FDTD3D does not have a source which needs to be calculated in the CPU and then transfer to GPU.
  2. Example FDTD3D does not need to read field from GPU at every time step.
  3. Updating of the FDTD is more complicated than the example FDTD3D. The FDTD needs to update 6 field values by 21 memory locations, while the FDTD3D only needs to update one field by 6 memory locations.

 
The present table also shows that larger x×y case has better speedup performance in our implementation: the second case with 280×280×16 cells has a speedup factor of 22.4, while the first case with 100×100×100 cells has a speedup factor of 20.2.