Parallel computing

cosmos 4th April 2017 at 11:26am

Why we cannot keep increasing CPU speed? Power has emerged as one of the primary factors in processor design.

Often used in Computer cluster and GPU computing. Main application is for High-performance computing (see more there)

Fundamental concept: total time vs total work

We say that a parallel algorithm is work efficient if its work complexity is asymptotically the same as the equivalent serial algorithm

Analysis of parallel algorithms

Parallel programming

Parallel communication patterns

Tasks <> Memory

Map. 1-to-1.. 1 thread on 1 part of memory, independently.
Scatter. 1-to-many. 1 thread, write to a potentially different and potentially more than 1 part of memory, independently.
Gather. many-to-1. Like scatter but for reading instead of writting.
- Stencil. Read from a fixed set of neighbours, and write to 1 part of memory
Transpose.1-to-1. Any read and any write locations?
Reduce. all-to-1.
scan/sort. all-to-all.
More methods
- Reduce –> parallelizing reduce for binary/associative operators. See more at Analysis of parallel algorithms
- Scan – math – why do we care about parallel scan

Thread diveregence

Introduction to parallel programming by nvidia in Udacity: https://classroom.udacity.com/courses/cs344/lessons/55120467/concepts/671181630923

Latency vs throughput tradeoff

Latency: time for a single unit operation to take place

Throughput: number of operations per second.

Latency has advanced more slowly than throughput in technologies: Latency lags throughput

Types of parallel computing

High-throughput computing, aka embarassingly parallel computing: lots of *independent* tasks.
High-performance computing often refers to a big task divided into many parallel computing nodes, but they are not totally independent, and so issues of communication ened to be addressed.

Memory models

distributed and shared memory parallel computing models

Share memory: all the cores can see the same memory. OpenMP. Limited to one node in a Computer cluster
Distributed memory: each core has a separate memory they can access. MPI. Scales to many many thousdands of cores accross several nodes..

Often use a combination of both, like CUDA

– Clusters and job managers. – Jobs vs Tasks. • Creating and submitting them. • Getting the results – Code portability. – Callback functions • Advanced parallelism. – spmd mode, message passing. – GPU computing.

https://uk.mathworks.com/help/distcomp/how-parallel-computing-products-run-a-job.html