Massively Parallel Algorithms - SS 2014

News

The lecture begins at 8:30 am every Wednesday.

About this Course

There are big changes afoot. The era of increased performance from faster single cores and optimized single core programs has ended. Instead, highly parallel GPU cores, initially developed for shading, can now run hundreds or thousands of threads in parallel. Consequently, they are increasingly being adopted to offload and augment conventional (albeit multi-core) CPUs. And the technology is getting better, faster, and cheaper. It will probably even become a general computing processor on mobile devices, because it offers more processing power per energy amount.

The high number of parallel cores, however, poses a great challenge for software and algorithm design that must expose massive parallelism to benefit from the new hardware architecture. The main purpose of the lecture is to teach practical algorithm design for such parallel hardware.

Simulation is widely regarded as the third pillar of science (in addition to experimentation and theory). Simulation has an ever increasing demand for high-performance computing. The latter has received a boost with the advent of GPUs, and it is even becoming -- to some extent -- a commodity.

There are many scientific areas where the knowledge you will gain in this course can be very valuable and useful to you:

Computer science (e.g., visual computing, database search)
Computational material science (e.g., molecular dynamics simulation)
Bio-informatics (e.g., alignment, sequencing, ...)
Economics (e.g., simulation of financial models)
Mathematics (e.g., solving large PDEs)
Mechanical engineering (e.g., CFD and FEM)
Physics (e.g., ab initio simulations)
Logistics (e.g. simulation of traffic, assembly lines, or supply chains)

In this course, you will get hands-on experience in developing software for massively parallel computing architectures. For the first half of the lecture, there will be supervised exercises to familiarize yourself with the CUDA parallel programming model and environment. The exercises will comprise, for instance, image processing algorithms, such as you might find in Photoshop or Instagram.

Prerequisites are:

A little bit of experience with C/C++ ; note that we will need just plain old C during this course, but the concept of pointers should be familiar
Liking for algorithmic thinking

Useful, but not required, is a computer/notebook with an Nvidia GPU that is CUDA capable. You can find a list of all supported GPU's here. If you don't have access to such a computer, you are welcome to work on your assignments in our lab.

Not required are

Experience with parallel programming,
experience with computer graphics (although we will study several algorithms and applications of massively parallel algorithms in the area of visual computing)

Slides

The following table contains the topics and the accompanying slides (it will be filled step-by-step after the respective lectures).

Week	Topics	Slides	Assignments	Frameworks
1.	Introduction (More Moore, stream programming model, Flynn's taxonomy, overall speedup, Amdahl's law, Gustafson's law)	PDF1 PDF2	CudaFirstSteps Assignment 1
2.	Introduction to CUDA and Fundamental Concepts of Parallel Programming 1 (terminology, control flow, transfering data, blocks, data flow in general)	PDF	Assignment 2	Memcpy and Kernel Launch
3.	Introduction to CUDA and Fundamental Concepts of Parallel Programming 2 (multi-dimensional blocks and grids, classes in CUDA, constant memory, simplistic raytracer, warps, thread divergence )	PDF	Assignment 3	Reverse Array and Fractal Zoomer
4.	Introduction to CUDA and Fundamental Concepts of Parallel Programming 3 (constant memory, more details on the GPU architecture, warps, measuring performance, GPU/CPU synchronization, shared memory, algorithm for dot product, barrier synchronization, race conditions, document similarity)	PDF	Assignment 4	ReduceMaxSum
5.	Introduction to CUDA and Fundamental Concepts of Parallel Programming 4 (coalesced memory access, heat transfer simulation, double buffering pattern, texture memory, GPU's memory hierarchy, parallel histogram computation, atomic operations)	PDF	Assignment 5	Heat Transfer
6.	Dense Matrix Algorithms (matrix vector product, column major storage, four variants of the algorithm, matrix-matrix multiplication, blocked matrix multiplication, arithmetic intensity, All Pairs Shortest Path with matrix mult.)	PDF	Assignment 6	Matrix Operations
7.	Parallel prefix-sum 1 (definition, simple examples, Hillis-Steele algorithm, depth & work complexity, Blelloch's algorithm, Brent's theorem & optimization)	PDF	Assignment 7	Line of Sight
8.	Parallel prefix-sum 2 (Brent's theorem, application to prefix-sum, theoretical speedup based on Brent, split operation, quick introduction into sequential radix sort, radix sort on the GPU, stream compaction, summed area tables, better percision for integral images, depth-of-field rendering, gathering vs. scattering, face detection)	PDF
9.	Parallel sorting 1 (comparator networks, the 0-1 principle, odd-even mergesort, bitonic sorting)	PDF	Assignment 8
10.	Parallel sorting 2 (example networks of bitonic sorters, complexity, (digression: adaptive bitonic sorting), application: BVH construction, space filling curves)	PDF	Assignment 9	Image Integral Sum
11.	Parallel sorting 3 (construction of z-order curve, Morton codes, construction of linear BVHs, faster ray-tracing by coherent ray packets, packing & unpacking arrays of hash values) Random Forests 1 (classification problem, simple solutions, concept of decision trees)	PDF1 PDF2	Assignment 10
12.	Random Forests 2 (information gain, entropy, conditional entropy, problems of decision trees, wisdom of crowds, bootstrapping/subsampling, randomization during construction)	PDF	Assignment 11
13.	Random Forests 3 (recap, applications: digit recognition, Kinect-like body tracking)	PDF

Some very simple examples that can serve as a starting point. They contain a makefile and should compile out-of-the-box at least under Max OS X and Linux.

Textbooks

The following textbooks can help review the material covered in class:

Jason Sanders, Edward Kandort: CUDA by Example. Addison-Wesley, Pearson Education.
Very easy reading, a very gentle introduction into CUDA, requires minimal C knowledge.
David B. Kirk, Wen-Mei W. Hwu: Programming Massively Parallel Processors. Morgan Kaufmann.
NVidia: CUDA C Programming Guide. (There is also a PDF version.)
Russ Miller, Laurence Boxer: Algorithms, Sequenetial & Parallel. Cengage Learning.
Doesn't talk about technical details of implementing parallel algorithms, but takes more the theoretical, purely algorithmic perspective (uses PRAMs as algorithmic model).

Please note that the course is not based on one single textbook! Some topics might even not be covered in any current textbook! So, I'd suggest you first look at the books in the library before purchasing a copy.

If you plan on buying one of these books, you might want to consider buying a used copy -- they can often be purchased for a fraction of the price of a new one. Two good internet used book shops are Abebooks and BookButler.

Grades and Points achieved by the Assignments

For taking part in a so-called "Fachgespräch" (mini oral exam), you need a grade from the assignments >= 4.0 . You can get this by doing the exercises (assignments). You need at least 40% of all points of all asignments to achieve a grade of 4.0 .

Some Additional Literature You Might Want to Read Online

Reevaluating Amdahl's Law and Gustafson's Law (Source)
Herb Sutter's Welcome to the Jungle from 2012 gives a number of inspiring insights into future trends -- and past trends that have ceased to continue. (Herb Sutter is widely respected in the C++ software engineering community.) (Source)
What Every Programmer Should Know About Memory by Ulrich Drepper, 2007 (Source)
Well, I don't think that every programmer really has to know everything the paper explains, but I hope you all are interested in broadening your horizon -- who wants to know only that stuff she/he needs?
Performance with constant memory (excerpt from CUDA by Example).
CUDA Documents from NVidia
Introduction to CUDA 5.0 (Source)
Chapter 39. Parallel Prefix Sum (Scan) with CUDA by Mark Harris, Shubhabrata Sengupta, John D. Owens. In GPU Gems 3. (Source)
My article about Adaptive Bitonic Sorting
A very good explanation of BVH construction in this blog article by Tero Karras, with an application to collision detection on the first pages (Source)
Here is a pretty comprehensive Introduction to Random Forests by Criminisi, Shotton, and Konukoglu. (Source)

Course exercise projects created by students in SS2014

Gabriel Zachmann
Last modified: Thu Jul 24 14:01:16 CEST 2014