Distributed and Accelerated

The Container Function Library

The Container Function Library (CFL) is a modern C++ header library for concurrency on distributed systems with GPU-acceleration. The library can be used to distribute data and execution on cores, machines and GPU:s, using a symbolic notation with zero1 run-time overhead.

A key to concurrent programming is to put computations in containers and treat results from asynchronous or remote functions as proper values. Another key is to define data parallel parts of a program with collective operations instead of sequential expressions. Synchronization can otherwise be infeasible when data and execution are distributed and nested on many cores, machines and GPU:s.

More profoundly, containers can express the modality of values and computations, and collective operations the dependency between computations. And for computations, dependency is the fundamental constraint for parallelization.

Concretely, CFL augments functions and values with a global address space for distributed and accelerated systems, provides a set of collective operation primitives, and has tools to adapt end-user functions for distributed use. Any function or library can be adapted for use with CFL. In particular, this applies to the Concurrency and Parallelization Technical Specifications from the C++ Standardizations Committee.

Features

Central to CFL is the generalization of functions to accept containers as arguments - hence the name Container Function Library. Whether an argument is provided as a plain value or in a container does not matter, only the result of the function changes accordingly. In particular, this allows for a simple syntax to define future continuations of arbitrary arity, i.e. compositions of asynchronous functions.

Another essential feature is the ability to put expressions as functions, with composition and adaptors. While this may appear as syntactic sugar, it is in fact indispensable to define a program with collective operations. As a consequence, collective operations have value semantics, a significant difference to the standard library. Accordingly, reference semantics and value life-time management are provided through adaptors.

CFL provides:

Like transform and accumulate in the standard library, collective operations can be expressed as higher-order functions. CFL generalizes collective operations to:

The symbolic notation is resolved at compile-time with generic programming, treating functions as first-class objects. Compile-time evaluation allows for optimizations not available at run-time, and type-safety can be preserved across hardware boundaries.

The meta programming facilities of CFL are really a by-product of other library features. Its API is somewhat disguised beneath what appear as naming conventions, composition and collective operations on tuples. But still, the expected capabilities of compile-time evaluation and type-level operations are available, with a unified syntax including run-time computations.

Compatibility

CFL is a small header library and requires only a standard C++11 compiler2. Distributed programs require POSIX and ssh, and GPU acceleration requires NVIDIA CUDA3. As a header library, the user is in full control of the generated library.

Using the library does not require any previous knowledge in concurrent, distributed, GPU or meta programming, but the presentation here assumes at least some familiarity with the perils of parallelizing a program.

Example: Hello World!

First, a “Hello World!” program for a GPU accelerated cluster, with remote and parallel printing. If unfamiliar, expressions like 1__ and 1_node are C++11 user-defined literals, a convenient way to give a number a type. Here, 1__ is an argument placeholder and 1_node an address to a remote machine (or to be precise, a function adapting another function to execute on node with address 1).

#define CFL_DISTRIBUTED 2       // number of processes (nodes)
#include <cfl/cfl.hpp>          // master header
using namespace cfl;

int main ()
{
    char str [] = "Hello World!\n;

    auto f = cout << 1__;       // operator << as a unary function

    // print whole string
    f (str);                    // locally
    0_gpu (f) (str);            // on gpu #0
    1_node (f) (str);           // on node #1
    0_thread (f) (str);         // in thread #0

    // split printing in parts
    map (f, str);               // single chars locally in sequence
    0_gpu (map) (f, str);       // single chars on gpu #1 in parallel
    all_nodes (map) (f, str);   // in chunks on all nodes in parallel
    all_threads (map) (f, str); // in chunks in all threads concurrently
}

Compile with with nvcc -std=c++11 hello_world.cu -lpthread and execute with ./a.out -s 10.0.0.2. Two processes are started, a master on the local machine and a slave on 10.0.0.2 (see manual for run). The number of threads (or work queues) defaults to the hardware concurrency4.

Example: Map-Reduce

How to optimize nested collective operations using function composition and adaptors.

Consider an input array u and two functions f and g:

int u [4] = {1, 2, 3, 4};

auto f = [] (int u       ) { return 2 * u; };
auto g = [] (int u, int v) { return u + v; };

A possible map-reduce implementation using the standard library could be:

using namespace std;

int v [4];
transform (begin (u), end (u), begin (v), f);
auto r = accumulate (begin (v), end (v), 0, g);

Instead using the corresponding functions map and fold from CFL:

using namespace cfl;

auto r = fold (g, 0, map (f, u));

As seen above, the intermediate result from map has owning semantics, and does not have to be manually allocated. On the argument side, both map and fold accept arrays without the need to retrieve start and end iterators.

Assignment to a manually allocated intermediate result v is possible through the inplace operator <<=, adapting a function to return by in-place assignment instead of value:

(v <<= map) (f, u);

Generally, this is meaningful when assignment can be moved into an collective operation, which is the case for map, but not fold. When operating on uninitialized storage, there is a corresponding in-place placement new operator %=.

In particular, in-place assignment is supported by bound functions (closures), when their inner components do. This can be used for in-place assignment in nested function calls, and is one of the core features of CFL5.

Most CFL functions support argument placeholders natively, used for partial application and composition:

auto h = fold (g, 0, map (f, 1_c));

Here h is a bound function (or a closure), accepting an array argument. Regarding the placeholder 1_c, there are some subtleties concerning closures and higher-order functions the standard library can not express6, so CFL introduces two different types of placeholders 1__ and 1_c (see manual page for bind). Any function can be made to accept placeholders using the cfn adaptor, short for container function,

This can be taken a step further, avoiding an intermediate result array altogether, by moving the application of f into the second argument of g. To achieve function composition like this, f and g are adapted to accept argument placeholders using cfn:

auto f = cfn ([] (int u       ) { return 2 * u; });
auto g = cfn ([] (int u, int v) { return u + v; });

auto h = g (1__, f (2_c));

auto r = fold (h, 0, u);

For reference, this composition is provided by the standard library as inner_product. Where the standard library is limited to a number of explicit common compositions, CFL provides arbitrary composition.

This is how partial application, composition and in-place assignment can conveniently be used in CFL to optimize collective operations, depending on the particular calling situation.

Example: zip

How to manage value-lifetime using function adaptors.

zip makes an array of tuples from elements of arrays, occurring at the same position. In general, zip can be viewed as an example of the conversion between structure-of-arrays and array-of-structs sometimes necessary for large data sets and GPU memory layouts. A possible zip implementation f in CFL could be:

auto f = map (tuple, 1__, 2__);

int u [3] = {1, 2, 3};
int v [3] = {4, 5, 6};

auto x = f (u, v); // {(1,4), (2,5), (3,6)};

The default storage in CFL follows the expression categories of C++ strictly. Basically, the categories describe whether an expression yields a value or a reference, and an indication whether it is about to expire or not. See section [basic.lval] in the C++ specification.

As the array element access u [i] yields an lvalue reference, an lvalue reference is precisely what is stored in the tuples of x. Perhaps somewhat unexpected, but consistent with the C++ value categories. Assuming automatic variables and temporaries, this is in fact a least-copies strategy guaranteeing correct value lifetime.

To instead store by value, the prvalue (pure rvalue) category should be used. In CFL this would be expressed using the prfn adaptor, short for prvalue function, modifying a function to use prvalues instead.

auto g = map (prfn (tuple), 1__, 2__);

Note, prfn modifies the tuple function itself, so prfn (tuple) is another function. While not obvious, it is in fact equivalent to modify the closure f above:

auto h = prfn (f);

There is also a corresponding glfn adaptor, storing by generalized lvalue (glvalue)7 instead, a.k.a. forwarding.

prfn and glfn modify functions in a fundamental way, to return by value or by reference. The operations are available for almost all functions in CFL, at no run-time cost, also bound functions (closures). This is something without correspondence in the standard library, and may take some time to get used to. However, for correctness and efficiency reasons, function storage manipulation is necessary when functions are called in different contexts, such as when part of a temporary expression or leaving scope of referenced values.

Example: Remote Execution

How to distribute execution on cluster nodes and GPU:s.

GPU acceleration is enabled by using the nvcc compiler from NIVDIA, and cluster distribution by the CFL_DISTRIBUTED macro (see the manual page for run).

A function can be made to execute on another machine using the node and gpu adaptors.

auto f = [] (int arg) { return arg + 1; };
auto g = 1_node (f);
auto r = f (0);

Here, g is a local value, a function, such that when invoked executes on node with address 1. r is a local value, holding a reference to the actual result residing on the remote node. Similarly, below f is adapted for execution on gpu with address 0. Nested locations are permitted as well.

auto h = 1_gpu  (f);         // executes on gpu  #1
auto i = 1_node (1_gpu (f)); // executes on gpu  #1 on node #1 

Some natural restrictions to f and its arguments apply, such as being serializable. This excludes non-empty lambda closures, but CFL closures can be used without restriction, as long as its elements comply. This means the short-hand syntax for composition can be used in remote calls as well8.

Remote execution is defined also for collective operations. Using all nodes in a cluster to map f to an array u:

int u [3] = {0,1,2};

auto g = all_nodes (map);

auto r = g (f, u);

Here all_nodes is a convenience definition denoting all nodes in the cluster. g is a function mapping execution to those nodes, in chunks along the outer-most dimension of u, and r is an array of containers holding the remote results. Instead mapping the execution of f to the threads of a GPU:

auto h = 1_gpu (map);
auto s = h (f, u);

Here u is copied to to GPU memory and the application of f is distributed to threads along its inner-most dimension. Generally, coalescing GPU memory accesses require execution to be parallelized along the inner-most dimension - which is the opposite to the outer-most dimension often suitable for parallelization across nodes or GPU:s.

CFL provides a convenient syntax for remote calls. However, depending on calling situation, the most efficient variant of message and value synchronization may have to be manually specified. For reference, see the manual page for node9. It is the callers responsibility to distribute execution effectively. For example, there is no restriction on nested collective remote calls, where value synchronization quickly can become costly.


  1. A space and time overhead comparable to that of hand-crafted code. Or, as put by Bjarne Stroustrup, “What you don’t use, you don’t pay for. And further: What you do use, you couldn’t hand code any better.”

  2. Almost standard. There is some non-essential use of non-standard code for syntactic sugar reasons only. One is the constructor function attribute for automatic initialization in distributed programs.

  3. There are emerging open alternatives to the proprietary CUDA C++ compiler, but so far CFL has only been verified with CUDA. One reason is the single-source solution in CUDA, suitable for generic programming.

  4. Plus one for message processing, if CFL_DISTRIBUTED is defined.

  5. For example, to in-place assign the result of an inner map, use a function composition instead of plain function calls. For some functions f and g and arrays u and r:

    (r <<= map) (f, map (g, u));        // allocates inner result
    (r <<= map  (f, map (g, 1_c))) (u); // in-place inner result
  6. In the standard library, nested bind subexpressions are always evaluated, which is not always desired passing function arguments to higher-order functions.

  7. Although the built-in array subscripting already has forwarding properties and always yields a glvalue, even for a prvalue array, CFL subscripting does not.

  8. Normally, remote call handlers need not be manually registered. There are some exceptions though, such as remote calls to gcc’s indirect functions, such as sin.

  9. The syntax for remote execution is likely to change somewhat. In part to align syntax to current standardization proposals, and in part to support the revision of work queues and user mode context switching (see About Remote Array Operations).