About Functional Computing

Functional Computing Sweden AB was founded in 2013 in Lund (Sweden) and has developed the Container Function Library (CFL).

The primary goal for the library is to follow standardizations in parallel algorithms and provide a symbolic notation for writing distributed and accelerated programs. Examples include wider implementation of the Parallelism Technical Specification version 1, and standardization of version 2. A secondary goal is compatibility with non-standard libraries, such as Libfabric and HPX.

The primary target group are users with some spare machines or GPU:s, a data parallel task which has grown too large for conventional scripted tools, and otherwise not necessarily well-versed in concurrent programming. A secondary target are experienced users requiring a generic library for complex distribution of data and execution.

The evolution of features in CFL is briefly presented below. Some of them have no correspondence in the standard library (or even elsewhere), which may require a motivation.

Composition

Adapting sequences of remote and collective operations to a specific hardware configuration required a simple syntax for function composition and partial application. To achieve this, CFL introduced a composition placeholder and several core library functions such as tuple and bind were implemented.

Inplace operations

Composition in turn, required value semantics for collective operations, i.e. the ability to return also by value, instead of only by reference, as in the standard library. This lead to the introduction of inplace operations to assign and construct elements of pre-allocated results from collective operations, without any (possibly large) temporary results.

Value Category Adaptors

Next, value semantics required the adaption of functions themselves to return by a specified value category¹. This can be viewed as function output modification, akin to input modifiers such as std::move. Just as value lifetime and ownership could be described with the introduction of rvalue references in C++11, lifetime in compositions can be described with value category adaptors prfn and glfn, short for pure rvalue function and generalized lvalue function.

Meta Programming

During implementation, a whole family of collective operations for tuples emerged. And although not a primary goal, the library provides a GPU-compatible² API for meta programming through this family, as they evaluate both at compile-time and at type-level. In fact, most functions in CFL evaluate at type-level through the _t suffix convention. Also, the pack family of classes provide type-only operations.

The meta programming capabilities can be generalized to any container type, provided an element-wise constructor and element access exist. While not difficult to implement within CFL, such a generalization does imply ambiguities for nested containers, which have to be resolved by the syntax. Also, efficient construction of nested containers requires constructors designated for collective operations. So the generalization comes with a cost for the simplicity of the syntax, and either run-time efficiency or the intrusion of types.

Remote Array Operations

Originally, CFL just implemented collective operations on arrays for GPU:s and used MPI for calls to other machines. Eventually, MPI was abandoned because of the difficulties to implement simple and efficient serialization for generic types³. Instead CFL implements process communication and a job launcher, which account for the dependencies⁴ to POSIX and ssh. But although it was the first part to be implemented, collective operations on arrays for GPU:s and cluster nodes is now the last part to be released.

One of its core features is maintaining a collection of values on remote locations, but keeping its shape as a local first-class function. All of them, the collection, its values and shape, may be asynchronous values (futures), but at least the shape can often be kept synchronous⁵. This way a master node can make top-level decisions how to partition an array and dispatch to slaves, without actually synchronizing the collection. But before release, work queues need to support user mode context switching, to better handle nested asynchronous values.

This is an area of intensive work, and there are several relevant experimental C++ features. However, language and library advances arrive to the GPU compiler from NVIDIA (nvcc) with some natural delay. As a measure, C++14 was supported by nvcc in September 2017, about two years after gcc. There are non-standard context switching implementations from boost (context, coroutine and fiber). They require a minor installation on the target machines, but non-standard on many older systems. Alternatively, they can be distributed statically together with the generated user program binary. A standard conforming solution would be to use POSIX ucontext⁶ to implement context switching in CFL. Of the three, boost inclusion is currently the most viable solution.

Basically, the categories describe whether an expression yields a value or a reference, and an indication whether it is about to expire or not. See section [basic.lval] in the C++ specification.↩
The GPU compiler from NVIDIA (nvcc) does have some issues concerning constexpr evaluation and unevaluated contexts. But this is probably on par with the natural delay in features for the GPU compiler.↩
Although MPI has support for user-defined data types, it is limited to the capabilities of the offsetof macro, which soon resorts to undefined behaviour.↩
Non-standard context switching libraries are considered for inclusion, see section Array Operations.↩
Most collective functions can deduce the shape of their result before actually performing any operations on elements.↩
Standard POSIX ucontext is deprecated since the adoption of the 1999 C standard, although most systems have support anyway. It is not easy to be standards compliant!↩