Name	Name	Last commit message	Last commit date
parent directory ..
images	images
README.md	README.md
index.html	index.html

Introduction to HPX

Part 2

Overview

Previous: Introduction to HPX - Part 1

??? Click here to view the Presentation

Recap: What's HPX

Solidly based on a theoretical foundation – a well defined, new execution model (ParalleX)
Exposes a coherent and uniform, standards-oriented API for ease of programming parallel and distributed applications.
- Enables to write fully asynchronous code using hundreds of millions of threads.
- Provides unified syntax and semantics for local and remote operations.
Developed to run at any scale
Compliant C++ Standard implementation (and more)
Open Source: Published under the Boost Software License

Recap: What's HPX

HPX represents an innovative mixture of

A global system-wide address space (AGAS - Active Global Address Space)
Fine grain parallelism and lightweight synchronization
Combined with implicit, work queue based, message driven computation
Full semantic equivalence of local and remote execution, and
Explicit support for hardware accelerators (through percolation)

The HPX Programming Model

The HPX API

Borrowed from the C++ Standard

Class	Description
`hpx::thread`	Low level thread of control
`hpx::mutex`	Low level synchronization facility
`hpx::lcos::local::condition_variable`	Signal a condition
`hpx::future`	Asynchronous result transport (receiving end)
`hpx::promise`, `hpx::lcos::local::promise`	Asynchronous result transport (producing end)
`hpx::lcos::packaged_task`	Asynchronous result transport (producing end)
`hpx::async`	Spawning tasks (returns a future)
`hpx::bind`	Binding Parameters to callables
`hpx::function`	Type erased function object
`hpx::tuple`	Tuple
`hpx::any`	Type erased object (similar to `void*`)
`hpx::parallel::for_each`, etc.	Parallel Algorithms
`hpx::compute::vector`	Continous storage for N elements

Extensions to the standards APIs where necessary, maintaining full compatibility.

The HPX API

Lightweight Control Objects (LCOs)

Objects to synchronize between different threads of execution.
Ability to suspend and reactivate Tasks
Examples:
- mutex
- condition_variable
- channel
- promise
- future
- when_all, when_any
- ...
More on those later

What is a (the) future?

A future is an object representing a result which has not been calculated yet

Enables transparent synchronization with producer
Hides notion of dealing with threads
Makes asynchrony manageable
Allows for composition of several asynchronous operations
Turns concurrency into parallelism ]

Diving into the Future - The (basic) API

template <typename R>
class future
{
    // future constructors
    // Query the state
    // Waiting on the result
};

template <typename R>
class shared_future
{
    // Future constructors
    // Query the state
    // Waiting on the result
};

Diving into the Future - The (basic) API

Constructing a hpx::future<R>

template <typename R>
class future
{
    // Future constructors
    // Construct an empty future.
    future();

    // Move a future to a new one
    future(future&& f);

    // Unwrap a future. The new future becomes ready when
    // the inner, and outer future are ready.
    explicit future(future<future>&& f);
    explicit future(shared_future<future>&& f);

    // Turn this future into a shared_future. Invalidates
    // the future!
    shared_future<T> share();

    // Query the state
    // Waiting on the result
};

Diving into the Future - The (basic) API

Querying the state of the future

template <typename R>
class future
{
    // Future constructors

    // Query the state

    // Check if the future is ready yet.
    bool is_ready();

    // Check if the future has a value
    bool has_value();

    // Check if the future has an exception
    bool has_exception();

    // Waiting on the result
};

Diving into the Future - The (basic) API

Waiting for the future to become ready

template <typename R>
class future
{
    // Future constructors
    // Query the state

    // Waiting on the result
    void wait();

    // Get the result. This function might block if the result has
    // not been computed yet. Invalidates the future!
    R get();

    // Attach a continuation. The function f gets called with
    // the (ready) future. Returns a new future with the result of
    // the invocation of f. Invalidates the future!
    template <typename F>
    auto then(F&& f);
};

Diving into the Future - The (basic) API

Constructing a hpx::shared_future<R>

template <typename R>
class shared_future
{
    // Future constructors
    // Construct an empty future.
    shared_future();

    // Move a future to a new one
    shared_future(shared_future&& f);

    // Share ownership between two futures
    shared_future(shared_future const & f);

    // Unwrap a future. The new future becomes ready when
    // the inner, and outer future are ready.
    explicit future(shared_future<future>&& f);

    // Query the state
    // Waiting on the result
};

Diving into the Future - The (basic) API

Waiting for the future to become ready

template <typename R>
class future
{
    // Future constructors
    // Query the state

    // Waiting on the result
    void wait();

    // Get the result. This function might block if the result has
    // not been computed yet.
    R const& get();

    // Attach a continuation. The function f gets called with
    // the (ready) future. Returns a new future with the result of
    // the invocation of f.
    template <typename F>
    auto then(F&& f) const;
};

Producing Futures

`hpx::async`

template <typename F, typename...Ts>
auto async(F&& f, Ts...&&ts)
 -> future<decltype(f(std::forward<Ts>(ts)...)>;

F is anything callable with the passed arguments (actions are callable)

template <typename F, typename...Ts>
auto async(launch_policy, F&& f, Ts...&&ts)
 -> future<decltype(f(std::forward<Ts>(ts)...)>;

launch_policy can be async, sync, fork, deferred

template <typename Executor typename F, typename...Ts>
auto async(Executor&&, F&& f, Ts...&&ts)
 -> future<decltpype(f(std::forward<Ts>(ts)...)>;

Executor is a concept to be introduced later

Producing Futures

`hpx::lcos::local::promise`

hpx::lcos::local::promise<int> p;
hpx::future<int> f = p.get_future();
// f.is_ready() == false, f.get(); would lead to a deadlock
p.set_value(42);

// Print 42
std::cout << f.get() << std::endl;

Producing Futures

`hpx::lcos::promise`

hpx::lcos::promise<int> p;
hpx::future<int> f = p.get_future();
// f.is_ready() == false, f.get(); would lead to a deadlock
hpx::async(
    [](hpx::id_type promise_id)
    {
        hpx::set_lco_value(promise_id, 42);
    }
  , p.get_id());

// Print 42
std::cout << f.get() << std::endl;

Producing Futures

`hpx::make_ready_future`

Producing futures that are ready at construction

template <typename T>
future<typename decay<T>::type> make_ready_future(T&& t);

future<void> make_ready_future();

Producing Futures

And beyond ...

Futures are the main interface to express asynchrony
Most API functions in HPX return futures
This was just an excerpt ... let's see more!

Composing Futures

Sequential Composition

future<int> f1 = hpx::async(...);

// Call continuation once f1 is ready. f2 will become ready once
// the continuation has been run.
future<double> f2 = f1.then(
    [](future<int>&& f) { return f.get() + 0.0; });

The continuation needs to take the future as parameter to allow for exception handling. Exceptions happening in asynchronous calls will get rethrown on .get()
then accepts launch policies as well as executors
f1 will get invalidated. -- No invalidation:

shared_future<int> f1 = hpx::async(...);

// Call continuation once f1 is ready. f2 will become ready once
// the continuation has been run.
future<double> f2 = f1.then(
    [](future<int>&& f) { return f.get() + 0.0; });

Composing Futures

And Composition

future<int> f1 = hpx::async(...);
future<std::string> f2 = hpx::async(...);

auto all_f = hpx::when_all(f1, f2);

future<std::vector<float>> result = all_f.then(
    [](auto f) -> std::vector<float>
    {
        // ...
    });

Allows for attaching continuations to more than one future
f1 and f2 will get invalidated. (Use shared_future if you need them afterwards)
Also accepts a std::vector<future<T>> or a pair of iterators

Composing Futures

Or Composition

std::vector<future<int>> fs = ...;

future<std::vector<future<int>>> ffs = hpx::when_any(fs).then(
    [](auto f)
    {
        auto res = f.get();
        return res.sequence;
    });

Allows for waiting on any of the input futures
Returns a future<when_any_result<Sequence>>:

template <typename Sequence>
struct when_any_result
{
   std::size_t index; // Index to a future that became ready
   Sequence futures;  // Sequence of futures
};

Composing Futures

Dataflow

Shortcut to when_all(...).then(...)

future<int> f1 = hpx::async(...);
future<std::string> f2 = hpx::async(...);

future<double> f3 = hpx::dataflow(
    [](future<int>, future<std::string>) -> double
    {
        // ...
    });

Calls the passed function whenever all arguments that were futures are ready
Returns a future that becomes ready once the function has finished execution
Accepts launch policies as well as executors as the first parameter

Concepts of Parallelism

Types of Parallelism

Data parallelism
- Parallel Algorithms
- GPUs
Task based asynchronous and continuation style parallelism
- future<R>, async, etc...
Instruction Level Parallelism
- SIMD instructions

Concepts of Parallelism

Parallel Execution Properties

The execution restrictions applicable for work items
In what sequence work items have to be executed
Where work items should be executed
The parameters of the execution environment

Concepts of Parallelism

Bringing it all together

Concepts of Parallelism

Bringing it all together

Concepts of Parallelism

Bringing it all together

Concepts of Parallelism

Bringing it all together

Concepts of Parallelism

Bringing it all together

Concepts of Parallelism

Bringing it all together

Execution Policies

From the C++ Standard

Specify execution guarantees (in terms of thread-safety) for executed parallel tasks:
- sequential_execution_policy: seq
- parallel_execution_policy: par
- parallel_vector_execution_policy: par_vec

HPX Extensions

Asynchronous Execution Policies:
- sequential_task_execution_policy: seq(task)
- parallel_task_execution_policy: par(task)
In both cases the formerly synchronous functions return a future<R>
Instruct the parallel construct to be executed asynchronously
Allows integration with asynchronous control flow

Executors

Concept

Executor are objects responsible for
- Creating execution agents on which work is performed (N4466)
- In N4466 this is limited to parallel algorithms, here much broader use
Abstraction of the (potentially platform-specific) mechanisms for launching work
Responsible for defining the Where and How of the execution of tasks

Executors

Implementation

Executors must implement one function: async_execute(F&& f, Args&&... args)
Invocation of executors happens through executor_traits which exposes (emulates) additional functionality:

    executor_traits<my_executor_type>::execute(
    my_executor,
    [](size_t i){ // perform task i }, n);

Four modes of invocation: single async, single sync, bulk async and bulk sync
The async calls return a future

Executors

Examples

sequential_executor, parallel_executor:
- Default executors corresponding to par, seq
this_thread_executor
thread_pool_executor
- Specify core(s) to run on (NUMA aware)
distribution_policy_executor
- Use one of HPX’s (distributed) distribution policies, specify node(s) to run on
hpx::compute::host::block_executor
- Use a set of CPUs
hpx::compute::cuda::default_executor
- Use for running things on GPU
Etc.

Execution Parameters

Allow to control the grain size of work
- i.e. amount of iterations of a parallel for_each run on the same thread
- Similar to OpenMP scheduling policies: static, guided, dynamic
- Much more fine control

Rebind Execution Policies

Execution policies have associated default executor and default executor parameters

par: parallel executor, static chunk size
seq: sequential executor, no chunking
- Rebind executor and executor parameters

numa_executor exec;
// rebind only executor
auto policy1 = par.on(exec);
static_chunk_size param;

// rebind only executor parameter
auto policy2 = par.with(param);
// rebind both
auto policy3 = par.on(exec).with(param);

Data Placement

Basics

Mechanism to tell where to allocate data
C++ defines an Allocator concept std::allocator<T>
Extensions:
- Where do you want to allocate Data
- Ability to bulk allocate Data (NUMA aware allocation, GPU Device Allocation)
Data Structures to use those allocators
Different strategies for different platforms
- Need interface to control explicit placement of data
  - NUMA architectures
  - GPUs
  - Distributed systems

Data Placement

Data Structures

hpx::compute::vector<T, Allocator>
- Same interface as std::vector<T>
- Manages data locality through allocator
- Uses execution target objects for data placement
- Allows for direct manipulation of data on NUMA domains, GPUs, remote nodes, etc.
hpx::partitioned_vector<T>
- Same interface as std::vector<T> (almost)
- Segmented data store
- Segments can be hpx::compute::vector<T>
- Uses distribution_policy for data placement
- Allows for manipulation of data on several targets

Execution Targets

Opaque type which represent a place in the system
- Used to identify data placement
- Used to specify execution site close to data
Targets encapsulate architecture specifics
- CPU sets (NUMA domains), Scratch Pad Memory, GPU devices, Remote nodes
Allocators to be initialized from targets
- Customization of data placement
Executors to be initialized from targets as well
- Make sure code is executed close to placed data

Parallel Algorithms

Example: SAXPY - The HPX Way

Goal: SAXPY routine with data locality

a[i] = b[i] * x + c[i] for i from 0 to N-1
Using parallel algorithms
Explicit control over data locality
No raw loops

Example: SAXPY - The HPX Way

Step 1: Writing the serial version

std::vector<double> a = ...;
std::vector<double> b = ...;
std::vector<double> c = ...;
double x = ...;

std::transform(b.begin(),
  b.end(), c.begin(), a.begin(),
  [x](double bb, double cc)
  {
*   return bb * x + cc;
  }
);

]

bb is b[i]
cc is c[i]
the calculated value gets written to a[i]
Complete code ]

Example: SAXPY - The HPX Way

Step 2: Parallelize it

std::vector<double> a = ...;
std::vector<double> b = ...;
std::vector<double> c = ...;
double x = ...;

*using hpx::parallel::par;
*hpx::parallel::transform(par
  b.begin(), b.end(), c.begin(),
  a.begin(),
  [x](double bb, double cc)
  {
    return bb * x + cc;
  }
);

]

Replace the standard algorithm with a parallel one
Set parallel exeuction policy
Complete code ]

Example: SAXPY - The HPX Way

Step 3: Adding data locality

using hpx::compute::host;

typedef block_executor<>
    executor;
typedef block_allocator<double>
    allocator;

auto numa_domains = numa_nodes();
executor exec(numa_domains);
allocator alloc(numa_domains);

using hpx::compute::vector;

vector<double, allocator> a = ...;
vector<double, allocator> b = ...;
vector<double, allocator> c = ...;
double x = ...;

]

*using hpx::parallel::par;
*auto policy = par.on(exec);
hpx::parallel::transform(
  policy, b.begin(),
  b.end(), c.begin(), a.begin(),
  [x](double bb, double cc)
  {
    return bb * x + cc;
  }
);

Get targets for locality of data and execution
Setup Executor and Allocator
Run on the allocator
Complete code ]

Example: SAXPY - The HPX Way

Optional Step: Running it on the GPU

using hpx::compute::cuda;

typedef default_executor<>
    executor;
typedef allocator<double>
    allocator;

target device;
executor exec(device);
allocator alloc(device);

using hpx::compute::vector;

vector<double, allocator> a = ...;
vector<double, allocator> b = ...;
vector<double, allocator> c = ...;
double x = ...;

] .right-column[

*using hpx::parallel::par;
*auto policy = par.on(exec);
hpx::parallel::transform(
  policy, b.begin(),
  b.end(), c.begin(), a.begin(),
  [x] HPX_DEVICE
  (double bb, double cc)
  {
    return bb * x + cc;
  }
);

Get targets for locality of data and execution
Setup Executor and Allocator
Run on the allocator
Complete code
Works only for CUDA version 8 :( ]

Building HPX and Applications

Files

session2

Directory actions

More options

Directory actions

More options

Latest commit

History

session2

Folders and files

parent directory

README.md

Introduction to HPX

Part 2

Recap: What's HPX

Recap: What's HPX

The HPX Programming Model

The HPX Programming Model

The HPX Programming Model

The HPX Programming Model

The HPX Programming Model

The HPX API

Borrowed from the C++ Standard

The HPX API

Lightweight Control Objects (LCOs)

What is a (the) future?

Diving into the Future - The (basic) API

Diving into the Future - The (basic) API

Diving into the Future - The (basic) API

Diving into the Future - The (basic) API

Diving into the Future - The (basic) API

Diving into the Future - The (basic) API

Producing Futures

hpx::async

Producing Futures

hpx::lcos::local::promise

Producing Futures

hpx::lcos::promise

Producing Futures

hpx::make_ready_future

Producing Futures

And beyond ...

Composing Futures

Sequential Composition

Composing Futures

And Composition

Composing Futures

Or Composition

Composing Futures

Dataflow

Concepts of Parallelism

Types of Parallelism

Concepts of Parallelism

Parallel Execution Properties

Concepts of Parallelism

Bringing it all together

Concepts of Parallelism

Bringing it all together

Concepts of Parallelism

Bringing it all together

Concepts of Parallelism

Bringing it all together

Concepts of Parallelism

Bringing it all together

Concepts of Parallelism

Bringing it all together

Execution Policies

From the C++ Standard

HPX Extensions

Executors

Concept

Executors

Implementation

Executors

Examples

Execution Parameters

Rebind Execution Policies

Data Placement

Basics

Data Placement

`hpx::async`

`hpx::lcos::local::promise`

`hpx::lcos::promise`

`hpx::make_ready_future`