The available computing power of distributed high performance systems, often with several thousands of cores, can hardly be mastered with the conventional programming models. On the other hand, the growing demand for processing huge amounts of data requires a better control of the workflows being executed, an efficient storage management and a fault-tolerant execution of the workflows.
Handling these aspects correctly may exceed the capabilities of a normal user, requiring expert knowledge. In order to properly handle this situation and to alleviate the domain specific programmer’s mission, a natural idea would be to ensure a separation of the coordination level and the user programming level of parallel applications. This requires a more sophisticated programming model and an execution framework that offers support for automatic parallelization and scheduling, a fault-tolerance mechanism and services for the job execution control, monitoring, memory management, etc. Trying to put this concept in practice, we designed and implemented a programming model and an execution framework that clearly realizes this separation.
GPI-Space is a complex but flexible software platform, combining several important software components that can also be used either as standalone applications or in combination with other software tools or packages. Each of this components were designed and implemented taking into account the experience accumulated with other important high performance computing projects. GPI-Space consists of: a distributed runtime system, a work ow engine and a storage layer.
The Distributed Run-Time System
The Distributed Run-Time System (DRTS) represents the execution layer of GPI-Space. It is dynamic, fault-tolerant and can dynamically build arbitrary topologies. It relies on a master-slave architecture, where an agent may have multiple masters, event subscribers and workers.
On top of such an architecture stays the orchestrator, which is responsible with handling the user requests and scheduling them on the available agents. The agents may have access to the partitioned global address space and thus trigger large data transfers from or to the virtual memory.
The agents may dispose of a workflow engine that interprets sub-workflows, creates intermediary tasks and assembles the results. A graphical representation of a DRTS, deployed as a tree topology, is as in the figure to the left.
In fact, the agents can form logical communication topologies with a much more complex graph structure, allowing thus a straightforward implementation of the parallel algorithms that assume a certain structure of the logical communication graph. The agents implement a Staged Event Driven Architecture and are controlled by finite state machines. The stages are basically thread pools with a shared queue of events and they can asynchronously exchange messages, making thus the agent more flexible and more responsive.
The whole distributed run-time system is fault-tolerant and an agent can join and leave the system at any time, without stopping the execution of the submitted workflow. The agent was designed with the bridge pattern in mind, trying to decouple as much as possible the abstract part from the implementation part.
An agent is composed of several software components such as
The jobs may be submitted together with a list of requirements and the workers may have different capabilities. Each agent disposes of a scheduler that tries to fairly schedule the jobs on workers whose capabilities are best matching the requirements.
The Workflow Engine
The coordination level consists of a workflow engine, a workflow description language and a set of tools that are intended to help the user to build workflows.
The workflows are basically high-level Petri nets, described in a proprietary XML-based language that we defined. We developed also a workflow engine that is capable to concurrently interpret and execute these workflows. The user may either write the workflow directly into the XML-based language or may use an editor (currently under development, but having basic functionality implemented).
Apart of the workflow engine, a number of other tools were developed, with the goal to assist the user in the course of the development of a workflow:
A Petri net compiler which translates the XML description of the workflow into some intermediary format that the workflow engine is capable to understand.
A verification tool that is able to verify the basic properties of the net.
A basic visualization tool, based on the Graphviz software package.
A graphical editor.
The compiler generates the internal representation of the runtime environment from the XML representation of the Petri net. It checks the semantic validity of the input net, being also able to
check other properties like termination, absence of deadlocks, reachability, etc. and to eventually optimize the net. Typically, a user who wants to write an application for GPI-Space should focus on describing the workflow and on how to logically organize the storage layer.
In the case of GPI-Space, the Petri nets have a double role: they are used not only for controlling the execution of the generated tasks, but also for controlling the access of the tasks to the partitioned global address space, acting as a transactional mechanism and guaranteeing thus the data integrity.
From a global point of view, the execution of a workflow in GPI-Space consists in the following steps:
The orchestrator receives a job having attached the internal representation of a workflow and assigns it to one of the available agents.
The agent hands over the attached description of a job received from a master to the workflow engine and this one extracts all the executable activities, some of them being executed locally and the others being sent to available workers.
When the job completes, the assigned worker or agent sends the result to the corresponding master. This one hands the result over to its workflow engine, which inserts the corresponding output tokens and looks for new active transitions. If new activities were generated, they are sent to the available workers. The process continues until no transition can be fired anymore.
When no other activities can be generated, the network is considered to be completely processed and the result of the job is sent back to the submitter.
In his dissertation, Carl Adam Petri examined the possibilities of adding resources to running calculations on an as-needed basis, without interrupting the process and without informing any of the existing resources about the new resources. He discovered an elegant solution that only connected resources locally with other resources. Nevertheless, all resources must be able to operate autonomously, i.e., asynchronously. Petri just needed an opportunity to describe the asynchronous and distributed systems and soon thereafter, the development started on the nets that bear his name today.
What makes the Petri nets interesting for us besides their simple graphic and hierarchical structure is their local nature (i.e., no global state), their concurrency, (i.e., there is no total assignment of events, just data dependencies) and their reversibility (i.e., the causal can be determined from one result). Incidentally, these are all properties that Petri intentionally borrowed from Physics for use in Computer Science. In addition, Petri nets are often used in the engineering disciplines as a modeling tool as well and the theory behind them is very well researched and understood.
The advantages of Petri nets as a mathematical modeling language are formulated very well by van der Aalst: Petri nets have precise execution semantics that assign specific meanings to the net, serve as the basis for an agreement, are independent of the tools used, and are what enable process analysis and solutions. Furthermore, because Petri nets are not based on events but rather on state transitions, it is possible to differentiate between activation and execution of an elementary functional unit. In particular, interruption and restart of the applications are easy, which is a fundamental condition for fault tolerance to hardware failure. Last, there are mature analysis techniques available that besides proving the correctness, also allow performance predictions.
How do Petri Nets look like?
There are boxes for transitions and circles for places. Transitions stand for the elementary functional units and places are the placeholders for the small, black, so called tokens that represent the data. Places and transitions are connected by directional arrows. A transition can fire when all of its input places contain tokens. When fired, a token from the input place is removed and a token is produced at each output place. Fig. 1 on the left shows the before and after situations for Transition “F”.
Transition “G” in Fig. 2 has two output places, i.e., it produces two tokens when fired.
A kind of automatic parallelism is present, for example, when more than one token is at the input place for Transition “H”. In Fig. 3, three incarnations of “H” can be executed simultaneously.
A different kind of concurrency exists between Transitions “A” and “B” in Fig. 4: If there is a token at the input place for Transition “S”, a token will be produced at the input place for “A” and “B”, thus, each of these transitions can fire in parallel. Transition “J” synchronizes the Transitions “A” and “B”: It can only fire when both “A” as well as “B” have been fired.
The Storage Layer
The storage layer within GPI-Space is represented by GPI. This is a middleware that allows for executing parallel applications complying with a Partitioned Global Address Space (PGAS) programming model, developed at our institute. Typically, the memory of the individual nodes in a cluster is aggregated and seen as a single address space, where large objects, often exceeding the capacity of a single cluster node, can be stored. GPI is targeted at RDMA-enabled interconnects such as Infiniband or CRAY Gemini.
The main idea here is that the full performance of the RDMA-enabled networks can be delivered to the application directly, without interrupting the CPUs. GPI constitutes an alternative to the traditional message-passing model for the development of parallel applications that are intended to run on modern multicore systems. GPI focuses on one-sided communication and the development of asynchronous algorithms, leveraging the capabilities of modern interconnects to overlap communication with computation.
At the node level, the Manycore Threading Package (MCTP), developed in our department with the goal to better take advantage of new multicore architectures, is used. Within GPI-Space, different applications can be combined into a workflow and executed using the virtual memory as a persistent layer that facilitates the interaction with each-other.
More information on GPI can be found here