An Efficient New Static Scheduling Heuristic for Accelerated Architectures (2024)

Journal List
Springer Nature - PMC COVID-19 Collection
PMC7302258

As a library, NLM provides access to scientific literature. Inclusion in an NLM database does not imply endorsem*nt of, or agreement with, the contents by NLM or the National Institutes of Health.
Learn more: PMC Disclaimer | PMC Copyright Notice

Computational Science – ICCS 2020. 2020 May 26; 12137: 3–16.

Published online 2020 May 26. doi:10.1007/978-3-030-50371-0_1

PMCID: PMC7302258

Guest Editor (s): Valeria V. Krzhizhanovskaya,⁸ Gábor Závodszky,⁹ Michael H. Lees,¹⁰ Jack J. Dongarra,¹¹ Peter M. A. Sloot,¹² Sérgio Brissos,¹³ and João Teixeira¹⁴

⁸University of Amsterdam, Amsterdam, The Netherlands

⁹University of Amsterdam, Amsterdam, The Netherlands

¹⁰University of Amsterdam, Amsterdam, The Netherlands

¹¹University of Tennessee, Knoxville, TN USA

¹²University of Amsterdam, Amsterdam, The Netherlands

¹³Intellegibilis, Setúbal, Portugal

¹⁴Intellegibilis, Setúbal, Portugal

Valeria V. Krzhizhanovskaya, Email: ln.avu@ayaksvonahzihzrK.V.

Contributor Information.

Thomas McSweeney,¹⁵ Neil Walton,¹⁵ and Mawussi Zounon^15,¹⁶

Author information Copyright and License information PMC Disclaimer

Abstract

Heterogeneous architectures that use Graphics Processing Units (GPUs) for general computations, in addition to multicore CPUs, are increasingly common in high-performance computing. However many of the existing methods for scheduling precedence-constrained tasks on such platforms were intended for more diversely heterogeneous clusters, such as the classic Heterogeneous Earliest Finish Time (HEFT) heuristic. We propose a new static scheduling heuristic called Heterogeneous Optimistic Finish Time (HOFT) which exploits the binary heterogeneity of accelerated platforms. Through extensive experimentation with custom software for simulating task scheduling problems on user-defined CPU-GPU platforms, we show that HOFT can obtain schedules at least shorter than HEFT’s for medium-to-large numerical linear algebra application task graphs and around shorter on average for a large collection of randomly-generated graphs.

Keywords: High-Performance Computing, GPU computing, Scheduling, Precedence constraints, Directed Acyclic Graphs

Introduction

Modern High-Performance Computing (HPC) machines typically comprise hundreds or even thousands of networked nodes. These nodes are increasingly likely to be heterogeneous, hosting one or more powerful accelerators—usually GPUs—in addition to multicore CPUs. For example, Summit, which currently heads the Top500¹ list of the world’s fastest supercomputers, comprises over 4000 nodes, each with two 22-core IBM Power9 CPUs and six NVIDIA Tesla V100 GPUs.

Task-based parallel programming is a paradigm that aims to harness this processor heterogeneity. Here a program is described as a collection of tasks—logically discrete atomic units of work—with precedence constraints that define the order in which they can be executed. This can be expressed in the form of a graph, where each vertex represents a task and edges the precedence constraints between them. We are interested only in the case when such task graphs are Directed Acyclic Graphs (DAGs)—directed and without any cycles.

The immediate question is, how do we find the optimal way to assign the tasks to a set of heterogeneous processing resources while still respecting the precedence constraints? In other words, what schedule should we follow? This DAG scheduling problem is known to be NP-complete, even for hom*ogeneous processors [17], so typically we must rely on heuristic algorithms that give us reasonably good solutions in a reasonable time.

A fundamental distinction is made between static and dynamic scheduling. Static schedules are fixed before execution based on the information available at that time, whereas dynamic schedules are determined during runtime. There are generic advantages and disadvantages to both: static scheduling makes greater use of the data so is superior when it is sufficiently accurate, whereas dynamic scheduling uses more recent data. In practice task scheduling is usually handled by a runtime system, such as OmpSs [11], PaRSEC [6], or StarPU [4]. Most such systems use previous execution traces to predict task execution and data transfer times at runtime. On a single machine the latter is tricky because of shared buses and the possibility of asynchronous data transfers. Hence at present dynamic scheduling is typically preferred. However static schedules can be surprisingly robust, even when estimates are poor [1]. Furthermore, robustness can be improved using timing distribution information [18]. In addition, superior performance can be achieved in dynamic environments by modifying an existing static schedule, rather than computing a new one from scratch [1].

In this paper we therefore focus on the problem of finding good static schedules for multicore and GPU platforms. To facilitate this investigation, we developed an open-source software simulator which allows users to simulate the static scheduling of arbitrary task DAGs on arbitrary CPU-GPU platforms, without worrying about the time or energy usage constraints imposed by real systems.

The popular HEFT scheduling heuristic comprises two phases: a task prioritization phase in which the order tasks are to be scheduled is determined and a processor selection phase in which they are actually assigned to the processing resources. In this article we introduce HOFT, which follows the HEFT framework but modifies both phases in order to exploit accelerated architectures in particular, without significantly increasing the complexity of the algorithm. HOFT works by first computing a table of optimistic estimates of the earliest possible times all tasks can be completed on both processor types and using this to guide both phases. Simulations with real and randomly-generated DAGs on both single and multiple GPU target platforms suggest that HOFT is always at least competitive with HEFT and frequently superior.

Explicitly, the two main contributions of this paper are:

A new static scheduling heuristic that is optimized specifically for accelerated heterogeneous architectures;
Open-source simulation software that allows researchers to implement and evaluate their own scheduling algorithms for user-defined CPU-GPU platforms in a fast and reproducible manner.

The remainder of this paper is structured as follows. In Sect.2 we summarize the relevant existing literature. Then in Sect.3 we explicitly define the simulation model we use to study the static task scheduling problem. We describe HEFT in detail in Sect.4, including also benchmarking results with our simulation model and a minor modification to the algorithm that we found performs well. In Sect.5 we describe our new HOFT heuristic, before detailing the numerical experiments that we undertook to evaluate its performance in Sect.6. Finally in Sect.7 we state our conclusions from this investigation and outline future work that we believe may be useful.

Related Work

Broadly, static scheduling methods can be divided into three categories: mathematical programming, guided-random search and heuristics. The first is based on formulating the scheduling problem as a mathematical program; see, for example, Kumar’s constraint programming formulation in [14]. However solving these is usually so expensive that they are restricted to small task graphs. Guided-random search is a term used for any method that generates a large population of potential schedules and then selects the best among them. Typically these are more general optimization schemes such as genetic algorithms which are refined for the task scheduling problem. As a rule, such methods tend to find very high-quality schedules but take a long time to do so [7].

Heuristics are the most popular approach in practice as they are often competitive with the alternatives and considerably faster. In turn listing heuristics are the most popular kind. They follow a two-phase structure: an ordered list of all tasks is first constructed (task prioritization) and they are then scheduled in this order according to some rule (processor selection). HEFT is the most prominent example: all tasks are prioritized according to their upward rank and then scheduled on the processor expected to complete their execution at the earliest time; a fuller description is given in Sect.4. Canon et al. [8] compared twenty different task scheduling heuristics and found that HEFT was almost always among the best in terms of both schedule makespan and robustness.

Many modifications of HEFT have been proposed in the literature, such as HEFT with lookahead from Bittencourt, Sakellariou and Madeira [5], which has the same task prioritization phase but schedules all tasks on the resources estimated to minimize the completion time of their children. This has the effect of increasing the time complexity of the algorithm so, in an attempt to incorporate a degree of lookahead into the HEFT framework without increasing the cost, Arabnejad and Barbosa proposed Predict Earliest Finish Time (PEFT) [3]. The main innovation is that rather than just minimizing the completion time of a task during processor selection, we also try to minimize an optimistic estimate of the time it will take to execute the remaining unscheduled tasks in the DAG.

Like the majority of existing methods for static DAG scheduling in heterogeneous computing, HEFT was originally intended for clusters with diverse nodes. At least one extension specifically targeting accelerated architectures has been proposed before, namely HEFT-NC (No Cross) from Shetti, Fahmy and Bretschneider [15], but our new HOFT heuristic differs from this in both the task prioritization and processor selection phases of the algorithm.

Simulation Model

In this paper we use a simulation model to study the static task scheduling problem for multicore and GPU. This simulator follows the mathematical model described in Sect.3.1 and therefore facilitates the evaluation of scheduling algorithms for idealized CPU-GPU platforms. The advantage of this approach is that it allows us to compare multiple algorithms and determine how intrinsically well-suited they are for accelerated architectures. Although this model may not capture the full range of real-world behavior, we gathered data from a single heterogeneous node of a local computing cluster to guide its development and retain the most salient features. This node comprises four octacore Intel (Skylake) Xeon Gold 6130 CPUs running at 2.10GHz with 192GB RAM and four Nvidia V100-SXM2-16GB (Volta) GPUs, each with 16GB GPU global memory, 5120 CUDA Cores and NVLink interconnect. We used Basic Linear Algebra Subroutine (BLAS) [10] and Linear Algebra PACKage (LAPACK) [2] kernels for benchmarking as they are widely-used in scientific computing applications.

The simulator is implemented in Python and the complete source code is available on Github². All code used to generate results presented in this paper is available in the folder simulator/scripts so interested researchers may repeat our experiments for themselves. In addition, users may make modifications to the simulator that they believe will more accurately reflect their own target environment.

Mathematical Model

The simulator software implements the following mathematical model of the problem. Suppose we have a task DAG G consisting of n tasks and e edges that we wish to execute on a target platform H comprising P processing resources of two types, CPU resources and GPU resources. In keeping with much of the related literature and based on current programming practices, we consider CPU cores individually but regard entire GPUs as discrete [1]. For example, a node comprising 4 GPUs and 4 octacore CPUs would be viewed as 4 GPU resources and CPU resources.

We assume that all tasks are atomic and cannot be divided across multiple resources or aggregated to form larger tasks. Further, all resources can only execute a single task at any one time and can in principle execute all tasks, albeit with different processing times. Given the increasing versatility of modern GPUs and the parallelism of modern CPUs, the latter is a much less restrictive assumption than it once may have been.

In our experiments, we found that the spread of kernel processing times was usually tight, with the standard deviation often being two orders of magnitude smaller than the mean. Thus we assume that all task execution times on all processing resources of a single type are identical. In particular, this means that each task has only two possible computation costs: a CPU execution time and a GPU execution time . When necessary, we denote by the processing time of task on the specific resource .

The communication cost between task and its child is the length of time between when execution of is complete and execution of can begin, including all relevant latency and data transfer times. Since this depends on where each task is executed, we view this as a function . We assume that the communication cost is always zero when and that there are only four possible communication costs between tasks and when this isn’t the case: , from a CPU to a different CPU; , from CPU to GPU; , from GPU to CPU; and from GPU to a different GPU.

A schedule is a mapping from tasks to processing resources, as well as the precise time at which their execution should begin. Our goal is to find a schedule which minimizes the makespan of the task graph, the total execution time of the application it represents. A task with no successors is called an exit task. Once all tasks have been scheduled, the makespan is easily computed as the earliest time all exit tasks will be completed. Note that although we assume that all costs represent time, they could be anything else we wish to minimize, such as energy consumption, so long as this is done consistently. We do not however consider the multi-objective optimization problem of trading off two or more different cost types here.

Testing Environments

In the numerical experiments described later in this article, we consider two simulated target platforms: Single GPU, comprising 1 GPU and 1 octacore CPU, and Multiple GPU, comprising 4 GPUs and 4 octacore CPUs. The latter represents the node we used to guide the development of our simulator and the former is considered in order to study how the number of GPUs affects performance. We follow the convention that a CPU core is dedicated to managing each of the GPUs [4], so these two platforms are actually assumed to comprise 7 CPU and 1 GPU resources, and 28 CPU and 4 GPU resources, respectively. Based on our exploratory experiments, we make two further assumptions. First, since communication costs between CPU resources were negligible relative to all other combinations, we assume they are zero—i.e., . Second, because CPU-GPU communication costs were very similar to the corresponding GPU-CPU and GPU-GPU costs, we take them to be identical—i.e., . These assumptions will obviously not be representative of all possible architectures but the simulator software allows users to repeat our experiments for more accurate representations of their own target platforms if they wish.

We consider the scheduling of two different sets of DAGs. The first consists of ten DAGs comprising between 35 and 22,100 tasks which correspond to the Cholesky factorization of tiled matrices, where . In particular, the DAGs are based on a common implementation of Cholesky factorization for tiled matrices which uses GEMM (matrix multiplication), SYRK (symmetric rank-k update) and TRSM (triangular solve) BLAS kernels, as well as the POTRF (Cholesky factorization) LAPACK routine [10]. All task CPU/GPU processing times are means of 1000 real timings of that task kernel. Likewise, communication costs are sample means of real communication timings between the relevant task and resource types. All numerical experiments were performed for tile sizes 128 and 1024; which was used will always be specified where results are presented. Those sizes were chosen as they roughly mark the upper and lower limits of tile sizes typically used for CPU-GPU platforms.

The standard way to quantify the relative amount of communication and computation represented by a task graph is the Computation-to-Communication Ratio (CCR), the mean computation cost of the DAG divided by the mean communication cost. For the Cholesky DAGs, the CCR was about 1 for tile size 128 and about 18 for tile size 1024, with minor variation depending on the total number of tasks in the DAG.

We constructed a set of randomly-generated DAGs with a wide range of CCRs, based on the topologies of the 180 DAGs with 1002 tasks from the Standard Task Graph (STG) set [16]. Following the approach in [9], we selected GPU execution times for all tasks uniformly at random from [1,100] and computed the corresponding CPU times by multiplying by a random variable from a Gamma distribution. To consider a variety of potential applications, for each DAG we made two copies: low acceleration, for which the mean and standard deviation of the Gamma distribution was defined to be 5, and high acceleration, for which both were taken to be 50 instead. These values roughly correspond to what we observed in our benchmarking of BLAS and LAPACK kernels with tile sizes 128 and 1024, respectively. Finally, for both parameter regimes, we made three copies of each DAG and randomly generated communication costs such that the CCR fell into each of the intervals [0,10], [10,20] and [20,50]. Thus intotal the random DAG set contains DAGs.

HEFT

Recall that as a listing scheduler HEFT comprises two phases, an initial task prioritization phase in which the order all tasks are to be scheduled is determined and a processor selection phase in which the processing resource each task is to be scheduled on is decided. Here we describe both in order to give a complete description of the HEFT algorithm.

The critical path of a DAG is the longest path through it, and is important because it gives a lower bound on the optimal schedule makespan for that DAG. Heuristics for hom*ogeneous platforms often use the upward rank, the length of the critical path from that task to an exit task, including the task itself [17], to determine priorities. Computing the critical path is not straightforward for heterogeneous platforms so HEFT extends the idea by using mean values instead. Intuitively, the task prioritization phase of HEFT can be viewed as an approximate dynamic program applied to a simplified version of the task DAG that uses mean values to set all weights.

More formally, we first define the mean execution cost of all tasks through

where the second expression is how would be computed under the assumptions of our model. Likewise, the mean communication cost between and is the average of all such costs over all possible combinations of resources,

where , , and . For all tasks in the DAG, we define their upward ranks recursively, starting from the exit task(s), by

where is the set of ’s immediate successors in the DAG. The task prioritization phase then concludes by listing all tasks in decreasing order of upward rank, with ties broken arbitrarily.

The processor selection phase of HEFT is now straightforward: we move down the list and assign each task to the resource expected to complete its execution at the earliest time. Let be the earliest time at which the processing resource is actually free to execute task , be the set of ’s immediate predecessors in the DAG, and be the time when execution of a task is actually completed (which in the static case is known precisely once it has been scheduled). Then the earliest start time of task on processing resource is computed through

and the earliest finish time of task on is given by

HEFT follows an insertion-based policy that allows tasks to be inserted between two that are already scheduled, assuming precedence constraints are still respected, so may not simply be the latest finish time of all tasks on . A complete description of HEFT is given in Algorithm1. HEFT has a time complexity of . For dense DAGs, the number of edges is proportional to , where n is the number of tasks, so the complexity is effectively [17].

Benchmarking

Using our simulation model, we investigated the quality of the schedules computed by HEFT for the Cholesky and randomly-generated DAG sets on both the single and multiple GPU target platforms described in Sect.3. The metric used for evaluation was the speedup, the ratio of the minimal serial time (MST)—the minimum execution time of the DAG on any single resource—to the makespan. Intuitively, speedup tells us how well the schedule exploits the parallelism of the target platform.

Figure1a shows the speedup of HEFT for the Cholesky DAGs with tile size 128. The most interesting takeaway is the difference between the two platforms. With multiple GPUs the speedup increased uniformly with the number of tasks until a small decline for the very largest DAG, but for a single GPU the speedup stagnated much more quickly. This was due to the GPU being continuously busy and adding little additional value once the DAGs became sufficiently large and more GPUs therefore postponing this effect. Results were broadly similar for Cholesky DAGs with tile size 1024, with the exception that the speedup values were uniformly smaller, reaching a maximum of just over four for the multiple GPU platform. This was because the GPUs were so much faster for the larger tile size that HEFT made little use of the CPUs and so speedup was almost entirely determined by the number of GPUs available.

Open in a separate window

Fig. 1.

Speedup of HEFT for Cholesky (tile size 128) and random (high acceleration) DAG sets.

Figure1b shows the speedups for all 540 high acceleration randomly-generated DAGs, ordered by their CCRs; results were broadly similar for the low acceleration DAGs. The speedups for the single GPU platform are much smaller with a narrower spread compared to the other platform, as for the Cholesky DAGs. More surprising is that HEFT sometimes returned a schedule with speedup less than one for DAGs with small CCR values, which we call a failure since this is obviously unwanted behavior. These failures were due to the greediness of the HEFT processor selection phase, which always schedules a task on the processing resource that minimizes its earliest finish time without considering the communication costs that may later be incurred by doing so. The effect is more pronounced when the CCR is low because the unforeseen communication costs are proportionally larger.

HEFT-WM

The implicit assumption underlying the use of mean values when computing task priorities in HEFT is that the probability of a task being scheduled on a processing resource is identical for all resources. But this is obviously not the case: if a task’s GPU execution time is ten times smaller than its CPU execution time then it is considerably more likely to be scheduled on a GPU, even if not precisely ten times so. This suggests that we should weight the mean values according to each task’s acceleration ratio, the CPU time divided by the GPU time. In particular, for each task we estimate its computation cost to be

and for each edge (say, between tasks and ) the estimated communication cost is

We call the modified heuristic which follows Algorithm1 but uses (6) instead of (1) and (7) instead of (2) HEFT-WM (for Weighted Mean). We were unable to find any explicit references to this modification in the literature but given its simplicity we do not think it is a novel idea and suspect it has surely been used before in practice.

HOFT

If we disregard all resource contention, then we can easily compute the earliest possible time that all tasks can be completed assuming that they are scheduled on either a CPU or GPU resource, which we call the Optimistic Finish Time (OFT). More specifically, for , we move forward through the DAG and build a table of OFT values by setting , if is an entry task, and recursively computing

for all other tasks, where if and 0 otherwise. We use the OFT table as the basis for the task prioritization and processor selection phases of a new HEFT-like heuristic optimized for CPU-GPU platforms that we call Heterogeneous Optimistic Finish Time (HOFT). Note that computing the OFT table does not increase the order of HEFT’s time complexity.

Among several possible ways of using the OFT to compute a complete task prioritization, we found the most effective to be the following. First, define the weights of all tasks to be the ratio of the maximum and minimum OFT values,

Now assume that all edge weights are zero, , and compute the upward rank of all tasks with these values. Upward ranking is used to ensure that all precedence constraints are met. Intuitively, tasks with a strong preference for one resource type—as suggested by a high ratio—should be scheduled first.

We also propose a new processor selection phase which proceeds as follows. Working down the priority list, each task is scheduled on the processing resource with the smallest EFT as in HEFT except when is not also the fastest resource type for that task. In such cases, let be the resource of the fastest type with the minimal EFT and compute

the saving that we expect to make by scheduling on rather than . Suppose that is of type . By assuming that each child task of is scheduled on the type of resource which minimizes its OFT and disregarding the potential need to wait for other parent tasks to complete, we estimate , the earliest finish time of all child tasks, given that is scheduled on , through

Likewise for we compute and if

we schedule task on ; otherwise, we schedule it on . Intuitively, the processor selection always chooses the resource with the smallest EFT unless by doing so we expect to increase the earliest possible time at which all child tasks can be completed.

Simulation Results

Figure2 shows the reduction in schedule makespan, as a percentage of the HEFT schedule, achieved by HOFT and HEFT-WM for the set of Cholesky DAGs, on both the single and multiple GPU target platforms. The overall trend for the multiple GPU platform is that HOFT improves relative to both HEFT variants as the number of tasks in the DAG grows larger; it is always better than standard HEFT for tile size 1024. For the single GPU platform, HOFT was almost always the best, except for the smallest DAGs and the largest DAG with tile size 128 (for which all three heuristics were almost identical). Interestingly, we found that the processor selection phase of HOFT never actually differed from HEFT’s for these DAGs and so the task prioritization phase alone was key.

Open in a separate window

Fig. 2.

HEFT-WM and HOFT compared to HEFT for Cholesky DAGs.

HOFT achieved smaller makespans than HEFT on average for the set of randomly-generated DAGs, especially for those with high acceleration, but was slightly inferior to HEFT-WM, as can be seen from Table1. However also included in the table is HOFT-WM, the heuristic obtained by using the HEFT-WM task prioritization with the HOFT processor selection. HOFT-WM was identical to HEFT-WM for the Cholesky DAGs but improved on both heuristics for the randomly-generated DAG set, suggesting that the HOFT processor selection phase is generally more effective than HEFT’s, no matter which task ranking is used. The alternative processor selection also reduced the failure rate for DAGs with very low CCR by about half on the single GPU platform, although it made little difference on the multiple GPU platform.

Table 1.

Makespan reduction vs. HEFT. Shown are both the average percentage reduction (APR) and the percentage of (540) DAGs for which each heuristic improved on the HEFT schedule (Better).

Heuristic	Single GPU				Multiple GPU
	Low acc.		High acc.		Low acc.		High acc.
	APR	Better	APR	Better	APR	Better	APR	Better
HEFT-WM	0.8	74.8	2.3	69.6	1.6	84.8	2.4	79.8
HOFT		50.3	3.8	83.1	1.4	69.2	2.3	76.5
HOFT-WM	0.8	70.9	4.6	76.9	1.4	78.1	3.7	81.1

Open in a separate window

Conclusions

Overall our simulations suggest that HOFT is often superior to—and always competitive with—both standard HEFT and HEFT-WM for multicore and GPU platforms, especially when task acceleration ratios are high. The processor selection phase in particular appears to be more effective, at least on average, for any task prioritization. It should also be noted that HEFT-WM was almost always superior to the standard algorithm in our simulations, suggesting that it should perhaps be the default in accelerated environments.

Although the number of failures HOFT recorded for DAGs with low CCRs was slightly smaller than for HEFT, the failure probability was still unacceptably high. Using the lookahead processor selection from [5] instead reduced the probability of failure further but it was still nonzero and the additional computational cost was considerable. We investigated cheaper sampling-based selection phases that consider only a small sample of the child tasks, selected either at random or based on priorities. These did reduce the failure probability in some cases but the improvement was usually minor. Alternatives to lookahead that we intend to consider in future are the duplication [13] or aggregation of tasks.

Static schedules for multicore and GPU are most useful in practice as a foundation for superior dynamic schedules [1], or when their robustness is reinforced using statistical analysis [18]. This paper was concerned with computing the initial static schedules; the natural next step is to consider their extension to real—i.e., dynamic—environments. This is often called stochastic scheduling, since the costs are usually modeled as random variables from some—possibly unknown—distribution. The goal is typically to find methods that bound variation in the schedule makespan, which is notoriously difficult [12]. Experimentation with existing stochastic scheduling heuristics, such as Monte Carlo Scheduling [18], suggested to us that the main issue with adapting such methods for multicore and GPU is their cost, which can be much greater than computing the static schedule itself. Hence in future work we intend to investigate cheaper ways to make effective use of static schedules for real platforms.

Footnotes

¹https://www.top500.org/.

²https://github.com/mcsweeney90/heterogeneous_optimistic_finish_time.

Supported by the Engineering and Physical Sciences Research Council (EPSRC).

Contributor Information

Valeria V. Krzhizhanovskaya, Email: ln.avu@ayaksvonahzihzrK.V.

Gábor Závodszky, Email: ln.avu@ykzsdovaZ.G.

Michael H. Lees, Email: ln.avu@seel.h.m.

Jack J. Dongarra, Email: ude.ktu.lci@arragnod.

Peter M. A. Sloot, Email: ln.avu@tools.a.m.p.

Sérgio Brissos, Email: moc.silibigelletni@sossirb.oigres.

João Teixeira, Email: moc.silibigelletni@ariexiet.oaoj.

References

1. Agullo, E., Beaumont, O., Eyraud-Dubois, L., Kumar, S.: Are static schedules so bad? A case study on Cholesky factorization. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 1021–1030, May 2016. 10.1109/IPDPS.2016.90

2. Anderson E, et al. LAPACK Users’ Guide. 3. Philadelphia: Society for Industrial and Applied Mathematics; 1999. [Google Scholar]

3. Arabnejad H, Barbosa JG. List scheduling algorithm for heterogeneous systems by an optimistic cost table. IEEE Trans. Parallel Distrib. Syst. 2014;25(3):682–694. doi:10.1109/TPDS.2013.57. [CrossRef] [Google Scholar]

4. Augonnet C, Thibault S, Namyst R, Wacrenier PA. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput. Pract. Exp. 2011;23(2):187–198. doi:10.1002/cpe.1631. [CrossRef] [Google Scholar]

5. Bittencourt, L.F., Sakellariou, R., Madeira, E.R.M.: DAG scheduling using a lookahead variant of the Heterogeneous Earliest Finish Time algorithm. In: 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, pp. 27–34, February 2010. 10.1109/PDP.2010.56

6. Bosilca G, Bouteiller A, Danalis A, Faverge M, Hérault T, Dongarra JJ. PaRSEC: exploiting heterogeneity to enhance scalability. Comput. Sci. Eng. 2013;15(6):36–45. doi:10.1109/MCSE.2013.98. [CrossRef] [Google Scholar]

7. Braun TD, et al. A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems. J. Parallel Distrib. Comput. 2001;61(6):810–837. doi:10.1006/jpdc.2000.1714. [CrossRef] [Google Scholar]

8. Canon LC, Jeannot E, Sakellariou R, Zheng W. Comparative evaluation of the robustness of DAG scheduling heuristics. In: Gorlatch S, Fragopoulou P, Priol T, editors. Grid Computing. Cham: Springer; 2008. pp. 73–84. [Google Scholar]

9. Canon L-C, Marchal L, Simon B, Vivien F. Online scheduling of task graphs on hybrid platforms. In: Aldinucci M, Padovani L, Torquati M, editors. Euro-Par 2018: Parallel Processing; Cham: Springer; 2018. pp. 192–204. [Google Scholar]

10. Dongarra JJ, Du Croz J, Hammarling S, Duff IS. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 1990;16(1):1–17. doi:10.1145/77626.79170. [CrossRef] [Google Scholar]

11. Duran A, et al. OmpSs: a proposal for programming heterogeneous multi-core architectures. Parallel Process. Lett. 2011;21(02):173–193. doi:10.1142/S0129626411000151. [CrossRef] [Google Scholar]

12. Hagstrom, J.N.: Computational complexity of PERT problems. Networks 18(2), 139–147. 10.1002/net.3230180206

13. Ahmad I, Kwok Y-K. On exploiting task duplication in parallel program scheduling. IEEE Trans. Parallel Distrib. Syst. 1998;9(9):872–892. doi:10.1109/71.722221. [CrossRef] [Google Scholar]

14. Kumar, S.: Scheduling of dense linear algebra kernels on heterogeneous resources. Ph.D. thesis, University of Bordeaux, April 2017. https://tel.archives-ouvertes.fr/tel-01538516/file/KUMAR_SURAL_2017.pdf

15. Shetti, K.R., Fahmy, S.A., Bretschneider, T.: Optimization of the HEFT algorithm for a CPU-GPU environment. In: PDCAT, pp. 212–218 (2013). 10.1109/PDCAT.2013.40

16. Tobita, T., Kasahara, H.: A standard task graph set for fair evaluation of multiprocessor scheduling algorithms. J. Scheduling 5(5), 379–394. 10.1002/jos.116

17. Topcuoglu H, Hariri S, Wu MY. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 2002;13(3):260–274. doi:10.1109/71.993206. [CrossRef] [Google Scholar]

18. Zheng W, Sakellariou R. Stochastic DAG scheduling using a Monte Carlo approach. J. Parallel Distrib. Comput. 2013;73(12):1673–1689. doi:10.1016/j.jpdc.2013.07.019. [CrossRef] [Google Scholar]

Articles from Computational Science – ICCS 2020 are provided here courtesy of Nature Publishing Group