File Type: MS Word (DOC) & PDF

File Size: 3,504KB

Number of Pages: 257

** **

## TABLE OF CONTENTS

TITLE PAGE ……………………………………………………………………………………………………. I

DECLARATION ………………………………………………………………………………………………. II

CERTIFICATION ……………………………………………………………………………………………. III

ACKNOWLEDGEMENT …………………………………………………………………………………. IV

ABSTRACT ……………………………………………………………………………………………………… V

TABLE OF CONTENTS …………………………………………………………………………………… VI

CHAPTER ONE: INTRODUCTION ……………………………………………………………………. 1

1.1. Background to the study ……………………………………………………………………………. 1

1.2. Research Motivation …………………………………………………………………………………….. 4

1.3. Research Objectives ……………………………………………………………………………………… 4

1.4. Research Methodology …………………………………………………………………………………. 5

1.5. Organization of the Thesis …………………………………………………………………………….. 5

1.6. Contribution to Knowledge …………………………………………………………………………… 6

1.7. Benefits of this Research Work to Society ………………………………………………………. 6

CHAPTER TWO: LITERATURE REVIEW …………………………………………………………. 7

2.1. Introduction …………………………………………………………………………………………………. 7

2.2. Parallel Computing Overview ……………………………………………………………………….. 7

2.2.1. Some Concepts and Terminologies …………………………………………………………. 10

2.2.2. Message Passing Model ………………………………………………………………………… 13

2.2.3. Data Parallel Model ……………………………………………………………………………… 14

2.3. Analysing Sequential Algorithm ………………………………………………………………….. 15

2.4. Analysing Parallel Algorithms ……………………………………………………………………… 16

2.5. Review of Cannon‟s Algorithms and Block Rhotrix Operations ………………………. 17

2.5.1. Rhotrix-Rhotrix Multiplication: Cannon‟s Algorithm Method …………………… 21

2.5.2. Communication Steps in Cannon’s Algorithm ………………………………………….. 22

2.5.3. Systolic Array ……………………………………………………………………………………… 27

2. 6. Background Concept from Group Theory to Complex Analysis ……………………… 30

2.7. A Review of Strassen‟s Algorithm ……………………………………………………………….. 30

2.8. Conversion of Rhotrix to Coupled Matrix ……………………………………………………… 36

2.8.1. Solving (m x n) x (n x m) and (m–1) x (n–1) Coupled Matrix Problems ……… 37

vii

2.8.2. Parallel Computing with Processor Array Overview …………………………………. 41

2.8.3. Architectural Considerations in Array Processor Design …………………………… 41

2.8.4. Mapping Parallel Algorithms onto Locally Interconnected Computing Networks ……………………………………………………………………………………………………… 42

2.9. Conclusion ………………………………………………………………………………………………… 43

CHAPTER THREE: SEQUENTIAL AND PARALLEL ALGORITHMIC DESIGN FOR RHOTRIX MULTIPLICATION ………………………………………………………………… 44

3.1. Introduction ……………………………………………………………………………………………….. 44

3.2. The Algebra of Vector dot-product ………………………………………………………………. 44

3.2.1. Heart-Oriented Rhotrix Multiplication ……………………………………………………. 45

3.2.2. Row-Wise Heart-Oriented Rhotrix Multiplication ……………………………………. 56

3.3. Row-Column Multiplication of N-Dimensional Rhotrices ………………………………. 61

3.3.1. Row-Column Definition and Formula Expression for N-Dimensional Rhotrix62

Justification of adopted method: from multiset point of view ……………………………… 64

3.3.2. Row-Column Multiplication Techniques…………………………………………………. 65

3.3.3. Generalized Form of Row-Column Multiplication Technique ……………………. 69

3.4. Application of Cannon‟s Algorithm in Rhotrix Row-Column Multiplication …. 71

3.4.1. Communication Steps in Cannon’s Algorithm ………………………………………….. 75

3.4.2. Analysis of Cannon Algorithm …………………………………………………………… 79

3.4.3. Theoretical Computation of Cannon Algorithm ……………………………………. 79

3.5. Verification of Strassen‟s Algorithm by Divide-and-Conquer Technique ………….. 83

3. 5.1. Srassen‟s Algorithm and Rhotrix Row-Column Multiplication …………………. 85

3.5.2. Strassen‟s Algorithm Computation Approach ………………………………………. 91

3.5.3. Strassen‟s Algorithm Fixup Case ……………………………………………………………. 93

3.5.4. Program Task Graph for Strassen‟s Algorithm…………………………………………. 94

3.6. Contribution Summary ……………………………………………………………………………….. 99

3.7. Parallel Multiplication of Rhotrices ………………………………………………………………. 99

3.8. Architecture and Data-parallel Operations for Processor Array ………………………. 100

3.8.1. Structure of Processing Interconnection Network …………………………………… 101

3.8.2. Architectural Features of Processor Arrays (Enabling and Disabling processors) ………………………………………………………………………………………………………………….. 102

3.9. The Mapping Problem ………………………………………………………………………………. 104

3.9.1. Design of Processor Array Mapping Techniques ……………………………………. 104

viii

3.9.2. Linear Mapping of Processing Element …………………………………………………. 106

3.9.3. Communication Vectors ……………………………………………………………………… 106

3.9.4. Linear Mapping ………………………………………………………………………………….. 107

3.9.5. Selection of Scheduling Vector ST Based on Scheduling Inequalities ……….. 107

3.9.6. Rhotrix Multiplication and 2-D Processor Array Design …………………………. 108

3.10. Rhotrix Multiplication on Process Array Grid ……………………………………………. 111

3.10.1. Pseudocode for N-Dimension Rhotrix Multiplication on Process Array Grid ………………………………………………………………………………………………………………….. 111

3.10.2. Input Data Movement Pattern for (Coupled matrix) ………………………………. 112

3.10.3. Input Data Movement Pattern for Rhotrix Elements ……………………………… 114

3.11. Row-column Multiplication on Master-Worker Platform (Coupled Matrix) …… 117

3.11.1. Application Process and Input Data Communication …………………………….. 118

3.11.2. Master-Worker Network Topology Platform ……………………………………….. 119

3.11.3. Minimization of the Communication Volume ………………………………………. 120

3.11.4. The Maximum Re-use Algorithm ……………………………………………………….. 122

3.12. Algorithms for Master-worker Platforms (Coupled Matrix) ……………………… 123

3.12.1. Principle of the Algorithm Design ………………………………………………………. 124

3.13. Program Task Graph for Heart-Oriented Rhotrix Algorithm ………………………… 127

3.14. Memory Usage ……………………………………………………………………………………… 130

3.15. Heart-Oriented MPI Experimental Performance Model ……………………………….. 130

3.16. Heart-Oriented Scalability Analysis Performance Model …………………………….. 132

3.17. Task-Parallel Approach for Heart-Oriented Algorithm Using MPI ……………….. 133

3.18. A Framework of Task-Parallel Approach for Strassen‟s Algorithm Using MPI 135

3.19. Data Structures ……………………………………………………………………………………….. 136

3.20. Computational Time Complexity Analysis for Rhotrix Multiplication …………… 139

3.20.1. Computational Time Complexity for Row-Column Multiplication …………. 139

3.20.2. Computational Time Complexity for Heart-Oriented Multiplication ……….. 140

3.21. Conclusion …………………………………………………………………………………………….. 141

CHAPTER FOUR: IMPLEMENTATION AND PERFORMANCE RESULTS …….. 142

4.1. Introduction ……………………………………………………………………………………………… 142

4.2. MPI Performance Results ………………………………………………………………………….. 142

4.3. Performance Analyzer ……………………………………………………………………………… 147

4.4. Process Array Simulation for Coupled Matrix Implementation ………………………. 152

ix

4.5. Process Array Simulation for Row-Column Algorithm Implementation ………….. 157

4.5. Conclusion ………………………………………………………………………………………………. 162

CHAPTER FIVE: SUMMARY, CONCLUSION AND FUTURE WORK ……………. 163

5.1. Summary and Conclusion ………………………………………………………………………….. 163

5.2. Future Work …………………………………………………………………………………………….. 164

REFERENCES ………………………………………………………………………………………………. 165

## CHAPTER ONE

NTRODUCTION

1.1. Background to the study

Linear algebra is an area which involves heavy computations and therefore is in need of high-performance routines to decrease the total computation time. In effort to make this possible, parallel computers have evolved with different architectures. Parallel computer systems consisting of multiple processors provide much more raw computing power (by orders of magnitude) than traditional uni-processor computer systems. Powerful parallel computer systems open up new frontiers in parallel applications that strive to utilize such systems effectively. The challenges for parallel applications, particularly in the field of scientific computing, lie in what the new parallel programming model should be and what kinds of parallelism an algorithm should possess to utilize the parallel computer system effectively.

One of the models of parallel computers is the multicomputer model. A multicomputer is a concurrent distributed-memory system with message-passing among processors (Foster, 1995). A multicomputer consists of a number of von Neumann processors that have their own memory and an interconnection network that links processors through message interchange. Hence, the MIMD architectural model, which stands for multiple instruction streams and multiple data streams (Kumar et al., 1994). Together with message-passing, forms an ideal parallel programming model suitable for multicomputer. This combined hardware, software architecture is capable of executing different programs concurrently on processors and exchanging information among processors via message passing.

Among the key requirements for effective parallel computation, the performance requirement is always the most critical, because the primary motivation of parallel applications is to achieve higher performance that cannot be accomplished by using traditional sequential computer systems. However, it is difficult for a single algorithm to achieve high performance for a wider range of problem sizes on computer systems with different characteristics. A non-trivial example is the well-known sorting problem. The “quick-sort” algorithm works well for sorting lists while the “bubble-sort” algorithm works well for short lists. In order to accomplish the sorting problem optimally, application developers must take the responsibility for formulating the problem and

2

choosing an appropriate algorithm according to the length of the list. However, this approach puts a heavy burden on application developers for complex problems, as well as imposing a need for predictive performance models. Furthermore, in order to utilize different parallel computer systems effectively, application developers must have deep knowledge of underlying systems. The better approach to overcome this dilemma is to use the “poly-algorithmic approach.” The poly-algorithmic approach, a concept introduced by Rice and Rosen, (1966), refers to the use of two or more algorithms to solve the same problems with a high level decision-making process determining which of a set of algorithm performs best in a given situation. The poly-algorithm approach helps the parallel libraries to achieve performance scalability and portability while hiding the complexity of performance-oriented software and the multiplicity of algorithms in the interface.

Consequently, the research hypothesis is that no single parallel rhotrix multiplication algorithm can always achieve the best performance for multiplying rhotrices with different sizes on process grids with different shapes; the poly-algorithmic approach is able to achieve the requirement of high performance over a range of system and problems parameters. In the present study, we will specifically target the problem of rhotrix multiplication on systolic array parallel architecture. We will build a collection of general purpose parallel rhotrix multiplication algorithms. We will measure performance of each algorithm for rhotrices with different sizes on simulated process grids (systolic array). By using the poly-algorithmic approach, we will show that it is possible to achieve high and uniform performance by choosing an appropriate algorithm from a collection of parallel rhotrix multiplication algorithms suitable for a particular situation. This will require multiple algorithms and associated performance models and measurements for representative case.

The concept of rhotrices was introduced by Ajibade (2003) as an extension of ideas on matrix-tertions and matrix-noitrets discussed in Attanasov and Shannon (1998). Ajibade presented the initial algebra and analysis on rhotrices and established some interesting relationships between rhotrices and their hearts. He defined the multiplication of two rhotrices

3

k

g h Q j

f

and Q

e

b h R d

a

R ( ) ( )

as follows:

( ) ( )

( ) ( ) ( ) ( ) ( ) ( )

( ) ( )

eh Q kh R

bh Q gh R h R h Q dh Q jh R

ah Q fh R

This multiplication procedure gives initial interesting results but may not allow further

development regarding the question posed in (Ajibade, 2003). Some of those questions

were answered in (Attanasov and Shannon, 1998), in relation to matrix-tertions and

matrix-noitrits but to the best of our knowledge there has not been any extension

regarding generalization of this concept of rhotrix multiplication with regards to ndimensioned

rhotrices. This is why we propose an extension by means of generalization

and derivation of mathematical models and algorithms for the heart-oriented (an

acronym for the multiplication method given by Ajibade) rhotrix multiplication.

An alternative multiplication method was proposed by Sani (2004). According to this

method, since rhotrices lie somewhere between 2×2 and 3×3 matrices, it is reasonable to

define their multiplication similarly to matrix multiplication. To do that he defined the

rows and columns of a rhotrix

as

e

b h R d

a

( ) and

e d

b a

b e

a d

respectively. He then defined multiplication of any two rhotrices using the row-column

multiplication method of matrices, except for the hearts which is multiplied directly.

The multiplication is defined as follows:

bj ek

bf eg h R h Q aj dk

af dg

k

g h Q j

f

e

b h R d

a

R Q

( ) ( ) ( ) ( )

4

This multiplication, like that of matrices, is non commutative, but it is associative. Again as part of this work, we also propose a new approach to this multiplication method by a way of designing new generalization case, mathematical model and algorithms for the row-column rhotrices multiplication different from what was initially given by the author.

Rhotrix multiplication can be seen as a fundamental operation in most arbitrary numerical linear algebra applications. Its efficient implementation on parallel high performance computers, together with the implementation of other basic linear algebra operations, could therefore be considered an issue of prime importance when providing these systems with scientific software libraries. Consequently, considerable effort has been devoted in this research work towards developing efficient parallel rhotrix multiplication algorithms, and this will remain a task in the future as well.

This research work presents some sets of new poly algorithms approach for multiplying higher dimensional rhotrices in both sequential and parallel environment using conventional programming languages like C, Message Passing Interface (MPI) and the process array architecture for the parallelization execution.

1.2. Research Motivation

The Motivation of the research work was based on two fundamental multiplicative concepts on rhotrices introduced in (Ajibade, 2003), (Sani, 2004), and (Sani, 2008). In each of the aforementioned journal publications, multiplication of rhotrices was introduced and addressed by these researchers in a unique but separate ways. The initial driving force behind the conception of the research work was to extend the work done by these scholars and to equally obtain high performance by taking advantage of grid computer architectures and exploiting inherent parallel techniques such as those found in high-performance computation.

1.3. Research Objectives

The primary objective of this thesis is to devise a distributed computational framework that would allow the parallelization and execution of our proposed sequential and parallel rhotrices multiplication algorithms, based on the rhotrices multiplication

5

methods developed in (Ajibade, 2003) and (Sani, 2004). Particular effort will be placed on achieving this in an affordable way, which is by avoiding the unnecessary use of costly hybrid machines and taking advantage of existing computational infrastructure available in the university computer laboratory. More specifically, several mathematical models will be developed and new sequential rhotrix multiplication algorithms designed over the parallelization framework of a functional grid. This study will produce a set of the most effective strategies of parallelization of these and similar algorithms.

In order to validate the usefulness and effectiveness of the implemented framework, two main problems will be discussed:

Heart-oriented multiplication: i.e direct multiplication of rhotrix elements with corresponding array indices.

Row-column multiplication: i.e multiplication of rhotrix elements row by column wise.

1.4. Research Methodology

The method used in this study is as follows:

A review of different rhotrix multiplication methods from related literatures was carried out. This includes row-column multiplication akin to the traditional row-column matrix multiplication, similar to this also is the rhotrix conversion to coupled matrix by padding, referred to as rhotrix transformation. A collection of algorithms were designed to implement each of the aforementioned methods. The implementation was done using C and Java programming Languages for the sequential algorithm test bed while Delphi API and MPI message passing were used to simulate the parallel algorithms.

1.5. Organization of the Thesis

The rest of the thesis is organized as follows: Chapter II reviews the relevant literature in the research areas that covers different ideology defining rhotrix multiplication that pertains to this study and discusses the methodology of scalable parallel platform models adopted for the study, Chapter III discusses rhotrix data representation format, algorithm design for rhotrices row-column and heart-oriented rhotrices application, data mapping problem, rhotrix data mapping design and parallel platform performance model. Chapter IV describes the research implementation strategy of our parallel rhotrix

6

multiplication algorithms and provides the experimental results of each algorithm on the adopted platform. Chapter V summarises the present study and presents the conclusion and future work.

1.6. Contribution to Knowledge

The main contributions of this work are in twofold:

On the theoretical side, we have proposed and designed four sequential and three parallel algorithms implementation of row-column and heart-oriented computation of rhotrices on computer systems. Furthermore, these algorithms have been checked against experimental results generated from sequential C programs and MPI massage passing used as test bed to further buttress the validity of the algorithms presented in our study.

On the practical side, we have implemented and evaluated two parallel implementations of rhotrix row-column and dot product computation on process array and heterogeneous master – worker platforms using MPI message passing library. These implementations are based on the data allocation and dynamic load balancing strategies.

1.7. Benefits of this Research Work to Society

Considering the sparseness and arbitrary representation of some scientific and engineering problems, rhotrix multiplication can be regarded as a better candidate for the computational chemists for representing problems in terms of states of a chemical system. Each rhotrix index corresponds to a different basis state, and the rhotrix approximates the Hamiltonian of the system. A change of basis is accomplished through rhotrix multiplication. Rhotrix could be termed as a super matrix (for every set of rhotrix elements, there exist two other matrices embedded within these sets). Rhotrix-Vector multiplication could be applied in some transforms used in image processing, signal processing for the electronics and telecommunication industries (convolution). Some other possible application domain areas of large rhotrix multiplication are in molecular biology, genetic engineering and cellular computing. In the aforementioned areas, changes in states and trends monitoring pattern can be accomplished through derived sets of system of linear equations from rhotrix multiplication.