art_of_concurrency_chapter_2

The Art of Concurrency - Chapter 2

Return to The Art of Concurrency - A Thread Monkey's Guide to Writing Parallel Applications, Concurrency Bibliography, Concurrency, Concurrent Programming, Parallel Programming, Multi-Threaded Programming, Multi-Core Programming, Asynchronous Programming, Concurrency and Functional Programming, Concurrency and Security, Concurrency and Data Science - Concurrency and and Databases, Concurrency Glossary, GitHub Concurrency, Awesome Concurrency, Concurrency Topics

“ (ArtConc 2009)

Chapter 2. Concurrent or Not Concurrent?

“To get things started, I want to first talk about two design methods for concurrent algorithms, but I want to do it abstractly. Now, before you roll your eyes too far back and hurt yourself, let me say that there will be plenty of code examples in later chapters to give concreteness to the ideas that are presented here. This is a book on the design of concurrent algorithms, and in this chapter I've collected a lot of the wisdom on initial approaches that apply to a large percentage of code you're likely to encounter (it can get pretty dry without code to look at, so be sure you're well hydrated before you start).” (ArtConc 2009)

“In addition, I want to let you know that not every bit of computation can be made concurrent, no matter how hard you try. To save you the time of trying to take on too many impossible things in the course of your day, I have examples of the kinds of algorithms and computations that are not very amenable to concurrency in the section What's Not Parallel. When any of those examples can be modified to allow for concurrent execution, I've included hints and tips about how to do that.” (ArtConc 2009)

Design Models for Concurrent Algorithms

“If you've got a sequential code that you want to transform into a concurrent version, you need to identify the independent computations that can be executed concurrently. The way you approach your serial code will influence how you reorganize the computations into a concurrent equivalent. One way is task decomposition, in which the computations are a set of independent tasks that threads can execute in any order. Another way is data decomposition, in which the application processes a large collection of data and can compute every element of the data independently.” (ArtConc 2009)

“The next two sections will describe these approaches in more detail and give an example of a problem that falls into each category. These two models are not the only possibilities, but I've found them to be the two most common. For other patterns of computation and how to transform them into concurrent algorithms, read Patterns for Parallel Programming by Timothy G. Mattson et al. (Addison-Wesley, 2004). Many of the ideas presented in the next two sections are rooted in material from that book.” (ArtConc 2009)

Task Decomposition

“When you get right down to it, any concurrent algorithm is going to turn out to be nothing more than a collection of concurrent tasks. Some may be obvious independent function calls within the code. Others may turn out to be loop iterations that can be executed in any order or simultaneously. Still others might turn out to be groups of sequential source lines that can be divided and grouped into independent computations. For all of these, you must be able to identify the tasks and decompose the serial code into concurrently executable work. If you're familiar enough with the source code and the computations that it performs, you may be able to identify those independent computations via code inspection.” (ArtConc 2009)

“As I've implied, the goal of task decomposition, or any concurrent transformation process, is to identify computations that are completely independent. Unfortunately, it is the rare case where the serial computation is made up of sequences of code that do not interact with each other in some way. These interactions are known as dependencies, and before you can make your code run in parallel, you must satisfy or remove those dependencies. The section What's Not Parallel describes some of these dependencies and how you might overcome them.” (ArtConc 2009)

“You will find that, in most cases, you can identify the independent tasks at the outset of the concurrent computation. After the application has defined the tasks, spawned the threads, and assigned the tasks to threads (more details on these steps in a moment), almost every concurrent application will wait until all the concurrent tasks have completed. Why? Well, think back to the original serial code. The serial algorithm did not go on to the succeeding phase until the preceding portion was completed. That's why we call it serial execution. We usually need to keep that sequence of execution in our concurrent solutions in order to maintain the sequential consistency property (getting the same answer as the serial code on the same input data set) of the concurrent algorithm.” (ArtConc 2009)

“The most basic framework for doing concurrent work is to have the main or the process thread define and prepare the tasks, launch the threads to execute their tasks, and then wait until all the spawned threads have completed. There are many variations on this theme. Are threads created and terminated for each portion of parallel execution within the application? Could threads be put to “sleep” when the assigned tasks are finished and then “woken up” when new tasks are available? Rather than blocking after the concurrent computations have launched, why not have the main thread take part in executing the set of tasks? Implementing any of these is simply a matter of programming logic, but they still have the basic form of preparing tasks, getting threads to do tasks, and then making sure all tasks have been completed before going on to the next computation.” (ArtConc 2009)

“Is there a case in which you don't need to wait for the entire set of tasks to complete before going to the next phase of computation? You bet. Consider a search algorithm. If your tasks are to search through a given discrete portion of the overall data space, does it make any sense to continue searching when you have located the item you were looking for? The serial code was likely written to stop searching, so why should the concurrent tasks continue to waste execution resources in an unproductive manner? To curtail the execution of threads before the natural termination point of tasks requires additional programming logic and overhead. Threads will need to periodically check the status of the overarching task to determine whether to continue or wind things up. If the original search algorithm was to find all instances of an item, each thread would examine all assigned data items and not need to worry about early termination.” (ArtConc 2009)

“You may also encounter situations in which new tasks will be generated dynamically as the computation proceeds. For example, if you are traversing a tree structure with some computation at each node, you might set up the tasks to be the traversal of each branch rooted at the current node. For a binary tree, up to two tasks would be created at each internal node. The mechanics of encapsulating these new tasks and assigning them to threads is all a matter of additional programming.” (ArtConc 2009)

“There are three key elements you need to consider for any task decomposition design:” (ArtConc 2009)

What are the tasks and how are they defined?

What are the dependencies between tasks and how can they be satisfied?

How are the tasks assigned to threads?

Each of these elements is covered in more detail in the following sections.

What are the tasks and how are they defined?

“The ease of identifying independent computations within an application is in direct proportion to your understanding of the code and computations being performed by that code. There isn't any procedure, formula, or magic incantation that I know of where the code is input and out pops a big neon sign pointing to the independent computations. You need to be able to mentally simulate the execution of two parallel streams on suspected parts of the application to determine whether those suspected parts are independent of each other (or might have manageable dependencies).” (ArtConc 2009)

“Simulating the parallel or concurrent execution of multiple threads on given source code is a skill that has been extremely beneficial to me in both designing concurrent algorithms and in proving them to be error-free (as we shall see in Chapter 3). It takes some practice, but like everything else that takes practice, the more you do it, the better you will get at doing it. While you're reading my book, I'll show you how I approach the art of concurrent design, and then you'll be better equipped to start doing this on your own.” (ArtConc 2009)

Note

“There is one tiny exception for not having a ”magic bullet” that can identify potentially independent computations with loop iterations. If you suspect a loop has independent iterations (those that can be run in any order), try executing the code with the loop iterations running in reverse of their original order. If the application still gets the same results, there is a strong chance that the iterations are independent and can be decomposed into tasks. Beware that there might still be a “hidden” dependency waiting to come out and bite you when the iterations are run concurrently — for example, the intermediate sequence of values stored in a variable that is harmless when the loop iterations were run in serial, even when run backward.“ (ArtConc 2009)

“To get the biggest return on investment, you should initially focus on computationally intense portions of the application. That is, look at those sections of code that do the most computation or account for the largest percentage of the execution time. You want the ratio of the performance boost to the effort expended in transforming, debugging, and tuning of your concurrent code to be as high as possible. (I freely admit that I'm a lazy programmer — anytime I can get the best outcome from the least amount of work, that is the path I will choose.)” (ArtConc 2009)

“Once you have identified a portion of the serial code that can be executed concurrently, keep in mind the following two criteria for the actual decomposition into tasks:” (ArtConc 2009)

“There should be at least as many tasks as there will be threads (or cores).” (ArtConc 2009)

“The amount of computation within each task (granularity) must be large enough to offset the overhead that will be needed to manage the tasks and the threads.” (ArtConc 2009)

“The first criterion is used to assure that you won't have idle threads (or idle cores) during the execution of the application. If you can create the number of tasks based on the number of threads that are available, your application will be better equipped to handle execution platform changes from one run to the next. It is almost always better to have (many) more tasks than threads. This will allow the scheduling of tasks to threads greater flexibility to achieve a good load balance. This is especially true when the execution times of each task are not all the same or the time for tasks is unpredictable.” (ArtConc 2009)

“The second criterion seeks to give you the opportunity to actually get a performance boost in the parallel execution of your application. The amount of computation within a task is called the granularity. The more computation there is within a task, the higher the granularity; the less computation there is, the lower the granularity. The terms coarse-grained and fine-grained are used to describe instances of high granularity and low granularity, respectively. The granularity of a task must be large enough to render the task and thread management code a minuscule fraction of the overall parallel execution. If tasks are too small, execution of the code to encapsulate the task, assign it to a thread, handle the results from the task, and any other thread coordination or management required in the concurrent algorithm can eliminate (best case) or even dwarf (worst case) the performance gained by running your algorithm on multiple cores.” (ArtConc 2009)

Note

“Granularity, defined another way, is the amount of computation done before synchronization is needed. The longer the time between synchronizations, the coarser the granularity will be. Fine-grained concurrency runs the danger of not having enough work assigned to threads to overcome the overhead costs (synchronizations) of using threads. Adding more threads, when the amount of computation doesn't change, only exacerbates the problem. Coarse-grained concurrency has lower relative overhead costs and tends to be more readily scalable to an increase in the number of threads.” (ArtConc 2009)

“Consider the case where the time for overhead computations per task is the same for two different divisions of tasks. If one task divides the total work into 16 tasks, and the other uses only 4 tasks, which scheme would run faster on four cores with four threads? Figure 2-1 illustrates the two task decompositions and their execution with overhead added.” (ArtConc 2009)

Figure 2-1. Task granularity example

“You can imagine that the height of each box in Figure 2-1 represents the amount of time required to execute the associated computation. With many small tasks — each requiring a separate overhead computation — Figure 2-1 (a), the fine-grained solution, will take longer to run than the same work divided into fewer larger tasks, as shown in Figure 2-1 (b), the coarse-grained decomposition. Larger tasks also provide better opportunities for other performance benefits (e.g., reuse of cache, more efficient memory access patterns) that are already part of the serial algorithm.” (ArtConc 2009)

“You will need to strike a balance between these two criteria. For a fixed amount of work, the larger you define your tasks to be, the fewer tasks there will be to assign to threads. You may find that in order to satisfy the second criterion, which is the more important of the two, you will need to define a number of tasks that is fewer than the number of threads. Because reducing execution time is the main reason for writing and running concurrent code, having an application that might not utilize all the cores available is more desirable than having an application that performs worse (takes longer to execute) when using all the cores in the platform.” (ArtConc 2009)

Finally, don't be afraid to go back and rework your task decomposition. If your initial decomposition does not meet the criteria, you should consider alternate decompositions. Also, if you find that you are not achieving the performance levels you expect, you may need to go back and redefine the tasks, the number of threads to be used, or how those tasks are assigned to threads.“ (ArtConc 2009)

“What are the dependencies between tasks and how can they be satisfied?” (ArtConc 2009)

“Two types of dependencies can occur between tasks. The first is order dependency, where some task relies on the completed results of the computations from another task. This reliance can be a direct need to use the computed values as input to the succeeding task, or it may simply be the case that the task that follows will be updating the same memory locations as the previous task and you must ensure that all of the previous updates have been completed before proceeding. Both of these cases describe a potential data race, which we need to avoid.” (ArtConc 2009)

“For example, if you are building a house, putting up the roof involves attaching the rafters to the walls, laying down the decking, and applying the shingles. The dependence between these three tasks is one of execution order. You can't put down shingles until the decking is there, and you can't nail down the decking unless you have the rafters in place. So, instead of hiring three teams to do these three tasks in parallel, you can hire one roofing crew to do all three in the order required (there is parallelism within each of the roofing steps, plus the act of putting on the roof is independent of installing the electrical wiring, the plumbing, and putting up drywall).” (ArtConc 2009)

“To satisfy an execution order constraint, you can schedule tasks that have an order dependency onto the same thread and ensure that the thread executes the tasks in the proper sequence. The serial code was written with the order dependency already taken care of. So, the serial algorithm should guide the correct decomposition of the computations into tasks and assignment of those tasks to threads. Still, even after grouping tasks to execute on threads, there may be order constraints between threads. If regrouping tasks to threads is not an option or will severely hurt performance, you will be forced to insert some form of synchronization to ensure correct execution order.” (ArtConc 2009)

“The second type of dependency is data dependency. Identifying potential data dependencies can be straightforward: look for those variables that are featured on the left side of the assignment operator. Data races require that the variable in question have at least one thread that is writing to that variable. Check for any assignment of values to the same variable that might be done concurrently as well as any updates to a variable that could be read concurrently. Of course, using pointers to reference memory locations can make the identification process trickier. There are tools (covered in Chapter 11) that can assist in finding nonobvious data dependencies in your code.” (ArtConc 2009)

Solving data dependencies can be more complicated than solving execution order dependencies. In the latter, the sequence of execution within the serial code gives us the solution; in the former, the serial code being written with the assumption of a single-threaded execution leads to the problem.“ (ArtConc 2009)

“If you're fortunate enough to have code with no dependencies, sometimes called enchantingly parallel, you can skip down to the next section about scheduling. If you're like the rest of us and aren't so fortunate, examine your dependencies to see if they might be removable (recurrences and induction variables) or separable (reduction computations). There are remedies, in many cases, for these dependencies. Those options, as well as a description of these two dependency classes, are discussed in What's Not Parallel.” (ArtConc 2009)

“Now let's go over the two easiest solutions for simple data conflicts between tasks. These are using local variables and adding mutual exclusion code. Consider the pseudocode given in Example 2-1 (if you live in Chicago or New York, substitute some other place like Los Angeles when computing population differences).” (ArtConc 2009)

Example 2-1. Pseudocode with shared variable

popDiff = abs(Population[MyTown] - Population[NewYork]); DoSomething(popDiff, MyTown, NewYork); popDiff = abs(Population[MyTown] - Population[Chicago]); DoSomething(popDiff, MyTown, Chicago);

“If we know that concurrent calls to the DoSomething() function are thread-safe (i.e., there are no side effects or dependencies when there is more than one concurrent call to the function), we can assign the first two lines of the example to one thread and the last two lines to a second thread. This will create a data race on the popDiff variable. Since this variable is used only as a temporary or ”workvariable, allocating a local copy to each thread will eliminate the problem.“ (ArtConc 2009)

Depending on the threading model that you are using to implement your concurrent algorithm, there can be several ways to create variables that are accessible only to a given thread. In all cases, if you declare a variable within a function that is executed by threads, those variables will be local to the calling thread when they are allocated. Explicitly allocating space from a thread's stack (say with the alloca() function) is another way. OpenMP has the private clause to generate local copies of variables for each thread within the parallel region to which the clause is attached. Both Windows and POSIX threads include a thread-local storage (TLS) API that will allocate memory to hold copies of variables, one per thread, and allow threads to have access only to the copy that is earmarked for that thread.“ (ArtConc 2009)

Note

“The TLS API is pretty “heavy.” I wouldn't recommend using it for things like local work variables within a single routine. Variables that are allocated and accessed via the TLS are persistent across function scopes. If you need local copies of a variable and that variable and its contents need to be available to different functions or disparate calls to the same function executed by the thread, TLS is the mechanism that can give you the private copy and the persistence of value required.” (ArtConc 2009)

“When all else fails, when you don't have the option to make a local copy of a shared variable, when none of the transformations given in What's Not Parallel can eliminate the data dependency, the only option left is to explicitly provide mutually exclusive access to the shared variable. In most cases a simple lock, or mutex, will suffice. In some instances, you can use a less onerous atomic operation. Different threading models will have different options of synchronization objects, and different algorithms will have different protection requirements.” (ArtConc 2009)

“It should go without saying that you will want to use the option that has the lowest impact on performance, since such added synchronization is overhead that was not in the original serial code. This might mean trying several possibilities and even modifying the initial algorithm or data structures to create the chance to use a better synchronization object. It is your job as the programmer to find the best option for each situation you encounter.” (ArtConc 2009)

How are the tasks assigned to threads?

Tasks must be assigned to threads for execution. Perhaps the more correct way to say this is that threads must know which tasks to execute. In either case, you always want to assure that the amount of computation done by threads is roughly equivalent. That is, the load (of computation) is balanced per thread. We can allocate tasks to threads in two different ways: static scheduling or dynamic scheduling.“ (ArtConc 2009)

Note

“Under worksharing constructs in OpenMP and the parallel algorithms of Intel Threading Building Blocks (TBB), the actual assignment of tasks to threads is done “under the covers.” The programmer can influence that assignment to some degree, though. Even if you use only one of these two threading libraries for your concurrent coding, you should still read through the advice in this section to help you better influence the task assignments in your applications.” (ArtConc 2009)

“In static scheduling, the division of labor is known at the outset of the computation and doesn't change during the computation. If at all possible, when developing your own concurrent code, try to use a static schedule. This is the easiest method to program and will incur the least amount of overhead.” (ArtConc 2009)

“The mechanics and logic of code needed to implement a static schedule will involve each thread having a unique identification number in the range of [0, N–1] for N threads. This number can be easily assigned at the time a thread is created in the order that threads are created (code that can generate unique identification numbers to threads will be part of several implementation examples in later chapters). If tasks are collections of separate, independent function calls or groups of sequential source lines, you can group those calls or code lines into tasks that are assigned to threads through a thread's ID number (e.g., through a switch statement). If tasks are loop iterations, you can divide the total number of iterations by the number of threads and assign block(s) of iterations to threads, again through the thread's ID number. You will have to add additional logic to compute the start and end values of the loop bounds in order for each thread to determine the block that should be executed.” (ArtConc 2009)

“When assigning loop iterations into blocks, you need to be sure that each thread doesn't overlap execution of an iteration assigned to another thread and that all the loop iterations are covered by some thread. You won't get the correct results if threads execute an iteration multiple times or leave out the computation of some iterations. An alternative to assigning loop iterations into blocks is to use the thread's ID number as the starting value of the loop iterator and increment that iterator by the number of threads, rather than by 1. For example, if you have two threads, one thread will execute the odd-numbered iterations and the other thread will execute the even iterations. Obviously, you will need to make adjustments to where the loop starts and how to compute the next iteration per thread if the loop iterator doesn't start at 0 and is already incremented by something other than 1. However, the implementation of setting up N threads to each do every Nth iteration will involve fewer code changes than dividing the iteration set into separate blocks.” (ArtConc 2009)

Static scheduling is best used in those cases where the amount of computation within each task is the same or can be predicted at the outset. If you have a case where the amount of computation between tasks is variable and/or unpredictable, then you would be better served by using a dynamic scheduling scheme.“ (ArtConc 2009)

“Under a dynamic schedule, you assign tasks to threads as the computation proceeds. The driving force behind the use of a dynamic schedule is to try to balance the load as evenly as possible between threads. Assigning tasks to threads is going to incur overhead from executing the added programming logic to carry out the assignment and from having threads seek out a new task.” (ArtConc 2009)

“There are many different ways to implement a dynamic method for scheduling tasks to threads, but they all require a set of many more tasks than threads. Probably the easiest scheduling scheme involves indexing the tasks. A shared counter is used to keep track of and assign the next task for execution. When seeking a new task, a thread gains mutually exclusive access to the counter, copies the value into a local variable, and increments the counter value for the next thread.” (ArtConc 2009)

“Another simple dynamic scheduling method involves setting up a shared container (typically a queue) that can hold tasks and allow threads to pull out a new task once the previous task is complete. Tasks (or adequate descriptions of tasks) must be encapsulated into some structure that can be pushed into the queue. Access to the queue must be mutually exclusive between threads to ensure that threads get unique tasks and no tasks are lost through some corruption of the shared container.” (ArtConc 2009)

“If tasks require some preprocessing before their assignment to threads, or if tasks are not all known at the outset of computation, you may need more complex scheduling methods. You can set one of your threads aside to do the preprocessing of each task or receive new tasks as they arise. If the computation threads rendezvous with this extra thread in order to receive the next task for execution, you have a boss/worker algorithm. By placing a shared container to distribute tasks between the threads preparing tasks and the threads executing the task, you get the producer/consumer method. I mentioned these methods briefly in Chapter 1.” (ArtConc 2009)

Example: numerical integration

“Now that you've seen the criteria used to define and implement a task decomposition, let's put those ideas into practice on a very simple application to compute an approximate value of the constant pi. We won't worry about the details of how to implement the concurrency with threads, but we can identify the design decisions we need to make, as well as other considerations that we need to take into account to carry through with the implementation.” (ArtConc 2009)

Numerical integration is a method of computing an approximation of the area under the curve of a function, especially when the exact integral cannot be solved. For example, the value of the constant pi can be defined by the following integral. However, rather than solve this integral exactly, we can approximate the solution by use of numerical integration:“ (ArtConc 2009)

“The code in Example 2-2 is an implementation of the numerical integration midpoint rectangle rule to solve the integral just shown. To compute an approximation of the area under the curve, we must compute the area of some number of rectangles (num_rects) by finding the midpoint (mid) of each rectangle and computing the height of that rectangle (height), which is simply the function value at that midpoint. We add together the heights of all the rectangles (sum) and, once computed, we multiply the sum of the heights by the width of the rectangles (width) to determine the desired approximation of the total area (area) and the value of pi.” (ArtConc 2009)

“I won't create a threaded version of this code, but you're welcome to give it a try on your own based on the task decomposition discussion later.” (ArtConc 2009)

Example 2-2. Numerical integration code to approximate the value of pi“ (ArtConc 2009)

static long num_rects=100000; void main() { int i; double mid, height, width, sum = 0.0; double area; width = 1.0/(double) num_rects; for (i = 0; i < num_rects; i++){ mid = (i + 0.5) * width; height = 4.0/(1.0 + mid*mid); sum += height; } area = width

“What are the independent tasks in this simple application? The computation of the height for a rectangle is independent of the computation of any other rectangle. In addition, notice that the loop that performs these calculations holds the bulk of the computation for the entire application. The loop is an execution hotspot that you should examine for independent computations.” (ArtConc 2009)

“Are there any dependencies between these tasks and, if so, how can we satisfy them? The two work variables, mid and height, are assigned values in each iteration. If each thread had a local copy, that copy could be used during execution of iterations that are assigned to a thread. Also, the iteration variable, i, is updated for each iteration. Each thread will need a local copy in order to avoid interfering with the execution of iterations within other threads. The sum variable is updated in each iteration, but since this value is used outside of the loop, we can't have each thread work with a local copy that would be discarded after the thread was done. This is a reduction, and you'll find tips on solving such situations later in the section What's Not Parallel. The quick and dirty solution would be to put a synchronization object around the line of code updating sum so that only one thread at a time will write to the variable.” (ArtConc 2009)

“How should you assign tasks to threads? With loop iterations as tasks, we can assign blocks of iterations to threads based on an assigned ID number. Alternatively, we can have threads execute alternating iterations based on the number of threads. Because there are no indexed array elements in the loop, and thus no cache issues, I would recommend the latter approach.” (ArtConc 2009)

“The final piece to consider is adding the results of all the individual loop computations together and storing them in a location that you can print from. This will depend directly on how the reduction operation on sum is handled.” (ArtConc 2009)

Data Decomposition

“Before I get started on data decomposition, I want to make sure that you haven't skipped down to this section without reading the previous section on task decomposition. There is a lot of good stuff in that section and I want to make sure you've covered it and absorbed it. Many of the things that are covered there will apply directly to data decomposition, and I won't likely repeat them here. So, even if you think you will only ever be working on data decomposition solutions for the rest of your programming career, be sure to read the section on task decomposition. You'll get a better understanding of the problems that are shared between the two decomposition methods.” (ArtConc 2009)

“When you start to examine a serial application for transformation into an equivalent concurrent solution, the first feature of the computations you might identify is that the execution is dominated by a sequence of update operations on all elements of one or more large data structures. If these update computations are independent of each other, you have a situation where you can express the concurrency of your application by dividing up the data structure(s) and assigning those portions to threads, along with the corresponding update computations (tasks). This method of defining tasks based on assigning blocks of large data structures to threads is known as data decomposition. In Mattson's Patterns for Parallel Programming, this is called “geometric decomposition.”” (ArtConc 2009)

“How you divide data structures into contiguous subregions, or “chunks,” of data will depend on the type of data structure. The most common structure that falls into the data decomposition category is an array. You can divide arrays along one or more of their dimensions. Other structures that use an array as a component (e.g., graph implemented as an adjacency matrix) can be divided into logical chunks as well. It will all depend on what the computations are and how independent the processing is for each chunk.” (ArtConc 2009)

“I would add list structures to the set of decomposable data structures, but only if there is an easy way to identify and access sublists of discrete elements. In a linked list implementation this would require index pointers that reference alternate entry points into the list. For example, given a linked list of people arranged alphabetically, the first person whose name starts with a new letter of the alphabet would be referenced with an external pointer for that letter. If the concurrent version of the code needs to set up these external references as part of its overhead, make sure the amount of computation is sufficient to eclipse the additional overhead time. Otherwise, consider a different approach in either the data structure or how you implement the concurrency.” (ArtConc 2009)

“However you decide to do it, the decomposition into chunks implies the division of computation into tasks that operate on elements of each chunk, and those tasks are assigned to threads. The tasks will then be executed concurrently and each task will update the chunk associated with it. Data within an assigned chunk is readily available and safe to use, since no other tasks will be updating those elements. On the other hand, the update computations may require data from neighboring chunks. If so, we will have to share data between tasks. Accessing or retrieving essential data from neighboring chunks will require coordination between threads.” (ArtConc 2009)

“As with task decomposition, load balance is an important factor to take into consideration, especially when using chunks of variable sizes. If the data structure has a nice, regular organization and all the computations on that structure always take the same amount of execution time, you can simply decompose the structure into chunks with the same number of elements in some logical and efficient way. If your data isn't organized in a regular pattern or the amount of computation is different or unpredictable for each element in the structure, decomposing the structure into tasks that take roughly the same amount of execution time is a much less straightforward affair. Perhaps you should consider a dynamic scheduling of chunks to threads in this case.” (ArtConc 2009)

“The next sections outline the key elements that every successful data decomposition solution must account for. It will also address some thoughts on how to deal with sharing of neighboring data to best assure a load balance across computations on data chunks. The three key elements you need to consider for any data decomposition design are:” (ArtConc 2009)

How should you divide the data into chunks?

How can you ensure that the tasks for each chunk have access to all data required for updates?

How are the data chunks assigned to threads?

Each of these elements is covered in more detail in the following sections.

How should you divide the data into chunks?

Partitioning the global data structure into chunks is at the heart of a data decomposition design. The mechanics for doing this will depend mainly on the type of data structure you are decomposing. For concreteness, I'll deal with one- and two-dimensional arrays as examples. The ideas here can be applied to other structures with a little ingenuity on your part.“ (ArtConc 2009)

“Since each chunk of data will have an associated task, many of the same criteria that we had when defining tasks for task decomposition can be applied to the chunks of data decomposition. Specifically, make sure you have at least one chunk per thread (more is probably better) and ensure that the amount of computation that goes along with that chunk is sufficient to warrant breaking out that data as a separate chunk (now, aren't you glad you read the task decomposition section before starting into this section?).” (ArtConc 2009)

“With arrays of elements, you can divide the data into chunks at the individual element level, at the row or column level, as groups of rows or columns, or blocks of nonoverlapping subranges of rows and columns. Figure 2-2 shows a 4×4 array divided into chunks in several different ways.” (ArtConc 2009)

Figure 2-2. Array decomposition examples

“The amount of computation required by the associated task will be in direct proportion to the number of elements in a chunk. As stated before, this is known as the granularity of the computation and is exactly like what we had when we were considering how to define tasks. However, data decompositions have an additional dimension that you must consider when dividing data structures into chunks. This other dimension is the shape of the chunk.” (ArtConc 2009)

“The shape of a chunk determines what the neighboring chunks are and how any exchange of data will be handled during the course of the chunk's computations. Let's say we have the case that data must be exchanged across the border of each chunk (the term exchange refers to the retrieval of data that is not contained within a given chunk for the purpose of using that data in the update of elements that are in the local chunk). Reducing the size of the overall border reduces the amount of exchange data required for updating local data elements; reducing the total number of chunks that share a border with a given chunk will make the exchange operation less complicated to code and execute.” (ArtConc 2009)

Large granularity can actually be a detriment with regard to the shape of a chunk. The more data elements there are within a chunk, the more elements that may require exchange of neighboring data, and the more overhead there that may be to perform that exchange. When deciding how to divide large data structures that will necessitate data exchanges, a good rule of thumb is to try to maximize the volume-to-surface ratio. The volume defines the granularity of the computations, and the surface is the border of chunks that require an exchange of data. Figure 2-3 illustrates two different divisions of the same 4×8 data structure into two chunks. Both chunks have the same number of elements (16), but the scheme on the left has eight elements that share a border, whereas the scheme on the right has only four. If updates to each chunk relied on accessing data in the other chunk across the border, the division shown on the right would require fewer overall exchanges.“ (ArtConc 2009)

Figure 2-3. Volume-to-surface ratio examples

“Irregular shapes may be necessary due to the irregular organization of the data. You need to be more vigilant with chunks of irregular shapes to ensure that a good load balance can be maintained, as well as a high enough granularity within the chunk to lessen the impact of unavoidable overheads.” (ArtConc 2009)

“You may need to revise your decomposition strategy after considering how the granularity and shape of the chunks affect the exchange of data between tasks. The division of data structures into chunks influences the need to access data that resides in another chunk. The next section develops ideas about accessing neighboring data that you should consider when deciding how to best decompose a data structure into chunks.” (ArtConc 2009)

“How can you ensure that the tasks for each chunk have access to all data required for updates?” (ArtConc 2009)

“The updating of elements within a data chunk is typically the overwhelming source of computation within concurrent algorithms that use a data decomposition scheme. Not unheard of, but not as interesting, are applications that only read data from large structures. Even in these cases, before an application has the ability to read it, the data must be created and the elements of the data structure holding any data must be updated. I suspect that for applications that simply input the data and then use it only for reference to support other computations, task decomposition would be the better method for concurrent design.” (ArtConc 2009)

“If a chunk itself contains all the data required to update the elements within the chunk, there is no need to coordinate the exchange of data between tasks. A more interesting case occurs when some data that is required by a given chunk is held within a neighboring chunk. In that case, we must find efficient means to exchange data between these nearby chunks. Two methods for doing this come to mind: copy the data from the nearby chunk into structures local to the task (thread), or access the data as needed from the nearby chunk. Let's look at the pros and cons of each of these.” (ArtConc 2009)

“The most obvious disadvantage for copying the necessary data not held in the assigned chunk is that each task will require extra local memory in order to hold the copy of data. However, once the data has been copied, there will be no further contention or synchronization needed between the tasks to access the copies. Copying the data is best used if the data is available before the update operation and won't change while being copied or during the update computations. This will likely mean some initial coordination between tasks to ensure that all copying has been done before tasks start updating.” (ArtConc 2009)

“The extra local memory resources that are allocated to hold copied data are often known as ghost cells. These cells are images of the structure and contents of data assigned to neighboring chunks. For example, consider the division of data shown in Figure 2-3 (b). If the update computation of an individual element required the data from the two elements on either side of it in the same row, the whole column from the neighboring chunk bordering the split would need to be accessible. Copying these data elements into ghost cells would allow the element to access that data without interfering in the updates of the neighboring chunk. Figure 2-4 shows the allocated ghost cells and the copy operation performed by each thread to fill those cells.” (ArtConc 2009)

“Another factor to consider when thinking about copying the required data is how many times copying will be necessary. This all depends on the nature of the update computation. Are repeated copies required for, say, an iterative algorithm that refines its solution over multiple updates? Or is the data copy only needed once at the outset of the computation? The more times the copy has to be carried out, the greater the overhead burden will be for the update computations. And then there is the matter of the amount of data that needs to be copied. Too many copy operations or too much data per copy might be an indicator that simply accessing the required data directly from a neighboring chunk would be the better solution.” (ArtConc 2009)

Figure 2-4. Using ghost cells to hold copied data from a neighboring chunk

Accessing data as needed takes full advantage of shared memory communication between threads and the logic of the original serial algorithm. You also have the advantage of being able to delay any coordination between threads until the data is needed. The downside is that you must be able to guarantee that the correct data will be available at the right time. Data elements that are required but located within a neighboring chunk may be in the process of receiving updates concurrently with the rest of the elements in the neighboring chunk. If the local chunk requires the “old” values of nonlocal data elements, how can your code ensure that those values are not the “new” values? To answer this question or to know whether we must even deal with such a situation, we must look at the possible interactions between the exchange of data from neighboring chunks and the update operation of local chunk elements.“ (ArtConc 2009)

“If all data is available at the beginning of tasks and that data will not change during the update computation, the solution will be easier to program and more likely to execute efficiently. You can either copy relatively small amounts of data into ghost cells or access the unchanging data through shared memory. In order to perform the copy of nonlocal data, add a data gathering (exchange) phase before the start of the update computations. Try to minimize the execution time of the data-gathering phase, since this is pure overhead that was not part of the original serial code.” (ArtConc 2009)

“If nonlocal data will be accessed (or copied) during update computations, you will need to add code to ensure that the correct data will be found. Mixing exchange and update computations can complicate the logic of your application, especially to ensure correct data is retrieved. However, the serial application likely had this requirement, too, and the solution to the need for accessing correct data concurrently should simply follow the serial algorithm as much as possible.” (ArtConc 2009)

“For example, if you are modeling the distribution of heat from a source through a metal plate, you can simulate the plate by a two-dimensional array of current temperatures at discrete spatial locations. At each time step of the computation, the new value of each discrete location is the average of the current temperature and the temperature of some set of neighboring cells. Since this calculation will update the current temperature of a cell and skew the results of other cells that use this cell's value, the serial code will have a new and old plate array. The values in the old array are used to update the new values. The roles of the plate arrays are switched for the next time iteration. In the concurrent version of this application, the old plate is read (only) to update the current temperatures in the new array. Thus, there is no need for synchronization to access old data and there should be minimal changes to the serial source in order to implement this concurrent solution.” (ArtConc 2009)

How are the data chunks (and tasks) assigned to threads?

“As with task decomposition, the tasks that are associated with the data chunks can be assigned to threads statically or dynamically. Static scheduling is simplest since the coordination needed for any exchange operations will be determined at the outset of the computations. Static scheduling is most appropriate when the amount of computations within tasks is uniform and predictable. Dynamic scheduling may be necessary to achieve a good load balance due to variability in the computation needed per chunk. This will require (many) more tasks than threads, but it also complicates the exchange operation and how you coordinate the exchange with neighboring chunks and their update schedules.” (ArtConc 2009)

Being the sharp reader that you are, you have no doubt noticed that in most of the discussion over the last four pages or so I have used the termtask” rather than “thread.” I did this on purpose. The tasks, defined by how the data structures are decomposed, identify what interaction is needed with other tasks regardless of which thread is assigned to execute what task. Additionally, if you are using a dynamic schedule of tasks, the number of tasks will outnumber the total number of threads. In such a case, it will not be possible to run all tasks in parallel. You may then come up against the situation where some task needs data from another task that has not yet been assigned to execute on a thread. This raises the complexity of your concurrent design to a whole other level, and I'm going to leave it to you to avoid such a situation.“ (ArtConc 2009)

Example: Game of Life on a finite grid

“Conway's Game of Life is a simulation of organisms that live and die within cells arranged as a grid. Each grid cell is either empty or hosts a living organism. Occupied cells are called “alive,” while empty cells are “dead.” (See Figure 2-5 for an example of a portion of a grid with live and dead cells.) The simulation charts the births and deaths of the organisms through successive time steps or generations. For more information, see Wheels, Life, and Other Mathematical Amusements by Martin Gardner (Freeman & Co., 1983) or the tens of millions of words written and pictures drawn since the first computer simulation was written.” (ArtConc 2009)

Figure 2-5. Game of Life example; black dots represent “live” cells

“The state of cells can change from one generation to the next according to the following rules:” (ArtConc 2009)

The neighbors of a cell are the eight cells that border the given cell horizontally, vertically, and diagonally.

“If a cell is alive but has one or fewer neighboring cells alive, or has four or more neighboring cells alive, the cell will die of starvation or overcrowding, respectively, in the next generation.” (ArtConc 2009)

Living cells with two or three living neighbors will remain alive in the next generation.

“If a cell is dead and has exactly three neighboring cells that are alive, there will be a birth and the dead cell will become alive in the next generation. All other dead cells will remain dead in the next generation.” (ArtConc 2009)

All births and deaths take place at exactly the same time. This means that a cell that will be killed is still counted to help give birth to a neighboring cell.

In theory, the grid should be unbounded, but with only finite memory available in computer platforms, the size of the grid “universe” is usually restricted. Still, a very large two-dimensional array can hold current and successive generations of a simulation. The computations to decide whether any cell lives, dies, or is born is independent of the computations of any other cell. Thus, with the very large two-dimensional array and the update of elements in that array, the Game of Life simulation is an excellent candidate for a data decomposition design.

Example 2-3 shows serial code for updating the status of each cell into the next generation from the current generation. I assume that Grid is a data structure that is essentially a two-dimensional array that can hold the values of the user-defined constants ALIVE and DEAD. The grid array has rows indexed by 0 to N+1 and columns indexed by 0 to M+1. The extra rows and columns, whose cells are not considered for update in the given routine, can be thought of as boundary cells of the “universe” that are always dead (think of it as a wall of poison that can't be breached). By adding these boundary cells, I remove the need to deal with the grid edges as a special case. After computing the neighbor count for a cell, the disposition of that cell in the next generation is decided based on the rules just given. As with the task decomposition example, I won't create a threaded version of the code, but you're welcome to give it a try on your own based on the discussion that follows.

Example 2-3. Game of Life routine to compute next generation configuration from current

void computeNextGen (Grid curr, Grid next, int N, int M) { int count; for (int i = 1; i ⇐ N; i++) { for (int j = 1; j ⇐ M; j++) { count = 0; if (curr[i-1][j-1] == ALIVE) count++; // NW neighbor if (curr[i-1][j] == ALIVE) count++; // N neighbor if (curr[i-1][j+1] == ALIVE) count++; // NE neighbor if (curr[i][j-1] == ALIVE) count++; // W neighbor if (curr[i][j+1] == ALIVE) count++; // E neighbor if (curr[i+1][j-1] == ALIVE) count++; // SW neighbor if (curr[i+1][j] == ALIVE) count++; // S neighbor if (curr[i+1][j+1] == ALIVE) count++; // SE neighbor if (count ⇐ 1 || count >= 4) next[i][j] = DEAD; else if (curr[i][j] == ALIVE && (count == 2 || count == 3)) next[i][j] = ALIVE; else if (curr[i][j] == DEAD && count == 3) next[i][j] = ALIVE; else next[i][j] = DEAD; } } return; }

What is the large data structure in this application and how can you divide it into chunks? I've already given away the fact that we can divide the grid into chunks whose tasks can be assigned to threads. What is the best way to perform the division? Refer to Figure 2-5, but think bigger. Groups of rows or groups of columns or blocks seem the most natural divisions. You should anticipate the question of which exchanges of data each potential division strategy will entail and let that influence your answer.

Other factors to consider are the layout of the grid in memory, which may be anticipated within the serial code already, and how each division scheme might adversely affect the given layout. The amount of new code and line changes needed to transform the source code can be a factor influencing your decision. Dividing by rows would require modifications to the i loop, dividing by columns would require modifications to the j loop, and using a block division would require you to modify both the i and j loops.

For a chosen data decomposition, what exchange of data between tasks is required and how will this be accomplished? Since you need access to all eight neighbors of a given cell, if any neighbor cells are in a different chunk, there will need to be some form of data exchange. However, the code uses the curr grid for counting neighbors and is read-only. Each task can simply access the data when it is needed without fear of getting the wrong value or causing a data race with another task. On the other hand, if the rules of computing the next generation allowed us to make changes within the same grid that was being used to count neighbors, it might be better to copy data from other chunks into local ghost cells. If not, the results of your concurrent code may be different than the serial version.

How should you distribute the data chunks to tasks? Since you can determine the amount of computation for any grid cell and any collection of cells at design time, a static distribution will work well. In fact, I would recommend that the number of chunks be equal to the number of threads available.

What are the dependencies between tasks? This is a question that comes from our discussion of task decomposition and goes beyond the exchange of data needed between chunks. The answers to this question are derived from the serial code and the modifications necessary (from all the previous design decisions you've made). The count is a temporary work variable, so a local copy per thread takes care of that. Each thread will also need a copy of the two loop iteration variables. The global arrays are safe, since curr is only read and the updates to the elements in the next grid will not overlap.

Concurrent Design Models Wrap-Up

There often aren't clear, discrete steps to follow when developing a task or data decomposition solution. When considering how to answer the design element questions for your chosen design model, some of the decisions you make will be based on answers to questions that follow and answers that have come before. I hope you got that sense from the discussion of each example given. Even though I had to write down the questions sequentially, you may need to consider more than one thing at a time to devise an efficient concurrent solution.

Design Factor Scorecard

For most of the algorithms that are discussed and analyzed in Chapters 6 through 10, I will include a ”Design Factor Scorecard“ to discuss the concurrent algorithm and its implementation on four key factors. These factors are taken from Mattson's Patterns for Parallel Programming where they are used as criteria concurrent programmers need to keep in mind when designing and implementing parallel applications through the parallel programming patterns that are presented. For our purposes, I've redefined these terms slightly from how Mattson et al. originally used them. My interpretations of these terms are given shortly.

It is my contention that programmers of concurrent algorithms should keep each of these factors in mind, along with their relative importance to each other and the tradeoffs between stressing one factor over another when designing and implementing concurrent algorithms. The merits and tradeoffs possible for the algorithms and code given in later chapters will also be presented in the Design Factor Scorecard section after the descriptions of each algorithm presented.

Efficiency

Your concurrent applications must run quickly and make good use of processing resources. With regard to a concurrent algorithm, efficiency will examine the overhead computations that you must add to ensure a correct execution, how alternative arrangements of threads or organizations of tasks might work better or worse, and what other problems there could be with the performance of the threaded application.

S[[implicity]]

The simpler your concurrent algorithms are, the easier they will be to develop, debug, verify, and maintain. In terms of concurrent code based on a serial version, discussions of s[[implicity]] will focus on how much extra code you have to add to achieve a concurrent solution and how much of the original structure of the serial algorithm remains.

Portability

One of the goals of this book is to be as agnostic as possible with regard to available threading models and which models are used to implement the solution algorithms presented. Portability discussions will examine the tradeoffs that you could encounter if you used a different threading model from the one used in this text. While this book is primarily dedicated to the design and exploration of multithreaded codes, one of the options discussed under portability will be distributed-memory variations of the algorithms.

Scalability

Because the number of cores will only increase as time passes, your concurrent applications should be effective on a wide range of numbers of threads and cores, and sizes of data sets. Scalability refers to what you should expect with regard to how a given concurrent algorithm will behave with changes in the number of cores or size of data sets.

My Two Cents' Worth on the Factors

For me, scalability is the most important of these four factors, with efficiency a close second. This means that I will typically try to design a concurrent algorithm that will maintain its level of performance as the number of cores and threads increases to the detriment of a simpler or more portable algorithm. In order to gain that scalability, the algorithm and its implementation must be efficient, so that is my secondary goal.

There have been many times when the scalability of a concurrent application I've written has peaked and flattened out. This was usually due to the requirements of the algorithm and the paraphernalia provided by the threading model being used. In these cases, I would ask myself if it is worth the extra work to try to discover an alternative (which may not exist) that might scale better, or to rewrite the whole thing in terms of a different threading or design model. These are some of the tradeoffs you face when tackling concurrent algorithms.

What's Not Parallel

In the chapters and pages that follow, we're going to explore quite a few things that can be executed concurrently. Before we get to those, though, I want to impress upon you that not everything can be executed in parallel. I don't want you wasting time beating your head against the wall, endlessly poring over this book and others on parallel algorithms, or pestering friends and colleagues (or me) with phone calls in the wee hours of the morning trying to enlist them in your search for a parallel solution, especially when there isn't one.

Note

One of the more famous illustrations of a situation that cannot be made parallel is cited in Fred Brooks's 1995 book, The Mythical Man-Month: Essays on Software Engineering (Addison-Wesley Professional). The nine-month gestation period for a human is a serial process — you can't get a baby in one month by assigning nine women to the job.

On the other hand, if you wanted to raise a baseball team of players from cradle to Wrigley Field in the shortest amount of time, you could employ 9 women (10 if you want a designated hitter). This would give you a newborn starting lineup after nine months. Asking one woman to do all the work would require about nine years, barring twins and such, and the pitcher (firstborn) would be starting her Little League career when the right fielder (ninth-born) was just arriving.

Algorithms with State

The first example of code that cannot be executed concurrently is algorithms, functions, or procedures that contain a state. That is, something kept around from one execution to the next. For example, the seed to a random number generator or the file pointer for I/O would be considered state. Algorithms with state cannot be made directly concurrent, and whenever you encounter such code, a red flag should go up when you are considering concurrency. However, you may be able to take steps to render the code thread-safe, which may be sufficient.

You can make state-filled code thread-safe by adding some form of synchronization or writing the code to be reentrant (i.e., it can be reentered without detrimental side effects while it is already running). The former option will serialize all concurrent executions of the synchronized code (and add unnecessary overhead when not called concurrently), while the latter option may not be possible if the update of global variables is part of the code.

If the variable(s) holding the state does not have to be shared between threads, you can use TLS to overcome this dependence. Using TLS, each thread would have a separate copy of the state variable(s) (accessed in exactly the same way across all threads) to ensure there are no data races on the variable(s). Thus, each thread can have a different random number seed and use the same code to generate a separate stream of numbers that will not interfere with any other thread's seed.

Recurrences

Recurrence relations within loops feed information forward from one iteration to the next. Prime examples of this are time-stepping loops and convergence loops. No matter how many tea leaves we read, tarot cards we consult, or Magic Eight Ball apps we write, we can't parse out future time steps to multiple threads for concurrent execution.

A simple code example of a recurrence is given in Example 2-4. The recurrence shown is the read access of the a[i-1] element that was computed in the previous iteration.

Example 2-4. Recurrence relation on array access

for (i = 1; i < N; i++) a[i] = a[i-1] + b[i];

Unfortunately, most recurrences cannot be made concurrent. Prefix sum is a special case of a recurrence that can be made to run concurrently (see Chapter 6 for more details on concurrent algorithms for prefix scan).

If you've got a recurrence relationship that is a hotspot in your code, look for a pointhigher“ in the call tree that would include execution of the recurrence. Thread there, where possible.

Induction Variables

Induction variables are variables that are incremented on each trip through a loop. Most likely, these are index variables that do not have a one-to-one relation with the value of the loop iterator variable. Example 2-5 shows a code segment with two induction variables, i1 and i2.

Example 2-5. Induction variables

i1 = 4; i2 = 0; for (k = 1; k < N; k++) { B[i1++] = function1(k,q,r); i2 += k; A[i2] = function2(k,r,q); }

As this code stands, even if the calls to function1() and function2() are independent, there's no way to transform this code for concurrency without a few radical alterations to the serial source. Specifically, you would need to replace the array index increment expressions with a calculation based solely on the value of the loop iterator variable.

Without much strain on your brain, I'm sure you can see that you could rewrite the first statement in the loop as shown in Example 2-6.

Example 2-6. Solution for first induction variable increment

B[k+4] = function1(k,q,r);

The second is a bit trickier. Take a moment to see whether you can figure it out on your own. I've put the solution at the end of the next paragraph.

A worse case than the code shown in Example 2-5 are induction variables that have a conditional increment. As an example, say you want to search through a list and copy out all items that have some property, such as all recording artists who released more albums than Pink Floyd. We'll assume that the list is implemented with some random access data structure; otherwise, if we use a linked or pointer-based data structure, we've already got problems about how to efficiently access data concurrently. Now, the loop index variable is running through the full set of data items and the induction variable is only incrementing when we find an item that matches our criterion in order to store it in a second array (indexed by the induction variable). There is no closed-form relation between the value of the loop index and the value of the induction variable.

Did you figure out how to set up the second induction variable form from Example 2-5? It is simply the sum of the integers from 1 up to the current value of k. Thus, you can use the code in Example 2-7 in a concurrent version of the loop.

Example 2-7. Solution for second induction variable increment

i2 = (k*k + k)/2; A[i2] = function2(k,r,q);

This is a contrived example, of course. Your situations may not be so neat and clean, though I hope they are.

Reduction

Reductions take a collection (typically an array) of data and reduce it to a single scalar value through some combining operation. To remove this dependency, the operation must be both associative and commutative, such as addition, multiplication, or some logical operations. The loop in the code fragment shown in Example 2-8 will find the sum of all elements of the c array as well as the largest (maximum) element in the array.

Example 2-8. Reduction code

sum = 0; big = c[0]; for (i = 0; i < N; i++) { sum += c[i]; big = (c[i] > big ? c[i] : big); // maximum element }

At first glance, the loop cannot be made to execute concurrently. Each element, in turn, is added to the running total and compared to the largest element found so far, replacing that largest found element as needed. This is all done in the order of the incrementing index variable i. However, notice that the results would be exactly the same (within limits of rounding and truncation) if the loop were run in reverse order (I mentioned this idea of running a loop in reverse as a good initial test that the loop may be capable of concurrent execution earlier in this chapter).

Taking advantage of the associativity of the operator(s) involved will allow you to create a concurrent version of the reduction algorithm. Divide the loop iterations among the threads to be used and simply compute partial results (sum and big in the preceding example) in local storage. Next, combine each partial result into a global result, taking care to synchronize access to the shared variables. Of course, if you're threading your loop with OpenMP or Intel TBB, you just need to make use of the reduction clause or parallel_reduce algorithm.

Loop-Carried Dependence

The final example of code that poses a problem to our efforts to write concurrent applications is known as loop-carried dependence. This dependence occurs when results of some previous iteration are used in the current iteration. Typically, this situation will be evidenced by references to the same array element on both the left- and righthand sides of assignments and a backward reference in some righthand side use of the array element. Obviously, the iterations of such loops are not completely independent. Example 2-9 shows a code fragment that updates corresponding elements of the a and b arrays, but the update of a elements relies on previously updated elements from the b array.

Example 2-9. Loop-carried dependence code

for (k = 5; k < N; k++) { b[k] = DoSomething(k); a[k] = b[k-5] + MoreStuff(k); }

Dividing such loop iterations into tasks presents the problem of requiring extra synchronization to ensure that the backward references have been computed before they are used in computation of the current iteration. Recurrence is a special case of a loop-carried dependence where the backward reference is the immediate previous iteration. There is no way to efficiently execute such loop iterations concurrently, since waiting for the backward references to be resolved can require a hefty amount of synchronization.

For example, if your backward reference spans five iterations (as in Example 2-9), you can divide up the loop iterations into chunks with five iterations in each chunk. Once the first iteration of the first chunk has completed, the first iteration in the second chunk can start, since the dependence of the first iteration of the second chunk (iteration

  1. 6) has been satisfied. You can daisy-chain loop chunks and threads like this, but it will all need some major code modifications and synchronization to ensure that all prerequisite dependences have been satisfied before execution can start on each separate iteration.

Not-so-typical loop-carried dependence

The loop in the code fragment given in Example 2-10 cannot be made parallel because wrap is carried from one iteration to the next. This is loop-carried dependence that doesn't follow the typical format involving obvious backward references, since the backward references are “hidden” in the wrap variable.

Example 2-10. Atypical loop-carried dependence

wrap = a[0] * b[0]; for (i = 1; i < N; i++) { c[i] = wrap; wrap = a[i] * b[i]; d[i] = 2 * wrap; }

Fortunately, you can restructure this simple case to define wrap before use in each iteration and create a loop whose iterations can be executed concurrently. This is possible because you can assign the proper value of wrap based solely on the value of the loop iterator variable. Example 2-11 shows the results of this code restructuring.

Example 2-11. Modified loop-carried dependence

for (i = 1; i < N; i++) { wrap = a[i-1] * b[i-1]; c[i] = wrap; wrap = a[i] * b[i]; d[i] = 2 * wrap; }

“Disregarding the code used to implement the loop repetition (initializing, incrementing, and testing the loop iterator), you will notice that the modified code is now executing 4N statements, as opposed to the 3N+1 needed in Example 2-10.” (ArtConc 2009)

“Rewriting an existing algorithm to something less efficient in order to get a better chance of concurrency may be necessary. Don't use something that is too far afield of the original, though. Less efficient serial algorithms will tend to add overhead (as seen when comparing the number of statements in the loop bodies in the previous examples).” (ArtConc 2009)

Fair Use Sources

Concurrency: Concurrency Programming Best Practices, Concurrent Programming Fundamentals, Parallel Programming Fundamentals, Asynchronous I/O, Asynchronous programming (Async programming, Asynchronous flow control, Async / await), Asymmetric Transfer, Akka, Atomics, Busy waiting, Channels, Concurrent, Concurrent system design, Concurrency control (Concurrency control algorithms‎, Concurrency control in databases, Atomicity (programming), Distributed concurrency control, Data synchronization), Concurrency pattern, Concurrent computing, Concurrency primitives, Concurrency problems, Concurrent programming, Concurrent algorithms, Concurrent programming languages, Concurrent programming libraries‎, Java Continuations, Coroutines, Critical section, Deadlocks, Decomposition, Dining philosophers problem, Event (synchronization primitive), Exclusive or, Execution model (Parallel execution model), Fibers, Futures, Inter-process communication, Linearizability, Lock (computer science), Message passing, Monitor (synchronization), Computer multitasking (Context switch, Pre-emptive multitasking - Preemption (computing), Cooperative multitasking - Non-preemptive multitasking), Multi-threaded programming, Multi-core programming, Multi-threaded, Mutual exclusion, Mutually exclusive events, Mutex, Non-blocking algorithm (Lock-free), Parallel programming, Parallel computing, Process (computing), Process state, Producer-consumer problem (Bounded-buffer problem), Project Loom, Promises, Race conditions, Read-copy update (RCU), Readers–writer lock, Readers–writers problem, Recursive locks, Reducers, Reentrant mutex, Scheduling (computing)‎, Semaphore (programming), Seqlock (Sequence lock), Serializability, Shared resource, Sleeping barber problem, Spinlock, Synchronization (computer science), System resource, Thread (computing), Tuple space, Volatile (computer programming), Yield (multithreading), Concurrency bibliography, Manning Concurrency Async Parallel Programming Series, Concurrency glossary, Awesome Concurrency, Concurrency topics, Functional programming. (navbar_concurrency - see also navbar_async, navbar_python_concurrency, navbar_golang_concurrency, navbar_java_concurrency)


Cloud Monk is Retired (for now). Buddha with you. © 2005 - 2024 Losang Jinpa or Fair Use. Disclaimers

SYI LU SENG E MU CHYWE YE. NAN. WEI LA YE. WEI LA YE. SA WA HE.


art_of_concurrency_chapter_2.txt · Last modified: 2023/10/01 07:18 by 127.0.0.1