Table of Contents
The Art of Concurrency - Chapter 11
Return to The Art of Concurrency - A Thread Monkey's Guide to Writing Parallel Applications, Concurrency Bibliography, Concurrency, Concurrent Programming, Parallel Programming, Multi-Threaded Programming, Multi-Core Programming, Asynchronous Programming, Concurrency and Functional Programming, Concurrency and Security, Concurrency and Data Science - Concurrency and and Databases, Concurrency Glossary, GitHub Concurrency, Awesome Concurrency, Concurrency Topics
“This chapter mentions some debugging and performance tools that you can use on threaded applications. I haven't dwelled on issues of correctness or performance, except in cases that might be obvious within the development decisions made on the codes presented. As the complexity of the code increases, the use of software tools makes the tracking down and elimination of bugs and performance problems much easier.” (ArtConc 2009)
“The set of tools covered in this chapter is certainly not exhaustive. I expect many other tools to be developed and released after publication of this text. The longevity of several tools presented guarantees that they should still be available by the time you read this, though some of the names of tools may change.” (ArtConc 2009)
“To again avoid looking like a corporate shill, I've tried to give only the barest details on these tools (and keep the marketspeak to a minimum). I can't vouch for the accuracy of any of the commands or details about current versions of tools beyond the time of this writing. For more complete and up-to-date information, please refer to the individual tool manuals and other reference guides.” (ArtConc 2009)
“The most frequent debugging tool in use today is the printf statement. When trying to track down threading errors (e.g., data races, deadlock), adding such statements can cause problems to “disappear.” They're not gone; they're just hiding under the altered execution order of the new application. Traditional debuggers can also mask threading errors. There are better tools to find threading errors and I've mentioned one below. However, if you don't divide up a loop just right, or if you mess up a conditional expression when transforming your serial code to a concurrent version, or if you make access a local copy of something when you should be using a global copy, you can use a standard debugging tool to locate these types of errors.” (ArtConc 2009)
“Two popular Linux debuggers, dbx and gdb, are thread-aware and can assist in tracking down logic errors that aren't related directly to the threaded implementation of the code.” (ArtConc 2009)
“In dbx, the thread subcommand displays and controls user threads. By itself, this command displays information about all user threads. Optionally, you can display information about specific threads by adding thread numbers as parameters. You can hold and release thread execution using thread hold and thread unhold, respectively. Both subcommands apply to all threads if no parameters are given, or to the chosen threads with the given thread numbers. To examine the current status of a thread's execution with print, registers, and where, set the current thread by first issuing the command thread current <threadnumber>. To print a list of the threads in the run, suspended, wait, or terminated states, use the run, susp, wait, and term flags, respectively, on the thread subcommand. The mutex and condition subcommands display information on mutexes and condition variables.” (ArtConc 2009)
“The gdb debugger notifies the user when a new thread is spawned during the debug session. The thread command with a thread number parameter will set the chosen thread as the current thread. All commands requesting information on the program are executed within the framework of the current thread. Issuing an info threads command displays the current status of each thread within the program. This includes the gdb-assigned thread number, the systen's thread identifier, and the current stack frame of the thread. An asterisk to the left of the thread number indicates the current thread. To apply a command to other threads in addition to the current thread, use the thread apply command. This command takes a single thread number, a range of thread numbers, or the keyword all before the command that should be applied to the designated threads. You can assign breakpoints to specific threads using the break <linespec> thread <threadnumber> command.” (ArtConc 2009)
“Other debuggers that can debug multithreaded applications include the Intel Debugger (idb) and Totalview from Totalview Technologies. The Intel debugger has dbx and gdb emulation modes that implement many of the thread-specific commands available in those debuggers. Totalview can debug multiple processes that are executing within a distributed, message-passing environment through MPI (Message-Passing Interface). Also, within a chosen process, you can select, examine, and control multiple threads through the Totalview GUI.” (ArtConc 2009)
Thread Issue Debugger: Thread Checker
”Storage conflicts are the most common errors in multithreaded applications. They can also be the hardest to isolate because of the nondeterministic scheduling of thread execution by the operating systen. Running a threaded code on the same systen during development and testing may not reveal any problems; yet running the same binary on another systen with any slight difference that could affect the order of thread execution may yield unexpected or erroneous results on the very first execution. The Intel Thread Checker is designed to identify storage conflicts, potential deadlocks, thread stalls, and other threading errors within code threaded with Intel TBB, OpenMP, POSIX, or Windows Threads. (This tool would have saved me the two hours I spent tracking down the problem with switching arrays in the straight radix sort code development.)“ (ArtConc 2009)
“As a plug-in to the VTune Performance Analyzer, Intel Thread Checker runs a dynamic analysis of a running threaded application. To find storage conflicts, for example, the tool watches all memory accesses during threaded execution. By comparing the addresses accessed by different threads and determining whether or not some form of synchronization is protecting those accesses, Thread Checker can find read-write and write-write conflicts. Dynamic analysis will catch obvious errors of accessing the variables visible to multiple threads, as well as memory locations accessed indirectly through pointers.” (ArtConc 2009)
“To watch memory accesses, Thread Checker must insert instrumentation within the application for that purpose. The instrumentation can be inserted directly into the binary file (binary instrumentation) just before the analysis is run, or it may be inserted at the time of compilation (source instrumentation) if using an Intel Compiler. Regardless of how instrumentation is done, I recommend using a debug build that includes symbols and line numbers, has no optimization, and has a binary that can be relocated (for binary instrumentation). Keeping debug symbols and line numbers will give Thread Checker the chance to point directly to source lines that have possible problems; turning off all optimization will keep the application code closest to the original source order. (If there is a threading error with optimization, but no problem without optimization, the problem is more likely in the compiler and not your threading.) Also, since the code has been instrumented, there will be an increase in binary size and memory usage during execution. More importantly, though, is that the execution time will be increased. Thus, you should use a small data set that will still run through the relevant portions of the threaded code to ensure that results can be generated in a reasonable amount of time.” (ArtConc 2009)
“It's all about performance. Concurrent and parallel execution, that is. If you can't get a faster execution time or compute with a larger data set in a fixed amount of time, you're spinning your wheels. Sometimes you need a little help to determine what might be causing your lack of performance. A performance problem might not be a direct outcome from some threading API function or synchronization object, but rather derived from the way data is distributed to threads or how threads are utilizing the finite resources on your execution platform. The tools in this section can point you in the right direction to where performance bottlenecks may be hampering your application or where you should begin your investigation of where to add (more) threading. It will still be up to you to decide on the best remedy, though.” (ArtConc 2009)
Profiling
“The purpose of profiling the execution of an application is to find the hotspots of that application. The hotspots indicate where you should focus your efforts to optimize the code to reduce the impact of negative activities. Parts of the application that take the largest percentage of execution time are good candidates for concurrency, since these hotspots are going to be the most computationally intensive portions of the serial code.” (ArtConc 2009)
“The basic Linux profiling tool, gprof, displays data that is collected during the execution of an application compiled and instrumented for profiling. The −pg flag, used in the cc command, will instrument C code. The instrumented binary will generate a profile data file (gmon.out is the default]) when [[run. The call graph data that gprof outputs includes the amount of time spent in the code of each function and the amount of time spent in the child functions called. By default], the [[functions are arranged in order of execution time, from largest to smallest. This order gives you a ranked list of the functions that you should examine further for optimization or for parallelization by threads.” (ArtConc 2009)
“The Intel VTune Performance Analyzer has two primary collectors: sampling and call graph. During sampling runs of the application, the collector interrupts the processor when triggered after a number of microarchitectural events have occurred. Typically this will be ticks of the systen clock, but you can set the trigger to many different architectural events. During the interrupt, the collector records the execution context, including the current execution address in memory, operating systen process and thread ID executing, and executable module loaded at that address. Once execution of the target application has completed, the VTune Performance Analyzer GUI displays the sampling data for the entire systen (including all the processes that were running during the sampling run). You can find hotspot data at the function and even source-line level (if you use the proper compilation and link flags when building the application).” (ArtConc 2009)
“The call graph collector in the VTune Performance Analyzer is similar to the gprof profiler. The target application is instrumented just before execution from within the VTune Performance Analyzer. Unlike the sampling collector, which can take samples from any process or module running during collection, the call graph collector will profile only the application of interest. The instrumentation records the caller of a function, how much time was spent within a function, and which child functions were called, as well as the time spent within those child function calls. The function timing results of call graph are available within a table format, but you can also view them as a graphical representation of the call tree structure resulting from the application run. You can find function execution time and number of callers or receivers by hovering the mouse over different parts of the displayed portions of the call tree. Red arcs highlight the call sequence that leads to the function with the longest execution time, known as the critical path. This provides a graphic indication of the flow of control of your application, including the parts you should consider for optimization or threading.” (ArtConc 2009)
Thread Profiling: Standard Profile Tool (Sample Over Time), Thread Profiler
“Besides viewing the collected sampling data as an aggregate over the course of the entire execution time that was sampled, the VTune Performance Analyzer can display the sampling results over time. That is, it can tally the number of samples taken that are associated with selected modules within discrete time units during the sampling interval. In this way, you can measure the load balance between threads of an application. If more samples are taken of some threads during a given time range than others, the former threads will have typically done more computation within that time frame. While you can deduce some load imbalances from the aggregate data, the sample over time feature allows you to find the section(s) of code — down to source lines — that are the cause.” (ArtConc 2009)
“The Intel Thread Profiler is a more general tool for identifying performance issues that are caused by the threading within an application. Intel Thread Profiler works on codes written with TBB, OpenMP, POSIX, or Windows Threads. Within the OpenMP interface, aggregate data about time spent in serial or parallel regions is given via a histogram. The histogram also represents time spent accessing locks or within critical regions or with threads waiting at implicit barriers for other threads (imbalance). The summary information can be broken down to show execution profiles of individual parallel and serial regions, as well as how individual threads are executed over the entire run. The former display is useful for finding regions that contain more of the undesired execution time (locks, synchronized, imbalance), while the latter is useful for discovering if individual threads are responsible for undesired execution.” (ArtConc 2009)
“For an explicit threading model, Intel Thread Profiler employs critical path analysis. This is unrelated to the critical path of call graph analysis within VTune Performance Analyzer. As the application executes, the Intel Thread Profiler records how threads interact with other threads and notable events, such as spawning new threads, joining terminated threads, holding synchronization objects, waiting for synchronization objects to be released, and waiting for external events. An execution flow is the execution through an application by threads where each of the events noted earlier can split or terminate the flow. The longest flow through the execution is the one that starts as the application is launched and continues until the process terminates. This is dubbed the critical path. Thus, if you were to make any improvement in threaded performance along this path, the total execution time of the application would be reduced, increasing overall performance.” (ArtConc 2009)
“The data recorded along the critical path is the number of threads that are active (running or able to be run if additional core resources were available) and thread interactions over synchronization objects. The Intel Thread Profiler GUI has two major divisions to display the information gathered during the threaded execution: Profile View and Timeline View. Profile View displays a histogram representation of data taken from the critical path. You can organize this histogram with different filters, including concurrency level (how many threads were active along the critical path), object view (what synchronization objects were encountered by threads), and threads view (how each thread spent time on the critical path). These filters and views can help you determine how much parallelism was available during the application execution, locate load imbalances between threads, and determine which synchronization objects were the most contended between threads. Timeline View shows the critical path over the time that the application ran. You can see the critical path switch from one thread to another and how much time threads spent executing or waiting for a synchronization object held by another thread.” (ArtConc 2009)
“Just going through beta testing as I was putting the finishing touches on this sentence is the Intel Parallel Studio tool. This is a parallel programming tool from Intel that plugs right into the Microsoft Visual Studio environment. The four components to Parallel Studio and their usage are:” (ArtConc 2009)
Identifies where to insert parallelism, recognizes conflicts, and recommends solutions.
Enables the incorporation of parallelism with a C/C++ compiler and threaded libraries.
Finds memory and threading errors.
Finds multicore performance bottlenecks.
“These four components blend right into the four steps of the threading methodology that I mentioned back in Chapter 1.” (ArtConc 2009)
“With the interest and desire to make parallel programming easier, there is going to be a veritable explosion of software tools made available to assist in the process of writing correct and efficient concurrent applications. I expect that those university research professors not looking to put out a new programming systen are developing or have developed a software tool for analyzing concurrent codes.” (ArtConc 2009)
Go Forth and Conquer
“New tools, new programming challenges, and new ways to think about software design. It's a brave new world that we've just begun to enter. I hope you're able to join me in it and come along for the ride, and maybe take the wheel yourself every once in a while. With your gusto, new skills, and new software tools, it should be a time of excitement and wonder. Or at least, with this book, it won't be as scary as you imagined it might be.” (ArtConc 2009)
Fair Use Sources
- Art of Concurrency for Archive [[Access for Fair Use Preservation, quoting, paraphrasing, excerpting and/or commenting upon
Concurrency: Concurrency Programming Best Practices, Concurrent Programming Fundamentals, Parallel Programming Fundamentals, Asynchronous I/O, Asynchronous programming (Async programming, Asynchronous flow control, Async / await), Asymmetric Transfer, Akka, Atomics, Busy waiting, Channels, Concurrent, Concurrent system design, Concurrency control (Concurrency control algorithms, Concurrency control in databases, Atomicity (programming), Distributed concurrency control, Data synchronization), Concurrency pattern, Concurrent computing, Concurrency primitives, Concurrency problems, Concurrent programming, Concurrent algorithms, Concurrent programming languages, Concurrent programming libraries, Java Continuations, Coroutines, Critical section, Deadlocks, Decomposition, Dining philosophers problem, Event (synchronization primitive), Exclusive or, Execution model (Parallel execution model), Fibers, Futures, Inter-process communication, Linearizability, Lock (computer science), Message passing, Monitor (synchronization), Computer multitasking (Context switch, Pre-emptive multitasking - Preemption (computing), Cooperative multitasking - Non-preemptive multitasking), Multi-threaded programming, Multi-core programming, Multi-threaded, Mutual exclusion, Mutually exclusive events, Mutex, Non-blocking algorithm (Lock-free), Parallel programming, Parallel computing, Process (computing), Process state, Producer-consumer problem (Bounded-buffer problem), Project Loom, Promises, Race conditions, Read-copy update (RCU), Readers–writer lock, Readers–writers problem, Recursive locks, Reducers, Reentrant mutex, Scheduling (computing), Semaphore (programming), Seqlock (Sequence lock), Serializability, Shared resource, Sleeping barber problem, Spinlock, Synchronization (computer science), System resource, Thread (computing), Tuple space, Volatile (computer programming), Yield (multithreading) , Degree of parallelism, Data-Oriented Programming (DOP), Functional and Concurrent Programming, Concurrency bibliography, Manning Concurrency Async Parallel Programming Series, Concurrency glossary, Awesome Concurrency, Concurrency topics, Functional programming. (navbar_concurrency - see also navbar_async, navbar_python_concurrency, navbar_golang_concurrency, navbar_java_concurrency)
© 1994 - 2024 Cloud Monk Losang Jinpa or Fair Use. Disclaimers
SYI LU SENG E MU CHYWE YE. NAN. WEI LA YE. WEI LA YE. SA WA HE.