My Technical Articles . . . ...: Optimization

Porting C Code to the CCS - C6416

Initial Modifications

The first set of modifications to the original base C code is to accommodate multi-channel environments and prevention of stack overflow.

Code Modifications for Multi-Channel Environments

To accommodate multi-channel systems, eliminate global variables and static data in each C module. For each file that contained static or global data, following steps are necessary:

Define a data structure containing the static data variables in header-file.
Add this generated data structure in header-file to the general encoder and decoder channel information.
Perform the former static data initialization explicitly in the encoder/decoder initialization function.
For each function in the original file that used static data:
- Add a pointer to the data structure to the parameter list.
- Replace all static variables referenced in the function body with the variables stored in the data structure whose addresses are received as parameters.
For each function that called any of the functions modified in step 4, add a pointer to the data structure to its parameter list. Repeat this process recursively up to the main encoder/decoder function.

Prevention of stack overflow:

Since default size of stack is fixed K bytes, the large arrays in the functions may cause stack overflow. That is because the local variables are also placed on the stack when the function is called. These are accessed indirectly, which makes processor slow. So add a scratch space for reducing the stack size, which is more efficient way of using the RAM space also.

To perform this, for each function that contained the large arrays, the following steps are necessary to follow:

Define a data structure containing the large arrays in the header-file.
Add this generated data structure in header-file to the general encoder and decoder channel information.
For each function in the original file with the large arrays:
- Add a pointer to the structure to the parameter list.
- Replace all large arrays referenced in the function body with variables stored in the data structure whose addresses are received as parameters.
For each function that called any of the functions modified in step 3, add a pointer to the structure in that to its parameter list. Repeat this process recursively up to the main encoder/decoder function.

In addition, the Word16, Word32, and Flag data types used in the base code should redefine in typedef.h to comply with processor architecture.

Intrinsic Functions

The basic math operations defined in basic_op.c were replaced by corresponding intrinsic functions of TMS320C6416. The compiler recognizes intrinsic operators which can be used like functions to produce assembly language statements otherwise it would not be expressible through standard ANSI C. This optimization allows a call to a DSP specific intrinsic function to be replaced by a copy of the body of that function which can be optimized at the assembly level. This optimization will speedup execution by eliminating function call & return overhead and substitute hand optimized and compiler optimized low level code for high level C intrinsic code. The set of functions defined in the compiler's gsm.h file is substantially richer than the set provided in basic_op.c.

So all basic math operations were easily replaced, with some minor modifications described below that were required to handle overflow and saturation. The basic math operations in basic_op.c simulate two processor flags, Overflow and Carry. The Carry flag is used only inside the basic math operations, so there was no problem in removing it. However, the Overflow flag is set in several functions, and tested in other functions, so it must remain. The Control Status Register is defined in compiler's c6x.h file. The CSR contains control and status bits. The SAT field in this register is set when there is overflow. Because the overflow flag was no longer available as a global variable, the SAT bit is cleared explicitly when the Overflow flag is assigned zero in the original file.

Project-Level Optimizations (Function Inlining)

In general, these techniques improve the performance of both C and assembly code. So apply function inlining to the entire project.

Function inlining (also referred to as 'inline expansion' or simply 'inlining'), is the process of replacing a function call with the body of the function itself. This optimization technique improves execution time by eliminating function call & return overhead at the expense of larger code size.

Profiler information is used to determine whether a function should be inlined or not. Small, frequently called functions are the best candidates for inlining. In general, functions that take fewer than 25 cycles offer the most speed savings by inlining. The overhead to call such a function is typically 20 percent or greater. Functions that take 25–35 cycles and pass several parameters are also worth inlining. Function inlining is done in one of three ways:

Implicitly, allowing the compiler to select the functions to be inlined. This is done by setting the –o3 compiler option.
Explicitly, using the #pragma inline C statement. To inline a function in several files, the function should be placed in a header file, and the static keyword should be used in each file to prevent the linker from generating duplicate global symbols.
Manually replacing a function call with the body of the function.

For example, manual inlining technique -

// Body of the function of L_Extract()is defined as –

void L_Extract(Word32 L_32, Word16 *hi, Word16 *lo)

{

*hi = extract_h(L_32);

*lo = extract_l( L_msu( L_shr(L_32, 1) , *hi, 16384));

return;

}

//Further, intrinsic functions of extract_l() and extract_h() are:

extract_l ( a ) ((a )& 0xffff); and extract_h ( a ) (unsigned(a) >>16);

// Original: L_Extract( ) calls were manually replaced with L_Extract( ) body

L_Extract(sum0, &r_h[i], &r_l[i]);

//Modified: STEP 1

r_h[i] = extract_h(sum0);

r_l[i] = extract_l(L_msu( L_shr(sum0, 1), r_h[i], 16384));

// extract_h( ) and extract_l( ) calls were manually replaced with intrinsic //functions

// Finally: STEP 2

r_h[i] = (unsigned (sum0) >> 16);

r_l[i] = (sum0 & 0xffff);

Implicit and explicit
inlining requires no knowledge of the code and is performed without examining code sections because the compiler automatically tries to interleave consecutive calls. These techniques are recommended when there is only one call to the inlined function.

Manual inlining is recommended when several identical calls are made in sequence, and enables the programmer to fully exploit potential parallelism. By replacing a function call with the C statements of the function, we can place in parallel, by hand, the actions in the function by allocating different variables and performing several consecutive identical calls on phases. This technique requires a working knowledge of the code to identify where a small function is repeatedly called and which actions are executed in parallel. Manual inlining is also used in assembly code.

Within Code Composer Studio, the Build Options are used to set the -o3, -pm op3, -mr1, -mt program-level options that affect all project files.

Collect profile data after the project-level optimizations, and use this as a baseline for further analysis.

Function Level Optimization

The general approach for optimizing each function includes:

Establishing the function interface and separating the function from the rest of the code to facilitate analysis.
Add test code that saves the function input and output values before and after each function call.
Optimize the function and verifying the output to ensure that it remains the same as the corresponding reference output.
Integrate the optimized function with the rest of the code and verify that it passes the test vectors.

Criterion for selecting functions:

The selection of functions for optimization is primarily based on profiling information after the project-level optimizations, focusing on the most time-critical functions. Information provided by the profiler includes: Number of calls per frame. Based on this information, the decision is made whether a small, frequently-called function should be inlined. This information is most useful during the global-level optimizations. Total number of cycles per frame, is the most useful information for selecting the functions to optimize.

Loop Unrolling

Loop unrolling explicitly repeats the body of a loop with corresponding indices. That is expanding the loops so that, each iteration of the loop appears in the code. This optimization technique increases the number of instructions available to execute in parallel. Loop unrolling can be used when the operations in a single iteration do not use all of the resources of the C6000 architecture. Loop unrolling proves quite rewarding in some cases where it gives up to 2% improvement in terms of total number of cycles for each of its applications.

Loop Merging

Combining two or more loops into a single loop loads the ALU more efficiently, as illustrated below. This technique yields up to 1% improvement in terms of total number of cycles for each of its applications. This avoids spending overhead on two separate loops. An example code from Corr_xy2() function is illustrated.

Loop Splitting

Loop splitting refers to the process of breaking a large loop with several variables or pointers into two or more shorter loops and saving the results of the partial computations in local vectors. This technique enables the compiler to allocate registers more efficiently, resulting in substantial performance improvement, especially when combined with other optimization techniques.

Simplifying the Test Conditions

The test conditions were simplified to reduce the amount of computation. e.g. the conditions which involved subtraction were replaced with direct comparisons. This saves a subtraction operation. For example -

/* original if construct */

if( sub(max1, max2) < 0 ){

}

/* modified if construct */

if(max1 < max2){

}

Declaring Local Variables

Variables are declared as close as possible, in the code hierarchy, to their area of use to help the compiler identify their life cycles. This improves register allocation but requires more stack memory. For example -

for(i= -pit_max; i< L_frame; i+=2) {

Word32 sum = 0;

sum = L_mac(sum, signal[i], signal[i]);

}

Replacing Functions by Arithmetic Operators

The right/left shift operator >> is used in a variable shift displacement to save a function call, if overflow or underflow does not occur after a shift operation. Similarly additions, subtractions can be replaced by addition operator +or subtraction operator - after analyzing the code. However, the C operators *, +, - were not applied to operations on fractional values that only employ intrinsic functions, because it is difficult to ascertain whether overflow, underflow or saturation would occur or not. For example -

// Original: subtraction

T0_min = sub(T_op, 3);

// Modified: subtraction

T0_min =T_op- 3;

Pointer Addressing

Accessing array elements using pointers rather than indexing improves the performance greatly. This reduces the overhead of calculating the address. For example, consider the case of a for loop, the overhead involved in calculating the address of y[i] for every element are

Loading the base address of the array
Loading 'i' into the memory
Going to an offset (i) * size (type of array).

// Original:

for (i=0; i<L_WINDOW; i++) {

sum = L_mac(sum, y[i], y[i]);

}

Instead, this overhead can be moved outside the loop, to make the compiler generate faster code as shown in the below example.

//Modified:

Word16 *ptr ;

ptr = &y[0];

for (i=0; i<L_WINDOW; i++) {

sum = _sadd(sum, _smpy(*ptr ++, *ptr++));

}

Function Implementation In Linear Assembly

In general, rewriting C functions in assembly increases speed and reduces code size. However, because of improvements in C compiler performance, the trend is to keep most code in C and optimized C and to implement fewer functions in assembly. The functions to be implemented in assembly were selected based on profiling data.

Usually the user can program a DSP in either C language or in assembly language. The C6000 allows to program in Linear Assembly. Linear Assembly uses exactly the same assembly instructions, but the difference is that it is easier than assembly language to program. This language is processed by the Assembly Optimizer instead of the assembler. The output of which is pure assembly language program. The Assembly Optimizer allows us to write assembly code without being concerned with pipelining structure of C6000 or with assigning of registers. It accepts Linear Assembly code which is assembly code that has not been register allocated and is unscheduled. The Assembly Optimizer assigns register and uses loop optimization to turn linear assembly into highly parallel assembly.

Advantages of Linear Assembly are:

Assembly optimizer is delay slot aware for different instructions. So, we need not write NOP's.
Register scheduling and functional unit specifications is also not required. The assembly optimizer takes care of this. This means that the programmer can use C like variable name in linear assembly instead using of C6000 registers.
Most importantly assembly optimizer takes care of pipelining and parallelism this is huge advantage in VLIW(Very Long Instruction Word) machines with deep pipelines make it a huge challenge for the programmer to write highly pipelined code. We can say that Linear Assembly is a slightly higher level language than pure assembly but retaining about 90% programming efficiency of the hand coded pure assembly.

Functions Selection Criteria:

The entire code is profiled, and the most time-critical functions are identified. The estimates of ideal execution times are compared with the C optimization results. Functions which are near-optimal are retained in the optimized C version and the remainders are candidates to be implemented in assembly.

Conclusion

Porting of C code to the TMS320C6416 and optimizing the code while maintaining the bit-exactness of the original code is important. The recommended approach follows the following key points:

C source optimizations for selected functions.
Linear Assembly implementation of selected time-critical functions.

Do the C optimization of selected functions at both the project and function level. Execution time profile, provided by the profiler, can be used for selecting functions for optimization. The project-level optimizations involve profiling the initially ported code, inlining frequently-called functions, and optimizing the C code so that the compiler produces code that is better adapted to the TMS320C6416 architecture. In the function-level optimization, several techniques including loop unrolling and loop merging are involved. Linear Assembly implementation is the final phase of development because it is very difficult to change the structure of an implementation in linear assembly. Functions targeted for linear assembly implementation are primarily those for which the simulator could not produce efficient code, and those which execute for large number of cycles. Particular emphasis is given to loops, where each cycle saved results in a significant reduction in execution time. The optimized C implementation is used to verify the bit-exactness for the linear assembly and it also provides a pseudo-code for the algorithm.

My Technical Articles . . . ...

Pages

Thursday, October 7, 2010

Optimization - 01

No comments:

Post a Comment

Labels

About Me

Blog Archive

Followers

My Bookshelf