Showing posts with label C6416. Show all posts
Showing posts with label C6416. Show all posts

Thursday, October 7, 2010

Optimization - 02




Code Optimization - Inline Functions
Inline function is the optimization technique used by the compilers. One can simply prepend inline keyword/specifier to function prototype to make a function inline. This does not change the behavior of a function itself, but instruct compiler to embed the complete body of the function wherever that function is called, instead of being inserted only once and perform a regular call to it. When we declare a function or member function as inline we are trying to avoid the overhead of a function. If done carefully, this can improve the application's performance in exchange for increased compile time and possibly (but not always) an increase in the size of the generated binary executables.

//The format for its declaration is:
inline type name ( arguments ... )
{
// instructions ...
}

and the call is just like the call to any other function. You do not have to include the inline keyword when calling the function, only in its declaration.
Most compilers already optimize code to generate inline functions when it is more convenient. This specifier only indicates the compiler that inline is preferred for this function.
  1. Optimizations
C++ inline functions present a potential optimization to code: make a function inline and you save the overhead of a function call, Which is: loading parameters onto the stack, making a jump, popping parameters off the stack, and then it all in reverse when you return.
Modern CPUs have pipelines and attempt to optimize code execution on the fly. A function call can disrupt this pipeline when it is encountered, and again, when it returns. So that, not only we are avoiding the function call, but also we potentially allow the CPU to further improve execution.
Not only this, but a compiler may be able to optimize code which has been inlined in a way it cannot if there is a function call simply because the optimization becomes visible, e.g. optimizing register allocations. In the case of empty functions - fairly typical of constructors - making the function inline may allow the compiler to remove the call altogether.
Sometimes an inline function may save time and space. Making a function call requires instruction codes to make the call both on the part of the caller and callee. On occasions the number of bytes required for this may actually be greater than the number of bytes in the function body itself.
  1. Comparison to Macros
Traditionally, in languages such as C, inline expansion was accomplished at the source level using parameterized macros. Use of true inline functions, as are available in C99, provides several benefits over this approach:
  • Macro invocations do not perform type checking, or even check that arguments are well-formed, whereas function calls usually do.
  • In C, a macro cannot use the return keyword with the same meaning as a function would do (it would make the function that asked the expansion terminate, rather than the macro). In other words, you cannot make the macro return something which is not the result of the last expression invoked inside it.
  • Since C macros use mere textual substitution, this may result in unintended side-effects and inefficiency due to re-evaluation of arguments and order of operations.
  • Compiler errors within macros are often difficult to understand, because they refer to the expanded code, rather than the code the programmer typed.
  • Many constructs are awkward or impossible to express using macros, or use a significantly different syntax. Inline functions use the same syntax as ordinary functions, and can be inlined and un-inlined at will with ease.
  • Debugging information for inlined code is usually more helpful than that of macro-expanded code.
Many compilers can also inline expand some recursive functions; recursive macros are typically illegal.
Bjarne Stroustrup, the designer of C++, likes to emphasize that macros should be avoided wherever possible, and advocates extensive use of inline functions.
  1. Inline as a Better Macro
C programmers used to simulate inline functions with macros. C++ inline functions where in part, an attempt to encode this as a language feature rather than yet another use of the macro pre-processor. However, they are not an exact replacement.
  • If you use a macro the compiler has no desecration, the function will be inlined.
  • Scope rules are different.
  • Macros, particularly multi-line macros are much more difficult to understand and maintain.
In short, inline is an improvement on macro, but they are not a direct replacement.
  1. Catches
With these kinds of optimizations available we might wonder why we don't just make all functions inline. Simply, there are downsides, and in my opinion they out weigh the advantages.
For starters inline is only a hint to the compiler, it is free to ignore it. It will typically ignore it if:
  • The member is a virtual: it's not really possible to have an inline virtual function as virtual implies an indirect function call - although compiler can sometimes do it if it has type information.
  • The function is recursive or mutually recursive with other functions.
  • The compiler will also ignore the inline if it considers the body of the function "too big." How big is too big depends on your compiler.
The Other Inline
C++ inline functions are not the only "inlines" supported by most compilers. It is also possible to place assembler code within C++ code and this is commonly referred to as "inline assembler". It's normally clear which type is being talked about, but it's more difficult when doing a keyword search in MSDN or a search engine to pin down what you want.
The days when most systems contained some inline assembler to improve speed are fast receding. Traditionally used to speed up performance critical sections of a program it's now doubtful that even this is possible. Increasingly, to get the best performance out of a CPU requires you to know a lot of things about memory layout, cache architecture, pipelining, out of order execution and so on. While these rules can be encoded in a good compiler it is hard for an individual to know and apply all these rules.
  • There are a few occasions where inline assembler is still useful though. Microsoft use a little bit in the ATL to "thunk" a function call, and sometimes hard coding a break point in your code (asm { int 03h; } in Intel speak) has it's uses.
  • Inline code will not be generated when using a variable number of arguments to a function call (e.g. printf would not be inlined) or a variable sized data type, e.g.
void foo(int n)
{
char str[n];
...
}
  • Compilers have a limit to the depth they will inline functions, e.g. if inline A calls inline B, which calls inline C and so on. This is often configurable through a pragma or command line switch.
Of course, every compiler will differ slightly and you or I can't guarantee that there isn't a compiler out that can inline some of these things. Conversely, there are more reasons why your compiler will not inline a function that those given here. These are clear-cut cases but there are several other reasons why I avoid using inline functions. The main culprit is: code bloat, inline functions usually make your code bigger because you have duplicated code for each invocation of the function. Apart from needing a bigger hard disc code bloat has several implications:
  • Increased number of CPU cache misses: a CPU may keep a frequently executed function entirely in it's cache, if you increase the function size with lots of inline functions it may be too big for the cache, and because your functions are bigger fewer may be held in cache at once.
  • Disc cache: if every function is inlined the OS may not load the entire program into memory and your disc will start to thrash.
So, don't be surprised if making methods inline actually makes your program slower!
In addition embedded systems are very sensitive to code bloat: a PC may have a big CPU cache, oodles of fast RAM and a big fast hard disc but an embedded system probably won't. In the extreme this may mean you need to fit your device with a bigger ROM, which means it costs more, which in a competitive market may make the difference between success and failure.
  1. And some more catches...
There are catches which have to do less with an individual piece of code and more with how a system hangs together. These design issues mean you should think carefully before making a method inline:
  • Coupling between headers and implementation increases because a change is more likely to occur in the header file, which causes a bigger rebuild because other files depend on your header files. If the code is in the implementation file by contrast only that file needs to be recompiled. On large systems the difference can be hours.
  • Putting the implementation in the header file reduces the opportunity to forward declare classes, so you have to include another header file, increasing coupling; making the code less flexible and more prone to ripple effects - again increasing build times.
Perhaps a more significant point is what a lot of inline functions says about your design. Inlining is typically used forget and set functions. Classes with lots of such functions aren't really hiding anything, they are just containers for values which begs the question, just how object oriented is your design?
Some other points that are worth noting about inlined functions are:
  • Finding code in source files can be time consuming, if you have lots of inline functions you need to look in the .h files as well as the .cpp files. This may seem trivial but on a large system with several thousand files the time adds up.
  • If some of your code is compiled with inlining enabled and some with them disabled you will encounter link errors.
  • Some compilers have an auto-inline function that lets the compiler decide which functions to inline and which to leave as is. Of course, how aggressive this is, and hence how much difference it makes, is dependent on your compiler.
  • Inlining is usually only enabled when optimised code is being generated, so your debug builds won't have any inlined functions.
  • Although inline is commonly thought of as a C++ optimisation it is also available in C, and other languages, although these may need vendor specific extensions.
  • Theoretically each inline function should have one, and only one implementation body. However, it would be possible to provide different bodies in different files. Stroustrup points out in the Design & Evolution of C++ that most compilers don't check for this, we can hope that in the 7 years since he made his comments the situation has improved but I'm not sure. Before anyone think of a use for this "feature" consider the maintenance nightmare and please don't do it.
Seeing inline at work
The most reliable way to see if a function is being inlined or not is to look at the output from the compiler. Most compilers have a switch to output assembler code for your inspection. In Visual C++ there are several /FA which control this - the "listing file type" in the project C++ settings box.
The other switches worth playing with are the /Ob which control the aggressiveness of inline expansion - the "Inline function expansion" option in C++ settings.

It's interesting to experiment with these switches and the following example:

struct Widget {
int number_;
    Widget();
int t() const;
};

Widget::Widget() : number_(99) { }

int Widget::t() const
{
return number_;
}

int main()
{
Widget w;
int i = w.t();
return i;
}

It's interesting to make the constructor and t function inline and look at the assembler code, or use the /Ob2 switch to force functions to be inline, at this point the compiler reduces the program to:

push 99
pop eax

The program still comes out at 19,968 bytes, which demonstrates the overhead that even the simplest programs may carry.
  1. When to Use Inlines
Despite all this there are occasions when inlines make sense.
  • Templates based code is always inlined: this is one of the reasons people believe templates lead to bigger executables.
  • Sometimes making a function inline will improve performance for all the reasons given at the start.
The orthodox approach to performance today is to design a system for ease of maintenance, flexibility and extendibility, and only consider performance second. Usually, this means producing a working system and then performance profiling, improving the slowest bit and repeating until the system is fast enough.
When you profile your code you may find a hot spot in a function that could benefit from inlining. This is quite understandable, and at this point in the cycle your can afford to sacrifice some maintainability for speed if it is needed. However, after making the change repeat the performance test to ensure it actually does help.
One compromise is to avoid inlining other than for templates. Then as part of the QA or performance cycle use your compilers auto-inline feature. This compromise is not ducking the issue, or delivering code with sub-optimal performance. In fact, it's taking an engineering line, we build something, measure it, and if necessary improve it. (Stroustrup questions if letting the compiler decide is the best policy.)
  1. Advantages
  • It does not require function calling overhead.
  • It also save overhead of variables push/pop on the stack, while function calling.
  • It also save overhead of return call from a function.
  • It increases locality of reference by utilizing instruction cache.
  • After in-lining compiler can also apply intraprocedural optmization if specified. This is the most important one, in this way compiler can now focus on dead code elimination, can give more stress on branch prediction, induction variable elimination etc..
  1. Disadvantages
  • May increase function size so that it may not fit on the cache, causing lots of cahce miss.
  • After in-lining function if variables number which are going to use register increases than they may create overhead on register variable resource utilization.
  • It may cause compilation overhead as if some body changes code inside inline function than all calling location will also be compiled.
  • If used in header file, it will make your header file size large and may also make it unreadable.
  • If somebody used too many inline function resultant in a larger code size than it may cause thrashing in memory. More and more number of page fault bringing down your program performance.
  • Its not useful for embeded system where large binary size is not preferred at all due to memory size constraints.
  1. Performance
Now covering the topic which most the people are interested in the "Performance".
In most of the cases Inline function boost performance if used cautiously as it saves lots of overhead as discussed in our Advantages section above but as we have also discussed its disadvantages one need to be very cautious while using them. Today's modern compiler inline functions automatically, so no need to specify explicitly in most of the cases. Although placing inline keyword only gives compiler a hint that this function can be optimized by doing in-lining, its ultimately compiler decision to make it inline. Though there are ways to instruct compiler too, for making a function call inline like one can use __forceinline to instruct compiler to inline a function while working with microsoft visual c++. I suggest not to use this keyword until you are very sure about performance gain. Making a function inline may or may not give you performance boost, it all depends on your code flows too. Don't expect a magical performance boost by prepending inline keyword before a function to your code as most of the compiler nowadays does that automatically.
As we have seen inline function serves in terms of performacen but one has to use it with extreme cautions.
  1. Uses Guidelines
  • Always use inline function when your are sure it will give performance.
  • Always prefer inline function over macros.
  • Don't inline function with larger code size, one should always inline small code size function to get performance.
  • If you want to inline a function in class, then prefer to use inkine keyword outside the class with the function definition.
  • In c++, by default member function declared and defined within class get linlined. So no use to specify for such cases.
  • Your function will not be inlined in case there is differences between exception handling model. Like if caller function follows c++ structure handling and your inline function follows structured exception handling.
  • For recursive function most of the compiler would not do in-lining but microsoft visual c++ compiler provides a special pragma for it i.e. pragma inline_recursion(on) and once can also control its limit with pragma inline_depth.
  • If the function is virtual and its called virtually then it would not be inlined. So take care for such cases, same hold true for the use of function pointers.
  1. Conclusion
While inline functions may improve performance seldom will you be able to radically change a program's performance by making a bunch of functions inline. Inlines are not free; they will usually increase your program size and may actually impair performance.
They may also increase your development cycle because they increase coupling between files and lead to longer build times and more ripple effects.
Heavy reliance on inline functions may indicate flaws in your design; either a lack of object-orientedness or failure to design a system with the required performance characteristics.

 

Optimization - 01



Porting C Code to the CCS - C6416
  1. Initial Modifications
The first set of modifications to the original base C code is to accommodate multi-channel environments and prevention of stack overflow.
  1. Code Modifications for Multi-Channel Environments
To accommodate multi-channel systems, eliminate global variables and static data in each C module. For each file that contained static or global data, following steps are necessary:
  • Define a data structure containing the static data variables in header-file.
  • Add this generated data structure in header-file to the general encoder and decoder channel information.
  • Perform the former static data initialization explicitly in the encoder/decoder initialization function.
  • For each function in the original file that used static data:
    • Add a pointer to the data structure to the parameter list.
    • Replace all static variables referenced in the function body with the variables stored in the data structure whose addresses are received as parameters.
  • For each function that called any of the functions modified in step 4, add a pointer to the data structure to its parameter list. Repeat this process recursively up to the main encoder/decoder function.
  1. Prevention of stack overflow:
Since default size of stack is fixed K bytes, the large arrays in the functions may cause stack overflow. That is because the local variables are also placed on the stack when the function is called. These are accessed indirectly, which makes processor slow. So add a scratch space for reducing the stack size, which is more efficient way of using the RAM space also.
To perform this, for each function that contained the large arrays, the following steps are necessary to follow:
  • Define a data structure containing the large arrays in the header-file.
  • Add this generated data structure in header-file to the general encoder and decoder channel information.
  • For each function in the original file with the large arrays:
    • Add a pointer to the structure to the parameter list.
    • Replace all large arrays referenced in the function body with variables stored in the data structure whose addresses are received as parameters.
  • For each function that called any of the functions modified in step 3, add a pointer to the structure in that to its parameter list. Repeat this process recursively up to the main encoder/decoder function.
In addition, the Word16, Word32, and Flag data types used in the base code should redefine in typedef.h to comply with processor architecture.
  1. Intrinsic Functions
The basic math operations defined in basic_op.c were replaced by corresponding intrinsic functions of TMS320C6416. The compiler recognizes intrinsic operators which can be used like functions to produce assembly language statements otherwise it would not be expressible through standard ANSI C. This optimization allows a call to a DSP specific intrinsic function to be replaced by a copy of the body of that function which can be optimized at the assembly level. This optimization will speedup execution by eliminating function call & return overhead and substitute hand optimized and compiler optimized low level code for high level C intrinsic code. The set of functions defined in the compiler's gsm.h file is substantially richer than the set provided in basic_op.c.
So all basic math operations were easily replaced, with some minor modifications described below that were required to handle overflow and saturation. The basic math operations in basic_op.c simulate two processor flags, Overflow and Carry. The Carry flag is used only inside the basic math operations, so there was no problem in removing it. However, the Overflow flag is set in several functions, and tested in other functions, so it must remain. The Control Status Register is defined in compiler's c6x.h file. The CSR contains control and status bits. The SAT field in this register is set when there is overflow. Because the overflow flag was no longer available as a global variable, the SAT bit is cleared explicitly when the Overflow flag is assigned zero in the original file.
  1. Project-Level Optimizations (Function Inlining)
In general, these techniques improve the performance of both C and assembly code. So apply function inlining to the entire project.
Function inlining (also referred to as 'inline expansion' or simply 'inlining'), is the process of replacing a function call with the body of the function itself. This optimization technique improves execution time by eliminating function call & return overhead at the expense of larger code size.
Profiler information is used to determine whether a function should be inlined or not. Small, frequently called functions are the best candidates for inlining. In general, functions that take fewer than 25 cycles offer the most speed savings by inlining. The overhead to call such a function is typically 20 percent or greater. Functions that take 25–35 cycles and pass several parameters are also worth inlining. Function inlining is done in one of three ways:
  • Implicitly, allowing the compiler to select the functions to be inlined. This is done by setting the –o3 compiler option.
  • Explicitly, using the #pragma inline C statement. To inline a function in several files, the function should be placed in a header file, and the static keyword should be used in each file to prevent the linker from generating duplicate global symbols.
  • Manually replacing a function call with the body of the function.

For example, manual inlining technique -
// Body of the function of L_Extract()is defined as –
  void L_Extract(Word32 L_32, Word16 *hi, Word16 *lo)
  {
      *hi = extract_h(L_32);
      *lo = extract_l( L_msu( L_shr(L_32, 1) , *hi, 16384));
      return;
  }

//Further, intrinsic functions of extract_l() and extract_h() are:
extract_l ( a ) ((a )& 0xffff); and extract_h ( a ) (unsigned(a) >>16);

// Original:    L_Extract( ) calls were manually replaced with L_Extract( ) body
L_Extract(sum0, &r_h[i], &r_l[i]);

//Modified:     STEP 1
r_h[i] = extract_h(sum0);
r_l[i] = extract_l(L_msu( L_shr(sum0, 1), r_h[i], 16384));

// extract_h( ) and extract_l( ) calls were manually replaced with intrinsic //functions

// Finally:     STEP 2
r_h[i] = (unsigned (sum0) >> 16);
r_l[i] = (sum0 & 0xffff);

Implicit and explicit
inlining requires no knowledge of the code and is performed without examining code sections because the compiler automatically tries to interleave consecutive calls. These techniques are recommended when there is only one call to the inlined function.
Manual inlining is recommended when several identical calls are made in sequence, and enables the programmer to fully exploit potential parallelism. By replacing a function call with the C statements of the function, we can place in parallel, by hand, the actions in the function by allocating different variables and performing several consecutive identical calls on phases. This technique requires a working knowledge of the code to identify where a small function is repeatedly called and which actions are executed in parallel. Manual inlining is also used in assembly code.
Within Code Composer Studio, the Build Options are used to set the -o3, -pm op3, -mr1, -mt program-level options that affect all project files.
Collect profile data after the project-level optimizations, and use this as a baseline for further analysis.
  1. Function Level Optimization
The general approach for optimizing each function includes:
  • Establishing the function interface and separating the function from the rest of the code to facilitate analysis.
  • Add test code that saves the function input and output values before and after each function call.
  • Optimize the function and verifying the output to ensure that it remains the same as the corresponding reference output.
  • Integrate the optimized function with the rest of the code and verify that it passes the test vectors.
Criterion for selecting functions:  
The selection of functions for optimization is primarily based on profiling information after the project-level optimizations, focusing on the most time-critical functions. Information provided by the profiler includes: Number of calls per frame. Based on this information, the decision is made whether a small, frequently-called function should be inlined. This information is most useful during the global-level optimizations. Total number of cycles per frame, is the most useful information for selecting the functions to optimize.
  1. Loop Unrolling
Loop unrolling explicitly repeats the body of a loop with corresponding indices. That is expanding the loops so that, each iteration of the loop appears in the code. This optimization technique increases the number of instructions available to execute in parallel. Loop unrolling can be used when the operations in a single iteration do not use all of the resources of the C6000 architecture. Loop unrolling proves quite rewarding in some cases where it gives up to 2% improvement in terms of total number of cycles for each of its applications.
  1. Loop Merging
Combining two or more loops into a single loop loads the ALU more efficiently, as illustrated below. This technique yields up to 1% improvement in terms of total number of cycles for each of its applications. This avoids spending overhead on two separate loops. An example code from Corr_xy2() function is illustrated.
  1. Loop Splitting
Loop splitting refers to the process of breaking a large loop with several variables or pointers into two or more shorter loops and saving the results of the partial computations in local vectors. This technique enables the compiler to allocate registers more efficiently, resulting in substantial performance improvement, especially when combined with other optimization techniques.
  1. Simplifying the Test Conditions
The test conditions were simplified to reduce the amount of computation. e.g. the conditions which involved subtraction were replaced with direct comparisons. This saves a subtraction operation. For example -
/* original if construct */
if( sub(max1, max2) < 0 ){
}
/* modified if construct */
if(max1 < max2){
}
  1. Declaring Local Variables
Variables are declared as close as possible, in the code hierarchy, to their area of use to help the compiler identify their life cycles. This improves register allocation but requires more stack memory. For example -
for(i= -pit_max; i< L_frame; i+=2) {
Word32 sum = 0;
sum = L_mac(sum, signal[i], signal[i]);
}
  1. Replacing Functions by Arithmetic Operators
The right/left shift operator >> is used in a variable shift displacement to save a function call, if overflow or underflow does not occur after a shift operation. Similarly additions, subtractions can be replaced by addition operator +or subtraction operator - after analyzing the code. However, the C operators *, +, - were not applied to operations on fractional values that only employ intrinsic functions, because it is difficult to ascertain whether overflow, underflow or saturation would occur or not. For example -
// Original:     subtraction
T0_min = sub(T_op, 3);
// Modified:     subtraction
T0_min =T_op- 3;
  1. Pointer Addressing
Accessing array elements using pointers rather than indexing improves the performance greatly. This reduces the overhead of calculating the address. For example, consider the case of a for loop, the overhead involved in calculating the address of y[i] for every element are
  • Loading the base address of the array
  • Loading 'i' into the memory
  • Going to an offset (i) * size (type of array).
// Original:
for (i=0; i<L_WINDOW; i++) {
sum = L_mac(sum, y[i], y[i]);
}
Instead, this overhead can be moved outside the loop, to make the compiler generate faster code as shown in the below example.
//Modified:
Word16 *ptr ;
ptr = &y[0];
for (i=0; i<L_WINDOW; i++) {
sum = _sadd(sum, _smpy(*ptr ++, *ptr++));
}
  1. Function Implementation In Linear Assembly
In general, rewriting C functions in assembly increases speed and reduces code size. However, because of improvements in C compiler performance, the trend is to keep most code in C and optimized C and to implement fewer functions in assembly. The functions to be implemented in assembly were selected based on profiling data.
Usually the user can program a DSP in either C language or in assembly language. The C6000 allows to program in Linear Assembly. Linear Assembly uses exactly the same assembly instructions, but the difference is that it is easier than assembly language to program. This language is processed by the Assembly Optimizer instead of the assembler. The output of which is pure assembly language program. The Assembly Optimizer allows us to write assembly code without being concerned with pipelining structure of C6000 or with assigning of registers. It accepts Linear Assembly code which is assembly code that has not been register allocated and is unscheduled. The Assembly Optimizer assigns register and uses loop optimization to turn linear assembly into highly parallel assembly.
Advantages of Linear Assembly are:
  • Assembly optimizer is delay slot aware for different instructions. So, we need not write NOP's.
  • Register scheduling and functional unit specifications is also not required. The assembly optimizer takes care of this. This means that the programmer can use C like variable name in linear assembly instead using of C6000 registers.
  • Most importantly assembly optimizer takes care of pipelining and parallelism this is huge advantage in VLIW(Very Long Instruction Word) machines with deep pipelines make it a huge challenge for the programmer to write highly pipelined code. We can say that Linear Assembly is a slightly higher level language than pure assembly but retaining about 90% programming efficiency of the hand coded pure assembly.
Functions Selection Criteria:
The entire code is profiled, and the most time-critical functions are identified. The estimates of ideal execution times are compared with the C optimization results. Functions which are near-optimal are retained in the optimized C version and the remainders are candidates to be implemented in assembly.
  1. Conclusion
Porting of C code to the TMS320C6416 and optimizing the code while maintaining the bit-exactness of the original code is important. The recommended approach follows the following key points:
  • C source optimizations for selected functions.
  • Linear Assembly implementation of selected time-critical functions.
Do the C optimization of selected functions at both the project and function level. Execution time profile, provided by the profiler, can be used for selecting functions for optimization. The project-level optimizations involve profiling the initially ported code, inlining frequently-called functions, and optimizing the C code so that the compiler produces code that is better adapted to the TMS320C6416 architecture. In the function-level optimization, several techniques including loop unrolling and loop merging are involved. Linear Assembly implementation is the final phase of development because it is very difficult to change the structure of an implementation in linear assembly. Functions targeted for linear assembly implementation are primarily those for which the simulator could not produce efficient code, and those which execute for large number of cycles. Particular emphasis is given to loops, where each cycle saved results in a significant reduction in execution time. The optimized C implementation is used to verify the bit-exactness for the linear assembly and it also provides a pseudo-code for the algorithm.

.