Over the last two days I had an adventure into efficient code again. My first project which took a good three hours is a new vector class template of variable data type and size. The code is designed using metafunctions to form efficient inline code. Benchmarks have shown that my code is not only fast, but it is very fast (with proper compiler settings). For example:
VectorFast 28526ps
Ideal 29361ps
This is the time it takes the while loop of each to iterate once (interated for 1000 clock cycles each, then time figured out per cycle). The loops:
start = clock();
while ( clock() – start < clock_count ) {
accum1 += dot( q1, q2 );
accum1 -= dot( q1, q2 );
count0 ++;
}
start = clock();
while ( clock() – start < clock_count ) {
accum2 += x1*x2 + y1*y2 + z1*z2;
accum2 -= x1*x2 + y1*y2 + z1*z2;
count1 ++;
}
So, in other words, I win. My benchmark code is not ideal, but it definitely shows good performance. Similar results arise with vector normalization, magnitude, and other operations. The “secret”, if you would call it that, is in the code:
template < typename TYPE >
struct UnrollDot {
template < unsigned int INDEX >
static __forceinline TYPE
evaluate( const TYPE* _Left, const TYPE* _Right ) {
return _Left[INDEX]*_Right[INDEX] + UnrollDot<TYPE>::evaluate<INDEX-1>( _Left, _Right );
}
template <>
static __forceinline TYPE
evaluate<0>( const TYPE* _Left, const TYPE* _Right ) {
return _Left[0]*_Right[0];
}
};
template < typename TYPE, unsigned int LEN >
__forceinline TYPE
dot ( const VectorFast< TYPE, LEN >& _Left, const VectorFast< TYPE, LEN >& _Right )
{
return UnrollDot<TYPE>::evaluate<LEN-1>( _Left.ptr(), _Right.ptr() );
}
which forces an unroll, and __forceinline forces, well, inline. I have produced a few pages of similar Unroll structures for all vector operations.