Archive for the 'Programming' Category

Finally, a good vector class

February 13, 2009

Over the last two days I had an adventure into efficient code again. My first project which took a good three hours is a new vector class template of variable data type and size. The code is designed using metafunctions to form efficient inline code. Benchmarks have shown that my code is not only fast, but it is very fast (with proper compiler settings). For example:

VectorFast    28526ps
Ideal         29361ps

This is the time it takes the while loop of each to iterate once (interated for 1000 clock cycles each, then time figured out per cycle). The loops:

start = clock();
while ( clock() – start < clock_count ) {
accum1 += dot( q1, q2 );
accum1 -= dot( q1, q2 );
count0 ++;
}
start = clock();
while ( clock() – start < clock_count ) {
accum2 += x1*x2 + y1*y2 + z1*z2;
accum2 -= x1*x2 + y1*y2 + z1*z2;
count1 ++;
}

So, in other words, I win. My benchmark code is not ideal, but it definitely shows good performance. Similar results arise with vector normalization, magnitude, and other operations. The “secret”, if you would call it that, is in the code:

template < typename TYPE >
struct UnrollDot {
template < unsigned int INDEX >
static __forceinline TYPE
evaluate( const TYPE* _Left, const TYPE* _Right ) {
return _Left[INDEX]*_Right[INDEX] + UnrollDot<TYPE>::evaluate<INDEX-1>( _Left, _Right );
}
template <>
static __forceinline TYPE
evaluate<0>( const TYPE* _Left, const TYPE* _Right ) {
return _Left[0]*_Right[0];
}
};

template < typename TYPE, unsigned int LEN >
__forceinline TYPE
dot ( const VectorFast< TYPE, LEN >& _Left, const VectorFast< TYPE, LEN >& _Right )
{
return UnrollDot<TYPE>::evaluate<LEN-1>( _Left.ptr(), _Right.ptr() );
}

which forces an unroll, and __forceinline forces, well, inline. I have produced a few pages of similar Unroll structures for all vector operations.

Rays be Damned

December 27, 2008

My big question has been, “How much more efficient could using #define statements be than safe, by-the-book code”. My preliminary code tested (Scalar is defined as double):

struct VectorDefines {
Scalar x, y, z;
VectorDefines( ) { }
VectorDefines( const Scalar& _X, const Scalar& _Y, const Scalar& _Z ) : x( _X ), y( _Y ), z( _Z ) { }
};

By calling one of each of the following operations once in a loop:

#define ADD(s,l,r)  s.x = l.x + r.x;  s.y = l.y + r.y;  s.z = l.z + r.z
#define DOT(l,r) (l.x*r.x + l.y*r.y + l.z*r.z)
#define CROSS(s,l,r) s.x = l.y*r.z – l.z*r.y;  s.y = l.z*r.x – l.x*r.z;   s.z = l.x*r.y – l.y*r.x;

Against

class VectorIdeal {
Scalar mX, mY, mZ;
public:
inline Scalar& x() { return mX; }
inline Scalar& y() { return mY; }
inline Scalar& z() { return mZ; }
inline const Scalar& cx() const { return mX; }
inline const Scalar& cy() const { return mY; }
inline const Scalar& cz() const { return mZ; }
VectorIdeal( ) { }
VectorIdeal( const Scalar& _X, const Scalar& _Y, const Scalar& _Z ) : mX( _X ), mY( _Y ), mZ( _Z ) { }
};
VectorIdeal operator + ( const VectorIdeal& _Left, const VectorIdeal& _Right ) {
return VectorIdeal( _Left.cx() + _Right.cx(), _Left.cy() + _Right.cy(), _Left.cz() + _Right.cz() );
}
Scalar Dot( const VectorIdeal& _Left, const VectorIdeal& _Right ) {
return _Left.cx()*_Right.cx() + _Left.cy()*_Right.cy() + _Left.cz()*_Right.cz();
}
VectorIdeal Cross( const VectorIdeal& _Left, const VectorIdeal& _Right ) {
return VectorIdeal(  _Left.cy()*_Right.cz() – _Left.cz()*_Right.cy(),
_Left.cz()*_Right.cx() – _Left.cx()*_Right.cz(),
_Left.cx()*_Right.cy() – _Left.cy()*_Right.cx() );
}

Which follows proper code formatting (at least, my way of proper formatting). At first, results were questionable. That is, the define method was taking more time than the proper way (about 1000/800 clock cycles ratio). This was because I was instantiating the vectors in the test code and that expansion was being copied through the defines (which I believe says something about the risk of horrible inefficiency with defines). I fixed that, and the next ratio blew my mind:

Defines : 1031
Ideal : 10630

That is, the defines were supposedly taking ~1/10 the processor time. I was about to go through my next ray-tracer iteration, CIMPR III, and re-hack the code when I remembered that I had run the speed test on Debug compiler settings. D’oh. Release settings gives:

Defines : 66
Ideal : 69

Or increasing the iterations of the test loop:

Defines : 824
Ideal : 886

So, all is good with the world and I can write safe code.

My second little project was just playing with particles and gradient fields:

http://img176.imageshack.us/img176/5720/47105234py4.jpg

http://img201.imageshack.us/img201/4462/79783795ut6.jpg

Basically the particles start between (-1,-1) and (1,1) and their velocities are modified by a function over R2, in these cases both combinations of trig functions.

The final little project is a physics one simulating a jelly ball. Basically a ring of springs with volume calculation and normal forcing. I even got bored and put in some crappy accumulation anti-aliasing.

Other than that my time is spent working on a ray-tracer, this time with spacial sorting for speed’s sake. I am just using Octrees for now, but hopefully will man up and figure the kd tree out some day.

Starting From Scratch

September 20, 2008

I have produced a model format which depends upon deflating the various coordinate data and packaging that into a nice container for easy management (including a string name for the planned resource manager/resource packager). The specifications are a WIP, but are as follows (note HTML massacred the nice alignment. Also, the number in brackets is the size in bits):

===================================================================================
STATIC MODEL V0
===================================================================================
[32] Identification
This is the identification of the file, “tm m”.
[8] Version
The version of the file format. The value zero indicates this model format.
[8] Length of Name
The number of characters in the following model name.
[8*Delta] Name
A character-string of length Delta where Delta is the previous field. This
is used when the file is loaded to reference it upon loading.

[4] LOD Count
The number of levels of detail to be specified.
[32*3] Minimum Bounding Box
The smallest vertex position coordinates in IEEE floating-point format,
three 32-bit values: x, y, and z respectively.
[32*3] Maximum Bounding Box
The biggest vertex position coordinates in IEEE floating-point format,
three 32-bit values: x, y, and z respectively.
{Repeat for Each LOD}
[5] LOD for Positions
[5] LOD for Normals
[5] LOD for UVs
[32] Number of Verticies
[32] Number of Polygons
{Repeat for Each Vertex}
[LODp*3] Vertex Position
The fields of this vector are signed.
[LODn*3] Vertex Normal
The fields of this vector are signed.
[LODuv*2] Vertex UV
{End Repeat}

{Repeat for Each Polygon}
[Omega*3] Polygon Vertex Indicies
{End Repeat}
{End Repeat}
[?] Padding
Aligns to the next byte.
[128] Checksum
Just an XORing of all bytes up to and including the padding. Gives a high
probability the data is valid if passes.
[8] EOF Confirmation
The value ‘E’, indicating a safe end of file was reached.

===================================================================================
FAST STATIC MODEL V0
===================================================================================
[32] Identification
This is the identification of the file, “tm m”.
[8] Version
The version of the file format. The value one indicates this model format.
[8] Length of Name
The number of characters in the following model name.
[8*Delta] Name
A character-string of length Delta where Delta is the previous field. This
is used when the file is loaded to reference it upon loading.

[8] LOD Count
The number of levels of detail to be specified.

{Repeat for Each LOD}
[32] Number of Verticies
[32] Number of Polygons
{Repeat for Each Vertex}
[32*3] Vertex Position
Three IEEE 32-bit floating-point values.
[32*3] Vertex Normal
Three IEEE 32-bit floating-point values.
[32*2] Vertex UV
Two IEEE 32-bit floating-point values.
{End Repeat}

{Repeat for Each Polygon}
[32*3] Polygon Vertex Indicies
{End Repeat}
{End Repeat}

===================================================================================
2D MATERIAL v0
===================================================================================
[32] Identification
This is the identification of the file, “tm t”.
[8] Version
The version of the file format. The value zero indicates this format.
[8] Length of Name
The number of characters in the following model name.
[8*Delta] Name
A character-string of length Delta where Delta is the previous field. This
is used when the file is loaded to reference it upon loading.

[4] Top Diffuse Texture Dimension (Two’s Exponent)
The base dimension of the texture to be stored. Id est, if the texture
is 512px by 512px, the value 9 is stored (2^9 = 512). The value zero is
invalid, and anything greater than 11 (2^11 = 2048) is most likely invalid.
[1] Stored MipMaps
If this is 1, mipmaps will be stored, otherwise only the base full texture
is used.
[4] Compression of Diffuse Map and MipMaps
0 – No compression
1 – JFIF (JPEG) lossy compression
2 – Lossless compression
3 – Bit depression
4-15 – Reserved
[4] Top Emission Map Texture Dimension
Same idea as the diffuse map dimension, but for the emission map (emission
maps are generally lower-quality). The value zero specifies that no emission
map is used.
[1] Stored MipMaps for Emission
If this is 1, mipmaps will be stored, otherwise only the base full texture
is used. Generally this is not used, but can be for high-quality textures.
If the emission map dimension is zero this is ignored.
[4] Compression of Emission Maps
Same values as “Compression of Diffuse Map and MipMaps” but for the emission
map. If the emission map dimension is zero this is ignored.
[4] Top Normal Map Texture Dimension
Same idea as the diffuse map dimension, but for the normal map. If zero no
normal mapping is used.
[1] Stored MipMaps for Normal Map
If this is 1, mipmaps will be stored, otherwise only the base full texture
is used. If the normal map dimension is zero this is ignored.
[4] Compression of Normal Maps
Same values as “Compression of Diffuse Map and MipMaps” but for the normal
map. If the normal map dimension is zero this is ignored.

{Repeat for Each LOD of Diffuse}
[?] Texture Data
This data depends on the compression field.
{End Repeat}
{Repeat for Each LOD of Emission}
[?] Texture Data
This data depends on the compression field.
{End Repeat}
{Repeat for Each LOD of Normal}
[?] Texture Data
This data depends on the compression field.
{End Repeat}
[?] Padding
Aligns to the next byte.
[128] Checksum
Just an XORing of all bytes up to and including the padding. Gives a high
probability the data is valid if passes.
[8] EOF Confirmation
The value ‘E’, indicating a safe end of file was reached.

{END OF FILE}

The philosophy behind the static model is that it will compress well while giving predictable results (LOD-wise). The fast static model is meant to be easily loaded into memory (ideally models are shipped/uploaded with the former and converted to the latter). The texture file is going to employ my own attempt at fourier series (JPEG-like) compression and the loss-less (non-patented) method used by the PNG format. The uncompressed format is for debugging primarily, and bit depression is simply reducing the number of bits per channel. The rest are reserved in case I want to bring along alpha maps or something later. I have not included alpha maps as is because they usually cause a dirty render with depth-sorting, which I don’t want to deal with at the moment.

Eventually multiple model/material files will be stored in a resource file and compressed further.

I may add my mesh optimization idea into a new model format which would be tasked with loading quickly and having the optimizations pre-calculated.

The resource file, later, will specify exactly what data is already calculated and still must be. For example, to personify the resource loader: “this model has its LOD and optimizations done, load it quickly. This one is compressed and needs to have its optimizations and LOD calculated and stored for further runs. This one is compressed, but so rarely used just leave it compressed, but on load calculate its LODs every time”.

Bah, in less than 9 hours I move to UCLA.

OH MY GOD FILTER KEYS FREAKED ME OUT WITH THE BEEP.

Until later, this shall all have to wait.

2D FEA “Goo” Comments

July 31, 2008

I have ceased work on the 2D finite element system using the grid-based optimized searching. My last work has been with generating springs between close elements so they “stick” to one another (and can also break apart). This required a pretty nifty spring management system (they are created and deleted on-the-fly cleaning themselves up). I again achieve real-time performance (although it drops below 60 fps if more points are added). The following images are from a simulation (thanks to ImageShack.us for the hosting) in order. Blue dots are elements, green lines springs.

At first a grid is generated which begins to fall. Because the grid is perfectly aligned, the points will not move along the x-axis unless nudged, so I sent a free element flying into the base of the “mass”.

It begins to settle, though the original top corners can still be made out as it deforms to fill the simulation area.

All signs of an original shape are gone, and the simulation has reached a low energy state where there is little to no movement.

To make sure to demonstrate the re-attachment ability, I add a some elements to the center of the “mass” with an initial positive y velocity, sending pieces everywhere.

Finally, the simulation reaches a low energy state again after being mixed up.

For any practical application, this would have to run with orders of magnitude more elements and in 3D. I began work on a 3D version, however it ran far too slow using the FPU to do anything reasonable in real-time and SSE optimizations were causing alignment errors. Given SSE and multi-threading, a theoretical 16x performance may be achieved, however I am ending this as this was only done to prove the grid-based method, which it has done with flying colors. The spacial sorting system remains an efficient way for distance-from-point lookups in n-dimensions with the grid size limitation. Other methods such as KD-tree and OCT-tree are most likely able to achieve greater performance and do so without the grid size limitation.

Particle Collision Optimization

July 29, 2008

This is just something that I had in the back of my head. I assumed it would speed things up and, well, it definitely did.

The problem is a bunch of particles which are supposed to repel only at a close distance, usually for something like fluid approximation, maybe just basic physics, I don’t know. I just wanted to do it. Anyway, it was all I did today (given my wonderful strep throat), and the results are promising:

The red line is is the number of frames per second (peaks at 175, after first five seconds only drops down to 109) and the blue line is the number of particles in the system. The method used is done by solving the new velocity of all particles (given any close interactions), repelling them, solving the new velocity at that position then going back to the original position and using an average of the two new accelerations/velocities, so this is with four system passes (now that I think about it, I could probably combine two…); however, this is not the purpose of it all.

If every point were checked with every other point, the number of distance checks would be N*N, or at 1,500 particles, 2.25 million. My system seems to only have to perform at most 8 per particle (though it could be anywhere from 0 to infinity). The system is divided into a grid and only adjacent grid blocks are tested for distance.

And finally, it working:

The blue dots are particles, the red blocks grid sections. The grid blocks get more red the more particles that are in them (I probably should have disabled grid rendering before benchmarking this… d’oh). Eventually the screen would fill and the ones that leave would be deleted (quite efficiently, I might add). As you can see, each particle testing only adjacent grid blocks probably averages… what, 6? Awesomeness, if you ask me. The grid size is also a parameter, including the grid location. The particle processing functions are callbacks. IE I set the first function, iterate the system, set the second function to move the particles, iterate, etc.

My only sadness comes from the fact that if the repel distance (AKA interaction distance) is greater that the size of one grid block, nothing outside the adjacent 9 cells will be considered. This might be useful for a RTS game engine. It is fast and makes it easy to find objects near one another (though it is limited).

This was done using a quad-core Q6600 CPU at 3.1 GHz, though because it is only single-threaded for now it utilizes 25% of that and the memory use peaked at no greater than 11mb. Double precision was used.

Oh man sneezing with strep throat sucks.

Mesh Optimization Pt. 1

July 25, 2008

I am going to test out two methods for optimizing complex mesh rendering.

Primarily, I am going to work on a method I have used in the past involving storing multiple polygon lists given the angle the model is viewed from. Essentially the model is stored as size separate versions: top, bottom, left, right, front, rear. If the camera is behind the object, the rear list is called et cetera. This is a crude methods but it is simple and from experience works somewhat well (75% of mesh rendered at a time using 4.5x the memory footprint). The effect is almost linear optimization (dot products are fixed CPU cost).

The second method has two branches. IE, optimization via polygon subtraction and addition with groups based on their dot product with stochastic vectors which are pre-processed and depending on how effective they are (IE how many polygons belong to them) possibly eliminated. Polygons belonging to no group will be added to a list which is rendered every time.

These methods assume the camera at no point enters the bounding volume, in this case AABB, of the object. If the camera does enter that close, a list containing all polygons will be called.

There is also possibility for SSE optimization.