Gillius's Programming

Particle System Memory Allocation

Abstract

Date of Original: March 21, 2004

I did some research into memory allocation methods for a particle system. A lot of people believe that a list of pointers is better than a list of objects, and some believe the opposite. Others use a static array. I decided to research into the different methods, using a particle system as the situtation.

Updated March 29 with a lot more results, refactored code, and two new algorithms, VectorAlgor and FlatArray.

Introduction

I decided to research several algorithms for managing memory in particle systems. There is some debate as to which method is best, and I wanted to discover how the algorithms actually compared by benchmarking them. There seem to be three major schools of thought:

  1. Some C programmers create a static array of particles. Simple to implement, they believe it is fast because iteration over an array is fast, and there is no dynamic memory allocation
  2. Some C++ programmers like to use a list of pointers -- it seems a majority do this. These programmers tend to believe that storing a list of pointers avoids an extra, expensive data copy.
  3. Some C++ programmers like to use a list of objects. Most recently this has been my own prefered approach. My rational being that the STL allocator class was smarter than new/delete.

When I discovered that my compiler's implementation of std::allocator was nothing more than a wrapper around new/delete, I began to question my rationale for storing objects directly, and I decided to run these benchmarks.

In addition to the three algorithms above, I decided to try out two new algorithms:

  1. A dual std::list of pointers. One being a dead object list I can allocate from, and one being a live object list.
  2. A custom std::allocator replacement that works by allocating from a pool of objects.

I decided to research particle systems rather than general allocation because usually in game programming particles are very often created and destroyed and are a case that requires special consideration and many are created and destroyed every frame. Normal game objects usually are not created or destroyed anywhere near the rate of objects like particles.

After my first submission of this article, I got a lot of feedback. Two implementations of a packet, flat array algorithm. One is dynamically-sized and is based off of std::vector. The other is statically sized and uses an array.

The Environment

The compiler environment is MSVC.NET 2003 (MSVC 7.1). I am using boost::timer and boost::format from the Boost library, version 1.31.0.

My system statistics:

Results from other systems are presented at the end, so those specifications will be given at that time. Those results have been contributed by other readers.

The Particle

The particle class itself I decided would not offer any rendering capability, but would provide a reasonable update function that a particle class might have. I provided two update functions, one that checks if the particle is dead (for the static array), and one that does not (for the lists and packed arrays).

The particle object itself is 36 bytes and contains the following variables:

The particle as given is a typical particle definition for a modern 3D game engine. The update method provides Euler integration given the vectors but it does not do anything complicated like collection detection or drawing. This is relevant because it is important to know how much time you can save by using a custom memory system -- if only 1% of the time is spent managing the particle list, it is more worth our time to optimize the collision detection (if any) or the particle rendering code.

The Algorithms

Particle Manager

The ParticleManager is a class coded specifically for the Particle class, and is a fixed size. The algorithm is as follows:

The ParticleManager enforces a maximum particle limit. Sometimes this is desireable. Allocation is decently quick when the array is not full, but an update iteration will have to check all of the dead particles.

Some helpful person suggested that I could construct a packed array so that on object death I would copy the last live particle to the dead particle's position, and therefore never iterate over dead elements. Allocation would be constant time, but when particles die we would have to perform a copy.

I'd like to try to implement this algorithm in the future, but not for this initial test.

List of Pointers

This is the standard std::list<Particle*> approach. Particles are allocated using new manually and the pointers are stored in the list. When objects die, the particles are deleted manually.

This algorithm has the advantage of not requiring any copying, but it is harder to implement due to explicit memory management. Also, two allocation steps are needed: one to allocate the list node and another to allocate the particle itself.

Like all the list algorithms, this algorithm benefits by never iterating over any dead particles.

Dual List of Pointers

I wanted to examine if eliminating the particle allocation step (avoid calling new/delete) was worthwhile, so I tried an algorithm similar to list of pointers except that I maintain two lists, a live list and a dead list.

Unlike the standard method, the dual list method will never shrink in memory size. It will use the amount of memory required to store the maximum number of particles that ever existed.

List of Objects

This is the standard std::list<Particle> approach. All memory management is handled the STL. Chances for memory leaks is greatly reduced due to no explicit memory management, and is easy to implement. It requires an extra copy though, through the copy constructor. It may be possible for some compilers to optimize out the copy, I've heard, but I did not examine if this is the case with MSVC.NET 2003 for this article.

BinAllocator

I decided to try to write my own replacement to the std::allocator. This algorithm works similarly to the ParticleManager, except that when the array is filled, a new array is made. The BinAllocator works with constant-sized baskets, and maintains a list. Each basket has a specific size of bins. When all baskets are filled, a new basket is added to the the basket list.

The last element allocated is stored, so that iterations begin at that location. When the iteration reaches the end of a basket, the iteration moves to the next basket. If it returns back to the start, a new basket is made. Else it returns the first empty bin it finds.

For this article I attempted two basket sizes: 10 and 500. The 500 size performed much better, but used more memory. Like the dual list approach, the BinAllocator can only grow, never shrink. When a particle dies, that bin is marked as empty.

I added a small optimization afterwards, to cache the last element deleted. When the allocation occurs, if the last operation was a delete, it reuses that node. This skips the iteration process to find a new node, and almost guarantees that the memory is in the cache, while iterating to the next bin is likely to cause a cache miss as the next bin is likely to have been deallocated longest ago.

I still believe this algorithm can be substantially improved, perhaps by expanding the cache idea, but this this method starts to look like the dual list method, which I discovered to be very slow.

I did only the work I needed for BinAllocator to get it to run with Particles (that can't throw exceptions) in a std::list. Namely I did not examine exception-safety, nor did I allow allocations of multiple elements as would be required with std::vector.

VectorAlgor and FlatArray

These work similarly to the ParticleManager, except they fill in the holes when particles die. Both of these algorithms were contributed from the readers at Allegro.cc. VectorAlgor was contributed by Julien Cugniere. FlatArray was contributed by an Allegro.cc member who goes by the name "Orz."

The theory behind these is to use copy-on-death semantics to keep from having to maintain a linked list (as with BinAllocator), and to never iterate over dead particles. The only state needed for the algorithm is the end of objects in the array, and the array itself. If there are 4 particles, taking up four slots A B C D, if particle B dies, the particle in slot D is copied to overwrite B, then the size is decremented by one.

The assumption is that the objects are small and copy quickly because they have a trivial copy operation. The disadvantage to this algorithm is that the addresses of the particle particles will change because of copy-on-death semantics, and due to resizing of the vector in VectorAlgor.

The only significant difference between VectorAlgor and FlatArray is that VectorAlgor uses std::vector, and is dynamically sized. FlatArray uses a straight array, and refuses new particles if it is filled (rather than searching for the particle that is closest to dying). For these performance, the FlatArray size was set to 50,000 particles, so no particles were dropped (our maximum average is 2500).

Other Ideas

I wanted to try out an algorithm someone suggested to me, and that I've seen in my quick research of other algorithms. It is similar to the BinAllocator in that it gives out single storage units (although there are variations that can allocate arrays). It is dissimilar to the BinAllocator in that it maintains a linked list as part of the nodes. Initially all of the "bins" are in the same linked list -- a free node linked list. Allocation simply takes the top node off the free list, and deallocation simply adds the node to the front of the free list. This has the advantage of not having to copy objects or move them in memory, meaning it should work with larger objects of if you need to keep a pointer to an object while it is alive.

I did not have enough time to implement this algorithm, but I would like to in the future. I believe that it would improve on the performance of the BinAllocator while not having the performance penalty I saw in the Dual List algorithm by having each node contain the actual pointers rather than having a node as a separate object (the Dual List still needed to call new/delete when you moved a pointer between lists, because it had to allocate the actual node of the std::list).

It may also be possible to implement this algorithm as a singly linked list rather than a doubly linked as is std::list. All of the implementations I saw of this algorithm used a doubly linked list, and I'm not sure why, so further study is needed. I've heard this algorithm called "simple segregated storage" for those interested in following up on this.

The Test

I ran each algorithm in a mock particle engine update loop. I ran the loop a certain number of times, using a certain dt value. The dt value is in seconds and is the frame length. One particle is created during each frame, and the particles are created to have a maximum life between 0 and 5 seconds.

This is by no means an exhausive test. Firstly, there are no memory allocations but particles. The other game code might call new with widly varying size requests that may reduce the efficency of new moreso than seen here. Secondly, the peformance may change significantly if the particle creation was not as constant. Thirdly, changes in particle life span might affect our choice in algorithm.

The Code

Here is the source code for this test. The source code itself is in the public domain. I would appreciate if the source code is used (probably only the BinAllocator is useful) that I would be mentioned in the program's credits, but it is not required.

I've refactored the code quite a bit from the original article's release when others started contributing code, so I wanted to clean it up. I rewrote the test framework to output results in a form that did not require hand post-processing, as well as split up all of the algorithms into different headers.

The single CPP file approach I took was for compiling simplicity (as I do not supply a makefile), and to make sure that the proper inlining is done equally in all compilers (MSVC.NET can inline across object file boundaries, GCC cannot still as far as I know).

The source code also contains the spreadsheet in which I placed the results from the original article. The file's format is Excel 2000. The newer results are only on this page.

The download file includes the source code and the executable compiled by MSVC.NET 7.1. Default options for a console application, except that whole program optimizations are turned on. MSVC.NET workspace files are provided.

Download source code: 78k zip, MS-DOS format

The Conclusion

It appears that the VectorAlgor is the best solution for our specific particle system and is memory efficent. Of the algorithms that do not copy or move objects, the best is the BinAllocator. The best solution using a standard STL algorithm is the list of objects. The list of pointers and dual list approach are not worth considering because of their slow performance. The ParticleManager and FlatArray algorithms are statically sized, and dynamic solutions that are equally fast or faster exist for these. A translation of FlatArray should be considered for implementation in other languages, such as C.

My guess is that VectorAlgor's performance will get worse and eventually get worse than the BinAllocator as object size increases, but more research is needed. I think that the BinAllocator can still be greatly improved through the algorithm discussed earlier. So for other memory management tasks that use large objects or need pointers that will not be invalidated, the BinAllocator is likely the best solution, but more research is needed to see if it can be improved by having a growable linked list.

The BinAllocator complies with the std::allocator interface, so it can be used directly with std::list, and with some work it might be appropriate for the other containers such as std::map. It is likely never appropriate (or even usable) for std::vector or std::string as they are probably implemented to allocate arrays of objects.

The Results

The theoritical average number of particles for each column is 25, 250, 500, and 2500.

The columns are dt values (frame length). One particle is created each frame, and particle life is 0 to 5 seconds (random).

The format of the results is a CSV format in aligned columns (thanks to boost::format for making this easy!)

Here are the results from my machine.

All times are in seconds.
10000 Loops 
,                        0.1,      0.01,     0.005,    0.001
ParticleManager 500,     0.04,     0.08,     0.11,     0.14
ParticleManager 2000,    0.15,     0.191,    0.23,     0.39
List of Pointers,        0.01,     0.04,     0.061,    0.64
List of Objects,         0.011,    0.03,     0.07,     0.47
Dual List,               0.01,     0.05,     0.111,    1.221
Bin List w/cached,       0,        0.04,     0.06,     0.441
VectorAlgor,             0.01,     0.03,     0.05,     0.331
FlatArray 50000,         0.01,     0.03,     0.06,     0.34
Objects At End,          27,       257,      505,      2467

20000 Loops 
,                        0.1,      0.01,     0.005,    0.001
ParticleManager 500,     0.08,     0.151,    0.16,     0.2
ParticleManager 2000,    0.311,    0.37,     0.481,    0.861
List of Pointers,        0.02,     0.09,     0.231,    4.005
List of Objects,         0.02,     0.05,     0.111,    1.081
Dual List,               0.02,     0.1,      0.231,    3.635
Bin List w/cached,       0.01,     0.06,     0.1,      1.052
VectorAlgor,             0.01,     0.05,     0.1,      0.721
FlatArray 50000,         0.01,     0.06,     0.11,     0.811
Objects At End,          21,       262,      501,      2475

50000 Loops 
,                        0.1,      0.01,     0.005,    0.001
ParticleManager 500,     0.211,    0.37,     0.401,    0.521
ParticleManager 2000,    0.781,    0.951,    1.192,    2.173
List of Pointers,        0.06,     0.23,     0.551,    9.654
List of Objects,         0.04,     0.15,     0.271,    2.964
Dual List,               0.06,     0.23,     0.581,    8.202
Bin List w/cached,       0.03,     0.13,     0.261,    2.834
VectorAlgor,             0.03,     0.14,     0.25,     1.963
FlatArray 50000,         0.03,     0.15,     0.271,    2.143
Objects At End,          25,       248,      500,      2510

100000 Loops 
,                        0.1,      0.01,     0.005,    0.001
ParticleManager 500,     0.41,     0.771,    0.822,    1.041
ParticleManager 2000,    1.532,    1.943,    2.323,    4.377
List of Pointers,        0.11,     0.471,    1.121,    22.503
List of Objects,         0.07,     0.3,      0.551,    5.979
Dual List,               0.11,     0.5,      1.182,    20.68
Bin List w/cached,       0.06,     0.27,     0.521,    5.728
VectorAlgor,             0.05,     0.271,    0.51,     4.026
FlatArray 50000,         0.05,     0.281,    0.541,    4.366
Objects At End,          20,       251,      485,      2481

Mika Halttunen from Allegro.cc contributed a partial result set for his machine.

All times are in seconds.
10000 Loops 
,                        0.1,      0.01,     0.005,    0.001
ParticleManager 500,     0.49,     0.681,    1.242,    1.412
ParticleManager 2000,    3.515,    3.705,    3.956,    7.811
List of Pointers,        0.08,     0.441,    2.814,    14.962
List of Objects,         0.06,     0.41,     1.102,    11.486
Dual List,               0.091,    1.011,    3.265,    16.874
Bin List w/cached,       0.05,     0.381,    1.091,    9.934
VectorAlgor,             0.041,    0.34,     0.661,    5.448
FlatArray 50000,         0.04,     0.35,     0.681,    5.858
Objects At End,          22,       255,      500,      2453

20000 Loops 
,                        0.1,      0.01,     0.005,    0.001
ParticleManager 500,     0.921,    1.372,    2.264,    2.713
ParticleManager 2000,    5.588,    6.109,    7.05,     14.471
List of Pointers,        0.15,     1.973,    5.148,    34.149
List of Objects,         0.13,     0.781,    2.013,    24.806
Dual List,               0.16,     1.692,    7.291,    36.632
Bin List w/cached,       0.09,     0.781,    2.113,    19.057
VectorAlgor,             0.09,     0.701,    1.422,    13.229
FlatArray 50000,         0.091,    0.781,    1.502,    10.725
Objects At End,          25,       253,      520,      2485

50000 Loops 
,                        0.1,      0.01,     0.005,    0.001
ParticleManager 500,     2.334,    3.585,    5.828,    7.171
ParticleManager 2000,    13.639,   15.402,   17.636,   38.836
List of Pointers,        0.47,     6.159,    18.166,   95.788
List of Objects,         0.29,     1.923,    4.957,    62.27
Dual List,               0.38,     5.298,    17.235,   109.537
Bin List w/cached,       0.471,    2.534,    7.671,    87.986
VectorAlgor,             0.291,    3.084,    5.969,    50.182
FlatArray 50000,         0.23,     2.113,    4.096,    41.77
Objects At End,          24,       255,      495,      2479

100000 Loops 
,                        0.1,      0.01,     0.005,    0.001
ParticleManager 500,     5.909,    9.153,    16.213,   17.686
ParticleManager 2000,    55.51,    49.681,   64.253,   125.841
List of Pointers,        1.091,    10.275,   51.714,   253.345
List of Objects,         1.041,    6.63,     25.096,   230.741

Julien Cugniere was kind to provide a comparison between MSVC and GCC. First the GCC results:

All times are in seconds.
10000 Loops 
,                        0.1,      0.01,     0.005,    0.001
ParticleManager 500,     0.08,     0.1,      0.13,     0.14
ParticleManager 2000,    0.281,    0.32,     0.351,    0.55
List of Pointers,        0.02,     0.04,     0.081,    0.651
List of Objects,         0.01,     0.03,     0.07,     0.56
Dual List,               0.01,     0.051,    0.09,     0.691
Bin List w/cached,       0.01,     0.03,     0.07,     0.591
VectorAlgor,             0,        0.03,     0.04,     0.28
FlatArray 50000,         0,        0.02,     0.05,     0.281
Objects At End,          24,       251,      510,      2473

20000 Loops 
,                        0.1,      0.01,     0.005,    0.001
ParticleManager 500,     0.15,     0.22,     0.231,    0.3
ParticleManager 2000,    0.561,    0.631,    0.721,    1.171
List of Pointers,        0.02,     0.111,    0.21,     1.542
List of Objects,         0.02,     0.09,     0.211,    1.342
Dual List,               0.02,     0.1,      0.2,      1.552
Bin List w/cached,       0.02,     0.07,     0.151,    1.311
VectorAlgor,             0.011,    0.05,     0.09,     0.641
FlatArray 50000,         0.01,     0.05,     0.09,     0.641
Objects At End,          32,       253,      471,      2453

50000 Loops 
,                        0.1,      0.01,     0.005,    0.001
ParticleManager 500,     0.37,     0.541,    0.611,    0.741
ParticleManager 2000,    1.402,    1.592,    1.793,    3.024
List of Pointers,        0.06,     0.291,    0.561,    4.105
List of Objects,         0.041,    0.24,     0.561,    3.555
Dual List,               0.05,     0.27,     0.501,    4.166
Bin List w/cached,       0.04,     0.19,     0.361,    3.555
VectorAlgor,             0.03,     0.13,     0.22,     1.733
FlatArray 50000,         0.03,     0.13,     0.231,    1.732
Objects At End,          26,       252,      497,      2487

100000 Loops 
,                        0.1,      0.01,     0.005,    0.001
ParticleManager 500,     0.741,    1.092,    1.252,    1.492
ParticleManager 2000,    2.804,    3.194,    3.585,    6.129
List of Pointers,        0.12,     0.541,    1.102,    8.402
List of Objects,         0.08,     0.471,    1.101,    7.241
Dual List,               0.1,      0.501,    1.001,    8.532
Bin List w/cached,       0.09,     0.381,    0.721,    7.31
VectorAlgor,             0.061,    0.25,     0.451,    3.515
FlatArray 50000,         0.06,     0.25,     0.471,    3.515
Objects At End,          24,       264,      483,      2493

He then compared for the 100,000 loops case, GCC to MSVC. We looked a bit into GCC's allocation for STL, and it appears to have optimizations for objects below 128 bytes. We suspect MSVC's algorithm is more generic. This causes GCC to compare in very interesting ways.

MSVC vs. GCC
Smaller number means GCC faster.

100000 Loops

ParticleManager 500,     205,26%,  151,46%,  154,38%,  156,89%
ParticleManager 2000,    193,11%,  175,21%,  164,22%,  147,47%
List of Pointers,        120,00%,  114,86%,  103,77%,   50,30%
List of Objects,         114,29%,  142,73%,  189,50%,  127,08%
Dual List,               100,00%,   94,35%,   89,22%,   43,23%
Bin List w/cached,       180,00%,  146,54%,  138,39%,  131,52%
VectorAlgor,             122,00%,   95,79%,   92,04%,   92,60%
FlatArray 50000,         120,00%,   92,59%,   90,40%,   84,78%
Objects At End,           80,00%,  104,35%,   95,27%,   98,73%

GCC really optimizes the pointer versions of the lists! It is slower than MSVC for the list of objects. The end result is that the difference between a list of pointers and objects is no where near as large as when in MSVC. GCC does speed up the VectorAlgor and FlatArray, but the GCC results show that as in MSVC, they are both the same speed, so the VectorAlgor is better as it is dynamically sized.

Old Results

I saved the Excel spreadsheet as an HTML file, and since it generates lots of nasty code I can't paste the table here or clean it up. I apologize for this. If anyone has problems viewing this, let me know. It worked for me with Mozilla 1.6 and IE 6, and viewed properly in my HTML editor, so it should be fine.

The release with optimized checks should be what you look at. I split the update method into update1, which does an isDead check, and update2, which does not.

ParticleManager 500 and 2000 refer to the static array size.

Explaination: The debug was when I ran the system in debug mode. It's more or less useless so I stopped taking the measurements. All of them except the Bin Test was with the old-unified update method.

Bin Test 10 and 500 refer to the basket size. "w/cached" is the version when I added the cached variable.

Old results from the original article