Here is a very interesting paper from Sony Research. It shows how an intelligent use of SPU+GPU can give the PS3 a performance level on par with top-of-the-line (or close to) PC GPU. It also shows the complexity of programming this beast, and as such, is a very interesting reading. I suggest you at least try to read it, even if you do not understand all of it (I am not understanding all of it either), it will enlighten you. And if you have any question, I will gladly let CPI take them
Paper link (free access, no login required):
http://research.scea.com/ps3_deferred_shading.pdf
Some sentences from the paper:
This paper studies a deferred pixel shading algorithm implemented on a Cell/B.E.-based computer entertainment system.The pixel shader runs on the Synergistic Processing Elements (SPEs) of the Cell/B.E. and works concurrently with the GPU to render images. The system's unified memory architecture allows the Cell/B.E. and GPU to exchange data through shared textures. The SPEs use the Cell/B.E. DMA list capability to gather irregular fine-grained fragments of texture data generated by the GPU. They return resultant shadow textures the same way. The shading computation ran at up to 85 Hz at HDTV 720p resolution on 5 SPEs and generated 30.72 gigaops of performance. This is comparable to the performance of the algorithm running on a state of the art high end GPU. These results indicate that the Cell/B.E. can effectively enhance the throughput of a GPU in this hybrid system by alleviating the pixel shading bottleneck.
(...)
We chose an extreme test case that stresses the memory subsystem and generates a significant amount of DMA waiting. Despite this waiting the algorithm scaled efficiently with speedup of 4.33 on 5 SPEs. This indicates the Cell/B.E. can be effective in speeding up this sort of irregular fine-grained shader. These results would carry over to less extreme shaders that have more regular data access patterns.
(...)
We study variations of a Cone Culled Soft Shadow algorithm. This algorithm belongs to a class of algorithms known as shadow mapping algorithms.
(...)
The algorithm is not physically correct and we accept many approximations for the sake of real time performance.
(...)
the dandelion is a challenging test for shadow algorithms. The algorithm correctly reproduced the fine detail at the base of the plant as well as the internal self-shadowing within the leaves.
(...)
The key to running any algorithm on the SPEs is to develop a streaming formulation in which data can be moved through the processor in blocks. We move eye data in scanline order and double buffer the scanline input. While one scanline of pixels is being processed we prefetch the next scanline. As each scanline is completed it is written to the shadow texture. We have measured the DMA waiting for the scanline data and it was negligible.
(...)
By having multiple DMA lists in flight concurrently we buffer fragment data in order to minimize DMA waiting. We experimented with the number and size of the DMA lists in order to minimize runtime. We found that having four DMA lists was optimal and that larger numbers did not reduce the runtime. We found similarly that fetching 128 pixels per DMA list was optimal and that longer DMA lists did not reduce runtime.We parallelized the computation across multiple SPEs by distributing scanlines to processors. This is straightforward and provides balanced workloads. We scheduled tasks using an event queue abstraction provided by the operating system that is based on one of the Cell/B.E. synchronization primitives, the mailbox. We measured the cost of this abstraction at less than 100 microseconds per frame. When running in parallel on multiple SPEs the individual processors completed their work within 100 microseconds of each other. Each SPE computes a set of scanlines for the shadow texture. They deliver their result directly into GPU memory in order to minimize the final render time.
(...)
1-SPE 19Hz (=frames per second)
2-SPEs 34 Hz
3-SPEs 59 Hz
4-SPEs 75Hz
5-SPEs 85Hz
(...)
The monochromatic shader ran at 85 Hz using 5 SPEs and at 34 Hz using 2 SPEs. Videogames are typically rendered at 30 or 60 frames per second. Shading calculations should generally run at these rates, but for shadow generation it is possible to use lower frame rates without affecting image quality. It would also be possible to use shadows generated at 720p resolution with a base image rendered at a higher 1080p resolution (1920x1080 pixels).
(...)
We implemented the same algorithm on a high end state of the art GPU, the NVIDIA GeForce 7800 GTX running in a Linux workstation. This GPU has 24 fragment shader pipelines running at 430 Mhz and processes 24 fragments in parallel. By comparison the 5 SPEs that we used process 20 pixels in parallel in quad-SIMD form. The GeForce required 11.1 ms to complete the shading operation. In comparison the Cell/B.E. required 11.65 ms including the DMA waiting time, and would require only 8.56 ms if the DMA waiting were eliminated. The performance of the Cell/B.E. with 5 SPEs was thus comparable to one of the fastest GPUs currently available, even though our implementation spent 27% of its time waiting for DMA. Results would resumably be even better on 7 SPEs, or on fewer SPEs if we could reduce or eliminate the DMA waiting.
(...)
Our initial results are encouraging as they show it is feasible to attain scalable speedup and high performance even for shaders with irregular fine-grained data access patterns. Removing the computation from the GPU effectively increases the frame rate, or more likely, the geometric complexity of the models that can be rendered in real time. We can also conclude that the performance of the Cell/B.E. is superior to a current state of the art high end GPU in that we achieved comparable performance despite performance limitations and despite using only part of the available processing power.
Arnaud