Saturday, June 21, 2014

DOOM on ZPU

The last few days I spend some time on porting the beloved game DOOM to the ZPU CPU. For you who don't know what I am talking about, DOOM was a ground breaking game released in early 90. ZPU is not only a russian AA gun but also a 32 bit open source CPU

DOOM v1.666 running in ZPU simulator.

So what is the point of doing this?
Well, it is just for fun I guess. A ZPU based game console will have a hard time to beat the next Xbox 720 or PS4. Infact the ZPU is pretty slow and not very well suited for graphics.
Having that said, DOOM on ZPU is interesting from a profiling and verification point. Without any hard evidence I would say DOOM is a more interesting work load than Dhrystone. It also tests the toolchain and hardware quite a bit.





Mr DOOM in trouble 
To summarize these are the leanings/status:
  • With HW acceleration it is possible to play DOOM on FPGA
  • long long operations and memcpy are not very great with the current toolchain
  • Port is running modified Phi platform
  • This is only tested in emulator

The following is currently not supported in the port:
  • Sound 
  • Input
  • Network
I currently don't have a FPGA board so I have only run it inside the ZPU simulator, this gets you about 1-2 FPS on a decent laptop. So not very playable, therefor I didn't bother with input or sound.

Some add-ons to the Phi platform was needed:
  • Mode 13 frame buffer
  • Doom1.wad preloaded into memory
  • 35Hz timer


Performance

So how fast is DOOM on ZPU? 
With the following setup:
  • All instructions implemented
  • 225 MHz 
  • Loads takes 2 cycles (i.e no RAM latency)
At 225 MHz an average 30.5 FPS with some rare 20-25 FPS are achieved when running 'timedemo demo3'. Resolution is the default Doom, 320x200.

To reach the same performance on a PC something like a 486 @ 66MHz and a decent graphics card is needed.

So is the ZPU about 1/4 the speed of a 486?
No, the comparison is a bit unfair:
  • In simulation the frame buffer has no latency, this is NOT true for computers from the 486 era. A good graphics card could almost double FPS.
  • ZPU port does not have any sound. 
  • RAM access has 1 cycle latency in simulation, this is far from true for an old PC.
The x86 port has assembly optimizations for the inner loops, the ZPU port is C code only.
It would be hard to reach >100 MHz on a FPGA, as far as I know all ZPU implementations barley reach 100 MHz. But with some tricks it would still be possible to achieve decent performance (see profiling for details):
  • Some of the inner loops and fixed point arithmetic could be replaced by hardware units.
  • There is a frame buffer memcpy done every frame, this could be removed if a double buffering capable display controller is used.
  • The 64 kB frame buffer could reside in a dualport onchip RAM. This would give a very fast frame buffer.


Profiling

The DOOM demo mode is very similar to real game play, the demo mode is normal game play but with key strokes recorded. Demo runs will be deterministic. The profiling has been run using the command line:
-timedemo demo3
(Again, taken from thisThey are running the original DOOM port. The ZPU port is ported from BOOM.)
Time spent in each frame as FPS. The workload varies much between frames.
Frame 1137, 76 FPS. Fastest frame. No time spent in ceiling or floor. Little time spent in drawing walls.


Frame index 184, 14 FPS. Slowest frame. Way more fixed point multiplications than surrounding frames, not sure why.


A typical frame takes about 7400000 cycles with the ZPU platform described above. Stairs, enemies, shooting and transparency are some things that will increase render time. The table below contains a breakdown for frame 200 of the demo run, this frame is good typical frame. The frame takes 9562694 cycles to complete.

Function Name Total cycles spent Nbr calls Avg time in call Description
memcpy (all) 1414058  5356263All memcpy calls
memcpy (framebuffer) 224504 1 224504 memcpy to framebuffer
udivsi3 2753956 1578 1745 long long div
muldi3 2073756 2322 893 long long mult
R_Drawspan 837281 161 5200 Draw ceiling/floor
R_DrawColumn 647947 867 747 Draw walls
(This is just for frame 200 but other frames should be somewhat similar)

There seems to be something fishy here.The majority of time should be spend inside the R_DawSpan and R_DrawColumn functions, these are the functions that does all the pixel drawing. Instead most time is spend doing 64bit arithmetic.
(See http://fabiensanglard.net/doomIphone/doomClassicRenderer.php for some interesting profiling results).


Frame 200. Used for function call profiling.

udivsi3 & muldi3

DOOM uses 64 bit multiplications as part of their fixed point arithmetic. Many old PCs lacked an FPU I believe this is the reason why DOOM is integer only.

#define FRACBITS 16

__inline__ static fixed_t FixedMul(fixed_t a, fixed_t b)
{
  return (fixed_t)((long long) a*b >> FRACBITS);
}

__inline__ static fixed_t FixedDiv(fixed_t a, fixed_t b)
{
  return (abs(a)>>14) >= abs(b) ? (a^b)<0 ? MININT : MAXINT :
    (fixed_t)(((long long) a << FRACBITS) / b);
}

The ZPU does not have support for 64bit hardware multiplication or division so the compiler relies on a software to implement the long long operations. A multiplication takes almost 900 cycles, 400 of those cycles are spent inside two 8 byte memcpy. The division routine has a similar behavior.
There is plenty of room for improvement in these routines. A good approach would be to do a  assembler optimized routine. Better but not portable would be to use a dedicated hardware for the fixed functions above. FPGAs usally has a lot of multipliers in them that can be used for the multiplication. Many FPGA vendors supply IPs for dividers, there are also some available on opencores.com
These hardware blocks would then be memory mapped, this would be easier than adding a new instruction to the instruction set. The hardware accelerated multiplier would be very quick the divider a bit slower.
Including function overhead the FixedMul could take less than 50 cycles and the FixedDiv less than 100 cycles, but this is just a very wild guess.

memcpy

If the Fixed functions are improved then the majority of the time spend in memcpy will disappear.
The memcpy complexity is approximently:

cycles_in_memcpy = 200 + bytes_to_copy *3.5;

Any small copy will have a pretty large overhead and large copies aren't that fast either. I am having trouble locating what memcpy being used, the only one I can find is this (zpugcc/toolchain/binutils/libiberty/bcopy.c):
void
bcopy (src, dest, len)
  register char *src, *dest;
  int len;
{
  if (dest < src)
    while (len--)
      *dest++ = *src++;
  else
    {
      char *lasts = src + (len-1);
      char *lastd = dest + (len-1);
      while (len--)
        *(char *)lastd-- = *(char *)lasts--;
    }
}
When looking at the assembler the doesn't look very similar, however GCC might unroll the loops.
The frame buffer memcpy takes about 2-3% of a frame, this could be removed completely by using hardware double buffering.
I have seen several post about optimized memcpy for ZPU, has anyone of  these reached the repository?

R_DrawColumn & R_DrawSpan

These are the functions that most time should be spent into. These functions draw the pixel values to the framebuffer (via a buffer). R_Drawcolumn draws the walls and R_Drawspan draws the floors and ceiling, (see http://fabiensanglard.net/doomIphone/doomClassicRenderer.php for a good explanation)
This is one of the innerloops inside R_Drawcolumn:

do
{
  // Re-map color indices from wall texture column
  //  using a lighting/special effects LUT.

  // heightmask is the Tutti-Frutti fix -- killough

  *dest = colormap[source[frac>>FRACBITS]];
  dest += SCREENWIDTH;
  if ((frac += fracstep) >= heightmask)
    frac -= heightmask;
}
while (--count);    

This loop would be a good candidate for hardware acceleration. The software would not need to wait for the hardware accelerated loop to finish, instead it could go on and do other things.
Each iteration takes about 30 cycles and the count value varies greatly but for this particular frame the average value is about 20. So around 600 cycles could be saved.
The R_drawspan has a similar but not identical inner loop that would also benefit greatly from hardware acceleration.

There is plenty of room for optimizations, software only optimizations would give a nice bump but HW acceleration is needed to reach 35 FPS on a FPGA board clocked at 50 MHz.

OP code histogram

The emulator has a nice opcode histogram feature, this is useful to get an understanding of the workloads.
A first guess might be that a 3D game would have a pretty characteristic profile when it comes to what instructions being used most. One might suspect lots of multiplications and divisions, but this is not the case.
Scroll down for a complete list of the opcode histogram. This histogram calculated from the entire run not just frame 200.
Here are some comments regarding some of the instructions:

loadsp / storesp / addsp 

The most common instructions are what I call the stack operation instructions these are loadsp, storesp and addsp. I use to think of them as a necessity due to the lack of registers, or they being the equivalent to registers in a register based architecture. In most workloads these will always be among the most common instructions.
This is a typical example:
count = dc_yh - dc_yl; 
   32798: 73           loadsp 12
   32799: 76           loadsp 24
   3279a: 31           sub
   3279b: 55           storesp 20

The register based architecture would have something like: 
       sub r20, r24,r12  // substract r12 from r24 and store in r20

load

32bit load is a very common instruction at 12.5%. 32bit loads are used for accessing arrays and global variables. There are many global variables in the code. These are often used in combination with the im instruction:

  fracstep = dc_iscale; 
   32532: af           im 47    
   32533: 99           im 25
   32534: c8           im -56
   32535: 08           load  // load from bccc8
00032536 <.LM6>:
  frac = dc_texturemid + (dc_yl-centery)*fracstep; 
   32536: 79           loadsp 36
   32537: af           im 47
   32538: b0           im 48
   32539: a4           im 36
   3253a: 08           load // load from bd824
   3253b: 31           sub
   3253c: 71           loadsp 4
   3253d: 29           mult
   3253e: af           im 47
   3253f: a0           im 32
   32540: 80           im 0
   32541: 08           load // load from bd000
   32542: 05           add
00032543 <.LM7>:
    register const byte *source = dc_source;            
   32543: af           im 47
   32544: 99           im 25
   32545: c4           im -60
   32546: 08           load // load from bccc4

All global variables are located close in memory but they are all fetched using absolute addresses. IMs are  used to describe the absolute addresses. In fact almost all arrays and global variables are loaded using 3 im to address them. Even if most global variable are close together
It would be nice to have something like a frame pointer and then a relative load.

   im 47     // set fp to at bccc4
   im 25
   im -60
   popfp

 fracstep = dc_iscale;    
   im -1
   loadfprel // load from bccc8

 frac = dc_texturemid + (dc_yl-centery)*fracstep; 
   loadsp 36
   im 5
   im 80
   loadfprel // load from bd824
   sub
   loadsp 4
   mult
   im 1
   im 79
   loadfprel // load from bd000

 register const byte *source = dc_source;            
   im 0
   loadfprel // load from bccc4

In the above example 3 instructions would be saved using a loadfprel instruction.
(To clarify, there is no such function. But it might be a good addition if there ever is a ZPUv2)

loadb

Most pixel operations are done using char*, these will in most cases use loadb. It is surprising to see that loadb is only 1.7% given that this instruction is executed a lot in the inner loops of R_Drawxxxx functions.



Opcode histogram for all frames

im 32.99%
storesp 12.77%
load 12.49%
loadsp 8.39%
add 9.35%
addsp 4.22%
neqbranch 3.17%
store 2.79%
loadb 1.76%
eq1.39%
and 1.28%
storeb 1.16%
lshiftright 0.98%
ashiftright 0.61%
not 0.56%
sub 0.54%
lessthanorequal 0.53%
mult 0.51%
poppcrel 0.51%
lessthan 0.50%
nop 0.47%
pushspadd 0.36%
ulessthanorequal 0.35%
ulessthan 0.35%
popsp 0.33%
or 0.33%
poppc 0.21%
ashiftleft 0.16%
loadh 0.15%
callpcrel 0.15%
pushsp 0.06%
call 0.05%
storeh 0.05%
flip 0.02%
xor 0.01%
neg 0.009%

What is next?

It would be nice to port this to FPGA. Preferably to an already existing enviroment. Maybe the ZPUIno?
Some things that the platform needs to handle:
  • Code size is  ~700kB
  • File data is ~4 MB 
  • 4MB RAM (either onchip or offchip+cache)
  • Display output (VGA/HDMI, preferably 6bits per plane)
  • FPGA resources for a Mode 13 frame buffer