Was going to do a new thread, but figure it belongs in here.
PSSR: Gleaming more Info from Mark Cerny’s PS5 Pro Technical Seminar
I went back – the other day to the video below - to look up the technical calculation for deriving the 300 (8-bit) TOPs that the PS5 Pro has, having been motivated to try and understand if the RX 9070’s and a RTX 50xx series TOPs compared in an apples-to-apples way to gain insight into the work involved in porting FSR4 to the PS5 Pro.
From watching again I was surprised by just how much more info I was gleaming from the same images and the same words .
The low hanging fruit I caught initially on rewatch was Mark saying twice that the time for PSSR’s maths/CNN takes ~1millisecond.
but he actually says it three times in effect with the following quote about reading and writing(256MBs) of the 128MB layer 1 (tensor) image from and to the unified GDDR6 RAM (@576GB/s) after modification
“And just reading and writing take up 0.5 millisecond for this one layer almost half the time for our entire CNN”
When you do the maths it does indeed take roughly 1/2 ms
(256x1024x1024/(576*1024*1024*1024) = 4.34*10^-4 seconds = 0.43ms
At that point I got a little confused and was wondering why the layer 1 image was being read and why it was also 128MB (eg with 16Bytes per pixel) especially when the incoming game image for PSSR processing was rendered sparsely so it wasn’t even 8MPixels, but 1/4 at 2MPixels with gaps and even after “a quicky upscale of what was rendered in the game.” (as Mark put it) to make the 4K raster image complete.
So I was still expecting the red-in image from RAM to be 3 or 4 bytes per pixel (6MB RGB8, or 8MB if using RGB10_A2, because why would you take the raster image, do a load of matrix maths to generate 16bytes per pixel – which needs to reside in RAM because it is 128MB - then read it back in – in 15MB chunks - to do the CNN and then write out to RAM, again?
The confusion seemed to stem from me not appreciating that the neural network is recurrent which doesn’t get highlighted until after the ML hardware coverage when Mark explains that they chose a U-net solution. After which, the following statement made a lot more sense
“The first layer does a lot of matrix maths and then outputs another 4K image with substantially more information per pixel”
as it is actually referring to a tensor that contains the new frame raster image, the quick upscale and some of the supplemental (recurrent 16byte) data from the previous PSSR pass and inferred updates from this pass.
The white arrows in the CNN U-net diagram above shows abstraction of the those processed tensors tiles at the levels moving across to be combined and written out to RAM, which interestingly moves in to another interesting observation about the PSSR algorithm and its running time.
Mark pre-empts the inaccuracy of the CNN diagram by saying it isn’t quite the same as the one for PSSR, but close enough.
Mark mentions within the talk a 270p tensor level that clearly isn’t on the diagram - below the 540p level which I believe is the inaccuracy he is talking about in the CNN.
What is also interesting is seeing how the tensor level resolutions change by one quarter of the resolution and how the supplemental tensor image data bytes doubles as the resolution drops by one quarter – based on the one data point we have of 1080p with 32bytes per pixel as shown in the diagram.
Making the assumption that this is a consistent trend with the tensor levels, what we are seeing is tensors:
4K16B (128MB),
1080p32B(64MB),
540p64B(32MB),
270p128B(16MB)
which when added up gives 240MB, and if we say it is both reads and writes, that’s 480MB of RAM data giving a time of
480 x 1024 x 1024 / 576 x 1024 x 1024 x 1024 = 8.138 x 10^-4 ~= 0.814ms
but wait

Mark says that the CNN runs roughly 10,000 Ops per pixel in the output 4K image, so with a little bit of TOPs maths(working for 60fps).
We have an image of (3840x2160) 8 mega pixels x 10,000 OPS x 60 fps
= 4.8 trillion Ops or just 4.8 TOPS
Which as a percentage of the 300TOPs is
4.8/300 = 0.0133 = 1.33%
working out that percentage in compute time based on a 16.67ms frame (60fps) that gives
16.67 * 0.0133 = 0.22ms
And finishing up, by adding that to the 0.814ms RAM bandwidth time
0.814 + 0.22 = 1.03ms
Which assuming I got my maths and assumptions correct, looks like he sort of said PSSR running time is about 1ms four times in the talk

, although given TOPS wastage mentioned for PSSR I suspect that 0.22ms number is actually 40% higher at the minimum, unless the 10K Ops number includes wastage.
I was going to write about more things I’d gleamed, in regards of the 3x3 convolutions and tile sizes and effective tile counts used on the 4K tensor, etc, and how that mapped into the 15MBs of the 30 WGP vector registers, etc, but I was quite surprised how much writing this proof comment took.