Support NeoGAF

Bojji · Dec 20, 2024

PaintTinJr said:
I used the one that has been refined and evolved over decades, where bad port accusations, or not pushing visuals can't be used, and being the biggest game in town would run better on Nvidia at pure raster if it possibly could given Nvidia own the dedicated card market and running best on Nvidia doesn't hurt the game's marketing.

Some games will run better on Nvidia or AMD for no logical reason, other than optimizing for one platform and not the other. COD is one of those games:

TPU shows average based of dozens of tests/games ^

PaintTinJr · Dec 20, 2024

Bojji said:
Some games will run better on Nvidia or AMD for no logical reason, other than optimizing for one platform and not the other. COD is one of those games:

TPU shows average based of dozens of tests/games ^

But in 'relative' spend on the die, AMD is still winning even in those situations because it would be much cheaper for Nvidia with their market share to produce those AMD cards with superior memory - that would then perform better - so by economics alone Nvidia maintain their advantage, not by design. That's the point I'm making.

TrebleShot · Dec 20, 2024

Clear said:
The thing that should be taken away from this discussion, and applied generally to all related subjects, is the paramount importance of system I/O.

It doesn't matter how fast the internals are on any component if there's bottleneck to data coming in or out of it,.

I'd add to this that the major bottleneck on every single PSSR implementation is the ability to passively apply it. It's not possible currently so requires an update from developers.

So say Sony suddenly announce PSSR 2 is available go ahead and use it. Every dev has to jump back in update the game and submit it for release.

That to me sounds very slow and which devs are going to take the time to actually do it? I don't think many will. Especially if as we've seen it can bork some graphical settings and need reversing or further work and another release.

They need a passive system that makes the updates automatically apply and it just updates.

FireFly · Dec 20, 2024

PaintTinJr said:
But in 'relative' spend on the die, AMD is still winning even in those situations because it would be much cheaper for Nvidia with their market share to produce those AMD cards with superior memory - that would then perform better - so by economics alone Nvidia maintain their advantage, not by design. That's the point I'm making.

AMD uses the same wafers for their CPUs, so it's not clear if they purchase significantly less than Nvidia overall. Customer-specific discounts are also unknown.

What is known is that with RDNA 3, AMD spend more transistors for comparable performance, on average.

PaintTinJr · Dec 20, 2024

FireFly said:
AMD uses the same wafers for their CPUs, so it's not clear if they purchase significantly less than Nvidia overall. Customer-specific discounts are also unknown.

What is known is that with RDNA 3, AMD spend more transistors for comparable performance, on average.

Which you should expect for a chip that is a RISC-CISC hybrid, as opposed to RISC + CISC subsystems.

FireFly · Dec 20, 2024

PaintTinJr said:
Which you should expect for a chip that is a RISC-CISC hybrid, as opposed to RISC + CISC subsystems.

I'm talking about PC GPUs not CPUs (or APUs).

Zathalus · Dec 20, 2024

PaintTinJr said:
But you can't you would have to compare spend on the die, so it is 'relative' in a true sense and not being able to offer interface connections, cache and vram and lithography levels that the competition can't. Ultimately Nvidia produce nothing that will be more power efficient and performant as the Pro, at the same size, and that's despite half the die space in a Pro being used by a CPU.

Well yeah, because Nvidia doesn’t make APUs for a home console (Switch 2 aside). We can compare against GPUs from AMD and Intel where Nvidia is still the market leader in die space and efficiency. Which was my point, Nvidia only seems like they brute force things if you take a look at the top end GPU die, for most of their product stack they actually have small and efficient GPUs.

Three · Dec 20, 2024

Bojji said:
They don't appear very important because small fraction of games use them. Why? Maybe because most popular console (target platform) doesn't support it?

So with that in mind would that change with a much smaller audience high end device like the PS5 Pro? They would target the PS5 still no? Why are you picking and choosing like this regarding feature importance? If you think things like DP4a may become important then so does ML performance in general on the Pro.

Bojji said:
VRS - dogshit
Mesh Shaders - Very important (in the long run) but PS5 supports Primitive Shaders so it doesn't change much for this console
SFS - who knows
No DP4a support - this killed potential for ML uspacling being used on regular PS5 (while you can run XESS on RDNA2 AMD GPUs on PC)

You say DP4a was important because it killed potential of XeSS on regular PS5. so why didn't the xbox exclusives like Starfield, Halo, redfall etc use it instead on xbox consoles? That had DP4a. Nobody is going to go official XeSS on a console with an AMD GPU and risk AMD support and improvements on that GPU when they can use the AMD equivalent. It wouldn't make sense. Is this not a case of "the non ML GPU/machine can do upscaling without AI" anymore?

Bojji said:
Tensor Cores/Custom RDNA hardware in Pro - ML hardware is something that will be important in the future for game rendering but IMO not for next few years. Right now they pretty much only use it for SR (DLSS, PSSR, XeSS).

It's important even now for things beyond SR. Games are benefiting from it already, framegen, denoisers, AI opponents, muscle deformation etc. They just aren't in many console games yet. Your previous post that "the non ML consoles can do those things already without a gpu better suited for it" is the same thing as your importance for DP4a now. Yet you're emphasising its importance now with DP4a because the base PS5 didn't have it.

Having hardware that's more efficient at these tasks instead of having older hardware not designed for these specific tasks means it frees up resources for the features both PS5 and PS5 pro support, instead of having the PS5 base doing these things in software and taking up more resources. The PS5 Pro benefits.

PaintTinJr · Dec 20, 2024

Zathalus said:
Well yeah, because Nvidia doesn’t make APUs for a home console (Switch 2 aside). We can compare against GPUs from AMD and Intel where Nvidia is still the market leader in die space and efficiency. Which was my point, Nvidia only seems like they brute force things if you take a look at the top end GPU die, for most of their product stack they actually have small and efficient GPUs.

But if their 4070 is only on par with a Pro APU, even with half the Pro APU being held back to RDNA2/3, the clock being held back to PS5 OG levels and the other half of the APU being a CPU, how would they add a CPU to match the Pro's Ryzen 2 and then fit more GPU performance?

The reality is that Sony using RDNA would design GPUs that would significantly beat Nvidia repeatedly if die-size, market share and component costs and power usage were all equalized.

So the point again, is that Nividia's market position allow them to design bloated GPUs with better lithography and memory technologies and more expensive layouts to sell at bigger prices and higher volume to outperform the competition in 2 out of 3 areas, currently.

PaintTinJr · Dec 20, 2024

FireFly said:
I'm talking about PC GPUs not CPUs (or APUs).

I wasn't being literal about RISC-CISC, but the same is true of both APU/GPU, the more generalized the circuitry for more generalized GPU compute, the more transistors needed. That's my assertion.

Gaiff · Dec 20, 2024

PaintTinJr said:
But if their 4070 is only on par with a Pro APU, even with half the Pro APU being held back to RDNA2/3, the clock being held back to PS5 OG levels and the other half of the APU being a CPU, how would they add a CPU to match the Pro's Ryzen 2 and then fit more GPU performance?

The Pro isn’t on par with the 4070.

PaintTinJr said:
The reality is that Sony using RDNA would design GPUs that would significantly beat Nvidia repeatedly if die-size, market share and component costs and power usage were all equalized.

Yeah, yeah, just like PSSR was supposed to be far superior to DLSS and you were acting smug about it.

PaintTinJr said:
So the point again, is that Nividia's market position allow them to design bloated GPUs with better lithography and memory technologies and more expensive layouts to sell at bigger prices and higher volume to outperform the competition in 2 out of 3 areas, currently.

Nonsense.

FireFly · Dec 20, 2024

PaintTinJr said:
I wasn't being literal about RISC-CISC, but the same is true of both APU/GPU, the more generalized the circuitry for more generalized GPU compute, the more transistors needed. That's my assertion.

I suspect that even in compute Nvidia is ahead on a per-transistor basis, due to the doubling of CUDA cores per SM with Ampere. However the discussion was specifically about the gaming performance of the architecture.

PaintTinJr · Dec 20, 2024

Gaiff said:
The Pro isn’t on par with the 4070.

Yeah, yeah, just like PSSR was supposed to be far superior to DLSS and you were acting smug about it.

Nonsense.

Let's check back on the Pro vs 4070 situation at the end of the gen, the PSSR situation versus DLSS 6(?) at the end of the gen, and compare power efficiency versus performance vs die size on RDNA vs Nvidia GPUs at the end of the gen.

The amortised design effort to get Pro APU to the performance and versatility it has is only going to make things easier for Amethyst going forward IMO. But it's cool you think it will work out different as we hit the diminishing returns walls versus costs in GPU design.

Xyphie · Dec 20, 2024

PaintTinJr said:
But you can't you would have to compare spend on the die, so it is 'relative' in a true sense and not being able to offer interface connections, cache and vram and lithography levels that the competition can't. Ultimately Nvidia produce nothing that will be more power efficient and performant as the Pro, at the same size, and that's despite half the die space in a Pro being used by a CPU.

It's almost like consumer GPUs aren't optimized for power efficiency because power is cheap. Take that same silicon as a 4070 and actually run it efficiently and you end up with 72W.

CPU is also nothing close to half the die space, it's like 40mm^2 out of ~300mm^2 back on 7nm, significantly less than that on 4nm or whatever the Pro uses.

PaintTinJr · Dec 20, 2024

FireFly said:
I suspect that even in compute Nvidia is ahead on a per-transistor basis, due to the doubling of CUDA cores per SM with Ampere. However the discussion was specifically about the gaming performance of the architecture.

And I would still expect in all edge cases RDNA to wipe the floor with CUDA in real world performance per FLOP, by being more generalized processing to find superior solutions via GPU software innovation, but both our statements are sadly hypotheticals.

PaintTinJr · Dec 20, 2024

Xyphie said:
It's almost like consumer GPUs aren't optimized for power efficiency because power is cheap. Take that same silicon as a 4070 and actually run it efficiently and you end up with 72W.

CPU is also nothing close to half the die space, it's like 40mm^2 out of ~300mm^2 back on 7nm, significantly less than that on 4nm or whatever the Pro uses.

Are you including the area of the wiring interfaces like the northbridge/southbridge and unified memory controllers in that assessment or the layer count?

Physiognomonics · Dec 20, 2024

Cerny looking at the sorry state of AMD and lack of ML uspcaling in 2020.

Zathalus · Dec 20, 2024

PaintTinJr said:
But if their 4070 is only on par with a Pro APU, even with half the Pro APU being held back to RDNA2/3, the clock being held back to PS5 OG levels and the other half of the APU being a CPU, how would they add a CPU to match the Pro's Ryzen 2 and then fit more GPU performance?

The reality is that Sony using RDNA would design GPUs that would significantly beat Nvidia repeatedly if die-size, market share and component costs and power usage were all equalized.

So the point again, is that Nividia's market position allow them to design bloated GPUs with better lithography and memory technologies and more expensive layouts to sell at bigger prices and higher volume to outperform the competition in 2 out of 3 areas, currently.

But the Pro isn’t on par with a 4070? The 4070 outperforms the Pro in raster and RT. The 4070 is also a severely cut down AD104 with only 46 out of 60 SM units enabled that is tied to a meagre 192bit bus.

Besides, dedicated GPUs don’t overly focus on power efficiency. Clock speeds and voltages are often bumped up to hit a specific performance target with not a lot of thought given to being efficient. For that a Nvidia mobile GPU would be a better comparison. The 4080 mobile offers performance equal to a regular 4070 at about 110w. The entire laptop uses roughly the same power as a Pro while offering superior performance. Or somebody can under volt a 4070 Ti which will dramatically lower power usage while still offering superior performance.

Bojji · Dec 20, 2024

Zathalus said:
But the Pro isn’t on par with a 4070? The 4070 outperforms the Pro in raster and RT. The 4070 is also a severely cut down AD104 with only 46 out of 60 SM units enabled that is tied to a meagre 192bit bus.

Besides, dedicated GPUs don’t overly focus on power efficiency. Clock speeds and voltages are often bumped up to hit a specific performance target with not a lot of thought given to being efficient. For that a Nvidia mobile GPU would be a better comparison. The 4080 mobile offers performance equal to a regular 4070 at about 110w. The entire laptop uses roughly the same power as a Pro while offering superior performance. Or somebody can under volt a 4070 Ti which will dramatically lower power usage while still offering superior performance.

Default voltage for 4070ti super is 1.1v. Quick and dirty undervolt: default performance on 0.975v

Card uses ~220W instead of 300W. Power efficiency of Ada is amazing and far beyond AMD.

Mr.Phoenix · Dec 20, 2024

TGO said:
Cool maybe he'll explain why some games looks like shit.

I am late to this party... so it won't surprise me if someone else has said this.

I see PSSR no differently from how I saw DLSS when that first launched, and it's like everyone forgot the uproar following the 2080 vs. the 1080 in price, specs, and real-world performance back then.

Then DLSS got better, and now it is the gold standard of image reconstruction.

I believe that will happen with PSSR too. It will get better.

Physiognomonics · Dec 20, 2024

Mr.Phoenix said:
I am late to this party... so it won't surprise me if someone else has said this.

I see PSSR no differently from how I saw DLSS when that first launched, and it's like everyone forgot the uproar following the 2080 vs. the 1080 in price, specs, and real-world performance back then.

Then DLSS got better, and now it is the gold standard of image reconstruction.

I believe that will happen with PSSR too. It will get better.

The last versions of PSSR are already pretty good. For instance in Wukong, a UE5 game with tons of foliage and alpha effects.

Bojji · Dec 20, 2024

Mr.Phoenix said:
I am late to this party... so it won't surprise me if someone else has said this.

I see PSSR no differently from how I saw DLSS when that first launched, and it's like everyone forgot the uproar following the 2080 vs. the 1080 in price, specs, and real-world performance back then.

Then DLSS got better, and now it is the gold standard of image reconstruction.

I believe that will happen with PSSR too. It will get better.

Dlss 1.0 was shit, and got deserved criticism. But it was also completely different thing than 2.0 (and beyond).

2.0 launched in 2020 and it was good from the start (and got better since then).

They launched pssr 4 years after Dlss 2.0 (and few years after XeSS) and it shows (sometimes big) issues in some portion of patched games. It wasn't ready to be given to third party devs in Q4 2024.

diffusionx · Dec 20, 2024

Bojji said:
Dlss 1.0 was shit, and got deserved criticism. But it was also completely different thing than 2.0 (and beyond).

2.0 launched in 2020 and it was good from the start (and got better since then).

They launched pssr 4 years after Dlss 2.0 (and few years after XeSS) and it shows (sometimes big) issues in some portion of patched games. It wasn't ready to be given to third party devs in Q4 2024.

PSSR is better than DLSS 2.0 was, which had issues but were mostly glossed over because it had showcase titles like Death Stranding. I played through Control with DLSS turned on around that time and actually wondering what the big deal was because it had IQ issues and the like.

On some level this PSSR uproar is crazy exaggeration and blowing things out of proportion but if it spurs things to improve quickly all the better. People are acting like every ps5 pro pssr game had major issues and is unplayable and it’s just not the case at all.

TehGoomba7s · Dec 20, 2024

REDRZA MWS said:
For a 700 “pro” console with all the technical bullshit that cerny spouts, higher frames should be the fucking minimum.

Irony of spouting bullshit, exactly what you are doing.

twilo99 · Dec 20, 2024

Lagspike_exe said:
real world performance

Since there aren’t any standard performance benchmarks in the console world this is directly related to developers and how well optimized their code is for whatever specific hardware they are targeting.

We have a whole thread dedicated to “ps5 pro enhanced games” because without extra care the extra performance is negligible.

Days like these... · Dec 20, 2024

Knack 4 plz

Bojji · Dec 20, 2024

diffusionx said:
PSSR is better than DLSS 2.0 was, which had issues but were mostly glossed over because it had showcase titles like Death Stranding. I played through Control with DLSS turned on around that time and actually wondering what the big deal was because it had IQ issues and the like.

On some level this PSSR uproar is crazy exaggeration and blowing things out of proportion but if it spurs things to improve quickly all the better. People are acting like every ps5 pro pssr game had major issues and is unplayable and it’s just not the case at all.

You can try updating DLSS file in Control and see what issues were related to 2.0 and what issues are related to lower res RT.

I don't remember anyone complaining much about DLSS in Control (this game was also beta test for it with "1.9" DLSS).

PaintTinJr · Dec 20, 2024

Zathalus said:
But the Pro isn’t on par with a 4070? The 4070 outperforms the Pro in raster and RT. The 4070 is also a severely cut down AD104 with only 46 out of 60 SM units enabled that is tied to a meagre 192bit bus.

Besides, dedicated GPUs don’t overly focus on power efficiency. Clock speeds and voltages are often bumped up to hit a specific performance target with not a lot of thought given to being efficient. For that a Nvidia mobile GPU would be a better comparison. The 4080 mobile offers performance equal to a regular 4070 at about 110w. The entire laptop uses roughly the same power as a Pro while offering superior performance. Or somebody can under volt a 4070 Ti which will dramatically lower power usage while still offering superior performance.

So exactly as I predicted and said was

Bojji 's angle from the start. A Nvidia lovin takes over the conversation about design philosophy where we can move up to a 4080 mobile or 4070ti and alter clocks and probably not even benchmark for 2hrs at full load to let thermal throttling kick-in, that doesn't on consoles no matter if 50hrs or 500hrs, and then ignore relative spend on caches, expensive wiring layouts, etc, etc, and just state it is better on Nvidia.

Pretty sure the 1300 theoretical TOPs of a RTX 4070 are stupidly memory bandwidth bound even at full desktop and power level for those stacked CNNs if running PSSR compared to the Pro as Cerny described in the video but no worries, say no more this dance repeats in every GPU related tech thread.

Zathalus · Dec 20, 2024

PaintTinJr said:
So exactly as I predicted and said was Bojji 's angle from the start. A Nvidia lovin takes over the conversation about design philosophy where we can move up to a 4080 mobile or 4070ti and alter clocks and probably not even benchmark for 2hrs at full load to let thermal throttling kick-in, that doesn't on consoles no matter if 50hrs or 500hrs, and then ignore relative spend on caches, expensive wiring layouts, etc, etc, and just state it is better on Nvidia.

Pretty sure the 1300 theoretical TOPs of a RTX 4070 are stupidly memory bandwidth bound even at full desktop and power level for those stacked CNNs if running PSSR compared to the Pro as Cerny described in the video but no worries, say no more this dance repeats in every GPU related tech thread.

It was mentioned that Nvidia "brute forces" compared to AMD. I have just pointed out that it is factually untrue. Nvidia has smaller die sizes and less power usage compared to equivalent GPUs from AMD and Intel. I've not moved up GPUs at all, 4080 mobile, 4070ti, 4070 Super, and regular 4070 all use the exact same die, AD104. 4080 mobile is perfectly capable of sustaining clocks that give it performance in-line with a regular 4070 as it has more SM units enabled of the AD104 chip, so running at reduced clock speeds allows it to use a fraction of the power while still offering good performance. A laptop with an 8 core AMD CPU and 4080 mobile is capable of performing at the level of a desktop 4070 while consuming around 200-220w combined - without being thermal throttled. Just comparing desktop GPUs, as I originally did, still shows that 7800XT trailing behind the 4070 Ti in performance (behind in raster and much more behind in RT) while having significantly more memory bandwidth, more cache, and a larger die.

I'm not even sure why you bring up 1300 TOPs of the 4070 as it certainly is not that when using INT8 and/or sparsity. The equivalent TOPS number to the Pro would be 260 TOPs of INT8 (no sparsity) at regular clock speeds.

Bojji · Dec 20, 2024

PaintTinJr said:
So exactly as I predicted and said was Bojji 's angle from the start. A Nvidia lovin takes over the conversation about design philosophy where we can move up to a 4080 mobile or 4070ti and alter clocks and probably not even benchmark for 2hrs at full load to let thermal throttling kick-in, that doesn't on consoles no matter if 50hrs or 500hrs, and then ignore relative spend on caches, expensive wiring layouts, etc, etc, and just state it is better on Nvidia.

Pretty sure the 1300 theoretical TOPs of a RTX 4070 are stupidly memory bandwidth bound even at full desktop and power level for those stacked CNNs if running PSSR compared to the Pro as Cerny described in the video but no worries, say no more this dance repeats in every GPU related tech thread.

There is no nvidia loving from me. I don't like this company.

But I don't have to like them to acknowledge that they are top dog in GPU space and AMD is behind them in almost every category.

PaintTinJr · Dec 20, 2024

Zathalus said:
It was mentioned that Nvidia "brute forces" compared to AMD. I have just pointed out that it is factually untrue. Nvidia has smaller die sizes and less power usage compared to equivalent GPUs from AMD and Intel. I've not moved up GPUs at all, 4080 mobile, 4070ti, 4070 Super, and regular 4070 all use the exact same die, AD104. 4080 mobile is perfectly capable of sustaining clocks that give it performance in-line with a regular 4070 as it has more SM units enabled of the AD104 chip, so running at reduced clock speeds allows it to use a fraction of the power while still offering good performance. A laptop with an 8 core AMD CPU and 4080 mobile is capable of performing at the level of a desktop 4070 while consuming around 200-220w combined - without being thermal throttled. Just comparing desktop GPUs, as I originally did, still shows that 7800XT trailing behind the 4070 Ti in performance (behind in raster and much more behind in RT) while having significantly more memory bandwidth, more cache, and a larger die.

I'm not even sure why you bring up 1300 TOPs of the 4070 as it certainly is not that when using INT8 and/or sparsity. The equivalent TOPS number to the Pro would be 260 TOPs of INT8 (no sparsity) at regular clock speeds.

And you believe it can run 500hrs, like any console (PlayStation or otherwise could) in a shop window at full tilt with a game for that duration without thermal throttling? No chance, you lot all called Michael a fanboy when he benchmarked UE5 on PC after a 1hr warmup before benchmarks that showed a full desktop couldn't maintain first hour performance. And you said the laptop was more performant than Pro, and now you are claiming 5/6 the AI performance of the Pro even in the first hour before throttling.

Zathalus · Dec 20, 2024

PaintTinJr said:
And you believe it can run 500hrs, like any console (PlayStation or otherwise could) in a shop window at full tilt with a game for that duration without thermal throttling? No chance, you lot all called Michael a fanboy when he benchmarked UE5 on PC for 1hr before benchmarks that showed a full desktop couldn't maintain first hour performance. And you said it was more performant, and now you are claiming 5/6 the AI performance of the Pro even in the first hour before throttling.

Do I believe a 4080 mobile can sustain performance in a good quality laptop? Err, yes? I’m not aware of any widespread reports of thermal throttling or overheating when it comes to those GPUs. They are underclocked cut down AD104 chips. On a decent laptop they are usually in the 70-80c range.

As for Michael, what GPU did he use? What branding? Blower style card? What cooling did his PC use? Ambient temperature of his room? It’s pretty meaningless to claim a specific user had throttling issues without knowing all the variables. I have multiple GPUs and none of them thermal throttle at all while gaming, even in summer without AC.

And yes, a regular 4070 has less TOPs than a PS5 Pro, I didn’t exactly claim otherwise. RT and raster performance is another matter.

But this is really aside my original point, that Nvidia is not really brute forcing anything, unless you have a 90 series card.

PaintTinJr · Dec 20, 2024

Zathalus said:
Do I believe a 4080 mobile can sustain performance in a good quality laptop? Err, yes? I’m not aware of any widespread reports of thermal throttling or overheating when it comes to those GPUs. They are underclocked cut down AD104 chips. On a decent laptop they are usually in the 70-80c range.

As for Michael, what GPU did he use? What branding? Blower style card? What cooling did his PC use? Ambient temperature of his room? It’s pretty meaningless to claim a specific user had throttling issues without knowing all the variables. I have multiple GPUs and none of them thermal throttle at all while gaming, even in summer without AC.

And yes, a regular 4070 has less TOPs than a PS5 Pro, I didn’t exactly claim otherwise. RT and raster performance is another matter.

But this is really aside my original point, that Nvidia is not really brute forcing anything, unless you have a 90 series card.

Workstation laptops have trouble running SAS for more than 1.5hr without thermal throttling and crashing the analysis, on all but ones design for F1 and Nasa, 500hrs straight gaming on laptop isn't going be trivial like it is on a console where throttling isn't allowed.

Zathalus · Dec 20, 2024

PaintTinJr said:
Workstation laptops have trouble running SAS for more than 1.5hr without thermal throttling and crashing the analysis, on all but ones design for F1 and Nasa, 500hrs straight gaming on laptop isn't going be trivial like it is on a console where throttling isn't allowed.

Well that just sounds like cheap Dell or HP workstations that probably just rely on a stock shitty Intel cooler. God knows corporations like buying the shittiest versions of PCs. I have a 4090 desktop and can game for 10+ hours without my temperatures spiking even once., nor do I see any major fluctuations in clock speeds. My wife has her 4080 laptop and is perfectly capable of playing D4 or WoW for multiple hours at a time without the GPU temperatures even going over 80c.

Intel CPUs, at least the 13900/14900 series thermal throttling makes sense as those CPUs suck at that. But a modern Nvidia GPU (80 series and under) should not really have any problems, especially a undervolted mobile part.

But if you don’t want to agree on temperatures that is fine, my original point was that die size and power consumption has Nvidia being the technology leader for now, which is factually true when comparing desktop or laptop parts.

awwr999999 · Dec 21, 2024

Lootlord said:
Who actually cares at this point?
So far the Pro has been disappointing to say the least...

People care because PS is moving in the wrong direction and at this point there's no reason to be excited about PS6 or the future of PS consoles.
Despite knowing that the majority of users turn RT off they've added dedicated RT HW that will raise the price of all future PS consoles.

rofif said:
Cerny also says PSSR is toughest thing he ever worked on and very resources consuming. I am telling you... all this AI crap is a scam.
It's a solution in search of a problem.

AI's got a big future in gaming outside of 'upscaling'. Dedicated AI cores will be useful for doing new things like real-time psychedelic AI hallucinations and just handling ordinary in-game AI (NPC, traffic, weather, events, etc.) more efficiently.

Bumblebeetuna said:
I’m gonna post this here since I don’t see a Pro OT, but I just got a Pro and set it up, is the wobble normal? I have a disc drive installed too but the wobble is strong. I’m using the cheap plastic things that came with it.

The eject button's on the wrong side of the disc slot.

saintjules said:
Make another PS3.

They would make another PS2.

SweetTooth said:
Who is the guy that insisted that Sony is simpy sticking with whatever AMD puts out with no customizations?

PS using AMD/AMD HW is a massive problem that undermines PS.
Every AMD PS console is an advertisement for AMD's consumer HW specced 'at' or 'better than' the PS-equivalent.
It doesn't matter how deeply PS customizes existing AMD HW, the damage is done and it's permanent. With time PS HW will always look worse than new AMD consumer HW.
PS needs to move back to full esoteric HW and existing in a proprietary realm where any comparison to consumer PC parts is impossible.
There's still no equivalent to the PS2's proprietary HW (CPU or GPU) while the PS3's GPU (based on Nvidia G70) looks underpowered and obsolete compared to modern day Nvidia HW.

PaintTinJr · Tuesday at 10:02 PM

Was going to do a new thread, but figure it belongs in here.

PSSR: Gleaming more Info from Mark Cerny’s PS5 Pro Technical Seminar

I went back – the other day to the video below - to look up the technical calculation for deriving the 300 (8-bit) TOPs that the PS5 Pro has, having been motivated to try and understand if the RX 9070’s and a RTX 50xx series TOPs compared in an apples-to-apples way to gain insight into the work involved in porting FSR4 to the PS5 Pro.

From watching again I was surprised by just how much more info I was gleaming from the same images and the same words .

The low hanging fruit I caught initially on rewatch was Mark saying twice that the time for PSSR’s maths/CNN takes ~1millisecond.

but he actually says it three times in effect with the following quote about reading and writing(256MBs) of the 128MB layer 1 (tensor) image from and to the unified GDDR6 RAM (@576GB/s) after modification

“And just reading and writing take up 0.5 millisecond for this one layer almost half the time for our entire CNN”

When you do the maths it does indeed take roughly 1/2 ms

(256x1024x1024/(576*1024*1024*1024) = 4.34*10^-4 seconds = 0.43ms

At that point I got a little confused and was wondering why the layer 1 image was being read and why it was also 128MB (eg with 16Bytes per pixel) especially when the incoming game image for PSSR processing was rendered sparsely so it wasn’t even 8MPixels, but 1/4 at 2MPixels with gaps and even after “a quicky upscale of what was rendered in the game.” (as Mark put it) to make the 4K raster image complete.

So I was still expecting the red-in image from RAM to be 3 or 4 bytes per pixel (6MB RGB8, or 8MB if using RGB10_A2, because why would you take the raster image, do a load of matrix maths to generate 16bytes per pixel – which needs to reside in RAM because it is 128MB - then read it back in – in 15MB chunks - to do the CNN and then write out to RAM, again?

The confusion seemed to stem from me not appreciating that the neural network is recurrent which doesn’t get highlighted until after the ML hardware coverage when Mark explains that they chose a U-net solution. After which, the following statement made a lot more sense

“The first layer does a lot of matrix maths and then outputs another 4K image with substantially more information per pixel”

as it is actually referring to a tensor that contains the new frame raster image, the quick upscale and some of the supplemental (recurrent 16byte) data from the previous PSSR pass and inferred updates from this pass.

The white arrows in the CNN U-net diagram above shows abstraction of the those processed tensors tiles at the levels moving across to be combined and written out to RAM, which interestingly moves in to another interesting observation about the PSSR algorithm and its running time.

Mark pre-empts the inaccuracy of the CNN diagram by saying it isn’t quite the same as the one for PSSR, but close enough.

Mark mentions within the talk a 270p tensor level that clearly isn’t on the diagram - below the 540p level which I believe is the inaccuracy he is talking about in the CNN.

What is also interesting is seeing how the tensor level resolutions change by one quarter of the resolution and how the supplemental tensor image data bytes doubles as the resolution drops by one quarter – based on the one data point we have of 1080p with 32bytes per pixel as shown in the diagram.

Making the assumption that this is a consistent trend with the tensor levels, what we are seeing is tensors:

4K16B (128MB),

1080p32B(64MB),

540p64B(32MB),

270p128B(16MB)

which when added up gives 240MB, and if we say it is both reads and writes, that’s 480MB of RAM data giving a time of

480 x 1024 x 1024 / 576 x 1024 x 1024 x 1024 = 8.138 x 10^-4 ~= 0.814ms

but wait

Mark says that the CNN runs roughly 10,000 Ops per pixel in the output 4K image, so with a little bit of TOPs maths(working for 60fps).

We have an image of (3840x2160) 8 mega pixels x 10,000 OPS x 60 fps

= 4.8 trillion Ops or just 4.8 TOPS

Which as a percentage of the 300TOPs is

4.8/300 = 0.0133 = 1.33%

working out that percentage in compute time based on a 16.67ms frame (60fps) that gives

16.67 * 0.0133 = 0.22ms

And finishing up, by adding that to the 0.814ms RAM bandwidth time

0.814 + 0.22 = 1.03ms

Which assuming I got my maths and assumptions correct, looks like he sort of said PSSR running time is about 1ms four times in the talk

, although given TOPS wastage mentioned for PSSR I suspect that 0.22ms number is actually 40% higher at the minimum, unless the 10K Ops number includes wastage.

I was going to write about more things I’d gleamed, in regards of the 3x3 convolutions and tile sizes and effective tile counts used on the 4K tensor, etc, and how that mapped into the 15MBs of the 30 WGP vector registers, etc, but I was quite surprised how much writing this proof comment took.

Mr Moose · Tuesday at 10:09 PM

PaintTinJr said:
Was going to do a new thread, but figure it belongs in here.

PSSR: Gleaming more Info from Mark Cerny’s PS5 Pro Technical Seminar

I went back – the other day to the video below - to look up the technical calculation for deriving the 300 (8-bit) TOPs that the PS5 Pro has, having been motivated to try and understand if the RX 9070’s and a RTX 50xx series TOPs compared in an apples-to-apples way to gain insight into the work involved in porting FSR4 to the PS5 Pro.

From watching again I was surprised by just how much more info I was gleaming from the same images and the same words .

The low hanging fruit I caught initially on rewatch was Mark saying twice that the time for PSSR’s maths/CNN takes ~1millisecond.

but he actually says it three times in effect with the following quote about reading and writing(256MBs) of the 128MB layer 1 (tensor) image from and to the unified GDDR6 RAM (@576GB/s) after modification

“And just reading and writing take up 0.5 millisecond for this one layer almost half the time for our entire CNN”

When you do the maths it does indeed take roughly 1/2 ms

(256x1024x1024/(576*1024*1024*1024) = 4.34*10^-4 seconds = 0.43ms

At that point I got a little confused and was wondering why the layer 1 image was being read and why it was also 128MB (eg with 16Bytes per pixel) especially when the incoming game image for PSSR processing was rendered sparsely so it wasn’t even 8MPixels, but 1/4 at 2MPixels with gaps and even after “a quicky upscale of what was rendered in the game.” (as Mark put it) to make the 4K raster image complete.

So I was still expecting the red-in image from RAM to be 3 or 4 bytes per pixel (6MB RGB8, or 8MB if using RGB10_A2, because why would you take the raster image, do a load of matrix maths to generate 16bytes per pixel – which needs to reside in RAM because it is 128MB - then read it back in – in 15MB chunks - to do the CNN and then write out to RAM, again?

The confusion seemed to stem from me not appreciating that the neural network is recurrent which doesn’t get highlighted until after the ML hardware coverage when Mark explains that they chose a U-net solution. After which, the following statement made a lot more sense

“The first layer does a lot of matrix maths and then outputs another 4K image with substantially more information per pixel”

as it is actually referring to a tensor that contains the new frame raster image, the quick upscale and some of the supplemental (recurrent 16byte) data from the previous PSSR pass and inferred updates from this pass.

The white arrows in the CNN U-net diagram above shows abstraction of the those processed tensors tiles at the levels moving across to be combined and written out to RAM, which interestingly moves in to another interesting observation about the PSSR algorithm and its running time.

Mark pre-empts the inaccuracy of the CNN diagram by saying it isn’t quite the same as the one for PSSR, but close enough.

Mark mentions within the talk a 270p tensor level that clearly isn’t on the diagram - below the 540p level which I believe is the inaccuracy he is talking about in the CNN.

What is also interesting is seeing how the tensor level resolutions change by one quarter of the resolution and how the supplemental tensor image data bytes doubles as the resolution drops by one quarter – based on the one data point we have of 1080p with 32bytes per pixel as shown in the diagram.

Making the assumption that this is a consistent trend with the tensor levels, what we are seeing is tensors:

4K16B (128MB),

1080p32B(64MB),

540p64B(32MB),

270p128B(16MB)

which when added up gives 240MB, and if we say it is both reads and writes, that’s 480MB of RAM data giving a time of

480 x 1024 x 1024 / 576 x 1024 x 1024 x 1024 = 8.138 x 10^-4 ~= 0.814ms

but wait Mark says that the CNN runs roughly 10,000 Ops per pixel in the output 4K image, so with a little bit of TOPs maths(working for 60fps).

We have an image of (3840x2160) 8 mega pixels x 10,000 OPS x 60 fps

= 4.8 trillion Ops or just 4.8 TOPS

Which as a percentage of the 300TOPs is

4.8/300 = 0.0133 = 1.33%

working out that percentage in compute time based on a 16.67ms frame (60fps) that gives

16.67 * 0.0133 = 0.22ms

And finishing up, by adding that to the 0.814ms RAM bandwidth time

0.814 + 0.22 = 1.03ms

Which assuming I got my maths and assumptions correct, looks like he sort of said PSSR running time is about 1ms four times in the talk , although given TOPS wastage mentioned for PSSR I suspect that 0.22ms number is actually 40% higher at the minimum, unless the 10K Ops number includes wastage.

I was going to write about more things I’d gleamed, in regards of the 3x3 convolutions and tile sizes and effective tile counts used on the 4K tensor, etc, and how that mapped into the 15MBs of the 30 WGP vector registers, etc, but I was quite surprised how much writing this proof comment took.

SweetTooth · Tuesday at 10:11 PM

PaintTinJr said:
Was going to do a new thread, but figure it belongs in here.

PSSR: Gleaming more Info from Mark Cerny’s PS5 Pro Technical Seminar

I went back – the other day to the video below - to look up the technical calculation for deriving the 300 (8-bit) TOPs that the PS5 Pro has, having been motivated to try and understand if the RX 9070’s and a RTX 50xx series TOPs compared in an apples-to-apples way to gain insight into the work involved in porting FSR4 to the PS5 Pro.

From watching again I was surprised by just how much more info I was gleaming from the same images and the same words .

The low hanging fruit I caught initially on rewatch was Mark saying twice that the time for PSSR’s maths/CNN takes ~1millisecond.

but he actually says it three times in effect with the following quote about reading and writing(256MBs) of the 128MB layer 1 (tensor) image from and to the unified GDDR6 RAM (@576GB/s) after modification

“And just reading and writing take up 0.5 millisecond for this one layer almost half the time for our entire CNN”

When you do the maths it does indeed take roughly 1/2 ms

(256x1024x1024/(576*1024*1024*1024) = 4.34*10^-4 seconds = 0.43ms

At that point I got a little confused and was wondering why the layer 1 image was being read and why it was also 128MB (eg with 16Bytes per pixel) especially when the incoming game image for PSSR processing was rendered sparsely so it wasn’t even 8MPixels, but 1/4 at 2MPixels with gaps and even after “a quicky upscale of what was rendered in the game.” (as Mark put it) to make the 4K raster image complete.

So I was still expecting the red-in image from RAM to be 3 or 4 bytes per pixel (6MB RGB8, or 8MB if using RGB10_A2, because why would you take the raster image, do a load of matrix maths to generate 16bytes per pixel – which needs to reside in RAM because it is 128MB - then read it back in – in 15MB chunks - to do the CNN and then write out to RAM, again?

The confusion seemed to stem from me not appreciating that the neural network is recurrent which doesn’t get highlighted until after the ML hardware coverage when Mark explains that they chose a U-net solution. After which, the following statement made a lot more sense

“The first layer does a lot of matrix maths and then outputs another 4K image with substantially more information per pixel”

as it is actually referring to a tensor that contains the new frame raster image, the quick upscale and some of the supplemental (recurrent 16byte) data from the previous PSSR pass and inferred updates from this pass.

The white arrows in the CNN U-net diagram above shows abstraction of the those processed tensors tiles at the levels moving across to be combined and written out to RAM, which interestingly moves in to another interesting observation about the PSSR algorithm and its running time.

Mark pre-empts the inaccuracy of the CNN diagram by saying it isn’t quite the same as the one for PSSR, but close enough.

Mark mentions within the talk a 270p tensor level that clearly isn’t on the diagram - below the 540p level which I believe is the inaccuracy he is talking about in the CNN.

What is also interesting is seeing how the tensor level resolutions change by one quarter of the resolution and how the supplemental tensor image data bytes doubles as the resolution drops by one quarter – based on the one data point we have of 1080p with 32bytes per pixel as shown in the diagram.

Making the assumption that this is a consistent trend with the tensor levels, what we are seeing is tensors:

4K16B (128MB),

1080p32B(64MB),

540p64B(32MB),

270p128B(16MB)

which when added up gives 240MB, and if we say it is both reads and writes, that’s 480MB of RAM data giving a time of

480 x 1024 x 1024 / 576 x 1024 x 1024 x 1024 = 8.138 x 10^-4 ~= 0.814ms

but wait Mark says that the CNN runs roughly 10,000 Ops per pixel in the output 4K image, so with a little bit of TOPs maths(working for 60fps).

We have an image of (3840x2160) 8 mega pixels x 10,000 OPS x 60 fps

= 4.8 trillion Ops or just 4.8 TOPS

Which as a percentage of the 300TOPs is

4.8/300 = 0.0133 = 1.33%

working out that percentage in compute time based on a 16.67ms frame (60fps) that gives

16.67 * 0.0133 = 0.22ms

And finishing up, by adding that to the 0.814ms RAM bandwidth time

0.814 + 0.22 = 1.03ms

Which assuming I got my maths and assumptions correct, looks like he sort of said PSSR running time is about 1ms four times in the talk , although given TOPS wastage mentioned for PSSR I suspect that 0.22ms number is actually 40% higher at the minimum, unless the 10K Ops number includes wastage.

I was going to write about more things I’d gleamed, in regards of the 3x3 convolutions and tile sizes and effective tile counts used on the 4K tensor, etc, and how that mapped into the 15MBs of the 30 WGP vector registers, etc, but I was quite surprised how much writing this proof comment took.

Nice writing and fascinating findings. I find Cerny seminars to be really informative and rewarding and I have to commend the Sony visual team for making great presentation simplifying such complex ideas.

Lysandros · Tuesday at 10:21 PM

PaintTinJr said:
Was going to do a new thread, but figure it belongs in here.

PSSR: Gleaming more Info from Mark Cerny’s PS5 Pro Technical Seminar

I went back – the other day to the video below - to look up the technical calculation for deriving the 300 (8-bit) TOPs that the PS5 Pro has, having been motivated to try and understand if the RX 9070’s and a RTX 50xx series TOPs compared in an apples-to-apples way to gain insight into the work involved in porting FSR4 to the PS5 Pro.

From watching again I was surprised by just how much more info I was gleaming from the same images and the same words .

The low hanging fruit I caught initially on rewatch was Mark saying twice that the time for PSSR’s maths/CNN takes ~1millisecond.

but he actually says it three times in effect with the following quote about reading and writing(256MBs) of the 128MB layer 1 (tensor) image from and to the unified GDDR6 RAM (@576GB/s) after modification

“And just reading and writing take up 0.5 millisecond for this one layer almost half the time for our entire CNN”

When you do the maths it does indeed take roughly 1/2 ms

(256x1024x1024/(576*1024*1024*1024) = 4.34*10^-4 seconds = 0.43ms

At that point I got a little confused and was wondering why the layer 1 image was being read and why it was also 128MB (eg with 16Bytes per pixel) especially when the incoming game image for PSSR processing was rendered sparsely so it wasn’t even 8MPixels, but 1/4 at 2MPixels with gaps and even after “a quicky upscale of what was rendered in the game.” (as Mark put it) to make the 4K raster image complete.

So I was still expecting the red-in image from RAM to be 3 or 4 bytes per pixel (6MB RGB8, or 8MB if using RGB10_A2, because why would you take the raster image, do a load of matrix maths to generate 16bytes per pixel – which needs to reside in RAM because it is 128MB - then read it back in – in 15MB chunks - to do the CNN and then write out to RAM, again?

The confusion seemed to stem from me not appreciating that the neural network is recurrent which doesn’t get highlighted until after the ML hardware coverage when Mark explains that they chose a U-net solution. After which, the following statement made a lot more sense

“The first layer does a lot of matrix maths and then outputs another 4K image with substantially more information per pixel”

as it is actually referring to a tensor that contains the new frame raster image, the quick upscale and some of the supplemental (recurrent 16byte) data from the previous PSSR pass and inferred updates from this pass.

The white arrows in the CNN U-net diagram above shows abstraction of the those processed tensors tiles at the levels moving across to be combined and written out to RAM, which interestingly moves in to another interesting observation about the PSSR algorithm and its running time.

Mark pre-empts the inaccuracy of the CNN diagram by saying it isn’t quite the same as the one for PSSR, but close enough.

Mark mentions within the talk a 270p tensor level that clearly isn’t on the diagram - below the 540p level which I believe is the inaccuracy he is talking about in the CNN.

What is also interesting is seeing how the tensor level resolutions change by one quarter of the resolution and how the supplemental tensor image data bytes doubles as the resolution drops by one quarter – based on the one data point we have of 1080p with 32bytes per pixel as shown in the diagram.

Making the assumption that this is a consistent trend with the tensor levels, what we are seeing is tensors:

4K16B (128MB),

1080p32B(64MB),

540p64B(32MB),

270p128B(16MB)

which when added up gives 240MB, and if we say it is both reads and writes, that’s 480MB of RAM data giving a time of

480 x 1024 x 1024 / 576 x 1024 x 1024 x 1024 = 8.138 x 10^-4 ~= 0.814ms

but wait Mark says that the CNN runs roughly 10,000 Ops per pixel in the output 4K image, so with a little bit of TOPs maths(working for 60fps).

We have an image of (3840x2160) 8 mega pixels x 10,000 OPS x 60 fps

= 4.8 trillion Ops or just 4.8 TOPS

Which as a percentage of the 300TOPs is

4.8/300 = 0.0133 = 1.33%

working out that percentage in compute time based on a 16.67ms frame (60fps) that gives

16.67 * 0.0133 = 0.22ms

And finishing up, by adding that to the 0.814ms RAM bandwidth time

0.814 + 0.22 = 1.03ms

Which assuming I got my maths and assumptions correct, looks like he sort of said PSSR running time is about 1ms four times in the talk , although given TOPS wastage mentioned for PSSR I suspect that 0.22ms number is actually 40% higher at the minimum, unless the 10K Ops number includes wastage.

I was going to write about more things I’d gleamed, in regards of the 3x3 convolutions and tile sizes and effective tile counts used on the 4K tensor, etc, and how that mapped into the 15MBs of the 30 WGP vector registers, etc, but I was quite surprised how much writing this proof comment took.

I think Cerny's PS5 PRO seminar was a short summary of your research paper. Full circle.

PaintTinJr · Tuesday at 10:41 PM

Lysandros said:
I think Cerny's PS5 PRO seminar was a short summary of your research paper. Full circle.

Yeah, hopeful I quit while ahead, there's so much crammed into his summary I was surprised by just how impressive the constrained refresh hardware looked after feeling like I was seeing it from a whole new view.

Just looking at those two numbers for transferring data on and off of the bus and the 10K OPs, it shows just how much compute time is wasted in the memory transfers, and if they could find a 2:1 efficiency with PSSR2 and just double the WGPs for the PS6 the solution should be able to be entirely fused giving even more capability to do more per frame.

Kilau · Tuesday at 10:44 PM

PaintTinJr said:
Was going to do a new thread, but figure it belongs in here.

PSSR: Gleaming more Info from Mark Cerny’s PS5 Pro Technical Seminar

I went back – the other day to the video below - to look up the technical calculation for deriving the 300 (8-bit) TOPs that the PS5 Pro has, having been motivated to try and understand if the RX 9070’s and a RTX 50xx series TOPs compared in an apples-to-apples way to gain insight into the work involved in porting FSR4 to the PS5 Pro.

From watching again I was surprised by just how much more info I was gleaming from the same images and the same words .

The low hanging fruit I caught initially on rewatch was Mark saying twice that the time for PSSR’s maths/CNN takes ~1millisecond.

but he actually says it three times in effect with the following quote about reading and writing(256MBs) of the 128MB layer 1 (tensor) image from and to the unified GDDR6 RAM (@576GB/s) after modification

“And just reading and writing take up 0.5 millisecond for this one layer almost half the time for our entire CNN”

When you do the maths it does indeed take roughly 1/2 ms

(256x1024x1024/(576*1024*1024*1024) = 4.34*10^-4 seconds = 0.43ms

At that point I got a little confused and was wondering why the layer 1 image was being read and why it was also 128MB (eg with 16Bytes per pixel) especially when the incoming game image for PSSR processing was rendered sparsely so it wasn’t even 8MPixels, but 1/4 at 2MPixels with gaps and even after “a quicky upscale of what was rendered in the game.” (as Mark put it) to make the 4K raster image complete.

So I was still expecting the red-in image from RAM to be 3 or 4 bytes per pixel (6MB RGB8, or 8MB if using RGB10_A2, because why would you take the raster image, do a load of matrix maths to generate 16bytes per pixel – which needs to reside in RAM because it is 128MB - then read it back in – in 15MB chunks - to do the CNN and then write out to RAM, again?

The confusion seemed to stem from me not appreciating that the neural network is recurrent which doesn’t get highlighted until after the ML hardware coverage when Mark explains that they chose a U-net solution. After which, the following statement made a lot more sense

“The first layer does a lot of matrix maths and then outputs another 4K image with substantially more information per pixel”

as it is actually referring to a tensor that contains the new frame raster image, the quick upscale and some of the supplemental (recurrent 16byte) data from the previous PSSR pass and inferred updates from this pass.

The white arrows in the CNN U-net diagram above shows abstraction of the those processed tensors tiles at the levels moving across to be combined and written out to RAM, which interestingly moves in to another interesting observation about the PSSR algorithm and its running time.

Mark pre-empts the inaccuracy of the CNN diagram by saying it isn’t quite the same as the one for PSSR, but close enough.

Mark mentions within the talk a 270p tensor level that clearly isn’t on the diagram - below the 540p level which I believe is the inaccuracy he is talking about in the CNN.

What is also interesting is seeing how the tensor level resolutions change by one quarter of the resolution and how the supplemental tensor image data bytes doubles as the resolution drops by one quarter – based on the one data point we have of 1080p with 32bytes per pixel as shown in the diagram.

Making the assumption that this is a consistent trend with the tensor levels, what we are seeing is tensors:

4K16B (128MB),

1080p32B(64MB),

540p64B(32MB),

270p128B(16MB)

which when added up gives 240MB, and if we say it is both reads and writes, that’s 480MB of RAM data giving a time of

480 x 1024 x 1024 / 576 x 1024 x 1024 x 1024 = 8.138 x 10^-4 ~= 0.814ms

but wait Mark says that the CNN runs roughly 10,000 Ops per pixel in the output 4K image, so with a little bit of TOPs maths(working for 60fps).

We have an image of (3840x2160) 8 mega pixels x 10,000 OPS x 60 fps

= 4.8 trillion Ops or just 4.8 TOPS

Which as a percentage of the 300TOPs is

4.8/300 = 0.0133 = 1.33%

working out that percentage in compute time based on a 16.67ms frame (60fps) that gives

16.67 * 0.0133 = 0.22ms

And finishing up, by adding that to the 0.814ms RAM bandwidth time

0.814 + 0.22 = 1.03ms

Which assuming I got my maths and assumptions correct, looks like he sort of said PSSR running time is about 1ms four times in the talk , although given TOPS wastage mentioned for PSSR I suspect that 0.22ms number is actually 40% higher at the minimum, unless the 10K Ops number includes wastage.

I was going to write about more things I’d gleamed, in regards of the 3x3 convolutions and tile sizes and effective tile counts used on the 4K tensor, etc, and how that mapped into the 15MBs of the 30 WGP vector registers, etc, but I was quite surprised how much writing this proof comment took.

Do I owe you $9.99 for reading that?

Lysandros · Tuesday at 11:06 PM

Kilau said:
Do I owe you $9.99 for reading that?

I'll wait for the audio book.

Loxus · Tuesday at 11:47 PM

PaintTinJr said:
Was going to do a new thread, but figure it belongs in here.

PSSR: Gleaming more Info from Mark Cerny’s PS5 Pro Technical Seminar

I went back – the other day to the video below - to look up the technical calculation for deriving the 300 (8-bit) TOPs that the PS5 Pro has, having been motivated to try and understand if the RX 9070’s and a RTX 50xx series TOPs compared in an apples-to-apples way to gain insight into the work involved in porting FSR4 to the PS5 Pro.

From watching again I was surprised by just how much more info I was gleaming from the same images and the same words .

The low hanging fruit I caught initially on rewatch was Mark saying twice that the time for PSSR’s maths/CNN takes ~1millisecond.

but he actually says it three times in effect with the following quote about reading and writing(256MBs) of the 128MB layer 1 (tensor) image from and to the unified GDDR6 RAM (@576GB/s) after modification

“And just reading and writing take up 0.5 millisecond for this one layer almost half the time for our entire CNN”

When you do the maths it does indeed take roughly 1/2 ms

(256x1024x1024/(576*1024*1024*1024) = 4.34*10^-4 seconds = 0.43ms

At that point I got a little confused and was wondering why the layer 1 image was being read and why it was also 128MB (eg with 16Bytes per pixel) especially when the incoming game image for PSSR processing was rendered sparsely so it wasn’t even 8MPixels, but 1/4 at 2MPixels with gaps and even after “a quicky upscale of what was rendered in the game.” (as Mark put it) to make the 4K raster image complete.

So I was still expecting the red-in image from RAM to be 3 or 4 bytes per pixel (6MB RGB8, or 8MB if using RGB10_A2, because why would you take the raster image, do a load of matrix maths to generate 16bytes per pixel – which needs to reside in RAM because it is 128MB - then read it back in – in 15MB chunks - to do the CNN and then write out to RAM, again?

The confusion seemed to stem from me not appreciating that the neural network is recurrent which doesn’t get highlighted until after the ML hardware coverage when Mark explains that they chose a U-net solution. After which, the following statement made a lot more sense

“The first layer does a lot of matrix maths and then outputs another 4K image with substantially more information per pixel”

as it is actually referring to a tensor that contains the new frame raster image, the quick upscale and some of the supplemental (recurrent 16byte) data from the previous PSSR pass and inferred updates from this pass.

The white arrows in the CNN U-net diagram above shows abstraction of the those processed tensors tiles at the levels moving across to be combined and written out to RAM, which interestingly moves in to another interesting observation about the PSSR algorithm and its running time.

Mark pre-empts the inaccuracy of the CNN diagram by saying it isn’t quite the same as the one for PSSR, but close enough.

Mark mentions within the talk a 270p tensor level that clearly isn’t on the diagram - below the 540p level which I believe is the inaccuracy he is talking about in the CNN.

What is also interesting is seeing how the tensor level resolutions change by one quarter of the resolution and how the supplemental tensor image data bytes doubles as the resolution drops by one quarter – based on the one data point we have of 1080p with 32bytes per pixel as shown in the diagram.

Making the assumption that this is a consistent trend with the tensor levels, what we are seeing is tensors:

4K16B (128MB),

1080p32B(64MB),

540p64B(32MB),

270p128B(16MB)

which when added up gives 240MB, and if we say it is both reads and writes, that’s 480MB of RAM data giving a time of

480 x 1024 x 1024 / 576 x 1024 x 1024 x 1024 = 8.138 x 10^-4 ~= 0.814ms

but wait Mark says that the CNN runs roughly 10,000 Ops per pixel in the output 4K image, so with a little bit of TOPs maths(working for 60fps).

We have an image of (3840x2160) 8 mega pixels x 10,000 OPS x 60 fps

= 4.8 trillion Ops or just 4.8 TOPS

Which as a percentage of the 300TOPs is

4.8/300 = 0.0133 = 1.33%

working out that percentage in compute time based on a 16.67ms frame (60fps) that gives

16.67 * 0.0133 = 0.22ms

And finishing up, by adding that to the 0.814ms RAM bandwidth time

0.814 + 0.22 = 1.03ms

Which assuming I got my maths and assumptions correct, looks like he sort of said PSSR running time is about 1ms four times in the talk , although given TOPS wastage mentioned for PSSR I suspect that 0.22ms number is actually 40% higher at the minimum, unless the 10K Ops number includes wastage.

I was going to write about more things I’d gleamed, in regards of the 3x3 convolutions and tile sizes and effective tile counts used on the 4K tensor, etc, and how that mapped into the 15MBs of the 30 WGP vector registers, etc, but I was quite surprised how much writing this proof comment took.

Nah, this deserves it's own thread.
It's too detailed.

saintjules · Wednesday at 2:14 AM

PaintTinJr said:
Was going to do a new thread, but figure it belongs in here.

PSSR: Gleaming more Info from Mark Cerny’s PS5 Pro Technical Seminar

I went back – the other day to the video below - to look up the technical calculation for deriving the 300 (8-bit) TOPs that the PS5 Pro has, having been motivated to try and understand if the RX 9070’s and a RTX 50xx series TOPs compared in an apples-to-apples way to gain insight into the work involved in porting FSR4 to the PS5 Pro.

From watching again I was surprised by just how much more info I was gleaming from the same images and the same words .

The low hanging fruit I caught initially on rewatch was Mark saying twice that the time for PSSR’s maths/CNN takes ~1millisecond.

but he actually says it three times in effect with the following quote about reading and writing(256MBs) of the 128MB layer 1 (tensor) image from and to the unified GDDR6 RAM (@576GB/s) after modification

“And just reading and writing take up 0.5 millisecond for this one layer almost half the time for our entire CNN”

When you do the maths it does indeed take roughly 1/2 ms

(256x1024x1024/(576*1024*1024*1024) = 4.34*10^-4 seconds = 0.43ms

At that point I got a little confused and was wondering why the layer 1 image was being read and why it was also 128MB (eg with 16Bytes per pixel) especially when the incoming game image for PSSR processing was rendered sparsely so it wasn’t even 8MPixels, but 1/4 at 2MPixels with gaps and even after “a quicky upscale of what was rendered in the game.” (as Mark put it) to make the 4K raster image complete.

So I was still expecting the red-in image from RAM to be 3 or 4 bytes per pixel (6MB RGB8, or 8MB if using RGB10_A2, because why would you take the raster image, do a load of matrix maths to generate 16bytes per pixel – which needs to reside in RAM because it is 128MB - then read it back in – in 15MB chunks - to do the CNN and then write out to RAM, again?

The confusion seemed to stem from me not appreciating that the neural network is recurrent which doesn’t get highlighted until after the ML hardware coverage when Mark explains that they chose a U-net solution. After which, the following statement made a lot more sense

“The first layer does a lot of matrix maths and then outputs another 4K image with substantially more information per pixel”

as it is actually referring to a tensor that contains the new frame raster image, the quick upscale and some of the supplemental (recurrent 16byte) data from the previous PSSR pass and inferred updates from this pass.

The white arrows in the CNN U-net diagram above shows abstraction of the those processed tensors tiles at the levels moving across to be combined and written out to RAM, which interestingly moves in to another interesting observation about the PSSR algorithm and its running time.

Mark pre-empts the inaccuracy of the CNN diagram by saying it isn’t quite the same as the one for PSSR, but close enough.

Mark mentions within the talk a 270p tensor level that clearly isn’t on the diagram - below the 540p level which I believe is the inaccuracy he is talking about in the CNN.

What is also interesting is seeing how the tensor level resolutions change by one quarter of the resolution and how the supplemental tensor image data bytes doubles as the resolution drops by one quarter – based on the one data point we have of 1080p with 32bytes per pixel as shown in the diagram.

Making the assumption that this is a consistent trend with the tensor levels, what we are seeing is tensors:

4K16B (128MB),

1080p32B(64MB),

540p64B(32MB),

270p128B(16MB)

which when added up gives 240MB, and if we say it is both reads and writes, that’s 480MB of RAM data giving a time of

480 x 1024 x 1024 / 576 x 1024 x 1024 x 1024 = 8.138 x 10^-4 ~= 0.814ms

but wait Mark says that the CNN runs roughly 10,000 Ops per pixel in the output 4K image, so with a little bit of TOPs maths(working for 60fps).

We have an image of (3840x2160) 8 mega pixels x 10,000 OPS x 60 fps

= 4.8 trillion Ops or just 4.8 TOPS

Which as a percentage of the 300TOPs is

4.8/300 = 0.0133 = 1.33%

working out that percentage in compute time based on a 16.67ms frame (60fps) that gives

16.67 * 0.0133 = 0.22ms

And finishing up, by adding that to the 0.814ms RAM bandwidth time

0.814 + 0.22 = 1.03ms

Which assuming I got my maths and assumptions correct, looks like he sort of said PSSR running time is about 1ms four times in the talk , although given TOPS wastage mentioned for PSSR I suspect that 0.22ms number is actually 40% higher at the minimum, unless the 10K Ops number includes wastage.

I was going to write about more things I’d gleamed, in regards of the 3x3 convolutions and tile sizes and effective tile counts used on the 4K tensor, etc, and how that mapped into the 15MBs of the 30 WGP vector registers, etc, but I was quite surprised how much writing this proof comment took.

Nice writing, even though I can't decipher much of the language. It's cool for people to take time out to do writeups like this.

I gave you a gold, but waiting for GAF to deliver it.

yogaflame · Wednesday at 2:21 AM

PaintTinJr said:
Was going to do a new thread, but figure it belongs in here.

PSSR: Gleaming more Info from Mark Cerny’s PS5 Pro Technical Seminar

I went back – the other day to the video below - to look up the technical calculation for deriving the 300 (8-bit) TOPs that the PS5 Pro has, having been motivated to try and understand if the RX 9070’s and a RTX 50xx series TOPs compared in an apples-to-apples way to gain insight into the work involved in porting FSR4 to the PS5 Pro.

From watching again I was surprised by just how much more info I was gleaming from the same images and the same words .

The low hanging fruit I caught initially on rewatch was Mark saying twice that the time for PSSR’s maths/CNN takes ~1millisecond.

but he actually says it three times in effect with the following quote about reading and writing(256MBs) of the 128MB layer 1 (tensor) image from and to the unified GDDR6 RAM (@576GB/s) after modification

“And just reading and writing take up 0.5 millisecond for this one layer almost half the time for our entire CNN”

When you do the maths it does indeed take roughly 1/2 ms

(256x1024x1024/(576*1024*1024*1024) = 4.34*10^-4 seconds = 0.43ms

At that point I got a little confused and was wondering why the layer 1 image was being read and why it was also 128MB (eg with 16Bytes per pixel) especially when the incoming game image for PSSR processing was rendered sparsely so it wasn’t even 8MPixels, but 1/4 at 2MPixels with gaps and even after “a quicky upscale of what was rendered in the game.” (as Mark put it) to make the 4K raster image complete.

So I was still expecting the red-in image from RAM to be 3 or 4 bytes per pixel (6MB RGB8, or 8MB if using RGB10_A2, because why would you take the raster image, do a load of matrix maths to generate 16bytes per pixel – which needs to reside in RAM because it is 128MB - then read it back in – in 15MB chunks - to do the CNN and then write out to RAM, again?

The confusion seemed to stem from me not appreciating that the neural network is recurrent which doesn’t get highlighted until after the ML hardware coverage when Mark explains that they chose a U-net solution. After which, the following statement made a lot more sense

“The first layer does a lot of matrix maths and then outputs another 4K image with substantially more information per pixel”

as it is actually referring to a tensor that contains the new frame raster image, the quick upscale and some of the supplemental (recurrent 16byte) data from the previous PSSR pass and inferred updates from this pass.

The white arrows in the CNN U-net diagram above shows abstraction of the those processed tensors tiles at the levels moving across to be combined and written out to RAM, which interestingly moves in to another interesting observation about the PSSR algorithm and its running time.

Mark pre-empts the inaccuracy of the CNN diagram by saying it isn’t quite the same as the one for PSSR, but close enough.

Mark mentions within the talk a 270p tensor level that clearly isn’t on the diagram - below the 540p level which I believe is the inaccuracy he is talking about in the CNN.

What is also interesting is seeing how the tensor level resolutions change by one quarter of the resolution and how the supplemental tensor image data bytes doubles as the resolution drops by one quarter – based on the one data point we have of 1080p with 32bytes per pixel as shown in the diagram.

Making the assumption that this is a consistent trend with the tensor levels, what we are seeing is tensors:

4K16B (128MB),

1080p32B(64MB),

540p64B(32MB),

270p128B(16MB)

which when added up gives 240MB, and if we say it is both reads and writes, that’s 480MB of RAM data giving a time of

480 x 1024 x 1024 / 576 x 1024 x 1024 x 1024 = 8.138 x 10^-4 ~= 0.814ms

but wait Mark says that the CNN runs roughly 10,000 Ops per pixel in the output 4K image, so with a little bit of TOPs maths(working for 60fps).

We have an image of (3840x2160) 8 mega pixels x 10,000 OPS x 60 fps

= 4.8 trillion Ops or just 4.8 TOPS

Which as a percentage of the 300TOPs is

4.8/300 = 0.0133 = 1.33%

working out that percentage in compute time based on a 16.67ms frame (60fps) that gives

16.67 * 0.0133 = 0.22ms

And finishing up, by adding that to the 0.814ms RAM bandwidth time

0.814 + 0.22 = 1.03ms

Which assuming I got my maths and assumptions correct, looks like he sort of said PSSR running time is about 1ms four times in the talk , although given TOPS wastage mentioned for PSSR I suspect that 0.22ms number is actually 40% higher at the minimum, unless the 10K Ops number includes wastage.

I was going to write about more things I’d gleamed, in regards of the 3x3 convolutions and tile sizes and effective tile counts used on the 4K tensor, etc, and how that mapped into the 15MBs of the 30 WGP vector registers, etc, but I was quite surprised how much writing this proof comment took.

English please?

Bernardougf · Wednesday at 2:27 AM

PaintTinJr said:
Was going to do a new thread, but figure it belongs in here.

PSSR: Gleaming more Info from Mark Cerny’s PS5 Pro Technical Seminar

I went back – the other day to the video below - to look up the technical calculation for deriving the 300 (8-bit) TOPs that the PS5 Pro has, having been motivated to try and understand if the RX 9070’s and a RTX 50xx series TOPs compared in an apples-to-apples way to gain insight into the work involved in porting FSR4 to the PS5 Pro.

From watching again I was surprised by just how much more info I was gleaming from the same images and the same words .

The low hanging fruit I caught initially on rewatch was Mark saying twice that the time for PSSR’s maths/CNN takes ~1millisecond.

but he actually says it three times in effect with the following quote about reading and writing(256MBs) of the 128MB layer 1 (tensor) image from and to the unified GDDR6 RAM (@576GB/s) after modification

“And just reading and writing take up 0.5 millisecond for this one layer almost half the time for our entire CNN”

When you do the maths it does indeed take roughly 1/2 ms

(256x1024x1024/(576*1024*1024*1024) = 4.34*10^-4 seconds = 0.43ms

At that point I got a little confused and was wondering why the layer 1 image was being read and why it was also 128MB (eg with 16Bytes per pixel) especially when the incoming game image for PSSR processing was rendered sparsely so it wasn’t even 8MPixels, but 1/4 at 2MPixels with gaps and even after “a quicky upscale of what was rendered in the game.” (as Mark put it) to make the 4K raster image complete.

So I was still expecting the red-in image from RAM to be 3 or 4 bytes per pixel (6MB RGB8, or 8MB if using RGB10_A2, because why would you take the raster image, do a load of matrix maths to generate 16bytes per pixel – which needs to reside in RAM because it is 128MB - then read it back in – in 15MB chunks - to do the CNN and then write out to RAM, again?

The confusion seemed to stem from me not appreciating that the neural network is recurrent which doesn’t get highlighted until after the ML hardware coverage when Mark explains that they chose a U-net solution. After which, the following statement made a lot more sense

“The first layer does a lot of matrix maths and then outputs another 4K image with substantially more information per pixel”

as it is actually referring to a tensor that contains the new frame raster image, the quick upscale and some of the supplemental (recurrent 16byte) data from the previous PSSR pass and inferred updates from this pass.

The white arrows in the CNN U-net diagram above shows abstraction of the those processed tensors tiles at the levels moving across to be combined and written out to RAM, which interestingly moves in to another interesting observation about the PSSR algorithm and its running time.

Mark pre-empts the inaccuracy of the CNN diagram by saying it isn’t quite the same as the one for PSSR, but close enough.

Mark mentions within the talk a 270p tensor level that clearly isn’t on the diagram - below the 540p level which I believe is the inaccuracy he is talking about in the CNN.

What is also interesting is seeing how the tensor level resolutions change by one quarter of the resolution and how the supplemental tensor image data bytes doubles as the resolution drops by one quarter – based on the one data point we have of 1080p with 32bytes per pixel as shown in the diagram.

Making the assumption that this is a consistent trend with the tensor levels, what we are seeing is tensors:

4K16B (128MB),

1080p32B(64MB),

540p64B(32MB),

270p128B(16MB)

which when added up gives 240MB, and if we say it is both reads and writes, that’s 480MB of RAM data giving a time of

480 x 1024 x 1024 / 576 x 1024 x 1024 x 1024 = 8.138 x 10^-4 ~= 0.814ms

but wait Mark says that the CNN runs roughly 10,000 Ops per pixel in the output 4K image, so with a little bit of TOPs maths(working for 60fps).

We have an image of (3840x2160) 8 mega pixels x 10,000 OPS x 60 fps

= 4.8 trillion Ops or just 4.8 TOPS

Which as a percentage of the 300TOPs is

4.8/300 = 0.0133 = 1.33%

working out that percentage in compute time based on a 16.67ms frame (60fps) that gives

16.67 * 0.0133 = 0.22ms

And finishing up, by adding that to the 0.814ms RAM bandwidth time

0.814 + 0.22 = 1.03ms

Which assuming I got my maths and assumptions correct, looks like he sort of said PSSR running time is about 1ms four times in the talk , although given TOPS wastage mentioned for PSSR I suspect that 0.22ms number is actually 40% higher at the minimum, unless the 10K Ops number includes wastage.

I was going to write about more things I’d gleamed, in regards of the 3x3 convolutions and tile sizes and effective tile counts used on the 4K tensor, etc, and how that mapped into the 15MBs of the 30 WGP vector registers, etc, but I was quite surprised how much writing this proof comment took.

All of this to play with a fugly androgenous bald chick

awwr999999 · Wednesday at 8:11 AM

yogaflame said:
English please?

Even the most advanced AI upscaling is objectively inferior to just playing games/content on a lower resolution display at native resolution.
Sony needs to redesign TVs around what works best for PlayStation's video games.

Support NeoGAF

PS5 Pro Technical Seminar with Mark Cerny

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

SBI’s Resident Gaslighter

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Gold Member

Member

Member

Have a Blessed Day

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Gold Member

Member

Member

Member

Member

Member

Gold Member

Gold Member

Member

Member

Similar threads