Imagine considering the 1.84 TFlops of PS4 with Tahiti GPU and Jaguar CPU as even being close to Ampere architecture with ARM A78
GCN was a fucking joke. Concurrently it was a shitshow and the cache management a fiasco. Nvidia's Kepler was beating the living shit out of AMD's solution just 2 months later with 56% smaller dies and much more power efficient. The whole RDNA project is to fix their fuckup, mainly with cache that needed a huge rework. For comparison Kepler had 1/3 the latency in local memory compared to Tahiti. These were not even doing FP32 or INT32 concurrently, something that oh, is so very much important in games. GCN could do an instruction every 4 cycles while Kepler was 1 instruction every cycle. GCN was a compute monster, it handled well large work sizes with long durations, but very few game workloads fall into this category. Simple geometry was not saturing the GP (idle), it had simultaneous bit commands that created huge buffers basically kneecappings parallelism. The larger GPU on PS4 also meant that the SE:CU ratio (shader engines vs compute units) would fill slower, prefering longer running waves which is again, anti-thesis to most gaming workloads.
Now we're still in 2012 architectures of Nvidia vs AMD. That's what's inside PS4.
A shitload happened between Kepler → Maxwell → Pascal → Volta → Turing → Ampere
Ampere was a paradigm shift for Nvidia, and no, not because of double Cuda core figure, although the problem is actually Turing's nomenclature of cuda cores compared to Pascal more than Ampere.
The paradigm shift is how efficient that architecture is with high occupancy.
- Improvements in concurrent operations (concurrent raster/RT/ML, which Turing was not)
- Asynchronous barrier to keep execution units always near full occupancy
- Ampere global memory traffic for asynchronous memory copy and reducing memory traffic
- Also served to hide data copy latency
The FP32/INT32 cuda core added to the one dedicated to FP32, is pretty much a Pascal core. Not exactly a slouch. It is there exactly for one purpose and it is to continue the trend of keeping the GPU at near full occupancy. There's less compute than shading in gaming, almost always. The extra FP32 after INT32 is done is to finish tasks ASAP. Something that AMD tried after with dual issue that required the compiler and failed miserably with RDNA 3. So even when you consider half the cuda cores somehow, that's not how games are. It is not a loss in occupancy, you're still getting that performance, just integer performance without the MASSIVE performance penalty that architectures pre-Turing had. Occupancy of a GPU is the main driver of modern days. Idle is not wanted. Like the asynchronous barrier of ampere, wtf is that? Well on RDNA 2 when you get a call that needs data written by the computer shader, the RDNA 2's synchronous barrier prevents it from executing until ALL the threads in the computer shader have finished executing. Making the WGP idle, not enough thread level parallelism left to hide latency, bye bye efficiency. RDNA 2 performs better than Ampere at low occupancy, but chokes at high ones. Don't even start to insert ML & RT in that poor fucking pipeline. Now imagine Tahiti efficiency
Then Switch 2's T239 (again, going off by 2022 rumours) vs PS4's baseline
- A78 ARM CPUs. Completely destroys Jaguar cores
- 12 GB
- UFS 3.1 storage, a much more power efficient storage than SSD but similar speeds, ~2100MB/s read speed.
- New I/O from either Ampere or the rumoured decompression engine added in the chip.
- RT cores & ML tensor cores
PC portables are the exact best examples of inefficiencies. RDNA 2 is super sensitive to memory latency and LPDDR5 with no infinity cache in Van Gogh kneecaps what it could have been.
Asus ROG Ally Z1 Extreme is supposed to be so much more powerful, raw specs in every ways are better for Z1 Extreme than Van Gogh. "up to 8.6 TFlops" RDNA 3 GPU.
Is it even close to a PS5? A Series S?
Everyone knows that answer. ROG Ally are on inefficient laptop trash. Not even close to customized for handheld. This fucking chipset is 25B transistors! At low TDP for handheld it completely chokes out.
And PC portables are bloated by windows OS or Proton layer.
This is the first time we'll see Ampere outside of windows OS with NVN API which is Nvidia's inhouse API close to the metal. Buckle the fuck up, regardless of TFlops.