Hudo
Gold Member
It entirely depends on their training data and what they want to achieve. A transformer model only begins to make sense if you have a lot of training data. The biggest advantage of a transformer model lies within it's QKV self-attention, which enables the model to relate information across "big" distances and without being hindered by "direction", so to speak. Another advantage is that, mathematically, it has unlimited context length, which RNNs or LSTMs do not have. But realistically, you are still limited by your hardware. It's just that you don't have to worry about exploding or vanishing gradients to that extent. Transformers are also nice if you want to merge multi-modal data into one embedding space. And, maybe that's the reason why Nvidia actually switched architectures, they are much easier to scale with hardware, since you "just" add more tokens/patches. So they are more easily scaled in width and depth. A CNN is much easier to scale in depth, but not that easy (sometimes even not feasible) to scale in width.Nvidia has once again jumped way ahead of the competition with DLSS 4. DLSS 4 is using a transformer-based model for image upscaling compared to the CNN-based model that was used for DLSS 2 & 3.
So where does this leave PSSR? PSSR is using a CNN-based model and you have to wonder whether this is ultimately a dead-end. Should Sony reassess their decision to go with a CNN-based model? Perhaps they seriously need to consider building a transformer-model in conjunction with improving their current CNN-model?
Maybe the future should be a transformer-based PSSR for Playstation 6? What do you think?
A CNN is, by no means "outdated". They are very much researched on and very much in use. Especially if you don't have a lot of data, because CNNs intrinsically are suited for visual data/visual problems due to the nature of how hierarchical convolutions work. It's much easier to extract features from a CNN and analyze them. It's much easier to extend and modify them. They are much more efficient than transformers (I know that there are optimizations like flash attention and linear attention etc... But these come with a trade-off) for visual problems. A transformer model also is a lot more heavy weight. We don't know enough about PSSR. We don't know, for example, if they also use attention blocks or residual blocks (most likely), etc.
In fact, in SOTA models for segmentation like SAM2 or Faster Mask R-CNN-based models use convolutions, but often together with transformer blocks/modules. Or at least attention blocks. Which are, for example, used in the encoder-decoder of diffusion models. I mean, you can even solve all the problems that transformers or CNNs can solve with MLPs (Universal Approximation Theorem). But you will find that you need a lot more of them, and they won't be as good as other models (e.g. MLP-Mixers). Or what about state-space models like Mamba?
TL;DR: It depends on what your goals and your training data and your hardware are. CNNs are not obsolete and certainly not a dead end. Hell, the U-Net, which is 10 years old at this point (was introduced at MICCAI 2015) is still a popular architecture that still gets new variations each year. Transformers are not suited for every problem, or at least, less efficient in solving them. CNNs are not suited for every problem, or less efficient at solving them. Although, it is remarkable how prevalent transformers have become, considering that they were originally intended to solve problems of LSTMs in natural language processing.