We present a comprehensive study investigating the alignment between subjective human perception and objective computational metrics in super-resolution models. Through psychophysical experiments and systematic evaluation of state-of-the-art super-resolution models including ResShift, BSRGAN, Real-ESRGAN, and SwinIR, we bridge the gap between computational metrics and human visual quality assessment. Our findings reveal significant discrepancies between traditional metrics like PSNR/SSIM and human preference, with ResShift demonstrating superior performance across both objective metrics and subjective evaluations. This research provides critical insights for developing more perceptually-aligned evaluation frameworks for super-resolution systems.
Experiment 1: Quick Evaluation Interface
Experiment 2: Pairwise Comparison
Main Interface: User interface for the psychophysical experiments
Age Distribution: Demographics of study participants across both experiments
Session Statistics: Completion rates showing high participant engagement
Key Finding: ResShift consistently outperformed other models across most objective metrics, achieving the highest PSNR (25.01) and best LPIPS (0.231) scores. However, BSRGAN's competitive PSNR/SSIM scores contrasted sharply with poor subjective performance.
Comprehensive Metric Analysis: Comparison of PSNR, SSIM, LPIPS, and CLIPIQA across all models
PSNR Results on DIV2K Dataset: Our experimental validation
Literature Comparison: Results from original model papers
Experiment 1 (54 observers): ResShift was chosen 624 times, significantly outperforming SwinIR (377), RealESRGAN (351), and BSRGAN (268). Participants cited sharpness, absence of artifacts, and color fidelity as key factors.
Experiment 2 (15 observers, 900 comparisons): Controlled environment validation confirmed ResShift's dominance with 309 selections, followed by RealESRGAN (228), SwinIR (220), and BSRGAN (143).
Borda Count Rankings: Statistical ranking confirming ResShift's consistent superiority
Experiment 1 Results: Clear preference hierarchy across 54 participants
Experiment 2 Results: Controlled validation with 900 pairwise comparisons
Preference Heatmap: Statistical significance confirmed by Chi-Square test (χ² = 61.40, p < 0.001)
Decision Time Distribution
Age Impact on Decision Time
Algorithm Preference by Age Group
Content-Dependent Preferences
User Feedback Analysis: Key terms include "sharp," "clear," "natural," and "detailed"
Decision Factors: Users prioritized visual naturalness over pixel-perfect accuracy
This research bridges the gap between quantitative metrics and perceptual quality, contributing to SR model development that excels in real-world applications. Future work will focus on:
SCIA 2025 Poster: A comprehensive visual summary of our research findings, experimental setup, and key insights. The poster provides an overview of both objective metrics and subjective evaluation results.