Super-Resolution Color Enhancement

Abstract

We present a comprehensive study investigating the alignment between subjective human perception and objective computational metrics in super-resolution models. Through psychophysical experiments and systematic evaluation of state-of-the-art super-resolution models including ResShift, BSRGAN, Real-ESRGAN, and SwinIR, we bridge the gap between computational metrics and human visual quality assessment. Our findings reveal significant discrepancies between traditional metrics like PSNR/SSIM and human preference, with ResShift demonstrating superior performance across both objective metrics and subjective evaluations. This research provides critical insights for developing more perceptually-aligned evaluation frameworks for super-resolution systems.

Experimental Setup

Experiment 1: Quick Evaluation Interface

Total Images: 30
Estimated Time: 15 minutes
Low-Resolution Image: 255×169 pixels (center)
High-Resolution Images: 4 images, 1020×676 pixels each
Task: Choose best HR image from randomized comparisons

Experiment 2: Pairwise Comparison

Method: Pairwise comparisons
Images: 10 images, 60 pairs per person
Display: BenQ calibrated monitor, sRGB, D65, 80 cd/m²
Estimated Time: 15 minutes
Task: Subjective quality assessment

Main Interface: User interface for the psychophysical experiments

Participant Demographics & Methodology

Age Distribution: Demographics of study participants across both experiments

Session Statistics: Completion rates showing high participant engagement

Objective Metrics Analysis

Key Finding: ResShift consistently outperformed other models across most objective metrics, achieving the highest PSNR (25.01) and best LPIPS (0.231) scores. However, BSRGAN's competitive PSNR/SSIM scores contrasted sharply with poor subjective performance.

Comprehensive Metric Analysis: Comparison of PSNR, SSIM, LPIPS, and CLIPIQA across all models

PSNR Results on DIV2K Dataset: Our experimental validation

Literature Comparison: Results from original model papers

Subjective Evaluation Results

Experiment 1 (54 observers): ResShift was chosen 624 times, significantly outperforming SwinIR (377), RealESRGAN (351), and BSRGAN (268). Participants cited sharpness, absence of artifacts, and color fidelity as key factors.

Experiment 2 (15 observers, 900 comparisons): Controlled environment validation confirmed ResShift's dominance with 309 selections, followed by RealESRGAN (228), SwinIR (220), and BSRGAN (143).

Borda Count Rankings: Statistical ranking confirming ResShift's consistent superiority

Experiment 1 Results: Clear preference hierarchy across 54 participants

Experiment 2 Results: Controlled validation with 900 pairwise comparisons

Preference Heatmap: Statistical significance confirmed by Chi-Square test (χ² = 61.40, p < 0.001)

User Behavior & Demographic Analysis

Decision Time Distribution

Age Impact on Decision Time

Algorithm Preference by Age Group

Content-Dependent Preferences

Qualitative Insights

User Feedback Analysis: Key terms include "sharp," "clear," "natural," and "detailed"

Decision Factors: Users prioritized visual naturalness over pixel-perfect accuracy

Key Findings & Implications

Critical Insights:

ResShift's Robustness: Consistent performance across both objective metrics and subjective evaluations confirms its suitability for real-world applications
Metric Limitations: BSRGAN's poor subjective performance despite competitive PSNR/SSIM highlights the inadequacy of traditional metrics for perceptual quality
Perceptual Alignment: LPIPS and CLIPIQA showed better correlation with human preferences than PSNR/SSIM
Content Dependency: Optimal model selection varies significantly with image content type
Statistical Validation: Bradley-Terry model and Chi-Square tests (p < 0.001) confirm the reliability of subjective preferences

Future Directions

This research bridges the gap between quantitative metrics and perceptual quality, contributing to SR model development that excels in real-world applications. Future work will focus on:

Expand Datasets: Include diverse demographic groups and image types to enhance generalizability
Failure Case Analysis: Detailed investigation of problematic outputs to identify improvement areas
Hybrid Metrics Development: Integrate objective and subjective components for holistic SR evaluation
Adaptive Evaluation: Content-specific metrics that account for varying perceptual requirements

Conference Poster

SCIA 2025 Poster: A comprehensive visual summary of our research findings, experimental setup, and key insights. The poster provides an overview of both objective metrics and subjective evaluation results.

Aligning Subjective and Objective Assessments in Super-Resolution Models