Aligning Subjective and Objective Assessments in
Super-Resolution Models

1NTNU - Norwegian University of Science and Technology

Abstract

We present a comprehensive study investigating the alignment between subjective human perception and objective computational metrics in super-resolution models. Through psychophysical experiments and systematic evaluation of state-of-the-art super-resolution models including ResShift, BSRGAN, Real-ESRGAN, and SwinIR, we bridge the gap between computational metrics and human visual quality assessment. Our findings reveal significant discrepancies between traditional metrics like PSNR/SSIM and human preference, with ResShift demonstrating superior performance across both objective metrics and subjective evaluations. This research provides critical insights for developing more perceptually-aligned evaluation frameworks for super-resolution systems.

Experimental Setup

Experiment 1 Setup

Experiment 1: Quick Evaluation Interface

  • Total Images: 30
  • Estimated Time: 15 minutes
  • Low-Resolution Image: 255×169 pixels (center)
  • High-Resolution Images: 4 images, 1020×676 pixels each
  • Task: Choose best HR image from randomized comparisons
Experiment 2 Setup

Experiment 2: Pairwise Comparison

  • Method: Pairwise comparisons
  • Images: 10 images, 60 pairs per person
  • Display: BenQ calibrated monitor, sRGB, D65, 80 cd/m²
  • Estimated Time: 15 minutes
  • Task: Subjective quality assessment
Main Interface

Main Interface: User interface for the psychophysical experiments

Participant Demographics & Methodology

Age Distribution

Age Distribution: Demographics of study participants across both experiments

Session Completion

Session Statistics: Completion rates showing high participant engagement

Objective Metrics Analysis

Key Finding: ResShift consistently outperformed other models across most objective metrics, achieving the highest PSNR (25.01) and best LPIPS (0.231) scores. However, BSRGAN's competitive PSNR/SSIM scores contrasted sharply with poor subjective performance.

Objective Analysis

Comprehensive Metric Analysis: Comparison of PSNR, SSIM, LPIPS, and CLIPIQA across all models

PSNR Distribution

PSNR Results on DIV2K Dataset: Our experimental validation

Paper PSNR

Literature Comparison: Results from original model papers

Subjective Evaluation Results

Experiment 1 (54 observers): ResShift was chosen 624 times, significantly outperforming SwinIR (377), RealESRGAN (351), and BSRGAN (268). Participants cited sharpness, absence of artifacts, and color fidelity as key factors.

Experiment 2 (15 observers, 900 comparisons): Controlled environment validation confirmed ResShift's dominance with 309 selections, followed by RealESRGAN (228), SwinIR (220), and BSRGAN (143).

Borda Count Rankings

Borda Count Rankings: Statistical ranking confirming ResShift's consistent superiority

Most Picked Models

Experiment 1 Results: Clear preference hierarchy across 54 participants

Pairwise Results

Experiment 2 Results: Controlled validation with 900 pairwise comparisons

Preference Matrix

Preference Heatmap: Statistical significance confirmed by Chi-Square test (χ² = 61.40, p < 0.001)

User Behavior & Demographic Analysis

Time Distribution

Decision Time Distribution

Time vs Age

Age Impact on Decision Time

Algorithm vs Age

Algorithm Preference by Age Group

Algorithm per Image

Content-Dependent Preferences

Qualitative Insights

Feedback Word Cloud

User Feedback Analysis: Key terms include "sharp," "clear," "natural," and "detailed"

Reasons Word Cloud

Decision Factors: Users prioritized visual naturalness over pixel-perfect accuracy

Key Findings & Implications

Critical Insights:

  1. ResShift's Robustness: Consistent performance across both objective metrics and subjective evaluations confirms its suitability for real-world applications
  2. Metric Limitations: BSRGAN's poor subjective performance despite competitive PSNR/SSIM highlights the inadequacy of traditional metrics for perceptual quality
  3. Perceptual Alignment: LPIPS and CLIPIQA showed better correlation with human preferences than PSNR/SSIM
  4. Content Dependency: Optimal model selection varies significantly with image content type
  5. Statistical Validation: Bradley-Terry model and Chi-Square tests (p < 0.001) confirm the reliability of subjective preferences

Future Directions

This research bridges the gap between quantitative metrics and perceptual quality, contributing to SR model development that excels in real-world applications. Future work will focus on:

  • Expand Datasets: Include diverse demographic groups and image types to enhance generalizability
  • Failure Case Analysis: Detailed investigation of problematic outputs to identify improvement areas
  • Hybrid Metrics Development: Integrate objective and subjective components for holistic SR evaluation
  • Adaptive Evaluation: Content-specific metrics that account for varying perceptual requirements

Conference Poster

SCIA 2025 Poster: A comprehensive visual summary of our research findings, experimental setup, and key insights. The poster provides an overview of both objective metrics and subjective evaluation results.