Towards Any-Quality Image Segmentation via Generative and Adaptive Latent Space Enhancement

Abstract

Despite their success, Segment Anything Models (SAMs) experience significant performance drops on severely degraded, low-quality images, limiting their effectiveness in real-world scenarios. To address this, we propose GleSAM++, which utilizes Generative Latent space Enhancement to boost robustness on low-quality images, thus enabling generalization across various image qualities. Specifically, we adapt the concept of latent diffusion to SAM-based segmentation frameworks and perform the generative diffusion process in the latent space of SAM to reconstruct a high-quality representation, thereby improving segmentation. Additionally, to improve compatibility between the pre-trained diffusion model and the segmentation framework, we introduce two techniques, i.e., Feature Distribution Alignment (FDA) and Channel Replication and Expansion (CRE). However, the above components lack explicit guidance regarding the degree of degradation. The model is forced to implicitly fit a complex noise distribution that spans conditions from mild noise to severe artifacts, which substantially increases the learning burden and leads to suboptimal reconstructions. To address this issue, we further introduce a Degradation-aware Adaptive Enhancement (DAE) mechanism. The key principle of DAE is to decouple the reconstruction process for arbitrary-quality features into two stages: degradation-level prediction and degradation-aware reconstruction. This design reduces the optimization difficulty of the model and consequently enhances the effectiveness of feature reconstruction. Our method can be applied to pre-trained SAM and SAM2 with only minimal additional learnable parameters, allowing for efficient optimization. We also construct the LQSeg dataset with a greater diversity of degradation types and levels for training and evaluating the model. Extensive experiments demonstrate that GleSAM significantly improves segmentation robustness on complex degradations while maintaining generalization to clear images. Furthermore, GleSAM also performs well on unseen degradations, underscoring the versatility of our approach and dataset.

The comparison of qualitative results on low-quality images with varying degradation levels from an unseen dataset. To generate images with different degradation levels, we progressively added Gaussian Noise, Re-sampling Noise, and more severe Gaussian noise to an image. Results indicate that the baseline SAM shows limited robustness to degradation. Although RobustSAM retains some resilience against simpler degradations, it struggles with more complex and unfamiliar degradations. In contrast, our method consistently demonstrates strong robustness across images of varying quality.

Method

GleSAM++ contains two key components, i.e., the Generative & adaptive Latent space Enhancement and Degradation-aware Adaptive Enhancement(DAE), which decouples the reconstruction process for arbitrary-quality features into two stages: degradation-level prediction and degradation-aware reconstruction.

Overview of GleSAM++

Given an input image, GleSAM++ performs accurate segmentation through image encoding, generative & adaptive latent space enhancement, and mask decoding. During training, HQ-LQ image pairs are fed into the frozen image encoder to extract the corresponding HQ and LQ latent features. We then adaptively reconstruct high-quality representations in the SAM's latent space by efficiently fine-tuning a generative denoising U-Net with LoRA layers. Degradation-aware prediction module is used to explicitly estimate the degradation level of the input features and uses this information to dynamically regulate the denoising strength. Latent space alignment is used to bridge the feature distribution and structural gaps between the pre-trained latent diffusion model and SAM. Subsequently, the decoder is fine-tuned with segmentation loss to align the enhanced latent representations. Built upon SAMs, GleSAM++ inherits prompt-based segmentation and performs well on images of any quality.

Low-Quality Image Segmentation Dataset

We construct a comprehensive low-quality image segmentation dataset dubbed LQSeg that encompasses more complex and multi-level degradations, rather than relying on a single type of degradation for each image. The dataset is composed of images from several existing datasets with our synthesized degradations.

LQ-Seg dataset

Examples from the LQ-Seg dataset illustrating images with varying levels of synthetic degradation: LQ-1, LQ-2, and LQ-3. These samples showcase the progressive quality deterioration used for evaluating the robustness of segmentation models.

Visualizations

Due to the challenging degradations, previous SAM and the enhanced RobustSAM struggle to segment these objects accurately, resulting in serious detail missing and erroneous background prediction, showing their limitations. In contrast, GleSAM and GleSAM++ effectively recover finer details and achieve more precise segmentation results.

Visual Comparisons

Visual comparisons on the unseen ECSSD, Robust-Seg, and BDD-10K datasets. The results demonstrate the superior generalization capability of GleSAM++ to handle unseen degradations not included in the training set.

BibTeX

@inproceedings{guo2025segment,
      title={Segment Any-Quality Images with Generative Latent Space Enhancement},
      author={Guo, Guangqian and Guo, Yong and Yu, Xuehui and Li, Wenbo and Wang, Yaoxing and Gao, Shan},
      booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
      pages={2366--2376},
      year={2025}
    }