VNS-SAM

Boosting Segment Anything Model to Generalize to Visually Non-Salient Scenarios
Pengfei Chen 2
Huafeng Chen 1
Boqiang Zhang 4
1Northwestern Polytechnical University, 2University of Chinese Academic of Sciences, 3Max Planck Institute for Informatics (MPI-INF), 4University of Science and Technology of China.
seen-set res.

A comparison of masks predicted by SAM and VNS-SAM under three typical non-salient scenarios. SAM often struggles when dealing with (a) camouflaged objects where the objects perfectly match its surroundings, (b) polyp objects where polyp tissues and normal tissues have the same texture, posing challenges to medical image analysis, and (c) objects in low-light conditions where the targets lack significant color contrast with their backgrounds. SAM fails to accurately identify object boundaries and complete structures, leading to missing segmentation details and incorrect background predictions. In contrast, VNS-SAM can produce more accurate segmentation.

Abstract

Segment Anything Model (SAM), known for its remarkable zero-shot segmentation capabilities, has garnered significant attention in the community. Nevertheless, its performance is challenged when dealing with what we refer to as visually non-salient scenarios, where there is low contrast between the foreground and background. In these cases, existing methods often cannot capture the accurate contours and fail to produce promising segmentation results. In this paper, we propose Visually Non-Salient SAM (VNS-SAM), aiming to enhance SAM's perception of visually non-salient scenarios while preserving its original zero-shot generalizability. We achieve this by effectively exploiting SAM's low-level features through two designs: Mask-Edge Token Interactive decoder and Non-Salient Feature Mining module. These designs help the SAM decoder gain a deeper understanding of non-salient characteristics with only marginal parameter increments and computational requirements. The additional parameters of VNS-SAM can be optimized within 4 hours on 4x GPUs, demonstrating its feasibility and practicality for typical research laboratories. In terms of data, we established VNS-SEG, a unified dataset for various VNS scenarios, with more than 36K images, in contrast to previous single-task adaptations. It is designed to make the model learn more robust VNS features and comprehensively benchmark the model's segmentation performance and generalizability on VNS scenarios. Extensive experiments across various VNS segmentation tasks demonstrate the superior performance of VNS-SAM, particularly under zero-shot settings, highlighting its potential for broad real-world applications.


VNS-SAM: Visually Non-Salient Segment Anything Model

VNS-SAM contains two key components, i.e., the Mask-Edge Token Interactive (METI) decoder and Non-Salient Feature Mining (NSFM) modules, which encourage SAM to learn VNS characters.

Overview of VNS-SAM.

VNS-SAM architecture.

Overview of the proposed VNS-SAM. It enhances the SAM's original decoder to a mask-edge token interactive (METI) decoder with the interaction of edge semantics and dual-level decoder layers enhancement. Second, a lightweight NSFM module is designed to mine the inconspicuous discriminative features from the image encoder layers, which serve as complementary features to the prediction layer. {During training, the parameters of the pre-trained SAM are frozen and only the newly added parameters in VNS-SAM are trained.} During inference, VNS-SAM outputs the more precise VNS-mask and the original SAM's outputs. The prompt encoder and prompt tokens are omitted here.

Overview of the proposed NSFM module.

NSFM.

Details of Non-Salient Feature Mining (NSFM) module. The multi-level features extracted from the backbone are decomposed into different components. Then, the most informative high-frequency and low-frequency components are selected and multi-level features are aggregated for edge and mask feature extraction.

VNS-SEG: Visually Non-Salient Segmentation Dataset

To enable the segmentation models to effectively learn VNS characters, we meticulously construct a unified dataset: VNS-SEG, for training and benchmarking the performance of the segmentation model on diverse visually non-salient scenarios.

Data composition of the training set of our VNS-SEG.

seen-set res.

Data composition of the eval set of our VNS-SEG.

unseen-set-results.

Results on VNS-SEG Benchmark

Detailed resuls on VNS-SEG, including seen-set evaluation and unseen-set evaluation. Three types of prompts are used to comprehensively assess the model. Our models consistently outperform the baseline SAM and other competitors on diverse seen and unseen datasets.

Results on the Eval-Seen-Set.

seen-set res.

Results on the Eval-Unseen-Set.

unseen-set-results.

Visual Comparisons.

unseen-set-results.