Breaking Boundaries in Computer Vision: Segment Anything Model (SAM)

7 min readSep 21, 2023

Breaking Boundaries in Computer Vision: Segment Anything Model (SAM)

In the world of computer vision, image segmentation plays a crucial role in identifying and localizing objects within an image. While traditional segmentation models require extensive training with labeled data, the Segment Anything Model (SAM), introduced by Meta AI Research in 2023, does not need any of that! With over 200 citations in a short span, SAM has gained a lot of attention for its groundbreaking zero-shot transfer capabilities. In my opinion, SAM has the potential to revolutionize image segmentation and open new horizons in the world of computer vision. In this article, I will talk about the fundamentals of SAM, its architecture, the SA-1B dataset, and how it is reshaping the way we approach image segmentation. I learned most of the information I shared in this article from the paper published by Meta AI, and their github repository.

What is SAM?

SAM is designed to be promptable, which means it can segment unseen objects without the need for specific training data. SAM’s concept takes inspiration from large language models in natural language processing (NLP) that can transfer their knowledge to new tasks.

The architecture of SAM:

At the core of SAM’s architecture are three key components: the image encoder, the prompt encoder, and the mask decoder.

Image encoder, pre-trained using a Masked Auto Encoder (MAE) approach, efficiently obtains image embeddings, providing a rich representation of the input image.

Prompt encoders can handle a wide range of prompts, including points, bounding boxes, text, and masks. And here’s the best part, SAM can even work with text prompts encoded by CLIP models, making it incredibly versatile and powerful. In my opinion, this is a groundbreaking feature that truly sets SAM apart from other models because taking image and text as input is peculiar to SAM.

Lastly, mask decoder effectively maps the image and prompt embeddings to produce accurate and high-quality segmentation masks.

The architecture of SAM

Resolving Ambiguity with SAM

SAM boasts a unique feature that allows it to predict multiple output masks when presented with ambiguous prompts. This capability is particularly for segmenting specific portions of objects, or the intersection of multiple objects. SAM addresses such ambiguity by generating three mask outputs, each with a confidence score (IoU) to rank their relevance. This way, SAM ensures accurate and diverse segmentation for various prompting conditions.

The SA-1B Dataset

Comprising over 1.1 billion high-quality masks from 11 million licensed and privacy-respecting images, SA-1B is the largest segmentation dataset to date. The dataset was collected in three stages: an assisted-manual stage, a semi-automatic stage, and a fully automatic stage. This collecting type involves a mix of human annotation and model-assisted annotation to ensure a large set of accurately annotated images.

Zero-Shot Transfer Results

When defining SAM, I talked about its zero-shot transfer capabilities, in this section let’s talk about how the performance of SAM is improved with this ability. SAM’s zero-shot transfer capabilities have been evaluated across 23 downstream tasks. SAM outperformed the RITM model on 16 of these tasks. SAM shows an excellent performance in tasks like edge prediction, object proposal generation, instance segmentation, and even text-to-mask applications. Its ability to perform exceptionally well without specific training demonstrates its adaptability and potential for various real-world applications.

Conclusion

SAM, the Segment Anything Model, is a new technology which is really assertive in the field of computer vision. With its promptable, zero-shot transfer capabilities, SAM can segment any object eliminating the need for extensive training with labeled data which makes it efficient. The SA-1B dataset enhances SAM’s capabilities, making it useful for researchers and developers in different fields, including medical imaging and 3D rendering. As SAM continues to evolve, I believe we can expect even more advancements in the world of computer vision. SAM has undoubtedly set a new standard for segmentation models, unlocking exciting possibilities in image analysis and interaction with the digital world.

Now that we have explored the Segment Anything Model (SAM), let’s delve into how we can implement SAM in practical applications. Implementing SAM involves a series of steps, from setting up the necessary libraries and dependencies to utilizing the SA-1B dataset for training. I will give you the basic steps on how you can use SAM for image segmentation.

Step 1: Setting Up the Environment

Before we begin, I would like to discuss the environment in which you plan to implement your code. Since image processing necessitates the use of a GPU, you will need to configure CUDA, which you may not prefer. I suggest utilizing https://colab.research.google.com/ to execute your code with GPU support and to have the necessary libraries pre-installed.

We will need PyTorch, OpenCV and Matplotlib libraries to process our images and visualize them.

!pip install torch

!pip install opencv-python

!pip install matplotlib

Step 2: Preprocessing your image

image = cv2.imread(‘micra.jpeg’)

image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

plt.figure(figsize=(10,10))

plt.imshow(image)

plt.axis(‘off’)

plt.show() #see our image before segmenting

Step 3: Obtaining SAM Model and SA-1B Dataset

To utilize SAM, we need to download the pre-trained SAM model checkpoint. You can find the model checkpoint on the official SAM website (https://segment-anything.com/). Download the checkpoint file and save it to your preferred directory.

!pip install git+https://github.com/facebookresearch/segment-anything.git

Similarly, to access the SA-1B dataset, you can find the necessary information and download links on the same website. If you will use a SAM model, then you don’t need to install any dataset, since models are trained.

Step 4: Loading and Configuring SAM Model

Now, we have the SAM model checkpoint, we load it into our Python environment using PyTorch’s built-in functions.

import torch

from sam_model import build_sam_vit_h

sam_checkpoint = “/content/drive/MyDrive/sam_vit_h_4b8939.pth”

model_type = “vit_h” #i used vit_h model, there are 2 other models too

device = “cuda” if torch.cuda.is_available() else “cpu”

sam = sam_model_registry[model_type](checkpoint=sam_checkpoint)

sam.to(device=device)

Step 5: Performing Segmentation with SAM

Our environment is set up, the model is loaded. We can perform segmentation with SAM. We will give our image to the SAM. At this point, adjusting the parameters of SamAutomaticMaskGenerator() function to have more accurate masks based on your image properties is important. I have tried different values for my image and these values fit.

from sam import SamAutomaticMaskGenerator

mask_generator_ = SamAutomaticMaskGenerator(

model=sam,

points_per_side=32,

pred_iou_thresh=0.97,#(iou) you need to adjust this threshold to not have intersected masks etc.

stability_score_thresh=0.96,

crop_n_layers=1,

crop_n_points_downscale_factor=2,

min_mask_region_area=10000, #according to your image size adjust it

)

masks = mask_generator_.generate(image)

print(len(masks)) #to see how many masks do we have

SAM will process the image and generate segmentation masks based on the given prompt.

Step 6: Post-processing and Visualization

Now let’s visualize the segmentation masks on the original image using Matplotlib or any other image visualization library.

import numpy as np

def show_anns(anns): #function from segment-anything research to show labeled image

if len(anns) == 0:

return

sorted_anns = sorted(anns, key=(lambda x: x[‘area’]), reverse=True)

ax = plt.gca()

ax.set_autoscale_on(False)

polygons = []

color = []

for ann in sorted_anns:

m = ann[‘segmentation’]

img = np.ones((m.shape[0], m.shape[1], 3))

color_mask = np.random.random((1, 3)).tolist()[0]

for i in range(3):

img[:,:,i] = color_mask[i]

ax.imshow(np.dstack((img, m*0.35)))

plt.figure(figsize=(10,10))

plt.imshow(image)

show_anns(masks)

plt.axis(‘off’)

plt.show()

We can see our labeled image after this code snippet. To show you an example, I put an image from the SA-1B dataset, which is shown as labeled in segment-anything official website. You can explore any type of image at this website.

An example labeled image from the dataset of SAM

In the above implementation, we utilized a pre-trained model. Now, I’d like to address another aspect: Fine-tuning with SAM. Fine-tuning gives you a chance to utilize SAM for improved performance on your custom dataset and real-world applications. I will provide you with the basic steps of fine-tuning. By following these steps, you can fine-tune SAM to adapt it to your specific image segmentation task requirements. Remember to experiment with different prompts and loss functions to achieve the best results for your unique project.

Creating a Custom Dataset: If the SA-1B dataset does not meet your requirements, you can create your own dataset or utilize open-source datasets.

Bounding boxes and masks: We should extract the bounding box coordinates which will be fed into SAM as prompts from our data, after that extract the ground truth segmentation masks.

Preprocessing the image: We need to convert the images to the format that SAM’s built-in functionalities require.

Adding optimizer and loss function: We can use optimizer here to increase the performance, and define our loss function.

Running the fine-tuning: In a training loop firstly, take the image embeddings with the image encoder of SAM, and after get the prompt embeddings using the prompt encoder of SAM. After this image embeddings, we should generate masks using mask decoder.

Then, post-processing masks are needed, and calculating the loss then running the optimizer for each epoch. In this way, we can get our fine-tuned model.

Testing and comparing: We can compare our fine-tuned model with the original model. To compare it we can use SamPredictor for our fine-tuned model prediction.

For further code implementation of fine-tuning you may take a look at this link.

I have talked about SAM, and its features. I hope you understand the power of SAM, and why I find it so incredible and useful.

Thanks for reading and hope to see you in another article!

Gizem Ayaz

Breaking Boundaries in Computer Vision: Segment Anything Model (SAM)

Breaking Boundaries in Computer Vision: Segment Anything Model (SAM)

What is SAM?

The architecture of SAM:

Resolving Ambiguity with SAM

The SA-1B Dataset

Zero-Shot Transfer Results

Conclusion

Step 1: Setting Up the Environment

Step 4: Loading and Configuring SAM Model

Now, we have the SAM model checkpoint, we load it into our Python environment using PyTorch’s built-in functions.

Step 6: Post-processing and Visualization

Now let’s visualize the segmentation masks on the original image using Matplotlib or any other image visualization library.

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Skann.ai

No responses yet