Hello Reader,

Welcome to another edition of PYCAD newsletter where we cover interesting topics in Machine Learning and Computer Vision applied to Medical Imaging. The goal of this newsletter is to help you stay up-to-date and learn important concepts in this amazing field! I've got some cool insights for you below ↓

Vision Transformer Made Simpler

I thought that CNNs are a must when it comes to vision tasks. I thought that transformers will not be able to fully replace them for vision.

I was wrong. Here’s why.

There is a new vision transformer model from Meta AI. It’s called Hiera.

This vision transformer outperforms previous models in both accuracy and speed.

The performance was measured on image classification and video classification tasks.

But what makes Hiera impressive is its simplicity in design.

This model does NOT have:

Convolutional layers.
Shifted windows.
Attention bias.

Adding these techniques have historically been necessary to make transformers work good enough for vision tasks.

Why were these techniques necessary to make vision transformers work well?

Because they addressed some important drawbacks that vanilla vision transformers had. For example: lack of inductive bias.

So how did Hiera address these drawbacks without the use of these techniques?

By using a self-supervised learning technique for learning visual representation. The technique is called “masked pretraining”;

Below is an image that shows the full architecture of the model when trained with this technique.

Hiera achieves some impressive results for image and video tasks. For example:

On ImageNet-1K dataset, it had a 0.3% accuracy gain while being almost twice as fast during inference compared to ConvNextV2-B.

On Kenitics-400 dataset, it had 2.5% accuracy gain while being almost 3 times faster compared to ViT-B.

You can check out the paper here and the code here.

Improving Object Detection Results without Training

Detecting small objects in images has always been a difficult task for deep learning models. There is an approach that can make your results incredibly better.

This approach is called Slicing Aided Hyper Inference or SAHI for short.

It is extremely simple, yet very effective.

You basically take your input image and divide it into patches.

You then resize these patches and pass everything to your model: the original image and the resized patches.

Then you aggregate the results and you filter them based on an IoU (intersection over union) threshold.

The technique can be used directly with your trained object detection model without any finetuning.

It can also be used as a data augmentation technique during training.

From what the paper has reported, the results are very impressive.

If it's used without finetuning you get an AP increase of 6.8%, 5.1% and 5.3% for FCOS, VFNet and TOOD detectors, respectively.

With finetuning, you get an AP increase of 12.7%, 13.4% and 14.5% AP in the same order.

You can read more about it in the original paper. You can also check the code here.

Cool AI Tools (Affiliates)

BlogAssistant helps minimize AI content flags, so you can produce articles in over 30 languages that draw readers in without giving off creepy robot vibes. It also lets you retrieve images for your content that are always 100% royalty-free!

Machine Learning for Medical Imaging

Vision Transformer Made Simpler

Vision Transformer Made Simpler

Improving Object Detection Results without Training

Cool AI Tools (Affiliates)

What'd you think of today's edition?

AI Scribes: The Future of Medical Documentation?

From DeepSeek to Lung Tumors

LLMs that are HIPAA Compliant!