Posted on

VGG-TSwinformer: Transformer-based deep learning model for early Alzheimer’s disease prediction.

VGG-TSwinformer: Transformer-based deep learning model for early Alzheimer’s disease prediction.

Patient suffering from Alzheimer’s disease


Zhentao Hu, Zheng Wang, Yong Jin, Wei Hou, School of Artificial Intelligence, Henan University, Zhengzhou, 450046, China , College of Computer and Information Engineering, Henan University, Kaifeng, 475004, China


About 44% of people with mild cognitive impairment (MCI) go on to develop Alzheimer’s disease (AD) within three years, acting as a transitional phase between normal aging and AD.
Patients with MCI show anatomical abnormalities in the brain as well as cognitive deterioration; those who develop AD have the quickest rate of brain atrophy.
Given that pMCI patients have a higher risk of developing AD, differentiating between progressive MCI (pMCI) and stable MCI (sMCI) is essential for early AD diagnosis and treatments.
Differentiating between AD and cognitively normal (CN) individuals is not as complex as classifying pMCI and sMCI due to small changes in cognition and brain structure.

Problem and Limitations:

The field of aided diagnosis for Alzheimer’s disease (AD) has made great progress thanks to deep learning techniques, which have even surpassed manual diagnosis.
Two Groups of AD Diagnostic Techniques: There are two types of AD diagnostic methods: categorization models and progression models. While classification models forecast diagnosis labels, such as Alzheimer’s disease (AD) versus cognitively normal (CN) people, progression models try to quantify the course of the disease, such as differentiating between progressing MCI (pMCI) and stable MCI (sMCI).
Current Methodologies for Progression Models: A number of approaches, such as 3D CNN architectures, unsupervised learning models, and sparse regression coupled with deep learning, have been put forth for progression models. The accuracy of these methods in differentiating between pMCI and sMCI has varied.
Problems with Longitudinal Studies of MCI: Processing lengthy sequences of 3D medical image data presents difficulties for longitudinal studies of MCI, which can result in high-dimensional data volumes and overfitting with short datasets. Current methods, including manually extracting features and combining them with RNN, might not fully capture the features of brain disorders associated with AD.
Transformer-Based Models’ Potential: Transformer-based models have showed promise in resolving the difficulties associated with longitudinal studies; they were first used in natural language processing. Particularly, the Swin Transformer has demonstrated cutting-edge performance in computer vision applications, providing a viable method for handling short-term longitudinal MRI data.
VGG-TSwinformer Model Overview: In the final paragraph, the suggested VGG-TSwinformer model is presented. This model combines CNN and Transformer topologies to capture both spatial and temporal information for accurate prediction of MCI progression.


VGG-16 Architecture as discussed in the study by Zhentao Hu, Zheng Wang, Yong Jin, Wei Hou

The VGG-16 model is used in brain image processing because of its convolutional neural network architecture, which is well-known for its effective feature extraction. It has 13 convolutional layers and 5 pooling layers, with additional convolutional layers for channel extension and feature mapping. A convolutional layer with ReLU activation maps the input slices from slice series T1 and T2 to an output feature map size of 3x3x512, which is then mapped to tokens. A feature map is produced via max pooling, and its properties are dictated by variables such as stride and pooling window size. Low-level features are extracted by convolution, yielding 256 characteristics per slice. After mapping these attributes to tokens, two token series representing spatial features are produced. These token series then go through further processing to enable position embedding. The tokens are finally passed through a temporal attention block for further analysis, facilitating the prediction of disease progression.


Self-attention process

Self-Attention (SA):
Internal correlations in data sequences, such as tokens or image patches, are captured by SA.
Queries, keys, and values are obtained from input tokens.
Token relevance is determined by computing weight coefficients based on similarity between the query and the key.
Fused information from all tokens is represented by the weighted sum of value vectors.

Self-Attention in Multiple Heads (MSA):
Multiple independent query, key, and value groups are added by MSA to enhance SA.
Understanding is improved by the distinct characteristics of the data that each head captures.
Concatenation of the outputs creates a full representation.

Transformer Group:
consists of a multi-layer perceptron (MLP), layer normalization (LN), and SA or MSA.
Gradients are stabilized by LN, and input and SA/MSA output are combined via residual connections.
By using GELU activation to refine features, MLP increases the representational capability of the model.

VGG-Tswinformer model:

High-level characteristics from slices in T1 and T2 are extracted using VGG-16 CNN in order to prepare them for further processing.

Attention Mechanisms: To improve the model’s capacity to capture longitudinal changes, ten attention blocks—five spatial and five temporal—are used for feature integration.

Temporal Attention and MSA: In order to extract local longitudinal features that are essential for identifying minute changes in brain morphogenesis across time, MSA is performed in the temporal attention block.

Refinement with RSwin and LSwin Blocks: By dividing token series into windows and carrying out MSA, RSwin and LSwin blocks ensure excellent integration across spatial dimensions, hence refining feature fusion.

Complete Feature Integration: Ensures complete feature integration for precise predictions in MCI research by concluding with MSA on every token in T2.


The study was performed on Alzheimer’s Disease Neuroimaging Initiative (ADNI) [36] database ( to verify the performance of the proposed model

Normalization with FSL: with the FMRIB Software Library (FSL), the sMRI images were normalized into the MNI152 standard space, yielding uniform dimensions of 182 × 218 × 182 (X × Y × Z) with a spatial resolution of 1 × 1 × 1 mm³ per voxel.

Skull Dissection: To improve the quality of a later analysis, non-brain tissues were removed from spatially normalized sMRI images by the process of skull dissection.

Unified Bias Field Correction with ANTs: To reduce intensity changes brought on by imaging errors and improve the accuracy of subsequent processing stages, unified bias field correction was implemented using Advanced Normalization Tools (ANTs).
Axial Slice Selection: From the preprocessed sMRI images, axial slices were chosen in a vertical axial plane direction, beginning in the middle and extending to both ends. As a result, each image contained 40 axial slices, for a total of 80 slices per sample.

Formation of Slice Series T1 and T2: Following slicing, each sample included two 3D sMRI images that corresponded to slice series T1 and T2. The number of slices in both slice series was equal, and each slice’s measurements were 182 × 218 × 1.


The 823 samples were divided into three subsets: training, validation and test, of which 65% for training, 20% for validation and 15% for test. Because of the GPU restrictions, we set the token dimension C = 256 and the number of slices N = 80 for each of the two slice series in each sample. For 100 epochs, the model was trained with a momentum of 9e −5 and a learning rate of 1e −5. We used SGD [37] as the optimizer with a weight decay of 0.1 to prevent overfitting. Cross-entropy [38] was our loss function of choice. This study conducted five experiments, each of which randomly selected a different training, validation, and test subset in order to eliminate randomness.


Experiment with Replaced Slice Series: In each sample, all of the slices from slice series T1 were swapped out for slices from slice series T2. Five controlled studies showed that the model’s performance is not competitive with other algorithms when samples do not contain information regarding changes in the brain anatomy of individuals with motor cortex injury.

Model Sensitivity and Specificity Comparison: The original experiment’s sensitivity was higher than the control experiment’s, suggesting that the model is more responsive to alterations in brain anatomy. The initial experiment’s specificity, however, was lower, indicating larger percentages of missed diagnosis—a diagnosis that is more unsatisfactory in clinical settings than misdiagnosed.
Impact of Pre-trained CNN: Research on the use of pre-trained VGG-16 loaded with pre-trained weights in VGG-TSwinformer shown that the accuracy, specificity, and AUC of the scratch-trained model were higher than those of the pre-trained model. This suggests that the performance of the model prediction is not enhanced by utilizing pre-trained VGG.

Comparing Various Plane Slices: There are three plane views available in MRI images: sagittal, coronal, and axial. Variable influences on model performance were found while comparing the model’s performance utilizing various plane slices. Even though axial plane slices allowed the model to perform as comprehensively as possible, a single plane slice is insufficient to extract all of the information of a 3D MRI. It might work better to combine the three plane cuts.


Accuracy graph and Sensitivity vs Specificity as discussed in the study by Zhentao Hu, Zheng Wang, Yong Jin, Wei Hou

Model Superiority: The suggested VGG-TSwinformer model performs better in terms of accuracy (77.2%), sensitivity (79.97%), and AUC (0.8153%) than cross-sectional deep learning methods.

VGG-TSwinformer Summary and Confirmation: In longitudinal MCI investigations, VGG-TSwinformer extracts sMRI features by using VGG-16 CNN. It captures anatomical alterations in the brain by using sliding-window methods and temporal attention. When validated using the ADNI database, the diagnostic efficacy is higher than when using cross-sectional approaches. Limitations include problems with feature fusion, indicating that future research should use multimodal biomarkers.

Alka Gupta

I am a B.Tech, CSE AIML, 3rd year student from the KIET Group of Institutions, Ghaziabad.

I am proficient in languages such as C/C++, Python, HTML, CSS, and JavaScript; utilize developer tools like Jupyter, Google Colab, and Anaconda; am adept in frameworks such as TensorFlow, Scikit-Learn, Keras, and Open CV; and have experience with cloud and database technologies, including SQL and MySQL.