Predicting Brain Microstructure from bSSFP MRI with Deep Learning

This post was AI-generated from the project’s source code, thesis, and documentation. It is an automated summary, not original writing.

The Dataset: DOVE

The research leverages the DOVE dataset (acquired at the Department of High-field Magnetic Resonance, Max Planck Institute for Biological Cybernetics), which is described as one of the largest collections of phase-cycled bSSFP brain images ever acquired. Key specifications:

ParameterbSSFPDWIT1w (MP2RAGE)
Subjects120 healthy120 healthy120 healthy
Sessions3 per subject4 per subject1 per subject
Resolution1.4 mm isotropic2 mm isotropic1 mm isotropic
Phase cycles12 (0-360 deg)--
TR / TE4.8 / 2.4 ms3500 / 62 ms5000 / 2.98 ms
b-value-1000 s/mm^2-
B0 field3T3T3T
ScannerSiemens PrismaSiemens PrismaSiemens Prisma

The dataset was organized following the BIDS standard (Brain Imaging Data Structure), converted from DICOM using BIDScoin, and all modalities were coregistered to the T1-weighted images using SPM.

By cross-pairing bSSFP and DWI sessions within each subject (3 bSSFP sessions x 4 DWI sessions = 12 pairs per subject), the authors constructed approximately 1,077 training samples — a creative way to multiply the effective dataset size from a limited number of scans.


Preprocessing: From Raw Scans to Neural Network Input

The preprocessing pipeline is surprisingly involved, reflecting the real-world complexity of working with multi-modal MRI data.

bSSFP Preprocessing

  1. Phase rescaling: Scanner output [0, 4095] converted to radians [-pi, pi]
  2. Phase correction: Each phase cycle multiplied by a complex exponential to remove global phase offsets
  3. Gibbs ringing removal: Using DiPy to suppress truncation artifacts
  4. Resampling: Complex data resampled to match DWI resolution (2 mm isotropic)
  5. Masking: Brain mask derived from DWI applied
  6. Normalization: Magnitude and phase each normalized to [0, 1] across the entire dataset
  7. Channel interleaving: Magnitude and phase alternated, producing 24-channel 3D volumes (12 magnitude + 12 phase)

DWI Preprocessing

  1. Distortion correction: FSL’s TOPUP for susceptibility-induced distortions
  2. Brain extraction: FSL’s BET
  3. Eddy current correction: FSL’s eddy
  4. Tensor estimation: FSL’s DTIFIT to compute the 6 independent tensor elements
  5. Normalization: Diagonal and off-diagonal elements normalized separately to [0, 1]

T1-weighted Preprocessing


The Model: A Conditional GAN with Modality-Specific Heads

Generator Architecture

The generator follows an encoder-decoder (UNet) design built on MONAI’s BasicUNet, adapted for 3D volumetric data with several key modifications:

Input heads: Each source modality gets its own ResNet-style input block — a residual block with 3 convolutional layers, ReLU activations, and instance normalization. This maps the modality-specific channel count (24 for bSSFP, 6 for DWI/T1w) to a standardized 24-channel representation that feeds into the shared backbone.

UNet backbone: 5 contracting and 5 expanding blocks with feature maps progressing as 48 -> 96 -> 192 -> 384 -> 768 -> 384 -> 192 -> 96 -> 48 -> 24. Key modifications from the standard UNet:

Output: 6 channels representing the upper-triangular elements of the symmetric diffusion tensor (Dxx, Dxy, Dxz, Dyy, Dyz, Dzz).

The full model has approximately 51 million trainable parameters, requiring roughly 14 GiB of GPU VRAM.

Discriminator Architecture

A PatchGAN-style discriminator receives the concatenation of input image and tensor (real or generated):

Five progressive convolutional layers with LeakyReLU(0.2) and batch normalization downsample the input, producing a spatial map of real/fake predictions rather than a single scalar — this encourages the generator to produce locally realistic textures.

Loss Function Design

The total generator loss combines three complementary objectives:

  1. L1 Reconstruction Loss: Captures low-frequency, voxel-wise accuracy. Simple but essential — it ensures the predicted tensor values are numerically close to ground truth.

  2. SSIM Loss (1 - SSIM): Penalizes differences in luminance, contrast, and structural patterns using a sliding-window comparison. Computed over local patches, it captures perceptual similarity that pixel-wise metrics miss.

  3. Perceptual Loss: This is where things get interesting. Rather than using a VGG network pre-trained on ImageNet (the standard in computer vision), the authors use MedicalNet — a ResNet-10 pre-trained on a large corpus of medical imaging data across different organs, modalities, and pathologies. Both the generated and ground-truth tensors are passed through MedicalNet, and the L2 distance between their deep feature representations becomes a loss term. This captures “medical image-ness” — structural patterns characteristic of real anatomical data — that neither L1 nor SSIM can quantify.

  4. Adversarial Loss: BCEWithLogitsLoss from the PatchGAN discriminator, pushing the generator to produce tensors that are locally indistinguishable from real data.

These losses are weighted and summed, with the perceptual component receiving a 1000x multiplier to ensure it meaningfully influences the gradient.


Training Strategy: Why Naive Training Fails

Single-Stage Training (Direct)

The first approach was straightforward: initialize the network randomly and train end-to-end on the source-to-DT task. Using AdamW with a learning rate of 1e-4, distributed across 4 NVIDIA GeForce RTX 5000 GPUs via PyTorch Lightning’s DDP strategy, with early stopping (patience of 5 epochs).

The results were sobering. Looking at the test loss breakdown (Figure 4.1), an interesting pattern emerges: the SSIM loss dominates the total loss across all modalities (~0.03–0.05), while the L1 and perceptual losses are comparatively tiny. The auto-encoding task (DT-to-DT) has the lowest total loss, while pc-bSSFP performs worst. But despite these seemingly reasonable loss values, the predictions contained visible artifacts across all input modalities (clearly shown in Figure 4.2, where red circles highlight checkerboard-like artifacts in brain regions). Quantitatively:

The off-diagonal elements encode the subtle directional information in diffusion — the fiber crossings, the orientation of tracts. They have smaller absolute values than diagonal elements, making them proportionally harder to predict and more sensitive to relative error.

Multi-Stage Training (The Breakthrough)

The key insight was that the network needs to first “understand” what a valid diffusion tensor looks like before it can learn to predict one from a foreign modality.

Stage 1: Autoencoder Pre-training Train the full UNet as a DT-to-DT autoencoder. The network takes a real diffusion tensor as input and must reproduce it. This forces the decoder to learn the statistical structure and spatial patterns of valid diffusion tensors — the correlations between elements, the typical ranges in different brain regions, the spatial smoothness properties.

Stage 2: Input Head Training Swap the DT input head for the target modality’s input head (e.g., bSSFP). Freeze all pre-trained weights and train only the new input head. This teaches the input block to map the source modality into the representation space that the pre-trained backbone expects, without disturbing the learned tensor generation capabilities.

Stage 3: Fine-tuning Unfreeze all weights, reduce the learning rate to 1e-5, and fine-tune the entire network end-to-end. The pre-trained weights serve as a strong initialization, allowing the network to make subtle adaptations without catastrophic forgetting of what constitutes a valid diffusion tensor.

Interestingly, the multi-stage test losses (Figure 4.5) are comparable in magnitude to the direct training losses. However, the DT autoencoder now performs dramatically better (loss ~0.005), and pc-bSSFP as input (~0.02) is clearly superior to bSSFP (~0.04) and T1w (~0.007). The key takeaway: similar loss magnitudes can mask profoundly different prediction quality — the multi-stage approach doesn’t just reduce the loss, it changes what the network learns to represent.


Results: What the Network Learned

Quantitative Performance

Multi-stage training produced dramatically better results. Reading the actual error plots from the thesis (Figures 4.3–4.9), we can extract precise numbers:

Normalized Tensor Element Errors (Figure 4.6):

Element TypeDT (autoencoder)pc-bSSFPbSSFPT1w
Diagonal (CSF/GM/WM)~3/3/5%~7/8/10%~7/8/7%~7/7/7%
Off-diagonal (CSF/GM/WM)~5/5/5%~20/20/55%~25/35/60%~20/25/60%

Scalar Map Errors (Figures 4.7–4.9):

ScalarDTpc-bSSFPbSSFPT1w
Radial Diffusivity2–5%15–20%15–20%15–20%
Axial Diffusivity1–3%8–14%8–12%7–10%
Mean Diffusivity2–5%10–17%8–15%8–15%
Fractional Anisotropy8–12%15–27%15–25%10–15%
Inclination<0.05 deg<0.5 deg<0.5 deg<0.5 deg
Azimuth<0.25 deg<2.0 deg<1.75 deg<1.75 deg

Compare these multi-stage results to single-stage training (Figures 4.3–4.4):

The angular accuracy is perhaps the most striking result. The network predicts the direction of the primary diffusion eigenvector — which corresponds to the orientation of white matter fiber bundles — with sub-degree precision. This means the major white matter tracts (corpus callosum, corticospinal tract, arcuate fasciculus, etc.) are faithfully captured in the predicted tensors.

Modality Comparison

Four input modalities were tested:

  1. DT (autoencoder): Best performance, as expected — it’s reconstructing its own modality
  2. Phase-contrast bSSFP (12 phase cycles): Best among non-DT inputs, confirming that the complex bSSFP signal indeed encodes microstructural information
  3. bSSFP (single phase cycle repeated): Surprisingly competitive, suggesting that even a single bSSFP contrast carries substantial diffusion-related information
  4. T1-weighted (MP2RAGE): Competitive with bSSFP, despite carrying no explicit diffusion sensitivity

That pc-bSSFP outperforms single-phase bSSFP validates the hypothesis that the full frequency response profile — sampled across 12 phase cycles — contains richer microstructural information than any single contrast.

Regional Analysis

Error patterns varied systematically across brain tissue types:

Qualitative Assessment

The thesis presents side-by-side comparisons (ground truth on left, prediction on right) for the pc-bSSFP input modality. These are among the most compelling figures in the work:

What works well (Figures 4.10, 4.12, 4.14–4.15):

Where it struggles (Figures 4.11, 4.13):


Technical Implementation

Data Pipeline

The training pipeline uses TorchIO for 3D medical image I/O and augmentation:

Training Infrastructure

Evaluation Pipeline

Post-training evaluation is comprehensive and parallelized:

  1. Tensor denormalization: Inverse min-max scaling using saved normalization parameters
  2. Eigendecomposition: Full 3x3 tensor eigenanalysis at every voxel using NumPy
  3. Scalar computation: FA, MD, AD, RD, inclination, azimuth, and RGB direction maps
  4. Error analysis: Per-voxel relative error (and absolute angular error for directional metrics), stratified by tissue type using probabilistic segmentation maps from FSL
  5. Aggregation: Median, 1st/25th/75th/99th percentiles, mean, and standard deviation per subject, per ROI

Limitations and Honest Assessment

The thesis is refreshingly candid about the limitations of the approach:

Statistical Power

Only one model instance was trained per modality. With random weight initialization, different training runs can converge to different solutions. Drawing definitive conclusions about which modality is “best” requires multiple training runs with different random seeds — a standard practice in deep learning research that was omitted here, likely due to computational constraints (each training run requires significant GPU time).

Model Complexity

At 51 million parameters and 14 GiB of VRAM, the model is large. The thesis discusses a spectrum of approaches: voxel-wise regression (as in prior work by Birk et al.) at one extreme, and full-volume processing at the other. Patch-based approaches offer a middle ground, with patch size as a tunable knob controlling the trade-off between structural context and computational cost.

Data Representation

The current approach treats the complex bSSFP data as real-valued multi-channel images, discarding the inherent complex structure. Complex-valued neural networks could preserve the mathematical properties of the MRI signal (phase relationships, magnitude-phase correlations) and potentially improve prediction quality.

Similarly, the 12 phase cycles are treated as independent channels rather than as a temporal sequence sampling a frequency response. Temporal modeling via 1D convolutions, LSTMs, or vision transformers could explicitly capture the sequential structure of phase-cycle data.

Segmentation Quality

The probabilistic tissue segmentations used for regional error analysis (from FSL) were acknowledged to be of poor quality. The thesis notes that improved segmentations using FreeSurfer were being generated at the time of writing.

Perceptual Quality

While the perceptual loss and adversarial training help capture some high-frequency detail, the predictions remain smoother than ground truth. The thesis acknowledges that a PatchGAN discriminator loss (as described in Pix2Pix) might further improve detail but at the cost of more parameters and slower training.


Future Directions

The thesis outlines several promising research avenues:

Better Generative Models

Diffusion models (the deep learning kind, confusingly sharing terminology with diffusion MRI) and latent diffusion models have shown stunning results in image generation. Applying these to medical image synthesis could dramatically improve prediction quality, though at significant computational cost. Flow matching, using neural ODEs, offers a potentially more efficient alternative.

Architecture Innovations

Multi-Modal Foundation Models

Instead of training separate models for each modality pair, a foundation model could learn general MRI-to-MRI translation, then be fine-tuned for specific tasks. Recent work on vision transformers with convolutional components (ViT, Swin Transformer) and parameter-efficient fine-tuning (adapters, LoRA) makes this increasingly feasible.

Alternative Prediction Targets

Rather than predicting the full diffusion tensor, the network could predict:

Broader Context: The Image-to-Image Translation Landscape

The thesis situates this work within the broader landscape of image-to-image translation methods (surveyed extensively in Chapter 2 with a comprehensive taxonomy from Pang et al., 2021). This problem — translating images from one domain to another — has spawned a remarkable diversity of approaches: supervised and unsupervised, two-domain and multi-domain, using GANs, VAEs, flow-based models, and diffusion models. Prior MRI-to-MRI synthesis work has largely focused on single-channel modalities (T1 to T2, PD to T1), while this thesis tackles a many-to-many channel problem (24-channel bSSFP to 6-channel DT), which is substantially more challenging both in terms of information content and computational requirements.


The Bigger Picture

This work sits at an exciting intersection of MRI physics, computational neuroscience, and deep learning. The core finding — that bSSFP imaging carries recoverable diffusion information — has implications beyond this specific implementation:

Clinical impact: If perfected, bSSFP-derived diffusion tensors could provide microstructural information from scans that are already being acquired for other purposes (T2/T1 mapping, functional MRI preparation), effectively making diffusion data available “for free” in terms of scan time.

Physics insight: The fact that a neural network can learn this mapping provides indirect evidence that the bSSFP signal encodes microstructural information, complementing the theoretical and experimental work by Miller et al. While the network doesn’t provide an analytical formula relating bSSFP to diffusion, its success suggests that such a relationship exists and might be partially derivable.

Methodological template: The multi-stage training strategy — autoencoder pre-training followed by modality-specific fine-tuning — is a general recipe for cross-modal medical image synthesis that could be applied to many other modality pairs.

The predictions aren’t perfect. The off-diagonal elements remain noisy, FA maps lack the crispness of the real thing, and the model hasn’t been validated on pathological data. But the directional accuracy is remarkable, the major tracts are clearly resolved, and the multi-stage training breakthrough shows that the approach has room to grow.

For neuroscientists, radiologists, and MRI physicists, this work opens a door: one where the rich microstructural information of diffusion imaging might be accessible from faster, more robust, and more widely available MRI sequences.


This project was developed as a Master of Science thesis (Tubingen, March 2024) at the Graduate Training Centre of Neuroscience, University of Tubingen (Faculty of Science and Faculty of Medicine), in collaboration with the Department of High-field Magnetic Resonance at the Max Planck Institute for Biological Cybernetics. The code is implemented in Python using PyTorch Lightning, MONAI, and TorchIO, and is available as open source.


Full document: Thesis (PDF)


Technical Reference

Key Technologies: PyTorch 2.2.1, PyTorch Lightning 2.2.1, MONAI 1.3.0, TorchIO 0.19.6, nibabel, PyBIDS, Weights & Biases

Model: Modified 3D BasicUNet (MONAI) + PatchGAN discriminator, ~51M parameters

Training: AdamW (lr=1e-4), DDP across 4x NVIDIA RTX 5000, early stopping (patience=5), multi-stage: autoencoder pre-train -> input head transfer -> full fine-tune (lr=1e-5)

Data: DOVE dataset, 120 subjects, 1077 paired samples, 108 unseen test samples, 64-channel receive head coils

Evaluation: Per-voxel relative error stratified by tissue type (CSF/GM/WM), PSNR, SSIM, FID (MedicalNet features), eigendecomposition-derived scalar maps