SDPA Enabling for Custom Model

Authors: Su Yang, Xiake Sun, Fiona Zhao

Introduction

To enable the SDPA fusion on GPU, we firstly need to convert model IR with SDPA Op.

Create new class SdpaAttention in the modeling_ custom_model.py using torch.scaled_dot_product_attention. This Pytorch Op could be matched and converted into OpenVINO SDPA Op.

Fig1: from Attention Is All You Need


Refer to Phi3SdpaAttention, this module is inherited from Phi3Attention as the weights of the module stay untouched. The only changes are on the forward pass to adapt to SDPA API. 

torch.scaled_dot_product_attention

From the equivalent implementation, the target is to replace the related Pytorch Ops(like softmax, matmul and dropout) with torch.nn.functional.scaled_dot_product_attention.

Fig2: implementation equivalent from PyTorch 2.3 documentation

For some custom model, the equivalent code in is as follow:

attn_weights = torch.matmul(query_states, key_states.transpose(2, 3))
attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=query_states.dtype)
…
attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout_rate, training=self.training)
attn_output = torch.matmul(attn_weights, value_states) 

The corresponding implementation of scaled_dot_product_attention:

attn_output = torch.nn.functional.scaled_dot_product_attention(
            query_states,
            key_states,
            value_states,
            attn_mask=attention_mask,
            dropout_p=self.attention_dropout_rate if self.training else 0.0, 
            is_causal= True and attention_mask is None and q_len > 1, 
            scale=1)

SDPA’s Scaling factor with different head_size

For the Pytorch model with different head_size, this scaled_dot_product_attention need to be modified with the scaling factor.

attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
…
attn_output = torch.nn.functional.scaled_dot_product_attention(
            query_states,
            key_states,
            value_states,
            attn_mask=attention_mask,
            dropout_p=self.attention_dropout_rate if self.training else 0.0, 
            is_causal= True and attention_mask is None and q_len > 1, 
            scale=1/math.sqrt(self.head_dim)
        )

The SDPA Operator has already implemented. The SDPA fusion on GPU supports for head size =128 at OV24.2 release.

The OV24.3 release relax SDPA head size limitations for LLMs from 128 only to a range of 64 to 256, thanks to Sergey’s PR.

Usage

Replace the original modeling_custom_model.py with the new script (with SdpaAttention) in the Pytorch model folder.  

Notice:

  • After converting model again, check the SDPA layers (“aten::scaled_dot_product_attention”) in the OV IR .xml file.
  • Double check the OpenVINO executable graph for the SDPA enabling.  
  • Don’t forget to check the accuracy of the Pytorch model inference, after any modification with modeling_custom_model.py.

Conclusion

In this blog, we introduce how to use torch.scaled_dot_product_attention to enable the SDPA for custom model.

Performance improvement with SDPA on MTL iGPU is depended on the model structure. SDPA enabling for the custom model is the base for further optimization like Page Attention.