Azure AI Speech voice control with OpenAI GPT4 and SSML

Azure AI Speech voice control with OpenAI GPT4 and SSML
Azure AI Speech can utilise SSML to control voice behaviour and can be used to control OpenAI's neural voices with strict limitations

SSML or Speech Synthesis Markup Language allows us to control the spoken behaviour of a synthesised voice in our application. Within Azure AI Speech, we get the ability to use SSML with the available synthetic and neural voices available from Microsoft and also the neural voices from OpenAI too(with limitations).

In this small post, I will demonstrate how we can use this in near real time speech applications. I will be using a chat/dialogue where a user speaks in their voice to a Neural Voice where the Neural Voice will dynamically adjust its tone based on the sentiment found in the text spoken by the neural voice. This can be done with a combination of Azure AI Speech Services for providing the range of voices and the tone customisation with SSML and the OpenAI Completions API with GPT4 (now updated below to GPT4-o). The OpenAI model is for creating the raw text response to be spoken and for deciding which SSML XML tags are appropriate for the user's query, where an Azure AI Speech neural voice then speaks the response as given by the GPT4 decorated SSML. This post will use basic SSML tags such 'express-as' ,'style' , 'voicename' and 'style degree' in order to intentionally have a subtle output as demonstration.

Pre-requisites

  • Azure AI Speech Service resource in Azure (Free or S0 tier for more neural voices).
  • OpenAI API Account with available funds - Larger model calls will be more expensive
  • Microphone for voice input

First to get the dialog between a user (using microphone input for converting speech to text) and a Neural Voice, we can write the following:

public class Program
{
    private static string subscriptionKey = "{YOUR_AZURE_SPEECHKEY}";
    private static string region = "{YOUR_AZURE_SPEECH_REGION}";
    private static SpeechRecognizer recognizer;
    private static List<Message> chatHistory { get; set; }
    private static GPTHelper gpt = new GPTHelper();
    private string openAISecretDevKey = "{YOUR_OPENAI_APIKEY}";

    public static async Task Main(string[] args)
    {
        //Initialisation and acquire GPT system message
        var speechConfig = SpeechConfig.FromSubscription(subscriptionKey, region);
        var audioConfig = AudioConfig.FromDefaultMicrophoneInput();

        recognizer = new SpeechRecognizer(speechConfig, audioConfig);
        recognizer.Recognized += Recognizer_Recognized;

        chatHistory = gpt.GetSystemMessage();

        await ListenForSpeechAsync();
        Console.ReadLine();
    }

    private static async Task ListenForSpeechAsync()
    {
        //ask user to say something
        Console.WriteLine("Say something...");
        await recognizer.StartContinuousRecognitionAsync().ConfigureAwait(false);
    }

    private static async Task RespondToUserAsync(string userSpeech)
    {
        // Stop listening for speech recognition
        await recognizer.StopContinuousRecognitionAsync().ConfigureAwait(false);

        var synthesisConfig = SpeechConfig.FromSubscription(subscriptionKey, region);

        HttpClient client = new HttpClient();
        client.DefaultRequestHeaders.Authorization =
            new AuthenticationHeaderValue("Bearer", openAISecretDevKey);

        //add user's spoken message to chat history
        chatHistory.Add(new Message { Role = "user", Content = userSpeech });

        //craft new request containing previous ENTIRE chat
        //history and post request back to OpenAI Api
        StringContent? requestBody = gpt.CraftPromptRequestBody(chatHistory);

        var response = 
          await client.PostAsync("https://api.openai.com/v1/chat/completions",
          requestBody);
          
        string gptJsonResult = await response.Content.ReadAsStringAsync();

        SSMLData ssmlData = gpt.ProcessGptResponse(gptJsonResult);

        //craft ssml xml structure and decorate with gpt results
        //customise this to your liking with extra tags such as prosody
        var ssmlString = $"<speak version=\"1.0\"" + 
                $"xmlns=\"http://www.w3.org/2001/10/synthesis\"" + 
                $"xmlns:mstts=\"https://www.w3.org/2001/mstts\"" +
                $" xml:lang=\"en-US\">\r\n " +
                $"<voice name=\"{ssmlData.VoiceName}\">\r\n " +
                $"<mstts:express-as style=\"{ssmlData.Style}\" styledegree=\"
                {ssmlData.StyleDegree}\">\r\n " +
                $"{ssmlData.SpeechSentence}\r\n " +
                $"</mstts:express-as>" +
                $"</voice>\r\n</speak>";

       //add back reponse from OpenAI into chat history
       chatHistory.Add(new Message { Role = "assistant", Content =
           ssmlData.SpeechSentence });

        //speak here
        var synthesizer = new SpeechSynthesizer(synthesisConfig);
        await synthesizer.SpeakSsmlAsync(ssmlString);

        // Start listening for speech again
        await ListenForSpeechAsync();
    }

    private static async void Recognizer_Recognized(object sender,
    SpeechRecognitionEventArgs e)
    {   
        if (e.Result.Reason == ResultReason.RecognizedSpeech)
        {   //show what the user said and respond to it
            Console.WriteLine($"You said: {e.Result.Text}");

            if (!string.IsNullOrEmpty(e.Result.Text.ToLower()))
            {
                await RespondToUserAsync(e.Result.Text);
            }
        }
    }
}

The Message class is written as:

public class Message
{
    public string Role { get; set; }
    public string Content { get; set; }
}

In our SSMLData class:

public class SSMLData
{
    public string Style {  get; set; }
    public string StyleDegree { get; set; }
    public string SpeechSentence { get; set; }
    public string VoiceName { get; set; }
}

Within our GPTHelper class:

public class GPTHelper
{

  public List<Message> GetSystemMessage()
  {
    //create a very detailed and elaborate system message
     string systemMessage = $" Can you respond and answer to the user's message that you receive." +
          $"The goal for you is to ultimately pick a voicename, a Style , a styledegree and the responsemessage , where these elements represent items that are meant to" +
          $"be inserted in an Azure SSML snippet for voice synthesis. The voicename is the voice to be used for speaking back to the user and consider them as agents that all work in the same customer service company called Cypher One but with varying character traits. If the user asks for a specific agent by their known name, then change the voicename then choose that voicename" +
          $"For styledegree, the range can only be between 0.01 to 2 in increments or decrements of 0.01 where the lower the number, the more subdued the Style is and the higher the number the stronger the Style is emphasised when speaking." +
          $"For the Style, you can pick the following ONLY: friendly , chat , hopeful , unfriendly , customerservice , assistant, embarassed, empathetic, gentle" +
          $"For voicename, you can ONLY pick from these: en-US-AndrewNeural , en-US-JennyNeural , en-US-JasonNeural , en-US-JaneNeural" +
          $"Consider that voicename en-US-JasonNeural and en-US-JaneNeural, known as Jason and Jane respectively are young interns and are only just learning to be helpful so their responsemessage lean towards acute in either direction, their styledegree is either very high or very low. Their Style cannot be assistant or customerservice  " +
          $"Consider that en-US-AndrewNeural, aka Andrew, is the line manager of both en-US-JasonNeural and en-US-JaneNeural. en-US-AndrewNeural refers to them as Jason and Jane and is allowed to delegate back to them where logically reasonable to do so in response to the user message. en-US-AndrewNeural refers to himself as Andy" +
          $"Consider that en-US-JennyNeural, aka Jenny or Jen, is the Head manager of all voicenames and is highly skilled in customer service, is quite sympathetic and helpful. ONLY en-US-JennyNeural can have a customerservice or assistant Style " +
          $"All agents can identify themselves as by their human names in the responsemessage should they be asked." + 
          $"For your responsemessage, this is where you answer the user's message and try to make the responsemessage short and simple where it logically applies given the chosen voicename and what I have told you about it and medium length should the user ask for some detail. Where you ask a question back to the user in your responsemessage to get more detail, you MUST then pick the same voicename in the follow up answer to the next question." +
          $"ONLY Change the voicename should the user ask you to speak to another agent, otherwise use the voicename from your previous responsemessage"+
          $"Do not add other text in around your answer, you MUST present your answer ALWAYS in the format <voicename>|<Style>|<styledegree>|<responsemessage> ";

   var chatHistory = new List<Message>
    {
       new Message { Role = "system", Content = systemMessage }
    };
  return chatHistory;
}

public StringContent? CraftPromptRequestBody(List<Message> chatHistory){

      var requestPayload = new
      {
          model = "gpt-4o", //base gpt4 and gpt4o are good for reasoning 
          temperature = 0.8, //allow for some creativity with higher temp
          messages = chatHistory.Select(m => new { role = m.Role,
            content = m.Content }).ToList()
      };

      var jsonPayload = JsonConvert.SerializeObject(requestPayload);
      var requestData = new StringContent(jsonPayload,
        Encoding.UTF8, "application/json");

      return requestData;
    }

public SSMLData ProcessGptResponse(string gptJsonResult){
    //extract values we want from response
    //ps - might be better to use new GPT JSON capabilities here
    JObject jsonObject = JObject.Parse(gptJsonResult);
    string value = (string)jsonObject["choices"][0]["message"]["content"];
    SSMLData ssmlData = new SSMLData();
    List<string> responseParts = value.Split("|").ToList();
    ssmlData.VoiceName = responseParts[0].Replace(" ", "");
    ssmlData.Style = responseParts[1].Replace(" ", "");
    ssmlData.StyleDegree = responseParts[2].Replace(" ", "");
    ssmlData.SpeechSentence = responseParts[3];

    return ssmlData;
  }
}

The use of GPT4 here is really there for coherence and instruction understanding, I was finding that even GPT3.5 Turbo was at times failing to understand that it could not be switching the current service agent unless the user asked for that (as instructed in the system message). And finally, I specifically hand picked some known Azure AI Speech neural voices that supported certain SSML voice control tags, most importantly the Style and Style degree. The omission of OpenAI Voices is due to the fact that at the moment of writing, OpenAI Voices, within the realm/use of Azure Speech Services, do not support any significant Azure SSML tags that control emotional range such as style prosody because this is ongoing research on OpenAI's part.

See more from Microsoft on this here and also OpenAI's word on their voice models here on controlling the emotional range (Update!! OpenAI voices can now be directly instructed/prompted to show certain emotional emphasis with GPT4-o, we are sure this will come to Azure SSML too).

Results

And here is a sample of an interaction with the Azure AI Speech neural voices 😀(not OpenAI voices). The inference latency is in part to invoking the generally large GPT4 model here. The emotional range showcased in the responses here is quite subtle and not immediately evident, but this is largely dictated by the OpenAI API system message I wrote in this demo app as seen above and to keep in line with developing Responsible AI. I used a very limited amount of SSML tags for control and this was intentional in my system message. There are many more SSML tags that can be added to control the emotional output of the voices but these require more thought on ethical grounds and considerations around potential misuse or consequential harm (recording done with base GPT4 model creating the response text 1 day before release of GPT4-o):

audio-thumbnail
Human to neural
0:00
/107.206531

Conclusion

The principle works in the sense that we can very carefully and dynamically modify how the neural voices speak back to the user. It should also noted that the OpenAI model here will assessing the user's sentiment from the text that they produce rather than from their voice, there is no voice analysis taking place here. Finally costs; It's important to note that in the above implementation, the payloads get progressively larger in a linear fashion, make each subsequent request to OpenAI more expensive than the last - especially when invoking more expensive models such as GPT4 and higher to take advantage of much better reasoning and understanding of the system message and overall instruction. 🌟