Overview

A turn-based conversation is a communication system designed for applications where the user utters a single statement, and the agent is expected to respond fully. This model differs from streaming conversations that try to mimic natural human discourse. Instead, it fits applications triggered by some kind of user input. For example, consider a voice memo application where the user records a message, and the agent generates a complete response.

A turn-based conversation system is perfect for applications that don’t require real-time responses or constant back-and-forths. This design reduces complexity and allows for a more controlled conversation flow. Each user input is treated as a discrete event, giving the system time to generate and deliver a full and meaningful response.

Turn-based quickstart

The code can be found here

import logging
from dotenv import load_dotenv
from vocode import getenv
from vocode.helpers import create_turn_based_microphone_input_and_speaker_output
from vocode.turn_based.agent.chat_gpt_agent import ChatGPTAgent
from vocode.turn_based.synthesizer.azure_synthesizer import AzureSynthesizer
from vocode.turn_based.synthesizer.eleven_labs_synthesizer import ElevenLabsSynthesizer
from vocode.turn_based.transcriber.whisper_transcriber import WhisperTranscriber
from vocode.turn_based.turn_based_conversation import TurnBasedConversation

logging.basicConfig()
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

load_dotenv()

# See https://api.elevenlabs.io/v1/voices
ADAM_VOICE_ID = "pNInz6obpgDQGcFmaJgB"

if __name__ == "__main__":
    (
        microphone_input,
        speaker_output,
    ) = create_turn_based_microphone_input_and_speaker_output(use_default_devices=False)

    conversation = TurnBasedConversation(
        input_device=microphone_input,
        output_device=speaker_output,
        transcriber=WhisperTranscriber(api_key=getenv("OPENAI_API_KEY")),
        agent=ChatGPTAgent(
            system_prompt="The AI is having a pleasant conversation about life",
            initial_message="Hello!",
            api_key=getenv("OPENAI_API_KEY"),
        ),
        synthesizer=AzureSynthesizer(
            api_key=getenv("AZURE_SPEECH_KEY"),
            region=getenv("AZURE_SPEECH_REGION"),
            voice_name="en-US-SteffanNeural",
        ),
        logger=logger,
    )
    print("Starting conversation. Press Ctrl+C to exit.")
    while True:
        try:
            input("Press enter to start recording...")
            conversation.start_speech()
            input("Press enter to end recording...")
            conversation.end_speech_and_respond()
        except KeyboardInterrupt:
            break

This example demonstrates a turn-based conversation, using a ChatGPT agent for text generation, WhisperTranscriber for speech-to-text, and AzureSynthesizer for text-to-speech. User interactions trigger the beginning and end of the recording, signaling the system when to listen and when to respond.

Remember to replace OPENAI_API_KEY and AZURE_SPEECH_KEY with your actual API keys and set the appropriate Azure region. You can also customize the voice, system prompt, and initial message as needed.

React turn-based quickstart

🚧 Under construction

If you want to work on a sample react app for this, reach out to us!