Speech to TextRealtime Transcription

Realtime diarization

Learn how to use the Speechmatics API to separate speakers in real-time

To learn more about diarization as a feature, check out the diarization page.

Overview

Real-time diarization offers the following ways to separate speakers in audio:

Speaker diarization — Identifies each speaker by their voice. Useful when there are multiple speakers in the same audio stream.
Channel diarization — Transcribes each audio channel separately. Useful when each speaker is recorded on their own channel.
Channel & speaker diarization — Combines both methods. Each channel is transcribed separately, with unique speakers identified within each channel. Useful when multiple speakers are present across multiple channels.

Speaker diarization

Speaker diarization picks out different speakers from the audio stream based on acoustic matching.

To enable Speaker diarization, diarization must be set to speaker in the transcription config:

{
  "type": "transcription",
  "transcription_config": {
    "language": "en",
    "diarization": "speaker"
  }
}

When diarization is enabled, each word and punctuation object in the transcript includes a speaker property that identifies who spoke it. There are two types of labels:

S# – S stands for speaker, and # is a sequential number identifying each speaker. S1 appears first in the results, followed by S2, S3, and so on.
UU – Used when the speaker cannot be identified or diarization is not applied, for example, if background noise is transcribed as speech but no speaker can be determined.

  "results": [
    {
      "alternatives": [
        {
          "confidence": 0.93,
          "content": "hello",
          "language": "en",
          "speaker": "S1"
        }
      ],
    },
    {
      "alternatives": [
        {
          "confidence": 1.0,
          "content": "hi",
          "language": "en",
          "speaker": "S2"
        }
      ],
    }]

Channel diarization

Channel diarization processes audio with multiple channels and returns a separate transcript for each one. This gives you perfect speaker separation at the channel level and more accurate handling of cross-talk.

To enable channel diarization, diarization must be set to channel and labels for each channel provided in channel_diarization_labels in the transcription config of the StartRecognition message:

{
  "type": "transcription",
  "transcription_config": {
    "language": "en",
    "diarization": "channel",
    "channel_diarization_labels": ["New_York", "Shanghai"]
  }
}

You should see a channels field in the RecognitionStarted message which lists all the channels you requested:

{
  "message": "RecognitionStarted",
  ...
  "channels": ["New_York", "Shanghai"]
}

Send audio to a channel

To send audio for a specific channel, you can use the AddChannelAudio message. You'll need to encode the data in base64 format:

{
  "message": "AddChannelAudio",
  "channel": "New_York",
  "data": <base_64_encoded_data>
}

You should get an acknowledgement in the form of a ChannelAudioAdded message from the server, with a corresponding sequence number for the channel:

{
  "message": "ChannelAudioAdded",
  "channel": "New_York",
  "seq_no": <10>
}

Transcript response

Transcripts are returned independently for each channel, with the channel property identifying the channel.

{
  "message": "AddTranscript",
  "channel": "New_York",
  ...
  "results": [
    {
      "type": "word",
      "start_time": 1.45,
      "end_time": 1.8,
      "alternatives": [{
        "language": "en",
        "content": "Hello,",
        "confidence": 0.98,
      }]
    },
  ]
}

The channel property will be returned for AddTranscript and AddPartialTranscript messages only. Features such as audio events, translation and end of turn detection do not currently include this property. To request this feature, please contact support.

Channel and speaker diarization

Channel and speaker diarization combines speaker diarization and channel diarization, splitting transcripts per channel whilst also separating individual speakers in each channel.

To enable this mode, follow the steps in speaker diarization and set the diarization mode to channel_and_speaker.

To send audio to a channel, follow the instructions in send audio to a channel.

Transcripts are returned in the same way as channel diarization, but with individual speakers identified:

{
  "message": "AddTranscript",
  "channel": "New_York",
  "results": [
    {
      "alternatives": [{
        "content": "Hello",
        "confidence": 0.98,
        "speaker": 'S1',
      }]
    },
    ...
    {
      "alternatives": [{
        "content": "Hi",
        "confidence": 0.98,
        "speaker": 'S2',
      }]
    },
  ]
}

When using channel_and_speaker diarization, speaker labelling is specific to channels even if the speaker labels are the same. S1 on channel 1 is not necessarily the same as S1 on channel 2.

Limits

For SaaS customers, the maximum number of channels is 2.

For On-prem Container customers, the maximum number of channels depends on your Multi-session container's maximum number of connections.

The Speechmatics Python client CLI is currently limited to transcribing multi-channel audio in via files and not streaming/raw audio.

Configuration

You can customize diarization to match your use case by adjusting settings for sensitivity, limiting the maximum number of speakers, preferring the current speaker to reduce false switches, and controlling how punctuation influences accuracy.

Speaker sensitivity

You can configure the sensitivity of speaker detection by using the speaker_sensitivity setting in the speaker_diarization_config section of the job config object as shown below:

{
  "type": "transcription",
  "transcription_config": {
    "language": "en",
    "diarization": "speaker",
    "speaker_diarization_config": {
      "speaker_sensitivity": 0.6
    }
  }
}

This takes a value between 0 and 1 (the default is 0.5). A higher sensitivity will increase the likelihood of more unique speakers returning.

Prefer current speaker

You can reduce the likelihood of incorrectly switching between similar sounding speakers by setting the prefer_current_speaker flag in the speaker_diarization_config:

{
  "type": "transcription",
  "transcription_config": {
    "language": "en",
    "diarization": "speaker",
    "speaker_diarization_config": {
      "prefer_current_speaker": true
    }
  }
}

By default this is false. When this is set to true, the system will stay with the speaker of the previous word, if they closely match the speaker of the new word.

This can reduce instances where the system inadvertently alternates between different speaker labels within a single speaker audio segment

However, it may also result in some shorter speaker turn changes between similar speakers being missed.

Max. speakers

You can prevent too many speakers from being detected by using the max_speakers setting in the StartRecognition message as shown below:

{
  "message": "StartRecognition",
  "audio_format": {
    "type": "raw",
    "encoding": "pcm_f32le",
    "sample_rate": 48000
  },
  "transcription_config": {
    "language": "en",
    "operating_point": "enhanced",
    "diarization": "speaker",
    "speaker_diarization_config": {
      "max_speakers": 10
    }
  }
}

The default value is 50, but it can take any integer value between 2 and 100 inclusive.

This restricts the number of unique speaker labels that may be output by the system.

Note that accuracy may decline once this limit is reached. It is advisable to set the value to at least the expected number of speakers, and preferably slightly higher.

Speaker diarization and punctuation

Speaker diarization uses punctuation to improve the accuracy of speaker change points. Small adjustments to speaker labels may be applied based on sentence boundries.

For example, consider a case where the diarization marks a speaker change one word after a full stop:

Hello my name is John. And my name is Alice.

In this case, the above would be corrected to move the speaker change point to match with the end of sentence:

Hello my name is John. And my name is Alice.

Speaker diarization may also insert punctuation when a speaker change occurs without a corresponding sentence-ending punctuation mark in the transcription result.

These adjustments are only applied when punctuation is enabled. Disabling punctuation via the permitted_marks setting in punctuation_overrides can reduce diarization accuracy.

Adjusting punctuation sensitivity can also affect how accurately speakers are identified.

Speaker change (legacy)

The Speaker Change Detection feature was removed in July 2024. The speaker_change and channel_and_speaker_change parameters are no longer supported. Use the Speaker diarization feature for speaker labeling.

For API-related questions, contact support.

On-prem

To run channel or channel_and_speaker diarization with an on-prem deployment, configure your environment as follows:

Use a GPU Speech-to-Text container. Handling multiple audio streams is computationally intensive and benefits from GPU acceleration.
Set the SM_MAX_CONCURRENT_CONNECTIONS environment variable to match the number of channels you want to process.

For more details on container setup, see the on-prem deployment docs.

Overview​

Speaker diarization​

Channel diarization​

Send audio to a channel​

Transcript response​

Channel and speaker diarization​

Limits​

Configuration​

Speaker sensitivity​

Prefer current speaker​

Max. speakers​

Speaker diarization and punctuation​

Speaker change (legacy)​

On-prem​