Transcribe audio on iOS & macOS: Whisperkit

WhisperKit is an Apple-optimized implementation of OpenAI's Whisper speech recognition model, designed specifically for iOS and macOS applications. It provides efficient, on-device speech transcription capabilities with impressive accuracy, making it ideal for developers looking to add audio-to-text features to their apps.
Introduction to Whisperkit and its capabilities
WhisperKit leverages Core ML to run Whisper models natively on Apple Silicon, providing robust transcription capabilities across multiple languages. It supports various audio formats and can handle challenging audio environments, making it suitable for diverse applications without requiring cloud connectivity.
Setting up Whisperkit on iOS and macOS
To integrate WhisperKit into your apps, you'll need to set up the proper environment:
Prerequisites
- macOS 14.0 or later
- Xcode 15.0 or later
- Apple Silicon Mac for development (Intel Macs are not supported)
- iOS 15.0+ or macOS 13.0+ for deployment
Installation
Add WhisperKit to your project using Swift Package Manager:
- Open your Xcode project
- Navigate to
File > Add Packages
- Enter the repository URL:
https://github.com/argmaxinc/WhisperKit.git
- Select the latest stable version (0.9.0 or newer) and add it to your project
Alternatively, add it directly to your Package.swift
file:
dependencies: [
.package(url: "https://github.com/argmaxinc/WhisperKit.git", from: "0.9.0"),
]
Step-by-step guide to transcribing audio files
Here's how to implement audio transcription with WhisperKit in your Swift application:
import WhisperKit
// Basic transcription example
Task {
do {
// Initialize WhisperKit with default settings
let pipe = try await WhisperKit()
// Transcribe an audio file
let audioPath = "path/to/your/audio.mp3" // Supports .wav, .mp3, .m4a, .flac
let result = try await pipe.transcribe(audioPath: audioPath)
// Access the transcribed text
print("Transcription: \(result?.text ?? "No transcription available")")
} catch {
print("Error transcribing audio: \(error)")
}
}
Using a specific model
import WhisperKit
Task {
do {
// Configure WhisperKit with a specific model
let config = WhisperKitConfig(model: "large-v3", modelRepo: "argmaxinc/whisperkit-coreml")
let pipe = try await WhisperKit(config)
// Transcribe audio
let result = try await pipe.transcribe(audioPath: "recording.m4a")
print(result?.text ?? "No transcription available")
} catch {
print("Error: \(error)")
}
}
Optimizing transcription accuracy and performance
To achieve optimal results with WhisperKit:
Audio quality considerations
- Use clear, high-quality audio recordings when possible.
- Minimize background noise in recording environments.
- For voice recordings, position microphones closer to speakers.
Model selection
WhisperKit offers several model options with different size/performance tradeoffs:
- tiny-en: Smallest model, English only, fastest but least accurate.
- base-en: Small model, English only, a bit more accurate than
tiny-en
. - small-en: Good accuracy with reasonable performance for English.
- medium-en: High accuracy with moderate performance impact for English.
- large-v3: Highest accuracy, but requires more memory and processing power. Supports multiple languages.
Performance considerations
- Memory requirements vary significantly by model size (from ~30MB for
tiny-en
to ~1.5GB forlarge-v3
). - Processing time scales with model size and audio length.
- Apple Silicon devices (M1/M2/M3) provide significantly better performance than older devices.
- Consider using smaller models for real-time applications.
Practical use cases and examples
WhisperKit can be effectively used in:
- Voice note applications with automatic transcription.
- Accessibility features for hearing-impaired users.
- Meeting and interview transcription tools.
- Podcast and video content transcription.
- Language learning applications.
Troubleshooting common issues
Model download failures
Issue: Models fail to download or initialize.
Solution: Ensure proper internet connectivity and verify you have sufficient storage space. You can also pre-download models and include them in your app bundle:
let config = WhisperKitConfig(
model: "large-v3",
modelFolder: Bundle.main.resourceURL?.appendingPathComponent("Models").path
)
Memory pressure
Issue: App crashes due to memory limitations.
Solution: Use a smaller model or implement transcription in smaller audio chunks:
// Process audio in 30-second segments
let audioChunks = splitAudioIntoChunks(audioPath: path, chunkDuration: 30)
var fullTranscription = ""
for chunk in audioChunks {
let result = try await pipe.transcribe(audioPath: chunk)
fullTranscription += result?.text ?? ""
}
Slow transcription
Issue: Transcription takes too long for your use case.
Solution: Use a smaller model or adjust beam size in configuration:
var config = WhisperKitConfig(model: "small-en")
config.beamSize = 1 // Faster but potentially less accurate
let pipe = try await WhisperKit(config)
Streaming transcription
WhisperKit supports streaming audio for near real-time transcription. Here's how to use the CLI tool for streaming:
# For streaming audio directly from a microphone
swift run whisperkit-cli transcribe --model-path "Models/whisperkit-coreml/openai_whisper-large-v3" --stream
Refer to the WhisperKit documentation for implementing streaming transcription programmatically within your app.
Transloadit's speech transcription capabilities
If you need a cloud-based solution without managing infrastructure, Transloadit offers a powerful speech transcription service as part of our Artificial Intelligence service. Our 🤖 speech/transcribe Robot efficiently transcribes speech in audio or video files with these advantages:
- Support for multiple audio and video formats (MP3, WAV, MP4, MOV, etc.).
- Multiple provider options (AWS, GCP, Replicate, FAL, Transloadit).
- Various output formats (JSON, SRT, text, WebVTT).
- Support for over 100 languages.
- Automatic language detection.
- Speaker diarization options.
This cloud-based approach can be ideal for processing large files or when on-device processing isn't feasible.
Happy coding!