Dev.to헤드라인2026. 05. 13. 00:04

[E2E 테스트] 라즈베리 파이에서 실시간 음성 제어 AI 비서 배포하기

요약

본 튜토리얼은 라즈베리 파이와 같은 저전력 엣지 디바이스에서 실시간 음성 비서를 구축하는 방법을 안내합니다. 클라우드 의존성을 제거하고 Whisper-small 및 경량 TensorFlow Lite 모델을 활용하여, 오디오가 장치를 벗어나지 않는 프라이버시를 보장하며 제로 레이턴시의 로컬 AI 기능을 구현할 수 있습니다.

핵심 포인트

엣지 컴퓨팅은 낮은 지연 시간(Zero latency), 높은 프라이버시, 그리고 인터넷 연결 없이 작동하는 오프라인 신뢰성을 제공합니다.
실시간 음성 비서 파이프라인은 `sounddevice`를 이용한 오디오 스트리밍, Whisper-small을 사용한 로컬 전사(transcription), 그리고 경량 의도 분류기를 결합하여 구성됩니다.
라즈베리 파이와 같은 엣지 환경에서는 전체 TensorFlow 대신 `tflite-runtime`과 같이 가벼운 라이브러리를 사용하는 것이 필수적입니다.
Whisper 모델을 CPU에서 구동할 때 성능 최적화를 위해 비빔 디코딩(non-beam decoding) 모드나 TorchScript를 활용하는 방법을 고려해야 합니다.

음성 비서를 엣지(Edge)에서 실행해야 하는 이유? 음성-텍스트 변환(speech-to-text)과 의도 감지(intent detection)를 로컬에서 수행하면 다음을 얻을 수 있습니다.

제로 레이턴시 (Zero latency) – 클라우드로의 왕복 과정이 없습니다.
프라이버시 (Privacy) – 오디오가 장치를 벗어나지 않습니다.
오프라인 신뢰성 (Offline reliability) – 인터넷 연결이 끊겨도 비서가 작동합니다.

본 튜토리얼에서는 OpenAI의 Whisper(small 모델)를 전사(transcription)에, 작은 TensorFlow Lite 의도 분류기(intent classifier)를, 그리고 라즈베리 파이 4(2 GB 이상)에서 완전히 실행되는 실시간 오디오 파이프라인을 결합할 것입니다. 끝날 무렵에는 “램프 켜줘”와 같은 명령을 듣고 로컬 함수를 즉시 실행하는 Python 스크립트를 갖게 될 것입니다.

준비물 (What You’ll Need)	항목	이유
Raspberry Pi 4 (2 GB+) with Raspberry OS (64-bit)	Whisper-small을 위한 충분한 RAM 제공
Micro-USB 또는 USB-C 마이크로폰	오디오 캡처
Python 3.10+	최신 언어 기능 사용
ffmpeg	Whisper에 필요
git, pip, virtualenv	표준 개발 도구
선택 사항: GPIO 제어 릴레이	실제 명령을 시연하기 위함

팁: Pi Zero를 사용하는 경우, Whisper 대신 더 가벼운 모델(예: tiny.en)을 사용하거나 의도 인식기만 실행하세요.
1.

개발 환경 설정

OS 업데이트 및 시스템 의존성 설치

sudo apt update && sudo apt upgrade -y
sudo apt install -y python3-pip python3-venv ffmpeg libportaudio2 # 실제 명령을 시연하기 위함

팁: Pi Zero를 사용하는 경우, Whisper 대신 더 가벼운 모델(예: tiny.en)을 사용하거나 의도 인식기만 실행하세요.
1.

깨끗한 가상 환경 생성

python3 -m venv venv
source venv/bin/activate # 가상 환경 활성화

pip 업그레이드 및 핵심 라이브러리 설치

pip install --upgrade pip
pip install numpy sounddevice tqdm

Whisper 설치

Whisper는 첫 사용 시 모델을 다운로드하는 Python 패키지로 제공됩니다.
pip install git+https://github.com/openai/whisper.git

TensorFlow Lite 런타임 설치

전체 TensorFlow 패키지는 Pi에서 무겁습니다. 대신 경량 런타임을 사용하세요:
pip install tflite-runtime
2. 실시간 오디오 캡처
sounddevice를 사용하여 16 kHz 모노 오디오를 NumPy 버퍼로 직접 스트리밍합니다. Whisper는 16 kHz를 기대하므로 샘플링 속도를 그에 맞게 설정합니다.
import sounddevice as sd
import numpy as np
from collections import deque
SAMPLE_RATE = 16000
CHUNK_DURATION = 0.5 # 초 단위
CHUNK_SIZE = int( SAMPLE_RATE * CHUNK_DURATION ) # 청크 크기

마지막 5초를 유지하는 스레드 안전 순환 버퍼

audio_buffer = deque( maxlen = int( 5 * SAMPLE_RATE ))

def audio_callback ( indata , frames , time , status ): """ 각 오디오 청크에 대해 sounddevice가 호출합니다. """
if status :
print ( f " 오디오 상태: { status } " )
audio_buffer . extend ( indata [:, 0 ]) # 모노 채널
stream = sd . InputStream ( samplerate = SAMPLE_RATE , channels = 1 , dtype = ' float32 ' , callback = audio_callback , ) stream .

start() print("🔊 Listening…")
버퍼는 가장 최근의 오디오를 지속적으로 보유합니다. 루프 반복마다 2초 분량을 추출하여 Whisper에 공급할 것입니다.
3. Edge에서 Whisper 실행
Whisper-small(~39 M 파라미터)은 Pi의 RAM에 적합하며, CPU만으로 구동 시 Pi 4에서 실시간의 약 2배 속도로 작동합니다. 지연 시간을 낮추기 위해 비빔 디코딩(non-beam decoding) 모드를 사용하겠습니다.
import whisper
import torch

CPU에 Whisper-small 로드

model = whisper.load_model("small", device="cpu")
def transcribe_chunk(chunk):
""" (samples,) 형태의 NumPy 배열을 받아 텍스트를 반환합니다."""

Whisper는 [-1, 1]로 정규화된 float32 텐서를 예상합니다.

audio = torch.from_numpy(chunk).float()
result = model.transcribe(audio, language="en", word_timestamps=False, beam_size=1)
return result[0].strip()
줄여진 컴퓨팅을 위한 TorchScript (선택 사항)
적당한 속도 향상을 원한다면, 모델을 한 번 스크립트화하세요:
scripted = torch.jit.script(model)

transcribe_chunk에서 `model`을 `scripted`로 교체합니다.

작은 의도 분류기 구축
전체 문장을 구문 분석하는 대신, 키워드 스포팅 모델을 사용하여 짧은 발화(utterances)를 의도로 매핑할 것입니다. 아키텍처는 1-D 컨볼루션에 이어 밀집층(dense layer)으로 구성되어 있으며, 약 ~10k 파라미터만 사용합니다.
import tensorflow as tf
def build_intent_model(num_classes=4):
model = tf.keras.Sequential([
tf.keras.layers.Input(shape=(16000, 1)), # 1초 raw waveform
tf.keras.layers.

Rescaling ( 1.0 / 32768.0 ), # Normalize int16 to [-1, 1] tf.keras.layers.Conv1D( 8 , 13 , strides = 2 , activation = ' relu ' ), tf.keras.layers.Conv1D( 16 , 13 , strides = 2 , activation = ' relu ' ), tf.keras.layers.GlobalAveragePooling1D (), tf.keras.layers.Dense ( num_classes , activation = ' softmax ' ) ]) return model Training Data (quick example) Create a tiny dataset with four commands: ["turn on the lamp", "turn off the lamp", "what time is it", "stop listening"]. Record a few seconds for each command, label them, and train for a handful of epochs. # Assume X_train shape = (samples, 16000, 1), y_train one‑hot encoded model = build_intent_model( num_classes = 4 ) model.compile( optimizer = ' adam ' , loss = ' categorical_crossentropy ' , metrics = [ ' accuracy ' ]) model.fit( X_train , y_train , epochs = 15 , batch_size = 8 ) Convert to TensorFlow Lite and Quantize Quantization shrinks the model to ~30 KB and runs at >100 inferences/sec on the Pi. converter = tf.lite.TFLiteConverter.from_keras_model( model ) converter.optimizations = [ tf.lite.Optimize.DEFAULT ] # post‑training quantization tflite_model = converter.convert() with open( " intent_classifier.tflite " , " wb " ) as f: f.write( tflite_model ) print( " ✅ Saved quantized TFLite model " ) Load the TFLite Model import tflite_runtime.interpreter as tflite interpreter = tflite.Interpreter( model_path = " intent_classifier.tflite " ) interpreter.

allocate_tensors() input_idx = interpreter.get_input_details()[0]['index'] output_idx = interpreter.get_output_details()[0]['index']
def predict_intent(waveform):
"""
waveform: np.ndarray shape (16000,)
"""
# Reshape to (1, 16000, 1) and cast to int16 for the quantized model
input_data = waveform.astype(np.int16).reshape(1, -1, 1)
interpreter.set_tensor(input_idx, input_data)
interpreter.invoke()
probs = interpreter.get_tensor(output_idx)[0]
intent_id = np.argmax(probs)
return intent_id, probs[intent_id]

Map IDs to human-readable intents:

INTENT_MAP = {
0: "TURN_ON",
1: "TURN_OFF",
2: "GET_TIME",
3: "STOP"
}

Glue It All Together
Now we combine the audio stream, Whisper transcription, and intent classifier into a single loop.
We’ll use a 1-second sliding window for intent detection (fast) and a 2-second window for Whisper (more accurate).
import time
import datetime
def execute_intent(intent):
if intent == "TURN_ON":
print("💡 Turning lamp ON") # Example GPIO call:
# import RPi.GPIO as GPIO
# GPIO.output(LAMP_PIN, GPIO.HIGH)
elif intent == "TURN_OFF":
print("💡 Turning lamp OFF")
elif intent == "GET_TIME":
now = datetime.datetime.now()

strftime(" %H:%M ") print(f" 🕒 The time is { now } " ) elif intent == " STOP " : print( " 👋 Stopping assistant " ) raise KeyboardInterrupt try : while True : # ---- Intent detection (fast) ---- if len(audio_buffer) >= SAMPLE_RATE : # need at least 1 sec recent = np.array(list(audio_buffer)[-SAMPLE_RATE:]) # last 1 sec intent_id, confidence = predict_intent(recent) if confidence > 0.85 : # ignore low‑confidence guesses intent = INTENT_MAP[intent_id] print(f" [Intent] {intent} ( {confidence:.2f} ) " ) execute_intent(intent) # ---- Whisper transcription (every 2 sec) ---- if len(audio_buffer) >= 2 * SAMPLE_RATE : chunk = np.array(list(audio_buffer)[-2*SAMPLE_RATE:]) text = transcribe_chunk(chunk) if text : print(f" [Transcription] {text} " ) time.sleep(0.2) # tiny pause to keep CPU happy except KeyboardInterrupt : print( "
🛑 Assistant stopped " ) finally : stream.stop() stream.close() What’s happening? Audio callback continuously fills audio_buffer. Every loop we grab a 1‑second slice, run the quantized intent model, and instantly act on high‑confidence predictions. Every 2 seconds we feed a larger slice to Whisper for a full transcription – useful for debugging or for commands that need more context. The script exits gracefully on “stop listening”. 6. Optimizing for Real‑World Use Area Quick win CPU usage Set torch.set_num_threads(2) to limit Whisper’s thread count.

전력 사용량: Pi에서 pico 모드 활성화 (sudo raspi-config → Performance → Low‑Power). 오디오 품질: 럼블(rumble)을 제거하기 위해 간단한 하이패스 필터(scipy.signal.butter) 추가. 모델 크기: RAM에 병목 현상이 발생하면 Whisper-small 대신 Whisper-tiny 사용. 키워드 감지: 의도 모델은 항상 활성화 상태로 유지하고, 키워드가 감지된 후에만 Whisper를 호출합니다 (예: “hey pi”). 7. 시스템 서비스로 배포 스크립트를 수동으로 실행하는 것은 테스트에는 괜찮지만, 프로덕션급 비서의 경우 부팅 시 자동으로 시작되도록 설정해야 합니다. sudo nano /etc/systemd/system/voice-assistant.service 다음 내용을 붙여넣습니다: [Unit] Description = Edge Voice Assistant After = network.target [Service] WorkingDirectory = /home/pi/voice-assistant ExecStart = /home/pi/voice-assistant/venv/bin/python3 assistant.py Restart = on-failure User = pi [Install] WantedBy = multi-user.target 활성화 및 시작: sudo systemctl daemon-reload sudo systemctl enable voice-assistant.service sudo systemctl start voice-assistant.service 로그는 journalctl -u voice-assistant -f 로 확인합니다. 핵심 요약 Whisper-small은 비암시(beam size)를 제한하고 CPU 전용 추론을 사용하면 Raspberry Pi 4에서 실시간으로 실행될 수 있습니다. TensorFlow Lite로 사후 학습 양자화된 작은 1-D ConvNet이 밀리초 이하의 의도 감지 성능을 제공합니다. sounddevice와 순환 버퍼(circular buffer)를 사용하면 프레임을 손실하지 않고 오디오 스트리밍이 가능합니다. 빠른 의도 분류기와 간헐적인 Whisper 전사(transcription)를 결합하면 낮은 지연 시간과 높은 정확도를 모두 얻을 수 있습니다.

스크립트를 systemd 서비스로 패키징하면 비서가 자동으로 시작되고 안정적으로 유지됩니다.

AI 자동 생성 콘텐츠

원문 바로가기