본문으로 건너뛰기

© 2026 Molayo

HN요약2026. 04. 25. 17:47

영상 통화가 가능한 트랜스포머 아바타: Lemon Slice 소개

요약

Lemon Slice는 커스텀 확산 트랜스포머(DiT) 모델을 활용하여 사진 한 장만으로 실시간 대화가 가능한 아바타를 구현했습니다. 기존의 HeyGen 같은 플랫폼과 달리, 별도의 캐릭터 학습이나 인간 오퍼레이터 없이도 어떤 스타일의 이미지든 즉시 영상 통화를 할 수 있습니다. 특히 5초 단위로 끊기는 기존 모델의 한계를 극복한 '시간적 일관성 보존 기법'을 통해 무한 길이의 비디오 생성이 가능하며, 전체 시스템 지연 시간(latency)은 사용자 입력부터 아바타 응답까지 3~6초를 달성했습니다.

핵심 포인트

  • 커스텀 확산 트랜스포머(DiT) 모델을 활용하여 사진 기반의 실시간 대화형 아바타 영상을 구현했습니다.
  • 기존 플랫폼과 달리 별도의 캐릭터 학습이나 인간 개입 없이, 단일 이미지 업로드만으로 다양한 스타일의 아바타를 즉시 영상 통화에 사용할 수 있습니다.
  • 시간적 일관성 보존 기법을 개발하여 기존 모델(Sora, Runway 등)의 5초 단위 길이 제한 문제를 해결하고 무한 길이 비디오 생성을 가능하게 했습니다.
  • Deepgram, Modal 등을 조합한 스트리밍 아키텍처를 통해 사용자 입력부터 응답까지 3~6초의 낮은 엔드투엔드 지연 시간(latency)을 달성했습니다.

Show HN: Lemon Slice Live – Have a video call with a transformer model

Hey HN, this is Lina, Andrew, and Sidney from Lemon Slice. We’ve trained a custom diffusion transformer (DiT) model that achieves video streaming at 25fps and wrapped it into a demo that allows anyone to turn a photo into a real-time, talking avatar. Here’s an example conversation from co-founder Andrew:
https://www.youtube.com/watch?v=CeYp5xQMFZY.

Try it for yourself at:
https://lemonslice.com/live.

(Btw, we used to be called Infinity AI and did a Show HN under that name last year: https://news.ycombinator.com/item?id=41467704.)

Unlike existing avatar video chat platforms like HeyGen, Tolan, or Apple Memoji filters, we do not require training custom models, rigging a character ahead of time, or having a human drive the avatar. Our tech allows users to create and immediately video-call a custom character by uploading a single image. The character image can be any style - from photorealistic to cartoons, paintings, and more.

To achieve this demo, we had to do the following (among other things! but these were the hardest):

  1. Training a fast DiT model. To make our video generation fast, we had to both design a model that made the right trade-offs between speed and quality, and use standard distillation approaches. We first trained a custom video diffusion transformer (DiT) from scratch that achieves excellent lip and facial expression sync to audio. To further optimize the model for speed, we applied teacher-student distillation. The distilled model achieves 25fps video generation at 256-px resolution. Purpose-built transformer ASICs will eventually allow us to stream our video model at 4k resolution.

  2. Solving the infinite video problem. Most video DiT models (Sora, Runway, Kling) generate 5-second chunks. They can iteratively extend it by another 5sec by feeding the end of the 1st chunk into the start of the 2nd in an autoregressive manner. Unfortunately the models experience quality degradation after multiple extensions due to accumulation of generation errors. We developed a temporal consistency preservation technique that maintains visual coherence across long sequences. Our technique significantly reduces artifact accumulation and allows us to generate indefinitely-long videos.

  3. A complex streaming architecture with minimal latency. Enabling an end-to-end avatar zoom call requires several building blocks, including voice transcription, LLM inference, and text-to-speech generation in addition to video generation. We use Deepgram as our AI voice partner. Modal as the end-to-end compute platform. And Daily.co and Pipecat to help build a parallel processing pipeline that orchestrates everything via continuously streaming chunks. Our system achieves end-to-end latency of 3-6 seconds from user input to avatar response. Our target is <2 second latency.

More technical details here: https://lemonslice.com/live/technical-report.

Current limitations that we want to solve include: (1) enabling whole-body and background motions (we’re training a next-gen model for this), (2) reducing delays and improving resolution (purpose-built ASICs will help), (3) training a model on dyadic conversations so that avatars learn to listen naturally, and (4) allowing the character to “see you” and respond to what they see to create a more natural and engaging conversation.

We believe that generative video will usher in a new media type centered around interactivity: TV shows, movies, ads, and online courses will stop and talk to us. Our entertainment will be a mixture of passive and active experiences depending on what we’re in the mood for. Well, prediction is hard, especially about the future, but that’s how we see it anyway!

We’d love for you to try out the demo and let us know what you think! Post your characters and/or conversation recordings below.

AI 자동 생성 콘텐츠

본 콘텐츠는 HN AI Research의 원문을 AI가 자동으로 요약·번역·분석한 것입니다. 원 저작권은 원저작자에게 있으며, 정확한 내용은 반드시 원문을 확인해 주세요.

원문 바로가기
2

댓글

0