arXiv논문2026. 05. 26. 12:52

AdvantageFlow: Flow 모델에서의 RL을 위한 Advantage-Weighted Least Squares

요약

Rectified Flow 모델을 위한 새로운 강화학습 알고리즘인 AdvantageFlow를 제안합니다. Advantage-Weighted Least Squares 방식을 통해 순방향 과정 예측 손실을 최적화하며, Stable Diffusion 3.5 Medium 실험을 통해 기존 Flow-GRPO 및 최신 베이스라인보다 뛰어난 성능을 입증했습니다.

핵심 포인트

Advantage-Weighted Least Squares 기반의 순방향 과정 RL 도입
Rollout policy regularization을 통한 최적화 안정성 확보
Stable Diffusion 3.5 Medium 이미지 생성 성능 향상
기존 Flow-GRPO 및 최신 베이스라인 대비 SOTA 달성

우리는 rectified flow 모델을 위한 순방향 과정 강화학습 (forward-process reinforcement learning) 알고리즘인 AdvantageFlow를 소개합니다. 역과정 (reverse process)을 최적화하는 Flow-GRPO와 달리, 우리는 advantage-weighted 순방향 과정 예측 손실 (forward-process prediction loss)을 최적화합니다. 이 최적화 문제는 advantage가 음수일 때 불안정해지며 손실 함수가 비볼록 (non-convex)해지는 특성이 있습니다. 우리는 rollout policy regularization을 통해 이를 안정화하며, 이는 분산 (variance)을 줄이고 국소적인 보상 개선 타겟 분포 (local reward-improving target distribution)를 피팅하는 과정에서 발생합니다. 우리는 Stable Diffusion 3.5 Medium을 사용하여 이미지 생성 작업에서 AdvantageFlow를 평가합니다. 이는 Flow-GRPO와 negative-aware fine-tuning에 기반한 최첨단 (state-of-the-art) 순방향 과정 RL 베이스라인 모두를 능가합니다.

AI 자동 생성 콘텐츠

원문 바로가기

AdvantageFlow: Flow 모델에서의 RL을 위한 Advantage-Weighted Least Squares

요약

핵심 포인트

댓글