DDPM前向过程

前向扩散指的是将一个复杂分布转换成简单分布的过程 T:RdRd\mathcal{T}:\mathbb{R}^d\mapsto\mathbb{R}^d,即:

x0pcomplexT(x0)pprior\mathbf{x}_0\sim p_\mathrm{complex}\Longrightarrow \mathcal{T}(\mathbf{x}_0)\sim p_\mathrm{prior}

在DDPM中,将这个过程定义为马尔可夫链,通过不断地向复杂分布中的样本x0pcomplexx_0\sim p_\mathrm{complex}添加高斯噪声。这个加噪过程可以表示为q(xtxt1)q(\mathbf{x}_t\vert\mathbf{x}_{t-1})

q(xtxt1)=N(xt;1βtxt1,βtI)xt=1βtxt1+βtϵϵN(0,I)\begin{align} q(\mathbf{x}_t \vert \mathbf{x}_{t-1}) &= \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t} \mathbf{x}_{t-1}, \beta_t\mathbf{I})\\ \mathbf{x}_t&=\sqrt{1-\beta_t}\mathbf{x}_{t-1}+\sqrt{\beta_t}\mathbf\epsilon \quad \mathbf\epsilon\sim\mathcal{N}(\mathbf{0},\mathbf{I}) \end{align}

其中,{βt(0,1)}t=1T\{\beta_t\in(0,1)\}^T_{t=1},是超参数。
x0\mathbf{x}_0开始,不断地应用q(xtxt1)q(\mathbf{x}_t\vert\mathbf{x}_{t-1}),经过足够大的TT步加噪之后,最终得到纯噪声xT\mathbf{x}_T

x0pcomplexx1xtxTpprior\mathbf{x}_0\sim p_\mathrm{complex}\rightarrow \mathbf{x}_1\rightarrow \cdots \mathbf{x}_t\rightarrow\cdots\rightarrow \mathbf{x}_T\sim p_\mathrm{prior}

除了迭代地使用q(xtxt1)q(\mathbf{x}_t\vert\mathbf{x}_{t-1})外,还可以使用q(xtx0)=N(xt;αˉtx0,(1αˉt)I)q(\mathbf{x}_t\vert\mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1 - \bar{\alpha}_t)\mathbf{I})一步到位,证明如下(两个高斯变量的线性组合仍然是高斯变量):

xt=αtxt1+1αtϵt1 ;αt=1αt=αtαt1xt2+1αtαt1ϵˉt2==αˉtx0+1αˉtϵ ;ϵN(0,I),αˉt=i=1tαi \begin{aligned} \mathbf{x}_t &= \sqrt{\alpha_t}\mathbf{x}_{t-1} + \sqrt{1 - \alpha_t}\mathbf{\epsilon}_{t-1} &\ ;\alpha_t=1-\alpha_t\\ &= \sqrt{\alpha_t \alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \bar{\mathbf{\epsilon}}_{t-2} \\ &= \dots \\ &= \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\mathbf{\epsilon} &\ ;\mathbf{\epsilon}\sim \mathcal{N}(\mathbf{0}, \mathbf{I}),\bar{\alpha}_t=\prod_{i=1}^t \alpha_i\ \end{aligned}

一般来说,超参数βt\beta_t的设置满足0<β1<<βT<10<\beta_1<\cdots<\beta_T<1,则αˉ1>>αˉT1\bar{\alpha}_1 > \cdots > \bar{\alpha}_T\to1,则xT\mathbf{x}_T会只保留纯噪声部分。

DDPM逆向过程

在前向扩散过程中,实现了:

x0pcomplexx1xtxTpprior\mathbf{x}_0\sim p_\mathrm{complex}\rightarrow \mathbf{x}_1\rightarrow \cdots \mathbf{x}_t\rightarrow\cdots\rightarrow \mathbf{x}_T\sim p_\mathrm{prior}

如果能够实现将前向扩散过程反转,也就实现了从简单分布到复杂分布的映射。逆向扩散过程则是将前向过程反转,实现从简单分布随机采样样本,迭代地使用q(xt1xt)q(\mathbf{x}_{t-1}\vert\mathbf{x}_t),最终生成复杂分布的样本,即:

xTppriorxT1xtx0pcomplex\mathbf{x}_T\sim p_\mathrm{prior}\rightarrow \mathbf{x}_{T-1}\rightarrow \cdots \mathbf{x}_t\rightarrow\cdots\rightarrow \mathbf{x}_0\sim p_\mathrm{complex}

为了求取q(xt1xt)q(\mathbf{x}_{t-1}\vert\mathbf{x}_t),使用贝叶斯公式:

q(xt1xt)=q(xtxt1)q(xt1)q(xt)\begin{align} q(\mathbf{x}_{t-1}\vert\mathbf{x}_t)&=\frac{q(\mathbf{x}_t\vert\mathbf{x}_{t-1})q(\mathbf{x}_{t-1})}{q(\mathbf{x}_t)} \end{align}

然而,公式中q(xt1)q(x_{t-1})q(xt)q(x_t)不好求,根据DDPM的马尔科夫假设,可以为q(xt1xt)q(\mathbf{x}_{t-1}\vert\mathbf{x}_t)添加条件(可以证明,如果向扩散过程中的βt\beta_t足够小,那么q(xt1xt)q(\mathbf{x}_{t-1}\vert\mathbf{x}_t)是高斯分布。):

q(xt1xt)=q(xt1xt,x0)=q(xtxt1,x0)q(xt1x0)q(xtx0)=q(xtxt1)q(xt1x0)q(xtx0)=N(xt1;μ(xt;θ),σt2I)\begin{align} q(\mathbf{x}_{t-1}\vert\mathbf{x}_t)&=q(\mathbf{x}_{t-1}\vert\mathbf{x}_t,\mathbf{x}_0)\\ &=\frac{q(\mathbf{x}_t\vert\mathbf{x}_{t-1},\mathbf{x}_0)q(\mathbf{x}_{t-1}\vert\mathbf{x}_0)}{q(\mathbf{x}_t\vert\mathbf{x}_0)}\\ &=\frac{q(\mathbf{x}_t\vert\mathbf{x}_{t-1})q(\mathbf{x}_{t-1}\vert\mathbf{x}_0)}{q(\mathbf{x}_t\vert\mathbf{x}_0)}\\ &=\mathcal{N}(\mathbf{x}_{t-1};\mu(\mathbf{x}_t;\theta),\sigma_t^2\mathbf I) \end{align}

其中,μ(xt;θ)\mu(x_t;\theta)是高斯分布的均值,σt\sigma_t可以用超参数表示:

μ(xt;θ)=αt(1αˉt1)1αˉtxt+αˉt1βt1αˉtx0σt=1αˉt11αˉtβt\begin{align} \mu(\mathbf{x}_t;\theta)&=\frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\mathbf{x}_t+ \frac{\sqrt{\bar\alpha_{t-1}}\beta_t}{1-\bar{\alpha}_t}\mathbf{x}_0\\ \sigma_t&=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\cdot\beta_t \end{align}

式中x0x_0可以反用公式xt=αˉtx0+1αˉtϵt\mathbf x_t=\sqrt{\bar{\alpha}_t}\mathbf x_0+\sqrt{1-\bar{\alpha}_t}\mathbf\epsilon_t

x0=1αˉt(xt1αˉtϵt)\mathbf x_0=\frac{1}{\sqrt{\bar{\alpha}_t}}\left(\mathbf{x}_t-\sqrt{1-\bar{\alpha}_t}\mathbf\epsilon_t\right)

则:

μ(xt;θ)=αt(1αˉt1)1αˉtxt+αˉt1βt1αˉtx0=αt(1αˉt1)1αˉtxt+αˉt1βt1αˉt1αˉt(xt1αˉtϵt)=1αt(xt1αt1αˉtϵt)\begin{align} \mu(\mathbf{x}_t;\theta)&=\frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\mathbf{x}_t+ \frac{\sqrt{\bar\alpha_{t-1}}\beta_t}{1-\bar{\alpha}_t}\mathbf{x}_0\\ &=\frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\mathbf{x}_t+ \frac{\sqrt{\bar\alpha_{t-1}}\beta_t}{1-\bar{\alpha}_t}\frac{1}{\sqrt{\bar{\alpha}_t}}\left(\mathbf{x}_t-\sqrt{1-\bar{\alpha}_t}\mathbf\epsilon_t\right)\\ &=\frac{1}{\sqrt{\alpha_t}}\left(x_t-\frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\mathbf\epsilon_t\right) \end{align}

而在推理的时候,ϵt\mathbf\epsilon_t是未知的,所以使用神经网络进行预测。综上,逆向扩散过程:

q(xt1xt)=N(xt1;μ(xt;θ),σt2I)=N(xt1;1αt(xt1αt1αˉtϵθ(xt,t)),(1αˉt11αˉtβt)2I)xt1=1αt(xt1αt1αˉtϵθ(xt,t))+1αˉt11αˉtβtϵϵN(0,I)\begin{align} q(\mathbf{x}_{t-1}\vert\mathbf{x}_t)&=\mathcal{N}(\mathbf{x}_{t-1};\mu(\mathbf{x}_t;\theta),\sigma_t^2\mathbf I)\\ &=\mathcal{N}\left(\mathbf x_{t-1};\frac{1}{\sqrt{\alpha_t}}\left(x_t-\frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\mathbf\epsilon_\theta(\mathbf x_t, t)\right),\left(\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\cdot\beta_t\right)^2\mathbf I\right)\\ \mathbf x_{t-1}&=\frac{1}{\sqrt{\alpha_t}}\left(x_t-\frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\mathbf\epsilon_\theta(\mathbf x_t, t)\right)+\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\beta_t\cdot\mathbf\epsilon\quad\mathbf\epsilon\sim\mathcal N(\mathbf 0, \mathbf I) \end{align}

DDPM训练方法

DDPM的训练目标是最小化训练数据的负对数似然:

logpθ(x0)logpθ(x0)+KL(q(x1:Tx0)pθ(x1:Tx0)) ;KL()0=logpθ(x0)+Ex1:Tq(x1:Tx0)[logq(x1:Tx0)pθ(x0:T)/pθ(x0)] ;pθ(x1:Tx0)=pθ(x0:T)pθ(x0)=logpθ(x0)+Ex1:Tq(x1:Tx0)[logq(x1:Tx0)pθ(x0:T)+logpθ(x0)]=Ex1:Tq(x1:Tx0)[logq(x1:Tx0)pθ(x0:T)]\begin{align} -\log p_\theta(\mathbf x_0) &\le -\log p_\theta(\mathbf x_0) + \mathrm{KL}\left(q(\mathbf x_{1:T}\vert\mathbf x_0)\Vert p_\theta(\mathbf x_{1:T}\vert\mathbf x_0)\right) &\ ;\mathrm{KL}(\cdot\Vert\cdot)\ge 0\\ &=-\log p_\theta(\mathbf x_0)+\mathbb{E}_{\mathbf x_{1:T}\sim q(\mathbf x_{1:T}\vert\mathbf x_0)}\left[\log\frac{q(\mathbf x_{1:T}\vert\mathbf x_0)}{p_\theta(\mathbf x_{0:T})/p_\theta(\mathbf x_0)}\right]&\ ;p_\theta(\mathbf x_{1:T}\vert\mathbf x_0)=\frac{p_\theta(\mathbf x_{0:T})}{p_\theta(\mathbf x_0)}\\ &=-\log p_\theta(\mathbf x_0)+\mathbb{E}_{\mathbf x_{1:T}\sim q(\mathbf x_{1:T}\vert\mathbf x_0)}\left[\log\frac{q(\mathbf x_{1:T}\vert\mathbf x_0)}{p_\theta(\mathbf x_{0:T})}+\log p_\theta(\mathbf x_0)\right]\\ &=\mathbb{E}_{\mathbf x_{1:T}\sim q(\mathbf x_{1:T}\vert\mathbf x_0)}\left[\log\frac{q(\mathbf x_{1:T}\vert\mathbf x_0)}{p_\theta(\mathbf x_{0:T})}\right]\\ \end{align}

其中pθ(x1:Tx0)p_\theta(\mathbf x_{1:T}\vert\mathbf x_0)是使用网络估计分布qq(变分推断),定义LVLBEq(x0:T)[logq(x1:Tx0)pθ(x0:T)]Eq(x0)logpθ(x0)\mathcal{L}_{\mathrm{VLB}}\triangleq\mathbb{E}_q(\mathbf x_{0:T})\left[\log\frac{q(\mathbf x_{1:T}\vert\mathbf x_0)}{p_\theta(\mathbf x_{0:T})}\right]\ge-\mathbb{E}_{q(\mathbf x_0)}\log p_\theta(\mathbf x_0),那么VLB是训练数据的负对数似然的上节,最小化VLB就是最小化负对数似然。继续对VLB拆分:

LVLB=Eq(x0:T)[logq(x1:Tx0)pθ(x0:T)]=Eq[logt=1Tq(xtxt1)pθ(xT)t=1Tpθ(xt1xt)]=Eq[logpθ(xT)+t=1Tlogq(xtxt1)pθ(xt1xt)]=Eq[logpθ(xT)+t=2Tlogq(xtxt1)pθ(xt1xt)+logq(x1x0)pθ(x0x1)]=Eq[logpθ(xT)+t=2Tlogq(xtxt1,x0)pθ(xt1xt)+logq(x1x0)pθ(x0x1)] ;q(xtxt1)=q(xtxt1,x0)=Eq[logpθ(xT)+t=2Tlog(q(xt1xt,x0)pθ(xt1xt)q(xtx0)q(xt1x0))+logq(x1x0)pθ(x0x1)] ;Bayes Theorem=Eq[logq(xTx0)pθ(xT)+t=2Tlogq(xt1xt,x0)pθ(xt1xt)logpθ(x0x1)]=Eq[KL(q(xTx0)pθ(xT))LT+t=2TKL(q(xt1xt,x0)pθ(xt1xt))Lt1logpθ(x0x1)L0]=Eq[LT+t=2TLt1L0]\begin{align} \mathcal{L}_{\mathrm{VLB}}&=\mathbb{E}_{q(\mathbf x_{0:T})}\left[\log\frac{q(\mathbf x_{1:T}\vert\mathbf x_0)}{p_\theta(\mathbf x_{0:T})}\right]\\ &=\mathbb{E}_q\left[\log\frac{\prod_{t=1}^{T}q(\mathbf x_t\vert\mathbf x_{t-1})}{p_\theta(\mathbf x_T)\prod_{t=1}^{T}p_\theta(\mathbf x_{t-1}\vert\mathbf x_t)}\right]\\ &=\mathbb{E}_q\left[-\log p_\theta(\mathbf x_T)+\sum\limits^{T}_{t=1}\log\frac{q(\mathbf x_t\vert\mathbf x_{t-1})}{p_\theta(\mathbf x_{t-1}\vert\mathbf x_t)}\right]\\ &=\mathbb{E}_q\left[-\log p_\theta(\mathbf x_T)+\sum\limits^{T}_{t=2}\log\frac{q(\mathbf x_t\vert\mathbf x_{t-1})}{p_\theta(\mathbf x_{t-1}\vert\mathbf x_t)}+\log\frac{q(\mathbf x_1\vert\mathbf x_0)}{p_\theta(\mathbf x_0\vert\mathbf x_1)}\right]\\ &=\mathbb{E}_q\left[-\log p_\theta(\mathbf x_T)+\sum\limits^{T}_{t=2}\log\frac{q(\mathbf x_t\vert\mathbf x_{t-1}, \mathbf x_0)}{p_\theta(\mathbf x_{t-1}\vert\mathbf x_t)}+\log\frac{q(\mathbf x_1\vert\mathbf x_0)}{p_\theta(\mathbf x_0\vert\mathbf x_1)}\right] &\ ;q(\mathbf x_t\vert\mathbf x_{t-1})=q(\mathbf x_t\vert\mathbf x_{t-1}, \mathbf x_0)\\ &=\mathbb{E}_q\left[-\log p_\theta(\mathbf x_T)+\sum\limits^{T}_{t=2}\log\left(\frac{q(\mathbf x_{t-1}\vert\mathbf x_{t}, \mathbf x_0)}{p_\theta(\mathbf x_{t-1}\vert\mathbf x_t)} \frac{q(\mathbf x_t\vert\mathbf x_0)}{q(\mathbf x_{t-1}\vert\mathbf x_0)}\right)+\log\frac{q(\mathbf x_1\vert\mathbf x_0)}{p_\theta(\mathbf x_0\vert\mathbf x_1)}\right] &\ ;\text{Bayes Theorem}\\ &=\mathbb{E}_q\left[\log\frac{q(\mathbf x_T\vert\mathbf x_0)}{p_\theta(\mathbf x_T)}+\sum_{t=2}^{T}\log\frac{q(\mathbf x_{t-1}\vert\mathbf x_t, \mathbf x_0)}{p_\theta(\mathbf x_{t-1}\vert\mathbf x_t)}-\log p_\theta(\mathbf x_0\vert\mathbf x_1)\right]\\ &=\mathbb{E}_q\left[\underbrace{\mathrm{KL}(q(\mathbf x_T\vert\mathbf x_0) \Vert p_\theta(\mathbf x_T))}_{\mathcal{L}_T} + \sum_{t=2}^{T}\underbrace{\mathrm{KL}(q(\mathbf x_{t-1}\vert\mathbf x_t, \mathbf x_0) \Vert p_\theta(\mathbf x_{t-1}\vert\mathbf x_t))}_{\mathcal{L}_{t-1}}-\underbrace{\log p_\theta(\mathbf x_0\vert\mathbf x_1)}_{\mathcal{L}_0}\right]\\ &=\mathbb{E}_q\left[\mathcal{L}_T+\sum_{t=2}^{T}\mathcal{L}_{t-1}-\mathcal{L}_0\right] \end{align}
  1. 由于xT\mathbf x_T是纯噪声,所以LT\mathcal{L}_T是常数
  2. 对于L0\mathcal{L}_0,DDPM专门设计了特殊的pθ(x0x1)p_\theta(\mathbf x_0\vert\mathbf x_1)
  3. 对于LtKL(q(xtxt+1,x0)pθ(xtxt+1))1tT1\mathcal{L}_t\triangleq\mathrm{KL}(q(\mathbf x_t\vert\mathbf x_{t+1}, \mathbf x_0) \Vert p_\theta(\mathbf x_t \vert \mathbf x_{t+1})) \quad 1\le t \le T-1,是两个正态分布的KL散度,有解析解。在DDPM中,使用了简化之后的损失函数:
Ltsimple=Et[1,T],x0,ϵt[ϵtϵθ(αˉtx0+1αˉtϵt,t)22]\begin{align} \mathcal{L}_t^{\mathrm{simple}}&=\mathbb{E}_{t\sim[1,T],\mathbf x_0,\mathbf\epsilon_t}\left[\Vert\mathbf\epsilon_t-\mathbf\epsilon_\theta(\sqrt{\bar{\alpha}_t}\mathbf x_0+\sqrt{1-\bar{\alpha}_t}\mathbf\epsilon_t,t)\Vert^2_2\right] \end{align}

DDPM总结

综上,DDPM的训练和采样/推理过程如下图所示: DDPM.png

扩散模型与分数生成模型的联系

对于分数:xlogp(x)=x[12σ2(xμ)2]=xμσ=(μ+σϵ)μσ2=ϵσ\nabla_x\log p(x)=\nabla_x\left[-\frac{1}{2\sigma^2}(x-\mu)^2\right]=-\frac{x-\mu}{\sigma}=-\frac{(\mu+\sigma\epsilon)-\mu}{\sigma^2}=-\frac{\epsilon}{\sigma} 又因为:xt=αˉtx0+1αˉtϵ, ϵN(0,1)x_t=\sqrt{\bar{\alpha}_t}x_0+\sqrt{1-\bar{\alpha}_t}\epsilon, \epsilon\sim\mathcal{N}(0,1) 所以: xtlog p(xt)=Ep(x0)[log p(xtx0)]=ϵθ(xt,t)1αˉt\nabla_{x_t}\log p(x_t)=\mathbb{E}_{p(x_0)}\left[\log p(x_t|x_0)\right]=-\frac{\epsilon_\theta(x_t,t)}{\sqrt{1-\bar{\alpha}_t}} 因此,扩散模型的噪声估计器和score只相差一个scale:11αˉt-\frac{1}{\sqrt{1-\bar{\alpha}_t}}.

分类器引导采样

为了让扩散模型能够进行条件生成,需要建模数据与条件的联合分布,换句话说,需要让模型估计这个联合分布的score:

=xt[log(p(yxt)p(xt))]=xt[logp(yxt)+logp(xt)]=xtlog p(yxt)+xtlog p(xt)=\nabla_{x_t}\left[\log(p(y|x_t)p(x_t))\right]\\ =\nabla_{x_t}\left[\log p(y|x_t)+\log p(x_t)\right]\\ =\nabla_{x_t}\log p(y|x_t)+\nabla_{x_t}\log p(x_t)

p(yxt)p(y|x_t)是一个分类器,训练这样一个分类器去估计这一项

=xtlog p(yxt)+xtlog p(xt) xtlog fϕ(yxt)11αˉtϵθ(xt,t)=\nabla_{x_t}\log p(y|x_t)+\nabla_{x_t}\log p(x_t)\\ \thickapprox \nabla_{x_t}\log f_\phi(y|x_t)-\frac{1}{\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(x_t,t)

这样就得到了一个新的score估计器:

xtlog p(xt,y)=11αˉtϵ~θ(xt,y,t)=xtlog fϕ(yxt)11αˉtϵθ(xt,t) ϵ~θ(xt,t,y)=ϵθ(xt,t,y)1αˉtxtlog fϕ(yxt)\nabla_{x_t}\log p(x_t,y)\\ =-\frac{1}{\sqrt{1-\bar{\alpha}_t}}\tilde\epsilon_\theta(x_t,y,t)\\ =\nabla_{x_t}\log f_\phi(y|x_t)-\frac{1}{\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(x_t,t)\\ \Leftrightarrow \tilde\epsilon_\theta(x_t,t,y)=\epsilon_\theta(x_t,t,y)-\sqrt{1-\bar{\alpha}_t}\nabla_{x_t}\log f_\phi(y|x_t)

为了设置分类器的引导强度,新增一个引导参数ww:

ϵ~θ(xt,t,y)=ϵθ(xt,t,y)w1αˉtxtlogfϕ(yxt)\tilde\epsilon_\theta(x_t,t,y)=\epsilon_\theta(x_t,t,y)-w\sqrt{1-\bar{\alpha}_t}\nabla_{x_t}\log f_\phi(y|x_t)

无分类器引导采样

在分类器引导采样中,根据贝叶斯公式:

xtlog p(yxt)=xtlog[p(xty)p(y)p(xt)]=xtlog p(xty)+xtlog p(y)  xtlog p(xt)=xtlog p(xty)xtlog p(xt)=11αˉt(ϵ(xt,t,y)ϵ(xt,t))\nabla_{x_t}\log p(y|x_t)\\ =\nabla_{x_t}\log\left[\frac{p(x_t|y)p(y)}{p(x_t)}\right]\\ =\nabla_{x_t}\log p(x_t|y)+\nabla_{x_t}\log p(y) - \nabla_{x_t}\log p(x_t)\\ =\nabla_{x_t}\log p(x_t|y)-\nabla_{x_t}\log p(x_t)\\ =-\frac{1}{\sqrt{1-\bar{\alpha}_t}}\left(\epsilon(x_t,t,y)-\epsilon(x_t,t)\right)

代入分类器引导采样公式中:

ϵ~θ(xt,t,y)=ϵθ(xt,t,y)w1αˉtxtlogfϕ(yxt)=ϵθ(xt,t,y)+w[ϵ(xt,t,y)ϵ(xt,t)]\tilde\epsilon_\theta(x_t,t,y)\\ =\epsilon_\theta(x_t,t,y)-w\sqrt{1-\bar{\alpha}_t}\nabla_{x_t}\log f_\phi(y|x_t)\\ =\epsilon_\theta(x_t,t,y)+w\left[\epsilon(x_t,t,y)-\epsilon(x_t,t)\right]

即,分类器引导采样中分类器提供的方向等价为ϵ(xt,t,y)ϵ(xt,t)\epsilon(x_t,t,y)-\epsilon(x_t,t),这个方向靠近条件的方向,远离无条件方向

思考

进一步思考,这一项的数值大小ϵ(xt,t,y)ϵ(xt,t)\epsilon(x_t,t,y)-\epsilon(x_t,t)标志着数据和条件对齐的程度,这是一个隐式的分类器,能够利用这个特点分类 继续进一步,如果将无条件score估计网络ϵ(xt,t)\epsilon(x_t,t)换成另一个条件yy\prime,即ϵ(xt,t,y)\epsilon(x_t,t,y\prime)。则:

ϵ(xt,t,y)ϵ(xt,t,y)=[ϵ(xt,t,y)ϵ(xt,t)][ϵ(xt,t,y)ϵ(xt,t)]\epsilon(x_t,t,y)-\epsilon(x_t,t,y\prime)\\ =\left[\epsilon(x_t,t,y)-\epsilon(x_t,t)\right]-\left[\epsilon(x_t,t,y\prime)-\epsilon(x_t,t)\right]

这里出现了两个隐式分类器,第一个是衡量数据和条件yy的对齐程度,第二个是衡量数据和条件yy\prime的对齐程度。如果使用这两个隐式分类器替换之前的隐式分类器,那么就相当于在生成过程中让数据尽可能对齐条件yy,远离yy\prime,这个做法被广泛用于Stable Diffusion中(negative prompt),negative prompt一般被设置为low quality, ugly等想让模型远离提示词。既然都能够同时使用1个正提示词和1个负提示词进行引导,那么也可以实现m个提示词(隐式分类器)进行引导: Compositional visual generation with composable diffusion models

无分类器引导中的CFG值

无分类器引导采样中的ww一般被称为CFG值,越大表示越向条件靠近(可以提高样本的保真度),越小表示越向非条件靠近(可以提高样本的多样性),可以根据实际需求调节。

BTW,虽然扩散模型可以像cGAN或cVAE那样训练一个conditional model,即ϵ(xt,t,y)\epsilon(x_t,t,y),这个在条件yy较简单的时候(例如只是一个类别标签),一个好的backbone仍然能够实现条件生成。但是当条件变复杂的时候(例如文本,草图等),无分类器引导采样就变得很重要了,这个是目前非常主流的做法。

Reference

  1. 从零开始了解Diffusion Models
  2. https://ayandas.me/blog-tut/2021/12/04/diffusion-prob-models.html
  3. What are Diffusion Models
  4. An introduction to Diffusion Probabilistic Models

转载时请包括本文地址:https://dw-dengwei.cn/posts/diffusion