Continuous Bernoulli distribution

Continuous Bernoulli distribution
Continuous Bernoulli distribution
	Probability density function
Parameters
Support
PDF	; where
CDF
Mean
Variance

In probability theory, statistics, and machine learning, the continuous Bernoulli distribution^[1]^[2]^[3] is a family of continuous probability distributions parameterized by a single shape parameter $\lambda \in (0,1)$ , defined on the unit interval $x\in [0,1]$ , by:

p(x|\lambda )\propto \lambda ^{x}(1-\lambda )^{1-x}.

The continuous Bernoulli distribution arises in deep learning and computer vision, specifically in the context of variational autoencoders,^[4]^[5] for modeling the pixel intensities of natural images. As such, it defines a proper probabilistic counterpart for the commonly used binary cross entropy loss, which is often applied to continuous, $[0,1]$ -valued data.^[6]^[7]^[8]^[9] This practice amounts to ignoring the normalizing constant of the continuous Bernoulli distribution, since the binary cross entropy loss only defines a true log-likelihood for discrete, $\{0,1\}$ -valued data.

The continuous Bernoulli also defines an exponential family of distributions. Writing $\theta =\log \left(\lambda /(1-\lambda )\right)$ for the natural parameter, the density can be rewritten in canonical form: $p(x|\theta )\propto \exp(\theta x)$ . ^[10]

Statistical inference

Given an independent sample of $n$ points $x_{1},\dots ,x_{n}$ with $x_{i}\in [0,1]\,\forall i$ from continuous Bernoulli, the log-likelihood of the natural parameter $\theta$ is

{\mathcal {L}}(\theta )=\theta \sum _{i=1}^{n}x_{i}-n\log\{(e^{\theta }-1)/\theta \}

and the maximum likelihood estimator of the natural parameter $\theta$ is the solution of ${\mathcal {L}}'(\theta )=0$ , that is, ${\hat {\theta }}$ satisfies

{\frac {e^{\hat {\theta }}}{e^{\hat {\theta }}-1}}-{\frac {1}{\hat {\theta }}}={\frac {1}{n}}\sum _{i=1}^{n}x_{i}

where the left hand side $e^{\hat {\theta }}/(e^{\hat {\theta }}-1)-{\hat {\theta }}^{-1}$ is the expected value of continuous Bernoulli with parameter ${\hat {\theta }}$ . Although ${\hat {\theta }}$ does not admit a closed-form expression, it can be easily calculated with numerical inversion.

Further properties

The entropy of a continuous Bernoulli distribution is

\operatorname {H} [X]={\begin{cases}0&{\text{ if }}\lambda ={\frac {1}{2}}\\{\frac {\lambda \log \left(\lambda \right)-\left(1-\lambda \right)\log \left(1-\lambda \right)}{1-2\lambda }}-\log \left({\frac {2\tanh ^{-1}\left(1-2\lambda \right)}{e\left(1-2\lambda \right)}}\right)&{\text{ otherwise}}\end{cases}}\!

Related distributions

Bernoulli distribution

The continuous Bernoulli can be thought of as a continuous relaxation of the Bernoulli distribution, which is defined on the discrete set $\{0,1\}$ by the probability mass function:

p(x)=p^{x}(1-p)^{1-x},

where $p$ is a scalar parameter between 0 and 1. Applying this same functional form on the continuous interval $[0,1]$ results in the continuous Bernoulli probability density function, up to a normalizing constant.

Uniform distribution

The Uniform distribution between the unit interval [0,1] is a special case of continuous Bernoulli when $\lambda =1/2$ or $\theta =0$ .

Exponential distribution

An exponential distribution with rate $\Lambda$ restricted to the unit interval [0,1] corresponds to a continuous Bernoulli distribution with natural parameter $\theta =-\Lambda <0$ .

Continuous categorical distribution

The multivariate generalization of the continuous Bernoulli is called the continuous-categorical.^[11]

References

^ Loaiza-Ganem, G., & Cunningham, J. P. (2019). The continuous Bernoulli: fixing a pervasive error in variational autoencoders. In Advances in Neural Information Processing Systems (pp. 13266-13276).
^ PyTorch Distributions. https://pytorch.org/docs/stable/distributions.html#continuousbernoulli
^ Tensorflow Probability. https://www.tensorflow.org/probability/api_docs/python/tfp/edward2/ContinuousBernoulli Archived 2020-11-25 at the Wayback Machine
^ Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
^ Kingma, D. P., & Welling, M. (2014, April). Stochastic gradient VB and the variational auto-encoder. In Second International Conference on Learning Representations, ICLR (Vol. 19).
^ Larsen, A. B. L., Sønderby, S. K., Larochelle, H., & Winther, O. (2016, June). Autoencoding beyond pixels using a learned similarity metric. In International conference on machine learning (pp. 1558-1566).
^ Jiang, Z., Zheng, Y., Tan, H., Tang, B., & Zhou, H. (2017, August). Variational deep embedding: an unsupervised and generative approach to clustering. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (pp. 1965-1972).
^ PyTorch VAE tutorial: https://github.com/pytorch/examples/tree/master/vae.
^ Keras VAE tutorial: https://blog.keras.io/building-autoencoders-in-keras.html.
^ Lee, C. J.; Dahl, B. K.; Ovaskainen, O.; Dunson, D. B. (2025). Scalable and robust regression models for continuous proportional data. arXiv preprint arXiv:2504.15269. https://arxiv.org/abs/2504.15269
^ Gordon-Rodriguez, E., Loaiza-Ganem, G., & Cunningham, J. P. (2020). The continuous categorical: a novel simplex-valued exponential family. In 36th International Conference on Machine Learning, ICML 2020. International Machine Learning Society (IMLS).

[1] Loaiza-Ganem, G., & Cunningham, J. P. (2019). The continuous Bernoulli: fixing a pervasive error in variational autoencoders. In Advances in Neural Information Processing Systems (pp. 13266-13276).

[2] PyTorch Distributions. https://pytorch.org/docs/stable/distributions.html#continuousbernoulli

[3] Tensorflow Probability. https://www.tensorflow.org/probability/api_docs/python/tfp/edward2/ContinuousBernoulli Archived 2020-11-25 at the Wayback Machine

[4] Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.

[5] Kingma, D. P., & Welling, M. (2014, April). Stochastic gradient VB and the variational auto-encoder. In Second International Conference on Learning Representations, ICLR (Vol. 19).

[6] Larsen, A. B. L., Sønderby, S. K., Larochelle, H., & Winther, O. (2016, June). Autoencoding beyond pixels using a learned similarity metric. In International conference on machine learning (pp. 1558-1566).

[7] Jiang, Z., Zheng, Y., Tan, H., Tang, B., & Zhou, H. (2017, August). Variational deep embedding: an unsupervised and generative approach to clustering. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (pp. 1965-1972).

[8] PyTorch VAE tutorial: https://github.com/pytorch/examples/tree/master/vae.

[9] Keras VAE tutorial: https://blog.keras.io/building-autoencoders-in-keras.html.

[Lee2025-10] Lee, C. J.; Dahl, B. K.; Ovaskainen, O.; Dunson, D. B. (2025). Scalable and robust regression models for continuous proportional data. arXiv preprint arXiv:2504.15269. https://arxiv.org/abs/2504.15269

[11] Gordon-Rodriguez, E., Loaiza-Ganem, G., & Cunningham, J. P. (2020). The continuous categorical: a novel simplex-valued exponential family. In 36th International Conference on Machine Learning, ICML 2020. International Machine Learning Society (IMLS).

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

Continuous Bernoulli distribution
Probability density function
Parameters	$\lambda =1/(1+e^{-\theta })\in (0,1)$	$\theta \in \mathbb {R}$ , natural parameter
Support	$x\in [0,1]$	$x\in [0,1]$
PDF	$C(\lambda )\lambda ^{x}(1-\lambda )^{1-x}\!$ where $C(\lambda )={\begin{cases}2&{\text{if }}\lambda ={\frac {1}{2}}\\{\frac {2\tanh ^{-1}(1-2\lambda )}{1-2\lambda }}&{\text{ otherwise}}\end{cases}}$	$f(x\mid \theta )={\begin{cases}1&\theta =0\\\exp(x\theta -\log\{(e^{\theta }-1)/\theta \})&\theta \neq 0\end{cases}}$
CDF	$F(x\mid \lambda )={\begin{cases}x,&\lambda ={\tfrac {1}{2}}\\[6pt]{\dfrac {\lambda ^{x}(1-\lambda )^{1-x}+\lambda -1}{2\lambda -1}},&{\text{otherwise}}\end{cases}}$	$F(x\mid \theta )={\begin{cases}x&\theta =0\\(e^{\theta x}-1)/(e^{\theta }-1)&\theta \neq 0\end{cases}}$
Mean	$\operatorname {E} [X]={\begin{cases}{\tfrac {1}{2}}&\lambda ={\tfrac {1}{2}}\\[6pt]{\dfrac {\lambda }{2\lambda -1}}+{\dfrac {1}{2\tanh ^{-1}(1-2\lambda )}},&{\text{otherwise}}\end{cases}}$	$\operatorname {E} [X]={\begin{cases}1/2&\theta =0\\e^{\theta }/(e^{\theta }-1)-\theta ^{-1}&\theta \neq 0\end{cases}}$
Variance	$\operatorname {Var} (X)={\begin{cases}{\tfrac {1}{12}},&\lambda ={\tfrac {1}{2}}\\[6pt]-{\dfrac {\lambda (1-\lambda )}{(1-2\lambda )^{2}}}+{\dfrac {1}{(2\tanh ^{-1}(1-2\lambda ))^{2}}},&{\text{otherwise}}\end{cases}}$	$\operatorname {Var} (X)={\begin{cases}1/12&\theta =0\\(2-e^{\theta }-e^{-\theta })^{-1}+\theta ^{2}&\theta \neq 0\end{cases}}$