A Nearest neighbor Approach to Estimating Divergence Between Continuous Random Vectors

1. Introduction

The Shannon (or differential) entropy of a continuously distributed random variable (r.v.) X with probability density function (pdf) f is widely used in probability theory and information theory as a measure of uncertainty. It is defined as the negative mean of the logarithm of the density function, i.e.,

$H (f) = - E_{f} [ln f (X)]$

(1)

k-Nearest neighbor (knn) density estimators were proposed by Mack and Rosenblatt [1]. Penrose and Yukich [2] studied the laws of large numbers for k-nearest neighbor distances. The nearest neighbor entropy estimators when

X \in R^{p}

were studied by Kozachenko and Leonenko [3]. Singh et al. [4] and Leonenko et al. [5] extended these estimators using k-nearest neighbors. Mnatsakanov et al. [6] studied knn entropy estimators for variable rather than fixed k. Eggermontet et al. [7] studied the kernel entropy estimator for univariate smooth distributions. Li et al. [8] studied parametric and nonparametric entropy estimators for univariate multimodal circular distributions. Neeraj et al. [9] studied knn estimators of circular distributions for the data from the Cartesian product, that is,

{[0, 2 π)}^{p}

. Recently, Mnatsakanov et al. [10] proposed an entropy estimator for hyperspherical data based on the moment-recovery (MR) approach (see also Section 4.3).

In this paper, we propose k-nearest neighbor entropy, cross-entropy and KL-divergence estimators for hyperspherical random vectors defined on a unit p-hypersphere

S^{p - 1}

centered at the origin in p-dimensional Euclidean space. Formally,

$S^{p - 1} = \{x \in R^{p} : ∥ x ∥ = 1\}$

(2)

The surface area

S_{p}

of the hypersphere is well known:

S_{p} = \frac{2 π^{p / 2}}{Γ (\frac{p}{2})}

, where Γ is the gamma function. For a part of the hypersphere, the area of a cap with solid angle ϕ relative to its pole is given by Li [11] (cf. Gray [12]):

$S (ϕ) = \frac{1}{2} S_{p} [1 - sgn (cos ϕ) I_{{cos}^{2} ϕ} (\frac{1}{2}, \frac{p - 1}{2})]$

(3)

where sgn is the sign function, and

I_{x} (α, β)

is the regularized incomplete beta function.

For a random vector from the unit circle

S^{1}

, the von Mises distribution vM

(μ, κ)

is the most widely used model:

$f_{vM} (x; μ, κ) = \frac{1}{2 π I_{0} (κ)} e^{κ μ^{T} x}$

where T is the transpose operator,

| | μ | | = 1

and

κ \geq 0

are the mean direction vector and concentration parameters, and

I_{0}

is the zero-order modified Bessel function of the first kind. Note that the von Mises distribution has a single mode. The multimodal extension to the von Mises distribution is the so-called generalized von Mises model. Its properties are studied by Yfantis and Borgman [13] and Gatto and Jammalamadaka [14].

The generalization of von Mises distribution onto

S^{p - 1}

is the von Mises-Fisher distribution (also known as Langevin distribution) vMF

_{p} (μ, κ)

with pdf,

$f_{p} (x; μ, κ) = c_{p} (κ) e^{κ μ^{T} x}$

(4)

where the normalization constant is

$c_{p} (κ) = \frac{κ^{p / 2 - 1}}{{(2 π)}^{p / 2} I_{p / 2 - 1} (κ)}$

and

I_{ν} (x)

is the ν-order modified Bessel function of the first kind. See Mardia and Jupp [15] (p. 167) for details.

Since von Mises-Fisher distributions are members of the exponential family, by differentiating the cumulant generating function, one can obtain the mean and variance of

μ^{T} X

$\begin{matrix} E_{f} [μ^{T} X] & = A_{p} (κ) \end{matrix}$

and

$\begin{matrix} V_{f} [μ^{T} X] & = A_{p}^{'} (κ) \end{matrix}$

where

A_{p} (κ) = I_{p / 2} (κ) / I_{p / 2 - 1} (κ)

, and

A_{p}^{'} (κ) = \frac{d}{d κ} A_{p} (κ) = 1 - A_{p} {(κ)}^{2} - (p - 1) / κ A_{p} (κ)

. See Watamori [16] for details. Thus the entropy of

f_{p}

is:

$H (f_{p}) = - E_{f} [ln f_{p} (X)] = - ln c_{p} (κ) - κ E_{f} [μ^{T} X] = - ln c_{p} (κ) - κ A_{p} (κ)$

(5)

and

$V_{f} [ln f_{p} (X)] = κ^{2} V_{f} [μ^{T} X] = κ^{2} A_{p}^{'} (κ)$

(6)

Spherical distributions have been used to model the orientation distribution functions (ODF) in HARDI (High Angular Resolution Diffusion Imaging). Knutsson [17] proposed a mapping from (

p = 3

) orientation to a continuous and distance preserving vector space (

p = 5

). Rieger and Vilet [18] generalized the orientation in any p-dimensional spaces. McGraw et al. [19] used vMF

_{3}

mixture to model the 3-D ODF and Bhalerao and Westin [20] applied

{vMF}_{5}

mixture to 5-D ODF in the mapped space. Entropy of the ODF is proposed as a measure of anisotropy (Özarslan et al. [21], Leow et al. [22]). McGraw et al. [19] used Rényi entropy for the

{vMF}_{3}

mixture since it has a closed form. Leow et al. [22] proposed an exponential isotropy measure based on the Shannon entropy. In addition, KL-divergence can be used to measure the closeness of two ODF's. A nonparametric entropy estimator based on knn approach for hyperspherical data provides an easy way to compute the entropy related quantities.

In Section 2, we will propose the knn based entropy estimator for hyperspherical data. The unbiasedness and consistency are proved in this section. In Section 3, the knn estimator is extended to estimate cross entropy and KL-divergence. In Section 4, we present simulation studies using uniform hyperspherical distributions and aforementioned vMF probability models. In addition, the knn entropy estimator is compared with the MR approach proposed in Mnatsakanov et al. [10]. We conclude this study in Section 5.

2. Construction of knn Entropy Estimators

Let

X \in S^{p - 1}

be a random vector having pdf f and

X_{1}, X_{2}, \dots, X_{n}

be a set of i.i.d. random vectors drawn from f. To measure the nearness of two vectors x and

y

, we define a distance measure as the angle between them:

ϕ = arccos (x^{T} y)

and denote the distance between

X_{i}

and its k-th nearest neighbor in the set of n random vectors by

ϕ_{i} : = ϕ_{n, k, i}

With the distance measure defined above and without loss of generality, the naïve k-nearest neighbor density estimate at

X_{i}

is thus,

$f_{n} (X_{i}) = \frac{k / n}{S (ϕ_{i})}$

(7)

where

S (ϕ_{i})

is the cap area as expressed by (3).

Let

L_{n, i}

be the natural logarithm of the density estimate at

X_{i}

$L_{n, i} = ln f_{n} (X_{i}) = ln \frac{k / n}{S (ϕ_{i})}$

(8)

and thus we construct a similar k-nearest neighbor entropy estimator (cf. Singh et al. [4]):

$\begin{matrix} H_{n} (f) = - \frac{1}{n} \sum_{i = 1}^{n} [L_{n, i} - ln k + ψ (k)] = \frac{1}{n} \sum_{i = 1}^{n} ln [n S (ϕ_{i})] - ψ (k) \end{matrix}$

(9)

where

ψ (k) = \frac{Γ^{'} (k)}{Γ (k)}

is the digamma function.

In the sequel, we shall prove the asymptotic unbiasedness and consistency of

H_{n} (f)

2.1. Unbiasedness of $H_{n}$

To prove the asymptotic unbiasedness, we first introduce the following lemma:

Lemma 2.1. For a fixed integer $k < n$ , the asymptotic conditional mean of $L_{n, i}$ given $X_{i} = x$ , is

$E [lim_{n \to \infty} L_{n, i} | X_{i} = x] = ln f (x) + ln k - ψ (k)$

(10)

Proof.

\forall ℓ \in R

, consider the conditional probability

$\begin{matrix} P {L_{n, i} < ℓ | X_{i} = x} & = P {f_{n} (X_{i}) < e^{ℓ} | X_{i} = x} \end{matrix}$

$\begin{matrix} = P {S (ϕ_{i}) > \frac{k}{n} e^{- ℓ}} \end{matrix}$

(11)

Equation (11) implies that there are at most k samples falling within the cap

C_{i}

centered at

X_{i} = x

with area

S_{c_{i}} = \frac{k}{n} e^{- ℓ}

If we let

$p_{n, i} = \int_{C_{i}} f (y) d y$

and

Y_{n, i}

be the number of samples falling onto the cap

C_{i}

, then

Y_{n, i} \sim B I N (n, p_{n, i})

, is a binomial random variable. Therefore,

$P {L_{n, i} < ℓ | X_{i} = x} = P {Y_{n, i} < k}$

If we let

\frac{k}{n} \to 0

n \to \infty

, then

p_{n, i} \to 0

n \to \infty

. It is reasonable to consider the Poisson approximation of

Y_{n, i}

with mean

λ_{n, i} = n p_{n, i} = \frac{k e^{- ℓ}}{S_{c_{i}}} p_{n, i}

. Thus, the limiting distribution of

Y_{n, i}

is a Poisson distribution with mean:

$λ_{i} = lim_{n \to \infty} λ_{n, i} = k e^{- ℓ} lim_{n \to \infty} \frac{p_{n, i}}{S_{c_{i}}} = k e^{- ℓ} f (x)$

(12)

Define a random variable

L_{i}

having the conditional cumulative density function,

$F_{L_{i}, x} (ℓ) = lim_{n \to \infty} P {L_{n, i} < ℓ | X_{i} = x}$

then

$F_{L_{i}, x} (ℓ) = \sum_{j = 0}^{k - 1} \frac{{[k f (x) e^{- ℓ}]}^{j}}{j!} e^{- k f (x) e^{- ℓ}}$

By taking derivative w.r.t. ℓ, we obtain the conditional pdf of

L_{i}

$f_{L_{i}, x} (ℓ) = \frac{{[k f (x) e^{- ℓ}]}^{k}}{(k - 1)!} e^{- k f (x) e^{- ℓ}}$

(13)

The conditional mean of

L_{i}

$E [L_{i} | X_{i} = x] = \int_{- \infty}^{\infty} ℓ \cdot \frac{{[k f (x) e^{- ℓ}]}^{k}}{(k - 1)!} e^{- k f (x) e^{- ℓ}} d ℓ$

By change of variable,

z = k f (x) e^{- ℓ}

$\begin{matrix} E [L_{i} | X_{i} = x] & = \int_{0}^{\infty} [ln f (x) + ln k - ln z] \frac{z^{k - 1}}{(k - 1)!} e^{- z} d z \end{matrix}$

$\begin{matrix} = ln f (x) + ln k - \int_{0}^{\infty} ln z \frac{z^{k - 1}}{(k - 1)!} e^{- z} d z \end{matrix}$

$\begin{matrix} = ln f (x) + ln k - ψ (k) \end{matrix}$

(14)

☐

Corollary 2.2. Given $X_{i} = x$ , let $η_{n, k, x} : = n S (ϕ_{i}) = k e^{- L_{n, i}}$ , then $ln η_{n, k, x} = ln k - L_{n, i}$ converges in distribution to $ln η_{k, x} = ln k - L_{i}$ , and

$E [ln η_{k, x}] = - ln f (x) + ψ (k)$

Moreover, $η_{k, x}$ is a gamma r.v. with the shape parameter k and the rate parameter $f (x)$ .

Theorem 2.3. If a pdf f satisfies the following conditions: for some

ϵ > 0

(A_{1}) : \int_{S^{p - 1}} {| ln f (x) |}^{1 + ϵ} f (x) d x < \infty

(A_{2}) : \int_{S^{p - 1}} \int_{S^{p - 1}} {|ln [1 - I_{{(x^{T} y)}^{2}} (\frac{1}{2}, \frac{p - 1}{2})]|}^{1 + ϵ} f (x) f (y) d x d y < \infty

, then the estimator proposed in (9) is asymptotically unbiased.

Proof. According to Corollary 2.2 and condition (

A_{2}

), we can show (see (16)–(22)) that for almost all values of

x \in S^{p - 1}

, there exists a positive constant C such that

(i) E [| ln η_{n, k, x} |^{1 + ϵ}] < C

for all sufficiently large n.

Hence, applying the moment convergence theorem [23] (p. 186), it follows that

$lim_{n \to \infty} E [ln η_{n, k, x}] = E [ln η_{k, x}] = - ln f (x) + ψ (k)$

for almost all values of

x \in S^{p - 1}

. In addition, using Fatou's lemma and condition (

A_{1}

), we have that

$\begin{matrix} \underset{n \to \infty}{lim sup} \int_{S^{p - 1}} {| E [ln η_{n, k, x}] |}^{1 + ϵ} f (x) d x \\ \leq \int_{S^{p - 1}} \underset{n \to \infty}{lim sup} {| E [ln η_{n, k, x}] |}^{1 + ϵ} f (x) d x \\ = \int_{S^{p - 1}} {| - ln f (x) + ψ (k) |}^{1 + ϵ} f (x) d x \\ \leq C_{ϵ} (\int_{S^{p - 1}} {| - ln f (x) |}^{1 + ϵ} f (x) d x + {| ψ (k) |}^{1 + ϵ}) < \infty \end{matrix}$

where

C_{ϵ}

is a constant. Therefore,

$\begin{matrix} lim_{n \to \infty} E [H_{n} (f)] = lim_{n \to \infty} E_{f} [ln (n S (ϕ_{i}))] - ψ (k) \\ = lim_{n \to \infty} \int_{S^{p - 1}} E [ln η_{n, k, x}] f (x) d x - ψ (k) \\ = \int_{S^{p - 1}} lim_{n \to \infty} E [ln η_{n, k, x}] f (x) d x - ψ (k) \\ = \int_{S^{p - 1}} E [ln η_{k, x}] f (x) d x - ψ (k) \\ = \int_{S^{p - 1}} [- ln f (x) + ψ (k)] f (x) d x - ψ (k) \\ = H (f) \end{matrix}$

To show

(i)

, one can follow the arguments similar to those used in the proof of Theorem 1 in [24]. Indeed, we can first establish

$(i i) E [| ln η_{2, 1, x} |^{1 + ϵ}] < C .$

Namely, we justify that

(i)

is valid when

n = 2

and

k = 1

. But the inequality

(i i)

follows immediately from the condition

(A_{2})

and

$\begin{matrix} E [{|ln η_{2, 1, x}|}^{1 + ϵ}] & = E [{|ln [2 S (ϕ_{1, 2})]|}^{1 + ϵ} | X_{1} = x] \end{matrix}$

$\begin{matrix} = E [{|ln (S_{p} [1 - sgn (x^{T} X_{2}) I_{{(x^{T} X_{2})}^{2}} (\frac{1}{2}, \frac{p - 1}{2})])|}^{1 + ϵ}] \end{matrix}$

$\begin{matrix} \leq C_{ϵ} {| ln S_{p} |}^{1 + ϵ} + C_{ϵ} {| ln 2 |}^{1 + ϵ} + C_{ϵ} E_{f} [{|ln [1 - I_{{(x^{T} X_{2})}^{2}} (\frac{1}{2}, \frac{p - 1}{2})]|}^{1 + ϵ} 1 (x^{T} X_{2} > 0)] \end{matrix}$

$\begin{matrix} = C_{ϵ} ({| ln S_{p} |}^{1 + ϵ} + {| ln 2 |}^{1 + ϵ}) + \frac{1}{2} C_{ϵ} E_{f} [{|ln [1 - I_{{(x^{T} X_{2})}^{2}} (\frac{1}{2}, \frac{p - 1}{2})]|}^{1 + ϵ}] \end{matrix}$

(15)

Here

ϕ_{1, 2} = arccos (X_{1}^{T} X_{2})

and

1 (\cdot)

is the indicator function.

Now let us denote the distribution function of

η_{n, k, x}

$\begin{matrix} G_{n, k, x} (u) = P (η_{n, k, x} \leq u) = P (n S (ϕ_{n, k, 1}) \leq u | X_{1} = x) \\ = 1 - \sum_{j = 0}^{k - 1} (\binom{n - 1}{j}) {(\int_{C_{x} (ϕ_{n} (u))} f (y) d y)}^{j} {(1 - \int_{C_{x} (ϕ_{n} (u))} f (y) d y)}^{n - 1 - j} \end{matrix}$

where

ϕ_{n} (u) = S^{- 1} (\frac{u}{n})

and

C_{x} (ϕ)

is a cap

{y \in S^{p - 1} : y^{T} x \geq cos ϕ}

with the pole x and base radius

sin ϕ

. Note also that the functions

S (ϕ)

(see (3)) and

ϕ_{n} (u) = S^{- 1} (\frac{u}{n})

are both increasing functions.

Now, one can see (cf. (66) in [24]):

$\begin{matrix} E [| ln η_{n, k, x} |^{1 + ϵ}] \leq I_{1} + I_{2} + I_{3} \end{matrix}$

(16)

where

$\begin{matrix} I_{1} = (1 + ϵ) \int_{0}^{1} {(ln \frac{1}{u})}^{ϵ} u^{- 1} G_{n, k, x} (u) d u \end{matrix}$

$\begin{matrix} I_{2} = (1 + ϵ) \int_{1}^{\sqrt{n}} {(ln u)}^{ϵ} u^{- 1} (1 - G_{n, k, x} (u)) d u \end{matrix}$

$\begin{matrix} I_{3} = (1 + ϵ) \int_{\sqrt{n}}^{n S_{p}} {(ln u)}^{ϵ} u^{- 1} (1 - G_{n, k, x} (u)) d u \end{matrix}$

It is easy to see that for sufficiently large n and almost all

x \in S^{p - 1}

$\begin{matrix} I_{1} < (1 + ϵ) f (x) Γ (1 + ϵ) < \infty \end{matrix}$

(17)

and

$\begin{matrix} I_{2} \leq (1 + ϵ) \sum_{j = 0}^{k - 1} {[sup_{y \in S^{p - 1}} f (y)]}^{j} f {(x)}^{- j - ϵ} Γ (j + ϵ) < \infty \end{matrix}$

(18)

(cf. (89) and (85) in [24], respectively).

Finally, let us show that

I_{3} \to 0

n \to \infty

. For each x with

f (x) > 0

, if we choose a

δ \in (0, f (x))

, then for all sufficiently large n,

\sqrt{n} \int_{C_{x} (ϕ_{n} (\sqrt{n}))} f (y) d y > f (x) - δ

, since the area of

C_{x} (ϕ_{n} (\sqrt{n}))

is equal to

\frac{1}{\sqrt{n}}

. Using arguments similar to those used in (69)–(72) from [24], we have

$\begin{matrix} I_{3} \leq & (1 + ϵ) n^{k - 1} k e^{- (n - k - 1) (f (x) - δ) \frac{1}{\sqrt{n}}} \end{matrix}$

$\begin{matrix} \times \int_{\sqrt{n}}^{n S_{p}} {(ln u)}^{ϵ} u^{- 1} (1 - \int_{C_{x} (ϕ_{n} (u))} f (y) d y) d u \end{matrix}$

(19)

The integral in (19) after changing the variable,

t = \frac{2 u}{n}

, takes the form

$\begin{matrix} \int_{\frac{2}{\sqrt{n}}}^{2 S_{p}} {(ln \frac{n t}{2})}^{ϵ} t^{- 1} (1 - G_{2, 1, x} (t)) d t \end{matrix}$

$\begin{matrix} = (\int_{\frac{2}{\sqrt{n}}}^{1} + \int_{1}^{2 S_{p}}) {(ln \frac{n t}{2})}^{ϵ} t^{- 1} (1 - G_{2, 1, x} (t)) d t \end{matrix}$

(20)

since

ϕ_{n} (\frac{n t}{2}) = S^{- 1} (\frac{t}{2}) = ϕ_{2} (t)

and

1 - \int_{C_{x} (ϕ_{2} (t))} f (y) d y = 1 - G_{2, 1, x} (t))

. The first integral in the right side of (20) is bounded as follows:

$\int_{\frac{2}{\sqrt{n}}}^{1} {(ln \frac{n t}{2})}^{ϵ} t^{- 1} (1 - G_{2, 1, x} (t)) d t \leq \frac{\sqrt{n}}{2} {(ln \frac{n}{2})}^{ϵ}$

(21)

while for the second one, we have

$\int_{1}^{2 S_{p}} {(ln \frac{n t}{2})}^{ϵ} t^{- 1} (1 - G_{2, 1, x} (t)) d t \leq C_{ϵ} {(ln \frac{n}{2})}^{ϵ} E [η_{2, 1, x}] + C_{ϵ} {(ln \frac{n}{2})}^{ϵ} B$

(22)

where

$B = \int_{1}^{2 S_{p}} {(ln t)}^{ϵ} t^{- 1} (1 - G_{2, 1, x} (t)) d t = \frac{1}{1 + ϵ} E {| ln η_{2, 1, x} |}^{1 + ϵ}$

Combination of (15)–(22) and

(i i)

yields

(i)

Remark. Note that

$\begin{matrix} 1 - I_{t^{2}} (\frac{1}{2}, \frac{p - 1}{2}) & \approx \frac{1}{B (\frac{1}{2}, \frac{p - 1}{2})} {(t^{2})}^{- \frac{1}{2}} {(1 - t^{2})}^{\frac{p - 1}{2}} \\ \approx \frac{2^{\frac{p - 1}{2}}}{B (\frac{1}{2}, \frac{p - 1}{2})} {(1 - t)}^{\frac{p - 1}{2}} as t ↑ 1 \end{matrix}$

where

B (\cdot, \cdot)

is the beta function. Hence, in the conditions

(A_{j}), j = 2

, 4, 6 and 8, the difference

1 - I_{{(x^{T} y)}^{2}} (\frac{1}{2}, \frac{p - 1}{2})

can be replaced by

1 - x^{T} y

2.2. Consistency of $H_{n}$

Lemma 2.4. Under the following conditions: for some

ϵ > 0

(A_{3}) : \int_{S^{p - 1}} {| ln f (x) |}^{2 + ϵ} f (x) d x < \infty

(A_{4}) : \int_{S^{p - 1}} \int_{S^{p - 1}} {|ln [1 - I_{{(x^{T} y)}^{2}} (\frac{1}{2}, \frac{p - 1}{2})]|}^{2 + ϵ} f (x) f (y) d x d y < \infty

the asymptotic variance of $L_{n, i}$ is finite and equals $V_{f} [ln f (X)] + ψ_{1} (k)$ , where $ψ_{1} (k)$ is the trigamma function.

Proof. The conditions

(A_{3})

and

(A_{4})

, and the argument similar to the one used in the proof of Theorem 2.3, yields

$lim_{n \to \infty} E [L_{n, i}^{2} | X_{i} = x] = E [L_{i}^{2} | X_{i} = x]$

Therefore, it is sufficient to prove that

V_{f} [L_{i}] = V_{f} (ln f (X)) + ψ_{1} (k)

. Similarly to (14), we have

$\begin{matrix} E [L_{i}^{2} | X_{i} = x] & = \int_{0}^{\infty} {[ln f (x) + ln k - ln z]}^{2} \frac{z^{k - 1}}{(k - 1)!} e^{- z} d z \end{matrix}$

$\begin{matrix} = {[ln f (x) + ln k]}^{2} - 2 [ln f (x) + ln k] ψ (k) + Γ^{''} (k) / Γ (k) \end{matrix}$

(23)

Since

Γ^{''} (k) / Γ (k) = ψ^{2} (k) + ψ_{1} (k)

$\begin{matrix} E [L_{i}^{2} | X_{i} = x] & = {[ln f (x) + ln k - ψ (k)]}^{2} + ψ_{1} (k) \end{matrix}$

(24)

After some algebra, it can be shown that

$\begin{matrix} V_{f} [L_{i}] & = E_{f} [{(ln f (X))}^{2}] - {(E_{f} [ln f (X)])}^{2} + ψ_{1} (k) \end{matrix}$

$\begin{matrix} = V_{f} [ln f (X)] + ψ_{1} (k) \end{matrix}$

(25)

☐

Lemma 2.5. For a fixed integer $k < n$ , $L_{n, i}$ are asymptotically pairwise independent.

Proof. For a pair of random variables

L_{n, i}

and

L_{n, j}

with

i \neq j

and

X_{i} \neq X_{j}

, following the similar argument for Lemma 2.1,

C_{i}

and

C_{j}

shrink as n increases. Thus, it is safe to assume that

C_{i}

and

C_{j}

are disjoint for large n, and

L_{n, i}

and

L_{n, j}

are independent. Hence Lemma 2.5 follows. ☐

Theorem 2.6. Under the conditions $(A_{1})$ through $(A_{4})$ , the variance of $H_{n} (f)$ decreases with sample size n, that is

$lim_{n \to \infty} V_{f} [H_{n} (f)] = 0$

(26)

and $H_{n} (f)$ is a consistent estimator of $H (f)$ .

Theorem 2.6 can be established by using Theorem 2.3 and Lemmas 2.4 and 2.5, and

$lim_{n \to \infty} V_{f} [H_{n} (f)] = lim_{n \to \infty} \frac{1}{n} {V_{f} [ln f (X)] + ψ_{1} (k)} = 0$

For a finite sample, the variance of

H_{n} (f)

can be approximated by

\frac{1}{n} {V_{f} [ln f (x)] + ψ_{1} (k)}

. For instance, for the uniform distribution,

V_{f} [ln f (x)] = 0

and

V [H_{n} (f)] \approx ψ_{1} (k) / n

and for a vMF

_{p} (μ, κ)

V [H_{n} (f)] \approx \frac{1}{n} [κ^{2} A_{p}^{'} (κ) + ψ_{1} (k)]

. See the illustration in Figure 1. The simulation was done with sample size

n = 1000

and the number of simulations was

N = 10, 000

. Since

ψ_{1} (k)

is a decreasing function, the variance of

H_{n} (f)

decreases when k increases.

Figure 1. Variances of

H_{n} (f)

by simulation and approximation.

Figure 1. Variances of

H_{n} (f)

by simulation and approximation.

Entropy 13 00650 g001

3. Estimation of Cross Entropy and KL-divergence

3.1. Estimation of Cross Entropy

The definition of cross entropy between continuous pdf's f and g is,

$H (f, g) = - \int f (x) ln g (x) d x$

(27)

Given a random sample of size n from f, {

X_{1}, X_{2}, \dots, X_{n}

}, and a random sample of size m from g, {

Y_{1}, Y_{2}, \dots, Y_{m}

}, on a hypersphere, denote the knn density estimator of g by

g_{m}

. Similarly to (7),

$g_{m} (X_{i}) = \frac{k / m}{S (φ_{i})}$

(28)

where

φ_{i}

is the distance from

X_{i}

to its k-th nearest neighbor in {

Y_{1}, Y_{2}, \dots, Y_{m}

}. Analogously to the entropy estimator (9), the cross entropy can be estimated by:

$H_{n, m} (f, g) = \frac{1}{n} \sum_{i = 1}^{n} ln S (φ_{i}) + ln m - ψ (k)$

(29)

Under the conditions

(A_{1})

–

(A_{4})

, for a fixed integer

k < min (n, m)

, one can show that

H_{n, m} (f, g)

is asymptotically unbiased. Moreover, by similar reasoning applied for

H_{n} (f)

, one can show that

H_{n, m} (f, g)

is also consistent and

V [H_{n, m} (f, g)] \approx \frac{1}{n} {V_{f} [ln g (x)] + ψ_{1} (k)}

. For example, when both f and g are vMF with the same mean direction and different concentration parameters,

κ_{1}

and

κ_{2}

, respectively, the approximate variance will be

\frac{1}{n} [κ_{2}^{2} A_{p}^{'} (κ_{1}) + ψ_{1} (k)]

. Figure 2 shows the approximated and simulated variance of the knn estimators for cross entropy are close to each other and both decrease with k. The simulation is done with sample size

n = m = 1000

and the number of simulations was

N = 10, 000

Figure 2. Variances of

H_{n, m}

by simulation and approximation.

Figure 2. Variances of

H_{n, m}

by simulation and approximation.

Entropy 13 00650 g002

3.2. Estimation of KL-Divergence

KL-divergence is also known as relative entropy. It is used to measure the similarity of two distributions. Wang et al. [24] studied the knn estimator of KL-divergence for distributions defined on

R^{p}

. Here we propose the knn estimator of KL-divergence of continuous distribution f from g defined on a hypersphere. The KL-divergence is defined as:

$K L (f ∥ g) = E_{f} [ln f (X) / g (X)] = \int f (x) ln \frac{f (x)}{g (x)} d x$

(30)

Equation (30) can also be expressed as

K L (f ∥ g) = H (f, g) - H (f)

. Then the knn estimator of KL-divergence is constructed as

H_{n, m} (f, g) - H_{n} (f)

, i.e.,

$K L_{n, m} (f ∥ g) = \frac{1}{n} \sum_{i = 1}^{n} ln \frac{f_{n} (X_{i})}{g_{m} (X_{i})} = \frac{1}{n} \sum_{i = 1}^{n} ln \frac{S (φ_{i})}{S (ϕ_{i})} + ln \frac{m}{n}$

(31)

where

g_{m} (X_{i})

is defined as in (28). Besides, for finite samples, the variance of the estimator,

V [K L_{n, m}]

, is approximately

\frac{1}{n} {V_{f} [ln f (X)] + V_{f} [ln g (X)] - 2 {Cov}_{f} [ln f (X), ln g (X)] + 2 ψ_{1} (k)}

. When f and g are vMF as mentioned above, with concentration parameter

κ_{1}

and

κ_{2}

, respectively, we have:

$\begin{matrix} V_{f} [ln f (X)] = κ_{1}^{2} A_{p}^{'} (κ_{1}) \\ V_{f} [ln g (X)] = κ_{2}^{2} A_{p}^{'} (κ_{1}) \end{matrix}$

and

$\begin{matrix} {Cov}_{f} [ln f (X), ln g (X)] = κ_{1} κ_{2} A_{p}^{'} (κ_{1}) \end{matrix}$

So the approximate variance is

\frac{1}{n} [{(κ_{1} - κ_{2})}^{2} A_{p}^{'} (κ_{1}) + 2 ψ_{1} (k)]

. Figure 3 shows the approximated and simulated variance of the knn estimators for KL-divergence. The approximation for von Mises-Fisher distribution is not as good as the one for uniform distributions. This could be due to the modality of von Mises-Fisher distributions or the finitude of sample sizes. The larger the sample size, the closer the approximation is to the true value.

Figure 3. Variances of

K L_{n, m}

by simulation and approximation.

Figure 3. Variances of

K L_{n, m}

by simulation and approximation.

Entropy 13 00650 g003

In summary, we have

Corollary 3.1. (1) Under conditions $(A_{1}), (A_{2})$ and for some $ϵ > 0$ ,

(A_{5}) : \int_{S^{p - 1}} {| ln g (x) |}^{1 + ϵ} f (x) d x < \infty

(A_{6}) : \int_{S^{p - 1}} \int_{S^{p - 1}} {|ln [1 - I_{{(x^{T} y)}^{2}} (\frac{1}{2}, \frac{p - 1}{2})]|}^{1 + ϵ} f (x) g (y) d x d y < \infty

for a fixed integer $k < min (n, m)$ , the knn estimator of KL-divergence given in (31) is asymptotically unbiased.

(2) Under condition $(A_{3}), (A_{4})$ and for some $ϵ > 0$ ,

(A_{7}) : \int_{S^{p - 1}} {| ln g (x) |}^{2 + ϵ} f (x) d x < \infty

(A_{8}) : \int_{S^{p - 1}} \int_{S^{p - 1}} {|ln [1 - I_{{(x^{T} y)}^{2}} (\frac{1}{2}, \frac{p - 1}{2})]|}^{2 + ϵ} f (x) g (y) d x d y < \infty

for a fixed integer $k < min (n, m)$ , the knn estimator of KL-divergence given in (31) is asymptotically consistent.

To prove the last two corollaries, one can follow the similar steps proposed in Wang et al. [24].

4. Simulation Study

To demonstrate the proposed knn entropy estimators and assess their performance for finite samples, we conducted simulations for the uniform distribution and von Mises-Fisher distributions with the p-coordinate unit vector,

e_{p}

, as the common mean direction for

p = 3

and 10. For each distribution, we drew samples of size

n = 100

, 500 and 1000. All simulations were repeated

N = 10, 000

times. Bias, standard deviation (SD) and root mean squared error (RMSE) were calculated.

4.1. Bias and Standard Deviation

Figure 4, Figure 5, Figure 6, Figure 7, Figure 8 and Figure 9 show simulated bias and standard deviation of the proposed entropy, cross-entropy and KL-divergence estimators along different k. The pattern for the standard deviation is clear. It decreases sharply then slowly as k increases. This is consistent with the variance approximations described in Section 2 and Section 3. The pattern for bias is diverse. For uniform distributions, the bias term is very small. When the underlying distribution has a mode, for example, vMF models used in the current simulations, the relation between bias and k becomes complex and the bias term can be larger for larger k values.

Figure 4.

| Bias |

(dashed line) and standard deviation (solid line) of entropy estimate

H_{n}

for uniform distributions.

Figure 4.

| Bias |

(dashed line) and standard deviation (solid line) of entropy estimate

H_{n}

for uniform distributions.

Entropy 13 00650 g004

Figure 5.

| Bias |

(dashed line) and standard deviation (solid line) of entropy estimate

H_{n}

for vMF

_{p} (e_{p}, 1)

distributions.

Figure 5.

| Bias |

(dashed line) and standard deviation (solid line) of entropy estimate

H_{n}

for vMF

_{p} (e_{p}, 1)

distributions.

Entropy 13 00650 g005

Figure 6.

| Bias |

(dashed line) and standard deviation (solid line) of cross entropy estimate

H_{n, m}

for uniform distributions.

Figure 6.

| Bias |

(dashed line) and standard deviation (solid line) of cross entropy estimate

H_{n, m}

for uniform distributions.

Entropy 13 00650 g006

Figure 7.

| Bias |

(dashed line) and standard deviation (solid line) of cross entropy estimate

H_{n, m}

for

f = {vMF}_{p} (e_{p}, 1)

and

g =

uniform distributions.

Figure 7.

| Bias |

(dashed line) and standard deviation (solid line) of cross entropy estimate

H_{n, m}

for

f = {vMF}_{p} (e_{p}, 1)

and

g =

uniform distributions.

Entropy 13 00650 g007

Figure 8.

| Bias |

(dashed line) and standard deviation (solid line) of KL-divergence estimate

K L_{n, m}

for uniform distributions.

Figure 8.

| Bias |

(dashed line) and standard deviation (solid line) of KL-divergence estimate

K L_{n, m}

for uniform distributions.

Entropy 13 00650 g008

Figure 9.

| Bias |

(dashed line) and standard deviation (solid line) of KL-divergence estimate

K L_{n, m}

for

f = {vMF}_{p} (e_{p}, 1)

and

g =

uniform distributions.

Figure 9.

| Bias |

(dashed line) and standard deviation (solid line) of KL-divergence estimate

K L_{n, m}

for

f = {vMF}_{p} (e_{p}, 1)

and

g =

uniform distributions.

Entropy 13 00650 g009

4.2. Convergence

To validate the consistency, we conducted simulations of different sample size n from 10 to 100,000 for the distribution models used above. Figure 10 and Figure 11 shows the estimates and theoretical values of entropy, cross-entropy and KL-divergence for different sample sizes with

k = 1

and

k = ⌊ ln n + 0.5 ⌋ =

2–12, respectively. The proposed estimators converge to the corresponding theoretical values quickly. Thus the consistency of these estimators are verified. The choice of k is an open problem for knn based estimation approaches. These figures show that using lager k, e.g., the logarithm of n, for lager n, is giving a slightly better preference.

Figure 10. Convergence of estimates with sample size n using the first nearest neighbor. For vMF

_{p}

κ = 1

Figure 10. Convergence of estimates with sample size n using the first nearest neighbor. For vMF

_{p}

κ = 1

Entropy 13 00650 g010

Figure 11. Convergence of estimates with sample size n using

k = ⌊ ln n + 0.5 ⌋

nearest neighbors. For vMF

_{p}

κ = 1

Figure 11. Convergence of estimates with sample size n using

k = ⌊ ln n + 0.5 ⌋

nearest neighbors. For vMF

_{p}

κ = 1

Entropy 13 00650 g011

4.3. Comparison with the Moment-Recovered Construction

Another entropy estimator for hyperspherical data was developed recently by Mnatsakanov et al. [10] using MR approach. We call this estimator the MR entropy estimator and denote it by

H_{n}^{(M R)} (f)

$H_{n}^{(M R)} (f) = - \frac{1}{n} \sum_{i = 1}^{n} ln P_{n, t} (X_{i}) + ln S (arccos t)$

(32)

where

P_{n, t} (X_{i})

is the estimated probability of the cap

{y \in S^{p - 1} : y^{T} X_{i} \geq t}

defined by the revolution axis

X_{i}

and t is the distance from the cap base to the origin and acts as a tuning parameter. Namely, (see Mnatsakanov et al. [10]),

$P_{n, t} (X_{i}) = \frac{1}{n - 1} \sum_{j = 1, j \neq i}^{n} \sum_{k = ⌊ n t ⌋ + 1}^{n} (\binom{n}{k}) {(X_{j}^{T} X_{i})}^{k} {(1 - X_{j}^{T} X_{i})}^{n - k}$

(33)

Via simulation study, the empirical comparison between

H_{n} (f)

and

H_{n}^{(M R)} (f)

was done for the uniform and vMF distributions. The results are presented in Table 1. The values of k and t listed in the table are the optimal ones in the sense of minimizing RMSE. Z-tests and F-tests (at

α = 0.05

) were performed to compare the bias, standard deviation (variance) and RMSE (MSE) between the knn estimators and corresponding MR estimators. In general, for uniform distributions, there are no significant difference for biases. Among other comparisons, the differences are significant. Specifically, knn achieves slightly smaller bias and RMSE values than those of the MR method. The standard deviations of knn method are also smaller for the uniform distribution but larger for vMF distributions than those based on MR approach.

Table 1. Comparison of knn and moment methods by simulations for spherical distributions.

**Table 1.** Comparison of knn and moment methods by simulations for spherical distributions.
Method		knn				MR
p	n	k	bias	SD	RMSE	t	bias	SD	RMSE
Uniform:
3	100	99	0.00500	0.00147	0.00521	0.01	0.00523	0.01188	0.01298
3	500	499	0.00100	0.00013	0.00101	0.01	0.00107	0.00233	0.00257
3	1000	999	0.00050	0.00005	0.00050	0.01	0.00051	0.00120	0.00130
10	100	99	0.00503	0.00130	0.00520	0.01	0.00528	0.01331	0.01432
10	500	499	0.00100	0.00011	0.00101	0.01	0.00102	0.00264	0.00283
10	1000	999	0.00050	0.00004	0.00050	0.01	0.00052	0.00130	0.00140
vMF $_{p} (e_{p}, 1)$ :
3	100	71	0.01697	0.05142	0.05415	0.30	0.02929	0.04702	0.05540
3	500	337	0.00310	0.02336	0.02356	0.66	0.00969	0.02318	0.02512
3	1000	670	0.00145	0.01662	0.01668	0.74	0.00620	0.01658	0.01770
10	100	46	0.02395	0.02567	0.03511	0.12	0.02895	0.02363	0.03737
10	500	76	0.00702	0.01361	0.01531	0.40	0.01407	0.01247	0.01881
10	1000	90	0.00366	0.01026	0.01089	0.47	0.01115	0.00907	0.01437

5. Discussion and Conclusions

In this paper, the knn based estimators for entropy, cross-entropy and Kullback-Leibler divergence are proposed for distributions on hyperspheres. Asymptotic properties such as unbiasedness and consistency are proved and validated by simulation studies using uniform and von Mises-Fisher distribution models. The variances of these estimators decrease with k. For uniform distributions, variance is dominant and bias is negligible. When the underlying distributions are modal, the bias can be large if k is large. In general, we conclude that the behavior of knn and MR entropy estimators have similar performance in terms of root mean square error.

Acknowledgements and Disclaimer

The authors thank the anonymous referees for their helpful comments and suggestions. The research of Robert Mnatsakanov was supported by NSF grant DMS-0906639. The findings and conclusions in this report are those of the author(s) and do not necessarily represent the views of the National Institute for Occupational Safety and Health.

austinhistalle.blogspot.com

Source: https://www.mdpi.com/1099-4300/13/3/650/htm

A Nearest neighbor Approach to Estimating Divergence Between Continuous Random Vectors

1. Introduction

2. Construction of knn Entropy Estimators

2.1. Unbiasedness of $H_{n}$

2.2. Consistency of $H_{n}$

3. Estimation of Cross Entropy and KL-divergence

3.1. Estimation of Cross Entropy

3.2. Estimation of KL-Divergence

4. Simulation Study

4.1. Bias and Standard Deviation

4.2. Convergence

4.3. Comparison with the Moment-Recovered Construction

5. Discussion and Conclusions

Acknowledgements and Disclaimer

0 Response to "A Nearest neighbor Approach to Estimating Divergence Between Continuous Random Vectors"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel

A Nearest neighbor Approach to Estimating Divergence Between Continuous Random Vectors

1. Introduction

2. Construction of knn Entropy Estimators

2.1. Unbiasedness of H n

2.2. Consistency of H n

3. Estimation of Cross Entropy and KL-divergence

3.1. Estimation of Cross Entropy

3.2. Estimation of KL-Divergence

4. Simulation Study

4.1. Bias and Standard Deviation

4.2. Convergence

4.3. Comparison with the Moment-Recovered Construction

5. Discussion and Conclusions

Acknowledgements and Disclaimer

0 Response to "A Nearest neighbor Approach to Estimating Divergence Between Continuous Random Vectors"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel

2.1. Unbiasedness of $H_{n}$

2.2. Consistency of $H_{n}$