Download Quantified Maximum Entropy MemSys5 Users` Manual

Transcript
30
CHAPTER 2. CLASSIC MAXIMUM ENTROPY
M
X
=
Rkj
j=1
L
X
Cji hi
i=1
and hence
F = RC h
and
∂F
∂h
= R C.
Note again the dimensions of the three spaces:
L = dimension of hidden space
2.6
M
= dimension of visible space
N
= dimension of data space.
Inferences about the noise level and other variables
The classic maximum entropy analysis carries with it the overall probability of the data from (2.5)
on page 19:
Pr(D) =
Z
b | D)
dα Pr(α | D) = Pr(α
N/
2
= (2π)−
1
b − L(h))
b (det B)− /2 .
b h)
det[σ −1 ] exp(αS(
This expression becomes useful when it is realised that, like any other probabilistic expression,
it is conditional on the underlying assumptions. In fact all the probabilities derived in the classic
maximum entropy analysis have been conditional upon a choice of model m, experimental variables
defining the response functions, noise amplitudes σ, and so on, so that Pr(D) is really a shorthand
for
Pr(D | m, R, σ, . . . ).
If such variables are imperfectly known, these conditional probability values can be used to refine
our knowledge of them, by using Bayes’ theorem in the form
Pr(variables | D) = constant × Pr(variables) Pr(D | variables).
Ideally, one would set up a full prior for unknown variables and integrate them out in order to
determine Pr(h | D). In practice, though, with large datasets it usually suffices to select the single
“best” values of the variables, just as was the case for the regularisation constant α.
A common special case of this concerns experimental data in which the overall noise level
is uncertain, so that all the standard deviations σ in Pr(D) above should be scaled with some
coefficient c. Rescaling α to α/c2 for convenience gives
N/
2
Pr(D | α, c) = (2πc2 )−
1
2
b − L(h))/c
b
det[σ −1 ] exp (αS(h)
(det B)− /2 .
For this case, the maximum entropy trajectory itself is unaltered and parameterised by the
same values of α, though the probability clouds for h are of different overall size. The Evidence is
maximised over c when
c2 = 2 (L − αS)/N.
At this scaling, we note that (for linear data) the χ2 misfit statistic
χ2 = (D − F )T [c−2 σ −2 ] (D − F )