Download BinClass: A Software Package for Classifying Binary Vectors User's

Transcript
2 Theory
2.1 Clustering
The main focus of BinClass is on classifying (clustering) data consisting of
binary vectors. By clustering we mean division of the set of vectors to a
set of disjoint subsets (clusters, classes) in a such way that the cost of the
classication is minimal. There are many methods available and some of
them are implemented in BinClass.
The clustering problem is an important eld of research in computer science. Classication of data is a fundamental tool in pattern recognition and
vector quantization, which are applications of image processing and computer vision [17]. Classication is also important in neural and self learning
systems [31]. Their applications vary widely from business to medicine. Our
research group has been interested in the problem in the taxonomic sense,
especially bacterial taxonomy [15]. The ability to classify things is undoubtly
one of the key features of human intelligence. It is also well known that the
clustering problem is a dicult one, and we have to resort on approximate
solutions [8, 14].
We rst discuss methods to measure the cost of classication. There are
simple error measures such as MSE which are quite commonly used in many
applications, and more complex ones such as stochastic complexity. We shall
rst discuss these and then the algorithms.
2.1.1 Minimum Error
Let us suppose that a set of binary vectors X t of form x(l) = (x1(l) ; x2(l) ; : : : ; xd(l) ),
xi(l) 2 [0; 1], denoted by X t = fx(l) jl = 1; 2; : : : ; tg is classied to a set of k
disjoint classes C = fC1; : : : ; Ck g, Cj = fx(l) jl = 1; 2; : : : ; tj g, j 2 [1; : : : ; k]
by some method. Then for each class Cj we compute the number of the ones
in each column i by
tj
X
tij = xi(l)
l=1
and dene the centroid of the class Cj by
^j = (^1j ; : : : ; ^dj ); ^ij = ttij :
j
(1)
Furthermore, we dene the Hypothetical Mean Organism (HMO), by rounding ^j
aj = (a1j ; : : : ; adj ); aij = b^ij + 1=2c
(2)
4