Download RSES 2.2 User's Guide

Transcript
RSES 2.2 User’s Guide
Warsaw University
http://logic.mimuw.edu.pl/∼rses
January 19, 2005
Contents
1 Introduction to RSES
1.1 The history of RSES creation . . . . .
1.2 Aims and capabilities of RSES . . . . .
1.3 Technical requirements and installation
1.3.1 Installation in MS Windows . .
1.3.2 Instalation in Linux . . . . . . .
. . . . .
. . . . .
of RSES
. . . . .
. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
5
7
7
9
9
2 Using RSES
2.1 Managing projects
2.2 Objects . . . . . .
2.3 Main system menu
2.4 Toolbar . . . . . .
2.5 Context menu . . .
2.6 Status and progress
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
11
14
16
18
19
21
.
.
.
.
.
.
.
.
23
23
32
34
37
39
41
42
46
.
.
.
.
.
49
49
50
51
52
58
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
of calculation
.
.
.
.
.
.
3 Objects in project - quick reference
3.1 Tables . . . . . . . . . . . . . . . .
3.2 Reduct sets . . . . . . . . . . . . .
3.3 Rule sets . . . . . . . . . . . . . . .
3.4 Cut sets . . . . . . . . . . . . . . .
3.5 Linear combinations . . . . . . . .
3.6 Decomposition trees . . . . . . . .
3.7 LTF Classifiers . . . . . . . . . . .
3.8 Results . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 Main data analysis methods in RSES
4.1 Missing value completion . . . . . . .
4.2 Cuts, discretization and grouping . .
4.3 Linear combinations . . . . . . . . .
4.4 Reducts and decision rules . . . . . .
4.5 Data decomposition . . . . . . . . . .
3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
CONTENTS
4.6
4.7
4.8
k-NN classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 60
LTF Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Cross-validation method . . . . . . . . . . . . . . . . . . . . . 65
5 RSES scenario examples
5.1 Train-and-test scenarios . . . . . . . . . . . .
5.1.1 Rule based classifier . . . . . . . . . .
5.1.2 Rule based classifier with discretization
5.1.3 Decomposition tree . . . . . . . . . . .
5.1.4 k-NN classifier . . . . . . . . . . . . . .
5.1.5 LTF Classifier . . . . . . . . . . . . . .
5.2 Testing with use of cross-validation . . . . . .
5.3 Scenario for decision generation . . . . . . . .
A Selected RSES 2.2 file formats
A.1 Data sets . . . . . . . . . . .
A.2 Reduct sets . . . . . . . . . .
A.3 Rule sets . . . . . . . . . . . .
A.4 Set of cuts . . . . . . . . . . .
A.5 Linear combinations . . . . .
A.6 LTF-C . . . . . . . . . . . . .
A.7 Classification results . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
69
69
69
70
72
74
75
77
78
.
.
.
.
.
.
.
81
81
83
84
86
89
90
91
Bibliography
93
Index
98
Chapter 1
Introduction to RSES
RSES 2.2 - Rough Set Exploration System 2.2 is a software tool that provides
the means for analysis of tabular data sets with use of various methods, in
particular those based on Rough Set Theory (see [22]).
The RSES system was created by the research team supervised by Professor Andrzej Skowron. Currently, the RSES R&D team consists of: Jan Bazan
(University of Rzeszów), Rafał Latkowski (Warsaw University), Michał Mikołajczyk (Warsaw University), Nguyen Hung Son (Warsaw University), Nguyen
Sinh Hoa (Polish-Japanese Institute of Information Technology), Andrzej
Skowron (Warsaw University), Dominik Ślęzak (University of Regina and
Polish-Japanese Institute of Information Technology), Piotr Synak (PolishJapanese Institute of Information Technology), Marcin Szczuka (Warsaw
University), Arkadiusz Wojna (Warsaw University), Marcin Wojnarski (Warsaw University), Jakub Wróblewski (Polish-Japanese Institute of Information
Technology).
The RSES system is freely available (for non commercial use) on the Internet. The software and information about it can be downloaded from:
http://logic.mimuw.edu.pl/∼rses
1.1
The history of RSES creation
Back in 1993, as a project accompanying master theses of Krzysztof Przyłucki
and Joanna Słupek (supervised by Andrzej Skowron at Warsaw University)
the piece of software named decision table analyzer has been created. This
early system was written in C++ for Windows 3.11 platform. The creation of
this software was also supported by: Jan Bazan, Tadeusz Gąsior, and Piotr
Synak.
5
6
CHAPTER 1. INTRODUCTION TO RSES
In the next year (1994) first version of RSES (version 1.0) was created.
Written in C++, for HP-UX (a Unix flavor) was only available for Apollo
workstations by Hewlett Packard. Version 1.0 of the RSES system was developed by: Jan Bazan, Agnieszka Chądzyńska, Nguyen Hung Son, Nguyen
Sinh Hoa, Adam Cykier, Andrzej Skowron, Piotr Synak, Marcin Szczuka,
and Jakub Wróblewski.
In 1996 the RSES-lib 1.0 library of computational methods was put together. Written in C++ it was running on both Unix and Microsoft Windows platforms. The RSES-lib 1.0 library was used as a part of computational kernel of ROSETTA (Rough Set Toolkit for Analysis of Data). The
ROSETTA system was developed between 1996 and 1998, as a result of cooperation between Warsaw University and Norwegian University of Science
and Technology (NTNU) in Trondheim. The ROSETTA system was initially developed for Microsoft Windows 9x/NT (see, e.g. [21]). The creators
of RSES-lib 1.0 include: Jan Bazan, Nguyen Hung Son, Nguyen Sinh Hoa,
Adam Cykier, Andrzej Skowron, Piotr Synak, Marcin Szczuka, and Jakub
Wróblewski. The ROSETTA system owes its concept and initial development to Jan Komorowski and Alexander Ørn (both of Knowledge Systems
Group/NTNU at that time). The ROSETTA system is still in use today, for
details refer to [26].
The next version of RSES-lib, i.e., RSES-lib 2.0 was created in 1998
and 1999, mostly to satisfy the demand for newer and more versatile tool
to be used in computations done for the research project ESPRIT-CRIT2
(funded by European Commission). One of CRIT2 sub-projects devoted to
data analysis and knowledge discovery was realized at the Group of Logic,
Warsaw University under the supervision of Professor Andrzej Skowron. The
work on creation of RSES-lib 2.0 was done by: Jan Bazan, Nguyen Hung
Son, Nguyen Sinh Hoa, Andrzej Skowron, Piotr Synak, Marcin Szczuka, and
Jakub Wróblewski.
In 2000 a new version of RSES emerged. This time it was equipped with
Graphical User Interface (GUI) for Microsoft Windows 9x/NT/2000/Me.
The system was written in C++ and used RSES-lib 2.0 as its computational backbone. As this version of RSES was the first to be equipped with
Microsoft Windows GUI, it was named version 1.0. The developers of this
version include: Jan Bazan, Nguyen Hung Son, Nguyen Sinh Hoa, Andrzej
Skowron, Piotr Synak, Marcin Szczuka, and Jakub Wróblewski.
The year of 2002 brought the next major version of RSES (version 2.0),
significantly different from the previous ones. This time the system was
written in Java, although some parts of the computational kernel still use elements of RSES-lib 2.0 (written in C++). The C++ part of computational
kernel has been partly re-written in order to comply to the standards of GCC
1.2. AIMS AND CAPABILITIES OF RSES
7
compiler (GNU C++). In this way, by using Java and GCC, the portability of the system was achieved. Starting with this version (2.0) the RSES
system is distributed for both Microsoft Windows 9x/NT/2000/Me/XP and
Linux/i386.
One year after RSES 2.0 the next version – RSES 2.1 was put together.
It features newer, improved, more versatile and more user friendly GUI as
well as several new computational methods.
1.2
Aims and capabilities of RSES
The main aim of RSES is to provide a tool for performing experiments on
tabular data sets.
In general, the RSES system offers the following capabilities:
• import of data from text files,
• visualization and pre-processing of data including, among others, methods for discretization and missing value completion,
• construction and application of classifiers for both smaller and vast
data sets, together with methods for classifier evaluation.
The RSES system is a software tool with an easy-to-use interface, at the
same time featuring a bunch of method that make it possible to perform
compound, non-trivial experiments in data exploration with use of Rough
Set methods.
1.3
Technical requirements and installation of
RSES
In order to run RSES we recommend at least:
• CPU - Pentium 200 MHz;
• 128 MB RAM;
• Java Virtual Machine version 1.4.1;
• Operating system MS Windows 9x/NT/2000/Me/XP or Linux/i386
(kernel 2.2 or newer).
8
CHAPTER 1. INTRODUCTION TO RSES
Majority of RSES is written in Java. Therefore, in order to make it
running, an appropriate version of JVM (Java Virtual Machine) is required.
Moreover, as part of computation is done by methods from RSES-lib 2.0
(written in C++, see 1.1), the compiled version of this code is distributed
with the installation bundle as an executable file named RSES.kernel. Please
note that this part of RSES is platform-dependant. The figure 1.3 illustrates
the general architecture of RSES. Notice, that Java part of RSES comprises
of two logical sub-parts, the RSES 2.2 GUI and the computational kernel
powered by the RSES-lib 3.0 library. The GUI part is responsible for interaction with user while the RSES-lib 3.0 serves as computational engine and
a wrapper for RSES-lib 2.0 computational routines.
Operating System
MS-Windows, Linux
Java Virtual Machine
RSES GUI
Java Swing
RSES-lib 3.0
Java
RSES-lib 2.0
C++
Figure 1.1: RSES internal architecture.
On the RSES Homepage http://logic.mimuw.edu.pl/∼rses one will
find the installation bundles for both MS Windows and Linux/i386. Those
bundles contain the RSES executables as well as several demonstration data
sets. These data sets are provided in order to help user in starting to work
with RSES without necessity of preparing the data in advance. We do hope
that the example data sets provided in installation bundles will make the
first steps in RSES easier for the user and help him/her in preparing his/her
own data for experiments.
1.3. TECHNICAL REQUIREMENTS AND INSTALLATION OF RSES 9
1.3.1
Installation in MS Windows
MS Windows who already have the Java Virtual Machine installed can directly download and run the RSES installation bundle. If JVM is not present
or is in version older than 1.4, we recommend that it is installed prior to
RSES installation. The current version of Java (either SDK or JRE) may be
obtained at no cost from SUN Microsytems website http://java.sun.com.
After downloading the RSES installation bundle, which is a single executable file, it is enough to double-click on it in order to launch the installer
program. The installation is performed according to standard MS Windows
procedure. During the installation process the user may choose the installation directory and some other basic installation parameters. In case of
confusion during the installation process we recommend to accept the default values proposed by installer.
1.3.2
Instalation in Linux
Please note that the RSES installation bundle for Linux contains, as an
executable binary file, the part of computational kernel that is written in
C++ and statically build. This binary file was created with GCC ver. 2.95
for machines running Linux kernel 2.2 or newer for i386 architecture.1
In order to run RSES on Linux the Java Virtual Machine version 1.4.1
or newer has to be installed. If JVM is not present or is in version older
than 1.4, we recommend that it is installed prior to RSES installation. The
current version of Java for Linux (either SDK or JRE) may be obtained at
no cost from SUN Microsytems website http://java.sun.com. The RSES
installation bundle for Linux is provided as a single .tgz file containing tarred
and compressed files. In order to install it the user has to unpack the files
into the directory of choice, preserving the structure and names of directories
from the .tgz file.
The easiest way to unpack RSES properly, inside the directory of choice,
is by executing from command line:
tar -xvzf rses_22.tgz
It is important to check after unpacking if the files startRSES and rses.kernel
have the attribute Executable. In this way we are ready to launch RSES
(refer to the beginning of next chapter).
1
Tested with vanilla and customized kernels up to 2.6.7.
10
CHAPTER 1. INTRODUCTION TO RSES
Chapter 2
Using RSES
To start RSES 2.2 on Microsoft Windows platform one have to open the menu
Start/Programs/Rses2 and select Rough Set Exploration System ver. 2.2.
To start RSES 2.2 in Linux one have to open terminal window in the
directory where the Rses.jar is located and execute the command:
java -jar Rses.jar
or use provided simple script to start RSES from command line by:
./startRSES
To stop the application one have to select from menu /File/Exit (Alt+fe)
and confirm the decision to quit. Thanks to obligatory confirmation is quite
impossible to close program by accident without saving the results of current
work.
Once the program is launched the main RSES 2.2 window, consisting of
main menu, toolbar and project workspace appears.
Main menu contains those most general options offered by RSES. Some
of its functionality is also accessible via toolbar and context menus.
2.1
Managing projects
The user may work several experiments at the same time. The RSES system
is capable of working with several projects at the same time . However, only
one of the experiments may be active (perform computation) at any given
moment.
At this point it is worth mentioning, that it is possible to run several RSES
experiments at the same time in distributed environment. For this purpose
11
12
CHAPTER 2. USING RSES
Figure 2.1: RSES 2.2 - main window just after start
Figure 2.2: Main Menu
Figure 2.3: Toolbar
a special, additional module named Dixer - DIstributed eXEcutoR has been
developed. For the sake of convenience and versatility the Dixer subsystem
is equipped with separate graphical user interface. The simplistic interface
for Dixer was designed in order to provide an easy way of designing multiple
2.1. MANAGING PROJECTS
13
distributed experiments. (see http://logic.mimuw.edu.pl/∼rses).
A new project in RSES may be created in one of three possible ways:
• by choosing from main menu /File/New project,
• using keyboard shortcut: Alt+fn,
• clicking on the first (from the top) toolbar button.
Figure 2.4: New project in RSES
Once new (empty) project has been created, we may place in it various
objects such us: data tables, decision rule sets, result summaries etc. To find
out more about possible objects in projects refer to section 2.2.
In the upper part of the window user can see tabs with names of currently
active (lighter tab) and other (darker tabs) existing projects. By clicking on
14
CHAPTER 2. USING RSES
tab the user can access his/her projects. This feature simplifies work when
several projects are being developed.
In the lower part of each project workspace two tabs are present. These
tabs are used to switch between two types of project views:
• Design view – standard view for working with project,
• History view – this view presents all registered events that happened
during the operations on project.
At the bottom and on the right side of project window (while in Design
view) the scrollbars are placed. They simplify work with large projects which
are impossible to fit into a small window.
In /File menu user will find options for saving and restoring project
to/from file. Detailed description of main menu options is given in section
2.3.
By right-clicking within project’s workspace area (white area) the user
invokes a context menu. With use of this context menu user can insert new
objects (entities) into the project (see subsection 2.5).
2.2
Objects
Objects that can be placed in projects fell into following categories:
• Data Tables
• Reduct Sets
• Rule Sets
• Cut Sets
• Linear Combinations
• Decomposition Trees
• LTF-C (Local Transfer Function Classifiers)
• Classification Results (Experiments’ effects)
To create an object we may (besides using the context menu) choose a
corresponding option from menu (by using mouse or keyboard shortcut) or
click the corresponding button on the toolbar. If we introduce new object
2.2. OBJECTS
15
into project using context menu then the object is place at the position of
mouse cursor. 1 , . In the case of object introduced with use of /Insert menu
or toolbar the objects emerge in the central part of visible workspace. Each
new object is shifted few pixels from the central point, so that occlusions
does not occur when we introduce several objects at the same time.
Dependencies between objects within project are marked by connecting
such object with arrows. For example, if we calculate decision rules for some
data table then the arrow originating in table and pointing at set of rules
appears.
Figure 2.5: Two objects bound by dependence (table and rules).
To move object within project user has to select it with mouse click
and then, holding the left mouse button, move mouse to desired position.
Releasing mouse button causes the object to be placed at new position.
1
The context menu for main project window will be further referred to as general menu
16
CHAPTER 2. USING RSES
User can also move several objects at the same time. To do so one has to
mark several objects (using mouse) and then click and hold the left mouse
button on one of selected objects. The group of objects is then moved and
once the mouse button is released, the objects are placed at new position.
To select several objects in project it is enough to push left mouse button
while mouse cursor is at the position within project space and then, still
holding the mouse button, mark a rectangular area within project. Objects
that are inside this rectangular area are instantly selected.
Another way of selecting/reselecting objects in groups is by clicking on
an object while holding the Ctrl key. The object marked is this way becomes selected or deselected (if was previously selected) without change in
the selection status of other project’s objects.
Each object as well as group of objects have a corresponding context menu
attached. With this menu it is possible to change name, duplicate, save,
restore, remove, show components, and perform other operations (specific to
object’s type) for an object. Context menu is accessible by right-clicking on
selected object. Detailed description of object’s context menus is provided
in section 3. Description of context menu for groups of objects is given in
subsection 2.5.
Selected objects and object groups may be removed from the project using
key. Each deletion, in order to avoid loss of data, has to be confirmed by
clicking “OK” in corresponding dialog box.
2.3
Main system menu
Option from RSES 2.2 main menu may be applied by selecting it from dropdown menu with the use of mouse, or by using keyboard shortcut. Keyboard
shortcuts are constructed according to the simple principle. The combination
to be used is Alt+<letter>, where <letter> is the (only) underlined letter
in the name of option we want to chose. By holding the Alt key and selecting subsequent option we may navigate through the entire menu structure.
For instance by pressing Alt+<f>+<n> we create a new project (/File/New
project).
Some of the menu options are accompanied with small icons. These icons
directly correspond to buttons on the toolbar that have the same functionality. Such icons are also present in general (context) menu.
Short description of options in main menu is presented below.
2.3. MAIN SYSTEM MENU
17
– File
– New project
– Open project
– Close project
– Close all projects
– Save project
– Save project As
– Exit
– Insert
– Insert table
– Insert reduct set
– Insert rule set
– Insert cut set
– Insert linear combination set
– Insert decomposition tree
– Insert LTF-C
– Insert results
– Help
– About
Figure 2.6: Layout of the main menu
• File – projects’ management
– New project – creates new project
– Open project – restores previously saved project from the disk
– Close project – closes active project
– Close all projects – closes all currently open projects
– Save project – saves active project to a file on disk
– Save project As – saves active project to the specified file on disk
– Exit – terminates RSES
• Insert – inserting new objects into active project
– Insert table – inserts data table
– Insert reduct set – inserts reduct set
– Insert rule set – inserts rule set
– Insert cut set – inserts cut and/or attribute partition set
– Insert linear combination set – inserts a set of linear combinations
– Insert decomposition tree – inserts decomposition tree
18
CHAPTER 2. USING RSES
– Insert LTF-C – inserts LTF-C ( Local Transfer Function Classifier)
– Insert results – inserts an object for viewing experiment results
• Help – help and information
– About – basic information about RSES
Notice! All objects added to the project with use of menu or toolbar appear
in the central part of project workspace, at the position that is randomly
shifted by few pixels from the actual center. This is to avoid occlusion (when
inserting several objects at the same time) and make selection of new object
easier while working with mouse.
2.4
Toolbar
The toolbar contains buttons corresponding to selected options from main
and general menus. In this way the RSES user have instant access to most
common actions.
Figure 2.7: Toolbar and corresponding menu options
• New project – creates new project
• Open project – restores previously saved project from the disk
• Save project – saves active project to a file on disk
2.5. CONTEXT MENU
19
• Exit – terminates RSES
• Insert table – inserts data table
• Insert reduct set – inserts reduct set
• Insert rule set – inserts rule set
• Insert cut set – inserts cut and/or attribute partition set
• Insert linear combination set – inserts a set of linear combinations
• Insert decomposition tree – inserts decomposition tree
• Insert LTF-C – inserts LTF-C ( Local Transfer Function Classifier)
• Insert results – inserts an object for viewing experiment results
• About – basic information about RSES
Notice! All objects added to the project with use of menu or toolbar appear
in the central part of project workspace, at the position that is randomly
shifted by few pixels from the actual center. This is to avoid occlusion (when
inserting several objects at the same time) and make selection of new object
easier while working with mouse.
2.5
Context menu
A context menu is associated with every objects (or group of objects) in
project workspace. It can be accessed by right-clicking on the object.
The contents of context menu depend on the kind of object (objects).
We briefly present the contents of context menus below, and discuss their
options in more detail further in this manual.
Along with the context menu for particular object(s) there exists the
context menu for the entire project space. We call it general menu. To
access general menu user have to right-click on empty area within the project
workplace (white area).
Options from general menu are also accessible from main program menu
and the toolbar.
List of options in general menu:
• Insert table – inserts data table
• Insert reduct set – inserts reduct set
20
CHAPTER 2. USING RSES
–
–
–
–
–
–
–
–
Insert
Insert
Insert
Insert
Insert
Insert
Insert
Insert
table
reduct set
rule set
cut set
linear combination set
decomposition tree
LTF-C
results
Figure 2.8: Layout of general menu (context menu for the project)
• Insert rule set – inserts rule set
• Insert cut set – inserts cut and/or attribute partition set
• Insert linear combination set – inserts a set of linear combinations
• Insert decomposition tree – inserts decomposition tree
• Insert LTF-C – inserts LTF-C ( Local Transfer Function Classifier)
• Insert results – inserts an object for viewing experiment results
Context menu for a group of (selected) objects can be accessed by clicking
on any of selected objects with right mouse button.
–
–
–
–
–
–
–
Align Left
Align Center
Align Right
Align Top
Align Middle
Align Bottom
Remove
Figure 2.9: Layout of context menu for a group of objects
Options in context menu for a group of objects:
• Align Left – align selected objects horizontally to the leftmost object
• Align Center – center selected objects horizontally
2.6. STATUS AND PROGRESS OF CALCULATION
21
• Align Right – align selected objects horizontally to the rightmost object
• Align Top – align selected objects vertically to the one on the top
• Align Middle – center selected objects vertically
• Align Bottom – align selected objects vertically to the one at the bottom
• Remove – removes all selected objects
Note, that using Align Left, Align Center, Align Right, Align Top, Align
Middle and Align Bottom causes selected objects to appear in one (horizontal
or vertical) line. As a result some objects may be occluded by others.
2.6
Status and progress of calculation
The user has an access to information about the track of operations performed
in the project. This information is collected in history view that is reachable
via History tab at the bottom of project workspace. The history view stores
information about operations on objects, performed calculations, errors, and
calculation terminations.
Figure 2.10: Computation progress control
During each calculation new control window appears. In this window
the user can see the advancement of current calculation shown with use of
progress bar. The user can instantly terminate currently running computation by clicking Terminate button. After such termination a dialog box
with information appears, and the fact of termination is logged in project’s
history.
22
CHAPTER 2. USING RSES
Chapter 3
Objects in project - quick
reference
In this chapter we present a quick review of objects appearing in RSES
projects together with short description of operations that may be performed
on them. More detailed description of algorithms and their options are presented in chapter 4.
3.1
Tables
Tables are the most important entities in project. They represent data tables
(tabular data sets) and allow for their examination, edition, and launching
computations on data.
Figure 3.1: Icon representing decision table
The user can view the data contained in the table by double clicking on
it or by selecting View from table object’s context menu.
The context menu for table contains the following options:
• Load – load data from file into table object. File is in one of formats:
RSES, RSES 1.0, Rosetta, and Weka.
• Save As – save data to file in RSES format.
• View – view the contents of table (see figure 3.3). The user may scroll,
and rearrange the view window.
23
24
CHAPTER 3. OBJECTS IN PROJECT - QUICK REFERENCE
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Load
Save As
View
Change name
Change decision attribute
Remove
Split in Two
Select subtable
Complete
– Remove objects with missing values
– Complete with most common or mean value
– Complete with most common or mean value w.r.t. decision class
Discretize
– Generate cuts
– Discretize table
Linear combinations
– Generate linear combinations
– Add linear combinations as new attributes
Reducts/Rules
– Calculate reducts or rules
– Calculate dynamic reducts
Make decomposition
Create LTF-C
Classify
– Test table using rule set
– Test table using decomposition tree
– Test table using k-NN
– Test table using LTF-C
– Cross-validation method
Statistics
– Statistics for attributes
– Comparison of two selected attributes
Positive region
Figure 3.2: Layout of context menu for data table object
• Change name – change the table name (see figure 3.4). This name is
saved together with data. Table name does not have to be identical
with the name of file used to store the table on disk. Table name can
also be altered by double-clicking on table name appearing below the
icon.
• Change decision attribute – selecting the decision attribute. Selected
attribute is moved to the end of table (becomes the last attribute).
• Remove – removes table (after separate confirmation).
• Split in Two – randomly splits table into two disjoint subtables (see
figure 3.5).
3.1. TABLES
25
Figure 3.3: View of data table contents
Figure 3.4: Changing table name
The Split factor parameter between 0.0 and 1.0 – specified by user –
determines the size of first subtable. The other subtable is a compliment of the first. For instance, setting Split factor to be 0.45 means
that the first subtable will contain 45% of objects from original table,
while the other will contain the remaining 55%. The new table objects
in project, resulting from split operation, are automatically assigned
names composed of original table’s name and the value of Split factor
(and it’s complement to 1 - see figure 3.5).
• Select subtable – creation of subtable (see figure 3.6) by selection of
attribute subset. In case of all attributes being selected, a copy of
original table is created.
• Complete – completion of missing data (see subsection 4.1). The user
26
CHAPTER 3. OBJECTS IN PROJECT - QUICK REFERENCE
Figure 3.5: Splitting table in two
Figure 3.6: Selecting subtable
provides the name for new table which is created by selected algorithm
on the basis of original one. This new table contains no missing values.
3.1. TABLES
27
• Discretize – data discretization (cf. [5]) and generation of cuts for attributes (refer to subsection 4.2).
• Linear combinations – generation of new attributes in the table. New
attributes are generated as linear combinations of existing ones (see
subsection 4.3)
• Reducts/Rules – reducts, dynamic reduct and decision rules from the
data (refer to subsection 4.4).
• Make decomposition – this option initiates calculation of the decomposition tree for data table (refer to subsection 4.5).
• Create LTF-C – creates LTF-C (Local Transfer Function Classifier)
based on artificial neural network (see subsection 4.7).
• Classify – launches the classification with use of selected classifier. The
classifier have to be constructed and trained prior to it’s use. Regardless
of the classifier chosen, the user can use one of two classification modes:
– Generate confusion matrix – calculates a matrix that store summarized classification error;
– Classify new cases – classifies new cases (with no known decision)
and stores the result as a new (decision) column in the table.
To select classification mode user has to select appropriate option in
General test mode field. This choice is available for all classification
methods.
The user have choice of the following classification methods:
– Test table using rule set – classification with use of decision rules
(see subsection 4.4);
– Test table using decomposition tree – classification with use of decomposition tree (see subsection 4.5);
– Test table using k-NN – classification of selected table with use of
Nearest Neighbors method (see subsection 4.6);
– Test table using LTF-C – classification with use of LTF-C (see subsection 4.7);
– Cross-validation method – classification with use of cross-validation
method applied to any of the classifiers mentioned above (see subsection 4.8).
28
CHAPTER 3. OBJECTS IN PROJECT - QUICK REFERENCE
• Statistics – displays basic information about table and attributes.
– Statistics for attributes opens the window as shown in figure 3.7.
Clicking Show chart button results in graphical display of selected
attribute value distribution in the form of bar chart (see figure 3.8). For symbolic attributes occurrences of all attribute values are counted and presented as histogram. In case of numerical
(continuous) attributes the attribute value space is divided into
several equal intervals and the histogram for these intervals is displayed. The number of intervals is selected by the user by using
Number of intervals option. As the user may be interested only
in some part of attribute value space, it is possible to alter the
attribute range by clicking on Change limits button and inputting
desired values (limits). The graphs with attribute distributions
may be exported to an external file. Currently, RSES 2.2 allows
to save them in PNG, JPG and HTML format. To export a graph
user has to click on Export button in graph window (see figure
3.8).
Figure 3.7: Information on data table
– Comparison of two selected attributes opens a dialog window that
permits the user to choose two attributes to be compared (see
3.1. TABLES
29
Figure 3.8: Attribute distribution presented as graph
figure 3.9). First, the user selects the types of attributes to be
compared. This can be done by marking one of: Numeric and numeric, Symbolic and symbolic or Symbolic and numeric. Once types
are chosen, the system filters attributes with respect to their type
and places in two columns (see figure 3.9). The user has to select one attribute in each column nd hit the Show chart button.
Depending of the types of attributes chosen (symbolic/numeric),
different plots may be produced as a result. In case of two numeric attributes being compared the result is a scatter plot as
shown in figure 3.10. Each of axes on the plot correspond to one
of compared attributes. Comparison of two symbolic attributes
results in bar chart as shown in figure 3.11 whereas comparison of
symbolic and numeric attribute is a scatter plot as in figure 3.12.
30
CHAPTER 3. OBJECTS IN PROJECT - QUICK REFERENCE
Figure 3.9: Settings for comparison of two attributes
Figure 3.10: Distribution of two numeric attributes presented as graph.
• Positive region – calculates positive region for the table (Notice: last
column in the table is assumed to be decision attribute. The precedence
of columns cannot be changed in the table view. It is, however, possible
to move a column to the last position using the option Change decision
attribute from table’s context menu.)
3.1. TABLES
31
Figure 3.11: Distribution of two symbolic attributes presented as graph.
Figure 3.12: Comparison of two attributes: symbolic and numeric.
32
CHAPTER 3. OBJECTS IN PROJECT - QUICK REFERENCE
3.2
Reduct sets
Reduct for an information system is a subset of attributes which preserves
all discernibility information from the information system, and none of its
proper subsets has this ability (see [22]).
Figure 3.13: Icon representing reduct set
Double clicking on the icon representing reduct set corresponds to the
View option in the context menu, and results in the contents of this object
being displayed.
–
–
–
–
–
–
–
–
–
–
View
Change name
Remove
Filter
Shorten
Generate rules
Save As
Load
Append
Statistics
Figure 3.14: Layout of context menu for reduct set
Options in the context menu for reduct set:
• View – displays contents of the reduct set (see figure 3.15). The user
can scroll and resize this window according to requirements.
The reduct set view window consists of five columns. First of these
columns stores the identification number, the others have the following
meaning (for a single row):
– Size – size of the reduct, number of participating attributes.
– Pos.Reg. – the positive region for the table after reduction, i.e.
after removing attributes from outside the reduct.
3.2. REDUCT SETS
33
Figure 3.15: Viewing contents of reduct set
– SC – value of the Stability Coefficient (SC) for the reduct. this
value is used to determine the stability of reduct in dynamic case
(see [2] and subsection 4.4).
– Reducts – reduct presented as a list of attributes.
• Change name – changes object name (see figure 3.4), the name is stored
together with the contents of object in file. The name of object does
not need to be identical with the name of file that is used to store it.
The name of object can also be changed by double clicking on name
tag below the icon representing object.
• Remove – removes reduct set (additional confirmation required).
• Filter – filters the reduct set. The user can remove reducts on the basis
of stability coefficient (SC). Before using this option it is recommended
to examine statistics for the set of reducts to be filtered.
• Shorten – shortening of reducts. The user provides a coefficient between
0 and 1, which determines how “aggressive” the shortening procedure
should be. The coefficient equal to 1.0 means that no shortening occurs.
If Shortening ratio is near zero, the algorithm attempts to maximally
shorten reducts. This shortening ratio is in fact a threshold imposed
on the relative size of positive region after shortening.
• Generate rules – generates a set of decision rules on the basis of the
reduct set and selected data table (see also subsection 4.4).
• Save As – saves the set of reducts to a file.
• Load – loads previously stored reduct set from a file.
• Append – appends the current reduct set with reducts from a file. Repeating entries only appear once.
34
CHAPTER 3. OBJECTS IN PROJECT - QUICK REFERENCE
• Statistics – present basic statistics on the reduct set (see figure 3.16).
It also provides the ability for displaying the core (intersection of all
reducts).
The user may also review the graphical information on distribution of
reduct lengths as well as on frequency and role of particular attributes
in reducts’ construction.
Figure 3.16: Information on reduct set
3.3
Rule sets
Decision rules make it possible to classify objects, i.e. assign the value of
decision attribute. Having a collection of rules pointing at different decision
we may perform a voting obtaining in this way a simple rule-based decision
support system.
Double clicking on the icon representing rule set corresponds to the View
option in the context menu, and results in the contents of this object being
displayed.
3.3. RULE SETS
35
Figure 3.17: Icon representing rule set
–
–
–
–
–
–
–
–
–
–
View
Change name
Remove
Filter
Shorten
Generalize
Save As
Load
Append
Statistics
Figure 3.18: Layout of context menu for rule set.
List of options in context menu for rule set:
• View – displays contents of the rule set (see figure 3.19). The user can
scroll and resize this window according to requirements.
Figure 3.19: Viewing contents of rule set
36
CHAPTER 3. OBJECTS IN PROJECT - QUICK REFERENCE
The rule set view window consists of three columns. First of these
columns stores the identification number, the others have the following
meaning (for a single row):
– Match – number of objects from training table matching the conditional part of the rule (support of the rule).
– Decision rules – the rule itself, presented as a logical formula.
• Change name – changes object name (see figure 3.4), the name is stored
together with the contents of object in file. The name of object does
not need to be identical with the name of file that is used to store it.
The name of object can also be changed by double clicking on name
tag below the icon representing object.
• Remove – removes reduct set (additional confirmation required).
• Filter – filters the rule set. The user can remove rules on the basis of
support or the rules pointing at particular decision class. Before using
this option it is advisable to examine statistics for the set of rules to
be filtered.
• Shorten – shortening of rules. The user provides a coefficient between
0 and 1, which determines how “aggressive” the shortening procedure
should be. The coefficient equal to 1.0 means that no shortening occurs.
If Shortening ratio is near zero, the algorithm attempts to maximally
shorten rules (cf. [2], [5]).
• Generalize – make rules more general. The user provides a coefficient (a
ratio) between 0 and 1, which determines how “aggressive” the generalization procedure should be. The coefficient equal to 1.0 means that the
generalization must preserve the precision level of original rules. For
coefficients closer to zero, the generalization may cause rule to loose
precision for the sake of greater applicability.
• Save As – saves the set of rules to a file.
• Load – loads previously stored rule set from a file.
• Append – appends the current rule set with rules from a file. Repeating
entries only appear once.
• Statistics – present basic statistics on the rule set (see figure 3.20).
The user may also review the graphical information on distribution of
rule lengths as well as on distribution of rules between decision classes.
3.4. CUT SETS
37
Figure 3.20: Information about rule set
3.4
Cut sets
By cuts we understand the definition for decomposition of attribute value
sets. In case of numerical attributes being discretized in order to produce
a collection of intervals, the cuts are thresholds defining these intervals. In
case of symbolic attributes being grouped (quantized), cuts define disjoint
subsets of original attribute values.
Figure 3.21: Icon representing cut set
Double clicking on the icon representing rule set corresponds to the View
option in the context menu, and results in the contents of this object being
38
CHAPTER 3. OBJECTS IN PROJECT - QUICK REFERENCE
displayed.
–
–
–
–
–
View
Change name
Remove
Save As
Load
Figure 3.22: Layout of context menu for cut set
List of options in context menu for cut set:
• View – displays contents of the rule set (see figure 3.23). The user can
scroll and resize this window according to requirements.
Figure 3.23: Viewing contents of cut set
The cut set view window consists of four columns. First of these
columns stores the identification number, the others have the following
meaning (for a single row):
– Attribute – name of of the attribute for which the cuts have been
calculated.
– Size – number of cuts used for this attribute.
– Description – list of values representing cuts, * represents absence
of cuts for attribute.
• Change name – changes object name (see figure 3.4), the name is stored
together with the contents of object in file. The name of object does
3.5. LINEAR COMBINATIONS
39
not need to be identical with the name of file that is used to store it.
The name of object can also be changed by double clicking on name
tag below the icon representing object.
• Remove – removes cut set (additional confirmation required).
• Save As – saves the set of cuts to a file.
• Load – loads previously stored cut set from a file.
3.5
Linear combinations
Linear combination is an attribute (newly) created as a weighted sum of
selected existing attributes. We may have several such attributes for different
weight settings and different attributes participating in weighted sum.
Linear combinations are created on the basis of collection of attribute
sets consisting of k elements. Those k-element attribute sets as well as parameters of combination (weights) are generated automatically by adaptive
optimization algorithm implemented in RSES. As a measure for optimization we may use one of three possibilities. The measures take into account
potential quality of decision rules constructed on the basis of newly created
linear combination attribute. For details on these measures please turn to
[23]. The user may specify, by inputting the collection of numbers in Pattern
of linear combinations field, the number of new attributes to be constructed
and the number of original attributes to be used in linear combination. In
this field user states how many attributes should appear in particular combination. For instance, by entering the sequence “223344” the user orders
the algorithm to generate 6 (as there are 6 numbers in the sequence) linear
combinations using two (first couple), three (third and fourth) and four (last
two) original attributes, respectively.
Notice! The algorithm sometimes returns less combinations that it was
ordered, or returns combinations with fewer components. This may happen if
there are no combinations that comply to specification and, at the same time,
make any sense for a given data.
Figure 3.24: Icon representing a set of linear combinations
40
CHAPTER 3. OBJECTS IN PROJECT - QUICK REFERENCE
Double clicking on the icon representing the set of linear combinations
corresponds to the View option in the context menu, and results in the contents of this object being displayed.
–
–
–
–
View
Change name
Remove
Save As
Figure 3.25: Layout of context menu for a set of linear combinations
List of options in context menu for a set of linear combinations:
• View – displays contents of the set (see figure 3.26). The user can scroll
and resize this window according to requirements.
Figure 3.26: Viewing the contents of linear combination set
The linear combination view window consists of just two columns. First
of these columns stores the identification number, the other stores the
linear combination itself written as an arithmetic formula.
• Change name – changes object name (see figure 3.4), the name is stored
together with the contents of object in file. The name of object does
not need to be identical with the name of file that is used to store it.
The name of object can also be changed by double clicking on name
tag below the icon representing object.
• Remove – removes set of linear combinations (additional confirmation
required).
• Save As – saves the set of linear combinations to a file.
3.6. DECOMPOSITION TREES
3.6
41
Decomposition trees
Decomposition trees are used to split data set into fragments not larger than
a predefined size. These fragments, after decomposition represented as leafs
in decomposition tree, are supposed to be more uniform and easier to cope
with decision-wise. For more information on underlying methods please turn
to [20] and [4]. Usually the subsets of data in the leafs of decomposition tree
are used for calculation of decision rules (cf. [5]).
Figure 3.27: Icon representing decomposition tree
Double clicking on the icon representing the set of linear combinations
corresponds to the View option in the context menu, and results in the opening of a window with a graph of the tree.
– View
– Change name
– Remove
Figure 3.28: Layout of context menu for decomposition tree
List of options in context menu for a decision tree:
• View – opens new window with the display of decision tree (see figure
3.29). The user can scroll and resize this window according to requirements.
The decomposition tree view window displays the information on the
number of tree nodes and the tree itself. By placing mouse cursor
over any of tree nodes we may get (after a second a tag appears) an
information about the template (pattern) which is matched by subset
of objects corresponding to this node. Each internal (green) tree node
is associated with a simple context menu consisting of just one option
View node info. This context menu allows for inspection of template
corresponding to each node. The same information may be accessed
by double-clicking on the selected internal node. In case of leaf (red)
nodes the context menu contains two more options View rules and Save
42
CHAPTER 3. OBJECTS IN PROJECT - QUICK REFERENCE
Figure 3.29: Decomposition tree view
rules. The View rules option opens a new window that lists the rules
corresponding to the leaf. The rules are displayed in the standard RSES
way (see figure 3.19 in subsection 3.3). Save rules option lets user store
the rules in a file on disk.
• Change name – changes object name (see figure 3.4), the name is stored
together with the contents of object in project file. The name of object
does not need to be identical with the name of file that is used to store
it. The name of object can also be changed by double clicking on name
tag below the icon representing object.
• Remove – removes decomposition tree from project (additional confirmation required).
3.7
LTF Classifiers
LTF-C (Local Transfer Function Classifier) (see [24]) is a classification-oriented
artificial neural network model similar to that of Radial Basis Function network (RBF). It consists of two layers of computational neurons (plus an input
layer). The first computational layer (hidden layer) consists of neurons that
correspond to clusters of objects from the same decision class. Each of these
neurons have a decision class assigned and tries to construct a cluster of
objects from this class. The second computational layer (output layer) consists of neurons which gather the information from hidden (cluster-related)
neurons, sum it up and produce final network’s output.
Double clicking on the icon representing the LTF-C corresponds to the
3.7. LTF CLASSIFIERS
43
Figure 3.30: Icon representing LTF-C
View option in the context menu, and results in the opening of a window
with the network description.
–
–
–
–
View
Change name
Remove
Save As
Figure 3.31: Layout of context menu for LTF-C
List of options in context menu for LTF-C:
• View – opens new window with the display of network parameters (see
figure 3.32). The user can scroll and resize this window according to
requirements.
The window with information on LTF-C displays contents of the file
that is being created during the network’s learning process. This file is
split into two parts. First part stores the parameters used for learning,
the other describes network architecture (settings for both the whole
structure and single neurons).
The value of each parameter in this description is trailed by @ and parameter’s name. Among some teen parameters in file (starting with
@MaxNeur up to @DeltaTr) only @EtaUse and @UseTr are important
for the user. they are set to be equal to the value which is input by
the user in the field Threshold for neuron removal in the window used to
create LTF-C. The value of this threshold controls the process of removing redundant hidden neurons during network training. The larger the
threshold, the easier neurons are being dropped. For large threshold
the resulting neural network is more compact. However, the threshold should not be too high in order to obtain relevant and accurate
classifier.
At the line @Inputs the network description starts:
44
CHAPTER 3. OBJECTS IN PROJECT - QUICK REFERENCE
Figure 3.32: LTF-C view
– @Inputs – number of attributes, i.e. size of input
– @Outputs – number of outputs, i.e. number of decision classes
– @Neurons – number of neurons
– @SumCycles – number of number of training cycles (each cycle is
a single presentation of training object). This number is equal to
value input by the user in Number of training cycles field in network
creation window.
In the following lines neurons are described (see fig. 3.33).
Important parameters describing neuron:
– @Class – number of decision class to which this neuron corresponds – number between 0 and (@Outputs-1)
3.7. LTF CLASSIFIERS
@<Neuron0
@Class
@Life
@AvgUsefuln
@AvgActiv
@EtaEta
@Out
@Height
@Weights
0.8408
45
1
2999
0.04410261
0.00400000
0.000
0.000
1.0000
0.3130
0.3921
0.3070
@RecipOfRadii
0.3955 0.7920
@>
1.9009
1.5605
Figure 3.33: Example of the neuron description (view) in LTF-C
– @Life – neuron’s lifespan in number of training cycles. For most
of neurons this number is smaller than @SumCycles as there is only
one neuron in initial networks, and others are being automatically
added if necessary.
– @Weights – neuron’s weights, i.e. coordinates of the center of cluster which correspond to the neuron. These coordinates describe
input vector for which the neuron produces maximal output value.
Notice! If during network training the Normalize each numeric attribute was active then weights are normalized as well as corresponding attributes. The attributes are normalized to have zero as
expected value and variance equal to 1.
– @RecipOfRadii – reciprocal of neuron’s radii, i.e. width of its
reception field (in all dimensions). Smaller value corresponds to
wider reception field (w.r.t a given dimension) which is an indicator of decreased importance of this dimension for the neurons
final output.
As with weights, the values of radii depend on normalization (Normalize each numeric attribute option set by user).
46
CHAPTER 3. OBJECTS IN PROJECT - QUICK REFERENCE
Notice! neurons which were removed during training are not present in
network description. only neurons which survived to the end of training
process are there.
• Change name – changes object name (see figure 3.4), the name is stored
together with the contents of object in file. The name of object does
not need to be identical with the name of file that is used to store it.
The name of object can also be changed by double clicking on name
tag below the icon representing object.
• Remove – removes LTF-C (additional confirmation required).
• Save As – saves the LTF-C to a file.
3.8
Results
The “Results” object is used to store the outcome of classification experiment.
Figure 3.34: Icon representing results
Double clicking on the icon representing the set of results corresponds
to the View option in the context menu, and results in the contents of this
object being displayed.
–
–
–
–
View
Change name
Remove
Save As
Figure 3.35: Layout of context menu for a set of results
List of options in the context menu for results:
• View – opens new window with the display of result summary (see
figure 3.36). The user can scroll and resize this window according to
requirements.
3.8. RESULTS
47
Figure 3.36: View of classification results
The window presenting classification results provides various information. The central part is occupied by (confusion matrix), which in our
example is of the form:
Predicted
Actual
1
0
1
210
52
0
6
164
Rows in this matrix correspond to actual decision classes (all possible
values of decision) while columns represent decision values as returned
by classifier in discourse. In our example (above) we may see that the
constructed classifier sometimes mistakes objects from class 1 for those
from class 0 (6 such cases). We may also see that the classifier mistakes
class 0 for class 1 several times more frequently than the other way
round (52 cases as compared to 6). The values on diagonal represent
correctly classified cases. If all non-zero values in confusion matrix
appear on the diagonal, we conclude that classifier makes no mistakes
for a given set of data.
On the right side of the matrix there are additional columns with information such as:
– No. of obj. – number of objects in data set that belong to the
decision class corresponding to current row.
48
CHAPTER 3. OBJECTS IN PROJECT - QUICK REFERENCE
– Acuraccy – ratio of correctly classified objects from the class to
the number of all objects assigned to the class by the classifier.
– Coverage – ratio of classified (recognized by classifier) objects from
the class to the number of all objects in the class.
Last row in the table contains True positive rate for each decision class.
Below the table number of additional values is presented:
– Total number of tested objects: – number of test objects used to
obtain this result.
– Total acuraccy – ratio of number of correctly classified cases (sum
of values on diagonal in confusion matrix) to the number of all
tested cases (as in previous point).
– Total coverage – percentage of test objects that were recognized
by classifier.
In our example Total coverage equals 1 which means that all objects
have been recognized (classified). Such total coverage is not always
the case, as the constructed classifier may not be able to recognize
previously unseen object. If some test objects remain unclassified, the
Total coverage value is less than 1.
• Change name – changes object name (see figure 3.4), the name is stored
together with the contents of object in file. The name of object does
not need to be identical with the name of file that is used to store it.
The name of object can also be changed by double clicking on name
tag below the icon representing object.
• Remove – removes results (additional confirmation required).
• Save As – saves results to a file.
Chapter 4
Main data analysis methods in
RSES
This part presents a very brief description of main data analysis methods
(algorithms) available in RSES 2.2. It contains brief listing of options used
to control the implemented algorithms as well as short introduction to underlying theories with reference to appropriate literature.
Details concerning the meaning and construction of objects appearing in
RSES projects were already introduced in section 3.
4.1
Missing value completion
The missing elements in data table may be denoted as MISSING, NULL or ’?’
(not case sensitive).
RSES offers four approaches to the issue of missing values. These are as
follows:
• removal of objects with missing values by using Complete/Remove objects with missing values option from data table context menu,
• filling the missing part of data in one of two ways (see [11]):
– filling the empty (missing) places with most common value in
case of nominal attributes and filling with mean over all attribute
values in data set in case of numerical attribute. This procedure
is invoked by selecting Complete/Complete with most common or
mean value from data table context menu.
– filling the empty (missing) places with most common value for
the decision class in case of nominal attributes and filling with
49
50
CHAPTER 4. MAIN DATA ANALYSIS METHODS IN RSES
mean over all attribute values in the decision class in case of numerical attribute. This procedure is invoked by selecting Complete/Complete with most common or mean value w.r.t. decision
class from data table context menu.
• analysis of data without taking into account those objects that have
incomplete description (contain missing values). Objects with missing values (and their indiscernibility thereof) are disregarded during
rule/reduct calculation. This result is achieved by activating corresponding options in the dialog windows for reduct/rule calculation.
• treating the missing data as information ( NULL is treated as yet another
regular value for attribute).
4.2
Cuts, discretization and grouping
With use of Discretize/Generate cuts from data table context menu we may
generate decompositions of attribute value sets. With these descriptions,
further referred to as cuts we may perform next step, i.e. discretization of
numerical attributes or grouping (quantization) of nominal attributes.
The user may set several parameters that control discretization/grouping
procedure:
• Method choice – choice of discretization method from:
– Global method – global method (see, e.g. [5]),
– Local method – local method, slightly faster than the global one
but, generating much more cuts in some cases (see, e.g. [5]).
• Include symbolic attributes – option available only for local method.
Causes algorithms to perform grouping for nominal (symbolic) attributes
in parallel to discretization of numerical ones.
• Cuts to – determines the name of object in project used to store cuts.
Discretize/Discretize table option (in data table context menu) makes it
possible to discretize (group) attributes in the data table with use of previously calculated cuts. The user sets the set of cuts to be used and the
name of object to store resulting discretized table. The set of cuts must be
present in project prior to its use which meands it has to be calculated in
the previous step or loaded from a file.
4.3. LINEAR COMBINATIONS
51
Figure 4.1: Options in generation of cuts.
4.3
Linear combinations
RSES 2.2 sports an algorithm for construction of new attributes based on
linear combinations of existing ones (see [23]). To perform search for appropriate linear combination the user has to use Linear combinations/Generate
linear combinations option from the context menu for data table.
The dialog window associated with creation of linear combinations provides the user with several options (see figure 4.2):
• Pattern of linear combinations – a scheme for new attributes to be constructed. In this field user states how many attributes should appear in
particular combination. For instance, by entering the sequence “22344”
the user orders the algorithm to generate 5 linear combinations using
two (first couple), three (third) and four (last two) original attributes,
respectively.
• Measure choice – selection of one of measures for quality of combination
(see [23]).
• Linear combinations to – selection of object in project which will be used
to store the result (linear combinations).
In order to make use of derived linear combinations the user has to invoke
Linear combinations/Add linear combinations as new attributes from context
52
CHAPTER 4. MAIN DATA ANALYSIS METHODS IN RSES
Figure 4.2: Options for generation of linear combinations
menu for data table. Use of this option results in new attributes, based on
linear combinations, being added to the decision table. The linear combinations to used for extending data table have to be calculated beforehand.
Please note, that by using Select subtable option from the decision table’s
context menu it is possible to manually select the attributes (columns). In
this way the user may create table that contains, for instance, only newly
created attributes and the decision.
4.4
Reducts and decision rules
Given a data table we may want to derive reducts and/or decision rules. For
this purpose we use Reducts/Rules/Calculate reducts or rules option from the
decision table’s context menu.
The user should determine, whether the rules or reducts should be calculated and what options should be active during calculation (see figure 4.3).
The setting of options controls the behavior of algorithms for reduct/rule
calculation. The options are:
• Reduct/Rule choice – choice of the operation mode:
– Reducts – calculation of reducts,
– Rules – induction of decision rules.
4.4. REDUCTS AND DECISION RULES
53
• Discernibility matrix settings – properties of discernibility matrix (option
available only for reduct calculation):
– Full discernibility – full (global discernibility),
– Object related discernibility – discernibility w.r.t single object in
table is used (relative discernibility),
– Modulo decision – discernibility w.r.t decision only is used (decision/class discernibility),
• Method – choice of calculation method (algorithm):
– Exhaustive algorithm – an exhaustive, deterministic algorithm constructing all reducts or all minimal decision rules, depending on
the setting of Reduct/Rule choice option. The calculated rules are
those with minimal number of descriptors in conditional part (for
more information see [5]).
– Genetic algorithm – genetic algorithm for calculation of some reducts/rules
(see, e.g. [5]).
– Covering algorithm – covering algorithm (rule calculation only –
see [5]),
– LEM2 algorithm – LEM2 algorithm (rule calculation only – (see [10])),
• Genetic algorithm settings – parameters for calculation of reducts/rules
with use of genetic algorithm (options only active if genetic algorithm
is chosen as calculation method):
– High speed – quick mode – fast but may result in less accurate
output.
– Medium speed – regular mode.
– Low speed – thorough mode (slow).
– Number of reducts – maximum number of reducts to be calculated
by algorithm (size of population for genetic algorithm).
• Cover parameter – expected degree of coverage of the training set by
derived rules (option available only for covering and LEM2 methods),
• Don’t discern with missing values – do not consider two objects to be
discernible on some attribute in case when at least one of them have
NULL as a value of this attribute. See also subsection 4.1.
54
CHAPTER 4. MAIN DATA ANALYSIS METHODS IN RSES
• Generate to set – name of object in project that is used to store results
of calculation (reducts or rules). It may be either existing or completely
new object. In the latter case an appropriate object is created at the
end of calculation.
Figure 4.3: Options for calculating reducts/rules
It is also possible to generate dynamic reducts. Dynamic reducts are
reducts that remain to be such for many subtables of the original decision
table (see [2], [5]). The process of finding the dynamic reduct is computationally more costly, as it requires several subtables to be examined in
order to find the frequently repeating minimal subsets of attributes (dynamic reducts). To invoke dynamic reduct calculation the user have to select
Reducts/Rules/Calculate dynamic reducts option from context menu associated
with data table.
The dynamic reduct calculation process is controlled by following options
(see Fig. 4.4):
• Number of sampling levels – number of sampling levels for selection of
subtables (samples).
4.4. REDUCTS AND DECISION RULES
55
• Number of subtables to sample per level – number of subtables (samples)
chosen by random on each level.
• Smallest subtable size (in % of whole table) – size of the smallest allowable
subtable used in calculation, expressed as a percentage of the whole,
original decision table.
• Largest subtable size (in % of whole table) – size of the largest allowable
subtable used in calculation, expressed as a percentage of the whole,
original decision table.
• Include whole table – permission for the whole table to be used as sample.
In such case the calculated dynamic reducts have to be proper reduct
for the entire decision table.
• Modulo decision – only discernibility based on decision (class) is considered.
• Don’t discern with missing values – ignore missing values. Do not consider two objects to be discernible on some attribute in case when at
least one of them have NULL as a value of this attribute. See also
subsection 4.1.
• Dynamic reducts to – name of object in project that is used to store
results of calculation (dynamic reducts). It may be either existing or
completely new object. In the latter case an appropriate object is
created at the end of calculation.
Once the reducts are calculated, one may use them to generate decision
rules. To do that the user has to use Generate rules option from the context
menu associated with reduct set.
The following options are available while creating rules from reducts (see
figure 4.5):
• Train table from – data table to be used as training sample (this table
must already exist in the project!).
• Decision rules to – name of object used to store derived rules.
• Rules generation setting – choice of rule generating method:
– Global rules – calculation of global rules. The algorithm scans the
training sample object by object and produces rules by matching object against reduct. The resulting rule has attributes from
56
CHAPTER 4. MAIN DATA ANALYSIS METHODS IN RSES
Figure 4.4: Dynamic reducts – options
reducts in conditional part with values of currently considered object, and points at decision that corresponds to the decision for
this training object. Note that for large tables and large reduct
set the resulting set of rules may be quite large.
– All local rules – calculation of rules with use of local approach. For
each reduct a subtable, containing only the attributes present in
this reduct, is selected. For this subtable algorithm calculates a
set of minimal rules (rules with minimal number of descriptors in
conditional part – see, e.g. [5]) w.r.t decision. Finally, the rule
sets for all reducts are summed up to form result.
In order to use rule set for classification of objects the user should invoke
Classify/Test table using rule set option from the context menu of the table
used to store the objects to be classified. The set of rules to be used must
already exist in the project.
The parameters that may be set by the user when starting classification
(see figure 4.6):
• General test mode – test mode (decision on expected outcome). The
possibilities are:
4.4. REDUCTS AND DECISION RULES
57
Figure 4.5: Options for building rules from reduct set
– Generate confusion matrix – calculates confusion matrix (see subsection 3.8). Applicable only for the tables that already contain
decision column.
– Classify new cases – classifies objects and adds derived decisions
(as last column) to the original table.
• Conflict resolve by – method for resolving conflicts when different rules
give contradictory results:
– Simple voting – decision is chosen by counting votes casted in favor
of each possibility (one matching rule - one vote).
– Standard voting – standard voting solution (each rule has as many
votes as supporting objects).
• Rules from set – rule set to be used (must exist in project).
• Results to – object used to store results.
58
CHAPTER 4. MAIN DATA ANALYSIS METHODS IN RSES
Figure 4.6: Options for classification with decision rules or decomposition
tree
4.5
Data decomposition
In order to perform data decomposition, i.e. to construct a decomposition
tree that can be further used as a classifier (see [20], [4]) the user should
invoke Make decomposition option from the table’s context menu.
The parameters that may be set by the user when starting decomposition
algorithm (see figure 4.7):
• Maximal size of leaf – maximal number of objects (rows) in the subtable
corresponding to the leaf in decomposition tree. If this number is not
exceeded in any of tree nodes, the algorithm ceases to work.
• Discretization in leafs – by activating this option the user orders the
algorithm to perform on-the-fly discretization of attributes during decomposition. The decision on discretization of particular attribute is
taken automatically by the algorithm but, the user has control over
some of discretization parameters (see subsection 4.2).
• Shortening ratio – rule shortening ratio. The ratio of shortening applied
to rules that are calculated by algorithm for subtables corresponding
4.5. DATA DECOMPOSITION
59
to decomposition tree nodes. For the value of 1.0 no shortening occurs.
The smaller this ratio the more “aggresive” shortening (see subsections
3.3 and 3.6).
• Decomposition tree to – name of object in project used to store the
resulting decomposition tree.
Figure 4.7: Data table decomposition – options
In order to use a decomposition tree as a classifier the user has to call
Classify/Test table using decomposition tree option from the context menu for
decision table. There has to exist at least one decomposition tree object in
project, which means that it has to be calculated in prior to classification.
Both the the dialog window (see figure 4.6) and the available classification
options are similar to those for rule based classification:
• General test mode – test mode (decision on expected outcome). The
possibilities are:
– Generate confusion matrix – calculates confusion matrix (see subsection 3.8). Applicable only for the tables that already contain
decision column.
60
CHAPTER 4. MAIN DATA ANALYSIS METHODS IN RSES
– Classify new cases – classifies objects and adds derived decisions
(as last column) to the original table.
• Conflict resolve by – method for resolving conflicts when different rules
give contradictory results:
– Simple voting – decision is chosen by counting votes casted in favor
of each possibility (one matching rule - one vote).
– Standard voting – standard voting solution (each rule has as many
votes as supporting objects).
• Decomposition tree from – the name of decision tree object to be used
(must exist in project).
• Results to – object used to store results.
4.6
k-NN classifiers
The k-NN (k Nearest Neighbor) classification method, as implemented in
RSES, does not require a separate training step. It is an instance-based classification technique. All the necessary steps, including initialization and parameter setting, are done at the beginning of classification (test) run. Therefore, to classify objects with use of k-NN the user only has to invoke Classify/Test table using k-NN from table’s context menu. The only prerequisite
for this method is the existence (within project) of nonempty table object
that will be used as training sample.
The algorithm constructs a distance measure on the basis of training
sample (table). Then, for each object in tested sample, chooses the decision
on the basis of decisions for k training objects that are nearest to the tested
one with respect to the calculated distance (see [8, 9]).
Parameters for the k-NN method (see figure 4.8):
• General test mode – test mode (decision on expected outcome). The
possibilities are:
– Generate confusion matrix – calculates confusion matrix (see subsection 3.8). Applicable only for the tables that already contain
decision column.
– Classify new cases – classifies objects and adds derived decisions
(as last column) to the original table.
4.6. K-NN CLASSIFIERS
61
Figure 4.8: Options for k-NN classification
• Train table from – determine the table to be used as training sample.
• Results to – object used to store results.
• Metric type – choice of metric (parameterized distance measure) to be
used:
– SVD – Simple Value Difference metric. SVD - Simple Value Difference Metric. Each attribute value is assigned with the decision
distribution vector determined from the training set. The distance
between two objects is defined by weighted linear sum of distances
for all the attributes. The distance between the two objects w.r.t
a single attribute is defined by the difference between the decision
distribution vectors corresponding to the attribute values of these
objects.
SVD metric is controlled by Value vicinity parameter that is used to
define the distance for numerical attributes. For each numerical
value w value vicinity defines how many attribute values close
to w in the training set are used for determining the decision
62
CHAPTER 4. MAIN DATA ANALYSIS METHODS IN RSES
distribution for w. For nominal values the decision distribution is
determined as described in [?, ?].
– City-SVD – a metric that combines city metric1 for numerical
attributes with Simple Value Difference metric for symbolic attributes. For numeric attributes the absolute value of difference
between attribute values is taken, for symbolic attributes the difference between corresponding decision distribution vectors from
training sample is used.
The user have to set Normalization parameter for this metric. This
parameter determines the value to be used for normalization of the
absolute value of difference for numerical attributes. The choices
are:
None – absolute value of difference for numerical attributes is taken
as-is (no normalization).
Std.dev. – absolute value of difference for numerical attribute is
divided by standard deviation of this attribute calculated from the
training sample.
Range – absolute value of difference for numerical attribute is divided by standard the length of the range of this attribute in
training set. This length is derived as a difference largest and
smallest observed value of the attribute in training sample.
• Attribute weighting – choice of attribute weighting (scaling) method. In
order to avoid dominance of some attributes over the whole distance,
one may use the following options:
– None – no scaling/weighting.
– Distance-based – iterative method for choosing the weights so that
they optimize the distance to correctly recognized training objects.
– Accuracy-based – iterative method for choosing the weights so that
they optimize the acuuracy of decision prediction for training objects.
– Number of iterations – number of iterations during weight optimization.
• Number of neighbours – number of neighbours used to classify the object
(k in the name of k-NN method):
1
Also known as Manhattan metric.
4.7. LTF CLASSIFIER
63
– Search optimal between 1 and ... – if this option is active, the algorithm determines the optimal size of neighborhood automatically.
• Voting – choice of voting method. The methods that can be used to
choose decision as a result of voting among objects in the are:
– Simple – the object assigned the decision value that is most frequent in the neighborhood.
– Distance-weighted – each neighbor casts the vote in favor of its decision value. The votes are weighted with use of distance between
the neighbor and the object to be classified. Final decision is the
one with the largest total of weighted votes.
• Filter neighbors using rules – if this option is selected, the algorithm
excludes from voting these neighbor objects which generate a local rule
incoherent with other members of neighborhood (see [8, 9]).
4.7
LTF Classifier
In order to build LTF-C (Local Transfer Function Classifier) the user have
to use Create LTF-C option from the context menu associated with decision
table.
The LTF-C is based on dedicated artificial neural network architecture.
LTF-C training process is based on slight alternations of network’s structure
and parameters made after presentation of each training object. There are
four basic kinds of alternations:
1. coordinates for the center of gaussian functions;
2. correcting the width of gaussian functions for each hidden neuron and
each component of the input independently;
3. adding new neuron to hidden layer;
4. removing redundant neurons from hidden layer.
Before the start of learning procedure the network contains no hidden
neurons. Hidden neurons are automatically added during learning if it is
necessary. Therefore, the user does not have to determine the network architecture as input and output layers are determined by data dimensions.
Followin options may be used to control LTF-C creation (see figure 4.9):
64
CHAPTER 4. MAIN DATA ANALYSIS METHODS IN RSES
• LTF-C to – name of object that will store the resulting LTF-C.
• Use default training parameters – choice between default and custom
setting of learning parameters. The customizable parameters are:
– Number of training cycles – the number of LTF-C’s training cycles.
– Threshold for neurons removal – level of error that, if exceeded by
neuron, results in removal of a hidden neuron.
Figure 4.9: LTF-C building options
In order to use LTF-C for classification of new objects (from some table)
the user have to call Classify/Test table using LTF-C option from the context
menu associated with the decision table object. The project should already
contain nonempty LTF-C object.
During classification the user may use the following options (see igure
4.10):
• General test mode – test mode (decision on expected outcome). The
possibilities are:
4.8. CROSS-VALIDATION METHOD
65
– Generate confusion matrix – calculates confusion matrix (see subsection 3.8). Applicable only for the tables that already contain
decision column.
– Classify new cases – classifies objects and adds derived decisions
(as last column) to the original table.
• LTF-C from – name of LTF-C object to be used.
• Results to – name of object used to store results (if Generate confusion
matrix is checked).
Figure 4.10: Options for classification with LTF-C
4.8
Cross-validation method
Cross-validation (see, e.g. [12]) is frequently used as a method for evaluation
of classification models. It comprises of several training and testing runs.
The data set is firs split into several, possibly equal in size, disjoint parts.
Then, one of the parts is taken as training sample and the remainder (sum of
all other parts) becomes the test sample. The classifier is constructed with
use of training sample and its performance is checked on test sample. These
66
CHAPTER 4. MAIN DATA ANALYSIS METHODS IN RSES
steps are repeated as many times as there are data parts, so that each of parts
is used as training set once. The final result of cross-validation procedure is
the average of scores from subsequent steps.
In order to use cross-validation method the user has to select Classify/Crosvalidation method option from the context menu associated with data table.
The following options are available to user (see figure 4.11):
• Type of classifier – the classifier to be used:
– Decision rules – classifier based on decision rules.
– Decomposition tree – decomposition tree based classifier.
– k-NN – classification with k-NN method.
– LTF-C – classification with use of LTF-C.
• Number of Folds – number of runs, i.e. number of equal parts into which
the original data will be split.
• Discretization – when this option is chosen, the discretization of data
is performed automatically if such need arises (option not available in
case of LTF-C).
• Change discretization method – choice of discretization method. The
discretization options are identical with Generate cuts dialog described
in subsection 4.2 (avaliable only when Discretization option is active).
• Change parameters of ... – changing the parameters of the classifier
to be used in cross-validation. The parameters are the same as those
described before for each of the classifier types.
• Change shortening ratio – change of rule shortening ratio for rule based
classifiers and decomposition trees (option available only when rule set
or decomposition tree is the current classification method).
• Change method of resolving conflicts – choice of conflict resolution method
(voting method) for rule based classifiers and decomposition trees (option available only when rule set or decomposition tree is the current
classification method).
• Results to – name of object used to store the cumulative cross-validation
results.
4.8. CROSS-VALIDATION METHOD
67
Figure 4.11: Cross-validation options
Figure 4.12: View of cross-validation results
For greater ease of use the Parameters field in main cross-validation control
window lists current parameter settings (e.g. discretization type, shortening
ratio etc.).
68
CHAPTER 4. MAIN DATA ANALYSIS METHODS IN RSES
The summarized results of cross-validation are stored in project with
use of typical result set object. It contains the confusion matrix and some
additional information, as described in subsection 3.8. The only difference
is that the values presented are the averages over all iterations of crossvalidation. Therefore, non-integer values may appear in the confusion matrix
(see figure 4.12).
Chapter 5
RSES scenario examples
In this part we present several example scenarios for data analysis with use
of methods implemented in RSES.
5.1
Train-and-test scenarios
One of the simplest scenarios in data analysis with use of rough set methods
is the train-and-test. In the event of having a data table for which we want
construct and evaluate a classifier we may start with splitting this data into
two disjoint parts. One part of data becomes the training sample, the other
is used as a testing set for classifier constructed from the training data. As a
result of such procedure we obtain classification results for test subtable and
the structure of classifier we have learned from training data.
The train-and-test process, as described above, may be parameterized
by choice of preprocessing method (discretization, completing missing values
etc.) and choice of classifier to be constructed (decision rules, decomposition
tree etc.).
5.1.1
Rule based classifier
In this scenario we use train-and-test procedure to construct and evaluate
rule based classifier. We do not perform any preprocessing. The scenario
comprises of following steps:
1. Launch RSES and create new project.
2. Insert new table to project.
69
70
CHAPTER 5. RSES SCENARIO EXAMPLES
3. Load data into table object (Load option from context menu); in our
example we have chosen irys data file from the data sets distributed
with RSES (DATA folder in main RSES installation directory).
4. We split data into two parts containing 60% and 40% of original table,
respectively. To do that we select Split in Two from context menu and
set 0.6 as the value of Split factor. As a result we obtain two new
table objects in project, named irys_0.6 and irys_0.4. First of them
(irys_0.6) is used as training set, the other (irys_0.4) is used as test
set.
5. We calculate decision rules for table irys_0.6 (from the context menu
for this table we select Reducts/Rules / Calculate reducts or rules option).
As the data set is small, we can afford calculation of all decision rules.
We select Exhaustive algorithm from rule calculation control window.
A new object is being created – a set of rules with the same name as
originating table, i.e. irys_0.6.
6. We test the rules we just calculated using the test table irys_0.4. We
open context menu for irys_0.4 and select Classify/Test table using rule
set option. In the window that opens we set Rules from set value to
be irys_0.6 and Results to to be irys_0.4 (default). For the Conflict
resolve by option we choose Standard voting. Additionally, the General
test mode option value is Generate confusion matrix. A new result object
named irys_0.4 emerges after we hit OK button.
7. We can now examine test results and rules we have calculated by double
clicking on result object named irys_0.4 and rule set object named
irys_0.6, respectively.
Figure 5.1 shows RSES windows after the end of train-and-test scenario
with use of rule based classifier.
5.1.2
Rule based classifier with discretization
The previous scenario may be extended and enriched by application of discretization procedure to training and testing sample. This is particularly
important in case of decision tables with numerical attributes. In order to
perform the previous scenario with added discretization we need to:
1. Launch RSES and create new project.
2. Insert new table to project.
5.1. TRAIN-AND-TEST SCENARIOS
71
Figure 5.1: RSES after train-and-test scenario with use of rule based classifier
3. Load data into table object (Load option from context menu); in our
example we have chosen irys data file from the data sets distributed
with RSES (DATA folder in main RSES installation directory).
4. We split data into two parts containing 60% and 40% of original table,
respectively. To do that we select Split in Two from context menu and
set 0.6 as the value of Split factor. As a result we obtain two new
table objects in project, named irys_0.6 and irys_0.4. First of them
(irys_0.6) is used as training set, the other (irys_0.4) is used as test
set.
5. We calculate cuts for table irys_0.6. For that we select Discretize/Generate
cuts otion from context menu and in dialog that pops up we select
method (e.g. global) and choose the object that will store resulting
cuts (we choose default irys_0.6 in the Cuts to field). A new set of cuts
named irys_0.6 is created within project.
6. We discretize the irys_0.6 table. This is done by selecting Discretize/Discretize
table option from context menu. We accept default values in the next
window (cuts from irys_0.6 and resulting table to irys_0.6D). As a result a new table object irys_0.6D is created in project. This object
contains discretized training table.
72
CHAPTER 5. RSES SCENARIO EXAMPLES
7. We discretize the irys_0.4 table. This is done by selecting Discretize/Discretize
table option from it’s context menu. We accept default values in the
next window (cuts from irys_0.6 and resulting table to irys_0.4D). As
a result new table object irys_0.4D is created in project. This object
contains discretized test table.
8. We calculate decision rules for table irys_0.6D (from the context menu
for this table we select Reducts/Rules / Calculate reducts or rules option).
We select LEM2 algorithm from rule calculation control window. A new
object is being created – a set of rules with the same name as originating
table, i.e. irys_0.6D.
9. We test the rules we just calculated using the test table irys_0.4D. We
open context menu for irys_0.4D and select Classify/Test table using
rule set option. In the window that opens we set Rules from set value to
be irys_0.6D and Results to to be irys_0.4D (default). For the Conflict
resolve by option we choose Standard voting. Additionally, the General
test mode option value is Generate confusion matrix. A new result object
named irys_0.4D emerges after we hit OK button.
10. We can now examine test results and cuts we have calculated by double
clicking on result object named irys_0.4D and cut set object named
irys_0.6, respectively.
Figure 5.2 shows RSES windows after the end of train-and-test scenario
with use of rule based classifier and discretization.
5.1.3
Decomposition tree
In case of larger data sets (more than 1000 objects) we may find out that
traditional methods for classifier construction are not working too well. This
is due to complexity issues associated with, e.g. calculation of rules. To address this problem a decomposition method has been implemented in RSES.
This method makes it possible to effectively process even very large data
table (hundred thousand objects and more).
Here we would like to present typical brief scenario for creating and using
a classifier based on decomposition tree.
1. Launch RSES and create new project.
2. Insert new table to project.
5.1. TRAIN-AND-TEST SCENARIOS
73
Figure 5.2: RSES after train-and-test scenario with use of rule based classifier
and discretization
3. Load data into table object (Load option from context menu); in our
example we have chosen sat_trn data file from the data sets distributed
with RSES (DATA folder in main RSES installation directory).
4. We create a decomposition tree for table sat_trn. This is done by
selecting Make decomposition option from it’s context menu. We set
control values in the next window: data from sat_trn; resulting tree
to new tree object named sat_trn; discretization is on (checkbox is
selected); 500 as maximum size of leaf; Shortening ratio set to 1.0 (no
shortening). As a result new tree object sat_trn is created in project.
5. We insert new table to project. This table will contain test sample.
6. We load data into table object (Load option from context menu); in our
example we have chosen sat_tst data file from the data sets distributed
with RSES (DATA folder in main RSES installation directory).
7. Now we can test the constructed tree. To do that we choose Classify/Test table using decomposition tree from the context menu associated with sat_tst table. In control window that pops up we set value
of Decomposition tree from set to sat_trn, value of Results to to sat_tst
74
CHAPTER 5. RSES SCENARIO EXAMPLES
and we choose Standard voting as conflict resolution method. Additionally, we select Generate confusion matrix in General test mode field. New
result object sat_tst is created.
8. We can now examine classification results and decomposition tree is
separate windows. For that we double click on sat_tst result set and
tree object sat_trn, respectively.
Figure 5.3 shows RSES after this scenario is finished.
Figure 5.3: RSES after train-and-test scenario with use of decomposition tree
5.1.4
k-NN classifier
For many data sets the k Nearest Neighbor classifier produces very reasonable
results. RSES contains the algorithms for construction of k-NN classifiers.
These algorithms utilize a version of k-NN that is enriched with several sophisticated methods for selecting the right distance measure and weighting
scheme (see subsection 4.6). Moreover, thanks to implementation of some
clever algorithms for data decomposition and representation, the version of
k-NN that RSES delivers is capable of dealing with relatively large data sets.
Here we would like to present typical brief scenario for using a classifier
based on k-NN approach.
1. Launch RSES and create new project.
5.1. TRAIN-AND-TEST SCENARIOS
75
2. Insert new table to project. This will be our training table.
3. Load data into table object (Load option from context menu); in our
example we have chosen sat_trn data file from the data sets distributed
with RSES (DATA folder in main RSES installation directory).
4. Now we insert new table into project. This will be our test table. We
load data to this table using Load option from context menu and selecting sat_tst file from DATA folder in main RSES installation directory.
5. We test the table sat_tst using the training sample sat_trn. Note,
that there is no separate classifier learning phase in case of constructing k-NN classifier. To perform test we open context menu for the
table sat_tst and select Classify/Test table using k-NN option. In the
window that appears we set Train table from to sat_trn and Confusion
matrix to to sat_tst. We also have to set few control parameters for
the k-NN algorithm. In our example we set Metric type->City SVD and
Normalization->Range. To speed up calculation we select in this window Attribute weighting->None, Voting->Simple, Number of neighbours>1 and deactivate Search optimal between 1 and ... option. After we
click OK a new result object named sat_tst emerges in the project.
6. Now we can examine classification results by double-clicking on sat_tst
result object.
Figure 5.4 shows the outlook of RSES after this scenario is finished.
5.1.5
LTF Classifier
The empirical study has shown that LTF-C (Local Transfer Function Classifier) based on artificial neural network produces encouraging results for
many data sets, especially in case of predominantly numerical attributes (see
[24]). The LTF-C’s in RSES may be treated as an alternative to classical
approaches based on rough sets, especially in situations where those latter
are only partially applicable.
Here we would like to present typical brief scenario for constructing and
using a classifier based on LTF neural network.
1. Launch RSES and create new project.
2. Insert new table to project. This will be our training table.
76
CHAPTER 5. RSES SCENARIO EXAMPLES
Figure 5.4: RSES after scenario with use of k-NN classification method
3. Load data into table object (Load option from context menu); in our
example we have chosen sat_trn data file from the data sets distributed
with RSES (DATA folder in main RSES installation directory).
4. We create a LTF-C for table sat_trn. This is done by selecting Create
LTF-C option from it’s context menu. We set (default) control values
in the next window: data from sat_trn; resulting LTF-C to new object
named sat_trn; Use default training parameters is on; Normalize each
numeric attribute in active. As a result (after a while) new LTF-C
object sat_trn is created in project.
5. We insert new table to project. This table will contain test sample.
6. We load data into table object (Load option from context menu); in our
example we have chosen sat_tst data file from the data sets distributed
with RSES (DATA folder in main RSES installation directory).
7. Now we can test our LTF-C. To do that we choose Classify/Test table
using LTF-C from the context menu associated with sat_tst table. In
control window that pops up we set value of LTF-C from to sat_trn and
value of Results to to sat_tst. Additionally, we select Generate confusion
matrix in General test mode field. New result object sat_tst is created.
5.2. TESTING WITH USE OF CROSS-VALIDATION
77
8. We can now examine classification results by double clicking on sat_tst
result set.
Figure 5.5 presents the outlook of RSES after the finish of scenario with
use of LTF-C.
Figure 5.5: RSES after scenario with use of LTF-C
5.2
Testing with use of cross-validation
For many data sets, especially if there are no more than 1000 objects in the
table, we may use cross-validation to evaluate quality of learned model (see
subsection 4.8) effectively.
Cross-validation in RSES comprises of several training and testing runs.
The data set is firs split into several, possibly equal in size, disjoint parts.
Then, one of the parts is taken as training sample and the remainder (sum of
all other parts) becomes the test sample. The classifier is constructed with
use of training sample and its performance is checked on test sample. These
steps are repeated as many times as there are data parts, so that each of parts
is used as training set once. The final result of cross-validation procedure is
the average of scores from subsequent steps.
Here we present a brief scenario for evaluating a classifier based on a set of
decision rules. In general, each classification method implemented in RSES
can be used in this place.
1. Launch RSES and create new project.
78
CHAPTER 5. RSES SCENARIO EXAMPLES
2. Insert new table to project.
3. Load data into table object (Load option from context menu); in our
example we have chosen heart data file from the data sets distributed
with RSES (DATA folder in main RSES installation directory).
4. Now we perform cross-validation test. For that we select Classify/Cross
validation method from the context menu of the heart table object. In
the window that pops up we select Decision rules as value of Type of
classifier and set Number of folds to 10. As we have chosen to use
decision rules, we need to set some parameters of rule-based classifier
to be tested. Using the checkbox we activate discretization and select
global discretization method by clicking on Change discretization method
button and selecting global method in the dialog box that appears.
With use of Change parameters of rule generation button we may alter
the way the rules are calculated, with use of Change shortening ratio –
the level of rule shortening, and with use of Change method of resolving
conflicts – the scheme for voting among conflicting rules. In our example
all the values are set to default values of Exhaustive algorithm, 1.0,
and Standard voting, respectively. Finally, we enter heart in Results to
field. The Parameters field in the bottom left part of the window keeps
us informed about the current setting. After clicking OK button the
cross-validation procedure commences and as a result we see new result
object heart. See figure 5.6.
5. We can now examine the results by double clicking on heart result set.
5.3
Scenario for decision generation
To this end all the described scenarios used data tables with already existing
decision column. In practical problems we are frequently facing a different
problem. We may be given some set of labeled objects – the training sample,
but what we want to do is to classify a set of objects for which the decision
value is not known. In other words, we have to generate the decision column
for the table that does not have one. RSES may be used for this purpose. It
may assign objects to decision classes (produce decision column for a table)
using a classifier created beforehand. The decision value generated by RSES
is appended to an objects at the last position. In this way the decision
generated by system is always stored in newly created last column in resulting
data table.
5.3. SCENARIO FOR DECISION GENERATION
79
Figure 5.6: Example of parameter setting for cross-validation method
In order to present the capability of generating decision column in RSES
we will need two tables. First of them will be the training table (training
sample), for the second we will be generating decision attribute. We will use
the RSES’ capability of subtable selection in order to perform the following
scenario. This scenario is very simplistic and meant only as an example of
using selected RSES capabilities.
1. Launch RSES and create new project.
2. Insert new table to project.
3. Load data into table object (Load option from context menu); in our
example we have chosen heart data file from the data sets distributed
with RSES (DATA folder in main RSES installation directory).
4. From heart table we extract subtable that contains no decision column.
To do that we select Select subtable option from it’s context menu. In
the window which appears we select all but last attributes (columns),
i.e. only attr13 is not selected. Once we confirm our choice of attributes,
a new table object named heart_SUB appears in the project.
5. Now we need to create the classifier. For that we calculate decision
rules for table heart (from the context menu for this table we select
80
CHAPTER 5. RSES SCENARIO EXAMPLES
Reducts/Rules / Calculate reducts or rules option). We select Exhaustive
algorithm from rule calculation control window. A new object is being
created – a set of rules with the same name as originating table, i.e.
heart.
6. Now we apply calculated rules to heart_SUB in order to generate decision column. To do that we perform testing without generating confusion matrix. Namely, we select Classify/Test table using rule set option
from the context menu of heart_SUB table object. In the window that
appears we have to select Classify new cases as the value of General test
mode, Standard voting as method of conflict resolution, and set Results
to field to heart_SUB. As a result (after confirming with OK) the new,
decision column is appended at the end of heart_SUB table.
In this simple example we have calculated all rules for indiscrepant data
table. Therefore, the decision columns in heart_SUB and heart tables should
be identical after this simple experiment, as both tables contain the very
same set of objects.
Appendix A
Selected RSES 2.2 file formats
This appendix contains a short description of the essential file formats used
by RSES 2.2 to store information.
In order to improve readability of listings extra spacing have been introduced. Extra spaces have no impact on the way RSES 2.2 reads and
interprets contents of it’s files.
A.1
Data sets
Data sets to be user in analysis with use of RSES 2.2 should follow the format
presented in figure A.1.
Table and attribute names as well as attribute values may contain spaces.
But, if such string contains space, it must be placed inside quotes or double
quotes. For instance, name of attribute serum cholestoral should be entered
as ”serum cholestoral” or ’serum cholestoral’. The RSES can accept characters that are defined in ISO-8859-1 (Western) and ISO-8859-2 (CEE-Latin2)
standards.
An example of data set is presented in figure A.1. First row that shows:
TABLE therapy
contains information about the name of the table. Unique name assigned to
data set simplifies management of entities in RSES project.
Second row stating:
ATTRIBUTES 5
provides us with information on the number of attributes (columns) in our
data, including decision attribute. In the following rows we have definition
of attributes:
81
82
APPENDIX A. SELECTED RSES 2.2 FILE FORMATS
TABLE therapy
ATTRIBUTES 5
temperature numeric
headache numeric 0
cough symbolic
catarrh symbolic
disease symbolic
OBJECTS 4
38.7
7
no
38.3
MISSING yes
MISSING 3
no
36.7
1
no
1
no
yes
no
no
angina
influenza
cold
healthy
Figure A.1: Example of RSES 2.1 data set
temperature numeric 1
headache numeric 0
cough symbolic
catarrh symbolic
disease symbolic
Each row describes a single attribute (column). The description for each
attribute consists of attribute’s name and type. There are two possible types
of attributes – symbolic and numeric. Obviously, symbolic is used to
mark the attributes that have limited set of symbolic (e.g. string) values. For
numerical attributes type declaration is followed by number which determines
the number of digits in decimal expansion of attribute values. For instance,
numeric 1 describes numerical attribute that has one digit after dot while
numeric 0 corresponds to attribute with integer values.
The line below attribute descriptions contains information on the number
of objects (rows) in data table .
OBJECTS 4
After that objects (rows) follow. The values in rows correspond to attribute types described before. 1 .
1
In this example extra spaces are added to increase readability. These spaces have no
influence on the way RSES interprets input data set.
A.2. REDUCT SETS
38.7
38.3
MISSING
36.7
7
MISSING
3
1
83
no
yes
no
no
no
yes
no
no
angina
influenza
cold
healthy
The MISSING symbol denotes missing attribute value i.e., the place in
data table where attribute value in not known, Such a missing value may
also be denoted using NULL or ’?’ symbols.
The default extension for RSES file containing data is .tab.
A.2
Reduct sets
RSES 2.2 can store sets of reducts in text files. Example contents of such file
are shown in figure A.2.
REDUCTS (5)
{ temperature, headache } 1.0
{ headache, cough } 1.0
{ headache, catarrh } 1.0
{ cough } 0.25
{ catarrh } 0.25
Figure A.2: Set of reducts in RSES 2.1 format
In this example the fist row:
REDUCTS (5)
contains the number of reducts stored in this file. In our case there are five
reducts.
Subsequent rows contain reducts (one per row).
{
{
{
{
{
temperature, headache } 1.0
headache, cough } 1.0
headache, catarrh } 1.0
cough } 0.25
catarrh } 0.25
84
APPENDIX A. SELECTED RSES 2.2 FILE FORMATS
Description of each reduct (each row) consists of list of attributes placed
between { and } and the number. The number shows the positive region for
the table reduced to only those listed attributes.
In our example first reduct contains two attributes: temperature and
headache. The value of positive region for this reduct is 1.0 which means
that the reduced table – with original decision attribute added – is consistent
(contain no discrepancies).
The default extension for RSES file containing reducts is .red.
A.3
Rule sets
Figure A.3 presents example contents of file that is used in RSES 2.2 to store
decision rules.
First row,
RULE_SET Demo
contains a unique name of rule set. This name is used to identify rule set
within project.
Subsequent rows contain attribute information. This information is essential if rules are to be applied to some data set. Before checking the data
against rules, the system have to verify if attributes in the data are identical
(in terms of format) to those appearing in rules. The description of attributes
follow the format already described for data files (see subsection A.1).
ATTRIBUTES 5
temperature numeric 1
headache numeric 0
cough symbolic
catarrh symbolic
disease symbolic
Right below rows with (conditional) attribute description appears information pertaining the decision attribute, i.e. the last attribute (column) in
decision table.
DECISION_VALUES 4
angina
influenza
cold
healthy
A.3. RULE SETS
85
RULE_SET Demo
ATTRIBUTES 5
temperature numeric 1
headache numeric 0
cough symbolic
catarrh symbolic
disease symbolic
DECISION_VALUES 4
angina
influenza
cold
healthy
RULES 16
(temperature=38.7)&(headache=7)=>(disease=angina[1]) 1
(temperature=38.3)&(headache=MISSING)=>
(disease=influenza[1]) 1
(temperature=MISSING)&(headache=3)=>(disease=cold[1]) 1
(temperature=36.7)&(headache=1)=>(disease=healthy[1]) 1
(headache=7)&(cough=no)=>(disease=angina[1]) 1
(headache=MISSING)&(cough=yes)=>(disease=influenza[1]) 1
(headache=3)&(cough=no)=>(disease=cold[1]) 1
(headache=1)&(cough=no)=>(disease=healthy[1]) 1
(headache=7)&(catarrh=no)=>(disease=angina[1]) 1
(headache=MISSING)&(catarrh=yes)=>(disease=influenza[1]) 1
(headache=3)&(catarrh=no)=>(disease=cold[1]) 1
(headache=1)&(catarrh=no)=>(disease=healthy[1]) 1
(cough=no)=>(disease={angina[1],cold[1],healthy[1]}) 3
(cough=yes)=>(disease=influenza[1]) 1
(catarrh=no)=>(disease={angina[1],cold[1],healthy[1]}) 3
(catarrh=yes)=>(disease=influenza[1]) 1
Figure A.3: Example set of rules in RSES 2.1 format
First, the number of possible decision values is declared. Then, admissible
decision values are listed, one in each row.
The remainder of rule set file stores the actual rules. In this part the
number of stored rules is declared first.
86
APPENDIX A. SELECTED RSES 2.2 FILE FORMATS
RULES 16
In our example 16 rules are present. They are listed in next rows (one
per row).
(temperature=38.7)&(headache=7)=>(disease=angina[1]) 1
(temperature=38.3)&(headache=MISSING)=>
(disease=influenza[1]) 1
(temperature=MISSING)&(headache=3)=>(disease=cold[1]) 1
(temperature=36.7)&(headache=1)=>(disease=healthy[1]) 1
...
(catarrh=no)=>(disease={angina[1],cold[1],healthy[1]}) 3
...
In our example the first rule states that if a patient has a mild fever
of 38.7C and serious headache (value of headache attribute is 7) then his
sickness is recognized as angina and there exist one case that support this
rule. By supporting we understand the objects in training data that match
at the conditional part of rule and have the decision identical with that
suggested by rule. The number of supporting objects in written at the end
of each rule.
In case of generalized rules there may be several decision values listed
after the => sign, as in:
...
(catarrh=no)=>(disease={angina[1],cold[2],healthy[1]}) 4
...
In this case the supports for each decision value is given in square brackets
after this value, and total (summarized) support appears at the end of row.
In our example angina[1] means that there is one case (out of four that
match the rule) which supports this decision.
The default extension for RSES file containing reducts is .rul.
A.4
Set of cuts
Figure A.4 presents example contents of file that is used in RSES 2.2 to store
cuts used in discretization/grouping.
The first row:
CUT_SET demo_global
A.4. SET OF CUTS
87
CUT_SET demo_global
ATTRIBUTES 4
INCLUDED_SYMBOLIC false
temperature numeric 1
[ 38.5 ]
headache numeric 0
[ 5.0 ]
cough symbolic
[ ]
catarrh symbolic
[ ]
Figure A.4: Example of file used to store cuts in RSES 2.1
declares a unique name for this cut set.
Second row:
ATTRIBUTES 4
informs us about the number of attributes to be discretized.
Third row:
INCLUDED_SYMBOLIC false
states whether the symbolic attribute values are grouped. This option (corresponding to true) is only available in case of local discretization method.
Next rows contain information about attribute name, type and about cuts
that have been generated for this attribute. For each attribute there are two
rows assigned. The first contains attribute description, the second - cuts.
temperature numeric 1
[ 38.5 ]
headache numeric 0
[ 5.0 ]
cough symbolic
[ ]
catarrh symbolic
[ ]
88
APPENDIX A. SELECTED RSES 2.2 FILE FORMATS
Empty brackets [ ] denote lack of cuts for corresponding attribute, i.e.
this attribute will not undergo discretization.
In case of numerical attribute a cut is an attribute value (threshold) that
splits the attribute domain into two sets. Using several such threshold makes
it possible to decompose the set of values of numerical attribute into several
disjoint parts.
In our example the attribute temperature is discretized with use of a
single cut, i.e. the attribute value space is decomposed into two smaller
intervals. First interval contains all values that are greater than 38.5, the
other those below 38.5.
In case of symbolic attributes we can find cuts (decompositions) only
if we use local discretization method with active Include symbolic attributes
option. The local discretization method can decompose set of values of a
symbolic attribute into two disjoint subsets. Therefore, to fully describe such
decomposition it suffices to know only one of resulting subsets, as the other is
simply a complement. In figure A.5 we present a set of cuts generated for the
same decision table as in figure A.4. The difference is in the method chosen
for cut generation (global vs. local) and inclusion of symbolic attributes.
CUT_SET demo_local
ATTRIBUTES 4
INCLUDED_SYMBOLIC true
temperature numeric 1
[ 37.5 ]
headache numeric 0
[ 2.0 5.0 ]
cough symbolic
[ { MISSING,no } ]
catarrh symbolic
[ ]
Figure A.5: Example of file for a set of cuts obtained by local method with
inclusion of symbolic attributes
Note, that attribute cough is decomposed into two subsets. The first subsets gathers the values MISSING and no, the other – all the remaining values
for this attribute. Also, compared with figure A.4, the cut for temperature
has changed and attribute headache is now decomposed into three subsets.
A.5. LINEAR COMBINATIONS
89
This shows the difference between global and local method for cut generation.
The default extension for RSES file containing cuts is .cut.
A.5
Linear combinations
Figure A.6 presents example contents of file that is used in RSES 2.2 to store
linear combinations used in creation of new attribues.
DIRECTIONS (4)
temperature*0.707+headache*0.707
temperature*0.705+headache*0.705+disease*(-0.062)
temperature*0.707+headache*0.707
temperature*0.696+headache*0.696+disease*0.174
Figure A.6: Example of file containing linear combinations
First row:
DIRECTIONS (4)
informs us that this file is used to store linear combinations (directions)
and how many such combinations appear in this file (in our case 4).
Subsequent rows list these linear combinations, one per row.
temperature*0.707+headache*0.707
temperature*0.705+headache*0.705+disease*(-0.062)
temperature*0.707+headache*0.707
temperature*0.696+headache*0.696+disease*0.174
Description of each combination is presented as a formula and may used
to create a new attribute.
The default extension for RSES file containing linear combinations (directions) is .dir.
90
APPENDIX A. SELECTED RSES 2.2 FILE FORMATS
A.6
LTF-C
The file used to store a LTF classifier (LTF-C) is quite large and complicated, as it contains a lot of information. Therefore, we will describe various
parts of it separately. Linear Transfer Function (LTF) classifier is based on
special kind of Artificial Neural Network. Some description of LTF-C and is
parameters is given in section 3.7.
The description of subdivides into: header, network parameters, description of neurons, and extended description of attributes from analyzed table.
The header contains exactly two rows. The first contains classifier’s name,
in our case demo. Second row in header give the number of lines (rows in file)
used to describe the classifier, in our case 62.
LTF_CLASSIFIER demo
62
Next part provides the description of network architecture input as a
comment. They are put here in order to provide the user with easy accessible
description right from the beginning. The information contained in this
comment is repeated further in the file, but in less readable form. Note, that
$ sign denotes the beginning of the comment line.
$-----------------------------------------------------$
LTF-C neural network
$
Number of inputs:
2
$
Number of outputs: 4
$
Number of neurons: 1
$-----------------------------------------------------The value of each parameter is preceded by @ and parameter’s name. Of
all the parameters (from @FileType to @SumCycles) those really important
for the user are @EtaUse and @UseTr.
@FileType
...
@EtaUse
@UseTr
...
@SumCycles
net
0.013
0.013
120
A.7. CLASSIFICATION RESULTS
91
Next part gives the description of neurons. The details are given in section
3.7.
@<Neuron0
@Class
@Life
...
@>
1
0
The last part of LTF-C file contains information about training data
that was used to construct it. This include description of decision attribute
(name and type) and its values (number of values and their names). Also,
for numerical conditional attributes the file stores information whether these
attributes were normalized or not (true/false). For all numerical attributes
the mean values and standard deviations on training data set are stored.
These values form two blocks. Each block start with number of attributes,
in our example 4.
disease symbolic
DECISION_VALUES 4
angina
influenza
cold
healthy
true
4
37.73333333333333
3.0
0.0
0.0
4
0.8617811013631386
2.1908902300206643
0.0
0.0
The default extension for RSES file containing LTF classifier is .ltf.
A.7
Classification results
Figure A.7 presents an example contents of a file used to store results of classification in RSES 2.1. This file stores almost exactly the same information
as the result view window (see figure 3.36)
92
APPENDIX A. SELECTED RSES 2.2 FILE FORMATS
TEST RESULTS:
Global coverage=0.8333333333333334
Global accuracy=0.8333333333333334
Decision classes:
angina influenza cold healthy
Confusion matrix:
0.0 0.0 0.0 0.0
0.0 0.6666666666666666 0.0 0.0
0.0 0.3333333333333333 0.0 0.0
0.0 0.0 0.0 0.6666666666666666
True positive rates for decision classes:
0.0 0.5 0.0 0.6666666666666666
Accuracy for decision classes:
0.0 0.6666666666666666 0.0 0.6666666666666666
Coverage for decision classes:
0.0 0.66666666666666 0.33333333333333 0.66666666666666
Figure A.7: RSES 2.1 classification result file contents
The very first row (TEST RESULTS:) states the purpose of this file.
Next two rows present values of coverage and accuracy for entire testing
data set.
Global coverage=0.8333333333333334
Global accuracy=0.8333333333333334
Then, all admissible values of decision (decision class names) are listed.
Decision classes:
angina influenza cold healthy
Next part of the file presents complete confusion matrix.
Confusion matrix:
0.0 0.0 0.0 0.0
0.0 0.6666666666666666 0.0 0.0
0.0 0.3333333333333333 0.0 0.0
0.0 0.0 0.0 0.6666666666666666
A.7. CLASSIFICATION RESULTS
93
Finally, some statistics for decision classes are presented. For each collection of values a couple of rows is used. First row in each couple names
the kind of measurement, the second brings the values. The order of values
corresponds to the order of decision classes defined earlier in the file.
True positive rates for decision classes:
0.0 0.5 0.0 0.6666666666666666
Accuracy for decision classes:
0.0 0.6666666666666666 0.0 0.6666666666666666
Coverage for decision classes:
0.0 0.66666666666666 0.33333333333333 0.66666666666666
The default extension for RSES file containing classification results is
.res.
94
APPENDIX A. SELECTED RSES 2.2 FILE FORMATS
Bibliography
[1] J. Bazan, Son H. Nguyen, Trung T. Nguyen, A. Skowron and J. Stepaniuk (1998): Decision rules synthesis for object classification. In: E.
Orłowska (ed.), Incomplete Information: Rough Set Analysis, Physica
– Verlag, Heidelberg, pp. 23–57.
[2] J. Bazan (1998): A Comparison of Dynamic and non-Dynamic Rough
Set Methods for Extracting Laws from Decision Table. In: L. Polkowski,
A. Skowron (eds.), Rough Sets in Knowledge Discovery, Physica – Verlag, Heidelberg, pp. 321–365.
[3] J. Bazan (1998): Metody wnioskowań aproksymacyjnych dla syntezy
algorytmów decyzyjnych. Ph. D. thesis, supervisor A. Skowron, Warsaw
University, pp. 1–179. (In Polish only)
[4] J. Bazan, M. Szczuka (2000): RSES and RSESlib – A Collection of Tools
for Rough Set Computations. Lecture Notes in Artificial Intelligence
3066, 592–601, Berlin, Heidelberg: Springer-Verlag.
[5] J. Bazan, H. S. Nguyen, S. H. Nguyen, P. Synak, and J. Wróblewski
(2000): Rough set algorithms in classification problem. In L. Polkowski,
S. Tsumoto, and T. Lin, editors, Rough Set Methods and Applications,
Physica-Verlag, Heidelberg New York, pp. 49–88.
[6] J. Bazan, M. Szczuka, J. Wróblewski (2002): A New Version of Rough
Set Exploration System, Lecture Notes in Artificial Intelligence 2005,
106–113, Berlin, Heidelberg: Springer-Verlag.
[7] J. Bazan, M. Szczuka, A. Wojna, M. Wojnarski (2004): On Evolution of
Rough Set Exploration System, Lecture Notes in Artificial Intelligence
3066, 592–601, Berlin, Heidelberg: Springer-Verlag.
[8] G. Gora, A. Wojna (2002): RIONA: A Classifier Combining Rule Induction and k-NN Method with Automated Selection of Optimal Neighbourhood, Proceedings of the Thirteenth European Conference on Machine
95
96
BIBLIOGRAPHY
Learning, ECML 2002, Helsinki, Finland, Lecture Notes in Artificial
Intelligence, 2430, Springer-Verlag, pp. 111–123
[9] G. Gora, A. Wojna (2002): RIONA: A New Classification System Combining Rule Induction and Instance-Based Learning, Fundamenta Informaticae, 51(4), pp. 369–390
[10] J. Grzymała-Busse (1997): A New Version of the Rule Induction System
LERS Fundamenta Informaticae, Vol. 31(1), pp. 27–39
[11] J. Grzymala-Busse and M. Hu (2000): A comparison of several approaches to missing attribute values in data mining. Proceedings of the
Second International Conference on Rough Sets and Current Trends in
Computing RSCTC’2000, October 16–19, 2000, Banff, Canada, 340–
347.
[12] D. Michie, D. J. Spiegelhalter, C. C. Taylor (1994): Machine learning,
neural and statistical classification. Ellis Horwood, New York.
[13] Son H. Nguyen and A. Skowron (1997). Quantization of real value attributes: Rough set and boolean reasoning approach. Bulletin of International Rough Set Society 1/1, pp. 5–16.
[14] Son H. Nguyen (1997). Discretization of real value attributes. Boolean
reasoning approach. Ph. D. thesis, supervisor A. Skowron, Warsaw University
[15] Son H. Nguyen, Hoa S. Nguyen (1998). Discretization Methods in Data
Mining. In: L. Polkowski, A. Skowron (eds.): Rough Sets in Knowledge
Discovery. Physica-Verlag, Heidelberg, pp. 451–482.
[16] Hoa S. Nguyen, H. Son Nguyen (1998). Pattern extraction from data.
Fundamenta Informaticae 34/1-2, pp. 129–144.
[17] Son H. Nguyen (1998). From Optimal Hyperplanes to Optimal Decision
Trees. Fundamenta Informaticae 34/1–2, pp. 145–174.
[18] Hoa S. Nguyen, A. Skowron and P. Synak (1998). Discovery of data patterns with applications to decomposition and classfification problems. In:
L. Polkowski and A. Skowron (eds.), Rough Sets in Knowledge Discovery 2: Applications, Case Studies and Software Systems, Physica-Verlag,
Heidelberg, pp. 55–97.
BIBLIOGRAPHY
97
[19] Hoa S. Nguyen (1999). Discovery of generalized patterns. Proceedings
of the Eleventh International Symposium on Methodologies for Intelligent Systems, Foundations of Intelligent Systems (ISMIS’99), June 8–11,
Warsaw, Lecture Notes in Artificial Intelligence 1609, Springer-Verlag,
Berlin, pp. 574–582.
[20] Hoa S. Nguyen (1999). Data regularity analysis and applications in data
mining. Ph. D. thesis, supervisor B. Chlebus, Warsaw University.
[21] A. Øhrn, J. Komorowski (1997): ROSETTA – A rough set tool kit for
analysis of data, Proceedings of the Fifth International Workshop on
Rough Sets and Soft Computing (RSSC’97) at the Third Joint Conference on Information Sciences ( JCIS’97), Research Triangle Park, NC,
March 2–5 (1997) 403–407.
[22] Z. Pawlak (1991): Rough sets: Theoretical aspects of reasoning about
data. Dordrecht: Kluwer.
[23] D. Ślęzak, J. Wróblewski (1999). Classification Algorithms Based on Linear Combinations of Features. Proc. of PKDD’99, Prague, Czech Republic, Springer-Verlag (LNAI 1704), Berlin Heidelberg 1999, pp. 548–553.
[24] M. Wojnarski (2003): LTF-C: Architecture, Training Algorithm and
Applications of New Neural Classifier. Fundamenta Informaticae, Vol.
54(1), pp. 89–105. IOS Press, 2003
[25] J. Wróblewski (1998). Genetic algorithms in decomposition and classification problem. In: L. Polkowski and A. Skowron (eds.), Rough Sets in
Knowledge Discovery 2: Applications, Case Studies and Software Systems, Physica-Verlag, Heidelberg, pp. 471–487
[26] ROSETTA Homepage http://rosetta.lcb.uu.se/general/
[27] RSES Homepage http://logic.mimuw.edu.pl/∼rses
Index
attribute
cuts, 27, 50
decision, 84
description, 82
new, 39
numeric, 82
statistics, 28
symbolic, 82
type, 82
classification
storage format, 91
computations, 23
confusion matrix, 47
storage format, 92
cross-validation, 27, 65
cuts, 37
global method, 89
local method, 88, 89
numerical attributes, 88
storage format, 86
symbolic attributes, 88
data, 23
examples, 8
format, 82
MISSING, 49, 83
missing values, 49
NULL, 49, 83
statistics, 28
data format, 82
decomposition, 27, 41, 58
discretization, 27, 37, 50
distributed environment, 13
Dixer, 13
dynamic reducts, 54
generating cuts, 50
grouping, 50
icons, 16
k-NN, 27, 60
linear combinations, 27, 39, 51
attributes, 89
storage format, 89
LTF-C, 27, 42, 63
storage format, 90
menu
context, 16, 19
context for object group, 16, 20
general, 15, 19
icons, 16
layout, 16
main, 11, 16
project’s context, 15
methods
description, 49
MISSING, 49, 83
missing values, 49
neuron, 44
NULL, 49, 83
completion, 26
in data, 49
various approaches, 49
objects, 14
98
INDEX
moving, 15
selecting, 16
project, 11
create, 13
history, 14, 21
project view, 14
saving and restoring, 14
reducts, 27, 32, 52
core, 34
statistics, 34
storage format, 83
results, 46
storage format, 91
RSES, 5
RSES-lib, 6
rules, 27, 34, 52, 86
statistics, 36
storage format, 84
support, 86
starting RSES, 11
statistics
attribute, 28
data, 28
reducts, 34
rules, 36
SVD, 61
City-SVD, 62
table, 23
attribute description, 82
attribute’s type, 82
format, 82
name, 24, 81
number of objects, 82
statistics, 28
toolbar, 18
trees
decomposition, 27, 41, 58
99