Structural Learning
At this point, the Learning Wizard is ready to learn the structure of
the model from the (possibly preprocessed) data and with the help of
the possible structural constraints specified.
Two pieces of information are needed in order to start the learning
process:
Note that, in this step of the Learning Wizard, only the structure of
the network is determined. The conditional probability tables (CPTs)
to be associated with the learned structure, will be learned in a
subsequent step.
The structural learning algorithms are based on making dependence
tests that calculate a test statistic which is asymptotically
chi-squared distributed assuming (conditional) independence. If the
test statistic is large for a given independence hypothesis, the
hypothesis is rejected; otherwise, it is accepted. The probability of
rejecting a true independence hypothesis is given by the level of
significance:
The level of significance can be entered into this text field. The
default value is set to 0.05, which should be appropriate for most
learning sessions, but it can take on any value in the open interval
(0;1). In general, the higher the significance level the more links
will be included in the learned structure. Reducing the level of
significance will usually also reduce the running time of the
structural learning algorithms.
The PC algorithm, which is a variant of the original PC algorithm due
to Peter Spirtes and Clark Glymour, belongs to the class of
constraint-based learning algorithms. The basic idea of these
algorithms is to derive a set of conditional independence and
dependence statements (CIDs) by statistical tests.
The algorithm performs the following steps:
- Statistical tests for conditional independence are performed for
all pairs of variables (except for those pairs for which a structural
constraint has been specified).
- An undirected link is added between each pair of variables for
which no conditional independences were found. The resulting
undirected graph is referred to as the skeleton of the
learned structure.
- Colliders are then identified, ensuring that no directed cycles
occur. (A collider is a pair of links directed such that they
meet in a node.) For example, if we find that A and B are dependent,
B and C are dependent, but A and C are conditionally independent given
S, not containing B, then this can be represented by the structure
A --> B <-- C.
- Next, directions are enforced for those links whose direction can
be derived from the conditional independences found and the colliders
identified.
- Finally, the remaining undirected links are directed randomly,
ensuring that no directed cycles occur.
One important thing to note about the PC algorithm is that, in
general, it will not be able to derive the direction of all the links
from data, and thus some links will be directed randomly. This
means that the learned structure should be inspected, and if any links
seem counterintuitive (e.g., sweat causes fever, instead of the other
way around), one might consider going back and insert a constraint
specifying the direction of the link, or use the NPC
algorithm, which allows the user to interactively decide on the
directionality of undirected links.
Traditional constraint-based learning algorithms produce provably
correct structures under the assumptions of infinite data sets, perfect
tests, and DAG faithfulness (i.e., that the data can be assumed to be
simulated from a DAG). In the case of limited data sets, however,
these algorithms often derive too many conditional independence
statements. Also, they may in some cases leave out important
dependence relations.
Generally, it is recommended to use the NPC algorithm, as the
resulting graph will be a better map of the (conditional) independence
relations represented in the data. In particular, when the data set is
small, the NPC algorithm should be the one preferred. The NPC
algorithm, however, has longer running times than the PC algorithm.
NPC stands for Necessary Path Condition, and it is a
criterion developed by researchers at Siemens in Munich for solving
some of the problems of constraint-based learning algorithms like the
PC algorithm. The basic machinery is the same in the PC and the NPC
algorithms (i.e., they are both based on generating a skeleton derived
trough statistical tests for conditional independence).
The NPC algorithm seeks to repair the deficiencies of the PC
algorithm, which occur especially in the face of limited data sets.
The solution provided by the NPC algorithm is based on inclusion of a
criterion known as the necessary path
condition. This criterion forms the basis for introducing the
notion of ambiguous regions, which in turn provide
a language for selecting among sets of inter-dependent uncertain
links. The resolution of ambiguous regions is performed in
interaction with the user (see the next page of the Learning Wizard).
In constraint-based learning algorithms, the skeleton of the graph is
constructed by not including a link in the induced graph whenever the
corresponding nodes are found to be conditional independent. There
can, however, be inconsistencies among the set of conditional
independence and dependence statements (CIDs) derived from limited
data sets. That is, not all CIDs can be represented simultaneously.
The inconsistencies are assumed to stem solely from sampling noise;
i.e., we still assume that there exists a perfect independence map of
the (unknown) probability distribution from which the data is
generated.
The number of inconsistencies in the set of CIDs reflects structural
model uncertainty. Thus, the number of uncertainties is a confidence
measure for the learned structure and can as such be used as an
indication of whether or not sufficient data has been used to perform
the learning. The inconsistent CIDs produce multiple solutions when
inducing a directed acyclic graph (DAG) from them. These solutions
differ with respect to the set of links included.
To resolve the inconsistencies, the NPC algorithm relies on user
interaction where the user gets the opportunity to decide on
directionality of undirected links and to resolve the ambiguous regions.
Informally, the necessary path condition says that in order for two
variables X and Y to be independent conditional on a set S, with no
proper subset of S for which this holds, there must exist a path
between X and every Z in S (not crossing Y) and between Y and every Z
in S (not crossing X). Otherwise, the inclusion of Z in S is
unexplained. Thus, in order for an independence statement to be valid,
a number of links are required to be present in the graph.
When the absence of a link, a, depends on the presence of another
link, b, and vice versa, we define a and b to be inter-dependent. Both
a and b are constitute what we call uncertain links. An
ambiguous region is a maximal set of inter-dependent
links. That is, an ambiguous region consists of a set of uncertain
links.
The main goal is to obtain as few and small ambiguous regions as
possible. It should be noted that deterministic relations between
variables will also produce ambiguous regions.
The user is offered the possibility of providing information as to how
the ambiguous regions should be resolved. The next page of the
Learning Wizard will provide an intuitive graphical interface for this
interaction.
The interested reader is referred to the literature for a more
detailed description of the notions behind the NPC algorithm. See e.g.
- Steck, H. (2001). Constrained-Based Structural Learning in
Bayesian Networks Using Finite Data Sets, PhD Thesis, Institut
für der Informatik der Technischen Universität München.
- Steck, H. and Tresp, V. (1999). Bayesian Belief Networks for Data
Mining, Proceedings of The 2nd Workshop on Data Mining und Data
Warehousing als Grundlage Moderner Entschidungsunterstuezender
Systeme, DWDW99, Sammelband, Universität Magdeburg, September
1999.
- Steck, H., Hofmann, R., and Tresp, V. (1999). Concept for the
PRONEL Learning Algorithm, Siemens AG, Munich.