Hierarchical Naive Bayes¶

Naive Bayes models are a popular type of classification model. The target (or class) variable has a state for each possible class. The attribute nodes are the observations used to decide the class. In a Naive Bayes model, the attribute nodes are children of the target node. When this structure hs been created and case data has been entered, it is simply a matter of running the EM algorithm in order to estimate the parameters of the model.

One of the problems with the Navie Bayes model is that some of the attribute variables can be more or less correlated. If two attribute variables are strongly correlated, then their combined effect on the target variable can be overestimated. We can compensate for this overestimation by introducing an extra (latent)variable, this variable will be the (only) parent of the two correlated attribute variables, and the latent variable will be a child of the target variable (or some other latent variable). This is called a Hierarchical Naive Bayes model by Langseth and Nielsen (2006).

HUGIN provides a simplified version of the algorithm proposed by Langseth and Nielsen (2006). The algorithm successively identifies pairs of variables to be combined via a latent variable.

One iteration of the algorithm follows this scheme:

Identify a pair of variables (the list of eligible variables
initially contains all non-target variables) that are strongly correlated given the target variable. This is done using the same test that is used to determine [in]dependence in the PC algorithm. The pair with the highest score (correlation) is selected.
The algorithm stops when no pair of correlated variables can be found, that is, the test of the PC algorithm indicates independence for all remaining pairs of variables. The threshold value for this decision is controlled by significance level.
Let X and Y be the selected pair of varaibles. Create a latent variable L with X and Y as children. The states of L are defined as one state for each combination of states of X and Y, and the conditional probability tables for X and Y are defined (according to the meaning of the states of L to be deterministic functions of L. Then it is tested (see Langseth and Nielsen (2006)) for details) whether some of the states of L contribute the same information with respect to the classification task. (if X and Y are irrelevant to the classification task, then all states of L will be collapsed to a single state) These states are then collapsed to a single state. An additional latent node L’ is created with L as the only child to represent the collapsed state space, and a conditional probability table for L is defined in the natural way. (if the collapsed state space is identical to the original state space, L’ will not be created) Since the state spaces of L and L’ are deterministically determined by X and Y, we can create case data for L and L’ to be used in the following iterations of the algorithm.

Finally, X and Y are removed from the list of eligible variables from which the next pair of variables will be chosen, and L’ (If L’ has not been created, then L is used instead.) is added to that list.

When the algorithm above stops, the nodes (except the target node) without parents will be given target as parent. These nodes also have no conditional probability table. In order to finalize the model, these nodes can be equipped with experience tables before estimating the CPTs using the EM algorithm. The last part must be done explicitly (i.e., parameter estimation is not performed as part of the algorithm).

Authors

1. Langseth, T. D. Nielsen (2006). Classification using Hierarchical Naive Bayes models.