Paradigms of Expert Systems¶
Section author: By Finn V. Jensen, Dept. of Computer Science, Aalborg University, Denmark
This is a brief overview of the three main paradigms of expert systems.
A rule is an expression of the form
if A then B
where A is an assertion and B can be either an action or another assertion. For instance, the following three rules could be part of a larger set of rules for troubleshooting water pumps:
If pump failure then the pressure is low
If pump failure then check oil level
If power failure then pump failure
A rule-based system consists of a library of such rules. These rules reflect essential relationships within the domain, or rather: they reflect ways to reason about the domain.
When specific information about the domain becomes available, the rules are used to draw conclusions and to point out appropriate actions. This is called inference. The inference takes place as a kind of chain reaction. In the above example, if you are told that there is a power failure, Rule 3 will state that there is a pump failure and Rule 1 will then tell us that the pressure is low. Rule 2 will also give a (useless) recommendation to check the oil level.
Rules can also be used in the opposite direction. Suppose you are told that the pressure is low; then Rule 1 states that it can be due to a pump failure, while Rule 3 states that a pump failure can be caused by a power failure. You should also be able to use Rule 2 to recommend checking the oil level, but it is very difficult to control such a mixture of inference back an forth in the same session.
Often the connections reflected by the rules are not absolutely certain, and also the gathered information is often subject to uncertainty. In such cases, a certainty measure is added to the premises as well as to the conclusions in the rules of the system. Now, a rule gives a function that describes how much a change in the certainty of the premise will change the certainty of the conclusion. In its simplest form, this looks like:
If A (with certainty x) then B (with certainty f(x))
There are many schemes for treating uncertainty in rule-based systems. The most common are fuzzy logic, certainty factors, and (adapted versions of) Dempster-Shafer belief functions. Common to all of these schemes is that uncertainty is treated locally. That is, the treatment is connected directly to the incoming rules and the uncertainty of their elements. Imagine, for example, that in addition to Rule 4 we have the rule
If C (with certainty x) then B (with certainty g(x))
If we now get the information that A holds with certainty a and C holds with certainty c, what is then the certainty of B?
There are different algebras for such a combination of uncertainties, depending on the scheme. Common to all these algebras is that in many cases they come to incorrect conclusions. This is because the combination of uncertainty is not a local phenomenon, but it is strongly dependent on the entire situation (in principle a global matter).
(Only the so-called feed-forward networks are treated.)
A neural network consists of several layers of nodes: At the top there is a layer of input nodes, at the bottom a layer of output nodes, and in between these normally 1 or 2 hidden layers. Except for the output nodes, all nodes in a layer are in principle connected to all nodes in the layer immediately below. A node along with its in-going edges is called a perceptron.
A neural network performs pattern recognition. You could for instance imagine a neural network that reads handwritten letters. By automatic tracking, a handwritten letter can be transformed into a set of findings on curves (not a job for the network). The network will have an input node for every possible kind of finding and an output node for each letter in the alphabet. When a set of findings is fed into the network, the system will match the pattern of findings with equivalent patterns of the different letters.
Technically, the input nodes are given a value (0 or 1). This value is transmitted to the nodes in the next layer. Each of these nodes perform a weighted sum of the incoming values, and if this sum is greater than a certain threshold, the node fires downward with the value 1. The values of the output nodes determine the letter.
So, apart from the architecture of the network (the number of layers and the number of nodes in each layer), the weights and the thresholds determine the behavior of the network. Weights and thresholds are set in order for the network to perform as well as possible. This is achieved by training: You have a large number of examples where both input values and output values are known. These are then fed into the training algorithm of the network. This algorithm determines weights and thresholds in such a way that the distance between the set of outputs from the network and the desired sets of outputs from the examples gets as small as possible.
There is nothing preventing the use of neural networks for domains requiring the handling of uncertainty. If relations are uncertain (for example in medical diagnosis), a neural network with the proper training will be able to give the most probable diagnosis given a set of symptoms. However, you will not be able to read the uncertainty of the conclusion from the network, you will not be able to get the next-most probable diagnosis and - probably the most severe set-back - you will not know under which assumptions about the domain the suggested diagnosis is the most probable.
Bayesian networks are also called Bayes nets, causal probabilistic networks (CPNs), Bayesian belief networks (BNs), or belief networks.
A Bayesian network consists of a set of nodes and a set of directed edges between these nodes. Edges reflect cause-effect relations within the domain. These effects are normally not completely deterministic (e.g. disease -> symptom). The strength of an effect is modeled as a probability:
If tonsillitis then P(temp>37.9) = 0.75
If whooping cough then P(temp>37.9) = 0.65
One could be led to read these statements as rules. They shouldn’t. So, a different notation is used:
P(temp>37.9 | whooping cough) = 0.65
If 6. and 7. are read as ‘If otherwise healthy and..then..’, there also needs to be a specification of how the two causes combine. That is, we need the probability of having a fever if both symptoms are present and if the patient is completely healthy. All in all you have to specify the conditional probabilities:
P(temp>37.9 | whooping cough, tonsillitis),
Where ‘whooping cough’ and ‘tonsillitis’ each can take the states ‘yes’ and ‘no’. So, you must for any node specify the strength of all combinations of states for the possible causes.
Fundamentally, Bayesian networks are used to update probabilities whenever information becomes available. The mathematical basis for this is Bayes’ theorem:
P(A | B) P(B) = P(B | A) P(A)
Contrary to the methods of rule-based systems, the updating method of Bayesian networks uses a global perspective, and if model and information are correct, it can be proved that the method computes the updated probabilities correctly (correctly regarding the axioms of the classical probability theory).
Any node in the network can receive information as the method doesn’t distinguish between inference in or opposite to the direction of the edges. Also, simultaneous input of information into several nodes will not affect the updating algorithm.
An essential difference between rule-based systems and systems based on Bayesian networks is that in rule based systems you try to model the expert’s way of reasoning (hence the name expert systems), while with Bayesian networks you try to model dependences in the domain itself. Systems of the latter type are often called decision support systems or normative expert systems.
Comparing Neural Networks and Bayesian Networks¶
The fundamental difference between the two types of networks is that a perceptron in the hidden layers does not in itself have an interpretation in the domain of the system, whereas all the nodes of a Bayesian network represent concepts that are well defined with respect to the domain.
The meaning of a node and its probability table can be subject to discussion, regardless of their function in the network. But it does not make any sense to discuss the meaning of the nodes and the weights in a neural network. Perceptrons in the hidden layers only have a meaning in the context of the functionality of the network.
This means that the construction of a Bayesian network requires detailed knowledge of the domain in question. If such knowledge can only be obtained through a series of examples (i.e., a data base of cases), neural networks seem to be an easier approach. This might be true in cases such as the reading of handwritten letters, face recognition, and other areas where the activity is a ‘craftsman like’ skill based solely on experience.
It is often criticized that in order to construct a Bayesian network you have to ‘know’ too many probabilities. However, there is not a considerable difference between this number and the number of weights and thresholds that have to be ‘known’ in order to build a neural network, and these can only be learnt by training. It is an enormous weakness of neural networks that you are unable to utilize the knowledge you might have in advance.
Probabilities, on the other hand, can be assessed using a combination of theoretical insight, empiric studies independent of the constructed system, training, and various more or less subjective estimates.
Finally, it should be mentioned that in the construction of a neural network the route of inference is fixed. It is decided in advance, about which relations information is gathered, and which relations the system is expected to compute. Bayesian networks are much more flexible in that respect.