Bayesian Network Classifiers in Lisp

Dr Michael Madden
Senior Lecturer in Information Technology
College of Engineering & Informatics
National University of Ireland, Galway.

This email address is being protected from spambots. You need JavaScript enabled to view it.

Instructions: see below.

Right-click on a file to download it.

madden_bn_v36.lsp LISP implementation of the various Bayesian Network classifiers.
weather_defines.lsp
weather_data.lsp
Small sample input files
name2bn
data2bn
mlc2bn
Perl programs to convert files from MLC++ format to this software format
lcurve Perl program for running a batch of analyses
runperl.bat For Windows users, a handy batch file for running Perl programs that don't have any file extension (as in Unix): copy this file, give it the same name as the Perl program (e.g. lcurve.bat to go with lcurve above), and put it in the same folder as the Perl program.

Instructions

This describes the basics of how to use my implementations of General Bayesian Network classifiers and other Bayesian classifiers. For further details, please read the code.

For a description of the classifiers, and a discussion of their relative performance, see: “On the Classification Performance of TAN and General Bayesian Networks” , Michael G. Madden. Knowledge Based Systems, 2009. http://datamining.it.nuigalway.ie/images/pubs/kbs-jan2009-mmadden.pdf.

Please cite that paper if you use this software in your work.

If you have problems using this code or if you think you have found a bug in it, you are welcome to contact me.

Requirements:

  1. A Common Lisp implementation -- I use CLISP (Bruno Haible) on Unix and Windows
  2. For supporting scripts, Perl Version 5
  3. DOT/DOTTY (part of GraphViz), for displaying network structures
  4. To produce data for learning curves on continuous-valued data, the lcurve program below uses the MLC++ discretize utility

Input File Formats:

See example files for the tiny Weather (a.k.a. PlayTennis) datset: weather_defines.lsp and weather_data.lsp. Note that, unrealistically, the testing data is just a copy of the training data in this example.

The following global variables must be defined before running the functions:

*nodes*
list of node names
*assmnts*
For each node, a list of possible values that can be assigned to it
*class_node*
Index (from 0) of the classification node in *nodes*
*train*
List of training cases, where each case is a list of value assignments for each node
*test*
List of testing cases (same format as *train*)
verbose
0: very little output. 1: normal output. 2+: debug output. 0 recommended if running functions in a batch.

Generating and Evaluating Bayesian Classifiers

The procedure for all of the classifiers is the same apart from one function call and one function parameter. The full procedure is described for the Naive Bayes classifier, and for the others only the differences are described.

In these instructions, I assume the current version of the BN Lisp software has been compiled and is loaded. In CLISP, this is done with the following commands (assuming the file containing the software is called madden_bn_v36.lsp):

(compile-file "madden_bn_v36") (load "madden_bn_v36") 

Format of Structure Representation

The functions to construct Bayesian structures -- (naive), (tanbn), (fggtan), (k2 N) -- all return a representation of the structure and also define a global variable with the return value; poor programming practice perhaps, but useful when experimenting with the different structures. The structure representation is a list of lists, where the head (car) of the inner list is a node and its tail (cdr) is a list of parent nodes. Nodes are represented numerically according to their position in *nodes*. To visualise the structure, use the DOT function described below. The name of the global variable depends on which classifier is used: see the descriptions of the classifiers below for details.

Naive Bayes Classifier

Inside your Lisp environment:
(load "weather_defines") ;; NOTE: Change these for a different data set 
(load "weather_data")    ;; define *nodes*, *assmnts*, *class_node*, *train*, *test*    
(setq verbose 1)             ;; Verbose mode: set to 0 if running in a batch 
(naive)                      ;; Construct Naive structure -- result is *naive*  
(probs *naive*)              ;; Generate probabilities for structure in *naive* 
(test *naive* *probs* *test*)  ;; Test the network given by structure in *naive*
;; and probablities in *prob* on the test dataset *test*
(stats *testres*)              ;; Display statistics from the result of the test,
;; which have been stored in *testres*

Tree Augmented Naive Bayes Classifier:

Just as above, except the call to construct the structure is:
(tanbn)            ;; Construct TAN structure -- result is *tanstruct*  
Then use *tanstruct* rather than *naive* in subsequent function calls, e.g.
(probs *tanstruct*) 
Note that this uses whatever the currently-specified metric is. If you want to construct a TAN classifer using Friedman et al's original Conditional Mutual Information metric, either set *metric* to CMI or use the (FGGTAN) convenience funtion.

General Bayesian Network Structure:

K2 is used to construct General Bayesian Networks (GBNs). The call to construct the structure is:
(k2 99)            ;; Construct K2 structure -- result is *k2struct* 
where the function parameter is the maximum number of parents any node may have.

Then use *k2struct* in subsequent function calls, e.g.

(probs *k2struct*) 
Incidentally, on the very small sample weather dataset, K2 does not work well.

Note that you can use this function to create a Bayesian Network augmented Naive Bayes structure, by adding an additional parameter with value T:

(k2 99 T)            ;; Start with a Naive Bayes structure and augment it with K2
;; -- result is *k2struct*

Other Features of the Code

To time programs:

(tstart)    ;; put this call in before constructing structure and calculating probs 
(tstop)     ;; displays elapsed real time and CPU time, calls to g-function  

To display a confusion matrix:

(confuse *naive* *probs* *test*) 

To generate a graphical display of the network:

(dot *k2struct* "filestem") 
This function writes out a DOT file to filestem.dot, which can be displayed on screen (from your command line, not in Lisp) using
	dotty filestem.dot 
or converted to PostScript (or other format, eg PNG) using
	dot -Tps filestem.dot -o filestem.ps 
Important: If using DOT, you cannot have '-' characters in your node names.

To generate data for a ROC curve:

(roc structure probabilities testdata classindex filename)
This function writes to filename the (x,y) points of a ROC curve with 200 evenly-spaced interpolated points.

For multiclass problems, the ROC curve is based on one class only: classindex is the index (counted from 0) of the classification variable category for which the ROC curve is to be generated.

Supporting Programs

I have written some Perl programs to facilitate performing analyses with the BN software. You can use them as they are or modify them to suit your own needs.

If you run any of these programs without arguments, they display a usage message. If you run them with a -v switch, they display verbose messages.

To convert files from MLC++ format

You can download some well-known datasets from the MLC++ web-page. MLC++ files are organised as follows: filestem.names describes the data, filestem.data contains training data, filestem.test contains testing data, and filestem.all contains all data, if using a procedure that will automatically split into training and testing sets.

There are three conversion programs:

name2bn
name2bn opt:-v filestem
Converts filestem.names into defines.lsp
data2bn
name2bn opt:-v filestem.all training-percentage rand-seed
OR:
name2bn opt:-v filestem.data
Converts MLC++ data files into data.lsp.
In the first form, you specify filestem.all and the program divides it into training and testing data, according to the training-percentage and random number seed that you specify. In the second form, you specify filestem.data but filestem.test must also exist, and the program takes the training and testing data from these.
mlc2bn
mlc2bn opt:-v opt:-n opt:-d filestem train-percentage rand-seed
Without either the -n or -d switch, this operates like the first form of data2bn above.
With -n, it also calls name2bn to produce defines.lsp.
With -d, it calls the MLC++ disretize program to discretize the data. This is useful if the data files contain continuous-valued data, as the BN implementation cannot handle it. It is simplest to use -d always, as it doesn't hurt the data files if all variables are already discrete/categorical, but the option is there in case discretize is not installed.

Note: To change the discretization algorithm, edit the program and change $discopts.

To perform a set of analyses for a learning curve

The program lcurve performs a set of analyses and stores raw data from which a learning curve can be plotted for a given dataset and algorithm. Usage:
	lcurve datafile_stem alg_name 
Here, datafile_stem is the name of a dataset in MLC++ format (.names and .all files must exist), and alg_name is one of naive, tan, or k2.

This program uses the mlc2bn and name2bn programs already described, and optionally the MLC++ discretize program. It operates as follows:

  • Repeat for several training set sizes:
    • Repeat N times:
      • Make training/testing data file
      • In LISP, run the training algorithm and test it
      • Store the run details and individual results
    • End repeat
  • End repeat
To specify the training set sizes and how many times the procedure is repeated for each size, you must edit the file and change the parameters in its clearly-marked "Configuration Variables" section. In the same section, you can specify the path of the lisp code. the command to run your Lisp interpreter, and whether to discretize the data.

The program writes out files called alg_name.csv and alg_name.log. The CSV file lists, for each run, the training set size, correct, incorrect and 'unknown' classifications (unknown = equal probability for all classes). You just need to average the sets of results to plot a learning curve in your favourite graphing package. The LOG file essentially contains an echo of the information displayed on screen.