Bayesian Network Classifiers in Lisp
Dr Michael Madden
Senior Lecturer in Information Technology
College of Engineering & Informatics
National University of Ireland, Galway.
Instructions: see below.
Right-click on a file to download it.
|madden_bn_v36.lsp||LISP implementation of the various Bayesian Network classifiers.|
|Small sample input files|
|Perl programs to convert files from MLC++ format to this software format|
|lcurve||Perl program for running a batch of analyses|
|runperl.bat||For Windows users, a handy batch file for running Perl programs that don't have any file extension (as in Unix): copy this file, give it the same name as the Perl program (e.g. lcurve.bat to go with lcurve above), and put it in the same folder as the Perl program.|
InstructionsThis describes the basics of how to use my implementations of General Bayesian Network classifiers and other Bayesian classifiers. For further details, please read the code.
For a description of the classifiers, and a discussion of their relative performance, see: “On the Classification Performance of TAN and General Bayesian Networks” , Michael G. Madden. Knowledge Based Systems, 2009. http://datamining.it.nuigalway.ie/images/pubs/kbs-jan2009-mmadden.pdf.
Please cite that paper if you use this software in your work.
If you have problems using this code or if you think you have found a bug in it, you are welcome to contact me.
- A Common Lisp implementation -- I use CLISP (Bruno Haible) on Unix and Windows
- For supporting scripts, Perl Version 5
- DOT/DOTTY (part of GraphViz), for displaying network structures
- To produce data for learning curves on continuous-valued data, the lcurve program below uses the MLC++ discretize utility
Input File Formats:See example files for the tiny Weather (a.k.a. PlayTennis) datset: weather_defines.lsp and weather_data.lsp. Note that, unrealistically, the testing data is just a copy of the training data in this example.
The following global variables must be defined before running the functions:
- list of node names
- For each node, a list of possible values that can be assigned to it
- Index (from 0) of the classification node in *nodes*
- List of training cases, where each case is a list of value assignments for each node
- List of testing cases (same format as *train*)
- 0: very little output. 1: normal output. 2+: debug output. 0 recommended if running functions in a batch.
Generating and Evaluating Bayesian ClassifiersThe procedure for all of the classifiers is the same apart from one function call and one function parameter. The full procedure is described for the Naive Bayes classifier, and for the others only the differences are described.
In these instructions, I assume the current version of the BN Lisp software has been compiled and is loaded. In CLISP, this is done with the following commands (assuming the file containing the software is called madden_bn_v36.lsp):
(compile-file "madden_bn_v36") (load "madden_bn_v36")
Format of Structure RepresentationThe functions to construct Bayesian structures -- (naive), (tanbn), (fggtan), (k2 N) -- all return a representation of the structure and also define a global variable with the return value; poor programming practice perhaps, but useful when experimenting with the different structures. The structure representation is a list of lists, where the head (car) of the inner list is a node and its tail (cdr) is a list of parent nodes. Nodes are represented numerically according to their position in *nodes*. To visualise the structure, use the DOT function described below. The name of the global variable depends on which classifier is used: see the descriptions of the classifiers below for details.
Naive Bayes ClassifierInside your Lisp environment:
(load "weather_defines") ;; NOTE: Change these for a different data set
(load "weather_data") ;; define *nodes*, *assmnts*, *class_node*, *train*, *test*
(setq verbose 1) ;; Verbose mode: set to 0 if running in a batch
(naive) ;; Construct Naive structure -- result is *naive*
(probs *naive*) ;; Generate probabilities for structure in *naive*
(test *naive* *probs* *test*) ;; Test the network given by structure in *naive*
;; and probablities in *prob* on the test dataset *test*
(stats *testres*) ;; Display statistics from the result of the test,
;; which have been stored in *testres*
Tree Augmented Naive Bayes Classifier:Just as above, except the call to construct the structure is:
(tanbn) ;; Construct TAN structure -- result is *tanstruct*Then use *tanstruct* rather than *naive* in subsequent function calls, e.g.
(probs *tanstruct*)Note that this uses whatever the currently-specified metric is. If you want to construct a TAN classifer using Friedman et al's original Conditional Mutual Information metric, either set *metric* to CMI or use the (FGGTAN) convenience funtion.
General Bayesian Network Structure:K2 is used to construct General Bayesian Networks (GBNs). The call to construct the structure is:
(k2 99) ;; Construct K2 structure -- result is *k2struct*where the function parameter is the maximum number of parents any node may have.
Then use *k2struct* in subsequent function calls, e.g.
(probs *k2struct*)Incidentally, on the very small sample weather dataset, K2 does not work well.
Note that you can use this function to create a Bayesian Network augmented Naive Bayes structure, by adding an additional parameter with value T:
(k2 99 T) ;; Start with a Naive Bayes structure and augment it with K2
;; -- result is *k2struct*
Other Features of the Code
To time programs:
(tstart) ;; put this call in before constructing structure and calculating probs
(tstop) ;; displays elapsed real time and CPU time, calls to g-function
To display a confusion matrix:
(confuse *naive* *probs* *test*)
To generate a graphical display of the network:
(dot *k2struct* "filestem")This function writes out a DOT file to filestem.dot, which can be displayed on screen (from your command line, not in Lisp) using
dotty filestem.dotor converted to PostScript (or other format, eg PNG) using
dot -Tps filestem.dot -o filestem.psImportant: If using DOT, you cannot have '-' characters in your node names.
To generate data for a ROC curve:
(roc structure probabilities testdata classindex filename)This function writes to filename the (x,y) points of a ROC curve with 200 evenly-spaced interpolated points.
For multiclass problems, the ROC curve is based on one class only: classindex is the index (counted from 0) of the classification variable category for which the ROC curve is to be generated.
Supporting ProgramsI have written some Perl programs to facilitate performing analyses with the BN software. You can use them as they are or modify them to suit your own needs.
If you run any of these programs without arguments, they display a usage message. If you run them with a -v switch, they display verbose messages.
To convert files from MLC++ formatYou can download some well-known datasets from the MLC++ web-page. MLC++ files are organised as follows: filestem.names describes the data, filestem.data contains training data, filestem.test contains testing data, and filestem.all contains all data, if using a procedure that will automatically split into training and testing sets.
There are three conversion programs:
name2bn opt:-v filestemConverts filestem.names into defines.lsp
name2bn opt:-v filestem.all training-percentage rand-seedOR:
name2bn opt:-v filestem.dataConverts MLC++ data files into data.lsp.
In the first form, you specify filestem.all and the program divides it into training and testing data, according to the training-percentage and random number seed that you specify. In the second form, you specify filestem.data but filestem.test must also exist, and the program takes the training and testing data from these.
mlc2bn opt:-v opt:-n opt:-d filestem train-percentage rand-seedWithout either the -n or -d switch, this operates like the first form of data2bn above.
With -n, it also calls name2bn to produce defines.lsp.
With -d, it calls the MLC++ disretize program to discretize the data. This is useful if the data files contain continuous-valued data, as the BN implementation cannot handle it. It is simplest to use -d always, as it doesn't hurt the data files if all variables are already discrete/categorical, but the option is there in case discretize is not installed.
Note: To change the discretization algorithm, edit the program and change $discopts.
To perform a set of analyses for a learning curveThe program lcurve performs a set of analyses and stores raw data from which a learning curve can be plotted for a given dataset and algorithm. Usage:
lcurve datafile_stem alg_nameHere, datafile_stem is the name of a dataset in MLC++ format (.names and .all files must exist), and alg_name is one of naive, tan, or k2.
This program uses the mlc2bn and name2bn programs already described, and optionally the MLC++ discretize program. It operates as follows:
- Repeat for several training set sizes:
- Repeat N times:
- Make training/testing data file
- In LISP, run the training algorithm and test it
- Store the run details and individual results
- End repeat
- Repeat N times:
- End repeat
The program writes out files called alg_name.csv and alg_name.log. The CSV file lists, for each run, the training set size, correct, incorrect and 'unknown' classifications (unknown = equal probability for all classes). You just need to average the sets of results to plot a learning curve in your favourite graphing package. The LOG file essentially contains an echo of the information displayed on screen.