Instructions: see below.
Right-click on a file to download it.
|madden_bn_v36.lsp||LISP implementation of the various Bayesian Network classifiers.|
|Small sample input files|
|Perl programs to convert files from MLC++ format to this software format|
|lcurve||Perl program for running a batch of analyses|
|runperl.bat||For Windows users, a handy batch file for running Perl programs that don't have any file extension (as in Unix): copy this file, give it the same name as the Perl program (e.g. lcurve.bat to go with lcurve above), and put it in the same folder as the Perl program.|
For a description of the classifiers, and a discussion of their relative performance, see: “On the Classification Performance of TAN and General Bayesian Networks” , Michael G. Madden. Knowledge Based Systems, 2009. http://datamining.it.nuigalway.ie/images/pubs/kbs-jan2009-mmadden.pdf.
Please cite that paper if you use this software in your work.
If you have problems using this code or if you think you have found a bug in it, you are welcome to contact me.
The following global variables must be defined before running the functions:
In these instructions, I assume the current version of the BN Lisp software has been compiled and is loaded. In CLISP, this is done with the following commands (assuming the file containing the software is called madden_bn_v36.lsp):
(compile-file "madden_bn_v36") (load "madden_bn_v36")
(load "weather_defines") ;; NOTE: Change these for a different data set
(load "weather_data") ;; define *nodes*, *assmnts*, *class_node*, *train*, *test*
(setq verbose 1) ;; Verbose mode: set to 0 if running in a batch
(naive) ;; Construct Naive structure -- result is *naive*
(probs *naive*) ;; Generate probabilities for structure in *naive*
(test *naive* *probs* *test*) ;; Test the network given by structure in *naive*
;; and probablities in *prob* on the test dataset *test*
(stats *testres*) ;; Display statistics from the result of the test,
;; which have been stored in *testres*
(tanbn) ;; Construct TAN structure -- result is *tanstruct*Then use *tanstruct* rather than *naive* in subsequent function calls, e.g.
(probs *tanstruct*)Note that this uses whatever the currently-specified metric is. If you want to construct a TAN classifer using Friedman et al's original Conditional Mutual Information metric, either set *metric* to CMI or use the (FGGTAN) convenience funtion.
(k2 99) ;; Construct K2 structure -- result is *k2struct*where the function parameter is the maximum number of parents any node may have.
Then use *k2struct* in subsequent function calls, e.g.
(probs *k2struct*)Incidentally, on the very small sample weather dataset, K2 does not work well.
Note that you can use this function to create a Bayesian Network augmented Naive Bayes structure, by adding an additional parameter with value T:
(k2 99 T) ;; Start with a Naive Bayes structure and augment it with K2
;; -- result is *k2struct*
(tstart) ;; put this call in before constructing structure and calculating probs
(tstop) ;; displays elapsed real time and CPU time, calls to g-function
(confuse *naive* *probs* *test*)
(dot *k2struct* "filestem")This function writes out a DOT file to filestem.dot, which can be displayed on screen (from your command line, not in Lisp) using
dotty filestem.dotor converted to PostScript (or other format, eg PNG) using
dot -Tps filestem.dot -o filestem.psImportant: If using DOT, you cannot have '-' characters in your node names.
(roc structure probabilities testdata classindex filename)This function writes to filename the (x,y) points of a ROC curve with 200 evenly-spaced interpolated points.
For multiclass problems, the ROC curve is based on one class only: classindex is the index (counted from 0) of the classification variable category for which the ROC curve is to be generated.
If you run any of these programs without arguments, they display a usage message. If you run them with a -v switch, they display verbose messages.
There are three conversion programs:
name2bn opt:-v filestemConverts filestem.names into defines.lsp
name2bn opt:-v filestem.all training-percentage rand-seedOR:
name2bn opt:-v filestem.dataConverts MLC++ data files into data.lsp.
mlc2bn opt:-v opt:-n opt:-d filestem train-percentage rand-seedWithout either the -n or -d switch, this operates like the first form of data2bn above.
Note: To change the discretization algorithm, edit the program and change $discopts.
lcurve datafile_stem alg_nameHere, datafile_stem is the name of a dataset in MLC++ format (.names and .all files must exist), and alg_name is one of naive, tan, or k2.
This program uses the mlc2bn and name2bn programs already described, and optionally the MLC++ discretize program. It operates as follows:
The program writes out files called alg_name.csv and alg_name.log. The CSV file lists, for each run, the training set size, correct, incorrect and 'unknown' classifications (unknown = equal probability for all classes). You just need to average the sets of results to plot a learning curve in your favourite graphing package. The LOG file essentially contains an echo of the information displayed on screen.