I have recently used Weka for trying out different classifiers for a vision-based task. Unfortunately, when I tried to re-implement the trained classifier in Python in order to integrate it within my experimental framework, I had some problems:
- Not all Weka classifiers output the trained "model" as a complete textual representation (i.e. J48 decision tree is output, while Random Forests aren't).
- Every classifier has its own output format (i.e. a J48 tree parser was quickly hacked, but ADTrees already look slightly different, requiring a modified parser).
- With the MultilayerPerceptron, my reimplementation would not give the same results as the Weka original, although I cannot see where one could possibly make a mistake with such a simple classifier.
- In any case, some effort is needed to integrate other classifiers into one's own program, since the only common, reusable format for Weka's classifiers is what "save model" produces: serialized Java objects, which cannot be loaded with other languages.
Thus, I was glad when I found weka python glue, a small hack (version number 0.1, probably unmaintained) that uses Java's native C interface to implement a small python module that starts a JVM and offers functions to load a serialized Weka model and classify data with it from Python. With some modifications (mostly to the Makefile), I got it to work, but I was not really satisfied:
- It is absolutely not object-oriented, i.e. it is not possible to load several classifiers in parallel, because the current classifier is a global state of the module.
- It requires the instances to be classified to be converted into list of strings (i.e. tuples of floats are not accepted), only to have the numbers be parsed in Java again.
- It does not support missing values.
- I suspect it to leak memory.
- Looking at the source code, it was much too complex for the simple task it performs.
Ideally, I thought it should be desirable to access Weka's classes from Python in an object-oriented manner, and thinking about it, it should be possible to write a Python-Java bridge that is general enough to allow quick access to all desired classes. After all, Java has a suitable reflection API. Actually, Python supports everything around dynamic attributes etc. to make this even really beautiful.
Some web search brought the shiny solution: JPype. This indeed contains everything I dreamt of, and allows things such as:
import jpype; from jpype import java jpype.startJVM(jpype.getDefaultJVMPath()) java.lang.System.out.println("hello world") jpype.shutdownJVM()
However, there is one problem that results in a ClassNotFoundException being raised when trying to de-serialize objects using an ObjectInputStream. Fortunately, Steve Ménard - the author of JPype - told me that this is a well-known problem with ObjectInputStream, which uses Class.forName without passing the optional "classLoader" argument. Thus, some strange class loader from a local (JPype) context is used, which cannot be set up and does not find any non-builtin classes. The solution (given in JPype's sourceforge tracker) is to extend ObjectInputStream, overriding the resolveClass method to use the desired (i.e. the system's) class loader.
I have put the relevant code into a JPypeObjectInputStream.java file that looks like this:
import java.io.InputStream; import java.io.ObjectInputStream; import java.io.ObjectStreamClass; import java.io.IOException; // see http://sourceforge.net/tracker/index.php?func=detail&aid=1799807&group_id=109824&atid=655012 public class JPypeObjectInputStream extends ObjectInputStream { public JPypeObjectInputStream(InputStream in) throws IOException { super(in); } protected Class<?> resolveClass(ObjectStreamClass desc) throws IOException, ClassNotFoundException { return Class.forName(desc.getName(), true, ClassLoader.getSystemClassLoader()); } }
Using javac, this produces a .class file that one can then load from within Python using JPype (assuming it is in your classpath):
import os, jpype from jpype import java ... JPypeObjectInputStream = jpype.JClass("JPypeObjectInputStream") ois = JPypeObjectInputStream( java.io.FileInputStream("example.model")) classifier = ois.readObject()
Using this, the above-mentioned weka python glue is really quickly re-implemented in Python, easily supporting numeric and nominal attributes and missing values in an object-oriented fashion:
import os, jpype from jpype import java if not jpype.isJVMStarted(): _jvmArgs = ["-ea"] # enable assertions _jvmArgs.append("-Djava.class.path="+os.environ["CLASSPATH"]) jpype.startJVM(jpype.getDefaultJVMPath(), *_jvmArgs) weka = jpype.JPackage("weka") JPypeObjectInputStream = jpype.JClass("JPypeObjectInputStream") class WekaClassifier(object): def __init__(self, modelFilename, datasetFilename): self.dataset = weka.core.Instances( java.io.FileReader(datasetFilename)) self.dataset.setClassIndex(self.dataset.numAttributes() - 1) self.instance = weka.core.Instance(self.dataset.numAttributes()) self.instance.setDataset(self.dataset) ois = JPypeObjectInputStream( java.io.FileInputStream(modelFilename)) self.model = ois.readObject() def classify(self, record): for i, v in enumerate(record): if v is None: self.instance.setMissing(i) else: self.instance.setValue(i, v) return self.dataset.classAttribute().value( int(self.model.classifyInstance(self.instance))) #jpype.shutdownJVM() is not called ATM
This module is closely modeled after weka python glue; in my application, I do not actually use exactly this code (e.g. I use the same dataset for all models, and I do not need the classification result as string, but I use the float result directly). However, it should be usable more or less as a drop-in solution and be easily extensible thanks to JPype.
Here are the above files for your convenient download (developed, tested, and used with JPype 0.5.3):
| File: JPypeObjectInputStream.java | (575 bytes; Thu, Mar/06/2008) | |
| File: weka_classifier.py | (1057 bytes; Thu, Mar/06/2008) | |