Using Weka from Python

I have recently used Weka for trying out different classifiers for a vision-based task. Unfortunately, when I tried to re-implement the trained classifier in Python in order to integrate it within my experimental framework, I had some problems:

  1. Not all Weka classifiers output the trained "model" as a complete textual representation (i.e. J48 decision tree is output, while Random Forests aren't).
  2. Every classifier has its own output format (i.e. a J48 tree parser was quickly hacked, but ADTrees already look slightly different, requiring a modified parser).
  3. With the MultilayerPerceptron, my reimplementation would not give the same results as the Weka original, although I cannot see where one could possibly make a mistake with such a simple classifier.
  4. In any case, some effort is needed to integrate other classifiers into one's own program, since the only common, reusable format for Weka's classifiers is what "save model" produces: serialized Java objects, which cannot be loaded with other languages.

Thus, I was glad when I found weka python glue, a small hack (version number 0.1, probably unmaintained) that uses Java's native C interface to implement a small python module that starts a JVM and offers functions to load a serialized Weka model and classify data with it from Python. With some modifications (mostly to the Makefile), I got it to work, but I was not really satisfied:

Ideally, I thought it should be desirable to access Weka's classes from Python in an object-oriented manner, and thinking about it, it should be possible to write a Python-Java bridge that is general enough to allow quick access to all desired classes. After all, Java has a suitable reflection API. Actually, Python supports everything around dynamic attributes etc. to make this even really beautiful.

Some web search brought the shiny solution: JPype. This indeed contains everything I dreamt of, and allows things such as:

import jpype; from jpype import java
jpype.startJVM(jpype.getDefaultJVMPath())
java.lang.System.out.println("hello world")
jpype.shutdownJVM()

However, there is one problem that results in a ClassNotFoundException being raised when trying to de-serialize objects using an ObjectInputStream. Fortunately, Steve Ménard - the author of JPype - told me that this is a well-known problem with ObjectInputStream, which uses Class.forName without passing the optional "classLoader" argument. Thus, some strange class loader from a local (JPype) context is used, which cannot be set up and does not find any non-builtin classes. The solution (given in JPype's sourceforge tracker) is to extend ObjectInputStream, overriding the resolveClass method to use the desired (i.e. the system's) class loader.

I have put the relevant code into a JPypeObjectInputStream.java file that looks like this:

import java.io.InputStream;
import java.io.ObjectInputStream;
import java.io.ObjectStreamClass;
import java.io.IOException;

// see http://sourceforge.net/tracker/index.php?func=detail&aid=1799807&group_id=109824&atid=655012
public class JPypeObjectInputStream extends ObjectInputStream
{
    public JPypeObjectInputStream(InputStream in) throws IOException
    {
        super(in);
    }

    protected Class<?> resolveClass(ObjectStreamClass desc) throws
        IOException, ClassNotFoundException
    {
        return Class.forName(desc.getName(), true,
                             ClassLoader.getSystemClassLoader());
    }
}

Using javac, this produces a .class file that one can then load from within Python using JPype (assuming it is in your classpath):

import os, jpype
from jpype import java

...

JPypeObjectInputStream = jpype.JClass("JPypeObjectInputStream")

ois = JPypeObjectInputStream(
              java.io.FileInputStream("example.model"))

classifier = ois.readObject()

Using this, the above-mentioned weka python glue is really quickly re-implemented in Python, easily supporting numeric and nominal attributes and missing values in an object-oriented fashion:

import os, jpype
from jpype import java

if not jpype.isJVMStarted():
    _jvmArgs = ["-ea"] # enable assertions
    _jvmArgs.append("-Djava.class.path="+os.environ["CLASSPATH"])
    jpype.startJVM(jpype.getDefaultJVMPath(), *_jvmArgs)

weka = jpype.JPackage("weka")

JPypeObjectInputStream = jpype.JClass("JPypeObjectInputStream")

class WekaClassifier(object):
    def __init__(self, modelFilename, datasetFilename):
        self.dataset = weka.core.Instances(
            java.io.FileReader(datasetFilename))
        self.dataset.setClassIndex(self.dataset.numAttributes() - 1)

        self.instance = weka.core.Instance(self.dataset.numAttributes())
        self.instance.setDataset(self.dataset)

        ois = JPypeObjectInputStream(
            java.io.FileInputStream(modelFilename))
        self.model = ois.readObject()

    def classify(self, record):
        for i, v in enumerate(record):
            if v is None:
                self.instance.setMissing(i)
            else:
                self.instance.setValue(i, v)
        return self.dataset.classAttribute().value(
            int(self.model.classifyInstance(self.instance)))

#jpype.shutdownJVM() is not called ATM

This module is closely modeled after weka python glue; in my application, I do not actually use exactly this code (e.g. I use the same dataset for all models, and I do not need the classification result as string, but I use the float result directly). However, it should be usable more or less as a drop-in solution and be easily extensible thanks to JPype.

Here are the above files for your convenient download (developed, tested, and used with JPype 0.5.3):

File: JPypeObjectInputStream.java(575 bytes; Thu, Mar/06/2008) download 'JPypeObjectInputStream.java' to disk
File: weka_classifier.py(1057 bytes; Thu, Mar/06/2008) download 'weka_classifier.py' to disk

Valid XHTML 1.0! Valid CSS!
This page was last modified: Monday, August 04, 2008 hacker emblem