Oracle introduces Tribuo, a Java Machine Learning library
Recently, Oracle announces a machine learning library written in Java called Tribuo. It provides tools for classification, regression, clustering, model development, etc. It also provides a unified interface for many popular third-party machine learning libraries.
Oracle mentioned that they have been committed to deploying machine learning models to large-scale production systems over the years. In the process, they discovered that there is often a gap between the expectations of companies and the functions provided by existing machine learning libraries.
For example, large software systems usually want to use self-describing building blocks and determine when their input or output is invalid. However, most machine learning libraries rely on a bunch of floating-point arrays to train models. At deployment time, the input is a float array, and another float array is generated as the predicted output. The meaning of any of these arrays, or the description of what the input/output float numbers should be, are left to the Wiki and bug trackers, or written as code comments. Oracle pointed out that developers would not want to add another database table for each machine learning model just to explain the meaning of the output floating array.
“Oracle World San Francisco 2006”by stevegarfield is licensed under CC BY-NC-SA 2.0
On the other hand, tracking the model in production is also tricky because it requires an external system to maintain the link between the deployed model and the training process and data. Oracle’s machine learning research team believes that it would be much better to embed these additional requirements directly into the machine learning library.
From the perspective of programming languages, most popular machine learning libraries are written in dynamic languages such as Python and R, while most enterprise systems are based on statically typed languages such as Java. This requires companies to use different languages to write code at the same time and ensure its operation, which will also incur more code maintenance costs and system overhead.
The open-source Tribuo can handle these problems properly. Tribuo has a data loading pipeline, a text processing pipeline, and function level conversion, which can be operated after the data is loaded. It knows what it has input/output, and can describe the range and type of each input/output.
It is also very convenient to use Tribuo to deploy models from other systems and languages. It provides interfaces to ONNX Runtime, TensorFlow, and XGBoost. Among them, the support for the onnx model allows the deployment of models trained by some Python software packages (such as PyTorch) in Java. Tribuo currently supports Java 8 and higher.