Idiot's Artificial Intelligence

Reference to spline modelIAI - 0
Reference to piecewise linear model IAI - 1

Yes, it is right, this is an Idiot's guide to Kolmogorov-Arnold Network (KAN), which is now recognized as a new form of Artificial Intelligence (AI). There are some people who inflate the complexity of KAN, this article will show how wrong they are. Prerequisite for reading this article is bachelor degree in any engineering field and basic coding skills.

Piecewise constant approximation

The starting point is identification of several additive functions $f_i$ in the following model $$ z = \sum_{i = 1}^n f_i (x_i) $$ Features $x_i$ and targets $z$ are tabulated numbers. Let all functions be piecewise constant defined on the same grid

than each function value $v$ can be found at run-time by an inline computed index as $v = f[ (int)((x - x_{min}) / \Delta)]$. The set of sought functions form matrix $F$, which can be addressed via indexes $i$ and $x_i = (int)((x - x_{min}) / \Delta)$ $$ z = \sum_{i = 1}^n F[ i ] [ x_i ].$$ We need to find matrix elements for known partial sums and indexes which is trivial task of linear algebra. If we stretch all matrix elements into a vector $$ F^T = [ \:F_{1,1} \:F_{1,2} ... F_{n, m}\:]$$ than features form a sparse binary vector $$ b^T = [\:0 \:0 \:0 \:1 \:0 ... 0 \:0 \:1\:] $$ where data records define positions of non zero elements and inner products of $b^T F$ must be equal to $z$. The training can be performed by projection descent (Kaczmarz 1937). The current approximation which is the point (numerically is a vector) is projected sequentially to each next hyperplane built on the new data and, since almost all of them are pairwise orthogonal, the method converges very fast to approximate solution. Numerically, it needs to find the difference between current model and observed object $$ \Delta z = z - \sum_{i = 1}^n F[ i ] [ x_i ], $$ divide it by the number of involved functions which is $n$ and add this $\alpha \Delta z / n$ to each involved matrix element $F[ i ] [ x_i ]$ ($\alpha$ is regularization parameter for computational stability).

Piecewise linear model

The piecewise constant model is obviously an approximation, it also can't be used in KAN as is. It should be upgraded to at least piecewise linear. The model, in this case, is slightly changed. The matrix $F[ i ] [ x_i ]$ will contain the node points of the functions. The function value is computed by linear interpolation. The index $(int)((x - x_{min}) / \Delta)$ defines the linear block and, having relative distances to the left and right nodes, the function value can be quickly computed. In identification step the difference is distributed between left and right nodes according to relative distances. This method is still projection descent (Kaczmarz 1937) for system of linear algebraic equations where in previously explained binary vector $b$ each non zero value is replaces by pairs, such as $0.3, 0.7$ or $0.2, 0.8$ representing relative distances within certain linear segment (the sum in these pairs is always 1). The hyperplanes are still near pairwise orthogonal and convergence is extremely fast.

Kolmogorov-Arnold Network

The models explained in the previous sections have two names Generalized Additive Model (GAM) or Discrete Urysohn Operator (DUO). The latter is a finite difference approximation of the Urysohn integral equation.

KAN is a tree of GAMs or DUOs where outputs of some operators are the inputs of the others. The simplest case is the following $$ z = g [ f_1(x_1) + f_2(x_2) ], $$ where $z, x_1, x_2$ are tabulated values in dataset and $g, f_1, f_2$ are functions to estimate in training process. This elementary model is not strong enough, usually something more complex is required, such as this $$ y_1 = f_{1,1}(x_1) + f_{1,2}(x_2) $$ $$ y_2 = f_{2,1}(x_1) + f_{2,2}(x_2) $$ $$ z = g_{1}(y_1) + g_{2}(y_2). $$ The tree may have more than two layers, the same inputs are passed via different Urysohns, called inner blocks. Their outputs are arguments in one top block, called outer, which produce model of the target $z$.

The complexity of this model is deceptive. It is absolutely elementary and estimation of all functions can be resolved easily, quicky, effectively and accurately and it needs half page of explanation, few hundred lines in codes, and can be implemented by common programmer for a few hours.

We can start from a random initialization of all involved piecewise linear functions. Assume we have a dataset with input vectors. Having them given, we can compute intermediate values $y_1, y_2$ and apply few improvement steps to outer operator. However, we apply dual action. Since outer operator is piecewise linear, all functions have derivatives in their definition fields. Having them we can find how to modify intermediate values $y_1, y_2$ for obtaining better approximation to $z$. After some research we found (with my coauthor) that each intermediate value $y$ must get an addend $$ \frac{\Delta z}{n} \:\: \frac{dg}{dy}. $$ That has a similarity with back propagation in Multilayer Perceptrons (MLP). So, the end of the story is the following. For the current model we estimate the output $\hat z$, compute above addend for each individual $y_k$, update each inner operator and explained in the previous section and outer operator as well. That is it, quick simple and effective.

The fully functional code for piecewise linear version can be found by link at the top. The second example is a spline model which needs a little more explanation. I advice to start learning from piecewise linear version.

Several testing datasets are generated by three formulas $$ z = exp(sin(\pi x) + y^2) $$ $$ z = exp((sin(\pi (x_1^2 + x_2^2) ) + sin(\pi (x_3^2 + x_4^2) ))/2) $$ $$ y = \frac{2 + 2 x_3}{3\pi} \left[ {arctan}\left( 20( x_1 - \frac{1}{2} + \frac{x_2}{6})\exp\left(x_5\right) \right) + \frac{\pi}{2} \right] + \frac{2 + 2 x_4}{3\pi} \left[ {arctan}\left( 20( x_1 - \frac{1}{2} - \frac{x_2}{6})\exp\left(x_5\right) \right) + \frac{\pi}{2} \right] $$ and another one is testing for epistemic uncertainty. The targets are areas of triangles generated for three randomly selected points, the features are coordinates of the points. The accurate modeling for short training datasets is impossible, since features are not correlated to targets. The accuracy for both KAN and MLP is about 4 to 6%.


Urysohn modeling
Kolmogorov-Arnold network