Content

Idiot's Artificial Intelligence
Reference to spline model  IAI  0 
Reference to piecewise linear model  IAI  1 
Yes, it is right, this is an Idiot's guide to KolmogorovArnold Network (KAN), which is now recognized as a new form of
Artificial Intelligence (AI). There are some people who inflate the complexity of KAN, this article will show how wrong they are.
Prerequisite for reading this article is bachelor degree in any engineering field and basic coding skills.
Piecewise constant approximation
The starting point is identification of several additive functions $f_i$ in the following model
$$ z = \sum_{i = 1}^n f_i (x_i) $$
Features $x_i$ and targets $z$ are tabulated numbers. Let all functions be piecewise constant defined on
the same grid
than each function value $v$ can be found at runtime by an inline computed index as $v = f[ (int)((x  x_{min}) / \Delta)]$. The set of sought
functions form matrix $F$, which can be addressed via indexes $i$ and $x_i = (int)((x  x_{min}) / \Delta)$
$$ z = \sum_{i = 1}^n F[ i ] [ x_i ].$$
We need to find matrix elements for known partial sums and indexes which is trivial task of linear algebra. If we stretch all matrix elements
into a vector
$$ F^T = [ \:F_{1,1} \:F_{1,2} ... F_{n, m}\:]$$
than features form a sparse binary vector
$$ b^T = [\:0 \:0 \:0 \:1 \:0 ... 0 \:0 \:1\:] $$
where data records define positions of non zero elements and inner products of $b^T F$ must be equal to $z$. The training can be performed by
projection descent (Kaczmarz 1937). The current approximation which is the point (numerically is a vector) is projected sequentially
to each next hyperplane built on the new data and, since almost all of them
are pairwise orthogonal, the method converges very fast to approximate solution. Numerically, it needs to find the difference between current
model and observed object
$$ \Delta z = z  \sum_{i = 1}^n F[ i ] [ x_i ], $$
divide it by the number of involved functions which is $n$ and add this $\alpha \Delta z / n$ to each involved matrix element $F[ i ] [ x_i ]$
($\alpha$ is regularization parameter for computational stability).
Piecewise linear model
The piecewise constant model is obviously an approximation, it also can't be used in KAN as is. It should be upgraded to at least piecewise
linear. The model, in this case, is slightly changed. The matrix $F[ i ] [ x_i ]$ will contain the node points of the functions. The function value is
computed by linear interpolation. The index $(int)((x  x_{min}) / \Delta)$ defines the linear block and, having relative distances
to the left and right nodes, the function value can be quickly computed. In identification step the difference is distributed between
left and right nodes according to relative distances. This method is still projection descent (Kaczmarz 1937) for system of linear
algebraic equations where in previously explained binary vector $b$ each non zero value is replaces by pairs, such as $0.3, 0.7$ or
$0.2, 0.8$ representing relative distances within certain linear segment (the sum in these pairs is always 1). The hyperplanes
are still near pairwise orthogonal and convergence is extremely fast.
KolmogorovArnold Network
The models explained in the previous sections have two names Generalized Additive Model (GAM) or Discrete Urysohn Operator (DUO). The latter
is a finite difference approximation of the Urysohn integral equation.
KAN is a tree of GAMs or DUOs where outputs of some operators are the inputs of the others. The simplest case is the following
$$ z = g [ f_1(x_1) + f_2(x_2) ], $$
where $z, x_1, x_2$ are tabulated values in dataset and $g, f_1, f_2$ are functions to estimate in training process. This elementary model
is not strong enough, usually something more complex is required, such as this
$$ y_1 = f_{1,1}(x_1) + f_{1,2}(x_2) $$
$$ y_2 = f_{2,1}(x_1) + f_{2,2}(x_2) $$
$$ z = g_{1}(y_1) + g_{2}(y_2). $$
The tree may have more than two layers, the same inputs are passed via different Urysohns, called inner blocks. Their outputs are arguments
in one top block, called outer, which produce model of the target $z$.
The complexity of this model is deceptive. It is absolutely elementary and estimation of all functions can be resolved easily, quicky, effectively and
accurately and it needs half page of explanation, few hundred lines in codes, and can be implemented by common programmer for a few hours.
We can start from a random initialization of all involved piecewise linear functions. Assume we have a dataset with input vectors. Having them given,
we can compute intermediate values $y_1, y_2$ and apply few improvement steps to outer operator. However, we apply dual action. Since outer operator
is piecewise linear, all functions have derivatives in their definition fields. Having them we can find how to modify intermediate values $y_1, y_2$
for obtaining better approximation to $z$. After some research we found (with my coauthor) that each intermediate value $y$ must get an addend
$$ \frac{\Delta z}{n} \:\: \frac{dg}{dy}. $$
That has a similarity with back propagation in Multilayer Perceptrons (MLP). So, the end of the story is the following. For the current model we
estimate the output $\hat z$, compute above addend for each individual $y_k$, update each inner operator and explained in the previous section and
outer operator as well. That is it, quick simple and effective.
The fully functional code for piecewise linear version can be found by link at the top. The second example is a spline model which needs a
little more explanation. I advice to start learning from piecewise linear version.
Several testing datasets are generated by three formulas
$$ z = exp(sin(\pi x) + y^2) $$
$$ z = exp((sin(\pi (x_1^2 + x_2^2) ) + sin(\pi (x_3^2 + x_4^2) ))/2) $$
$$
y = \frac{2 + 2 x_3}{3\pi} \left[ {arctan}\left( 20( x_1  \frac{1}{2} + \frac{x_2}{6})\exp\left(x_5\right) \right) + \frac{\pi}{2} \right] + \frac{2 + 2 x_4}{3\pi} \left[ {arctan}\left( 20( x_1  \frac{1}{2}  \frac{x_2}{6})\exp\left(x_5\right) \right) + \frac{\pi}{2} \right]
$$
and another one is testing for epistemic uncertainty. The targets are areas of triangles generated for three randomly selected points, the features are coordinates
of the points. The accurate modeling for short training datasets is impossible, since features are not correlated to targets. The accuracy for both KAN and
MLP is about 4 to 6%.
References
Urysohn modeling
KolmogorovArnold network
Preprint
 
