Kolmogorov-Arnold network with cubic spline interpolation

Reference to published paper Kolmogorov-Arnold
Reference to published code NOT_MIT_KAN
Code version with function plotting

The univariate functions $U_i$, $f_{i,j}$ used in K-A model as adequate replacement of multivariate $M$ $$ M(x_1, x_2, x_3, ... , x_n) = \sum_{i=1}^{2n+1} U_i\left(\sum_{j=1}^{n} f_{i,j}(x_{j})\right) $$ can be presented in the different forms. When dataset is the result of observation of physical system, it has uncertainties due to unaccounted factors (called epistemic), features and targets may also be recorded with errors. On that reason piecewise linear representation is sufficient, since modeling errors are less than data inaccuracies. But there are some cases when data represent unknown but very accurate model and we wish to reduce modeling errors. It may be needed for using K-A for solving partial differential equations.

The piecewise linear version was published by authors of this page in March 2021. And there is also friendlier explanation by video.

Here I explain how to increase accuracy of modeling by using cubic splines for all involved functions $U_i$, $f_{i,j}$. The explanation has 3 sections. First is identification of a single univariate function using elementary dataset, which is sequence of values $x_i$, $y_i$, implemented in the same way as it is used in K-A. Second is identification of a single generalized additive model or discrete Urysohn operator presented by $$ z(x_1, x_2, ..., x_n) = \sum_{j=1}^n f_j (x_j) $$ and also implemented in a way used in K-A model. And the third is the method of combining multiple GAMs in a tree known as K-A model. The methods for all 3 sections are used in a form of nested dolls. Single $f$ is used for each $f_j$ in a GAM and GAMs are used as the nested objects in K-A. Programmatically, they are 3 classes, used as objects in variable length lists at run-time.

Univariate function

Dataset is a list of values $x_i$, $y_i$. It is assumed that at least approximate transformation $f(x_i) = y_i$ exists. We divide entire interval of $[x_{min}, x_{max}]$ into even segments. The more segments, the higher the accuracy, but the slower identification and there is a risk of overparameterization. Typical number is 5 segments. I will use 5 in explanation, not $N$, for making all as clear as possible.

We introduce 5 unit vectors $e_1^T = [1 0 0 0 0], e_2^T = [0 1 0 0 0], ..., e_5^T = [0 0 0 0 1]$ and make splines for each of them. It is very old and known method, there are tons of documentation, but I recommend James Keesling. My implementation in code is sitting in SplineGenerator. Two of these spline approximations are shown in the image below

I denote these spline approximations as $S_1, S_2, ... S_5$. Each $S_k$ is built for $e_k$. Now we consider these $S_k$ as basis functions and build approximation to $y_i$ as linear combination $$ \hat y_i = \sum_{k=1}^5 c_k S_k (x_{i}) $$ Vector of coefficients $C^T = [c_1 c_2 c_3 c_4 c_5]$ is updated for each new pair randomly selected $x_i$, $y_i$. The method is called Projection Descent. It is very quick and effective for near orthogonal data vectors $S^T = [S_1(x_i), S_2(x_i), S_3(x_i), S_4(x_i), S_5(x_i)]$. In the code it is Univariate.Update method.

This part can be researched and tested independently. The coding sample is published.

GAM or discrete Urysohn

Input or features are vectors ${X^i}^T = [x_{1,i} \: x_{2,i} \: x_{3,i} \: x_{4,i} \: x_{5,i}]$, the model is below $$ \hat z_i = \sum_{k=1}^n f_k (x_{k, i}). $$ The identification is iterative update of every function for each difference between model and dataset value $\Delta = z_i - \hat z_i$. The difference $\Delta$ is evenly divided between all above univariate functions and each of them is updated according to previous section. It is also very quick and effective algorithm published in article of the authors of this page. In the code it is Urysohn object.

This part can also be researched and tested independently. The coding sample is published.


We introduce intermediate variables $\theta_i $ and split Kolmogorov-Arnold representation into set of equations: $$ M = \sum_{i=1}^{2n+1} U_i\left(\theta_i\right) $$ $$ \theta_i = \left(\sum_{j=1}^{n} f_{i,j}(x_{j})\right) $$ We initialize model randomly. For each record we have difference between computed and actual value $\Delta M$. We divide this difference according to derivatives of the outer operator $$ D_i = \Delta M \frac{dU_i}{d\theta_i} $$ and pass these differences to inner operators, which are updated according to previous section. The outer operator is then also updated in the same way using $D_i$ as new inputs.

The theoretical part and proof of local convergence is published by authors also.

Some notes on research

In addition to what already has been cited we can provide more links KASAM: Spline Additive Models for Function Approximation and KAN: Kolmogorov-Arnold Networks.

Comparison test

The coding example on the top has a self test, which uses built in data generation, similar to latter example. Two formulas are used for generating datasets: $$ z = exp[sin(\pi x) + y^2] $$ and $$ z = exp\left[[sin(\pi (x_1^2 + x_2^2)) + sin(\pi (x_3^2 + x_4^2))] / 2 \right] $$ The concept published in KAN: Kolmogorov-Arnold Networks is different but the accuracy and performance for these two examples is about the same.

The entire code is very simple. It does not use any additional libraries or any programming tricks and is 710 lines long, including building splines which involves tridiagonal matrix inversion. It has room for run-time optimization, but is written in the most simple way possible for people to learn and practice the concept.