Urysohn Adaptive Filter

This is interactive article designed by Andrew Polar and Mike Poluektov for a quick demonstration of their novel nonlinear adaptive filtering algorithm. It can be used in the field of data modelling, including identification of nonlinear dynamic systems. The explanation is backed up by numerical simulation, which is implemented in JavaScript of this single HTML file. The report attached at the bottom of this article is result of execution of client-side JavaScript code, which can be obtained by view source option of the browser. JavaScript must be enabled. Those who disabled JavaScript on security reasons, can save this file locally and open with the browser.

Patent pending "Method for identifying discrete Urysohn models", 15/998381, filed 08/11/2018.

Quick introduction

The starting point in all data modelling is linear regression. It is, usually, tried as the first step without much expectations

$$y_i = \sum_{j=1}^m h_j \cdot x_{ij} = \mathbf{h^T x_i}.$$

$\mathbf{x_i}$ is i-th input vector or data sample with elements $x_{ij}$, scalars $y_i$ are outputs, and $\mathbf{h}$ is a model vector with elements $h_j$. If real-time modeling is applied, which means identification along with data reading, the first choice is the Least Mean Squares (LMS). It is assumed that reader is familiar with its concept and it is not repeated in this article. When accuracy of the model doesn't meet expectations, it may be replaced by nonlinear model with according change in the algorithm. The usual choices are Volterra LMS or Kernel LMS. These algorithms replace original samples with elements $x_{ij}$ by a new data. The elements in these data samples may be sitting in different powers and in products (such as $x_{kn}*x_{kp}$), but the model itself non-the-less stays linear relatively to estimated parameters.

The authors of this article approached to improving of the model from completely different perspective. The idea is to replace each linear term in above model by unknown function and identify them all from data

$$y_i = \sum_{j=1}^m f_j (x_{ij}). $$

The only limitation imposed on the functions is piecewise linearity. The functions are continuous, defined on entire interval of $x$ from the range $[x_{min}, x_{max}]$ and represent several blocks of connected straight lines. The first linear model is a particular case of the latter. When every function has only one linear block and the lines in each of them pass through origin, the latter turns into linear model.

In case of two inputs $\mathbf{x_i}$ and $\mathbf{y_i}$ the model becomes blockwice bilinear, and it also can be identified by samples of inputs and outputs with only limitation of blockwice bilinearity on estimated functions $g_j(\cdot,\cdot)$

$$z_i = \sum_{j=1}^m g_j (x_{ij}, y_{ij}). $$

Quick explanation

There are two common steps in all adaptive filtering algorithms, both linear and nonlinear, they are: computation of model output and correction of the model. When these two steps are repeated for incoming data samples, the model is adjusted to produce right result, if the adaptation concept is basically true. Same two steps are involved in identification of two models introduced above. Let us start from the single input model.

For explanation we consider function defined by a single straight line between argument points $x^{left}$ and $x^{right}$ and known by two function values $f^{left}$ and $f^{right}$. For every argument $x$ from a defined interval the function value is then computed as

$$f = f^{left} * (1 - x) + f^{right} * x,$$

where $x$ is relative distance from the left abscissa of the interval. We call values $1-x$ and $x$ the weight coefficients for linear block and denote them as $w^{left}$, $w^{right}$. The upper case symbols are introduced to distinguish variables from their values, so when we write $x_{ik} = w^{left}$, we mean that variable $x_{ik}$ equals to a number $w^{left}$.

The rest part of identification algorithm is even more obvious than that. The model represents multiple functions and each function may have multiple linear blocks. If we stretch all arguments and function values into vectors and place them into a table along with weights for particular data sample, we can see the following:

	$f_{1}$				other functions	$f_{m}$
Arguments	$x^{1,1}$	$x^{1,2}$	$x^{1,3}$	$...$	$...$	$x^{m,1}$	$x^{m,2}$	$x^{m,3}$	$...$
Functions	$f^{1,1}$	$f^{1,2}$	$f^{1,3}$	$...$	$...$	$f^{m,1}$	$f^{m,2}$	$f^{m,3}$	$...$
Weights	$0$	$w^{1,left}$	$w^{1,right}$	$...$	$...$	$w^{m,left}$	$w^{m,right}$	$0$	$...$

The rows of arguments and functions represent current model, they all numbers. The row of weights is built according to a particular input sample. The inner product of weights and functions must be equal to output and if not, the function values are corrected according to a classic LMS concept. Each argument falls into only a single linear interval, so only two weight coefficients are different from zero within each function block. That makes rows for different data samples less linear dependent compared to linear adaptive filters, and provides fast convergence. Obviously, the order of columns in a table is free, only weight coefficients must be placed on the right positions and in suggested software DEMO this table is not even built. The model is assigned as two-dimensional array and values of the model are corrected as array elements. The table is brought up for explanation only.

The model with two inputs is not much different in principal. According to suggested concept each function $g_j(\cdot,\cdot)$ is bilinear within multiple rectangular fragments from the field of definition $[x_{min}, x_{max}], [y_{min}, y_{max}]$. These rectangular fragments cover entire field of definition without gaps and overlapping. They may not necessarily be all equal in size. The arguments for each function $g_j$ fall into a single rectangular fragment. If we name the corner points of this fragment as North-West, North-East, South-West, South-East, than function value is computed by bilinear formula:

$$g = NW + (NE-NW) * x + (SW - NW) * y + ((SE + NW) - (NE + SW) * x * y.$$

Here $x$ is relative distance from North and $y$ is relative distance from West. The weight coefficients are defined in the following way:

$$w^{NW} = (1 - x) (1 - y)$$ $$w^{NE} = x (1 - y)$$ $$w^{SW} = (1 - x) y$$ $$w^{SE} = x y$$

The next step is obvious, similar table can be arranged for two input model, the elements of the model can be stretched into a vector, weight coefficients are built by inputs and also can be placed into a table, inner product of model and weights must be equal to output, if not, the model is corrected.

Bilinear model does not assume that that all four points belong to a same plane, but they may in particular case.

Redundancy of the models

If we look at the table again, it is possible to notice that sum of all columns within the same function block gives always the column vector with all elements equal to 1 independently on a data. Same is true for bilinear model. That means that for any large set of input samples such matrix has $m-1$ linear dependent columns and if to build such system and apply least squares technique for estimation of a model vector, the matrix will always have incomplete rank. So when such solution is sought the regularization is necessary. When adaptive filtering is applied, by setting initial model to all zeros, the solution with minimum norm can be obtained as well. From practical perspective that means that models are redundant. If one solution exist, there is infinity of other solutions with the same accuracy and difference in some coefficients of the model may be compensated by others.

Relation to integral equations

The model with the single input can be obtained by applying quadrature formula for continuous nonlinear integral equation of Urysohn type

$$y(t) = \int_{0}^T U[x(t,s),s]ds, $$

which means that suggested technique can be used for identification of the kernel of this operator in real-time by processing input/output sequences. The model with two inputs is also discrete form of Urysohn with a different kernel

$$z(t) = \int_{0}^T U[x(t,s),y(t,s),s]ds. $$

That makes the suggested technique a convenient choice for identification of nonlinear dynamic systems. For dynamic systems arguments $x$ and $y$ depend on difference in the following way $x(t-s)$ and $y(t-s)$, which mathematically describes specific properties of such systems, but the suggested algorithm works for generic case.

Conclusion

The real-time identification of the introduced models falls into a category of adaptive filtering. The model must be defined upfront as multidimensional grid structure. The input data not only define some coefficients used in identification step, they also define a subset of modified grid elements, so they are used for finding the addresses of the elements subjected to update. That is principal difference of suggested approach from anything authors have seen so far.
Since models are actually kernels of Urysohn equations, it is suggested to call this technique Urysohn Adaptive Filtering. This is a brief explanation, many important details are omitted. They are published in archived articles, available online:

Modelling of Non-linear Control Systems using the Discrete Urysohn Operator
Canonical block-oriented model

Many coding sample can be found on the personal site of the author of this article

ezcodesample.com

There are also examples of identification of physical objects by recorded input/output values.

Numerical simulation

Numerical simulation is conducted for the single input dynamic object of Urysohn type. The input is series of adjacent values for assumed continuous signal. The operator kernel is a function with two variables or table for discrete case. The identified values are shown as rectangular matrix. Each column in the table is the a piecewice linear function given by points sitting in rows. Simulation program is written in JavaScript and executed by browser. It reruns with new data each time the browser is refreshed. The code can be obtained by option view source. The program generates Urysohn operator and two input samples, computes two corresponding outputs, use one input/output implementation for identification (it can be called training sample), and another implementation (unseen data) for validation. The accuracy of the model is estimated by cost function using modelled $\hat{y}$ and actual $y$ outputs: $$ error = \sqrt \frac {\sum_{j=1}^N (y_j - \hat{y_j})^2} {N} * \frac{1} { (y_{max} - y_{min})} $$ This computed error is shown in report. It is usually below one percent. The shown processes in the picture below are fragments of input and output. The modelled and given output are perfectly match, there is no visible discrepancy between them and on that reason, they are not shown. The identified kernel can be seen as 3D surface for visualization.

Image of identified kernel