10 Modeling
In this section, the fourth main step of CNMccontrol-oriented Cluster-based Network Modeling, i.e., modeling, is elaborated. The data and workflow is described in figure 10.1 . It comprises two main sub-tasks, which are modeling the Centroid Position Evolution (CPE) and modeling the transition properties tensors \(\boldsymbol Q / \boldsymbol T\). The settings are as usually defined in settings.py and the extracted attributes are distributed to the sub-tasks. Modeling the CPECentroid Position Evolution and the \(\boldsymbol Q/ \boldsymbol T\) tensors can be executed separately from each other. If the output of one of the two modeling sub-steps is at hand, CNMccontrol-oriented Cluster-based Network Modeling is not forced to recalculate both modeling sub-steps. Since the tracked states are used as training data to run the modeling they are prerequisites for both modeling parts. The modeling of the centroid position shall be explained in the upcoming subsection 10.0.1, followed by the explanation of the transition properties in subsection 10.0.2. A comparison between this CNMccontrol-oriented Cluster-based Network Modeling and the first CNMc version is provided at the end of the respective subsections. The results of both modeling steps can be found in section 13 and 14
10.0.1 Modeling the centroid position evolution
In this subsection, the modeling of the CPECentroid Position Evolution is described. The objective is to find a surrogate model, which returns all \(K\) centroid positions for an unseen \(\beta_{unseen}\). The training data for this are the tracked centroids from the previous step, which is described in section 9. To explain the modeling of the CPE, figure 10.2 shall be inspected. The model parameter values which shall be used to train the model \(\vec{\beta_{tr}}\) are used for generating a so-called candidate library matrix \(\boldsymbol{\Theta}\,(\vec{\beta_{tr}})\). The candidate library matrix \(\boldsymbol{\Theta}\,(\vec{\beta_{tr}})\) is obtained making use of a function of pySindy (Silva et al. 2020; Kaptanoglu et al. 2022; S. L. Brunton, Proctor, and Kutz 2016). In (S. L. Brunton, Proctor, and Kutz 2016) the term \(\boldsymbol{\Theta}\,(\vec{\beta_{tr}})\) is explained well. However, in brief terms, it allows the construction of a matrix, which comprises the output of defined functions. These functions could be, e.g., a linear, polynomial, trigonometrical or any other non-linear function. Made-up functions that include logical conditions can also be applied.
Since, the goal is not to explain, how to operate pySindy (S. L. Brunton, Proctor, and Kutz 2016), the curious reader is referred to the pySindy very extensive online documentation and (Silva et al. 2020; Kaptanoglu et al. 2022). Nevertheless, to understand \(\boldsymbol{\Theta}\,(\vec{\beta_{tr}})\) equation 10.1 shall be considered. In this example, 3 different functions, denoted as \(f_i\) in the first row, are employed. The remaining rows are the output for the chosen \(f_i\). Furthermore, \(n\) is the number of samples or the size of \(\vec{\beta_{tr} }\), i.e., \(n_{\beta,tr}\) and \(m\) denotes the number of the features, i.e., the number of the functions \(f_i\). \[ \begin{equation} \boldsymbol{\Theta_{exampl(n \times m )}}(\,\vec{\beta_{tr}}) = \renewcommand\arraycolsep{10pt} \begin{bmatrix} f_1 = \beta & f_2 = \beta^2 & f_2 = cos(\beta)^2 - exp\,\left(\dfrac{\beta}{-0.856} \right) \\[1.5em] 1 & 1^2 & cos(1)^2 - exp\,\left(\dfrac{1}{-0.856} \right) \\[1.5em] 2 & 2^2 & cos(2)^2 - exp\,\left(\dfrac{2}{-0.856} \right) \\[1.5em] \end{bmatrix} \label{eq_20} \end{equation} \tag{10.1}\]
The actual candidate library matrix \(\boldsymbol{\Theta}\,(\vec{\beta_{tr}})\) incorporates a quadratic polynomial, the inverse \(\frac{1}{\beta}\), the exponential \(exp(\beta)\) and 3 frequencies of cos and sin, i.e., \(cos(\vec{\beta}_{freq}), \ sin(\vec{\beta}_{freq})\), where \(\vec{\beta}_{freq} = [1, \, 2,\, 3]\). There are much more complex \(\boldsymbol{\Theta}\,(\vec{\beta_{tr}})\) available in CNMccontrol-oriented Cluster-based Network Modeling, which can be selected if desired. Nonetheless, the default \(\boldsymbol{\Theta}\,(\vec{\beta_{tr}})\) is chosen as described above. Once \(\boldsymbol{\Theta}\,(\vec{\beta_{tr}})\) is the generated, the system of equations 10.2 is solved. Note, this is very similar to solving the well-known \(\boldsymbol A \, \vec{x} = \vec{y}\) system of equations. The difference is that the vectors \(\vec{x}, \, \vec{y}\) can be vectors in the case of 10.2 as well, but in general, they are the matrices \(\boldsymbol{X} ,\, \boldsymbol Y\), respectively. The solution to the matrix \(\boldsymbol{X}\) is the desired output. It contains the coefficients which assign importance to the used functions \(f_i\). The matrix \(\boldsymbol Y\) contains the targets or the known output for the chosen functions \(f_i\). Comparing \(\boldsymbol A\) and \(\boldsymbol{\Theta}\,(\vec{\beta_{tr}})\) mathematically, no difference exists.
\[ \begin{equation} \boldsymbol{\Theta}\,(\vec{\beta_{tr}}) \: \boldsymbol X = \boldsymbol Y \label{eq_21} \end{equation} \tag{10.2}\]
With staying in the pySindy environment, the system of equations 10.2 is solved by means of the optimizer SR3, which is implemented in pySindy. Details and some advantages of the SR3 optimizer can be found in (Zheng et al. 2018). Nevertheless, two main points shall be stated. It is highly unlikely that the \(\boldsymbol{\Theta}\,(\vec{\beta_{tr}}),\: \boldsymbol X,\, \boldsymbol Y\) is going to lead to a well-posed problem, i.e., the number of equations are equal to the number of unknowns and having a unique solution. In most cases the configuration will be ill-posed, i.e., the number of equations is not equal to the number of unknowns. In the latter case, two scenarios are possible, the configuration could result in an over-or under-determined system of equations.
For an over-determined system, there are more equations than unknowns. Thus, generally, no outcome that satisfies equation 10.2 exists. In order to find a representation that comes close to a solution, an error metric is defined as the objective function for optimization. There are a lot of error metrics or norms, however, some commonly used (S. Brunton and Kutz 2019) are given in equations 10.3 to 10.5, where \(f(x_k)\) are true values of a function and \(y_k\) are their corresponding predictions. The under-determined system has more unknown variables than equations, thus infinitely many solutions exist. To find one prominent solution, again, optimization is performed. Note, for practical application penalization or regularization parameter are exploited as additional constraints within the definition of the optimization problem. For more about over- and under-determined systems as well as for deploying optimization for finding a satisfying result the reader is referred to (S. Brunton and Kutz 2019). \[ \begin{equation} E_{\infty} = \max_{1<k<n} |f(x_k) -y_k | \quad \text{Maximum Error} \;(l_{\infty}) \label{eq_22} \end{equation} \tag{10.3}\]
\[ \begin{equation} E_{1} = \frac{1}{n} \sum_{k=1}^{n} |f(x_k) -y_k | \quad \text{Mean Absolute Error} \;(l_{1}) \label{eq_23} \end{equation} \tag{10.4}\]
\[ \begin{equation} E_{2} = \sqrt{\frac{1}{n} \sum_{k=1}^{n} |f(x_k) -y_k |^2 } \quad \text{Least-squares Error} \;(l_{2}) \label{eq_24} \end{equation} \tag{10.5}\]
The aim for modeling CPE is to receive a regression model, which is sparse, i.e., it is described through a small number of functions \(f_i\). For this to work, the coefficient matrix \(\boldsymbol X\) must be sparse, i.e., most of its entries are zero. Consequently, most of the used functions \(f_i\) would be inactive and only a few \(f_i\) are actively applied to capture the CPE behavior. The \(l_1\) norm as defined in 10.4 and the \(l_0\) are metrics which promotes sparsity. In simpler terms, they are leveraged to find only a few important and active functions \(f_i\). The \(l_2\) norm as defined in 10.5 is known for its opposite effect, i.e. to assign importance to a high number of \(f_i\). The SR3 optimizer is a sparsity promoting optimizer, which deploys \(l_0\) and \(l_1\) regularization.
The second point which shall be mentioned about the SR3 optimizer is that it can cope with over-and under-determined systems and solves them without any additional input. One important note regarding the use of pySindy is that pySindy in this thesis is not used as it is commonly. For modeling the CPE only the modules for generating the candidate library matrix \(\boldsymbol{\Theta}\,(\vec{\beta_{tr}})\) and the SR3 optimizer are utilized.
Going back to the data and workflow in figure 10.2, the candidate library matrix \(\boldsymbol{\Theta}\,(\vec{\beta_{tr}})\) is generated. Furthermore, it also has been explained how it is passed to pySindy and how SR3 is used to find a solution. It can be observed that \(\boldsymbol{\Theta}\,(\vec{\beta_{tr}})\) is also passed to a Linear and Elastic net block. The Linear block is used to solve the system of equations 10.2 through linear interpolation. The Elastic net solves the same system of equations with the elastic net approach. In this the optimization is penalized with an \(l_1\) and \(l_2\) norm. In other words, it combines the Lasso (Tibshirani 1996; S. Brunton and Kutz 2019) and Ridge (S. Brunton and Kutz 2019), regression respectively. The linear and elastic net solvers are invoked from the Scikit-learn (Pedregosa et al. 2011) library.
The next step is not depicted in figure 10.2 . Namely, the linear regression model is built with the full data. For pySindy and the elastic net, the models are trained with \(90 \%\) of the training data and the remaining \(10 \%\) are used to test or validate the model. For pySindy \(20\) different models with the linear distributed thresholds starting from \(0.1\) and ending at \(2.0\) are generated. The model which has the least mean absolute error 10.4 will be selected as the pySindy model. The mean absolute error of the linear, elastic net and the selected pySindy will be compared against each other. The one regression model which has the lowest mean absolute error is selected as the final model.
The described process is executed multiple times. In 3-dimensions the location of a centroid is given as the coordinates of the 3 axes. Since the CPE across the 3 different axes can deviate significantly, capturing the entire behavior in one model would require a complex model. A complex model, however, is not sparse anymore. Thus, a regression model for each of the \(K\) labels and for each of the 3 axes is required. In total \(3 \, K\) regression models are generated.
Finally, first CNMc and CNMccontrol-oriented Cluster-based Network Modeling shall be compared. First, in first CNMc only pySindy with a different built-in optimizer. Second, the modeling CPE was specifically designed for the Lorenz system 7.1 . Third, first CNMc entirely relies on pySindy, no linear and elastic models are calculated and used for comparison. Fourth, the way first CNMc would perform prediction, was by transforming the active \(f_i\) with their coefficients to equations such that SymPy could be applied. The disadvantage is that if \(\boldsymbol{\Theta}\,(\vec{\beta_{tr}})\) is changed, modifications for SymPy are necessary. Also, \(\boldsymbol{\Theta}\,(\vec{\beta_{tr}})\) can be used for arbitrary defined functions \(f_i\), SymPy functions, however, are restricted to some predefined functions. In CNMccontrol-oriented Cluster-based Network Modeling it is also possible to get the active \(f_i\) as equations. However, the prediction is obtained with a regular matrix-matrix multiplication as given in equation 10.6 . The variables are denoted as the predicted outcome \(\boldsymbol{\tilde{Y}}\), the testing data for which the prediction is desired \(\boldsymbol{\Theta_s}\) and the coefficient matrix \(\boldsymbol X\) from equation 10.2 . \[ \begin{equation} \boldsymbol{\tilde{Y}} = \boldsymbol{\Theta_s} \, \boldsymbol X \label{eq_25} \end{equation} \tag{10.6}\]
With leveraging equation 10.6 the limitations imposed through SymPy are removed.
10.0.2 Modeling Q/T
In this subsection, the goal is to explain how the transition properties are modeled. The transition properties are the two tensors \(\boldsymbol Q\) and \(\boldsymbol T\), which consist of transition probability from one centroid to another and the corresponding transition time, respectively. For further details about the transition properties, the reader is referred to section 4. Modeling \(\boldsymbol Q / \boldsymbol T\) means to find surrogate models that capture the trained behavior and can predict the tensors for unseen model parameter values \(\boldsymbol{\tilde{Q}}(\vec{\beta}_{unseeen}) ,\, \boldsymbol{\tilde{T}}(\vec{\beta}_{unseeen})\). To go through the data and workflow figure 10.3 shall be considered.
First, the tracked state data is loaded and adapted in the sense that CNM’s data format is received. After that CNMCluster-based Network Modeling can be executed on the tracked state data. The outcome of CNMCluster-based Network Modeling are the transition property tensors for all the provided model parameter values \(\boldsymbol Q(\vec{\beta}) ,\, \boldsymbol T(\vec{\beta})\). However, CNMCluster-based Network Modeling does not return tensors as NumPy (Harris et al. 2020) arrays, but as Python dictionaries. Thus, the next step is to transform the dictionaries to NumPy arrays. \(\boldsymbol Q / \boldsymbol T\) are highly sparse, i.e., \(85 \% - 99\%\) of the entries can be zero. The \(99\%\) case is seen with a great model order, which for the Lorenz system 7.1 was found to be \(L \approx 7\). Furthermore, with an increasing \(L\), saving the dictionaries as NumPy arrays becomes inefficient and at some \(L\) impossible. With \(L>7\) the storage cost goes above multiple hundreds of gigabytes of RAM. Therefore, the dictionaries are converted into sparse matrices.
Thereafter, the sparse matrices are reshaped or stacked into a single matrix, such that a modal decomposition method can be applied. Followed by training a regression model for each of the mode coefficients. The idea is that the regression models receive a \(\beta_{unseen}\) and returns all the corresponding predicted modes. The regression models are saved and if desired plots can be enabled via settings.py.
In this version of CNMccontrol-oriented Cluster-based Network Modeling two modal decomposition methods are available. Namely, the Singular Value Decomposition (SVD) and the Non-negative Matrix Factorization (NMF). The difference between both is given in (Lee and Seung 1999). The SVDSingular Value Decomposition is stated in equation 10.7, where the variables are designated as the input matrix \(\boldsymbol A\) which shall be decomposed, the left singular matrix \(\boldsymbol U\), the diagonal singular matrix \(\boldsymbol{\Sigma}\) and the right singular matrix \(\boldsymbol V^T\). The singular matrix \(\boldsymbol{\Sigma}\) is mostly ordered in descending order such that the highest singular value is the first diagonal element. The intuition behind the singular values is that they assign importance to the modes in the left and right singular matrices \(\boldsymbol U\) and \(\boldsymbol {V^T}\), respectively. \[ \begin{equation} \boldsymbol A = \boldsymbol U \, \boldsymbol{\Sigma} \, \boldsymbol {V^T} \label{eq_26} \end{equation} \tag{10.7}\]
The big advantage of the SVDSingular Value Decomposition is observed when the so-called economical SVDSingular Value Decomposition is calculated. The economical SVDSingular Value Decomposition removes all zero singular values, thus the dimensionality of all 3 matrices can be reduced. However, from the economical SVDSingular Value Decomposition as a basis, all the output with all \(r\) modes is available. There is no need to perform any additional SVDSingular Value Decomposition to get the output for \(r\) modes, but rather the economical SVDSingular Value Decomposition is truncated with the number \(r\) for this purpose. NMFNon-negative Matrix Factorization, given in equation 4.5, on the other hand, has the disadvantage that there is no such thing as economical NMF. For every change in the number of modes \(r\), a full NMFNon-negative Matrix Factorization must be recalculated.
\[ \begin{equation} \boldsymbol {A_{i \mu}} \approx \boldsymbol A^{\prime}_{i \mu} = (\boldsymbol W \boldsymbol H)_{i \mu} = \sum_{a = 1}^{r} \boldsymbol W_{ia} \boldsymbol H_{a \mu} \tag{4.5} \end{equation} \]
The issue with NMFNon-negative Matrix Factorization is that the solution is obtained through an iterative optimization process. The number of iterations can be in the order of millions and higher to meet the convergence criteria. Because the optimal \(r_{opt}\) depends on the dynamical system, there is no general rule for acquiring it directly. Consequently, NMFNon-negative Matrix Factorization must be run with multiple different \(r\) values to find \(r_{opt}\). Apart from the mentioned parameter study, one single NMFNon-negative Matrix Factorization execution was found to be more computationally expensive than SVDSingular Value Decomposition. In (Pierzyna 2021) NMFNon-negative Matrix Factorization was found to be the performance bottleneck of first CNMc, which became more evident when \(L\) was increased. In subsection 14.0.1 a comparison between NMFNon-negative Matrix Factorization and SVDSingular Value Decomposition regarding computational time is given.
Nevertheless, if the user wants to apply NMFNon-negative Matrix Factorization, only one attribute in settings.py needs to be modified. Because of that and the overall modular structure of CNMccontrol-oriented Cluster-based Network Modeling, implementation of any other decomposition method should be straightforward. In CNMccontrol-oriented Cluster-based Network Modeling the study for finding \(r_{opt}\) is automated and thus testing CNMccontrol-oriented Cluster-based Network Modeling on various dynamical systems with NMFNon-negative Matrix Factorization should be manageable. The benefit of applying NMFNon-negative Matrix Factorization is that the entries of the output matrices \(\boldsymbol W_{ia},\, \boldsymbol H_{a \mu}\) are all non-zero. This enables interpreting the \(\boldsymbol W_{ia}\) matrix since both \(\boldsymbol Q / \boldsymbol T\) tensors cannot contain negative entries, i.e., no negative probability and no negative transition time.
Depending on whether NMFNon-negative Matrix Factorization or SVDSingular Value Decomposition is chosen, \(r_{opt}\) is found through a parameter study or based on \(99 \%\) of the information content, respectively. The \(99 \%\) condition is met when \(r_{opt}\) modes add up to \(99 \%\) of the total sum of the modes. In SVDSingular Value Decomposition \(r_{opt}\) is automatically detected and does not require any new SVDSingular Value Decomposition execution. A comparison between SVDSingular Value Decomposition and NMFNon-negative Matrix Factorization regarding prediction quality is given in section 14.0.2. After the decomposition has been performed, modes that capture characteristic information are available. If the modes can be predicted for any \(\beta_{unseen}\), the predicted transition properties \(\boldsymbol{\tilde{Q}}(\vec{\beta}_{unseeen}) ,\, \boldsymbol{\tilde{T}}(\vec{\beta}_{unseeen})\) are obtained. To comply with this CNMccontrol-oriented Cluster-based Network Modeling has 3 built-in methods. Namely, Random Forest (RF), AdaBoost, and Gaussian Process.
First, RFRandom Forest is based on decision trees, but additionally deploys a technique called bootstrap aggregation or bagging. Bagging creates multiple sets from the original dataset, which are equivalent in size. However, some features are duplicated in the new datasets, whereas others are neglected entirely. This allows RFRandom Forest to approximate very complex functions and reduce the risk of overfitting, which is encountered commonly with regular decision trees. Moreover, it is such a powerful tool that, e.g., Kilian Weinberger, a well-known Professor for machine learning at Cornell University, considers RFRandom Forest in one of his lectures, to be one of the most powerful regression techniques that the state of the art has to offer. Furthermore, RFRandom Forest proved to be able to approximate the training data
acceptable as shown in (Pierzyna 2021). However, as mentioned in subsection 4.0.1, it faced difficulties to approximate spike-like curves . Therefore, it was desired to test alternatives as well.
These two alternatives were chosen to be AdaBoost, and Gaussian Process. Both methods are well recognized and used in many machine learning applications. Thus, instead of motivating them and giving theoretical explanations, the reader is referred to (CAO et al. 2013), (Rasmussen 2004; Bishop 2006) for AdaBoost and Gaussian Process, respectively. As for the implementation, all 3 methods are invoked through Scikit-learn (Pedregosa et al. 2011). The weak learner for AdaBoost is Scikit-learn’s default decision tree regressor. The kernel utilized for the Gaussian Process is the Radial Basis Function (RBF). A comparison of these 3 methods in terms of prediction capabilities is provided in section 14.0.2.
Since the predicted \(\boldsymbol{\tilde{Q}}(\vec{\beta}_{unseeen}) ,\, \boldsymbol{\tilde{T}}(\vec{\beta}_{unseeen})\) are based on regression techniques, their output will have some deviation to the original \(\boldsymbol{Q}(\vec{\beta}_{unseeen}) ,\, \boldsymbol{T}(\vec{\beta}_{unseeen})\). Due to that, the physical requirements given in equations 10.8 may be violated.
\[ \begin{equation} \begin{aligned} 0 \leq \, \boldsymbol Q \leq 1\\ \boldsymbol T \geq 0 \\ \boldsymbol Q \,(\boldsymbol T > 0) > 0 \\ \boldsymbol T(\boldsymbol Q = 0) = 0 \end{aligned} \label{eq_31_Q_T_prediction} \end{equation} \tag{10.8}\]
To manually enforce these physical constraints, the rules defined in equation 10.9 are applied. The smallest allowed probability is defined to be 0, thus negative probabilities are set to zero. The biggest probability is 1, hence, overshoot values are set to 1. Also, negative transition times would result in moving backward, therefore, they are set to zero. Furthermore, it is important to verify that a probability is zero if its corresponding transition time is less than or equals zero. In general, the deviation is in the order of \(\mathcal{O}(1 \mathrm{e}{-2})\), such that the modification following equation 10.9 can be considered reasonable.
\[ \begin{equation} \begin{aligned} & \boldsymbol Q < 0 := 0 \\ & \boldsymbol Q > 1 := 1 \\ & \boldsymbol T < 0 := 0\\ & \boldsymbol Q \, (\boldsymbol T \leq 0) := 0 \end{aligned} \label{eq_32_Rule} \end{equation} \tag{10.9}\]
In conclusion, it can be said that modeling \(\boldsymbol Q / \boldsymbol T\) CNMccontrol-oriented Cluster-based Network Modeling is equipped with two different modal decomposition methods, SVDSingular Value Decomposition and NMF. To choose between them one attribute in settings.py needs to be modified. The application of NMFNon-negative Matrix Factorization is automated with the integrated parameter study. For the mode surrogate models, 3 different regression methods are available. Selecting between them is kept convenient, i.e. by editing one property in settings.py.