Prediction of User Throughput in the Mobile Network Along the Motorway and Trunk Road

The main goal of this research is to create a machine learning model for predicting user throughput in the mobile 4G network of the network provider M:tel Banja Luka, Bosnia and Herzegovina. The geographical area of the research is limited to the section of Motorway "9th January" (M9J) Banja Luka - Doboj, between the node Johovac and the town of Prnjavor (P-J section), and the area of the section of trunk road M17, between the node Johovac and the town of Doboj (J-D section). Based on the set of collected data, several models based on machine learning techniques were trained and tested together with the application of the Correlation-based Feature Selection (CFS) method to reduce the space of input variables. The test results showed that the models based on k-Nearest Neighbors (k-NN) have the lowest relative prediction error, for both sections, while the model created for the trunk road section has significantly better performance.


Introduction
The development of wireless network technologies, the enormous increase in the number of mobile applications and the expansion of the range of telecommunications services have conditioned the need for constantly better network performance. Today, telecommunications service providers and network applications have the task of providing a reliable and secure connection for users in all geographical areas and at all times, and appropriate throughput in the network to meet the growing need for streaming services. These requirements primarily apply to areas around important roads, such as motorways and trunk roads. Special attention is paid to the throughput in downlink (DL) traffic which makes up the largest part of generated network traffic. To meet customer needs and improve the Quality of user Experience (QoE), telecommunications providers must use the prediction of key performance indicators, such as throughput, which determine the direction of development and expansion of network capacity and resources in the future. This task is solved by predictive modeling. Prediction is defined as a foresight from the present to the future based on data from the past.
For the case study in this research, it was selected the geographical area of the road zone of Motorway "9th January" (M9J), Banja Luka-Doboj section, between the town of Prnjavor and the Johovac node (P-J section) and the area of the M17 trunk road section between the Johovac node and the town of Doboj (J-D section). The observed sections are of great importance in the road system, i.e. the road network of the Republic of Srpska and Bosnia and Herzegovina. The M9J Banja Luka-Doboj is a key road connecting the western and eastern part of the Republic of Srpska, and the M17 trunk road is one of the busiest roads in Bosnia and Herzegovina. The geographical area of the research is covered by the 4G -Long Term Evolution (LTE) telecommunications network managed by the M:tel BL provider.
The main goal of the research is to create models based on machine learning techniques for the prediction of the average user traffic throughput in the M:tel network in the observed geo-area. The sections are exposed to different conditions: different average speeds of users' vehicles, different number of handovers, different concentrations of users, different cell sizes, etc. Therefore, to increase prediction accuracy, it is necessary to create predictive models separately, for each section individually.
The syntactic structure of the paper consists of five sections. After the introduction, Section 1, Section 2 deals with the scientific background of the research. Section 3 contains described materials and research methods. The main emphasis is on Section 4, which provides the most important research results and discussion. Concluding remarks are given in Section 5. At the end of the paper, there is an overview of references.

Scientific background of the research
Throughput represents one of the key performance indicators (KPI) of any communication network and can be defined as the speed at which data can be transferred through the network between two endpoints [1]. According to [2] the average throughput, which is the focus of this research as a target (dependent) variable, is usually defined as the ratio of the average number of bits sent (or received) and the average duration of data transmission. Throughput depends on a number of factors such as signal strength, level of interference, number of users in the network, noise, location, etc., and because of this, its frequent fluctuations occur in the mobile network, which are more pronounced compared to wired networks, especially at high speeds of mobile devices [3,4]. This phenomenon is even more pronounced in the area around roads due to high mobility and traffic dynamics, i.e. variable number of users. On the other hand, the need for a permanent internet connection of the user and the increasingly massive use of video streaming services have caused the need for forecasting and modeling throughput in the future. Therefore, the prediction of the performance of the average throughput in the telecommunications network, as the subject of this research, represents a very important task with several goals: to satisfy the estimated demand for services, to provide a certain level of Quality of Service (QoS) and QoE, to reduce costs, to predict future system expansions or reserve system. Therefore, it is crucial to evaluate future results or network performance based on existing data. This engineering task is solved in this research by creating predictive models based on different machine learning techniques.
Throughput prediction in the cellular network at locations related to roads is the subject of numerous studies. In the previous period, a large number of scientific papers referring to this topic were published. According to [3], methods for predicting user throughput can be divided into three groups: methods based on formulas, methods based on historical data and methods based on machine learning techniques. In [5], the authors presented the Random Forest (RF) model for throughput prediction in the LTE network, emphasizing its application in maintaining a reliable connection of autonomous vehicles with infrastructure. The same machine learning model was proposed in [6] for LTE network throughput prediction, for different mobility scenarios. Additionally, the RF model was created in [7] to predict Video Streaming throughput. As a result of research in [8], different models were created based on Deep Neural Network (DNN) approaches, which enable throughput prediction in areas where there is no previous data on mobile network performance. When using a Live Streaming service, especially at high vehicle speeds, frequent fluctuations of Uplink connection throughput occur, which causes service delays. As a possible solution to this problem, the authors in [9] suggest PERCEIVE, a bandwidth prediction framework based on the Long Short-Term Memory (LSTM) model. Throughput prediction in data transmission between vehicles in future 6G networks is the subject of research in [10]. For this purpose, the authors created several models: Artificial Neural Network (ANN), RF and Support Vector Machine (SVM). Ur Rehman et al. in [11] modeled downlink throughput in the LTE network based on several independent variables related to the conditions of radio networks (traffic) using multilayer neural networks. The paper [12] presents the connection between road traffic and mobile networks in urban areas through predictive models of traffic flows, which were created based on mobile network signal data. In [13], the authors presented several machine learning and deep learning models for throughput prediction, which is crucial for delay reduction in online streaming services. Palaios et al. in the paper [14] investigate the influence of spatial, temporal and network characteristics on the prediction of throughput in the uplink and downlink direction.

Materials and methods
Following the basic goal of the research, the paper uses the Data-Driven Prediction approach to create predictive models. The application of this approach has been in expansion in recent years along with the enormous increase in the availability of Big Data. As a result of the ability to learn from data, machine learning models establish connections between dependent (output) and independent (input) variables and, based on "learned" functions, generate output values for given inputs. The most well-known machine learning techniques are ANN, Decision Trees, SVM, k-Nearest Neighbors (k-NN), etc. Thus, the data-driven prediction approach implies the existence of a data set that is divided into two parts: a data set for training and a data set for model testing. Therefore, data collection is the first step of the methodological research procedure, which is algorithmically shown in Figure 1. In the second step of the research process, statistical methods are applied. First, a correlation analysis is performed, and then, in order to simplify the model, the Correlation-based Feature Selection method is applied to reduce the space of input variables. Based on the selected subset of independent variables, predictive models are trained and tested (created), after which the prediction results are compared. The ultimate goal is to choose the model that gives the most accurate prediction.  Figure 2 shows the geo-area of the research. The length of the M9J Banja Luka-Doboj section from Prnjavor to the Johovac node is 35 km and in Figure 2 is marked in blue. The section of the M17 trunk road, between the Johovac node and the town of Doboj in the length of 12 km, is marked in red in the same figure. Research data were obtained from the mobile operator M:tel based on a previously submitted official Request specifying the necessary variables [15]. The obtained database for the LTE network contains data from a total of 71053 measurements. The values of the variables were registered in a period of 30 days (from 15 December 2020 to 15 January 2021), with a sampling frequency of one hour [15]. The data are structured into (input/output) vectors, where 17 independent variables, listed in Table 1, represent the input part, and the dependent variableaverage user throughput (USER_DL_TR), the output part of the vector [15]. In addition to the names of variables, Table 1 also provides abbreviated labels (V…) used further in the paper.  By extracting data from the obtained M:tel database, a set of a total of 9886 input/output vectors was formed for the P-J section, while 2301 input/output vectors were selected for the J-D section. This difference in the number of vectors for the two sets is due to the different lengths of the sections, and thus the different number of cells covering them.

Correlation Analysis
The quantitative expression of the measure of linear correlation between two variables is the Pearson correlation coefficient (r), which can be defined by the ratio of the covariance of two variables and the product of their standard deviations. The values of the Pearson correlation coefficient range from -1, which represents a perfect negative linear correlation, and 1, which represents a perfect positive linear correlation. A value equal to zero means that there is no correlation between the variables. Any value from the specified interval can be interpreted according to the scale shown in Table 2.

Correlation-based Feature Selection (CFS) Method
The CFS method allows the number of input/independent variables to be reduced based on previously performed correlation analysis to simplify the machine learning model. Pearson correlation coefficients help identify independent variables that may have a stronger influence on dependent variables. Thus, a higher correlation coefficient means that the observed independent variable can be considered a strong predictor of the dependent variable [17]. According to [18], a set of variables is representative for the prediction model if, in addition to a strong correlation between independent and dependent variables, there is as low correlation as possible between independent variables. The mathematical function that defines this correlation is Merit -heuristic evaluation function: where MS is the heuristic evaluation function of the subset S containing k variables; rcf -arithmetic mean of correlation between independent and dependent variables; rff -arithmetic mean of correlation between independent variables. There are three heuristic strategies for finding the best subset (with the largest Merit): forward selection, backward elimination, and best first [17]. This paper uses the forward selection strategy, which is presented algorithmically in Figure 3. The algorithm shown in Figure 3 starts with an empty subset of variables and with ranked correlation coefficients, from the largest to the smallest. In the next step, the independent variable with the highest correlation coefficient with the dependent variable is added to the existing subset and Merit is calculated. The resulting value is saved. If the subset does not contain all the variables of the initial set, it is necessary to return to step 2 and add the next variable to the subset. When all the variables of the initial set are added to the subset, the subset with the highest value of the heuristic Merit function is selected. The algorithm ends with that step.

Creating predictive models and comparative analysis of prediction results
Models for predicting average user throughput in the network were created in the SPSS Modeler software package. This software platform is one of the leading solutions in the field of Data Science, and especially machine learning. Supported techniques include Neural Networks, Classification and Regression Tree (C&R), Chi-square Automatic Interaction Detection (CHAID), linear regression, generalized linear regression, logistic regression, Bayesian Network, SVM, k-NN. A key role in this paper is played by the method of automatic modeling, which simultaneously examines several models of machine learning with different parameters according to a supervised learning paradigm. The SPSS Modeler automatically ranks offered solutions, which is possible based on correlation, relative error or the number of variables used. Comparative analysis of the offered solutions is given in the Results and discussion section based on relative error, which represents the ratio of the sum of squared errors of the dependent variable and the sum of squared errors of the null model, or mathematically: where yi is actual USER_DL_TR value of the i-th vector,  i y prediction based on actual value yi, and y the arithmetic mean of the variable USER_DL_TR.

Results and discussion
This section presents the most important research results. According to the methodological steps in Figure  1, first the results of correlation analysis are given, then the results of the application of the CFS method, and finally prediction results with comparative analysis of solutions are presented.

Results of correlation analysis
As an initial step in the statistical processing of data, a correlation analysis was performed to determine the measure of linear correlation of research variables. The matrix of Pearson correlation coefficients between the research variables for the M9J P-J section is shown in Figure 4. Based on the values of correlation coefficients shown in Figure 4, it is concluded that all independent variables, except V2, have a negative linear correlation with USER_DL_TR. According to the scale shown in Table 2, the variables V3, V4, V6, V8, V10, V13 and V16 have a very low correlation with the dependent variable. Low correlation defines the linear correlation of variables V5, V7, V9, V11, V12, V15 and V17 with the average user throughput. Variables V2 and V14 have a Moderate correlation with the observed output variable, which, for this section of the motorway, is the largest measure of correlation of independent and dependent variables. Therefore, the correlation coefficient higher than 0.45 (V2) was not determined between the independent and dependent variables for the observed section. Figure 5 shows Pearson correlation coefficients between the observed variables for the M17 section, J-D.
From Figure 5, it can be concluded that there is a negative correlation of all independent variables with user throughput, except variable V2, as is the case with the M9J P-J section ( Figure 4). Also, it is evident that there is a Very Low correlation between the variable V6 and user throughput. Variables V3 and V4 have slightly higher coefficients (r=-0.35) and according to the scale shown in Table 2, they have Low Correlation with the dependent variable. Moderate correlation defines the relationship between variables V2, V7 and V16 with USER_DL_TR. Most of the independent variables have a high correlation with the observed dependent variable, namely V5, V8, V9, V10, V11, V12, V13, V14, V15 and V17. Variable V1 with a value of Pearson coefficient r=-0.80 has a very high, negative linear correlation with the average user throughput.

Results of CFS application
The Correlation-based Feature Selection (CFS) method was applied to reduce the dimensionality of the space of independent variables, based on the correlation coefficients presented in the previously given analysis. The subset of independent variables in the initial step consists only of the variable that has the highest correlation with the dependent variable. In the next step, the subset is expanded with a variable that has the second largest correlation coefficient with the average user throughput (from Figure 4 and Figure 5). This step is repeated until all available independent variables are included in the subset. For each subset, according to Eq.
(1), Merit is calculated. The final subset of independent variables, used to create machine learning models, is the one with the highest Merit value. Table 3 shows the calculated Merit values for each subset of variables on the M9J P-J section. Based on the values given in Table 3, it is obvious that a subset consisting of variables V2 and V14 (0.508) has the largest Merit. With the expansion of the subset, the decrease of Merit is evident, to a final value of 0.338.  Table 4 provides an overview of the calculated Merit values for the subsets of independent variables, for the M17 J-D section. A subset of variables V1 and V17 has the highest Merit value (0.825), according to the results given in Table 4. Also, as is the case with the variables on the P-J section, the expansion of the subset leads to a decrease in Merit values. The complete set, with all independent variables, has Merit equal to 0.756.

Prediction results and comparative analysis of the results
Given that the average user throughput is continuous, in the SPSS Modeler software environment, training and testing data are processed using the Auto Numeric option to automatically create different predictive models. In this way, in just one pass through the modeling process, Auto Numeric examines models based on different machine learning techniques, different combinations of parameters for each of these models, and ranks the solutions according to relative prediction error. Table 5 presents the three best models of machine learning for both observed road sections. Based on the results, it is obvious that the models created for the J-D trunk road section have significantly higher prediction accuracy. The best model for this section is based on the k-NN machine learning technique and is characterized by a relative error of 0.183. The most accurate model for the M9J P-J section is based on the same technique, but its relative error is three times higher and is 0.549.    Figure 6 shows the Scatter plot of the prediction results of the k-Nearest Neighbors model for the M9J P-J section. It is obvious that there is a large deviation in the data obtained by prediction from the actual values. The coefficient of determination (R 2 ), as an indicator of the quality of the model, is 0.416, which can be considered a Moderate correlation [19]. Figure 7 shows a scatter plot of the prediction results using the k-Nearest Neighbors model for the J-D trunk road section. In Figure 7, it can be seen that the spots are largely concentrated in the vicinity of the line shown, which is indicated by the value of R 2 which is equal to 0.8005. This determination coefficient defines the High level of correlation of the data obtained by prediction with the real values of the dependent variable [19].

Conclusion
In the paper, it is created several machine learning models for average user throughput prediction in the mobile network in the observed geo-area of the research. Based on the criterion of relative prediction error, the best solutions were ranked and one model was selected for each of the sections. The results showed much higher prediction accuracy for the selected k-NN model on the trunk road section, between the Johovac node and the town of Doboj. One of the potential reasons for this is frequent throughput fluctuations, which are more pronounced in conditions of high user mobility, such as traveling on the motorway. Vehicle speed on the M9J section is limited to 130 km/h, and on the trunk road to 80 km/h. This may be one of the challenges for future research. Some of the other more important reasons are lower vehicle speeds on the trunk road, fewer handovers and fewer cells covering the 12 km long section. The research results and developed models have innovative theoretical and considerable practical significance, especially in terms of the needs of telecommunications service providers in the geo-area of the network that covers the observed roads. User throughput prediction in the network enables more precise planning and allocation of network resources in the future to meet user requirements. In relation to the previously published studies, which are listed in the introductory section, this paper is characterized by the following novelties: an original methodological approach to the application of machine learning methods in combination with modern statistical methods has been created; the selected predictive model is characterized by its simplicity because it has only two inputs, yet it is characterized by satisfactory accuracy; the applied method of automatic modeling facilitates and accelerates the process of examining models based on various machine learning techniques; combined geo-area in the zone of roads with different conditions for predicting the of telecommunication traffic performance has been selected; a representative set of correlative research variables in bimodal traffic has been identified. The orientation of future research may be to find other models for average user throughput prediction on the M9J P-J section to determine a smaller relative error of prediction. Also, in future research it is necessary to examine the prediction accuracy of different models with a larger number of input variables.