SGD-SVR: Your (ir)regular predictive model.

Photo by Enayet Raheem on Unsplash

Support Vector Regression (SVR) is one of the most frequently used regression models in machine learning. SVR is highly regarded for its versatility in handling datasets of various sizes and shapes, encompassing both linear and non-linear relationships between variables. This adaptability is particularly beneficial in real-world scenarios where data can exhibit complex patterns that defy simple linear modeling. SVR's flexibility arises from its ability to utilize different kernel functions, enabling it to capture intricate structures within the data and provide accurate predictions. However, despite its numerous advantages, SVR has certain limitations that need to be considered. One notable drawback is its computational complexity, especially when dealing with large datasets, and requires careful tuning of hyperparameters, which may pose challenges for users without extensive machine learning expertise.

In this post, we'll deep-dive into the realm of optimization algorithm especially Stochastic Gradient Descent (SGD) as a way to mitigate these downsides, providing a practical solution to enhance the effectiveness of SVR, as far as make it more scalable to larger datasets, and more "processor-friendly".

How it works...

If you've worked with machine learning at least once, you'll know what Support Vector Machine (SVM) is. It's a supervised machine learning algorithm which uses labeled data for classification, and regression tasks. The main idea of SVR is to find line—or a plane in higher dimensions—that best represents the trend in the data. SVR helps you find this line while minimizing the error between the actual data points and the line with the help of margin of tolerance.

Two dashed-lines as margin in SVR

Suppose a labeled dataset $x_{i} = (x_{i 1}, x_{i 2}, \dots, x_{i d}) \in R^{d}$ , and $y_{i} \in R$ with $i = 1, 2, \dots, n$ . With the help of SVR, you can find the predicted data $(f (x))$ by solving the optimization problem—or objective function $(L)$ —written as

min {\frac{1}{2} i = 1 \sum n j = 1 \sum n (α_{i} - α_{i}^{*}) (α_{j} - α_{j}^{*}) φ (x_{i})^{T} φ (x_{j}) - i = 1 \sum n y_{i} (α_{i} - α_{i}^{*}) + ε i = 1 \sum n (α_{i} + α_{i}^{*}) .

Subject to constraint

i = 1 \sum n (α_{i} - α_{i}^{*}) = 0, and 0 \leq α_{i}, α_{i}^{*} \leq C .

So that we can achieve

f (x) = i = 1 \sum n (α_{i} - α_{i}^{*}) K (x_{i}, x) + b .

With $K (x_{i}, x) = φ (x_{i})^{T} φ (x)$ is some kernel function like polynomial or radial basis function (RBF). In equation above, we have hyperparameter that need to be tuned such as: dual parameter $(α)$ , width of margin $(ε)$ , and trade-off $(C)$ . And now, this is where SGD come into play...

SGD play a significant role in determining the dual parameter $(α)$ SVR, by estimating the gradient of objective function $(L)$ using a smaller subset of data, making it computationally more efficient. By randomly selecting mini-batches of data for each iteration, SGD introduces randomness into the optimization process, which helps escape local minima and can lead to faster convergence. SGD can be written as

α_{i}^{(*)} (n e w) = α_{i}^{(*)} (o l d) - λ \frac{\partial L}{\partial α _{i}^{(*)} ( o l d )} .

With $α_{i}^{(*)}$ represent both $α$ and $α^{*}$ for each $i$ data, and $λ$ as learning rate. Partial derivative of the objective function $L$ with respect to $α$ $(\nabla_{α} L)$ can be written as

\frac{\partial L}{\partial α _{i}} \frac{\partial L}{\partial α _{i}^{*}} = j = 1 \sum n (α_{j} - α_{j}^{*}) K (x_{j}, x_{i}) - y_{i} + ε, = y_{i} - j = 1 \sum n (α_{j} - α_{j}^{*}) K (x_{j}, x_{i}) + ε .

This process is repeated until the stopping criteria is met, i.e. $0 \leq \sum_{i = 1}^{n} (α_{i} - α_{i}^{*}) \leq$ degree of tolerance $(t o l)$ . After that, the optimized $α$ can now be used to calculate the predictive values $(f (x))$ .

Simple flowchart of SGD-SVR

The real deal.

Lets begin the application of all those algorithm with some real data. The data used comes from Badan Pusat Statistik (BPS), namely Indonesia's GDP growth rate data from Q2 2010 to Q2 2023, totaling 53 data. This data consist of 9 independent variables $(X_{1}, X_{2}, \dots, X_{9})$ and 1 dependent variable $(Y)$ .

First 10 of Indonesia's GDP growth rate data

Next, we can use tools like tensorflow / scikit-learn to sped up the training stage. Personally, I used tensorflow with library such as Tensorflow Constrained Optimization to help me solve the optimization problem of SVR. The training stage can be further improved with the help of Grid Search algorithm to find the optimal hyperparameter.

The entire process can be viewed here.

Based on some trial and error attempt, I found the optimal hyperparameter trade-off $(C)$ , width of margin $(ε)$ , learning rate $(λ)$ , and degree of tolerance $(t o l)$ with respective values of 3, 0.1, 0.02, and 0.001, with data split by 90% as training data and 10% as testing data. The training stage yields mean squared error (MSE) 0.2805 with a total of 360 iterations, while the testing stage yields an MSE 0.0325. Hence, the application is NOT categorized as overfitting.

Comparison of actual and predicted data in both stages.

Summary

In conclusion, SGD-SVR offers a compelling solution to some of the most pressing challenges in machine learning. By harnessing SGD's efficient optimization techniques and SVR's robust regression capabilities, this collaboration addresses issues of scalability and computational complexity while maintaining high predictive accuracy. The dynamic nature of SGD allows SVR to adapt swiftly to the intricacies of large datasets, facilitating faster convergence and enhanced model performance.

·····

SGD-SVR: Your (ir)regular predictive model.

How it works...

The real deal.

Summary

References