AbbasJafar
(Jafar Abbas)
1iD
이명호
(Myungho Lee)
†iD
-
(Dept. of Computer Engineering, Myongji University.)
Copyright © The Korean Institute of Electrical Engineers(KIEE)
Key words
Dynamic Rating System(DRS), Distributed Temperature Sensor(DTS), Conductor Temperature Monitoring System(CTS), Correction Factor
1. Introduction
ML is becoming increasingly popular and used in various applications such as natural
language processing (1), self-driving cars, computer vision, healthcare, user behavior analytics (2), among many others. ML algorithms solve various data-specific problems such as classification,
regression, and clustering. Each algorithm deals with a number of hyperparameters
that determine its structure and help the algorithm learn during training. These hyperparameters
directly impact the learning process and their selection has direct control over the
performance of the models. Therefore, researchers focus on designing a ML algorithm
with a proper set of hyperparameters (3).
Hyperparameter Optimization (HPO) is a process of choosing a suitable set of hyperparameters
to tune the ML models. It also involves the selection of optimal values to enhance
the accuracy and effectiveness of the models. Optimizing these hyperparameters helps
prevent overfitting, which minimizes the cost and speeds up the computation. The selection
of hyperparameters relies on the experience of individuals. However, the credibility
of manual optimization is low and costly because it requires a lot of logical reasoning
for complex problems (4). To minimize the cost and the computing time, it is essential to develop automatic
HPO approaches that automate the tuning process.
Automatic HPO approaches upscale the performance of the models, lead to a lightweight
model with fewer parameters, and assist in selecting appropriate hyperparameters (5), (6). The most common HPO approaches are grid search (GS) and random search (RS) which
attempt to find the best hyperparameters to minimize the loss function (7), (8). Bayesian Optimization (BO) (9) finds the configuration of hyperparameters from the previous search space and avoids
evaluations that do not directly impact the model's performance. Unlike the GS and
RS, BO finds the optimal hyperparameters within the limited evaluations. Tree-structured
Parzen Estimator (TPE) is an iterative algorithm to optimize conditional hyperparameters
based on historical measurements (7). It finds the optimal configuration of hyperparameters that achieves the best performance.
Recently, automatic HPO frameworks have been developed which consist of more than
one optimization algorithm. For example, Optuna has GS, RS, TPE, Hyperband, and Pruning
optimization algorithms. Automatic HPO frameworks provide a user-friendly interface
for the implementation of optimization methods and help increase the models' efficiency
and accuracy. Thus, they solve large-scale problems efficiently.
This paper evaluates the comparative performance of the latest HPO frameworks such
as BO, Optuna, HyperOpt, and Keras tuner (9)–(12). In order to find an optimal combination of hyperparameters for each framework and
improve the performance of various models, two different sets of experiments were
carried out. First, different ML classifiers were optimized using the HPO frameworks
on publicly available datasets. The selected classifiers are Random Forest (RF), Extreme
Gradient Boosting (XGB), and Support Vector Machine (SVM) with their own set of hyperparameters
to tune. Classifiers were trained and optimized using HPO frameworks on dry beans,
raisin, and nomao datasets in order to obtain the best combination of hyperparameters.
Secondly, a CNN architecture was built and optimized using HPO frameworks on the CIFAR-10
dataset by optimizing various hyperparameters such as convolutional layers, fully
connected layers, the number of nodes, batch size, learning rate, etc. The accuracy,
F1 score, and computing time were considered as the performance metrics. The obtained
results show that the ML models and CNN optimized with HPO frameworks led to improved
performance. For the ML model, Optuna and HyperOpt performed well and found the best
combinations efficiently. Both frameworks used the TPE optimization algorithms and
achieved an accuracy of 93.97% and 94.12% respectively on the nomao dataset. HyperOpt
was a good choice when the accurate prediction matters the most. For a smaller task
where the cost matters a lot, BO was quite efficient. On the other hand, Optuna was
effective and worked well for large spaces which required more computing time. Considering
the trade-off between the accuracy vs. the computing time, Optuna obtained the optimal
set of hyperparameters within short computing time as compared with others. For the
CNN model, almost all the HPO frameworks performed well. HyperOpt-TPE was the best
in improving the training accuracy by 34% of CNN model, higher than all the other
HPO frameworks.
The rest of the paper is organized as follows: Section 2 introduces the overview of
HPO and its techniques. The HPO frameworks, such as BO, Optuna, HyperOpt, and Keras
Tuner are explained in Section 3. Section 4 presents the comparative performance evaluations
with experimental results and analyses. Section 5 reviews the previous research on
performance studies using state-of-the-art HPO frameworks. Section 6 concludes the
paper.
2. Hyperparameter Optimization
HPO is an important process to select the best combinations of hyperparameters that
result in the best performance for ML models. Several automatic HPO techniques have
been developed to tune the hyperparameters for designing efficient ML models. We first
describe the popular hyperparameters and then HPO techniques.
2.1 Hyperparameters
Hyperparameters are the variables of model to determine its structure and behavior.
Recently, hyperparameters have received remarkable attention to deal with the computational
complexity of the models. Each ML algorithm has its own specific set of hyperparameters
that need to be tuned. Studies such as [34–36] explain the hyperparameters in detail.
There are two major categories of hyperparameters: model and optimizer hyperparameters.
Model hyperparameters designed the structure of model while the training optimization
algorithms are optimizer hyperparameters.
The activation function is a kind of model-specific hyperparameter that makes a strong
perception of critical and non-linear complicated functions to convert the input signals.
The activation functions such as Sigmoid, Softmax, Tanh, and Rectified Linear Units
(ReLU) are commonly used (16). ReLU is suggested as a default activation function that rectifies the vanishing
gradient problem and converges six times faster than the Tanh function. The learning
rate (LR) is a well-known hyperparameter that quantifies the rate and speed of the
networks during learning. LR is used to adjust the weights of the hidden layers where
the overall structure of networks that extract the complex features directly are represented.
Optimal LR values need to be selected in order not to miss the local minimum. Recently,
the LR annealing approach has been developed to find optimal value during training
(17), (18). However, in most cases, the users manually set LR values. Data augmentation is a
data generation technique by creating fake copies of the training data to perform
specific functions such as transformation, rotation, cropping, etc. It helps the NNs
during the learning to improve their performance.
Optimization algorithms play a prominent role in the tuning of neural networks (NNs).
They make the networks learnable to perform better. In DL, Gradient Descent (GD),
Root Mean Square Propagation (RMSprop), and Adaptive momentum estimation (Adam) (19)–(21) are widely used optimizers. Gradient Descent minimizes the cost function by updating
the models’ parameters to reach the standard bias values in each step. It updates
the weights and bias iteratively to get the local minima by adjusting the gradient.
RMSprop optimizes the gradient by balancing the momentum (step size) and decreasing
the step size for large gradients (20). Another state-of-the-art optimizer is Adam which combines the two optimizers, Adaptive
Gradient Descent (AdaGrad) (22) and RMSprop. It computes the adaptive LR for each parameter and is used as the default
optimization algorithm for the training.
2.2 HPO Techniques
In this section, we overview the HPO techniques such as GS, RS, BO, Genetic Algorithm
(GA), and TPE [8, 9, 31-33].
2.2.1 Grid Search
Grid search (GS) is an optimization technique to search the configuration space by
checking all the possible combinations of hyperparameters (26). A user selects the search space and divides this space into a grid in GS. Each hyperparameter
in the grid has the same probability of affecting the optimization process. The selection
of hyperparameters requires users' prior knowledge. GS can find optimal combinations
with limited resources and achieve accurate predictions for different tasks (23). GS can perform parallelization and the results of one trial would not affect the
others.
The complex dimensional spaces of GS require a lot of training time (see table 1). Also, GS is not sensitive to the hyperparameters scaling, thus it may affect the
model performance. Furthermore, the method is only suitable for tuning a model, not
for selection.
2.2.2 Random Search
Random Search (RS) was proposed by (8) to address the limitations of GS. RS selects the optimal values randomly. It continues
searching until the desired fitness function is achieved. RS is suitable for the lower
budget and investigates a larger search space than the GS. However, it requires more
computational resources (8). RS allows stopping criteria for the experiments; it stops when the required output
(objective function) is achieved. Overall, RS performs better than the GS and can
perform in parallel also.
RS is faster than GS, however, still time-consuming when training complex models.
The method randomly selects the combinations, which may not find the optimal combinations
due to its lack of flexibility. It also has limitations in hyperparameter scaling
and may not find a global minimum, as shown in table 1. Additionally, this method is less efficient in considering the relationship between
the hyperparameters within the search space (27).
2.2.3 Bayesian Optimization
BO is a robust iterative algorithm in ML (9). Unlike GS and RS, BO is more efficient because it utilizes past results to regulate
future evaluations. This allows BO to find the global minimum with fewer iterations.
It uses the Bayesian statistical method to search the best combinations of hyperparameters.
Two primary elements of BO are surrogate models and an acquisition function (28). The surrogate models approximate the objective function with a probability distribution
based on the samples. At the same time, the acquisition function determines the distribution
and balances the tradeoff between explorations and exploitation. Surrogate models
allow the BO method to work efficiently by minimizing the number of expensive function
evaluations to achieve the optimum. The popular surrogate models include BO Gaussian
Process (BO-GP) (29), BO Random Forest (BO-RF) (30), and BO Tree Parzen Estimator (BO-TPE) (31).
The BO method can be computationally expensive where the objective function must be
evaluated frequently. The method took a long time to converge where the objective
function has local optima. Additionally, BO is conceptually complex and challenging
to parallelize (see table 1).
2.2.4 Genetic Algorithm
GA is a search optimization algorithm to solve multi-objective optimization problems
(24). GA works as a biological evolution process and selects those species capable of
adopting environmental changes. Later, these species reproduced in the upcoming generations
and inherited characteristics. The generations with better performance would survive
longer, and worse ones would disappear gradually. The population in each generation
represents the search space and the individual is a character. Thus, in each iteration,
the individuals are the hyperparameters and nature is the real input value. The selection
of individuals is based on the optimal value of fitness functions (32).
In HPO, GA is easy to implement and robust to optimize over a large population. However,
GA has limitations in configuring the additional hyperparameters with fitness functions
(see table 1). Moreover, the algorithm is difficult to perform in parallel due to its sequential
execution nature.
Many HPO algorithms minimize the objective function for independent data, such as
TPE and Hyperband. TPE (7) is a BO algorithm used to optimize the configuration of hyperparameter quantization
to achieve better performance. The applicability of TPE in ML with its key advantages
and disadvantages is shown in table 1. Hyperband is another optimization algorithm to solve the problem of pure exploration
and infinite bandits (33).
Table 1. Comparison of HPO techniques and the complexity (n: values of hyperparameters;
k: the total number of hyperparameters used)
HPO Approaches
|
Advantages
|
Disadvantages
|
Complexity
|
Applicability for DNN
|
Grid Search
|
Simplicity and baseline method
Used for parallelism
|
Time-consuming
Computational cost
Poor scaling
|
O(nk )
|
A few parameters need to be tuned
|
Random Search
|
Used for parallelism
Early stopping methods
Computational efficiency
|
Low efficiency
Limited search space
Lack of flexibility
Hard to find local optima
|
O(n)
|
Convenient for early stage
Random combinations
|
Bayesian Optimization
|
Fast and reliable
Efficiency and Flexibility
The foundation of other algorithms
|
Difficult to parallelize
Computational cost
Convergence
Robustness
|
O(nlogn)
|
Default algorithm
Variants of BO are applicable
|
Genetic Algorithm
|
Fast convergence speed
Efficient and flexible
No need for optimal initialization of values
|
Lack of parallelism
Long time to get the best model
Lack of interpretability
|
O(n2)
|
Mutation testing
Filtering and signal processing
Learning Fuzzy rule
|
TPE
|
Efficient search method
Better with conditional dependencies
Flexibility
|
Poor performance for parallelization
Computational cost
Convergence and robustness
|
O(nlogn)
|
For quantization configuration
|
3. Hyperparameter Optimization Frameworks
Hyperparameter optimization frameworks are the automatic tools to tune the hyperparameters
of ML models. Each tool typically includes a set of optimization techniques and a
user-friendly interface for defining search space, evaluating the objective functions,
and monitoring the performance of models. These frameworks are in high demand to tackle
complex machine-learning problems. In this section, we briefly overview the HPO frameworks
that we have used: Bayesian optimization (9), Optuna (10), HyperOpt (11), and Keras tuner (12). Note that BO is omitted, as explained previously in detail.
3.1 Optuna
Optuna (10) is a newly developed open-source tool by a Japanese AI company for ML and DL applications
in 2019. Optuna framework is built on the python language and guarantees the efficient
optimization of complex problems (10). The existing HPO tools have many limitations, such as constructing a search space
for each model individually, a lack of pruning approach, and handling the previous
techniques within allocated resources (6). Optuna addressed these problems and provided a better solution by building the framework.
It is a next-generation strategy that allows users to create a search space dynamically.
Optuna delivers a lot of user-customized sampling, searching, and pruning algorithms
for an efficient implementation. The versatile architecture of Optuna itself is easy
to set up and can be deployed for different types of problems, ranging from scalable
to lightweight experiments. The overall framework of Optuna for an ML model is shown
in fig 1, where it automatically finds optimal combinations of hyperparameters using specific
HPO sampling algorithms (GS, RS, BO, Hyperband). Later, the ML model is evaluated
with a validation strategy to produce the final results.
Optuna is designed to solve the problems and limitations of black-box optimization
frameworks. The implementation of Optuna is based on study and trial. The study requires
an objective function that decides the number of samples in each trial and returns
the optimal values for the specific hyperparameters. In contrast, a trial is the execution
of the objective function. The key features and available HPO algorithms of Optuna
are highlighted in table 2.
Fig. 1. Overall framework of the Optuna HPO process
3.2 HyperOpt
HyperOpt (11) is a python-based HPO tool based on sequential model-based optimization. This tool
provides a user-friendly interface to configure the variables from the search space
of hyperparameters and evaluate the objective function. The configured variables have
continuous or discrete values, conditional ones, and sensitivity (uniform, log scaling).
HyperOpt helps find the best deals on selected variables. These selected variables
define the optimal hyperparameter configuration search space that minimizes the objective
function as shown in fig 2.
HyperOpt key factors are a search space, objective function, and optimization algorithm.
A search space in HyperOpt is chosen with random variables or parameters whose distributions
have
Fig. 2. Overall framework of the HyperOpt HPO process
higher combinations of prior probability. It includes the functions and operators
in python language to combine the random combinations of parameters for the specific
objective function. An objective function is defined with any conditional structure
to map the sampling of different random parameter values to minimize the selective
optimization algorithm score. HyperOpt supports the following optimization algorithms
RS, TPE, and Adaptive TPE algorithms. HyperOpt search functions select the optimization
algorithms, configure the best-performing optimal hyperparameters, and store the configuration
results (see
table 2).
Table 2. Comparison of HPO frameworks with key features and components along GitHub
repositories information
HPO Frameworks
|
Key Features
|
Available HPO Algorithms
|
Essential Components for Optimization
|
GitHub Link
|
Bayesian Optimization
|
Efficient
Versatile optimized
Useful for high-cost functions
|
Bayesian Optimization
|
Defining and changing the bounds
Build surrogate model
Acquisition function
Update the model for search space
|
(34)
|
Optuna
|
Easy parallelization
Quick visualization
Versatile with platform-agnostic architecture
|
GS, RS
TPE
Hyperband
Pruning Algorithm
|
Objective function with trial
Creation of Optuna optimization
Obtained optimal search space
Visualization of results
|
(35)
|
HyperOpt
|
High speed & parallelization
Complex search spaces
Persisting and resuming the optimization process
|
Random Search
TPE
Adaptive TPE
|
Define objective function and search space
Minimize the objective over space
Database for score and configuration
|
(36)
|
Keras Tuner
|
Intuitive and efficient
Light-weight
Distributed optimization
Dashboard
|
Random Search
Hyperband
Bayesian Optimization
|
Hyperparameters selection choice
Selection of optimization algorithms
Perform tuning
|
(12)
|
3.3 Keras Tuner
In ML, there is no fixed way to select the optimal parameters such as the number of
layers, and kernel size, and the optimizing parameter such as LR, decay, normalization,
etc. to build the model. Keras is an open-source python API that led to the development
of Keras Tuner, an HPO framework (12), to select the search space of hyperparameters and finds the optimal combinations
of the values for training ML models. These hyperparameters play a vital role in generalizing
the models to perform better.
Keras tuner is a library to tune the set of hyperparameters to obtain high performance
for the imaging study (37). The idea of the Keras tuner is to define the range of values of hyperparameters
and get the optimal combination that improves the validation evaluation of the model,
as shown in fig 3. Moreover, it helps to build a lightweight and efficient ML model that selects the
optimal search space configuration to perform tuning (see table 2). Keras tuner has various built-in HPO methods: RS, GS, BO, hyperband, and evolutionary
algorithms. It allows the researchers to conduct experiments with different techniques.
RS, GS, and BO methods are explained in detail in section 3. The Keras Tuner-based
hyperband is an extended version of the RS with an early stopping function to optimize
the speed (38). This framework is helpful in the tuning of the CNN model and allows the tuning of
different numbers of the model's hyperparameters (convolutional layers, number of
neurons and epochs, learning rate, etc.). Keras Tuner works as an HPO framework by
optimizing the long-short-term memory (LSTM) network to predict the earthquake (39).
Fig. 3. Overall framework of Keras Tuner to perform HPO
4. Comparative Performance Evaluation of HPO Frameworks
In this section, we present experimental results for the comparative performance evaluations
of four different HPO frameworks such as BO, Optuna, HyperOpt, and Keras Tuner described
in Section 3. We conducted ML experiments using BO, Optuna, and HyperOpt frameworks
to classify the datasets and make predictions. (Previous research suggests that Keras
Tuner is primarily used for image classification rather than continuous classification
tasks, thus we did not consider it for the ML tasks.) For the DL CNN model experiments
on the image dataset, we used Keras Tuner along with the other HPO frameworks. We
first explain the experimental setting that includes the datasets and system specifications
used for the experiments. Experimental results are presented along with the analyses.
4.1 Experimental Settings
For the performance evaluation, we selected four publicly available datasets named
dry beans (40), raisin (41), nomao (42), and CIFAR-10 (43) dataset. The selection of datasets is based on the real-world scenarios. The details
of each dataset, including the number of samples, classes, and numerical and categorical
features are summarized in table 3. CIFAR-10 is the only image dataset. It consists of 60K images with ten different
classes with 10K images per each type. This dataset is extracted from the 80 million
tiny images database where each RGB image is 32 X 32 pixels. The dataset was split
into training and testing sets with 50K and 10K, respectively.
Table 3. Details of datasets for the experiments
Dataset
|
Classes
|
Samples
|
Attribute characteristics
|
Year Donated
|
Reference
|
dry beans
|
17
|
13611
|
Integer, Real
|
2020
|
(40)
|
raisin
|
8
|
900
|
Integer, Real
|
2021
|
(41)
|
nomao
|
120
|
34465
|
Real
|
2012
|
(42)
|
CIFAR-10
|
10
|
60000
|
3072
|
2009
|
(43)
|
We used Python programming language with all the libraries needed to conduct the experiments.
The training and testing of ML models were performed on a computer system with an
Intel® Core™ i7-8700 CPU with a clock speed of 3.20 GHz, 16 GB of DRAM, and an NVIDIA
GeForce GTX 1060 GPU with 8GB GDRAM. We evaluated accuracy, F1 score, and computing
time as the performance evaluation metrics for the models.
4.2 Machine Learning Classifiers and the Experimental Results
We implemented three machine learning classification models including RF, XGB, and
SVM, and evaluated their performance on the classification datasets (dry beans, raisin,
nomao).
Random Forest (RF)
Random forest is a learning method that can analyze the classification and regression
problems. It is an extension of decision tree algorithms and creates multiple trees
where each tree is trained on the subset of the data. The results for all the trees
are merged to build the final prediction resulting in a more robust and accurate model.
RF can efficiently overcome the problem of overfitting. It has multiple hyperparameters
to tune as discussed in table 4.
Extreme Gradient Boosting (XGB)
XGB is a type of supervised ML algorithm to improve traditional gradient boosting
using decision trees. It is a highly efficient classifier that won numerous data science
challenges. It builds the sequence of decision trees iteratively. Each new tree in
the sequence is used to correct the error of the previous tree that results in performance
improvement. The key hyperparameter of XGB to be optimized are gamma, n_estimators,
and max_depth.
Table 4. Configuration search space of hyperparameters with type, space, and range
of values
ML Classifiers
|
Hyperparameters
|
Type
|
Space
|
Range
|
RF
|
n_estimators
|
integer
|
linear
|
[100, 1000]
|
max_depth
|
real
|
-
|
[2-20]
|
min_samples_split
|
integer
|
-
|
[0.1, 1.0]
|
min_samples_leaf
|
-
|
-
|
[0.1, 0.5]
|
max_features
|
-
|
-
|
[0.1, 1.0]
|
criterion
|
real
|
-
|
-
|
XGB
|
colsample_bytree
|
real
|
-
|
[0.6, 1.0]
|
gamma
|
integer
|
-
|
[0,1]
|
max_depth
|
real
|
-
|
[2-20]
|
min_child_weight
|
-
|
-
|
-
|
n_estimators
|
integer
|
linear
|
[50, 1000]
|
subsample
|
numeric
|
|
[0.5, 1]
|
SVM
|
C
|
integer
|
float
|
[0.1, 10]
|
degree
|
-
|
-
|
[2, 4]
|
kernel
|
discrete
|
linear
|
-
|
gamma
|
numeric
|
|
[True, False]
|
Table 5. Experimental results of ML classifiers using selected HPO frameworks
Datasets
|
HPO Frameworks
|
Accuracy (%)
|
F1 score (%)
|
Computing Time (min)
|
dry beans
|
BO
|
87.17
|
86.23
|
22
|
Optuna-TPE
|
86.23
|
87.45
|
25
|
HyperOpt-TPE
|
87.45
|
87.44
|
28
|
raisin
|
BO
|
87.44
|
87.33
|
20
|
Optuna-TPE
|
87.33
|
87.33
|
23
|
HyperOpt-TPE
|
87.33
|
90.12
|
18
|
nomao
|
BO
|
90.12
|
93.97
|
27
|
Optuna-TPE
|
93.97
|
94.12
|
24
|
HyperOpt-TPE
|
94.12
|
92.17
|
30
|
Support Vector Machine (SVM)
SVM is a ML method that can be commonly used for classification and regression analysis.
It finds the best boundary to separate the classes in the data. New data points fall
on the sides of the boundary and can be classified easily. The data points closer
to the boundary are called support vectors. The SVM classifier works well for the
smaller dataset and may struggle to handle high-dimensional data.
Each classifier has its own configuration space of hyperparameters that need to be
tuned. The configuration space includes the tuned hyperparameters, type, space, and
the range of values for each hyperparameter (see table 4). The complete process involves the following steps: data pre-processing, selection
of search space with values, and selection of algorithm to perform the implementation.
In data pre-processing, we handled the missing values by imputation (mean, forward,
backward), one-hot encoding for the categorical features, use of label encoder to
target features, and Min-Max scaling to all the numerical features in the dataset.
The dataset was split into training and testing using the K-fold cross-validation
technique. We used 5-fold cross-validation and 50 iterations to tune the hyperparameters
in each fold. Hyperparameter tuning was performed with BO, Optuna, and HyperOpt HPO
framework for machine learning tasks. Previous research suggests that Keras Tuner
is primarily used for image classification rather than continuous classification tasks,
thus we did not consider it. As mentioned earlier, our experiments used accuracy,
F1 score, and the computing time as performance metrics.
Experiments were conducted on the benchmark datasets for the classification. The configuration
search space of each classifier is based on the list of hyperparameters provided in
table 4. In each experiment, models with high-accuracy results were returned and highlighted
in table 5. HyperOpt-TPE performed the best and achieved the highest accuracy score in the case
of dry beans and nomao with a similar score with others for the raisin dataset. On
the other hand, the BO had the lowest score.
Comparing the performance of HyperOpt and Optuna in terms of accuracy, both frameworks
performed well and achieved almost the similar results, while HyperOpt had an advantage
with a little margin: HyperOpt achieved an accuracy of 94.12% and Optuna got 93.97%.
Both frameworks used TPE to tune their hyperparameters. TPE requires upper and lower
limits and a distribution that makes optimization easier. Compared with RS, GS, BO,
and many other optimization algorithms, TPE can better utilize the resources by evaluating
the hyperparameters efficiently, can handle non-linear relationships between selected
hyperparameters and objective functions more effectively, and require lower computational
costs than other algorithms for finding the best hyperparameters.
The raisin dataset has a smaller number of instances compared with other datasets.
All frameworks had satisfactory accuracy scores. BO showed a higher accuracy score
of 87.44% despite a longer run time than others. The use of the acquisition function
to find the optimal combinations for smaller datasets led to better accuracy and optimization.
The results from table 5highlight that BO achieved quite a reasonable accuracy of 87.17% and 87.44% for the
dry beans and raisin datasets. Optuna and HyperOpt achieved the same results for the
raisin dataset as shown in fig 4.
In the high dimensional nomao dataset, BO had a lower score while HyperOpt had a higher
score with a slightly longer runtime. HyperOpt achieved an accuracy score of 94.12%.
Each framework took a long time to optimize compared with other datasets due to its
large size of samples. Optuna also achieved similar results as HyperOpt. Optuna generally
performed well in terms of computing time due to an adaptive sampling feature that
adjusts the search space dynamically and took less computation to find the optimal
combinations. Overall, HyperOpt took more computing time compared with others due
to the number of iterations for the classifiers during the experiments as shown in
fig 5. It explored more search space and evaluated one set of hyperparameters at a time
rather than simultaneously which resulted in a longer computing time.
HyperOpt and Optuna had good performance scores in all datasets, however, HyperOpt
had a longer runtime. The performance
Fig. 4. Accuracy of all selected HPO frameworks in percentage (%)
Fig. 5. Computing time (min) of HPO frameworks during the training
scores of HyperOpt and Optuna in terms of accuracy were almost the same, however,
Optuna had a shorter run time. From the experiments, we ranked Optuna as the best
choice for HPO considering the trade-off between the accuracy, F1 score vs. the computing
time. HyperOpt can also be a good choice in the case of accurate prediction because
it prioritizes accuracy over speed. Furthermore, it has the potential for parallel
computation for complex problems with large data. In the end, it is important to note
that the performance of the HPO framework relies on many factors such as the selection
of the HPO method, the size of the dataset, choosing the search space of hyperparameter
with the values, and the computational resources.
4.3 Deep Leaning CNN Model and the Experimental Results
In this section, we built a CNN architecture for the CIFAR-10 benchmark image dataset.
The traditional CNN architecture has various layers such as convolution, dense, pooling,
and fully connected layers. The selection of such layers and hyperparameters within
each layer as well as the number of neurons and padding were made using HPO frameworks.
The schematic diagram of the CNN model is presented in fig 6. The convolutional block in the CNN architecture consists of a convolutional layer
with the ReLU activation function and the MaxPooling layer. Each convolutional layer
is comprised of a 5 X 5 convolution filter, zero padding, and a stride of 1. The pooling
layer has a MaxPooling filter with a size of 2 X 2. Both input and output images have
the same size. This convolutional block takes the input images, extracts the sharp
features, and forwards them to the hidden layers. These are converted into a single
feature vector and passed to the fully connected layers of the network. Finally, the
resulting output is classified using the Softmax activation function to compute the
classification score.
A good choice of hyperparameters (learning rate, momentum, convolution, dense layers,
hidden units, batch size, etc.) can lead to higher performance. All the possible hyperparameters
to build the CNN model are highlighted in table 6. Each optimization algorithm has its specific number of hyperparameters with a range
of values. table 6provides the search space of hyperparameters with values and optional hyperparameters
of CNN. The configured values were obtained in the process of using HPO frameworks.
For example, using BO resulted in the following configured values: learning_rate 0.000544,
conv_layers 2, dense_layers 1, activation relu, and num_nodes 512. The optimal combinations
of hyperparameters were validated on the testing set to predict the performance. The
performance of each framework was measured and evaluated using the accuracy. Additionally,
the computing time during the training of the CNN with a HPO method was measured to
evaluate the efficiency.
Fig. 6. CNN architecture for our experimental setup (the number of convolutional blocks
and dense layers are selected from hyperparameters' search space configuration)
제목
Table 6. Configuration search space and optional hyperparameter to optimize the CNN
model
DL Model
|
Hyperparameters
|
Type
|
Space
|
Range
|
Optional Hyperparameters
|
CNN
|
learning_rate
|
real
|
log
|
[1e-6, 1e-2]
|
kernel_size, strides,
momentum, padding, and dropout
|
dense_layers
|
integer
|
linear
|
[1, 3]
|
conv_layers
|
-
|
-
|
[1, 3]
|
num_nodes
|
-
|
-
|
[5, 512]
|
batch_size
|
-
|
-
|
[10, 250]
|
Experiments were conducted using the CIFAR-10 dataset. The performance results of
the CNN with and without HPO frameworks are shown in
table 7. First, the training of CNN model was performed with the default hyperparameters
for each optimization algorithm. The default values were chosen based on the previous
knowledge and trained for 30 epochs. Then, the CNN model was trained with assigned
search space hyperparameters for each HPO framework. The performance results of the
optimized CNN are analyzed using an accuracy metric and the computing time.
Table 7. Experimental results of CNN model using selected HPO algorithms on CIFAR-10
dataset (Acc.: Accuracy; M: million; %: percentage; h: hour; m: minute)
HPO Framework
|
No. of Training Parameters
|
Training Acc. With default hyperparameters
|
Training Acc. with HPO
|
Testing Acc. with HPO
|
Computing Time
|
BO
|
1.20M
|
45.87%
|
72.88%
|
71.42%
|
1h6m
|
Optuna-TPE
|
1.24M
|
48.65%
|
74.15%
|
72.68%
|
2h32m
|
HyperOpt-TPE
|
364K
|
42.28%
|
76.98%
|
71.62%
|
2h24m
|
Keras Tuner- Hyperband
|
53K
|
72.37%
|
90.76%
|
84.66%
|
18m
|
The performance comparison with the four different HPO frameworks is summarized in
table 7. With the default values of hyperparameters, the Keras Tuner-Hyperband achieved the
best performance with 72.37% training accuracy, while BO, Optuna-TPE, and HyperOpt-TPE
achieved 45.87%, 48.65%, and 42.28% training accuracy respectively. With HPO, HyperOpt-TPE
improved the training accuracy by 34%, which was higher than all the other HPO frameworks.
Using BO, we achieved 72.88% training accuracy and 71.42% testing accuracy while it
took 1 hour and 6 minutes of computing time. BO explored the search space of optimal
hyperparameters effectively using a probabilistic model. The efficient exploration,
ability to handle both the categorical and continuous values, and adoption of the
most promising regions of search space led BO to improve its performance within a
short training time.
Keras Tuner improved the CNN performance with a valuable margin. The complexity of
the Keras Tuner framework is low due to fewer training parameters and the smaller
search space size. Keras Tuner-Hyperband achieved the highest testing accuracy of
84.66% which outperformed other frameworks. However, Keras Tuner showed lower improvements
in the training accuracy of 18.39%. With fewer parameters to optimize, it took only
18 minutes to complete the 30 epochs. The multiple features such as the easy-to-use
tool interface, the ability to perform parallelization and early stopping, and multiple
optimizations runs of optimization processes led to better results for CNN.
For Optuna and HyperOpt, the selected optimization algorithm was TPE. The achieved
training accuracies for the HyperOpt-TPE and Optuna-TPE were 74.15% and 76.98% respectively
as shown in fig 7. The search space of both the HPO frameworks was larger compared with the Keras Tuner
and BO which increased the number of parameters and the complexity, thus resulting
in a higher computing time (Optuna took 2 hours and 32 minutes). The reason for the
CNN performance improvement using Optuna was due to the regularization and the pruning
strategy. By allowing regularization which prevent overfitting, Optuna improved the
generalization of the CNN model. The pruning strategy in Optuna eliminated the unpromising
trials during the CNN optimization which resulted in better accuracy and the computing
time. HyperOpt-TPE detected the best combinations with configured values early and
improved the training accuracy by 34%, which was higher than all the other HPO frameworks.
It allowed efficient exploration of the larger space to identify the combinations
of hyperparameters for CNN model to improve the overall performance.
Fig. 7. Training Accuracy comparison of all selected HPO Frameworks for CNN model
To summarize, experimental results show that CNN with HPO frameworks achieved significant
improvements in training accuracy and detected optimal combinations of hyperparameters.
The training accuracies with and without HPO shown in
table 7show that all frameworks produced significant improvements. Keras Tuner improved by
18.39%, while BO, Optuna, and HyperOpt improved by 27.01%, 25.5%, and 34%, respectively.
It was simple to implement the Keras Tuner and BO to detect the optimal configuration
of hyperparameter for a smaller task where the cost matters a lot. Although BO took
more computing time due to poor parallelization, there is still a tradeoff between
the accuracy vs. the computing time using BO. On the other hand, Optuna and HyperOpt
are effective optimization algorithms that worked well for large spaces and guarantee
to detect the optimal combinations with the configured values. Larger spaces may include
unimportant hyperparameters that increase the complexity of the problems, resulting
in more computing time.
5. Previous Research
This section reviews previous research on using HPO frameworks (Optuna, BO, HyperOpt,
and Keras Tuner) for performance studies.
In (44), Optuna was used to optimize hyperparameters for an XGB classifier to diagnose cardiovascular
disease. Multiple hyperparameters such as n_esitmators, max_depth, gamma, learning
rate, etc. are tuned to improve the evaluation performance of XGB. The model achieved
an accuracy of 94.7% in the Cleveland dataset which outperformed the previous approaches.
In (45), Optuna improved the performance of the LightGBM model by updating their hyperparameters
for the prediction of circuit impedance values. Hyperparameters like n_esitimators,
learning_rate, max_depth, lambda, max_leaves etc. are optimized. Optuna provides the
optimal values for LightGBM to outperform other models and achieved an R2 value of
0.79. In (46), the Optuna tool was used for a comprehensive performance study to classify the SVM,
decision tree (DT), and RF for a multi-class problem. RS, GS, TPE, and CMA-ES optimization
techniques within Optuna are analyzed. In (47), (48), Optuna was used to optimize the architecture for DL applications and multiple hyperparameters
to improve the performance of CNN, LSTM, etc.
In (49), authors attempted to select a learning algorithm with its hyperparameters and tune
them automatically by using BO. It optimized the different classifiers to select an
optimal algorithm with their appropriate hyperparameter and achieved better results.
In (28), BO was used to select the automatically select the ML algorithms and their hyperparameters
for the WEKA approach. This approach is implemented on multiple datasets and improved
experimental results are compared with other optimization techniques. In (50), BO searched the best configurations search space of the CNN model for gastroenterology
which resulted in 10% higher accuracy compared with the previous method.
In (49), a comparative study was conducted using HyperOpt to improve the performance of ML
classifiers. Performance comparison of HyperOpt-BO with GS and RS for DT, XGBoost,
C-SVM, RF, and NN is performed on six different state-of-the-art datasets. Multiple
hyperparameters for each classifier were tuned while XGBoost using HyperOpt-BO selected
the best combinations of hyperparameters and achieved high accuracy within a short
computing time. In (51), authors improved the overall performance of SVM, AdaBoost, logistic linear regression,
RF, and NNs using the HyperOpt tool for the prediction of drugs. The selected hyperparameters
are tuned with a different range of values to obtain the configuration set. Finally,
models are trained with configures hyperparameters and 33 out of 36 models improved
the validation performance. In (52), the HyperOpt-TPE algorithm was used to tune the hyperparameters of the model to
predict the future taxi demands from the New York City dataset through the LSTM network.
The optimized LSTM achieved a higher MSE of 0.172 compared with other prediction models.
In (53), a CNN is optimized using the Keras Tuner to obtain the optimal combination of hyperparameters
to get better performance results within a short computing time. Optimized CNN achieved
94% of accuracy on the fer2013 dataset to detect emotions. In (39), Keras Tuner was used for optimizing a long-short-term memory (LSTM) DL network.
The model's hyperparameters are optimized to build an efficient architecture that
improves performance. 74.67% accuracy was achieved to predict the earthquake.
Although automated optimization approaches perform well, manual optimization approaches
are also effective for NNs for a DL task. Studies such as (54), (55) used the manual tuning of hyperparameters and improve the classification accuracy
with a valuable margin.
6. Conclusion
In this paper, a comparative performance evaluation study was conducted to analyze
the direct impact of the choose of hyperparameters to optimize the ML models. For
this purpose, the hyperparameters of each model were optimized using the latest HPO
frameworks such as BO, Optuna, HyperOpt, and Keras tuners. Each of these frameworks
consists of multiple state-of-the-art HPO algorithms. Two different experiments were
conducted to obtain the best configuration of hyperparameters and the resulting performance
was analyzed. First, multiple ML classifiers were optimized with HPO frameworks on
publicly available datasets. Second, a CNN model was built and optimized with HPO
frameworks for an image classification task. Experimental results showed that HyperOpt-TPE
outperformed the other frameworks for the accuracy and the computing time for ML models
while HyperOpt outperformed the other HPO frameworks for the CNN model.
Acknowledgements
This work was supported by the Supercomputer Development Leading Program of the National
Research Foundation of Korea (NRF) funded by the Korean government (MSIT) (No. 2020M3H
6A1084984).
References
T. Young, D. Hazarika, S. Poria, E. Cambria, 2018, Recent Trends in Deep Learning
Based Natural Language Processing, IEEE Computational Intelligence Magazine, Vol.
13, pp. 55-75
M. I. Jordan, T. M. Mitchell, Jul 2015, Machine learning: Trends, perspectives, and
prospects, Science, Vol. 349, No. 6245, pp. 255-260
R. Elshawi, M. Maher, S. Sakr, 2019, Automated Machine Learning: State-of-The-Art
and Open Challenges, ArXiv190602287 Cs Stat
S. Abreu, 2019, Automated Architecture Design for Deep Neural Networks, ArXiv
K. He, X. Zhang, S. Ren, J. Sun, 2016, Deep Residual Learning for Image Recognition,
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778
N. Ma, X. Zhang, H.-T. Zheng, J. Sun, 2018, ShuffleNet V2: Practical Guidelines for
Efficient CNN Architecture Design, CoRR, Vol. abs/1807.11164, pp. -
J. Bergstra, R. Bardenet, Y. Bengio, B. Kégl, 2011, Algorithms for Hyper-Parameter
Optimization, Advances in Neural Information Processing Systems, Vol. 24, pp. 2546-2554
J. Bergstra, Y. Bengio, 2012, Random Search for Hyper- Parameter Optimization, J.
Mach. Learn. Res., Vol. 13, No. 10, pp. 281-305
B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, N. de Freitas, 2016, Taking the Human
Out of the Loop: A Review of Bayesian Optimization, Proc. IEEE, Vol. 104, No. 1, pp.
148-175
T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama, 2019, Optuna: A Next-generation
Hyperparameter Optimization Framework, Proceedings of the 25th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining, pp. 2623-2631
J. Bergstra, B. Komer, C. Eliasmith, D. Yamins, D. D. Cox, 2015, Hyperopt: a Python
library for model selection and hyperparameter optimization, Comput. Sci. Discov.,
Vol. 8, No. 1, pp. 014008-
P. Probst, A.-L. Boulesteix, B. Bischl, 2019, Tunability: Importance of Hyperparameters
of Machine Learning Algorithms, J. Mach. Learn. Res., Vol. 20, No. 53, pp. 1-32
M. Claesen, B. De Moor, Apr. 06, 2015, Hyperparameter Search in Machine Learning,
arXiv
H. J. P. Weerts, A. C. Mueller, J. Vanschoren, Jul. 15, 2020, Importance of Tuning
Hyperparameters of Machine Learning Algorithms, arXiv
V. Nair, G. E. Hinton, 2010, Rectified linear units improve restricted boltzmann machines,
in Proceedings of the 27th International Conference on International Conference on
Machine Learning, Madison, WI, USA, pp. 807-814
J. Brownlee, Jan. 22, 2019, How to Configure the Learning Rate When Training Deep
Learning Neural Networks, Machine Learning Mastery
Y. Bengio, 2012, Practical Recommendations for Gradient-Based Training of Deep Architectures,
in Neural Networks: Tricks of the Trade: Second Edition, G. Montavon, G. B. Orr, and
K.-R. Müller, Eds. Berlin, Heidelberg: Springer, pp. 437-478
S. Agrawal, 2021, Hyperparameters in Deep Learning, Medium
, [Coursera] Neural Networks for Machine Learning (University of Toronto) (neuralnets)
D. P. Kingma, J. Ba, 2015, Adam: A Method for Stochastic Optimization, in 3rd International
Conference on Learning Representations, San Diego, CA, USA, May 7-9, 2015, Conference
Track Proceedings
J. Duchi, E. Hazan, Y. Singer, 2011, Adaptive Subgradient Methods for Online Learning
and Stochastic Optimization, Journal of machine learning research, Vol. 12, No. 7,
pp. 39-
P. Liashchynskyi, P. Liashchynskyi, 2019, Grid search, random search, genetic algorithm:
a big comparison for NAS, arXiv preprint arXiv:1912.06059
M. A. J. Idrissi, H. Ramchoun, Y. Ghanou, M. Ettaouil, 2016, Genetic algorithm for
neural network architecture optimization, in 2016 3rd International Conference on
Logistics Operations Management (GOL), pp. 1-4
J. Bergstra, R. Bardenet, Y. Bengio, B. Kégl, 2011, Algorithms for Hyper-Parameter
Optimization, in Advances in Neural Information Processing Systems, Vol. 24
R. Joseph, 2018, Grid Search for model tuning, Medium
M.-A. Zöller, M. F. Huber, 2021, Benchmark and Survey of Automated Machine Learning
Frameworks, ArXiv190412054 Cs Stat
A. Klein, S. Falkner, S. Bartels, P. Hennig, F. Hutter, 2017, Fast Bayesian Optimization
of Machine Learning Hyperparameters on Large Datasets, in Proceedings of the 20th
International Conference on Artificial Intelligence and Statistics, pp. 528-536
M. Seeger, 2004, Gaussian processes for machine learning, Int. J. Neural Syst., Vol.
14, No. 2, pp. 69-106
F. Hutter, H. H. Hoos, K. Leyton-Brown, 2011, Sequential Model-Based Optimization
for General Algorithm Configuration, in Learning and Intelligent Optimization, pp.
507-523
D. Maclaurin, D. Duvenaud, R. Adams, 2015, Gradient-based Hyperparameter Optimization
through Reversible Learning, in Proceedings of the 32nd International Conference on
Machine Learning, pp. 2113-2122
A. S. Wicaksono, A. A. Supianto, 2018, Hyper Parameter Optimization using Genetic
Algorithm on Machine Learning Methods for Online News Popularity Prediction, Int.
J. Adv. Comput. Sci. Appl. IJACSA, Vol. 9, No. 12, pp. 33-31
L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, A. Talwalkar, 2022, Hyperband: Bandit-Based
Configuration Evaluation for Hyperparameter Optimization, International Conference
on Learning Representations
2022, GitHub - fmfn/BayesianOptimization: A Python implementation of global optimization
with gaussian processes, https://github. com/fmfn/BayesianOptimization
Mar. 18, 2022, Optuna: A hyperparameter optimization framework, optuna
Jan. 12, 2023, Hyperopt: Distributed Hyperparameter Optimization, hyperopt
K. Team, Jan. 13, 2023, Keras documentation: KerasTuner
L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, A. Talwalkar, 2017, Hyperband: A
Novel Bandit-Based Approach to Hyperparameter Optimization, , pp. 6765-6816
Md. H. A. Banna, 2021, Attention-Based Bi-Directional Long-Short Term Memory Network
for Earthquake Prediction, IEEE Access, Vol. 9, pp. 56589-56603
Murat Koklu, Ilker Ali Ozkan, 2020, Multiclass classification of dry beans using computer
vision and machine learning techniques, Comput. Electron. Agric., Vol. 174, pp. 105507-
İ. Çinar, M. Koklu, P. D. Ş. Taşdemi̇r, Dec. 2020, Classification of Raisin Grains
Using Machine Vision and Artificial Intelligence Methods, Gazi Mühendis. Bilim. Derg.,
Vol. 6, No. 3, pp. -
L. Candillier, V. Lemaire, Aug. 2013, Active learning in the real-world design and
analysis of the Nomao challenge, in The 2013 International Joint Conference on Neural
Networks (IJCNN), pp. 1-8
A. Krizhevsky, I. Sutskever, G. E. Hinton, 2012, ImageNet Classification with Deep
Convolutional Neural Networks, in Advances in Neural Information Processing Systems,
Vol. 25
P. Srinivas, R. Katarya, Mar. 2022, hyOPTXg: OPTUNA hyper- parameter optimization
framework for predicting cardiovascular disease using XGBoost, Biomed. Signal Process.
Control, Vol. 73, pp. 103456-
J.-P. Lai, Y.-L. Lin, H.-C. Lin, C.-Y. Shih, Y.-P. Wang, P.-F. Pai, Feb. 2023, Tree-Based
Machine Learning Models with Optuna in Predicting Impedance Values for Circuit Analysis,
Micromachines, Vol. 14, No. 2, pp. -
J. Joy, M. P. Selvan, 2022, A comprehensive study on the performance of different
Multi-class Classification Algorithms and Hyperparameter Tuning Techniques using Optuna,
in 2022 International Conference on Computing, Communication, Security and Intelligent
Systems (IC3SIS), pp. 1-5
Y. Nishitsuji, J. Nasseri, Mar. 2022, LSTM with forget gates optimized by Optuna for
lithofacies prediction,
I. Ekundayo, 2020, OPTUNA Optimization Based CNN-LSTM Model for Predicting Electric
Power Consumption, masters, Dublin, National College of Ireland
S. Putatunda, K. Rama, 2018, A Comparative Analysis of Hyperopt as Against Other Approaches
for Hyper-Parameter Optimization of XGBoost, in Proceedings of the 2018 International
Conference on Signal Processing and Machine Learning, Shanghai China, pp. 6-10
R. J. Borgli, H. Kvale Stensland, M. A. Riegler, P. Halvorsen, 2019, Automatic Hyperparameter
Optimization for Transfer Learning on Medical Image Datasets Using Bayesian Optimization,
in 2019 13th International Symposium on Medical Information and Communication Technology
(ISMICT), pp. 1-6
J. Zhang, Q. Wang, W. Shen, Dec 2022, Hyper-parameter optimization of multiple machine
learning algorithms for molecular property prediction using hyperopt library, Chin.
J. Chem. Eng., Vol. 52, No. , pp. -
N. Schwemmle, T.-Y. Ma, May 2021, Hyperparameter Optimization for Neural Network based
Taxi Demand Prediction, presented at the BIVEC-GIBET Benelux Interuniversity Association
of Transport Researchers: Transport Research Days 2021
B. Abdellaoui, A. Moumen, Y. Idrissi, A. Remaida, 2021, Training the Fer2013 Dataset
with Keras Tuner., pp. 412-
A. Jafar, M. Lee, 2021, High-speed hyperparameter optimization for deep ResNet models
in image recognition, in Cluster Computing, pp. 1-9
A. Jafar, L. Myungho, Aug. 2020, Hyperparameter Optimization for Deep Residual Learning
in Image Classification, in 2020 IEEE International Conference on Autonomic Computing
and Self-Organizing Systems Companion (ACSOS-C), pp. 24-29
저자소개
Abbas Jafar received his B.S in Software Engineering from the Government College University
Faisalabad, Pakistan.
He joined Myongji University, Korea for a Master's Degree.
He graduated Master and now enrolled in Ph.D. Currently, he is a Research Assistant
in HPC Lab at Myongji University.
His research interests are AI in healthcare system, deep learning, high- performance
computing and performance optimization with a special interest in GPU computing.
Myungho Lee received his B.S. in Computer Science and Statistics from Seoul National
University, Korea, M.S. in Computer Science, Ph.D. in Computer Engineering from the
University of Southern California, USA.
He was a Staff Engineer in the Scalable Systems Group at Sun Microsystems, Inc, Sunnyvale,
California, USA.
He is currently a Full Professor in the Dept. of Computer Science & Engineering at
Myongji University.
His research interests are in High-Performance Computing: architecture, compiler,
and applications.