3.1 The M-SSD model
                  	
                     In this paper, an improved SSD model is designed. The SSD approach produces a fixed-size
                     collection of bounding boxes and scores in the presence of object class instances,
                     using a feed-forward convolutional network, followed by a non-maximum suppression
                     step to perform the object detection. The model utilizes the visual geometry group
                     (VGG-16) as its basic structure. However, it casts away the last fully connected layers,
                     adds a set of auxiliary convolutional layers to extract features at multiple scales,
                     and decreases the input size to each subsequent layer. It can improve the detection
                     accuracy of small objects, compared with other existing algorithms. For this kind
                     of model structure, the number of network architecture weights is large, and much
                     disk space is required. Furthermore, the detecting speed is slow. Therefore, it is
                     not suitable for limited computing platforms and small-storage real-time detection
                     systems. 
                     
                  
                  
                     Wei Liu (17) analyzed the SSD model structure and pointed out that the forward time is costed
                     mainly on the base network (i.e. nearly 80%). Therefore, for real-time applications,
                     using a faster basic network can reduce the amount of calculation and greatly improve
                     the speed. ResNet (18) was first proposed by Kaiming He and proven to be an efficient network. Lili Chen
                     (19) replaced the basic feature extraction model to ResNet-34 and got fast detection speed
                     on vehicle counting. Note that, in our single former USV object detection system,
                     it is unnecessary to utilize too many network layers for feature extraction. We choose
                     ResNet-18 as its basic feature extraction network, in order to obtain a real-time
                     detection performance. 
                     
                  
                  
                     The whole model structure of ResNet-18 comprises a convolutional layer, four basic
                     block layers and a final fully connected layer, which is shown in detail in Fig. 1. This structure avoids the problem of gradient disappearance caused by the deepening
                     of the neural network layers. Its efficiency has also been simultaneously improved
                     due to the introduced basic blocks. 
                     
                  
                  
                     
                     
                        
                        
                              
                              
Fig. 1. The flowchart of ResNet-18
                                 
                              
                            
                        
                     
                     
                     In real-time object detecting tasks, large-sized and excessive- convolution kernels
                     increase the computational cost, dilute the effective features and reduce the real-time
                     control accuracy. The authors of 
(20), 
(21) prove that the kernel sizes of 1×1 and 3×3 have fewer parameters but stronger feature
                     generalization abilities than the 5×5 and 7×7 kernel size. In addition, a block of
                     two convolutional layers with a 3×3 kernel size plays the same role as one 5×5 convolutional
                     layer, as the convolutional window is scanning the input. The original throughput
                     is kept. However, it results in a lighter number of parameters, while the stacked
                     convolutional layers yield a better result.
                     
                  
                  
                     
                     
                     
                     
                        
                        
                        
                        
                              
                              
Table 1. Parameters of M-SSD from FC6 to conv9_2 layer
                                 
                              
                           
                           
                              
                              
                              
                                    
                                       
                                          | 
                                             
                                          			
                                           Layer 
                                          			
                                        | 
                                       
                                             
                                          			
                                           Input size 
                                          			
                                        | 
                                       
                                             
                                          			
                                           Output size 
                                          			
                                        | 
                                       
                                             
                                          			
                                           Kernel size 
                                          			
                                        | 
                                       
                                             
                                          			
                                           Input channel 
                                          			
                                        | 
                                       
                                             
                                          			
                                           Output channel 
                                          			
                                        | 
                                    
                                    
                                          | 
                                             
                                          			
                                           FC6 
                                          			
                                        | 
                                       
                                             
                                          			
                                           38×38 
                                          			
                                        | 
                                       
                                             
                                          			
                                           19×19 
                                          			
                                        | 
                                       
                                             
                                          			
                                           3×3 
                                          			
                                        | 
                                       
                                             
                                          			
                                           256 
                                          			
                                        | 
                                       
                                             
                                          			
                                           512 
                                          			
                                        | 
                                    
                                    
                                          | 
                                             
                                          			
                                           FC7 
                                          			
                                        | 
                                       
                                             
                                          			
                                           19×19 
                                          			
                                        | 
                                       
                                             
                                          			
                                           19×19 
                                          			
                                        | 
                                       
                                             
                                          			
                                           1×1 
                                          			
                                        | 
                                       
                                             
                                          			
                                           512 
                                          			
                                        | 
                                       
                                             
                                          			
                                           512 
                                          			
                                        | 
                                    
                                    
                                          | 
                                             
                                          			
                                           conv6_1 
                                          			
                                        | 
                                       
                                             
                                          			
                                           19×19 
                                          			
                                        | 
                                       
                                             
                                          			
                                           10×10 
                                          			
                                        | 
                                       
                                             
                                          			
                                           1×1 
                                          			
                                        | 
                                       
                                             
                                          			
                                           512 
                                          			
                                        | 
                                       
                                             
                                          			
                                           256 
                                          			
                                        | 
                                    
                                    
                                          | 
                                             
                                          			
                                           conv6_2 
                                          			
                                        | 
                                       
                                             
                                          			
                                           19×19 
                                          			
                                        | 
                                       
                                             
                                          			
                                           10×10 
                                          			
                                        | 
                                       
                                             
                                          			
                                           3×3 
                                          			
                                        | 
                                       
                                             
                                          			
                                           128 
                                          			
                                        | 
                                       
                                             
                                          			
                                           256 
                                          			
                                        | 
                                    
                                    
                                          | 
                                             
                                          			
                                           conv7_1 
                                          			
                                        | 
                                       
                                             
                                          			
                                           10×10 
                                          			
                                        | 
                                       
                                             
                                          			
                                           5×5 
                                          			
                                        | 
                                       
                                             
                                          			
                                           1×1 
                                          			
                                        | 
                                       
                                             
                                          			
                                           256 
                                          			
                                        | 
                                       
                                             
                                          			
                                           128 
                                          			
                                        | 
                                    
                                    
                                          | 
                                             
                                          			
                                           conv7_2 
                                          			
                                        | 
                                       
                                             
                                          			
                                           10×10 
                                          			
                                        | 
                                       
                                             
                                          			
                                           5×5 
                                          			
                                        | 
                                       
                                             
                                          			
                                           3×3 
                                          			
                                        | 
                                       
                                             
                                          			
                                           64 
                                          			
                                        | 
                                       
                                             
                                          			
                                           128 
                                          			
                                        | 
                                    
                                    
                                          | 
                                             
                                          			
                                           conv8_1 
                                          			
                                        | 
                                       
                                             
                                          			
                                           5×5 
                                          			
                                        | 
                                       
                                             
                                          			
                                           3×3 
                                          			
                                        | 
                                       
                                             
                                          			
                                           1×1 
                                          			
                                        | 
                                       
                                             
                                          			
                                           128 
                                          			
                                        | 
                                       
                                             
                                          			
                                           128 
                                          			
                                        | 
                                    
                                    
                                          | 
                                             
                                          			
                                           conv8_2 
                                          			
                                        | 
                                       
                                             
                                          			
                                           5×5 
                                          			
                                        | 
                                       
                                             
                                          			
                                           3×3 
                                          			
                                        | 
                                       
                                             
                                          			
                                           3×3 
                                          			
                                        | 
                                       
                                             
                                          			
                                           64 
                                          			
                                        | 
                                       
                                             
                                          			
                                           64 
                                          			
                                        | 
                                    
                                    
                                          | 
                                             
                                          			
                                           conv9_1 
                                          			
                                        | 
                                       
                                             
                                          			
                                           3×3 
                                          			
                                        | 
                                       
                                             
                                          			
                                           1×1 
                                          			
                                        | 
                                       
                                             
                                          			
                                           3×3 
                                          			
                                        | 
                                       
                                             
                                          			
                                           128 
                                          			
                                        | 
                                       
                                             
                                          			
                                           128 
                                          			
                                        | 
                                    
                                    
                                          | 
                                             
                                          			
                                           conv9_2 
                                          			
                                        | 
                                       
                                             
                                          			
                                           3×3 
                                          			
                                        | 
                                       
                                             
                                          			
                                           1×1 
                                          			
                                        | 
                                       
                                             
                                          			
                                           1×1 
                                          			
                                        | 
                                       
                                             
                                          			
                                           64 
                                          			
                                        | 
                                       
                                             
                                          			
                                           64 
                                          			
                                        | 
                                    
                                 
                              
                           
                         
                        
                     
                     
                     Inspired by these literature methods, two modifications are performed herein compared
                     with the original SSD model: (a) we retain the SSD structure, use ResNet-18 as the
                     basic feature extraction network, but discard the VGG-16, followed by some convolutional
                     layers to detect the object; (b) we replace the convolutional kernels from FC6 to
                     conv9_2 layers and use convolutional kernels of 1×1 size to classify the object. The
                     layer’s specification from FC6 to conv9_2 is presented in 
Table 1 in detail, while the M-SSD model structure is presented in 
Fig. 2.
                     
                  
                  
                     
                     
                        
                        
                              
                              
Fig. 2. The overall structure of the M-SSD
                                 
                              
                            
                        
                     
                     
                     In contrast to the SSD model, we choose the layers of res3d, fc6, fc7, conv6_1, conv7_1,
                     conv8_1 and conv9_1 as the regression feature map layers to classify the object. In
                     each feature map layer, 1×1 represent the size of the convolutional kernel, 3 or 6
                     represents the numbers of prior box and 4 represents the values of the bounding box.
                     
                     
                  
                  
                     Afterwards, the M-SSD model parameters are set for the proposed real-time detection
                     system as follows:
                     
                  
                  
                     Ⅰ. Select default box parameters: the feature maps located in different layers have
                     different sizes of receptive fields in a CNN. To correctly detect targets with different
                     scales when they are moved, some algorithms convert the input image to different scales,
                     then process the converted image and fuse the detection results (22), (23). The strategy proposed in (24) is based on the fact that the default frame does not need to be mapped one to one
                     with the feature map receptive. 
                     
                  
                  
                     The default frame at different positions corresponds to different regions and target
                     sizes. Assuming that $m$ feature maps should be predicted, the default frame size
                     in each feature map is calculated as:
                     
                  
                  
                     
                     
                     
                     
                        
                        
                        
                        
                        
                     
                     
                     where $S_{\min}$ is the default frame size of the lowest layer having a value of 0.1
                     and $S_{\max}$ is the default frame size of the highest layer having a value of 0.96
                     in the network structure. 
                     
                  
                  
                     The different layers are sorted at regular intervals. The width- to-height ratio of
                     the default frame is $a_{r}\in\{1,\: 2,\: 3,\: 1/2,\: 1/3\}$. The width and height
                     of each default frame are respectively given by:
                     
                  
                  
                     
                     
                     
                     
                        
                        
                        
                        
                        
                     
                     
                     
                     Ⅱ. Choose the matching strategy: this strategy selects the default box for each true
                     label box to match it when it generates the M-SSD detection model. It then finds the
                     highest Jaccard for each true label from all the candidate default boxes, by re-adjusting
                     the Jaccard overlap coefficient.
                     
                  
                  
                     Ⅲ. Select the loss function: Softmax $l_{i}= -\log(e^{S_{y_i}}/\sum_{j}
                     e^{S_{j}})$ is selected as the loss function, $S_{j}$ is the score of class $j$ and
                     $y_{i}$ is the true label of the real object. Then the formula for the total loss
                     function $L$ is as follows:
                     
                  
                  
                     
                     
                     
                     
                        
                        
                        
                        
                        
                     
                     
                     where $N$ is the total number of images. 
                     
                  
                  
                     An object function always exists during model training. We should optimize the loss
                     function to minimize the loss value until the value becomes the lowest. The M-SSD
                     training model is developed based on the TensorFlow deep learning framework. 
                     
                  
                  
                     Based on this design, the algorithm complexity is reduced. The advantage of the proposed
                     design will be shown in the following comparative analysis.
                     
                  
                  
                     The reason why we use SSD for object detection is because the SSD network framework
                     is designed to be independent of the basic network and is used to accurately classify
                     and locate targets. It can run on any basic network(such as VGG, ResNet, MobileNet).
                     Therefore, we can use different basic networks for neural network learning and different
                     regression layers(from 6 to 8) to estimate their accuracy. It is a very useful neural
                     network framework to improve the detection accuracy and speed. YOLO and its improved
                     edition YOLO v3, YOLO v5 have been proposed for multiple objects detection. But, for
                     real-time detection, they are especially performed for tasks on mobile terminal. SSD
                     network framework is still a better choice, since its performance in terms of comprehensive
                     consideration of accuracy and speed is particularly outstanding when used as a network
                     with light structure to detect objects.
                     	
                  
                
               
                     3.2 M-SSD model training/testing
                  	
                     The next step consists in training/testing the proposed M-SSD model for object detection.
                     The hardware specifications of the experiment environment are shown in Table 2. CPU is used to train the M-SSD model with 16G RAM. The GPU can highly improve the
                     training speed. Note that some Library Functions of CUDA 10.0/CUDNN 8.0.0, and some
                     platforms such as Python 3.6/TensorFlow 1.8, are used to quickly and effectively train
                     the model. The trained model runs on Ubuntu 18.04 operating system, using a camera
                     to capture real-time objects with a resolution of 1024×768.
                     
                  
                  
                     
                     
                     
                     
                        
                        
                        
                        
                              
                              
Table 2. Hardware specification
                                 
                              
                           
                           
                              
                              
                              
                                    
                                       
                                          | 
                                             
                                          			
                                           Hardware device 
                                          			
                                        | 
                                       
                                             
                                          			
                                           Parameter 
                                          			
                                        | 
                                    
                                    
                                          | 
                                             
                                          			
                                           CPU 
                                          			
                                        | 
                                       
                                             
                                          			
                                           Inter(R) Core(TM) i7-8750H 
                                          			
                                        | 
                                    
                                    
                                          | 
                                             
                                          			
                                           RAM 
                                          			
                                        | 
                                       
                                             
                                          			
                                           16GB 
                                          			
                                        | 
                                    
                                    
                                          | 
                                             
                                          			
                                           GPU 
                                          			
                                        | 
                                       
                                             
                                          			
                                           NVIDIA GeForce GTX1060 
                                          			
                                        | 
                                    
                                    
                                          | 
                                             
                                          			
                                           Operate system 
                                          			
                                        | 
                                       
                                             
                                          			
                                           Ubuntu 18.04 
                                          			
                                        | 
                                    
                                    
                                          | 
                                             
                                          			
                                           CUDA/CUDNN 
                                          			
                                        | 
                                       
                                             
                                          			
                                           CUDA 10.0/CUDNN 8.0.0 
                                          			
                                        | 
                                    
                                    
                                          | 
                                             
                                          			
                                           Platform 
                                          			
                                        | 
                                       
                                             
                                          			
                                           Python, TensorFlow 
                                          			
                                        | 
                                    
                                    
                                          | 
                                             
                                          			
                                           Camera 
                                          			
                                        | 
                                       
                                             
                                          			
                                           USB HD, resolution1024×768 
                                          			
                                        | 
                                    
                                 
                              
                           
                         
                        
                     
                     
                     
                     
                     
                        
                        
                        
                        
                              
                              
Table 3. The parameters initialization
                                 
                              
                           
                           
                              
                              
                              
                                    
                                       
                                          | 
                                             
                                          			
                                           Parameters 
                                          			
                                        | 
                                       
                                             
                                          			
                                           Value 
                                          			
                                        | 
                                    
                                    
                                          | 
                                             
                                          			
                                           base_lr 
                                          			
                                        | 
                                       
                                             
                                          			
                                           0.0001 
                                          			
                                        | 
                                    
                                    
                                          | 
                                             
                                          			
                                           max_iter 
                                          			
                                        | 
                                       
                                             
                                          			
                                           50000 
                                          			
                                        | 
                                    
                                    
                                          | 
                                             
                                          			
                                           Ir_policy 
                                          			
                                        | 
                                       
                                             
                                          			
                                           Step 
                                          			
                                        | 
                                    
                                    
                                          | 
                                             
                                          			
                                           Gamma 
                                          			
                                        | 
                                       
                                             
                                          			
                                           0.1 
                                          			
                                        | 
                                    
                                    
                                          | 
                                             
                                          			
                                           Momentum 
                                          			
                                        | 
                                       
                                             
                                          			
                                           0.9 
                                          			
                                        | 
                                    
                                    
                                          | 
                                             
                                          			
                                           weight_decay 
                                          			
                                        | 
                                       
                                             
                                          			
                                           0.0005 
                                          			
                                        | 
                                    
                                    
                                          | 
                                             
                                          			
                                           image_size 
                                          			
                                        | 
                                       
                                             
                                          			
                                           300×300 
                                          			
                                        | 
                                    
                                    
                                          | 
                                             
                                          			
                                           Type 
                                          			
                                        | 
                                       
                                             
                                          			
                                           SGD 
                                          			
                                        | 
                                    
                                    
                                          | 
                                             
                                          			
                                           BN 
                                          			
                                        | 
                                       
                                             
                                          			
                                           32 
                                          			
                                        | 
                                    
                                 
                              
                           
                         
                        
                     
                     
                     An image database containing 2000 images was built. These images were collected under
                     different external environments and illumination intensities, with a ratio of 3:1
                     (positive images, including the USV: negative images without the object). A part of
                     the images was flipped, stretched or compressed to enhance the data set universality.
                     Accordingly, 80% of the images were used for training. The remaining 20% were used
                     for the network testing. In the base network, the images captured by the camera were
                     re-sized to 300×300 before inputting them to the net structure model. The model is
                     trained using stochastic gradient descent (SGD) with a 0.0001 initial learning rate
                     (base_lr), 0.9 momentum, 0.0005 weight decay and a batch normalization (BN) of 32.
                     The network was trained for 50,000 iterations and successfully converged. Other parameters
                     are detailed in 
Table 3.
                     
                  
                  
                     A part of the labeled images for training/validating is illustrated in Fig. 3. The experiment is implemented in a pool area of Kyungnam University in South Korea.
                     The training/validating accuracy of the proposed model is presented in Fig. 4. It can be seen that the classification accuracy can reach 96.75%. Some classification
                     and accuracy results, in the case of a successful detection, are shown in Fig. 5.
                     
                  
                  
                     To evaluate the performance of the proposed detection system, the following four evaluation
                     criteria are used:
                     
                  
                  
                     
                     
                     
                     
                        
                        
                        
                        
                        
                     
                     
                     
                     
                     
                     
                        
                        
                        
                        
                        
                     
                     
                     
                     
                     
                     
                        
                        
                        
                        
                        
                     
                     
                     
                     
                     
                        
                        
                        
                        
                        
                     
                     
                     where $a$ and $n$ respectively represent the number of misclassified samples and the
                     total number of samples, TP (true positive) refers to a positive sample which is predicted
                     to be a correct result, FP (false positive) refers to a negative sample which is predicted
                     to be a false alarm, FN (false negative) refers to a positive sample which is predicted
                     to be a missed detection, and TN (true negative) refers to a negative sample which
                     is predicted to be negative. 
                     
                  
                  
                     
                     
                        
                        
                              
                              
Fig. 3. Part of the images used for training
                                 
                              
                            
                        
                     
                     
                     The proposed M-SSD model is compared with SSD 
(10), R-SSD 
(24) and F-SSD 
(18), using the previously mentioned four parameters; precision, recall, accuracy and
                     F1. The results are shown in 
Fig. 6. It can be observed that the proposed M-SSD model results in a higher detection performance
                     than SSD, which can reach an accuracy of 96.75\%. This is due to the fact that ResNet-18,
                     which has a stronger feature extraction residual structure, is used to extract the
                     basic feature infor- mation. However, M-SSD has a lower detection performance than
                     R-SSD and F-SSD. This is due to the fact that the proposed model has fewer layers
                     than R-SSD of ResNet-50 and F-SSD of ResNet-34. This inversely proves that a higher
                     accuracy requires deeper network layers. However, this does not mean that a higher
                     accuracy results in a better detection performance. The computation time, given in
                     
Table 4, is another parameter for performance estimation. It can be seen from 
Table 4 that the computation time of the proposed M-SSD model is 424.36s, which is 26.35%
                     less than that of the SSD model, and much less than that of R-SSD and F-SSD. The proposed
                     design improves the detection performance and the detection speed. It can also be
                     implemented on mobile terminals, such as Rasberry Pi and Jetson Nano, for example.
                     
                  
                  
                     
                     
                        
                        
                              
                              
Fig. 4. Accuracy results of the proposed model
                                 
                              
                            
                        
                     
                     
                     
                        
                        
                              
                              
Fig. 5. Output of the M-SSD testing
                                 
                              
                            
                        
                     
                     
                     For our collected USV data set, the FPS of SSD is about 67 with the input resolution
                     300×300, and the FPS of our proposed M-SSD model is about 86 with the same input resolution.
                     When we download the trained file to the mobile terminal Jeston Nano, the FPS of our
                     proposed model is about 32, which achieves real-time former USV detection.
                     
                  
                  
                     
                     
                        
                        
                              
                              
Fig. 6. Performance comparison of different models
                                 
                              
                            
                        
                     
                     
                     
                     
                     
                        
                        
                        
                        
                              
                              
Table 4. Computation time of the methods (s)
                                 
                              
                           
                           
                              
                              
                              
                                    
                                       
                                          | 
                                             
                                          			
                                           Method 
                                          			
                                        | 
                                       
                                             
                                          			
                                           Basic network 
                                          			
                                        | 
                                       
                                             
                                          			
                                           Time 
                                          			
                                        | 
                                    
                                    
                                          | 
                                             
                                          			
                                           SSD (10) 
                                          			
                                        | 
                                       
                                             
                                          			
                                           VGG-16 
                                          			
                                        | 
                                       
                                             
                                          			
                                           576.25 
                                          			
                                        | 
                                    
                                    
                                          | 
                                             
                                          			
                                           R-SSD (24) 
                                          			
                                        | 
                                       
                                             
                                          			
                                           ResNet-50 
                                          			
                                        | 
                                       
                                             
                                          			
                                           824.36 
                                          			
                                        | 
                                    
                                    
                                          | 
                                             
                                          			
                                           F-SSD (18) 
                                          			
                                        | 
                                       
                                             
                                          			
                                           ResNet-34 
                                          			
                                        | 
                                       
                                             
                                          			
                                           720.64 
                                          			
                                        | 
                                    
                                    
                                          | 
                                             
                                          			
                                           M-SSD 
                                          			
                                        | 
                                       
                                             
                                          			
                                           ResNet-18 
                                          			
                                        | 
                                       
                                             
                                          			
                                           424.36 
                                          			
                                        | 
                                    
                                 
                              
                           
                         
                        
                     
                     
                     Part of the failure detection images are shown in 
Fig. 7. It can be observed that the unobvious characteristics and sharp changes of the ambient
                     light around the detected object, may cause a failure detection. Another labeled image
                     data set is used to verify our conjecture and to train a high accuracy model for further
                     studies. This collected data set mainly comprises images that we previously failed
                     to detect, as well as images collected under the situation of a similar environment.
                     A part of the new data set is shown in 
Fig. 8. The re-train loss for the new data set is presented in 
Fig. 9. It can be seen from 
Fig. 9that the training loss is slightly high during the re-training process. We are not
                     able to obtain a better train loss after 50,000 iterations. This is due to the fact
                     that the basic net structure cannot obtain more features of the USV object to train
                     the model, because of an unclear feature data set. This leads to a low object classification.
                     
                     
                  
                  
                     In summary, for blurred or unclear images, the network cannot learn enough features
                     and the loss function can not converge to zero. It is concluded that images with clear
                     features are required to train the model and then the network models can achieve good
                     accuracy. For former object detection, the paper gets higher detection accuracy and
                     faster speed than original SSD model through replacing the basic network VGG-16 with
                     ResNet-18 and utilizing 1×1 as the convolutional kernel to return 6 feature maps.
                     Although there is no significant improvement in accuracy, the computational time is
                     reduced 26.35% less than former SSD structure. In addition, it has an advantage in
                     that it can utilize a network with reduced computing efficiency. 
                     
                  
                  
                     
                     
                        
                        
                              
                              
Fig. 7. Part of the failure detection images
                                 
                              
                            
                        
                     
                     
                     
                        
                        
                              
                              
Fig. 8. New data set for training images
                                 
                              
                            
                        
                     
                     
                     
                        
                        
                              
                              
Fig. 9. The re-train loss for the new data set