Abstract: Scab, frogeye spot, and cedar rust are three common types of apple leaf diseases, and the rapid diagnosis and accurate identification of them play an important role in the development of apple production. In this work, an improved model based on VGG16 is proposed to identify apple leaf diseases, in which the global average poling layer is used to replace the fully connected layer to reduce the parameters and a batch normalization layer is added to improve the convergence speed. A transfer learning strategy is used to avoid a long training time. The experimental results show that the overall accuracy of apple leaf classification based on the proposed model can reach 99.01%. Compared with the classical VGG16, the model parameters are reduced by 89%, the recognition accuracy is improved by 6.3%, and the training time is reduced to 0.56% of that of the original model. Therefore, the deep convolutional neural network model proposed in this work provides a better solution for the identification of apple leaf diseases with higher accuracy and a faster convergence speed.
Keywords: apple leaf diseases; transfer learning; deep learning; convolutional neural networks
关键词:苹果叶病; 迁移学习; 深度学习; 卷积神经网络
1.Introduction
原文
译文
Leaf diseases are one of the main obstacles to apple production. Among them, scab, frogeye spot, and cedar rust are three most common types of apple leaf diseases and have a bad impact on apple growing. Therefore, the detection of apple leaf diseases has attracted more and more attention, and the early identification of apple leaf disease is very important for the intervention of treatment. In the past, disease identification methods were generally divided into manual identification and an expert system. However, both of them are highly dependent on fruit growers and experts and are time-consuming and usually poor in generalization.
With the development of machine learning methods, some computational models have been proposed for plant disease diagnosis based on different algorithms. Some studies have found diseased regions by K-means clustering-based segmentation and build disease recognition models using supervised learning methods, including the random forest, support vector machine (SVM), and K-nearest neighbor methods [1–3]. Rothe et al. used an active contour model for image segmentation and extracted Hu’s moments as features for the training of an adaptive neuro-fuzzy inference system, by which a classification accuracy of 85% can be achieved [4]. Gupta et al. proposed an autonomously modified SVM-CS model where a SVM model was trained and optimized using the concept of a cuckoo search [5]. However, these classification features are heavily depended on man-made selection and the recognition rates are not satisfactory.
随着机器学习方法的发展,一些基于不同算法的植物病害诊断计算模型被提出。 一些研究通过基于 K 均值聚类的分割发现了病变区域,并使用监督学习方法构建了疾病识别模型,包括随机森林、支持向量机 (SVM) 和 K 近邻方法 [1-3]。 罗特等人使用主动轮廓模型进行图像分割,并提取 Hu 的矩作为特征用于训练自适应神经模糊推理系统,由此可以实现 85% 的分类准确率 [4]。 古普塔等人 提出了一种自主修改的 SVM-CS 模型,其中使用布谷鸟搜索的概念对 SVM 模型进行训练和优化 [5]。 然而,这些分类特征严重依赖于人为选择,识别率并不理想。
In recent years, convolutional neural networks (CNNs) have shown good results in recognition tasks by reducing the need for image preprocessing and improving the identification accuracy [6–13]. Leaf disease recognition based on CNNs has become a new hotspot in the agricultural informatization area [14–16]. Lu et al. proposed a rice disease identification method based on deep CNN techniques and achieved an accuracy of 95.48% on a dataset of 500 natural images of diseased and healthy rice leaves [17]. Zhang et al. proposed the improved GoogLeNet and Cifar10 models and obtained the average identification accuracies of 98.9% and 98.8%, respectively [18]. Liu et al. designed a novel architecture of AlexNet to detect apple leaf diseases, and the experimental results showed that this approach achieved an overall accuracy of 97.62% for disease identification [19]. Although the recognition accuracy of these CNN models is higher than that of traditional machine learning methods, there are still some shortcomings—such as high model complexity, much more parameters, and a long training time—which prevent their application in real environments.
In this work, we propose a method for apple leaf disease identification based on an improved deep convolution neural network architecture which can effectively reduce the model complexity and training time. The network proposed in this work adopts the concept of transfer learning to pre-train a VGG16 network and adjusts the network structure by removing three fully connected layers, adding a global average pooling layer, a batch normalization layer, and a fully connected layer. Based on a benchmark dataset, the proposed model, which can reach a 89% reduction in the model parameters of the original VGG16 model, greatly reduced the training time and achieved a higher accuracy rate.
The dataset in this work is from the “2008 ’AI Challenger’ Global Challenge” and includes 10 kinds of plants with 27 categories of diseases. This work addresses the automatic identification of apple leaf diseases, therefore only apple leaves are selected from this dataset. There are four categories of apple leaf images within the dataset, and Figure 1 lists some of them. With the exception of healthy leaves, three types of disease images—i.e., scab, frogeye spot, and cedar rust—are collected within the dataset. Typically, the lesions on scab leaves are gray-brown and nearly round or radial, frogeye spot is tan and the shape is flakes or dots, and cedar rust leaves have round orange-yellow lesions with red edges. Some spot and cedar rust lesions are similar in color and shape, which increases the difficulty in recognition by computational methods.
In this work, there are 2446 pictures collected within our dataset, where 1340 of them are healthy, 411 are scab, 487 are frogeye spot, and 208 are cedar rust. In the original dataset, the dataset was divided into two subsets—i.e., 2141 pictures were for model training and the remaining 305 ones for testing. The details about the dataset are shown in Table 1.
With the rapid development of deep learning, CNNs had been applied widely in different fields, especially in image classification and recognition and target location and detection [20]. A CNN is a special multi-layer perceptron (MLP) or multilayered feed forward neural network, which generally consists of an input layer, convolution layer, pooling layer, fully connected layer, and output layer. The convolution layer can realize dimensionality reduction and feature extraction by implementing two design concepts: local perception and parameter sharing. The pooling layer can reduce the size of the data, where smart sampling also has the invariance of local linear transformation, which enhances the generalization ability of convolutional neural networks. The fully connected layer acts as a classifier in the whole neural network. It is common for multiple fully connected layers to be used after several rounds of convolution, and the resulting structure of the last convolutional layer is flattened [21,22].
The VGG16 contains 16 convolutional layers with very small receptive fields, 3 × 3, and five max‐pooling layers of size 2 × 2 for carrying out spatial pooling, followed by three fully connected layers. A classical VGG16 model involves 144 million parameters, where rectification nonlinearity (ReLU) activation is applied to all hidden space pooling and the softmax function is applied in the final layer [23]. The model also uses dropout regularization in the fully connected layers. A schematic of the VGG16 architecture is shown in Figure 2, where the marked red box shows a classifier consisting of three fully connected layers.
CNNs typically require a large annotated image dataset to achieve a high predictive accuracy. However, the acquisition of such data is difficult and labeling them is costly in many areas. In light of these challenges, the concept of transfer learning is adopted in many previous studies for solving cross-domain image classification problems and has been shown to be very useful, where the “off-the-shelf” features of well-established CNNs, such as VGG16, AlexNet, and GoogLeNet, are pre-trained on large-scale annotated natural image datasets, such as ImageNet, where 15 million images are involved [24–27].
One common strategy of transfer learning is feature transfer, which removes the last layer of the pre-trained network and sends its previous activation values, which can be regarded as feature vectors, into classifiers for training. Another is parameter transfer, which only needs to re-initialize a few layers of the network, such as the last layer, and the other layers directly using the weight parameters of the pre-trained network, while a new dataset is used to finetune the network parameters [28–30].
Because of the small amount of data in this work, training a neural network from scratch will take a long time, and the data insufficiency easily causes an over-fitting problem, which will bring the model poor robustness. Therefore, we can use the idea of transfer learning, where a pre‐trained model is built on ImageNet to optimize the classification and recognition of apple leaf diseases. Herein, the VGG16 is fine tuned to fit our own data, which can save a lot of training time.
A classical VGG16 network has a strong ability of image feature extraction and recognition. Its core idea is to use smaller convolution kernels to increase the depth of the network, which was the key to win the runner-up position in positioning and classification tasks in the ILSVRC Challenge in 2014. However, the VGG16 model has a huge amount of parameters, which will cause a slow convergence speed, long training time, and large storage capacity in practical applications.
To address these problems, this work improves the VGG16 model by using a global average pooling layer, a batch normalization layer and a fully connected layer to replace the three fully connected layers in the original model. The global average pooling layer is used to replace the fully connected layer to reduce the parameters, and the batch normalization layer is added to improve the convergence speed. In order to avoid a long training time, the weights of the convolution layers are pre-trained by VGG16 on ImageNet. The stochastic gradient descent (SGD) optimizer is replaced by an adaptive moment estimation (Adam) to accelerate the convergence of the network. The network structure is shown in Figure 3, where the improvement of a classifier consisting of a global average pooling layer, a batch normalization layer, and a fully connected layer is shown within the marked green box.
Global average pooling is to regularize the whole network structure to prevent over-fitting and reduce the dimensions from 3D to 1D [31,32]. In this work, the feature maps in the last convolution layer are averaged into a series of 1D outputs which is shown in Figure 4. A GAP can omit the expansion of the feature maps into vectors and full connection processing, and therefore greatly reduces the number of parameters. The advantage of a GAP over a fully connected layer is that it can preserve the convolution structure better by enhancing the correspondence between the feature maps and analogy, making the classification of the feature map credible and well-explained.
全局平均池化是对整个网络结构进行正则化以防止过拟合并将维度从 3D 减少到 1D [31,32]。 在这项工作中,最后一个卷积层中的特征图被平均为一系列一维输出,如图 4 所示。 GAP 可以省略将特征图扩展为向量和全连接处理,因此大大减少了数量 的参数。 GAP 相对于全连接层的优势在于它可以通过增强特征图和类比之间的对应关系更好地保留卷积结构,使特征图的分类可信且易于解释。
2.3.2. Batch Normalization (BN)
原文
译文
In deep learning, because the number of layers in the network is very large, if the data distribution at a certain layer starts to deviate significantly, this problem will intensify as the network deepens, which will increase the difficulty of the model optimization. Therefore, normalization helps to alleviate this problem. This method of batch normalization divides the data into several groups and updates the parameters according to the groups. The data in one group jointly determines the direction of the gradient and reduces the randomness when declining. On the other hand, because the number of samples in the batch is much smaller than the entire dataset, the amount of calculation has also dropped significantly. The batch normalization layer normalizes the inputs to the layer before the activation function is implemented, which can solve the problems of input data offset and increase [33].
Based on the BN algorithm, the parameters of the input layer are normalized and the activation function cannot affect the distribution of neurons. The importance of neurons will be weakened and some of them may be removed automatically. Because of the normalization of each epoch, the risk of parameter changes caused by a different data distribution is reduced and the convergence speed is accelerated.
Adam is an extension of the stochastic gradient descent algorithm which can iteratively update the neural network weights based on training data [34,35]. This method not only stores the exponential decay mean of the square gradient but also preserves the exponential decay mean of the previously calculated first-order and second-order moment estimation of the gradient. It also designs different adaptive learning rates for different parameters. Optimization algorithms such as SGD maintain a single learning rate during the training process, and Adam can iteratively update the neural network weights based on the training data. When the parameters are backpropagated and updated, the Adam algorithm can better adjust the learning rate. Thus, Adam has a fast convergence speed and effective learning effect. It can also correct the problems existing in other optimization techniques, such as the loss function fluctuation caused by the disappearance of the learning rate, slow convergence, or parameter updating with high variance.
Adam 是随机梯度下降算法的扩展,它可以根据训练数据迭代更新神经网络权重 [34,35]。该方法不仅存储了平方梯度的指数衰减均值,而且还保留了先前计算的梯度一阶和二阶矩估计的指数衰减均值。它还针对不同的参数设计了不同的自适应学习率。 SGD 等优化算法在训练过程中保持单一学习率,Adam 可以根据训练数据迭代更新神经网络权重。当参数进行反向传播和更新时,Adam 算法可以更好地调整学习率。因此,Adam 具有快速的收敛速度和有效的学习效果。它还可以纠正其他优化技术中存在的问题,例如由于学习率消失、收敛速度慢或参数更新方差大而导致的损失函数波动。
3.Results and Discussion
原文
译文
In this work, the proposed model was implemented with the Keras deep learning framework using a Intel® Core™ i7-8750H GPU (LENOVO, Jiangsu, China). The ImageNet pre-trained VGG16 CNN implemented within Keras Applications takes in a default image input size of 227 × 227. Therefore, all the pictures in our dataset were cut to the same size of 227 × 227.
The proposed CNN is trained on 2141 training pictures and tested on 305 ones, and the confusion is totally accurate, only one healthy picture is misclassified as scab, and only one is misclassified as healthy in both of scab and frogeye spot categories.
For the three misclassified pictures in the original dataset, Figure 5 lists the original one, its visualization of the last convolution layer and the superposition of the heat map of the original picture. There are some enlightenments can be found from these pictures. In Figure 5b, the strong light and small disease features may lead to the inaccurate extraction of disease features by the model. The frogeye spots in Figure 5c are small in size and light in color, which will leads to prediction errors with comparison to the dark area, for light is strongly learned in the network and therefore has a bigger weight.
To evaluate the performance of the proposed VGG model, four typical convolutional neural networks—i.e., AlexNet, GoogleNet, Resnet-34, and VGG16—are also implemented. Another apple leaf disease recognition structure presented by Liu et al., where the inception structure was added into the AlexNet framework, has also been compared. The recognition accuracy of the different models is shown in Figure 6.
It can be found that the accuracy of AlexNet and the original VGG16 is 93.11%, ResNet34 is 95.73%, and GoogleNet can reach 97.70%. When the inception structure was combined with AlexNet, the identification accuracy can be increased to 97.05%, which is higher than the original AlexNet. It can be seen that our work achieves the highest accuracy in the identification of apple leaf diseases—i.e, a 99.01% accuracy—which demonstrates the effectiveness of the proposed model. Compared to the other five models, whether in terms of precision, recall, or F1‐score, our model achieved the highest value.
可以发现AlexNet和原始VGG16的准确率为93.11%,ResNet34为95.73%,GoogleNet可以达到97.70%。 当inception结构与AlexNet结合时,识别准确率可以提高到97.05%,高于原来的AlexNet。 可以看出,我们的工作在识别苹果叶病害方面达到了最高的准确率——即 99.01% 的准确率——这证明了所提出模型的有效性。 与其他五个模型相比,无论是在精度、召回率还是 F1 分数方面,我们的模型都取得了最高的值。
原文
译文
Table 3 shows the precision, recall, f1‐score, and accuracy of different models achieved for the four categories of apple images. The Table 4 shows that AlexNet does not learn the features of the scab well enough, and the detection effect is poor; the improved Alex + Inception model recognition is better than the original Alex; what is more, the original VGG16 network has the worst learning of each feature. For these four‐leaf types, all the networks have the best recognition rate for healthy and the lowest scab recognition rate. Regardless of the accuracy or the detection index of each leaf type, our model achieved the best results. In general, our model has the best recognition effect.
表 3 显示了针对四类苹果图像实现的不同模型的准确率、召回率、f1-score 和准确率。 表4表明AlexNet对scab的特征学习得不够好,检测效果较差; 改进后的 Alex + Inception 模型识别比原来的 Alex 更好; 更重要的是,原始的 VGG16 网络对每个特征的学习最差。 对于这些四叶类型,所有网络的健康识别率最高,痂识别率最低。 无论是每种叶子类型的准确率还是检测指标,我们的模型都取得了最好的结果。 总的来说,我们的模型识别效果最好。
3.2.Convergence Rate Analysis
原文
译文
The loss values in this work are calculated by cross entropy. Figure 7 shows the accuracy and loss values of the five models during training. The experimental results show that AlexNet, ResNet-34, GoogleNet, Alex + Inception, and our convolutional neural network converge within 60 training epochs, while VGG16 converges slowly. It can be found the proposed network structure converges in 10 training epochs, which is faster than the other five CNN models. The training process of GoogleNet is similar to the process of ResNet-34, and both converge after 20 training epochs, and AlexNet and the Alex + Inception model tend to be stable after 40 epochs.
Table 4 shows the number of parameters for each model and training time required when the model becomes stable. It can be found that the classical VGG16 model has the most parameters and the longest training time, the Alex + Inception model has the least training parameters, and AlexNet has the shortest training time. Our improved model can reduce 119,534,592 training parameters in comparison to the original VGG16 model. The convolutional neural network proposed in this work has fewer training parameters than AlexNet, ResNet34, and VGG16. The training time of the proposed model is 692 s, which is similar to that of ResNet34 and GoogleNet.
The optimization algorithm is of great importance for the model performance. In this work, the SGD optimization algorithm in the original VGG16 is replaced by the Adam optimization algorithmto improve the converge rate. Figure 8 shows the training process of these two optimization algorithmswith same learning rate of 1 × 10−5. The results show that the model using the Adam algorithm has a faster convergence speed. It can be found that the accuracy of testing is 98.03% when the SGD algorithm is used, while that of the Adam algorithm is 99.01%. From the loss curve in Figure 8, it can be seen that the Adam algorithm can converge quickly and is more stable than SGD.
In this work, the dataset used herein includes only 2446 pictures, which is very small in comparison to that with which the VGG16 was pre-trained. In order to evaluate the performance of the proposed method, a data augmentation strategy is adopted to amplify the original dataset and test the classification performance on it. The augmented dataset is generated based on the original dataset by image geometric transformation, color changing, and noise adding, which increase the size of the test dataset from 2141 to 21,410.
Image rotation and flipping are two types of image geometric transformations where only the location of each pixel is changed. Rotating the pictures at different angles and flipping can expand the diversity of directions. It is generally difficult to capture each picture from different directions, and therefore to simulate this situation to eliminate the effect of direction on picture recognition, we rotated the original image around the center point by 90, 180, and 270 and when flipped horizontally. As shown in Figure 9, after rotation and flipping, the number of pictures increased by 4 times the original data set.
Adjusting the brightness, contrast, and hue of the image is another common image augmentation method widely used in image processing. During the process of image acquisition, pictures may be affected by different weather and exposed to different intensities of light, which possibly affects the experimental results. In order to simulate image collection under different light backgrounds, we adjusted the brightness and contrast, as shown in Figure 10, and the data was expanded by 4 times.
In the same experimental setup, the model we proposed is trained on the augmented 21,410 images and the final classification accuracy can reach 99.34%. When we used the original dataset to train the model, the accuracy rate can also reach 99.01%. Figure 12 shows the recognition accuracy. It can be seen that after the data expansion, all the measures have been slightly improved on the model proposed.
An improved convolution neural network model based on VGG16 is proposed in this work. The classifier of classical VGG16 network is modified by adding a batch normalization layer, a global average pooling layer, and a fully connected layer to accelerate convergence and reduce training parameters. The proposed model trains on 2141 apple leaves in the training set to identify apple leaf diseases. The experimental results show that the accuracy of the model test can reach 99.01% after 692 s training. Compared with the classical VGG16 network, the model parameters are reduced by 119,534,592, and the accuracy is improved by 6.3%.
Although the training time is longer than that of AlexNet and ResNet, our model has fewer parameters and a higher accuracy. Compared with GoogleNet and Alex + Inception, some parameters and training time are sacrificed, but our model has the highest accuracy of up to 99.01%. After data expansion, the accuracy of the model can be increased to 99.34%. The convolution neural network proposed in this work can identify apple leaf diseases quickly and accurately and provides a feasible scheme for identifying apple leaf diseases.
In the future, our work can be improved in the following aspects: (1) collecting more kinds and quantities of apple disease pictures to enrich the datasets to train better models, (2) trying other deep convolution neural networks to improve the accuracy and speed of recognition, (3) trying to run other deep learning methods and apply them to the real‐time detection of apple disease.