Two-stage Deep Learning Model with LSTM-based Autoencoder and CNN for Crop Classification Using Multi-temporal Remote Sensing Images

Kwak, Geun-Ho;Park, No-Wook;

doi:10.7780/kjrs.2021.37.4.4

Korean Journal of Remote Sensing (대한원격탐사학회지)

Volume 37 Issue 4
/
Pages.719-731
/
2021
/
1225-6161(pISSN)
/
2287-9307(eISSN)

Korean Society of Remote Sensing (대한원격탐사학회)

DOI QR Code

Two-stage Deep Learning Model with LSTM-based Autoencoder and CNN for Crop Classification Using Multi-temporal Remote Sensing Images

Kwak, Geun-Ho (Department of Geoinformatic Engineering, Inha University) ;
Park, No-Wook (Department of Geoinformatic Engineering, Inha University)

Received : 2021.08.09
Accepted : 2021.08.17
Published : 2021.08.31

https://doi.org/10.7780/kjrs.2021.37.4.4 Citation PDF KSCI HTML

Download PDF

⟨ Previous Next ⟩

Abstract

This study proposes a two-stage hybrid classification model for crop classification using multi-temporal remote sensing images; the model combines feature embedding by using an autoencoder (AE) with a convolutional neural network (CNN) classifier to fully utilize features including informative temporal and spatial signatures. Long short-term memory (LSTM)-based AE (LAE) is fine-tuned using class label information to extract latent features that contain less noise and useful temporal signatures. The CNN classifier is then applied to effectively account for the spatial characteristics of the extracted latent features. A crop classification experiment with multi-temporal unmanned aerial vehicle images is conducted to illustrate the potential application of the proposed hybrid model. The classification performance of the proposed model is compared with various combinations of conventional deep learning models (CNN, LSTM, and convolutional LSTM) and different inputs (original multi-temporal images and features from stacked AE). From the crop classification experiment, the best classification accuracy was achieved by the proposed model that utilized the latent features by fine-tuned LAE as input for the CNN classifier. The latent features that contain useful temporal signatures and are less noisy could increase the class separability between crops with similar spectral signatures, thereby leading to superior classification accuracy. The experimental results demonstrate the importance of effective feature extraction and the potential of the proposed classification model for crop classification using multi-temporal remote sensing images.

Keywords

1. Introduction

Remote sensing imagery has been regarded as one of important sources for crop monitoring and agricultural thematic mapping owing to its ability to provide periodic information regarding crops and agricultural environments, at various spatial scales (Na et al., 2017; Kwak et al., 2020; Weiss et al., 2020). Among various thematic maps, crop type maps generated by remote sensing image classification have been widely used as inputs for crop yield prediction and physical models (Li and Xu, 2020; Kwak et al., 2021). However, it is critical to generate highly accurate crop type maps from remote sensing imagery because any errors contained within the crop type maps affect the quality of the model output (Hao et al., 2015; Lee et al., 2018).

The classification of multiple crop types usually requires multi-temporal remote sensing images to completely account for the sequential change in vegetation vitality according to the growth cycles of crops of interest (Zhong et al., 2019; Zhou et al., 2019; Kwak et al., 2020). To achieve satisfactory classification performance, advanced classifiers that can effectively model temporal information are required in addition to the use of multi-temporal remote sensing images. Recently, long short-term memory (LSTM), an advanced recurrent neural network (RNN) that processes sequential temporal information for classification, has proven to be effective in crop classification using multi-temporal remote sensing images (Rußwurm and Körner, 2018; Kwak et al., 2020).

Despite the efficiency of LSTM for sequential information modeling, the classification performance of LSTM greatly depends on both the information content and input noise of multi-temporal images (Rußwurm and Körner, 2018). For example, radiometric inconsistency between multi-temporal images because of different image acquisition conditions may degrade classification performance. In addition, even the same crop may have different growth characteristics per field depending on the difference in sowing time (Na et al., 2018). Consequently, advanced classifiers, such as LSTM, may not achieve satisfactory classification performance when using only original multi-temporal spectral bands. Therefore, it is necessary to extract informative features that contain useful information to discriminate crop types while reducing the noise effect. Although deep learning models can extract features automatically, it is still important to extract useful features prior to classification especially, when input data contain noise and redundant information (Guo et al., 2020).

To resolve this issue, an autoencoder (AE), an unsupervised learning technique for compression or reconstruction of high-dimensional data, can be combined with LSTM to extract features that still contain sequential information useful for classification, but with less noise (Hamidi et al., 2020; Kalinicheva et al., 2020). The latent features extracted by LSTM-based AE (hereafter, referred to as LAE) can be used as inputs for convolutional neural networks (CNNs) that consider spatial contextual information for classification. The combination of LAE with CNN has great potential for crop classification as it can reduce the noise effect contained in the original multi-temporal images. In addition, it can fully utilize both spatial and temporal contextual information to identify different crops with similar spectral signatures. However, to the best of our knowledge, the potential of a deep learning framework in which feature embedding by LAE is combined with CNN-based classification has not yet been evaluated for crop classification.

This study presents an advanced hybrid classification model that combines LAE with CNN for crop classification with multi-temporal remote sensing images. Firstly, the AE is trained through a two-stage process, including pre-training and fine-tuning. The properties and effectiveness of the classification of features embedded by optimized AE are further evaluated using class separability measures. The classification performance of the proposed model is quantitatively compared with conventional deep learning models using original multi-temporal images as input, such as CNN, LSTM, and convolutional LSTM (CLSTM). Furthermore, the effectiveness of LAE-based feature embedding for crop classification is also compared with classification using stacked AE (SAE), which a basic AE model. The methodological development and potential of the proposed hybrid model are demonstrated using a crop classification experiment using multi-temporal unmanned aerial vehicle (UAV) images in Anbandegi, Korea.

2. Study Area and Data

The case study area is located in Anbandegi, Kangwon Province, one of the major highland Kimchi cabbage cultivation areas in Korea. In addition to the cultivation of highland Kimchi cabbage, cabbage and potato are also cultivated in the study area, with some fields managed as fallow (Fig. 1). Four multi-temporal UAV images acquired from July to August 2017 were selected as inputs for crop classification based on the growth cycles of the crops in the study area (Table 1). The UAV images were acquired using a fixed-wing eBee unmanned aerial system (senseFly, Switzerland) with a Canon IXUS/ELPH camera (Canon, USA) and preprocessed using Pix4Dmapper (Pix4D, Switzerland) by the National Institute of Agricultural Sciences. Three visible bands with a spatial resolution of 50 cm were used for crop classification as the crops in the study area can be discriminated using the visible bands, particularly the blue band, without the near-infrared band (Kwak et al., 2020).

OGCSBN_2021_v37n4_719_f0001.png 이미지

Fig. 1. Multi-temporal UAV images and a ground truth map in the study area.

Table 1. Multi-temporal UAV images used for crop classification

OGCSBN_2021_v37n4_719_t0001.png 이미지

As the classification targets are crops for the present study, four classes, including highland Kimchi cabbage, cabbage, potato, and fallow, were considered for supervised classification. Non-crop areas, including facilities and roads, were masked out using a land-cover map provided by the Environmental Geographic Information Service (Ministry of Environment, Korea). Training and test samples for each class were prepared from a ground truth map provided by the National Institute of Agricultural Sciences (Fig. 1). The quantity of the training and test samples are listed in Table 2. When a single crop type is cultivated in one field, all pixels within each field are highly autocorrelated in space (Zhong et al., 2019). To avoid the dependency between training and test samples, all crop fields were initially randomly partitioned into spatially exclusive training and test fields, as described in Kwak et al. (2020). A total of 500 training samples with the same number of pixels for each class were randomly selected from the training fields. Since each crop field consists of homogeneous pixels, the selection of more redundant samples for major crop types may lead to biased classification results (Demir et al., 2014). To mitigate the bias towards the major crop types, a predefined equal number of training samples were selected for all classes. All the pixels (1, 130, 202) within the predefined test fields that were independent of the training fields were used as test samples for the quantitative evaluation of classification performance (Table 2).

Table 2. Number of training and test samples for each class

OGCSBN_2021_v37n4_719_t0002.png 이미지

3. Classification Methodology

1) Convolutional Neural Network (CNN)

CNN is a popular deep learning model that classifies images using spatial features extracted from input images (LeCun et al., 2015). CNN was selected as the primary classifier for this study because it exhibited promising crop classification accuracy for major crop cultivation areas in Korea (Kwak et al., 2019).

The key feature of the CNN is the use of spatial features which are automatically extracted by applying a series of convolution filters. The architecture of the CNN model for spatial feature extraction consists of: (1) convolutional layers in which convolution operation is applied to either predefined input patches or spatial features of previous layers, and (2) pooling layers in which the most activated presence from the extracted spatial features is summarized (Fig. 2). In particular, various types of spatial features can be extracted from the input image by applying various filters to the convolutional layer. The extracted spatial features are transformed into a 1-dimensional vector and transferred to the fully connected layer (Kwak et al., 2021). The final classification result is generated using the probabilities of multiple classes obtained by applying a softmax function to the fully connected layer. In this study, a 2D-CNN was employed based on our previous study (Kwak et al., 2019; 2021).

OGCSBN_2021_v37n4_719_f0002.png 이미지

Fig. 2. Illustration of convolution and pooling operations.

2) Long Short-Term Memory (LSTM)

LSTM, which was developed to alleviate the weakness of regular RNNs that cannot learn long-term dependency due to vanishing and exploding gradients, is an efficient deep learning model for sequence classification (Hochreiter and Schmidhuber, 1997).

A common LSTM unit consists of hidden and cell states (Fig. 3). The hidden state passes the input of the current time step to forget, input, and output gates. The cell state combines the information learned from the forget and input gates with information from the previous time step and further passes it to the current time step. The forget gate initially determines whether to remember or forget the information from the cell state vector. Furthermore, the input gate determines which information should enter the current cell state, by regulating the amount of information to be retained. Finally, the output gate determines the current hidden state to be passed to the next time step along with the previous hidden state and the current cell state determined by the input gate. The LSTM unit operating with three gates has a chain of repetitive modules. The classification result is obtained by applying a softmax function that returns the probabilities for all classes to the hidden state extracted from the final time step.

OGCSBN_2021_v37n4_719_f0003.png 이미지

Fig. 3. Basic structure of the LSTM unit (modified from Rußwurm and Körner (2018)).

The LSTM utilizes the individual pixels in a 1-dimensional vector of multi-temporal images as the input. A point-wise operator is applied to perform multiplication between the 1-dimensional vectors (Fig. 3). Thus, LSTM cannot account for the spatial characteristics of the input data. As a variant of LSTM, convolutional LSTM was proposed to mitigate the weaknesses of conventional LSTM (Shi et al., 2015). CLSTM has the same learning process as LSTM, but uses patches extracted from multi-temporal images as input. Moreover, for internal matrix multiplication, CLSTM uses a convolutional operator instead of a point-wise operator. Thus, CLSTM has the advantage of simultaneously learning temporal and spatial features.

3) Feature Embedding by Autoencoder (AE)

AE is an unsupervised neural network model for dimensionality reduction (Hinton and Salakhutdinov, 2006). It is composed of encoder and decoder networks. The encoder network compresses high-dimensional input data into low-dimensional features, and the compressed features can be recovered by the decoder network (Fig. 4). The two networks are trained together by minimizing the difference between the original input data and reconstructed features (Hinton and Salakhutdinov, 2006). The AE can embed meaningful information from the input data by mapping compressed features to target values. Here, features compressed to a dimension lower than the input dimension are called ‘latent features’. The key nature of AE is its ability to inhibit the impact of noise irrelevant to the input data and significantly improve the generalization capability. As the AE compresses the input data with the information used to train the model, the compression or embedding capability of the AE depends on the data on which the model is trained. In this study, pixels representing individual crops were randomly extracted and used as input for AE, and the number of randomly extracted pixels was set to a total of 5,000, in consideration of the computational cost.

OGCSBN_2021_v37n4_719_f0004.png 이미지

Fig. 4. Basic structure of the autoencoder. X and X’ denote the input data and reconstructed input data, respectively.

In AE, the encoder and decoder are composed of various layers, such as dense layers and LSTM layers. In this study, the LSTM unit was employed as the encoder and decoder layers to learn temporal features for crop classification, which corresponds to the LAE described in the Introduction. An SAE that stacks several dense layers on the encoder and decoder is also applied for comparison with LAE-based features.

4) Combination of AE-based Features with CNN

The AE described in the previous subsection corresponds to the pre-training process. As no class label information is used during the feature reconstruction stage, informative features specific to individual crops may not be extracted from the pre-training alone.

For informative feature extraction for crop classification, this study adds the fine-tuning stage in which the weights of the pre-trained neural networks are further tuned using class label information from training samples (Fig. 5). More specifically, the weights of the latent features learned from the pre-trained AE are updated by adding the softmax function after the pre-trained latent layer and classifying with training samples. Through this fine-tuning stage, the latent features can inhibit the noise effect and contain class specific information that is useful for identifying crop types. The latent features extracted by the fine-tuned AE are further used as inputs for the CNN classifier that can consider spatial features during classification (Fig. 5). By leveraging the advantages of both feature embedding by AE and the CNN classifier, both temporal and spatial contextual information can be properly utilized to discriminate different crop types.

OGCSBN_2021_v37n4_719_f0005.png 이미지

Fig. 5. Proposed deep learning framework combining LAE with fine-tuning and CNN.

5) Classification and Evaluation Procedures

Fig. 6 shows the overall workflow for crop classification applied in this study. The informative latent features for crop classification were initially extracted from multi-temporal UAV images, using AE-based feature embedding. To extract useful features that convey the information content of input images as much as possible, it is necessary to determine the optimal hyperparameters of the AE model. In this study, the three hyper-parameters, including the number of layers and filters in the encoder and decoder, and the number of latent features, were optimally determined for the two AE models through ten-fold cross-validation, based on the class separability during the fine-tuning (Table 3). After randomly dividing the training samples into ten partitions, nine partitions were used for the AE model training and the remaining partition was used for model validation. These procedures were repeated ten times, and the optimal hyper-parameters were determined based on the validation accuracy (Table 3).

OGCSBN_2021_v37n4_719_f0006.png 이미지

Fig. 6. Overall classification procedures applied in this study.

Table 3. Tested and optimal hyper-parameters of SAE and LAE

OGCSBN_2021_v37n4_719_t0003.png 이미지

Once the optimal hyper-parameters for SAE and LAE were determined, the latent features extracted by both pre-training and fine-tuning were further compared with the original multi-temporal images from two perspectives. Firstly, to measure the reduction in the amount of noise contained in the original multi-temporal images by feature embedding, the Mahalanobis distance (MD) was used as a statistical measure to detect the class-specific noise of latent features extracted by AE. The MD is known to be effective for detecting outliers in multivariate data by comparing the distance between a specific value and the distribution of the entire data (De Maesschalck et al., 2000). As the MD is calculated for training samples per crop class, it is likely that the MD value of each training sample for the same crop class does not deviate significantly from the mean of the MD values from all the training samples. When the MD value of a training sample was greater than the mean of MD values for a certain crop class, the training sample was regarded as a noise pixel.

Secondly, the effectiveness of latent features extracted by the fine-tuned AE was quantitatively evaluated using the Jeffries-Matusita distance (JMD). The JMD has been widely utilized to measure class separability by considering the distance between class means and the distribution of values from the means (Bruzzone et al., 1995; Hao et al., 2015). In this study, JMD was used as the statistical separability criterion to compare the class separability of AE-based latent features with that of the original multi-temporal images. The JMD values range between 0 and 2. In addition, the higher the JMD value, the higher the separability between the two classes (Bruzzone et al., 1995). Similar to the calculation of MD, the JMD was calculated using 500 training samples. For a fair comparison of MD and JMD with different value ranges, all the MD and JMD values were normalized to have a value between 0 and 1.

The classification accuracy of the proposed classification model was quantitatively compared with that of conventional classifiers with different input features. As for the classifiers using original multi temporal images, CNN, LSTM, and CLSTM which use spatial, temporal, and both features, respectively, were applied and compared. The CNN was employed only for classification using AE-based latent features because the latent features compressed the input images into a 1-dimensional vector. The optimal hyper parameters determined in our previous work in the study area (Kwak et al., 2019) were employed for classification using CNN, LSTM, and CLSTM. To account for the stochastic nature of deep learning-based classifiers, the classification was repeated ten times for each classifier. Then, the average and standard deviation of the ten overall accuracy values were used as quantitative measures for comparison purposes. In addition to the above quantitative measures of classification accuracy, the spatial distributions of the classification results were visually compared with the ground truth map as shown in Fig. 1.

4. Result and Discussion

1) Impact of Feature Reconstruction by AE

Table 4 lists the number of noise pixels detected by an MD-based criterion for five different inputs. Compared to the case using the original multi-temporal images, the number of noise pixels for the latent features extracted by the fine-tuned SAE and LAE substantially decreased. This is because the pixel values were adjusted through the feature embedding process, with the greatest decrease observed in cabbage. In particular, the latent features extracted by the fine-tuned LAE significantly reduced the noise pixels, compared to other input features. As input features with less noise are embedded by the fine-tuned LAE so that all sample values in the same class are similar, the classification using those features is expected to be less prone to misclassification. On the contrary, the number of noise pixels for the latent features by the pre-trained SAE and LAE was similar or rather increased compared to the case using the original multi-temporal images. When various crop types have similar spectral signatures, pre-training alone in AE cannot significantly reduce noise pixels that have a great impact on classification accuracy. The latent features by the pre-trained SAE and LAE still contain noise and may fail to provide information useful to discriminate crop types. Therefore, this emphasizes the necessity of fine-tuning during the AE-based feature embedding.

Table 4. Number of noise pixels detected by an MD-based criterion for different inputs

OGCSBN_2021_v37n4_719_t0004.png 이미지

Fig. 7 shows the JMD-based class separability measures for the five input features. As expected, the latent features extracted by the fine-tuned SAE and LAE using label information from the training samples exhibited the highest class separability for all class pairs. Similar to the MD-based analysis result in which the fine-tuned SAE and LAE significantly reduced noise in cabbage (Table 3), the class separability of cabbage from potato and fallow was greatly improved compared to the case using the original multi-temporal images. In contrast, the class separability of latent features by the pre-trained SAE and LAE was similar to that of the original multi-temporal images. The class separability of the fallow from cabbage and potato decreased. These results can be explained by the fact that the pre-trained AE based on unsupervised learning fails to properly account for intra-class similarity and inter-class dissimilarity. Based on the MD and JMD analyses results, the classification using the latent features by AE with only pre-training is likely to exhibit relatively similar or lower classification accuracy, compared to the classification using the original multi-temporal images.

OGCSBN_2021_v37n4_719_f0007.png 이미지

Fig. 7. Comparison of JMD values for all class pairs with respect to different inputs.

2) Evaluation of Classification Results

Fig. 8 shows the average overall accuracy and standard deviation for different combinations of classifiers and input features. The classification map with the best overall accuracy among ten classification results for each combination is also shown in Fig. 9.

OGCSBN_2021_v37n4_719_f0008.png 이미지

Fig. 8. Average overall accuracy and standard deviation of different combinations of classifiers and inputs.

OGCSBN_2021_v37n4_719_f0009.png 이미지

Fig. 9. Classification results with the highest overall accuracy for different combinations of classifiers and inputs.

In the classification results using the original multi-temporal images, LSTM exhibited the highest overall accuracy (90.97%), with an improvement of 2.21%p and 0.84%p over CNN and CLSTM, respectively (Fig. 8). As shown in Fig. 9, however, isolated misclassification pixels were observed in all the fields of the LSTM-based classification that can consider only temporal features. Although spatial features were accounted for during classification, misclassifications that occurred in the fallow and potato fields were severe in the classification results of CNN and CLSTM. When inputs including noise or less informative features are used for classification, satisfactory classification performance cannot be achieved even though spatial or temporal features are considered by advanced classifiers. Thus, informative features containing less noise and useful information for classification are required to improve the classification accuracy. However, as implied in the MD and JMD-based analyses, the classification accuracy values of the cases using latent features by the pre-trained SAE and LAE were 74.02% and 87.56%, respectively. These are lower than the overall accuracy of the classification using the original multi-temporal images (Fig. 8). In particular, classification using the features extracted by the pre-trained SAE exhibited the lowest classification accuracy. Furthermore, its highest value of standard deviation indicates the instability of the classification performance.

As shown in Fig. 9, the misclassification between highland Kimchi cabbage and cabbage was dominant in the classification result using the latent features of the pre-trained SAE. Moreover, in the classification result using the latent features by the pre-trained LAE, most pixels in the northwest cabbage field were misclassified as fallow. These misclassification results are mostly due to the loss of useful information for classification by the pre-trained AE with pre-training.

The CNN-based classification using the latent features by the fine-tuned SAE and LAE showed overall accuracy values of 93.15% and 96.37%, respectively, with the lowest standard deviation, implying the best stable classification performance (Fig. 8). The best classification performance achieved using the latent features by the fine-tuned AE indicated that the proposed classification model including the AE with the fine-tuning stage could effectively not only remove noise contained in the original multi-temporal images, but also extract latent features with useful information for classification. When comparing the classification results of using the features from the fine-tuned SAE and LAE, the classification using SAE-based features included misclassification in some potato and fallow fields. In contrast, the classification result based on LAE-based features is almost similar to the ground truth, except for one highland Kimchi cabbage field in the south (Fig. 9). These analyses demonstrate the superiority of the two-stage classification model proposed in this study for crop classification using multi-temporal remote sensing images.

5. Conclusions

A two-stage deep learning-based model that combines LAE-based feature extraction with CNN is presented in this study for crop classification using multi-temporal remote sensing images. The key feature of the proposed model is to extract latent features containing less noise and useful temporal information and further to utilize them as inputs for classification. A crop classification experiment using multi-temporal UAV images demonstrated the potential application of the proposed model for crop classification. Based on statistical measures of noise detection and class separability, the fine-tuned LAE could extract latent features from multi-temporal remote sensing images that contain significant temporal signatures for crop classification. Using the latent features as input for the CNN classifier yielded the best classification performance, compared to conventional deep learning models. These experimental results indicate that the extraction of informative features from input multi temporal images through fine-tuning using class label information is critical for crop classification using multi-temporal images. The findings of this study indicate that the proposed model may be potentially applicable for other classification tasks, such as forest classification. Hence, future research will focus on the extensive evaluation of the proposed model for supervised classification of multi-temporal remote sensing images.

Acknowledgements

This work was carried out with the support of “Cooperative Research Program for Agriculture Science & Technology Development (Project No. PJ01350004)” Rural Development Administration, Republic of Korea. The authors thank Drs. Chan-won Lee, Kyung-do Lee, Sang-il Na, and Ho-yong Ahn for providing the preprocessed UAV images and ground truth data.

References

Bruzzone, L., F. Roli, and S.B. Serpico, 1995. An extension of the Jeffreys-Matusita distance to multiclass cases for feature selection, IEEE Transactions on Geoscience and Remote Sensing, 33(6): 1318-1321. https://doi.org/10.1109/36.477187
De Maesschalck, R., D. Jouan-Rimbaud, and D.L. Massart, 2000. The Mahalanobis distance, Chemometrics and Intelligent Laboratory Systems, 50(1): 1-18. https://doi.org/10.1016/S0169-7439(99)00047-7
Demir, B., L. Minello, and L. Bruzzone, 2014. An effective strategy to reduce the labeling cost in the definition oftraining sets by active learning, IEEE Geoscience and Remote Sensing Letters, 11(1): 79-83. https://doi.org/10.1109/LGRS.2013.2246539
Guo, J., H. Li, J. Ning, W. Han, W. Zhang, and Z.-S. Zhou, 2020. Feature dimension reduction using stacked sparse auto-encoders for crop classification with multi-temporal, quad-pol SAR data, Remote Sensing, 12(2): 321. https://doi.org/10.3390/rs12020321
Hamidi, M., A. Safari, and S. Homayouni, 2020. An auto-encoder based classifierfor crop mapping from multitemporal multispectral imagery, International Journal of Remote Sensing, 42(3): 986-1016.
Hao, P.,Y. Zhan, L.Wang, Z. Niu, and M. Shakir, 2015. Feature selection of time series MODIS data for early crop classification using random forest: A case study in Kansas, USA, Remote Sensing, 7(5): 5347-5369. https://doi.org/10.3390/rs70505347
Hinton, G.E. and R.R. Salakhutdinov, 2006. Reducing the dimensionality of data with neural networks, Science, 313(5786): 504-507. https://doi.org/10.1126/science.1127647
Hochreiter, S. and J. Schmidhuber, 1997. Long short-term memory, Neural Computation, 9(8): 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735
Kalinicheva, E., J. Sublime, and M. Trocan, 2020. Unsupervised satellite image time series clustering using object-based approaches and 3D convolutional autoencoder, Remote Sensing, 12(11): 1816. https://doi.org/10.3390/rs12111816
Kwak, G.-H., C.-W. Park, H.-Y. Ahn, S.-I. Na, K.-D. Lee, and N.-W. Park, 2020. Potential of bidirectional long short-term memory networks for crop classification with multitemporal remote sensing images, Korean Journal of Remote Sensing, 36(4): 515-525 (in Korean with English abstract). https://doi.org/10.7780/kjrs.2020.36.4.2
Kwak, G.-H., C.-W. Park, K.-D. Lee, S.-I. Na, H.-Y. Ahn, and N.-W. Park, 2021. Potential of hybrid CNN-RF model for early crop mapping with limited input data, Remote Sensing, 13(9): 1629. https://doi.org/10.3390/rs13091629
Kwak, G.-H., M.-G. Park, C.-W. Park, K.-D. Lee, S.-I. Na, H.-Y. Ahn, and N.-W. Park, 2019. Combining 2D CNN and bidirectional LSTM to consider spatio-temporal features in crop classification, Korean Journal of Remote Sensing, 35(5-1): 681-692 (in Korean with English abstract). https://doi.org/10.7780/kjrs.2019.35.5.1.5
LeCun, Y., Y. Bengio, and G. Hinton, 2015. Deep learning, Nature, 521(7553): 436-444. https://doi.org/10.1038/nature14539
Lee, J., B. Seo, and S. Kang, 2018. Development of a biophysical rice yield model using all-weather climate data, Korean Journal of Remote Sensing, 33(5-2): 721-732 (in Korean with English abstract). https://doi.org/10.7780/KJRS.2017.33.5.2.11
Li, K. and E. Xu, 2020. Cropland data fusion and correction using spatial analysis techniques and the Google Earth Engine, GIScience & Remote Sensing, 57(8): 1026-1045. https://doi.org/10.1080/15481603.2020.1841489
Na, S.-I., C.-W. Park, K.-H. So, H.-Y. Ahn, and K.-D. Lee, 2018. Application method of unmanned aerial vehicle for crop monitoring in Korea, Korean Journal of Remote Sensing, 34(5): 829-846 (in Korean with English abstract). https://doi.org/10.7780/KJRS.2018.34.5.10
Na, S.-I., C.-W. Park, K.-H. So, J.-M. Park, and K.-D. Lee, 2017. Satellite imagery based winter crop classification mapping using hierarchical classification, Korean Journal of Remote Sensing, 33(5-2): 677-687 (in Korean with English abstract). https://doi.org/10.7780/KJRS.2017.33.5.2.7
Russwurm, M. and M. Korner, 2018. Multi-temporal land cover classification with sequential recurrent encoders, ISPRS International Journal of Geo-Information, 7(4): 129. https://doi.org/10.3390/ijgi7040129
Shi, X., Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-C. Woo, 2015. Convolutional LSTM network: A machine learning approach for precipitation nowcasting, Proc. of the Advances in Neural Information Processing Systems, Montreal, QC, CA, Dec. 7-12, pp. 802-810.
Weiss, M., F. Jacob, and G. Duveiller, 2020. Remote sensing for agricultural applications: A meta-review, Remote Sensing of Environment, 236: 111402. https://doi.org/10.1016/j.rse.2019.111402
Zhong, L., L. Hu, and H. Zhou, 2019. Deep learning based multi-temporal crop classification, Remote Sensing of Environment, 221: 430-443. https://doi.org/10.1016/j.rse.2018.11.032
Zhou, Y., J. Luo, L. Feng, Y. Yang, Y. Chen, and W. Wu, 2019. Long-short-term-memory-based crop classification using high-resolution optical images and multi-temporal SAR data, GIScience & Remote Sensing, 56(8): 1170-1191. https://doi.org/10.1080/15481603.2019.1628412

Korean Journal of Remote Sensing (대한원격탐사학회지)

Two-stage Deep Learning Model with LSTM-based Autoencoder and CNN for Crop Classification Using Multi-temporal Remote Sensing Images

Abstract

Keywords

1. Introduction

2. Study Area and Data

3. Classification Methodology

1) Convolutional Neural Network (CNN)

2) Long Short-Term Memory (LSTM)

3) Feature Embedding by Autoencoder (AE)

4) Combination of AE-based Features with CNN

5) Classification and Evaluation Procedures

4. Result and Discussion

1) Impact of Feature Reconstruction by AE

2) Evaluation of Classification Results

5. Conclusions

Acknowledgements

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)