DOI QR코드

DOI QR Code

Optimizing Image Size of Convolutional Neural Networks for Producing Remote Sensing-based Thematic Map

  • Jo, Hyun-Woo (Department of Environmental Science and Ecological Engineering, Korea University) ;
  • Kim, Ji-Won (Department of Climatic Environment, Korea University) ;
  • Lim, Chul-Hee (Institute of Life Science and Natural Resources, Korea University) ;
  • Song, Chol-Ho (Department of Environmental Science and Ecological Engineering, Korea University) ;
  • Lee, Woo-Kyun (Department of Environmental Science and Ecological Engineering, Korea University)
  • Received : 2018.08.09
  • Accepted : 2018.08.22
  • Published : 2018.08.31

Abstract

This study aims to develop a methodology of convolutional neural networks (CNNs) to produce thematic maps from remote sensing data. Optimizing the image size for CNNs was studied, since the size of the image affects to accuracy, working as hyper-parameter. The selected study area is Mt. Ung, located in Dangjin-si, Chungcheongnam-do, South Korea, consisting of both coniferous forest and deciduous forest. Spatial structure analysis and the classification of forest type using CNNs was carried in the study area at a diverse range of scales. As a result of the spatial structure analysis, it was found that the local variance (LV) was high, in the range of 7.65 m to 18.87 m, meaning that the size of objects in the image is likely to be with in this range. As a result of the classification, the image measuring 15.81 m, belonging to the range with highest LV values, had the highest classification accuracy of 85.09%. Also, there was a positive correlation between LV and the accuracy in the range under 15.81 m, which was judged to be the optimal image size. Therefore, the trial and error selection of the optimum image size could be minimized by choosing the result of the spatial structure analysis as the starting point. This study estimated the optimal image size for CNNs using spatial structure analysis and found that this can be used to promote the application of deep-learning in remote sensing.

Keywords

1. Introduction

As deep-learning has shown remarkable achievements in the fields of image and voice recognition recently, there have been many attempts to apply deep-learning techniques to various fields that require data analysis (Szegedy et al., 2017). The applicability of deep-leaning to remote sensing is also increasing as the spatial, spectral, and temporal resolution of the acquired data are improved, enabling large-scale learning based on big data. Deep-learning is a machine learning function based on artificial neural networks, and high classification performance could be expected if it is applied to Big Data that enables large-scale learning. In particular, convolutional neural networks (CNNs), which mimic the optic nerves of living organisms, is known to be a suitable structure for image processing (LeCun and Bengio, 1995).

CNNs have been actively used in the field of image recognition, with the main classification target being a single high-resolution image (Simonyan and Zisserman, 2014). However, a general method for creating thematic maps from remote sensing data is to classify subdivided images of the target area using uniformly sized grids. For example, in the case of low-resolution satellite images, a number of studies have been conducted to classify images after dividing them into single pixels (Yuan et al., 2005; Kashung et al., 2018). In the case of high-resolution satellite images, many studies applied texture analyses after combining adjacent pixels with a moving window technique (Chica-Olmo and Abarca-Hernandez, 2000). Therefore, when using CNNs to create a thematic map from remotely sensed data, the image segmentation size, as an analysis unit, should be determined. The image segmentation size acts as a hyper-parameter that must be selected by the researcher to obtain optimal classification performance. Hyper-parameter cannot be specified by the data, so trial and error determination is necessary to find the optimal hyper-parameter. While a number of studies have been performed to find the most efficient tuning process in the image recognition field (Ilievski et al., 2017), basic research regarding the application of deep learning to remote sensing is insufficient.

With regards to studies applying CNNs to remote sensing data, Castelluccio et al. (2015) and Marmanis et al. (2016) used CNNs to classify objects and land-use status observed in the UC-Merced Land Use Dataset (UCML), which is an aerial photo dataset labeled by subject. UCML, however, is made by extracting sections of images from aerial photos according to specific objects, such as buildings or ports, or land-use patterns, so the above studies classify the already segmented images. Thus, although these studies used remote sensing data, they are different from creating thematic maps, such as land-cover or forest type maps that classify consecutive regions of the same image. Song and Kim (2018) also applied CNNs to hyper-spectral aeria phots to categorize types of crops and the use of buildings. The pixels were continuously classified from a single image and showed a high accuracy > 95%, highlighting the potential for use of CNNs in remote sensing. However, to promote additional deep-learning utilization studies in the future, studies on the basic methodologies, such data preprocessing or parameter optimization, need to be completed, as well as those regarding the utilization method.

Therefore, this study aims to propose a method for selecting the optimal image segmentation size in CNNs for creating thematic maps based on remote sensing data. This methodological study is based on a forest type classification with high resolution ortho-corrected aerial photos. The local variance (LV) trend was analyzed by spatial structure analysis to estimate the optimal image size that sufficiently contained the projected objects. The result of this study sets the groundwork for applying deep-learning techniques to the field of remote sensing.

2. Materials and Study area

1) Materials

In this study, forest type classification was carried out using ortho-corrected aerial photos with a 0.51 m spatial resolution and forest type maps (1:5000 scale), distributed by the National Geographic Information Institute and the Korea Forest Service, respectively (Table 1). The ortho-corrected aerial photos were taken in 2016, and are composed of three bands of red, green, and blue, using the Transverse Mercator projection and the GRS80 reference ellipsoid. The forest type map was produced from a combination of field surveys and the interpretation of aerial photos and contains information regarding the existence of forests, as well as their origin, type, age class, diameter class, and crown density. The CNNs analysis was conducted using the Keras Library in the Python 3.5 programming language.

Table 1. Data used for the analysis

OGCSBN_2018_v34n4_661_t0001.png 이미지

2) Study area

The selected study area covered Mt. Ung in Dangjin-si, Chungcheongnam-do, South Korea, and included both coniferous and deciduous forest. From the center of 126″37′13.206E 36″48′46.0656N and 126″37′2.514E 36″48′23.4468N, two plots of 200 m x 200 m were selected as site A and B (Fig. 1). Site A was used as the training area, 20% of site B was used for validation, by randomly selecting areas using programming techniques, while the rest of the site B was used as the testing area.

OGCSBN_2018_v34n4_661_f0001.png 이미지

Fig. 1. Study area; (a) Site A used for training (b) Site B used for validation/test area.

3. Method

1) Spatial structure analysis using local variance

According to Woodcock and Strahler (1987), the spatial structure of images must be understood to select the appropriate data scale, and the spatial structure can be understood by studying the relationship between spatial resolution and LV. LV is the average of the standard deviation values measured by using the Moving Window technique, moving a 3 × 3 pixel window (Eq.1). The higher the spatial resolution, the greater the number of pixels with similar spectral characteristics, because a single object will be comprised of more pixels, resulting in a lower LV. The lower the spatial resolution, the greater the LV, until the size of a single pixel is similar to the size of the objects contained in the image. However, when the spectral properties of multiple objects begin to be mixed in a single pixel, LV decreases as spatial resolution decreases (Fig. 2).

OGCSBN_2018_v34n4_661_f0002.png 이미지

Fig. 2. Relation between spatial structure and local variance as a function of spatial resolution.

\(\begin{aligned} L V=& \frac{1}{\mathrm{n}} \sum_{j} \sum_{i} w_{j i} \\ & w_{j i}=\frac{1}{9} \sum_{y=j-1}^{j+1} \sum_{x=i-1}^{i+1}\left(d_{y x}-\overline{\mathrm{d}}\right)^{2} \end{aligned}\)       (1)

n : Total number of pixels

j : Vertical length of image

i : Horizontal length of image

w : Standard deviation in a window

d : Digital number value

d : Mean value in a window

The size of objects was estimated by analyzing the spatial structure of the image, and the relationship between the size of objects, the image segmentation size, and the classification performance was analyzed. The cubic convolution resampling method was applied to the ortho-corrected aerial photos to create 27 different low-resolution images. The resolution of the images was degraded from 1.53 m to 28.05 m at intervals of 1.02 m, which was the size of 2 pixels in the original image. Since each image had three bands as an optical image, panchromatic images with a single band were created by image fusion to calculate the LV.

2) Forest type classification by CNNs

Whereas traditional machine learning techniques perform classification based on the features found by researchers, deep-learning techniques perform end-to-end learning to find features by themselves. In other words, in previous image classification methods, researchers extracted features using convolution filters by emphasizing the contours of the objects by high-pass filters or by blurring the objects with a low-pass filter. For CNNs, on the other hands, once researchers suggest an architecture consisting of the number and size of convolution filters, values in the convolution filters are automatically tuned to extract optimal features for image classification using the backpropagation process, followed by the chain rule of differentiation.

Thus, deep learning could be useful when combined with big data, from which it is difficult for researchers to extract features (Najafabadi et al., 2015). For the same reason, CNNs, as a deep-learning technique specializing in image processing, are suitable for classifying high resolution images. In this study, a forest type classification was performed using CNNs to analyse high resolution ortho-corrected aerial photos with diverse segmentation sizes. The analysis was performed by increasing the length of one side of square-shaped images by 1.0 m, from 1.53 m to 28.05 m. The segmentation size is same as the spatial resolution of the low-resolution images previously produced for spatial structure analysis.

The architecture of the CNNs is composed of 5 convolution layers, 1 pooling layer, and 2 fully connected layers, as shown in Fig. 3. The convolution layers were designed as a number of 3 × 3 pixel filters in an overlapping structure, and the values in the filters were optimized for classification as the model learned more about the data through the backpropagation process. The pooling layer used the max-pooling method, which extracts the maximum value from the four adjacent pixels, thereby speeding up the learning process and generalizing features, producing the same result regardless of the position and direction of the object. The fully connected layers combine the extracted features, and draw vectors with the same length as the number of forest type to be classified.

OGCSBN_2018_v34n4_661_f0003.png 이미지

Fig. 3. CNNs architecture.

The segmented images from site A were used for model learning, and 20% of the segmented images from site B were selected randomly to be used in the validation process for quantifying the learning process. The remaining 80% f the images from site B were used to test the classification results in comparison with the forest type map in order to test the accuracy of the model.

The classification results derived from the CNNs represent the result for the center of the segmented image, so the classification can not be performed at the boundary of the study area. Therefore, the larger the image segmentation size, the greater the area of the boundary excluded from the analysis. In this study, based on the classification area of the 28.05 m image, which has the largest number of missing results at the boundary, classification was performed only for areas 23 pixels or more away from the edge of the image to compare the accuracy of the classification under the same conditions.

4. Result and Discussion

The results of the spatial structure analysis on the low-resolution images, produced by the resampling method, are shown in Fig. 4. Although the learning site A and the validation and testing site B showed some differences, the two sites showed similar trends overall. As a result of calculating the mean values of site A and B, the LV increased from 1.53 m to 7.65 m resolution, corresponding to an increase from 3 pixels to 15 pixels of the original image, respectively. After 7.65 m, fluctuations in the LV were repeated, but the LV values showed similar or slightly increasing trends before and after the resolution of 18.87 m, corresponding to 37 pixels of the original image, and then decreased.

OGCSBN_2018_v34n4_661_f0004.png 이미지

Fig. 4. Local variance functioned by cell size.

It could be that the objects in the images, identified by their spectral characteristics, have an approximate size between 7.65mand 18.87m. Visual interpretations also imply that a number of trees have a crown width in the estimated range (Fig. 5). In addition, the LV fluctuations repeated after 7.65 m resolution because each low-resolution image had different centers in the process of resampling method. If the center of the pixel is located close to the center of the object, LV increases as the value of the pixel differentiate from the surrounding pixels, and if the center the pixel is located at the border of the object, LV decreases as the spectral value of the object and its background are blended.

OGCSBN_2018_v34n4_661_f0005.png 이미지

Fig. 5. Measured canopy width by visual detection.

The accuracy of the forest type classification using CNNs by image segmentation size is shown in Fig. 6. There were some slight fluctuations, but the accuracy tended to increase as the image size increased up to 15.81 m, at which the classification recorded the peak accuracy of 84.53%. After this size, the accuracy was maintained at a similar level. Therefore, the size close to 15.81 m is the optimal image size which consumes the least amount of computing resources for analysis, among the sizes with the high classification performance. Fig.7 shows the classification results using the image segmentation size of 15.81 m.

OGCSBN_2018_v34n4_661_f0006.png 이미지

Fig. 6. Accuracy of CNNs model functioned by Image size.

OGCSBN_2018_v34n4_661_f0007.png 이미지

Fig. 7. Forest type classification with 15.81 m square image.

These results highlight the applicability of spatial structure analysis in estimating the optimal image segmentation size. The optimum image size falls within the range of 7.65 m to 18.87 m, for which a high LV value was recorded in the spatial structure analysis, and it could be presumed that the accuracy would increase when the size of the segmented image is similar to the size of the objects, as sufficient information for the classification of the object is provided. In order to statistically identify the relationship between the LV and classification accuracy, the equivalent pixel size used in the spatial analysis and the image size in the CNNs are compared in Fig. 8. The LV and accuracy were normalized, and linear regression analysis was performed by dividing the size of pixel or image into two groups based on the value of 15.81 m, which is the optimum segmentation size.

OGCSBN_2018_v34n4_661_f0008.png 이미지

Fig. 8. Correlation between local variance and accuracy.

As a result, the trend line of the group with a cell size less than the optimal image segmentation size recorded a slope of 1.6462 and an R2 value of 0.7633, indicating a positive correlation between the LV and model accuracy. Meanwhile, for the group exceeding the optimum image segmentation size, the slope of the rend line was 0.0455, while the R2 value was 0.0297, indicating that a statistically significant relationship between LV and accuracy does not exist. This implies that LV could be an effective indicator for selecting an optimum image size, showing clear linearity with the accuracy. Also, it could be interpreted that as the image size increases up to a certain level, the increased object size adds significant information which aids in extracting features; once the image size exceeds a certain level, the image contains information pertaining to other objects or background information, which is unnecessary for extracting features.

5. Conclusion

In the process of creating a thematic map using CNNs, the image segmentation size acts as a hyper-parameter. As such, this variable must be determined, and this study aimed to suggest a method to optimize the image segmentation size. As a result of calculating the LV across a range of spatial resolutions in a study area consisting of both coniferous and deciduous forests, the LV is stabilized at spatial resolutions ranging from 7.65 m to 18.87 m, and it was possible to estimate that the size of the objects, identified by their spectral characteristics. In addition, the accuracy of the CNNs for forest type classification was highest (84.53%) in a square image with a segmentation size of 15.81 m. Therefore, similar trends between the optimal image segmentation size and the size of objects could be estimated by spatial structure analysis, resulting in the following conclusions.

First, the size of distinguishable objects, identified by their spectral characteristics, could be estimated by producing images with various spatial resolution using the resampling method, and analyzing the change of LV with the spatial resolution. However, in the range where the spatial resolution and size of objects are similar, the spectral properties can be mixed according to the position of the center of the pixels, causing LV to fluctuate. Thus, this method results in a range of spatial resolutions from which the size of objects can be inferred, rather than an exact figure.

Second, the accuracy of the CNNs by image segmentation size increases up to a certain level as the image size increases, and then plateaus. Thus, sufficiently large images must be used to achieve high accuracy, but the larger the image size, the more time and computing resources are consumed in the analysis. Therefore, it is desirable to find the optimum image size, where the accuracy stabilizes.

Finally, the relationship between the LV and classification accuracy was analyzed by pairing the same pixel size used in spatial analysis and image segmentation size used in the CNNs. As a result, it was found that there is a positive correlation between the LV and classification accuracy, up to the optimum segmentation size. Furthermore, there is no statistically significant relationship once the segmentation size exceeds the optimum value. This implies that, when the segmented images are sufficiently large to include the objects in the image, the features required for classification could be extracted efficiently. If the image size is larger than the objects, on the other hand, the accuracy is not likely t increase as unnecessary information is added. Thus, it is possible to select the estimated size of objects as a starting point from which to tune the image segmentation size, since the former is calculated relatively simplistically.

In this study, the classification of a forest was performed in an area where both coniferous and deciduous forests exist. As a result, it has been shown that the optimum image segmentation size could be estimated by spatial structure analysis in order to increase the forest type classification performance and to minimize computing resources. However, because the classification was only performed on limited subjects and data, additional analyses should be conducted with subjects other than forest type in various study areas with diverse image types to conclude whether the method is generally applicable in CNNs. Moreover, since different hyper-parameters can affect one another, it is possible that the accuracy at each segmentation size may change when other hyperparameters, such as the architecture of CNNs, are modified. Therefore, additional research is required to find the optimal combination of multiple hyper-parameters.

Acknowledgement

This study was carried out with the support of the “2016 Public Policy Development based on Environmental Policy (Project No. 2016000210001)” provided by Korea Environmental Industry and Technology Institute (Republic of Korea).

References

  1. Castelluccio, M., G. Poggi, C. Sansone, and L. Verdoliva, 2015. Land use classification in remote sensing images by convolutional neural networks, https://arxiv.org, Accessed on Aug. 22, 2018.
  2. Chica-Olmo, M. and F. Abarca-Hernandez, 2000. Computing geostatistical image texture for remotely sensed data classification, Computers & Geosciences, 26(4): 373-383. https://doi.org/10.1016/S0098-3004(99)00118-1
  3. Kashung,Y., B. Das, S. Deka, R. Bordoloi, A. Paul, and O. P. Tripathi, 2018. Geospatial technology based diversity and above ground biomass assessment of woody species of West Kameng district of Arunachal Pradesh, Forest Science and Technology, 14(2): 84-90. https://doi.org/10.1080/21580103.2018.1452797
  4. Ilievski, I., T. Akhtar, J. Feng, and C. A. Shoemaker, 2017. Efficient Hyperparameter Optimization for Deep Learning Algorithms Using Deterministic RBF Surrogates, Proc. of 2017 Association for the Advancement of Artificial Intelligence Conference, San Francisco, CA, Feb. 4-9, pp. 822-829.
  5. LeCun,Y. and Y. Bengio, 1995. Convolutional networks forimages, speech, and time series, MIT Press, Cambridge, MA, USA.
  6. Marmanis, D., M. Datcu, T. Esch, and U. Stilla, 2016. Deep learning earth observation classification using Image Net pretrained networks, IEEE Geoscience and Remote Sensing Letters, 13(1): 105-109. https://doi.org/10.1109/LGRS.2015.2499239
  7. Najafabadi, M. M., F. Villanustre, T. M. Khoshgoftaar, N. Seliya, R.Wald, and E. Muharemagic, 2015. Deep learning applications and challenges in big data analytics, Journal of Big Data, 2(1): 1.
  8. Simonyan, K. and A. Zisserman, 2014. Very deep convolutional networks for large-scale image recognition, https://arxiv.org, Accessed on Aug. 22, 2018
  9. Song, A. R. and Y. I. Kim, 2017. Deep Learningbased Hyperspectral Image Classification with Application to Environmental Geographic Information Systems, Korean Journal of Remote Sensing, 33(6): 1061-1073 (in Korean with English abstract). https://doi.org/10.7780/kjrs.2017.33.6.2.3
  10. Szegedy, C., S. Ioffe, V. Vanhoucke, and A. A. Alemi, 2017. Inception-v4, inception-resnet and the impact of residual connections on learning, Association for the Advancement of Artificial Intelligence, 4: 12.
  11. Woodcock, C. E. and A. H. Strahler, 1987. The factor of scale in remote sensing, Remote Sensing of Environment, 21(3): 311-332. https://doi.org/10.1016/0034-4257(87)90015-0
  12. Yuan, F., K. E. Sawaya, B. C. Loeffelholz, and M. E. Bauer, 2005. Land cover classification and change analysis of the Twin Cities(Minnesota) Metropolitan Area by multitemporal Landsat remote sensing, Remote sensing of Environment, 98(2-3): 317-328. https://doi.org/10.1016/j.rse.2005.08.006