Skip to main content
To KTH's start page To KTH's start page

Multi-Modal Deep Learning with Sentinel-1 and Sentinel-2 Data for Urban Mapping and Change Detection

Time: Wed 2022-06-15 09.00

Location: U1, Brinellvägen 26, Stockholm

Video link:

Language: English

Subject area: Geodesy and Geoinformatics, Geoinformatics

Doctoral student: Sebastian Hafner , Geoinformatik

Opponent: Professor Paolo Gamba, University of Pavia

Supervisor: Professor Yifang Ban, Geoinformatik

Export to calendar



Driven by the rapid growth in population, urbanization is progressing at an unprecedented rate in many places around the world. Earth observation has become an invaluable tool to monitor urbanization on a global scale by either mapping the extent of cities or detecting newly constructed urban areas within and around cities. In particular, the Sentinel-1 (S1) Synthetic Aperture Radar (SAR) and Sentinel-2 (S2) MultiSpectral Instrument (MSI) missions offer new opportunities for urban mapping and urban Change Detection (CD) due to the capability of systematically acquiring wide-swath high-resolution images with frequent revisits globally.

Current trends in both urban mapping and urban CD have shifted from employing traditional machine learning methods to Deep Learning (DL) models, specifically Convolutional Neural Networks (CNNs). Recent urban mapping efforts achieved promising results by training CNNs on available built-up data using S2 images. Likewise, DL models have been applied to urban CD problems using S2 data with promising results.

However, the quality of current methods strongly depends on the availability of local reference data for supervised training, especially since CNNs applied to unseen areas often produce unsatisfactory results due to their insufficient across-region generalization ability. Since multitemporal reference data are even more difficult to obtain, unsupervised learning was suggested for urban CD. While unsupervised models may perform more consistently across different regions, they often perform considerably worse than their supervised counterparts. To alleviate these shortcomings, it is desirable to leverage Semi-Supervised Learning (SSL) that exploits unlabeled data to improve upon supervised learning, especially because satellite data is plentiful. Furthermore, the integration of SAR data into the current optical frameworks (i.e., data fusion) has the potential to produce models with better generalization ability because the representation of urban areas in SAR images is largely invariant across cities, while spectral signatures vary greatly. 

In this thesis, a novel Domain Adaptation (DA) approach using SSL is first presented. The DA approach jointly exploits Multi-Modal (MM) S1 SAR and S2 MSI to improve across-region generalization for built-up area mapping. Specifically, two identical sub-networks are incorporated into the proposed model to perform built-up area segmentation from SAR and optical images separately. Assuming that consistent built-up area segmentation should be obtained across data modalities, an unsupervised loss for unlabeled data that penalizes inconsistent segmentation from the two sub-networks was designed. Therefore, the use of complementary data modalities as real-world perturbations for Consistency Regularization (CR) is proposed. For the final prediction, the model takes both data modalities into account. Experiments conducted on a test set comprised of sixty representative sites across the world showed that the proposed DA approach achieves strong improvements (F1 score 0.694) upon supervised learning from S1 SAR data (F1 score 0.574), S2 MSI data (F1 score 0.580) and their input-level fusion (F1 score 0.651). The comparison with two state-of-the-art global human settlement maps, namely GHS-S2 and WSF2019, showed that our model is capable of producing built-up area maps with comparable or even better quality.

For urban CD, a new network architecture for the fusion of SAR and optical data is proposed. Specifically, a dual stream concept was introduced to process different data modalities separately, before combining extracted features at a later decision stage. The individual streams are based on the U-Net architecture. The proposed strategy outperformed other U-Net-based approaches in combination with uni-modal data and MM data with feature level fusion. Furthermore, our approach achieved state-of-the-art performance on the problem posed by a popular urban CD dataset (F1 score 0.600).

Furthermore, a new network architecture is proposed to adapt Multi-Modal Consistency Regularization (MMCR) for urban CD. Using bi-temporal S1 SAR and S2 MSI image pairs as input, the MM Siamese Difference (Siam-Diff) Dual-Task (DT) network not only predicts changes using a difference decoder, but also segments buildings for each image with a semantic decoder. The proposed network is trained in a semi-supervised fashion using the underlying idea of MMCR, namely that building segmentation across sensor modalities should be consistent, to learn more robust features. The proposed method was tested on an urban CD task using the 60 sites of the SpaceNet7 dataset. A domain gap was introduced by only using labels for sites located in the Western World, where geospatial data are typically less sparse than in the Global South. MMCR achieved an average F1 score of 0.444 when applied to sites located outside of the source domain, which is a considerable improvement to several supervised models (F1 scores between 0.107 and 0.424).

The combined findings of this thesis contribute to the mapping and monitoring of cities on a global scale, which is crucial to support sustainable planning and urban SDG indicator monitoring.