View on GitHub


Dataset download instructions.


The OpenSentinelMap dataset contains Sentinel-2 imagery and per-pixel semantic label masks derived from OpenStreetMap. It is described in this paper.

this is an overview image

Data Access

The dataset may be freely downloaded from SharePoint here.

As a backup option, or for faster download speeds, the dataset is also available on Amazon S3. You can use the following command to download it, but beware that Amazon will charge your AWS profile about $40 in data transfer fees (about 9 cents a GB, and 445 GB in total).

aws s3 cp s3://vsi-open-sentinel-map/ ./open-sentinel-map --recursive --request-payer

Data Format


Image data is separated by year from 2017 to 2020. Each year’s worth of Sentinel imagery is compressed into a osm_sentinel_imagery_{YEAR}.tgz file. These files can be untarred using the following command.

tar -xvzf osm_sentinel_imagery_{YEAR}.tgz

The untarred folders of sentinel imagery will have the format


where each .npz file is a compressed numpy file containing the 32-bit float Bottom-of-Atmosphere imagery data. This file can be loaded from python using the numpy.load function, and the bands accessed via their keys. The bands are grouped by spatial resolution, and accessible using the key “gsd_{RESOLUTION}” (i.e. “gsd_10”, “gsd_20”, “gsd_60”).

The “gsd_10” array bands have the order blue, green, red, and then NIR. The “gsd_20” bands have 4 vegetation red edge bands, followed by two SWIR bands. The “gsd_60” array consists of the coastal aerosol and water vapour bands. The exact corresponding bands from the Sentinel-2 platform are listed in the below table. Find more information about these spectral bands here.

Data Key Sentinel-2 Bands
gsd_10 B02, B03, B04, B08
gsd_20 B05, B06, B07, B8A, B11, B12
gsd_60 B01, B09

The image files also contain an “scl” band and a “bad_percent” value. The “scl” band contains the Scene Classification Layer values, which inform the quality of each pixel at 20 m. resolution. These values are described in Figure 3 here.

The “bad_percent” value is a float value between 0 and 1 which describes the percentage of pixels within the “scl” band which we’ve determined to be bad data. Currently we include images with up to 25% bad data. You can use this key to filter the dataset using a lower threshold.


The label images can be untarred using the command

tar -xvzf osm_label_images.tgz

These images are in PNG format, with label values as described in the osm_categories.json file.

Auxiliary Data

The spatial_cell_info CSV file contains metadata for each spatial cell: the lon/lat bounds, the MGRS tile it is within, and the training split it belongs to. Note that the current data split was performed at the MGRS tile level to prevent data leakage. Use caution if performing your own train/test split.

The osm_categories JSON file details the exact mapping from OpenStreetMap tags to OpenSentinelMap labels.


This dataset is made available under the MIT license, freely available for both academic and commercial use.

Access to Sentinel data is free, full and open for the broad Regional, National, European and International user community. View Terms and Conditions.

OpenStreetMap® is open data, licensed under the Open Data Commons Open Database License (ODbL) by the OpenStreetMap Foundation (OSMF).



How to Cite


    author    = {Johnson, Noah and Treible, Wayne and Crispell, Daniel},
    title     = {OpenSentinelMap: A Large-Scale Land Use Dataset Using OpenStreetMap and Sentinel-2 Imagery},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
    month     = {June},
    year      = {2022},
    pages     = {1333-1341}


This research is based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via 2021-2011000004. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.