Refers to EO data that has been processed to a level that allows the direct utilization of the data in analytic operations such as time series analysis. The exact specification of what constitutes these relevant pre-processing steps is to a certain degree user and application dependent. CEOS provides specifications of ARD for Radar backscatter, surface reflectance and surface temperature data in the context of land applications.
The goal is to make problems and applications encountered in Earth Observation accessible to a much wider community of users, including especially people with backgrounds in artificial intelligence, machine learning, computer science or applied mathematics. This involves removing any obstacles related to the pre-processing of EO data into an ARD format and making it technically accessible so that the data can be easily incorporated into development environments.
AI is intelligence exhibited by machines that can observe, perceive and act upon their environment to maximize their chance of success at some goal. It refers to the capacity of an algorithm for assimilating information to perform tasks that are characteristic of human intelligence, such as recognizing objects and sounds, contextualizing language, learning from the environment, and problem solving. Currently, AI systems in development today remain mainly “optimizing” tools, operating as specialized expert systems that use a database of knowledge to make decisions (mainly through inference, not really “Intelligence”). Researchers are however working on a “strong” AI where machines can perform the full range of human cognitive capabilities. Within this report, the term “AI” will therefore be used as a generic term to mainly refer to Machine Learning adapted to work with geospatial data.
A model is a mathematical and machine readable representation of a “problem space” generated by algorithms processing training data that is representative of this problem space. A model can then be used to make predictions based on some input data.
A set of elements, metadata, data model and controlled vocabularies that a given AIREO TDS follows.
A number of profiles may apply to a given AIREO TDS: eg. GeoreferencedImage and ReferenceData
A benchmark dataset is a reference dataset that is widely considered to be of good quality in covering sufficiently all parts of the “problem space”, to the extent that it can be used to evaluate the performance of different models and all that these can be measured against one another.
The term benchmarking is used in machine learning (ML) to refer to the evaluation and comparison of ML methods regarding their ability to learn patterns in ‘benchmark’ datasets that have been applied as ‘standards’. Benchmarking could be thought of simply as a sanity check to confirm that a new method successfully runs as expected and can reliably find simple patterns that existing methods are known to identify . A more rigorous way to view benchmarking is as an approach to identify the respective strengths and weaknesses of a given methodology in contrast with others. Comparisons could be made over a range of evaluation metrics, e.g., power to detect signal, prediction accuracy, computational complexity, and model interpretability. This approach to benchmarking would be important for demonstrating new methodological abilities or simply to guide the selection of an appropriate ML method for a given problem.
A controlled vocabulary is a carefully selected list of words and phrases. This is done to avoid ambiguity and enable automation, such as automated consistency checking.
Documentation as used here refers to human readable information about the dataset, used to present general information about the dataset, such as how it was generated, how it should be used and who is responsible for maintaining it.
An abstraction of the real world which incorporates only those properties thought to be relevant to the application at hand. The data model would normally define specific groups of entities, and their attributes and the relationships between these entities
Description of metadata common to satellite-based or remotely sensed EO products
A machine-readable script for generating the specified dataset from a data source.
In 2016, the “FAIR Guiding Principles for scientific data management and stewardship” were published in Scientific Data. The authors intended to provide guidelines to improve the Findability, Accessibility, Interoperability, and Reuse of digital assets. The principles emphasise machine-actionability (i.e., the capacity of computational systems to find, access, interoperate, and reuse data with none or minimal human intervention) because humans increasingly rely on computational support to deal with data as a result of the increase in volume, complexity, and creation speed of data.
Members not described in the specification (“foreign members”) may be used in a GeoJSON document. The support for foreign members can vary across different implementations, and no normative processing model for foreign members is defined. Accordingly, implementations that rely too heavily on the use of foreign members might experience reduced interoperability with other implementations.
Features are variables that compose the data input of a machine learning (ML) model. In ML, it is typically referred to as input variables or dependent variables. In the context of EO these can be the pixel values of different spectral bands in optical data or the values of backscatter coefficients in different polarisations for SAR data. Moreover, EO features can also be derived from different derived variables, such as vegetation indexes (NDVI ) or HH polarization ratios. The process of using domain knowledge to extract features suitable for training a ML model from raw input data is termed feature engineering.
ML is a branch of AI which uses algorithms to develop models based on data and human interactions (e.g. supervision). These models can then be used to make predictions. The data-based models generated by ML algorithms can also be interpreted as an organisation/structure in the data space, thus uncovering unknown properties and patterns: this process is known as data mining (i.e. discovery of unknown properties and patterns). ML relies on a wide variety of algorithms (supervised and unsupervised), ranging from simple Symbolic Regression, Neural Network, decision tree, Support Vector Machine (SVM), up to genetic programming and ensemble methods such as random forest.
(syntactic) schema that describes metadata, allowing for it to be checked for correctness.
Data about data or service. Documentation of the data or service, typically specified with a controlled vocabulary
In the train/test split practice, it is the part of the AIREO TDS which has been set aside to train the ML model. As opposed to a test or validation set the parameters of the model are updated using this data.
A collection of standards, with parameters, options, classes, or subsets, necessary for building a complete computer system, application, or function. An implementation case of a more general standard or set of standards
Also referred to as Label Data or Target Data in the Machine Learning context, Reference Data is a set of measurements that accurately describe a phenomenon of interest, to be predicted as the output of a ML model. Other names for reference data in the context of EO include ground truth data or Calibration/Validation data. The reference data for a given problem in EO should describe fully all the possible values in the “problem space”. Reference data can come from a wide range of sources including in-situ measurements, interpreted EO very high resolution imagery, IoT devices and others.
A TDS is self contained and includes pairs of labels/annotations as well as the EO and/or other input data, whether this is data in the form of imagery or alphanumeric data. Once a user has an AIREO TDS they should be able to train their AI model without any extra data, this does not mean a user could not add external data to aid training.
The representation of spatial data as a matrix of valued cells.
Acquisition of data about aspects of the Earth-Ocean-Atmosphere System using sensors at a distance.
The machine learning task of learning a function that maps features to reference data based on example features-reference pairs. In supervised learning, each example is a pair consisting of a feature and a desired reference data value . A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples.
Splittable is a desirable property of an AIREO TDS data format. A splittable file format allows the access of smaller fragments of a AIREO TDS from a central AIREO TDS without the need to pull the entirety of the AIREO TDS.
To evaluate how the trained model performs in reality we need to test it on previously unseen data, therefore a test set of AIREO TDS is the data that isn’t in the training or validation set. It should be reflective of the distribution of the data expected in reality and various scores pertaining to the accuracy of the model are measured on this set.
Training is the process of identifying ideal parameters of an AI model using a AIREO TDS.
The validation set of the AIREO TDS is used to evaluate model performance during training, it is also used to choose hyperparameters of the model. A separate validation set ensures that the evaluation is unbiased as the same validation set is used at each training iteration.
A representation of the spatial extent of geographic features using geometric elements such as points, polylines and polygons in a coordinate space
To subscribe to the AIREO network or to contact the AIREO Team please email email@example.com