
- How does LabCore help project partners to standardize and manage experimental data?
The LabCore Digital Notebook Platform was specifically designed to satisfy the EU’s FAIR data principles. Thanks to our parsers, scientific data can be incorporated into digital notebooks from raw data files and stored with their metadata. The parsed data is formatted consistently across the entire platform, regardless of its source format, and can be readily collated into machine-learning-ready datasets. Sharing notebooks is easy and safe, allowing partners to quickly compare and work on their data directly within the platform. This tool significantly accelerates research and ensures that the original data and the metadata necessary to understand it, is never lost.
- What have been the main challenges for NANO regarding the generation and harmonization of FAIR data, both from newly created measurements and existing in the consortium, and what are your plans to overcome those challenges in order to guarantee the quality and consistency of the database?
We had to identify the descriptor features necessary to expose enough information about nanoporous materials and use them to make reliable predictions with machine learning models. However, not many of these features are consistently available in all measurements, and some are highly correlated, making them redundant. The current set of descriptor features has been developed with Mast3rboost partners alongside lists of critical measurements to compile the database. This is an ongoing process that requires continuous support and benefits greatly from the use of LabCore.
- Concerning the design of computational descriptors, which approach do you take to decide which physicochemical properties are to be converted into numerical vectors and how do you determine the performance of such ensemble descriptors within your machine-learning models?
We started by including every characterisation result available for the largest set of materials. These include the composition of the precursor mixture, its processing temperature and details, and the chemical analysis of the resulting sample. Most of these are already in numerical form. Several others are not numerical but have only a limited set of possible values (e.g. true/false) which can be assigned a number (e.g. 0/1). With principal component analysis (PCA) we were able to determine which combinations of features carry the most information about the materials, avoiding redundancy and reducing the complexity of the ML training process.
- Please explain how unsupervised and supervised machine learning are being combined to make interactive materials maps and investigate the relationships between fabrication processes, physical properties, environmental conditions, and hydrogen storage performance?
We used t-stochastic neighbour embedding (tSNE) to visualise similarities between materials. Each material is described by many numerical features and can be thought as a point in high-dimensional space. Our mind can only understand 3D visualization, although it is much better at dealing with 2D ones and could not comprehend such high-dimensional dataset. tSNE projects the distribution of materials onto a 2D plane, preserving their similarity relationships: similar materials in high-dimensional space end up near each other on the 2D plane. These material maps are easy to understand visually. By colouring the points by the performance measurement of the corresponding materials, we might find regions of high or low performing samples, and this can guide experimentalists in their exploratory work. Performance measurements are long and expensive experiments, and such sample screening would reduce the number of measurements that we have to perform. Avoiding performing large sets of lengthy experiments means that we can more efficiently construct a dataset for training supervised ML models, capable of directly estimating the hydrogen storage performance of a material before it is even synthesised.