Creating and loading materials science datasets¶
Before performing any data analysis, pre-processing steps (e.g. descriptor calculation) are often needed to transform materials science data in a suitable form for the algorithm of choice, being it for example a neural network. This pre-processing is usually a computationally demanding step, especially if hundred of thousands of structures needs to be calculated, possible for different parameters setting.
Since hyperparameter tuning of the regression/classification algorithm typically requires to run the model several times (for a given pre-processed dataset), it is thus highly beneficial to be able to save and re-load the pre-processed results in a consistent and traceable manner.
Here we provide functions to pre-process (in parallel), save and subsequently load materials science datasets; this not only eases the traceability and reproduciblity of data analysis on materials science data, but speeds up the prototyping of new models.
Example: diffraction fingerprint calculation for multiple structures¶
The code below illustrates how to compute a descriptor for multiple crystal structures using multiple processors, save the results to file, and reload the file for later use (e.g. for classification).
As illustrative example we calculate the two-dimensional diffraction fingerprint [1] of pristine (e.g. perfect) and highly defective (50% of missing atoms) crystal structures. In particular, the four crystal structures considered are: body-centered cubic (bcc), face-centered cubic(fcc), diamond (diam), and hexagonal closed packed (hcp) structures; more than 80% of elemental solids adopt one of these four crystal structures under standard conditions.
The steps performed in the code below are the following:
- define the folders where the results are going to be saved
- build the four crystal structures (bcc, fcc, diam, hcp) using the ASE package
- create a pristine supercell using the function
ai4materials.utils.utils_crystals.create_supercell
- create a defective supercell (50% of atoms missing) using the function
ai4materials.utils.utils_crystals.create_vacancies
- calculate the two-dimensional diffraction fingerprint for all (eight) crystal structures using
ai4materials.wrappers.calc_descriptor
- save the results to file
- reload the results from file
- generate a texture atlas with the two-dimensional diffraction fingerprints of all structures and write it to file.
Implementation details of the two-dimensional diffraction fingerprint can be found at
ai4materials.descriptors.diffraction2d
.
from ase.spacegroup import crystal
from ai4materials.descriptors.diffraction2d import Diffraction2D
from ai4materials.utils.utils_config import set_configs
from ai4materials.utils.utils_config import setup_logger
from ai4materials.utils.utils_crystals import create_supercell
from ai4materials.utils.utils_crystals import create_vacancies
from ai4materials.utils.utils_data_retrieval import generate_facets_input
from ai4materials.wrappers import calc_descriptor
from ai4materials.wrappers import load_descriptor
import os.path
# set configs
configs = set_configs(main_folder='./multiple_2d_diff_ai4materials/')
logger = setup_logger(configs, level='INFO', display_configs=False)
# setup folder and files
desc_file_name = 'fcc_bcc_diam_hcp_example'
# build crystal structures
fcc_al = crystal('Al', [(0, 0, 0)], spacegroup=225, cellpar=[4.05, 4.05, 4.05, 90, 90, 90])
bcc_fe = crystal('Fe', [(0, 0, 0)], spacegroup=229, cellpar=[2.87, 2.87, 2.87, 90, 90, 90])
diamond_c = crystal('C', [(0, 0, 0)], spacegroup=227, cellpar=[3.57, 3.57, 3.57, 90, 90, 90])
hcp_mg = crystal('Mg', [(1. / 3., 2. / 3., 3. / 4.)], spacegroup=194, cellpar=[3.21, 3.21, 5.21, 90, 90, 120])
# create supercells - pristine
fcc_al_supercell = create_supercell(fcc_al, target_nb_atoms=128, cell_type='standard_no_symmetries')
bcc_fe_supercell = create_supercell(bcc_fe, target_nb_atoms=128, cell_type='standard_no_symmetries')
diamond_c_supercell = create_supercell(diamond_c, target_nb_atoms=128, cell_type='standard_no_symmetries')
hcp_mg_supercell = create_supercell(hcp_mg, target_nb_atoms=128, cell_type='standard_no_symmetries')
# create supercells - vacancies
fcc_al_supercell_vac = create_vacancies(fcc_al, target_vacancy_ratio=0.50, target_nb_atoms=128, cell_type='standard_no_symmetries')
bcc_fe_supercell_vac = create_vacancies(bcc_fe, target_vacancy_ratio=0.50, target_nb_atoms=128, cell_type='standard_no_symmetries')
diamond_c_supercell_vac = create_vacancies(diamond_c, target_vacancy_ratio=0.50, target_nb_atoms=128, cell_type='standard_no_symmetries')
hcp_mg_supercell_vac = create_vacancies(hcp_mg, target_vacancy_ratio=0.50, target_nb_atoms=128, cell_type='standard_no_symmetries')
ase_atoms_list = [fcc_al_supercell, fcc_al_supercell_vac,
bcc_fe_supercell, bcc_fe_supercell_vac,
diamond_c_supercell, diamond_c_supercell_vac,
hcp_mg_supercell, hcp_mg_supercell_vac]
# calculate the descriptor for the list of structures and save it to file
descriptor = Diffraction2D(configs=configs)
desc_file_path = calc_descriptor(descriptor=descriptor, configs=configs, ase_atoms_list=ase_atoms_list,
desc_file=str(desc_file_name)+'.tar.gz', format_geometry='aims',
nb_jobs=-1)
# load the previously saved file containing the crystal structures and their corresponding descriptor
target_list, structure_list = load_descriptor(desc_files=desc_file_path, configs=configs)
# create a texture atlas with all the two-dimensional diffraction fingerprints
df, texture_atlas = generate_facets_input(structure_list=structure_list, desc_metadata='diffraction_2d_intensity',
target_list=target_list,
sprite_atlas_filename=desc_file_name,
configs=configs, normalize=True)
This are the calculated two-dimensional diffraction fingerprints for all crystal structures in the list :
Example: atomic feature retrieval for multiple structures¶
It was recently shown in Ref. [2] that the crystal structure of binary compounds can be predicted using compressed-sensing technique using atomic features only.
The code below illustrates how to retrieve atomic features, performing the following steps:
- build a list of crystal structure using the ASE package
- retrieve atomic features using the descriptor
ai4materials.descriptors.atomic_features.AtomicFeatures
for all crystal structures - save the results to file
- reload the results from file
- construct a table
df_atomic_features
containing the atomic features using the functionai4materials.descriptors.atomic_features.get_table_atomic_features
- write the atomic feature table as csv file
- build a heatmap of the atomic feature table
import sys
import os.path
atomic_data_dir = os.path.abspath(os.path.normpath("/home/ziletti/nomad/nomad-lab-base/analysis-tools/atomic-data"))
sys.path.insert(0, atomic_data_dir)
from ase.spacegroup import crystal
import matplotlib.pyplot as plt
from ai4materials.utils.utils_config import set_configs
from ai4materials.utils.utils_config import setup_logger
from ai4materials.utils.utils_crystals import get_spacegroup_old
from ai4materials.utils.utils_binaries import get_binaries_dict_delta_e
from ai4materials.wrappers import calc_descriptor
from ai4materials.wrappers import load_descriptor
from ai4materials.descriptors.atomic_features import AtomicFeatures
from ai4materials.descriptors.atomic_features import get_table_atomic_features
import seaborn as sns
# set configs
configs = set_configs(main_folder='./dataset_atomic_features_ai4materials/')
logger = setup_logger(configs, level='INFO', display_configs=False)
desc_file_name = 'atomic_features_try1'
# build atomic structures
group_1a = ['Li', 'Na', 'K', 'Rb']
group_1b = ['F', 'Cl', 'Br', 'I']
group_2a = ['Be', 'Mg', 'Ca', 'Sr']
group_2b = ['O', 'S', 'Se', 'Te']
ase_atoms_list = []
for el_1a in group_1a:
for el_1b in group_1b:
ase_atoms_list.append(crystal([el_1a, el_1b], [(0, 0, 0), (0.5, 0.5, 0.5)], spacegroup=225, cellpar=[5.64, 5.64, 5.64, 90, 90, 90]))
for el_2a in group_2a:
for el_2b in group_2b:
ase_atoms_list.append(crystal([el_2a, el_2b], [(0, 0, 0), (0.5, 0.5, 0.5)], spacegroup=225, cellpar=[5.64, 5.64, 5.64, 90, 90, 90]))
selected_feature_list = ['atomic_ionization_potential', 'atomic_electron_affinity',
'atomic_rs_max', 'atomic_rp_max', 'atomic_rd_max']
# define and calculate descriptor
kwargs = {'feature_order_by': 'atomic_mulliken_electronegativity', 'energy_unit': 'eV', 'length_unit': 'angstrom'}
descriptor = AtomicFeatures(configs=configs, **kwargs)
desc_file_path = calc_descriptor(descriptor=descriptor, configs=configs, ase_atoms_list=ase_atoms_list,
desc_file=str(desc_file_name)+'.tar.gz', format_geometry='aims',
selected_feature_list=selected_feature_list,
nb_jobs=-1)
target_list, ase_atoms_list = load_descriptor(desc_files=desc_file_path, configs=configs)
df_atomic_features = get_table_atomic_features(ase_atoms_list)
# write table to file
df_atomic_features.to_csv('atomic_features_table.csv', float_format='%.4f')
# plot the table with seaborn
df_atomic_features = df_atomic_features.set_index('ordered_chemical_symbols')
mask = df_atomic_features.isnull()
fig = plt.figure()
sns.set(font_scale=0.5)
sns_plot = sns.heatmap(df_atomic_features, annot=True, mask=mask)
fig = sns_plot.get_figure()
fig.tight_layout()
fig.savefig('atomic_features_plot.png', dpi=200)
This is the table containing the atomic features obtained using the code above:
ordered_chemical_symbols | atomic_ionization_potential(A) | atomic_electron_affinity(A) | atomic_rs_max(A) | atomic_rp_max(A) | atomic_rd_max(A) | atomic_ionization_potential(B) | atomic_electron_affinity(B) | atomic_rs_max(B) | atomic_rp_max(B) | atomic_rd_max(B) | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | LiF | -5.3291 | -0.6981 | 1.6500 | 2.0000 | 6.9300 | -19.4043 | -4.2735 | 0.4100 | 0.3700 | 1.4300 |
1 | KBr | -4.4332 | -0.6213 | 2.1300 | 2.4400 | 1.7900 | -12.6496 | -3.7393 | 0.7500 | 0.8800 | 1.8700 |
2 | KI | -4.4332 | -0.6213 | 2.1300 | 2.4400 | 1.7900 | -11.2571 | -3.5135 | 0.9000 | 1.0700 | 1.7200 |
3 | RbF | -4.2889 | -0.5904 | 2.2400 | 3.2000 | 1.9600 | -19.4043 | -4.2735 | 0.4100 | 0.3700 | 1.4300 |
4 | RbCl | -4.2889 | -0.5904 | 2.2400 | 3.2000 | 1.9600 | -13.9018 | -3.9708 | 0.6800 | 0.7600 | 1.6700 |
5 | RbBr | -4.2889 | -0.5904 | 2.2400 | 3.2000 | 1.9600 | -12.6496 | -3.7393 | 0.7500 | 0.8800 | 1.8700 |
6 | RbI | -4.2889 | -0.5904 | 2.2400 | 3.2000 | 1.9600 | -11.2571 | -3.5135 | 0.9000 | 1.0700 | 1.7200 |
7 | BeO | -9.4594 | 0.6305 | 1.0800 | 1.2100 | 2.8800 | -16.4332 | -3.0059 | 0.4600 | 0.4300 | 2.2200 |
8 | BeS | -9.4594 | 0.6305 | 1.0800 | 1.2100 | 2.8800 | -11.7951 | -2.8449 | 0.7400 | 0.8500 | 2.3700 |
9 | BeSe | -9.4594 | 0.6305 | 1.0800 | 1.2100 | 2.8800 | -10.9460 | -2.7510 | 0.8000 | 0.9500 | 2.1800 |
10 | BeTe | -9.4594 | 0.6305 | 1.0800 | 1.2100 | 2.8800 | -9.8667 | -2.6660 | 0.9400 | 1.1400 | 1.8300 |
11 | LiCl | -5.3291 | -0.6981 | 1.6500 | 2.0000 | 6.9300 | -13.9018 | -3.9708 | 0.6800 | 0.7600 | 1.6700 |
12 | MgO | -8.0371 | 0.6925 | 1.3300 | 1.9000 | 3.1700 | -16.4332 | -3.0059 | 0.4600 | 0.4300 | 2.2200 |
13 | MgS | -8.0371 | 0.6925 | 1.3300 | 1.9000 | 3.1700 | -11.7951 | -2.8449 | 0.7400 | 0.8500 | 2.3700 |
14 | MgSe | -8.0371 | 0.6925 | 1.3300 | 1.9000 | 3.1700 | -10.9460 | -2.7510 | 0.8000 | 0.9500 | 2.1800 |
15 | MgTe | -8.0371 | 0.6925 | 1.3300 | 1.9000 | 3.1700 | -9.8667 | -2.6660 | 0.9400 | 1.1400 | 1.8300 |
16 | CaO | -6.4280 | 0.3039 | 1.7600 | 2.3200 | 0.6800 | -16.4332 | -3.0059 | 0.4600 | 0.4300 | 2.2200 |
17 | CaS | -6.4280 | 0.3039 | 1.7600 | 2.3200 | 0.6800 | -11.7951 | -2.8449 | 0.7400 | 0.8500 | 2.3700 |
18 | CaSe | -6.4280 | 0.3039 | 1.7600 | 2.3200 | 0.6800 | -10.9460 | -2.7510 | 0.8000 | 0.9500 | 2.1800 |
19 | CaTe | -6.4280 | 0.3039 | 1.7600 | 2.3200 | 0.6800 | -9.8667 | -2.6660 | 0.9400 | 1.1400 | 1.8300 |
20 | SrO | -6.0316 | 0.3431 | 1.9100 | 2.5500 | 1.2000 | -16.4332 | -3.0059 | 0.4600 | 0.4300 | 2.2200 |
21 | SrS | -6.0316 | 0.3431 | 1.9100 | 2.5500 | 1.2000 | -11.7951 | -2.8449 | 0.7400 | 0.8500 | 2.3700 |
22 | LiBr | -5.3291 | -0.6981 | 1.6500 | 2.0000 | 6.9300 | -12.6496 | -3.7393 | 0.7500 | 0.8800 | 1.8700 |
23 | SrSe | -6.0316 | 0.3431 | 1.9100 | 2.5500 | 1.2000 | -10.9460 | -2.7510 | 0.8000 | 0.9500 | 2.1800 |
24 | SrTe | -6.0316 | 0.3431 | 1.9100 | 2.5500 | 1.2000 | -9.8667 | -2.6660 | 0.9400 | 1.1400 | 1.8300 |
25 | LiI | -5.3291 | -0.6981 | 1.6500 | 2.0000 | 6.9300 | -11.2571 | -3.5135 | 0.9000 | 1.0700 | 1.7200 |
26 | NaF | -5.2231 | -0.7157 | 1.7100 | 2.6000 | 6.5700 | -19.4043 | -4.2735 | 0.4100 | 0.3700 | 1.4300 |
27 | NaCl | -5.2231 | -0.7157 | 1.7100 | 2.6000 | 6.5700 | -13.9018 | -3.9708 | 0.6800 | 0.7600 | 1.6700 |
28 | NaBr | -5.2231 | -0.7157 | 1.7100 | 2.6000 | 6.5700 | -12.6496 | -3.7393 | 0.7500 | 0.8800 | 1.8700 |
29 | NaI | -5.2231 | -0.7157 | 1.7100 | 2.6000 | 6.5700 | -11.2571 | -3.5135 | 0.9000 | 1.0700 | 1.7200 |
30 | KF | -4.4332 | -0.6213 | 2.1300 | 2.4400 | 1.7900 | -19.4043 | -4.2735 | 0.4100 | 0.3700 | 1.4300 |
31 | KCl | -4.4332 | -0.6213 | 2.1300 | 2.4400 | 1.7900 | -13.9018 | -3.9708 | 0.6800 | 0.7600 | 1.6700 |
and this is its corresponding heatmap:
Example: dataset creation for data analytics¶
The code below illustrates how to compute a descriptor (the two-dimensional diffraction fingerprint [1]) for multiple crystal structures, save the results to file, and reload the file for later use (e.g. for classification).
The steps performed in the code below are the following:
- define the folders where the results are going to be saved
- build the four crystal structures (bcc, fcc, diam, hcp) using the ASE package
- create a pristine supercell using the function
ai4materials.utils.utils_crystals.create_supercell
- create a defective supercell (50% of atoms missing) using the function
ai4materials.utils.utils_crystals.create_vacancies
- calculate the two-dimensional diffraction fingerprint for all (eight) crystal structures
- save the results to file
- reload the results from file
- define a user-specified target variable (i.e. the variable that one want to predict with the classification/regression model); in this case this variable is the crystal structure type (‘fcc’, ‘bcc’, ‘diam’, ‘hcp’)
- create a dataset containing the specified
desc_metadata
(this needs to be compatible with the descriptor choice) - save the dataset to file in the folder
dataset_folder
, including data (numpy array), target variable (numpy array), and metadata regarding the dataset (JSON format) - re-load from file the saved dataset to be used for example in a classification task
from ase.spacegroup import crystal
from ai4materials.dataprocessing.preprocessing import load_dataset_from_file
from ai4materials.dataprocessing.preprocessing import prepare_dataset
from ai4materials.descriptors.diffraction2d import Diffraction2D
from ai4materials.utils.utils_config import set_configs
from ai4materials.utils.utils_config import setup_logger
from ai4materials.utils.utils_crystals import create_supercell
from ai4materials.utils.utils_crystals import create_vacancies
from ai4materials.wrappers import calc_descriptor
from ai4materials.wrappers import load_descriptor
import os.path
# set configs
configs = set_configs(main_folder='./dataset_2d_diff_ai4materials/')
logger = setup_logger(configs, level='INFO', display_configs=False)
# setup folder and files
dataset_folder = os.path.join(configs['io']['main_folder'], 'my_datasets')
desc_file_name = 'fcc_bcc_diam_hcp_example'
# build crystal structures
fcc_al = crystal('Al', [(0, 0, 0)], spacegroup=225, cellpar=[4.05, 4.05, 4.05, 90, 90, 90])
bcc_fe = crystal('Fe', [(0, 0, 0)], spacegroup=229, cellpar=[2.87, 2.87, 2.87, 90, 90, 90])
diamond_c = crystal('C', [(0, 0, 0)], spacegroup=227, cellpar=[3.57, 3.57, 3.57, 90, 90, 90])
hcp_mg = crystal('Mg', [(1. / 3., 2. / 3., 3. / 4.)], spacegroup=194, cellpar=[3.21, 3.21, 5.21, 90, 90, 120])
# create supercells - pristine
fcc_al_supercell = create_supercell(fcc_al, target_nb_atoms=128, cell_type='standard_no_symmetries')
bcc_fe_supercell = create_supercell(bcc_fe, target_nb_atoms=128, cell_type='standard_no_symmetries')
diamond_c_supercell = create_supercell(diamond_c, target_nb_atoms=128, cell_type='standard_no_symmetries')
hcp_mg_supercell = create_supercell(hcp_mg, target_nb_atoms=128, cell_type='standard_no_symmetries')
# create supercells - vacancies
fcc_al_supercell_vac = create_vacancies(fcc_al, target_vacancy_ratio=0.50, target_nb_atoms=128, cell_type='standard_no_symmetries')
bcc_fe_supercell_vac = create_vacancies(bcc_fe, target_vacancy_ratio=0.50, target_nb_atoms=128, cell_type='standard_no_symmetries')
diamond_c_supercell_vac = create_vacancies(diamond_c, target_vacancy_ratio=0.50, target_nb_atoms=128, cell_type='standard_no_symmetries')
hcp_mg_supercell_vac = create_vacancies(hcp_mg, target_vacancy_ratio=0.50, target_nb_atoms=128, cell_type='standard_no_symmetries')
ase_atoms_list = [fcc_al_supercell, fcc_al_supercell_vac,
bcc_fe_supercell, bcc_fe_supercell_vac,
diamond_c_supercell, diamond_c_supercell_vac,
hcp_mg_supercell, hcp_mg_supercell_vac]
# calculate the descriptor for the list of structures and save it to file
descriptor = Diffraction2D(configs=configs)
desc_file_path = calc_descriptor(descriptor=descriptor, configs=configs, ase_atoms_list=ase_atoms_list,
desc_file=str(desc_file_name)+'.tar.gz', format_geometry='aims',
nb_jobs=-1)
# load the previously saved file containing the crystal structures and their corresponding descriptor
target_list, structure_list = load_descriptor(desc_files=desc_file_path, configs=configs)
# add as target the spacegroup (using spacegroup of the "parental" structure for the defective structure)
targets = ['fcc', 'fcc', 'bcc', 'bcc', 'diam', 'diam', 'hcp', 'hcp']
for idx, item in enumerate(target_list):
item['data'][0]['target'] = targets[idx]
path_to_x, path_to_y, path_to_summary = prepare_dataset(
structure_list=structure_list,
target_list=target_list,
desc_metadata='diffraction_2d_intensity',
dataset_name='bcc-fcc-diam-hcp',
target_name='target',
target_categorical=True,
input_dims=(64, 64),
configs=configs,
dataset_folder=dataset_folder,
main_folder=configs['io']['main_folder'],
desc_folder=configs['io']['desc_folder'],
tmp_folder=configs['io']['tmp_folder'],
notes="Dataset with bcc, fcc, diam and hcp structures, pristine and with 50% of defects.")
x, y, dataset_info = load_dataset_from_file(path_to_x=path_to_x, path_to_y=path_to_y,
path_to_summary=path_to_summary)
In the code above, the numpy array x contains the specified desc_metadata
, the numpy array y contains
the specified targets, and dataset_info is a dictionary containing information regarding the dataset was just loaded:
{
"data":[{
"target_name": "target",
"n_bins": 100,
"path_to_summary": "/home/ziletti/Documents/calc_xray/2d_nature_comm/my_datasets/bcc-fcc-diam-hcp_summary.json",
"creation_date": "2018-06-20T18:42:07.110239",
"numerical_labels": [
2,
2,
0,
0,
1,
1,
3,
3
],
"classes": [
"bcc",
"diam",
"fcc",
"hcp"
],
"nb_classes": 4,
"path_to_y": "/home/ziletti/Documents/calc_xray/2d_nature_comm/my_datasets/bcc-fcc-diam-hcp_y.pkl",
"path_to_x": "/home/ziletti/Documents/calc_xray/2d_nature_comm/my_datasets/bcc-fcc-diam-hcp_x.pkl",
"text_labels": [
"fcc",
"fcc",
"bcc",
"bcc",
"diam",
"diam",
"hcp",
"hcp"
],
"target_categorical": true,
"disc_type": null,
"notes": "Dataset with bcc, fcc, diam and hcp structures, pristine and with 50% of defects.",
"dataset_name": "bcc-fcc-diam-hcp"
}
] }
[1] | (1, 2) A. Ziletti, D. Kumar, M. Scheffler, and L. M. Ghiringhelli, “Insightful classification of crystal structures using deep learning,” Nature Communications, vol. 9, pp. 2775, 2018. [Link to article] |
[2] | L. M. Ghiringhelli, J. Vybiral, S. V. Levchenko, C. Draxl, and M. Scheffler, “Big Data of Materials Science: Critical Role of the Descriptor,” Physical Review Letters, vol. 114, no. 10, p. 105503 . [Link to article] |
Section author: Angelo Ziletti <angelo.ziletti@gmail.com>