Creating and loading materials science datasets

Before performing any data analysis, pre-processing steps (e.g. descriptor calculation) are often needed to transform materials science data in a suitable form for the algorithm of choice, being it for example a neural network. This pre-processing is usually a computationally demanding step, especially if hundred of thousands of structures needs to be calculated, possible for different parameters setting.

Since hyperparameter tuning of the regression/classification algorithm typically requires to run the model several times (for a given pre-processed dataset), it is thus highly beneficial to be able to save and re-load the pre-processed results in a consistent and traceable manner.

Here we provide functions to pre-process (in parallel), save and subsequently load materials science datasets; this not only eases the traceability and reproduciblity of data analysis on materials science data, but speeds up the prototyping of new models.

Example: diffraction fingerprint calculation for multiple structures

The code below illustrates how to compute a descriptor for multiple crystal structures using multiple processors, save the results to file, and reload the file for later use (e.g. for classification).

As illustrative example we calculate the two-dimensional diffraction fingerprint [1] of pristine (e.g. perfect) and highly defective (50% of missing atoms) crystal structures. In particular, the four crystal structures considered are: body-centered cubic (bcc), face-centered cubic(fcc), diamond (diam), and hexagonal closed packed (hcp) structures; more than 80% of elemental solids adopt one of these four crystal structures under standard conditions.

The steps performed in the code below are the following:

  • define the folders where the results are going to be saved
  • build the four crystal structures (bcc, fcc, diam, hcp) using the ASE package
  • create a pristine supercell using the function ai4materials.utils.utils_crystals.create_supercell
  • create a defective supercell (50% of atoms missing) using the function ai4materials.utils.utils_crystals.create_vacancies
  • calculate the two-dimensional diffraction fingerprint for all (eight) crystal structures using ai4materials.wrappers.calc_descriptor
  • save the results to file
  • reload the results from file
  • generate a texture atlas with the two-dimensional diffraction fingerprints of all structures and write it to file.

Implementation details of the two-dimensional diffraction fingerprint can be found at ai4materials.descriptors.diffraction2d.

from ase.spacegroup import crystal
from ai4materials.descriptors.diffraction2d import Diffraction2D
from ai4materials.utils.utils_config import set_configs
from ai4materials.utils.utils_config import setup_logger
from ai4materials.utils.utils_crystals import create_supercell
from ai4materials.utils.utils_crystals import create_vacancies
from ai4materials.utils.utils_data_retrieval import generate_facets_input
from ai4materials.wrappers import calc_descriptor
from ai4materials.wrappers import load_descriptor
import os.path

# set configs
configs = set_configs(main_folder='./multiple_2d_diff_ai4materials/')
logger = setup_logger(configs, level='INFO', display_configs=False)

# setup folder and files
desc_file_name = 'fcc_bcc_diam_hcp_example'

# build crystal structures
fcc_al = crystal('Al', [(0, 0, 0)], spacegroup=225, cellpar=[4.05, 4.05, 4.05, 90, 90, 90])
bcc_fe = crystal('Fe', [(0, 0, 0)], spacegroup=229, cellpar=[2.87, 2.87, 2.87, 90, 90, 90])
diamond_c = crystal('C', [(0, 0, 0)], spacegroup=227, cellpar=[3.57, 3.57, 3.57, 90, 90, 90])
hcp_mg = crystal('Mg', [(1. / 3., 2. / 3., 3. / 4.)], spacegroup=194, cellpar=[3.21, 3.21, 5.21, 90, 90, 120])
# create supercells - pristine
fcc_al_supercell = create_supercell(fcc_al, target_nb_atoms=128, cell_type='standard_no_symmetries')
bcc_fe_supercell = create_supercell(bcc_fe, target_nb_atoms=128, cell_type='standard_no_symmetries')
diamond_c_supercell = create_supercell(diamond_c, target_nb_atoms=128, cell_type='standard_no_symmetries')
hcp_mg_supercell = create_supercell(hcp_mg, target_nb_atoms=128, cell_type='standard_no_symmetries')
# create supercells - vacancies
fcc_al_supercell_vac = create_vacancies(fcc_al, target_vacancy_ratio=0.50, target_nb_atoms=128, cell_type='standard_no_symmetries')
bcc_fe_supercell_vac = create_vacancies(bcc_fe, target_vacancy_ratio=0.50, target_nb_atoms=128, cell_type='standard_no_symmetries')
diamond_c_supercell_vac = create_vacancies(diamond_c, target_vacancy_ratio=0.50, target_nb_atoms=128, cell_type='standard_no_symmetries')
hcp_mg_supercell_vac = create_vacancies(hcp_mg, target_vacancy_ratio=0.50, target_nb_atoms=128, cell_type='standard_no_symmetries')

ase_atoms_list = [fcc_al_supercell, fcc_al_supercell_vac,
                  bcc_fe_supercell, bcc_fe_supercell_vac,
                  diamond_c_supercell, diamond_c_supercell_vac,
                  hcp_mg_supercell, hcp_mg_supercell_vac]

# calculate the descriptor for the list of structures and save it to file
descriptor = Diffraction2D(configs=configs)
desc_file_path = calc_descriptor(descriptor=descriptor, configs=configs, ase_atoms_list=ase_atoms_list,
                                 desc_file=str(desc_file_name)+'.tar.gz', format_geometry='aims',
                                 nb_jobs=-1)

# load the previously saved file containing the crystal structures and their corresponding descriptor
target_list, structure_list = load_descriptor(desc_files=desc_file_path, configs=configs)

# create a texture atlas with all the two-dimensional diffraction fingerprints
df, texture_atlas = generate_facets_input(structure_list=structure_list, desc_metadata='diffraction_2d_intensity',
                                          target_list=target_list,
                                          sprite_atlas_filename=desc_file_name,
                                          configs=configs, normalize=True)

This are the calculated two-dimensional diffraction fingerprints for all crystal structures in the list :

_images/fcc_bcc_diam_hcp_example.png

Example: atomic feature retrieval for multiple structures

It was recently shown in Ref. [2] that the crystal structure of binary compounds can be predicted using compressed-sensing technique using atomic features only.

The code below illustrates how to retrieve atomic features, performing the following steps:

  • build a list of crystal structure using the ASE package
  • retrieve atomic features using the descriptor ai4materials.descriptors.atomic_features.AtomicFeatures for all crystal structures
  • save the results to file
  • reload the results from file
  • construct a table df_atomic_features containing the atomic features using the function ai4materials.descriptors.atomic_features.get_table_atomic_features
  • write the atomic feature table as csv file
  • build a heatmap of the atomic feature table
import sys
import os.path

atomic_data_dir = os.path.abspath(os.path.normpath("/home/ziletti/nomad/nomad-lab-base/analysis-tools/atomic-data"))
sys.path.insert(0, atomic_data_dir)

from ase.spacegroup import crystal
import matplotlib.pyplot as plt
from ai4materials.utils.utils_config import set_configs
from ai4materials.utils.utils_config import setup_logger
from ai4materials.utils.utils_crystals import get_spacegroup_old
from ai4materials.utils.utils_binaries import get_binaries_dict_delta_e
from ai4materials.wrappers import calc_descriptor
from ai4materials.wrappers import load_descriptor
from ai4materials.descriptors.atomic_features import AtomicFeatures
from ai4materials.descriptors.atomic_features import get_table_atomic_features
import seaborn as sns

# set configs
configs = set_configs(main_folder='./dataset_atomic_features_ai4materials/')
logger = setup_logger(configs, level='INFO', display_configs=False)

desc_file_name = 'atomic_features_try1'

# build atomic structures
group_1a = ['Li', 'Na', 'K', 'Rb']
group_1b = ['F', 'Cl', 'Br', 'I']
group_2a = ['Be', 'Mg', 'Ca', 'Sr']
group_2b = ['O', 'S', 'Se', 'Te']

ase_atoms_list = []
for el_1a in group_1a:
    for el_1b in group_1b:
        ase_atoms_list.append(crystal([el_1a, el_1b], [(0, 0, 0), (0.5, 0.5, 0.5)], spacegroup=225, cellpar=[5.64, 5.64, 5.64, 90, 90, 90]))

for el_2a in group_2a:
    for el_2b in group_2b:
        ase_atoms_list.append(crystal([el_2a, el_2b], [(0, 0, 0), (0.5, 0.5, 0.5)], spacegroup=225, cellpar=[5.64, 5.64, 5.64, 90, 90, 90]))

selected_feature_list = ['atomic_ionization_potential', 'atomic_electron_affinity',
                         'atomic_rs_max', 'atomic_rp_max', 'atomic_rd_max']


# define and calculate descriptor
kwargs = {'feature_order_by': 'atomic_mulliken_electronegativity', 'energy_unit': 'eV', 'length_unit': 'angstrom'}

descriptor = AtomicFeatures(configs=configs, **kwargs)

desc_file_path = calc_descriptor(descriptor=descriptor, configs=configs, ase_atoms_list=ase_atoms_list,
                                 desc_file=str(desc_file_name)+'.tar.gz', format_geometry='aims',
                                 selected_feature_list=selected_feature_list,
                                 nb_jobs=-1)

target_list, ase_atoms_list = load_descriptor(desc_files=desc_file_path, configs=configs)
df_atomic_features = get_table_atomic_features(ase_atoms_list)

# write table to file
df_atomic_features.to_csv('atomic_features_table.csv', float_format='%.4f')

# plot the table with seaborn
df_atomic_features = df_atomic_features.set_index('ordered_chemical_symbols')
mask = df_atomic_features.isnull()
fig = plt.figure()
sns.set(font_scale=0.5)
sns_plot = sns.heatmap(df_atomic_features, annot=True, mask=mask)
fig = sns_plot.get_figure()
fig.tight_layout()
fig.savefig('atomic_features_plot.png', dpi=200)

This is the table containing the atomic features obtained using the code above:

  ordered_chemical_symbols atomic_ionization_potential(A) atomic_electron_affinity(A) atomic_rs_max(A) atomic_rp_max(A) atomic_rd_max(A) atomic_ionization_potential(B) atomic_electron_affinity(B) atomic_rs_max(B) atomic_rp_max(B) atomic_rd_max(B)
0 LiF -5.3291 -0.6981 1.6500 2.0000 6.9300 -19.4043 -4.2735 0.4100 0.3700 1.4300
1 KBr -4.4332 -0.6213 2.1300 2.4400 1.7900 -12.6496 -3.7393 0.7500 0.8800 1.8700
2 KI -4.4332 -0.6213 2.1300 2.4400 1.7900 -11.2571 -3.5135 0.9000 1.0700 1.7200
3 RbF -4.2889 -0.5904 2.2400 3.2000 1.9600 -19.4043 -4.2735 0.4100 0.3700 1.4300
4 RbCl -4.2889 -0.5904 2.2400 3.2000 1.9600 -13.9018 -3.9708 0.6800 0.7600 1.6700
5 RbBr -4.2889 -0.5904 2.2400 3.2000 1.9600 -12.6496 -3.7393 0.7500 0.8800 1.8700
6 RbI -4.2889 -0.5904 2.2400 3.2000 1.9600 -11.2571 -3.5135 0.9000 1.0700 1.7200
7 BeO -9.4594 0.6305 1.0800 1.2100 2.8800 -16.4332 -3.0059 0.4600 0.4300 2.2200
8 BeS -9.4594 0.6305 1.0800 1.2100 2.8800 -11.7951 -2.8449 0.7400 0.8500 2.3700
9 BeSe -9.4594 0.6305 1.0800 1.2100 2.8800 -10.9460 -2.7510 0.8000 0.9500 2.1800
10 BeTe -9.4594 0.6305 1.0800 1.2100 2.8800 -9.8667 -2.6660 0.9400 1.1400 1.8300
11 LiCl -5.3291 -0.6981 1.6500 2.0000 6.9300 -13.9018 -3.9708 0.6800 0.7600 1.6700
12 MgO -8.0371 0.6925 1.3300 1.9000 3.1700 -16.4332 -3.0059 0.4600 0.4300 2.2200
13 MgS -8.0371 0.6925 1.3300 1.9000 3.1700 -11.7951 -2.8449 0.7400 0.8500 2.3700
14 MgSe -8.0371 0.6925 1.3300 1.9000 3.1700 -10.9460 -2.7510 0.8000 0.9500 2.1800
15 MgTe -8.0371 0.6925 1.3300 1.9000 3.1700 -9.8667 -2.6660 0.9400 1.1400 1.8300
16 CaO -6.4280 0.3039 1.7600 2.3200 0.6800 -16.4332 -3.0059 0.4600 0.4300 2.2200
17 CaS -6.4280 0.3039 1.7600 2.3200 0.6800 -11.7951 -2.8449 0.7400 0.8500 2.3700
18 CaSe -6.4280 0.3039 1.7600 2.3200 0.6800 -10.9460 -2.7510 0.8000 0.9500 2.1800
19 CaTe -6.4280 0.3039 1.7600 2.3200 0.6800 -9.8667 -2.6660 0.9400 1.1400 1.8300
20 SrO -6.0316 0.3431 1.9100 2.5500 1.2000 -16.4332 -3.0059 0.4600 0.4300 2.2200
21 SrS -6.0316 0.3431 1.9100 2.5500 1.2000 -11.7951 -2.8449 0.7400 0.8500 2.3700
22 LiBr -5.3291 -0.6981 1.6500 2.0000 6.9300 -12.6496 -3.7393 0.7500 0.8800 1.8700
23 SrSe -6.0316 0.3431 1.9100 2.5500 1.2000 -10.9460 -2.7510 0.8000 0.9500 2.1800
24 SrTe -6.0316 0.3431 1.9100 2.5500 1.2000 -9.8667 -2.6660 0.9400 1.1400 1.8300
25 LiI -5.3291 -0.6981 1.6500 2.0000 6.9300 -11.2571 -3.5135 0.9000 1.0700 1.7200
26 NaF -5.2231 -0.7157 1.7100 2.6000 6.5700 -19.4043 -4.2735 0.4100 0.3700 1.4300
27 NaCl -5.2231 -0.7157 1.7100 2.6000 6.5700 -13.9018 -3.9708 0.6800 0.7600 1.6700
28 NaBr -5.2231 -0.7157 1.7100 2.6000 6.5700 -12.6496 -3.7393 0.7500 0.8800 1.8700
29 NaI -5.2231 -0.7157 1.7100 2.6000 6.5700 -11.2571 -3.5135 0.9000 1.0700 1.7200
30 KF -4.4332 -0.6213 2.1300 2.4400 1.7900 -19.4043 -4.2735 0.4100 0.3700 1.4300
31 KCl -4.4332 -0.6213 2.1300 2.4400 1.7900 -13.9018 -3.9708 0.6800 0.7600 1.6700

and this is its corresponding heatmap:

_images/atomic_features_plot.png

Example: dataset creation for data analytics

The code below illustrates how to compute a descriptor (the two-dimensional diffraction fingerprint [1]) for multiple crystal structures, save the results to file, and reload the file for later use (e.g. for classification).

The steps performed in the code below are the following:

  • define the folders where the results are going to be saved
  • build the four crystal structures (bcc, fcc, diam, hcp) using the ASE package
  • create a pristine supercell using the function ai4materials.utils.utils_crystals.create_supercell
  • create a defective supercell (50% of atoms missing) using the function ai4materials.utils.utils_crystals.create_vacancies
  • calculate the two-dimensional diffraction fingerprint for all (eight) crystal structures
  • save the results to file
  • reload the results from file
  • define a user-specified target variable (i.e. the variable that one want to predict with the classification/regression model); in this case this variable is the crystal structure type (‘fcc’, ‘bcc’, ‘diam’, ‘hcp’)
  • create a dataset containing the specified desc_metadata (this needs to be compatible with the descriptor choice)
  • save the dataset to file in the folder dataset_folder, including data (numpy array), target variable (numpy array), and metadata regarding the dataset (JSON format)
  • re-load from file the saved dataset to be used for example in a classification task
from ase.spacegroup import crystal
from ai4materials.dataprocessing.preprocessing import load_dataset_from_file
from ai4materials.dataprocessing.preprocessing import prepare_dataset
from ai4materials.descriptors.diffraction2d import Diffraction2D
from ai4materials.utils.utils_config import set_configs
from ai4materials.utils.utils_config import setup_logger
from ai4materials.utils.utils_crystals import create_supercell
from ai4materials.utils.utils_crystals import create_vacancies
from ai4materials.wrappers import calc_descriptor
from ai4materials.wrappers import load_descriptor
import os.path

# set configs
configs = set_configs(main_folder='./dataset_2d_diff_ai4materials/')
logger = setup_logger(configs, level='INFO', display_configs=False)

# setup folder and files
dataset_folder = os.path.join(configs['io']['main_folder'], 'my_datasets')
desc_file_name = 'fcc_bcc_diam_hcp_example'

# build crystal structures
fcc_al = crystal('Al', [(0, 0, 0)], spacegroup=225, cellpar=[4.05, 4.05, 4.05, 90, 90, 90])
bcc_fe = crystal('Fe', [(0, 0, 0)], spacegroup=229, cellpar=[2.87, 2.87, 2.87, 90, 90, 90])
diamond_c = crystal('C', [(0, 0, 0)], spacegroup=227, cellpar=[3.57, 3.57, 3.57, 90, 90, 90])
hcp_mg = crystal('Mg', [(1. / 3., 2. / 3., 3. / 4.)], spacegroup=194, cellpar=[3.21, 3.21, 5.21, 90, 90, 120])
# create supercells - pristine
fcc_al_supercell = create_supercell(fcc_al, target_nb_atoms=128, cell_type='standard_no_symmetries')
bcc_fe_supercell = create_supercell(bcc_fe, target_nb_atoms=128, cell_type='standard_no_symmetries')
diamond_c_supercell = create_supercell(diamond_c, target_nb_atoms=128, cell_type='standard_no_symmetries')
hcp_mg_supercell = create_supercell(hcp_mg, target_nb_atoms=128, cell_type='standard_no_symmetries')
# create supercells - vacancies
fcc_al_supercell_vac = create_vacancies(fcc_al, target_vacancy_ratio=0.50, target_nb_atoms=128, cell_type='standard_no_symmetries')
bcc_fe_supercell_vac = create_vacancies(bcc_fe, target_vacancy_ratio=0.50, target_nb_atoms=128, cell_type='standard_no_symmetries')
diamond_c_supercell_vac = create_vacancies(diamond_c, target_vacancy_ratio=0.50, target_nb_atoms=128, cell_type='standard_no_symmetries')
hcp_mg_supercell_vac = create_vacancies(hcp_mg, target_vacancy_ratio=0.50, target_nb_atoms=128, cell_type='standard_no_symmetries')

ase_atoms_list = [fcc_al_supercell, fcc_al_supercell_vac,
                  bcc_fe_supercell, bcc_fe_supercell_vac,
                  diamond_c_supercell, diamond_c_supercell_vac,
                  hcp_mg_supercell, hcp_mg_supercell_vac]

# calculate the descriptor for the list of structures and save it to file
descriptor = Diffraction2D(configs=configs)
desc_file_path = calc_descriptor(descriptor=descriptor, configs=configs, ase_atoms_list=ase_atoms_list,
                                 desc_file=str(desc_file_name)+'.tar.gz', format_geometry='aims',
                                 nb_jobs=-1)

# load the previously saved file containing the crystal structures and their corresponding descriptor
target_list, structure_list = load_descriptor(desc_files=desc_file_path, configs=configs)

# add as target the spacegroup (using spacegroup of the "parental" structure for the defective structure)
targets = ['fcc', 'fcc', 'bcc', 'bcc', 'diam', 'diam', 'hcp', 'hcp']
for idx, item in enumerate(target_list):
    item['data'][0]['target'] = targets[idx]

path_to_x, path_to_y, path_to_summary = prepare_dataset(
    structure_list=structure_list,
    target_list=target_list,
    desc_metadata='diffraction_2d_intensity',
    dataset_name='bcc-fcc-diam-hcp',
    target_name='target',
    target_categorical=True,
    input_dims=(64, 64),
    configs=configs,
    dataset_folder=dataset_folder,
    main_folder=configs['io']['main_folder'],
    desc_folder=configs['io']['desc_folder'],
    tmp_folder=configs['io']['tmp_folder'],
    notes="Dataset with bcc, fcc, diam and hcp structures, pristine and with 50% of defects.")

x, y, dataset_info = load_dataset_from_file(path_to_x=path_to_x, path_to_y=path_to_y,
                                                              path_to_summary=path_to_summary)

In the code above, the numpy array x contains the specified desc_metadata, the numpy array y contains the specified targets, and dataset_info is a dictionary containing information regarding the dataset was just loaded:

    {
          "data":[{
  "target_name": "target", 
  "n_bins": 100, 
  "path_to_summary": "/home/ziletti/Documents/calc_xray/2d_nature_comm/my_datasets/bcc-fcc-diam-hcp_summary.json", 
  "creation_date": "2018-06-20T18:42:07.110239", 
  "numerical_labels": [
    2, 
    2, 
    0, 
    0, 
    1, 
    1, 
    3, 
    3
  ], 
  "classes": [
    "bcc", 
    "diam", 
    "fcc", 
    "hcp"
  ], 
  "nb_classes": 4, 
  "path_to_y": "/home/ziletti/Documents/calc_xray/2d_nature_comm/my_datasets/bcc-fcc-diam-hcp_y.pkl", 
  "path_to_x": "/home/ziletti/Documents/calc_xray/2d_nature_comm/my_datasets/bcc-fcc-diam-hcp_x.pkl", 
  "text_labels": [
    "fcc", 
    "fcc", 
    "bcc", 
    "bcc", 
    "diam", 
    "diam", 
    "hcp", 
    "hcp"
  ], 
  "target_categorical": true, 
  "disc_type": null, 
  "notes": "Dataset with bcc, fcc, diam and hcp structures, pristine and with 50% of defects.", 
  "dataset_name": "bcc-fcc-diam-hcp"
}
    ] }
[1](1, 2) A. Ziletti, D. Kumar, M. Scheffler, and L. M. Ghiringhelli, “Insightful classification of crystal structures using deep learning,” Nature Communications, vol. 9, pp. 2775, 2018. [Link to article]
[2]L. M. Ghiringhelli, J. Vybiral, S. V. Levchenko, C. Draxl, and M. Scheffler, “Big Data of Materials Science: Critical Role of the Descriptor,” Physical Review Letters, vol. 114, no. 10, p. 105503 . [Link to article]

Section author: Angelo Ziletti <angelo.ziletti@gmail.com>