ai4materials.models.sis module¶

class ai4materials.models.sis.SIS(P, D, feature_list, feature_unit_classes=None, target_unit='eV', control=None, output_log_file='/home/beaker/.beaker/v1/web/tmp/output.log', rm_existing_files=False, if_print=True, check_only_control=False)[source]¶

Bases: object

Python interface with the fortran SIS+(Sure Independent Screening)+L0/L1L0 code.

The SIS+(Sure Independent Screening)+L0/L1L0 is a greedy algorithm. It enhances the OMP, by considering not only the closest feature vector to the residual in each step, but collects the closest ‘n_SIS’ features vectors. The final model is then built after a given number of iterations by determining the (approximately) best linear combination of the collected features using the L0 (L1-L0) algorithm.

To execute the code, besides the SIS code parameters also folder paths are needed as well as account information of a remote machine to let the code be executed on it.

P : array, [n_sample]; list; [n_sample]

P refers to the target (label). If ptype = ‘quali’ list of ints is required

D : array, [n_sample, n_features]

D refers to the feature matrix. The SIS code calculates algebraic combinations of the features and then applies the SIS+L0/L1L0 algorithm.

feature_list : list of strings

List of feature names. Needs to be in the same order as the feature vectors (columns) in D. Features must consist of strings which are in F_unit (See above).

feature_unit_classes : None or {list integers or the string: ‘no_unit’}

integers correspond to the unit class of the features from feature_list. ‘no_unit’ is reserved for dimensionless unit.

output_log_file : string

file path for the logger output.

rm_existing_files : bool

If SIS_input_path on local or remote machine (remote_input_path) exists, it is removed. Otherwise it is renamed to SIS_input_path_$number.

control : dict of dicts (of dicts)

Dict tree: {

‘local_paths’: { ‘local_path’:str, ‘SIS_input_folder_name’:str}, (‘local_run’,’remote_run’) : (

{‘SIS_code_path’:str, ‘mpi_command’:str}, {‘SIS_code_path’:str, ‘username’:str, ‘hostname’:str, ‘remote_path’:str, ‘eos’:bool, ‘mpi_command’:str, ‘nodes’:int, (‘key_file’, ‘password’):(str,str)}

), ‘parameters’ : {‘n_comb’:int, ‘n_sis’:int, ‘max_dim’:int, ‘OP_list’:list}, ‘advanced_parameters’ : {‘FC’:FC_dic,’DI’:DI_dic, ‘FCDI’:FCDI_dic}

} Here the tuples (.,.) mean that one and only one of the both keys has to be set. To see forms of FC_dic, DI_dic, FCDI_dic check FC_tuplelist, DI_tuplelist and FCDI_tuplelist above in PARAMETERS REFERENCE.

start : -

starts the code

get_results : list [max_dim] of dicts {‘D’, ‘coefficients’, ‘P_pred’}

get_results[model_dim-1][‘D’] : pandas data frame [n_sample, model_dim+1]: Descriptor matrix with the columns being algebraic combinations of the input feature matrix. Column names are thus strings of the algebraic combinations of strings of inout feature_list. Last column is full of ones corresponding to the intercept
get_results[model_dim-1][‘coefficients’] : array [model_dim+1]: Optimizing coefficients.
get_results[model_dim-1][‘P_pred’] : array [m_sample]: Fit : np.dot( np.array(D), coefficients)

For remote_run the library nomad_sim.ssh_code is needed. If remote machine is eos, in dict control[‘remote_run’] the (key:value) ‘eos’:True has to be set. Then set for example in addition ‘nodes’:1 and ‘mpi_run -np 32’ can be set.

Paths (say name: path) are all set in the intialization part with self.path and used in other functions with self.path. In general the other variables are directly passed as arguements to the functions. There are a few exceptions as self.ssh.

# >>> import numpy as np # >>> from nomad_sim.SIS import SIS # >>> ### Specify where on local machine input files for the SIS fortran code shall be created # >>> Local_paths = { # >>> ‘local_path’ : ‘/home/beaker/’, # >>> ‘SIS_input_folder_name’ : ‘SIS_input’, # >>> } # >>> # Information for ssh connection. Instead of password also ‘key_file’ for rsa key # >>> # file path is possible. # >>> Remote_run = { # >>> ‘mpi_command’:’’, # >>> ‘remote_path’ : ‘/home/username/’, # >>> ‘SIS_code_path’ : ‘/home/username/SIS_code/’, # >>> ‘hostname’ :’hostname’, # >>> ‘username’ : ‘username’, # >>> ‘password’ : ‘XXX’ # >>> } # >>> # Parameters for the SIS fortran code. If at each iteration a different ‘OP_list’ # >>> # shall be used, set a list of max_dim lists, e.g. [ [‘+’,’-‘,’*’], [‘/’,’*’] ], if # >>> # n_comb = 2 # >>> Parameters = { # >>> ‘n_comb’ : 2, # >>> ‘OP_list’ : [‘+’,’|-|’,’-‘,’*’,’/’,’exp’,’^2’], # >>> ‘max_dim’ : 2, # >>> ‘n_sis’ : 10 # >>> } # >>> # Final control dict for the SIS class. Instead of remote_run also local_run can be set # >>> # (with different keys). Also advanced_parameters can be set, but should be done only # >>> # if the parameters of the SIS fortran code are understood. # >>> SIS_control = {‘local_paths’:Local_paths, ‘remote_run’:Remote_run, ‘parameters’:Parameters} # >>> # Target (label) vector P , feature_list, feature matrix D. The values are made up. # >>> P = np.array( [1,2,3,-2,-9] ) # >>> feature_list=[‘r_p(A)’,’r_p(B)’, ‘Z(A)’] # >>> D = np.array([[7,-11,3], # >>> [-1,-2,4], # >>> [2,20,3], # >>> [8,1,8], # >>> [-3,4,1]]) # >>> # Use the code # >>> sis = SIS(P,D,feature_list, control = SIS_control, output_log_file =’/home/ahmetcik/codes/beaker/output.log’) # >>> sis.start() # >>> results = sis.get_results() # >>> # >>> coef_1dim = results[0][‘coefficients’] # >>> coef_2dim = results[1][‘coefficients’] # >>> D_1dim = results[0][‘D’] # >>> D_2dim = results[1][‘D’] # >>> print coef_2dim # [-3.1514 -5.9171 3.9697] # >>> # >>> print D_2dim # ((rp(B)/Z(A))/(rp(A)+rp(B))) ((Z(A)/rp(B))/(rp(B)*Z(A))) intercept # 0 0.916670 0.008264 1.0 # 1 0.166670 0.250000 1.0 # 2 0.303030 0.002500 1.0 # 3 0.013889 1.000000 1.0 # 4 4.000000 0.062500 1.0 # #

ask_periodically(sc, seconds, counter, username)[source]¶: Recursive function that runs periodically (each seconds) the function self.check_status.

check_(k)[source]¶

check_DI(file_path)[source]¶: Check DI.out, if calculation has finished.

check_FC(file_path)[source]¶

Check FC.out, if calculation has finished and feature space_sizes.

calc_finished : bool: If calculation finished there shoul be a ‘Have a nice day !’.
featurespace : integer: Total feature space size generated, before the redundant check.
n_collected : integer: The number of features collected in the current iteration. Should be n_sis.

check_OP_list(control)[source]¶

Checks form and items of control[‘parameters’][‘OP_list’].

control[‘parameters’][‘OP_list’] must be a list of operations strings or list of n_comb lists of operation strings. Furthermore if operation strings are item of available_OPs (see above) is checked.

control : dict

control : with manipulated control[‘parameters’][‘OP_list’]

check_OP_strings(OPs)[source]¶: Check if all items of OPs are items of available_OPs

check_arrays(P_in, D, feature_list, feature_unit_classes, ptype)[source]¶: Check arrays/list P, D and feature_list

check_control(par_in, par_ref, par_in_path)[source]¶

Recursive Function to check input control dict tree.

If for example check_control(control,control_ref,’control’) function goes through dcit tree control and compares with control_ref if correct keys (mandotory, not_mandotory, typos of key string) are set and if values are of correct type or of optional list. Furthermore it gives Errors with hints what is wrong, and what is needed.

par_in : any key: if par_in is dict, then recursion.
par_ref: any key: Is compared to par_in, if of same time. If par_in and par_key are dict, alse keys are compared.
par_in_path: string: Gives the dict tree path where, when error occurs, e.g. control[key_1][key_2]… For using function from outside start with name of input dict, e.g. ‘control’

check_feature_space_size(feature_list, n_target=5, upper_bound=300000000)[source]¶

check_feature_units(feature_unit_classes)[source]¶

Check feature units

Checks which

feature_unit_classes : list integers: list must be sorted.

unit_strings : list of strings: In the form [‘(1:3)’,’(4:8)’,..], where the indices start from 1,

check_files(iter_folder_name, dimension)[source]¶

Check which file is missing and maybe why.

This function, if something went wrong to find out where the problem occured. Returns an error string.

check_keys(par_in, par_ref, par_in_path)[source]¶

Compares the dicts par_in and par_ref.

Collects which keys are missing (only if keys are not in not_mandotary) amd: whcih keys are not expected (if for example there is a typo).

If there are missing or not expected ones, error message with missing/not expected ones.

par_in : dict

par_ref : dict

par_in_path : string: Dictionary path string for error message, e.g ‘control[key_1][key_2]’.

check_l0_steps(max_dim, n_sis, upper_limit=10000)[source]¶: Check if number of l0 steps is larger then a upper_limit

check_quali_dim(control)[source]¶: Check if quali then also desc_dim=2

check_status(filename, username)[source]¶

Check if calculation on eos is finished

Parameters filename: str

qstat will be written into this file. The file will be then read.

username: str: search in filename for this username. If not appears calculation is finished.

status : bool: True if calculations is still running.

check_type(par_in, par_ref, par_in_path, if_also_none=False)[source]¶

Check type of par_in and par_ref.

If par_ref is tuple, par_in must be item of par_ref: else: they must have same type.

convert_2_fortran(parameter, parameter_value)[source]¶

Convert parameters to SIS fortran code style.

Converts e.g. True to string ‘.true.’ or a string ‘s’ to “‘s’”, and other special formats. Returns the converted parameter.

convert_feature_strings(feature_list)[source]¶

Convert feature strings.

Puts an ‘sr’ for reals and an ‘si’ for integers at the beginning of a string. Returns the list with the changed strings.

do_transfer(ssh=None, eos=None, username=None, CPUs=None)[source]¶

Run the calcualtion on remote machine

First checks if already folder self.remote_input_path exists on remote machine, if yes it deletes or renames it. Then copies file system self.SIS_input_path with SIS fortran code files into the folder self.remote_input_path. Finally lets run the calculations on remote machine and copy back the file system with results. If eos, writes submission script, submits script and checks qstat if calculation finished.

ssh : object: Must be from code nomad_sim.ssh_code.
eos : bool: If remote machine is eos. To write submission script and submit …
username: string: needed to check qstat on eos
CPUs : int: To reserve the write number of CPUs in the eos submission script

estimate_calculation_expense(feature_list)[source]¶: Check the expense of the SIS+l0 calculations

estimate_feature_space(n_comb, n_features, ops, rate=1.0, n_comb_start=0)[source]¶

flatten(list_in)[source]¶

Returns the list_in collapsed into a one dimensional list

list_in : list/tuple of lists/tuples of …

get_OPs(OP_list)[source]¶: Conver OP_list to special format for SIS fortran input.

get_arrays_of_top_descriptors(top_indices)[source]¶

get_des(x)[source]¶: Change the descriptor strings read from the output DI.out. Remove characters as ‘:’ ‘si’, ‘sr’. Then convert feature strings for printing

get_indices_of_top_descriptors()[source]¶

get_next_size(n_features, ops)[source]¶

get_results(ith_descriptor=0)[source]¶

Attribute to get results from the file system.

ith_descriptor: int: Return the ith best descriptor.

out : list [max_dim] of dicts {‘D’, ‘coefficients’, ‘P_pred’}

out[model_dim-1][‘D’] : pandas data frame [n_sample, model_dim+1]: Descriptor matrix with the columns being algebraic combinations of the input feature matrix. Column names are thus strings of the algebraic combinations of strings of inout feature_list. Last column is full of ones corresponding to the intercept
out[model_dim-1][‘coefficients’] : array [model_dim+1]: Optimizing coefficients.
out[model_dim-1][‘P_pred’] : array [m_sample]: Fit : np.dot( np.array(D) , coefficients)

get_strings_of_top_descriptors(top_indices)[source]¶

get_type(value)[source]¶

get_value_from_dic(dictionary, key_tree_path)[source]¶

Returns value of the dict tree

dictionary: dict or ‘dict tree’ as control_ref: dict_tree is when key is tuple of keys and value is tuple of corresponding values.
key_tree_path: list of keys: Must be in the correct order beginning from the top of the tree/dict.

# Examples # ——– # >>> print get_value_from_dic[control_ref, [‘local_run’,’SIS_code_path’]] # <type ‘str’>

manipulate_descriptor_string(d)[source]¶

ncr(n, r)[source]¶: Binomial coefficient

read_results(iter_folder_name, dimension, task, tsizer)[source]¶

Read results from DI.out.

iter_folder : string: Name of the iter_folder the outputs of the corresponding iteration of SIS+l1/l1l0, e.g. ‘iter01’, ‘iter02’.
dimension : integer: DI.out provides for example in iteration three 1-3 dimensionl descriptors. Here choose which dimension should be returned.
task : integer < 100: For multi task, must be worked on.
tsizer : integer: Number of samples, e.g. number ofrows of D or P.

RMSE : float: Root means squares error of model
Des : list of strings: List of the descriptors
coef : array [model_dim+1]: Coefficients including the intercept
D : array [n_sample, model_dim+1]: Matrix with columns being the selected features (descriptors) for the model. The last column is full of ones corresponding to the intercept

read_results_quali()[source]¶

Read results for 2D desriptor from calculations with qualitative run.

results: list of lists: Each sublist characterizes separate model (if multiple model have same score/cost all of them are returned). Sublist contains [descriptor_strings, D, n_overlap] where D (D.shape = (n_smaple,2)) is array with descriptor vectors.

return_OP_error()[source]¶: Error message if control[‘parameters’][‘OP_list’] has wrong form

set_SIS_parameters(desc_dim=2, subs_sis=100, rung=1, opset=['+', '-', '/', '^2', 'exp'], ptype='quanti', advanced_parameters=None)[source]¶

Set the SIS fortran code parameters

If advanced parameters is passed, they will be used, otherwise default values will be used. Also max_dim, n_sis, n_comb, and OP_list can be overwritten by advanced_parameters if specified.

set_local_run(SIS_code_path='~/codes/SIS_code/', mpi_command='')[source]¶: Set and check local enviroment if local_run is used.

set_logger(output_log_file)[source]¶: Set logger for outputs as errors, warnings, infos.

set_main_settings(P, D, feature_list, feature_unit_classes, local_path='/home/beaker/', SIS_input_folder_name='input_folder')[source]¶: Set local environment and P, D and feature_list.

set_ssh_connection(hostname=None, username=None, port=22, key_file=None, password=None, remote_path=None, SIS_code_path=None, eos=False, nodes=1, mpi_command='')[source]¶: Set ssh connection. Set and check remote enviroment if remote_run is used.

start()[source]¶: Attribute which starts the calculations after init.

string_descriptor(RMSE, features, coefficients, target_unit)[source]¶: Make string for output in the terminal with model and its RMSE.

write_P_D(P, D, feature_list)[source]¶: Writes ‘train.dat’ as SIS fortran code input with P, D and feature strings

write_parameters()[source]¶: Write parameters into the SIS fortran code input files. Convert the parameters into the special format before.

write_submission_script(CPUs)[source]¶: writes eos job submission script.

ai4materials.models.sis.converted_2_standard = {'disA': 'd(A)', 'disAB': 'd(AB)', 'disB': 'd(B)', 'eaA': 'EA(A)', 'eaB': 'EA(B)', 'ebA': 'E_b(A)', 'ebAB': 'E_b(AB)', 'ebB': 'E_b(B)', 'hlgapA': 'HL_gap(A)', 'hlgapAB': 'HL_gap(AB)', 'hlgapB': 'HL_gap(B)', 'homoA': 'E_HOMO(A)', 'homoB': 'E_HOMO(B)', 'ipA': 'IP(A)', 'ipB': 'IP(B)', 'lumoA': 'E_LUMO(A)', 'lumoB': 'E_LUMO(B)', 'periodA': 'period(A)', 'periodB': 'period(B)', 'rdA': 'r_d(A)', 'rdB': 'r_d(B)', 'rpA': 'r_p(A)', 'rpB': 'r_p(B)', 'rpiAB': 'r_pi(AB)', 'rsA': 'r_s(A)', 'rsB': 'r_s(B)', 'rsigmaAB': 'r_sigma(AB)', 'valA': 'Z_val(A)', 'valB': 'Z_val(B)', 'zA': 'Z(A)', 'zB': 'Z(B)'}¶: Set logger for outputs as errors, warnings, infos.