Data documentation

Dataset construction

supernnova.data.make_dataset.build_traintestval_splits(settings)[source]

Build dataset split in the following way

Downsample each class so that it has the same cardinality as the lowest cardinality class
Randomly assign lightcurves to a 80/10/10 train test val split (except Out-of-distribution data 1/1/98)

OOD:: Will use the complete sample for testing, does not require settings.

Parameters:: settings (ExperimentSettings) – controls experiment hyperparameters

supernnova.data.make_dataset.process_single_FITS(file_path, settings)[source]

Carry out preprocessing on FITS file and save results to pickle. Pickle is preferred to csv as it is faster to read and write.

Join column from header files
Select columns that will be useful laer on
Compute SNID to tag each light curve
Compute delta times between measures
Filter preprocessing
Removal of delimiter rows

Parameters:

file_path (str) – path to .FITS file
settings (ExperimentSettings) – controls experiment hyperparameters

supernnova.data.make_dataset.process_single_csv(file_path, settings)[source]

Carry out preprocessing on csv file and save results to pickle. Pickle is preferred to csv as it is faster to read and write.

Compute delta times between measures
Filter preprocessing

Parameters:

file_path (str) – path to .csv file
settings (ExperimentSettings) – controls experiment hyperparameters

supernnova.data.make_dataset.preprocess_data(settings)[source]

Preprocess the FITS data

Use multiprocessing/threading to speed up data processing
Preprocess every FIT file in the raw data dir
Also save a DataFrame of Host Spe for publication plots

Parameters:: settings (ExperimentSettings) – controls experiment hyperparameters

supernnova.data.make_dataset.pivot_dataframe_single_from_df(df, settings)[source]

Carry out pivot: we will group time-wise close observations on the same row and each row in the dataframe will show a value for each of the flux and flux error column

All observations withing 8 hours of each other are assigned the same MJD
Results are cached with pickle

Parameters:

filename (str) – path to a .pickle file containing pre-processed data
settings (ExperimentSettings) – controls experiment hyperparameters

supernnova.data.make_dataset.pivot_dataframe_batch(list_files, settings)[source]

Use multiprocessing/threading to speed up data processing
Pivot every file in list_files and cache the result with pickle

Parameters:

list_files (list) – list of .pickle files containing pre-processed data
settings (ExperimentSettings) – controls experiment hyperparameters

supernnova.data.make_dataset.make_dataset(settings)[source]

Main function for data processing

Create the train test val splits
Preprocess all the FITs data, then pivot
Save all of the processed data to a single HDF5 database

Parameters:: settings (ExperimentSettings) – controls experiment hyperparameters

Data utilities

class supernnova.utils.data_utils.LogStandardized(arr_min, arr_mean, arr_std)

Bases: tuple

arr_mean: Alias for field number 1

arr_min: Alias for field number 0

arr_std: Alias for field number 2

supernnova.utils.data_utils.load_pandas_from_fit(fit_file_path)[source]

Load a FIT file and cast it to a PANDAS dataframe

Parameters:: fit_file_path (str) – path to FIT file
Returns:: (pandas.DataFrame) load dataframe from FIT file

supernnova.utils.data_utils.sntype_decoded(target, settings, simplify=False)[source]

Match the target class (integer in {0, …, 6} to the name of the class, i.e. something like “SN Ia” or “SN CC”

Parameters:

target (int) – specifies the classification target
settings (ExperimentSettings) – custom class to hold hyperparameters
simplify (Boolean) – if True do not show all classes

Returns:

(str) the name of the class

supernnova.utils.data_utils.tag_type(df, settings, type_column='TYPE')[source]

Create classes based on a type columns

Depending on the number of classes (2 or all), we create distinct target columns

Parameters:

df (pandas.DataFrame) – the input dataframe
settings (ExperimentSettings) – controls experiment hyperparameters
type_column (str) – the type column in df

Returns:

(pandas.DataFrame) the dataframe, with new target columns

supernnova.utils.data_utils.load_fitfile(settings, verbose=True)[source]

Load the FITOPT file as a pandas dataframe

Pickle it for future use (it is faster to load as a pickled dataframe)

Parameters:

settings (ExperimentSettings) – controls experiment hyperparameters
verbose (bool) – whether to display logging message. Default: True

Returns:

(pandas.DataFrame) dataframe with FITOPT data

supernnova.utils.data_utils.process_header_FITS(file_path, settings, columns=None)[source]

Read the HEAD FIT file, add target columns and return in pandas DataFrame format

Parameters:

file_path (str) – the path to the header FIT file
settings (ExperimentSettings) – controls experiment hyperparameters
columns (lsit) – list of columns to keep. Default: None

Returns:

(pandas.DataFrame) the dataframe, with new target columns

supernnova.utils.data_utils.process_header_csv(file_path, settings, columns=None)[source]

Read the HEAD csv file, add target columns and return in pandas DataFrame format

Parameters:

file_path (str) – the path to the header FIT file
settings (ExperimentSettings) – controls experiment hyperparameters
columns (lsit) – list of columns to keep. Default: None

Returns:

(pandas.DataFrame) the dataframe, with new target columns

supernnova.utils.data_utils.add_redshift_features(settings, df)[source]

Add redshift features to pandas dataframe.

Parameters:

settings (ExperimentSettings) – controls experiment hyperparameters
df (str) – pandas DataFrame with FIT data

Returns:

(pandas.DataFrame) the dataframe, possibly with added redshift features

supernnova.utils.data_utils.compute_delta_time(df)[source]

Compute the delta time between two consecutive observations

Parameters:: df (pandas.DataFrame) – dataframe holding lightcurve data
Returns:: (pandas.DataFrame) dataframe holding lightcurve data with delta_time features

supernnova.utils.data_utils.remove_data_post_large_delta_time(df)[source]

Remove rows in the same light curve after a gap > 150 days Reason: If no signal has been saved in a time frame of 150 days, it is unlikely there is much left afterwards

Parameters:: df (pandas.DataFrame) – dataframe holding lightcurve data
Returns:: (pandas.DataFrame) dataframe where large delta time rows have been removed

supernnova.utils.data_utils.load_HDF5_SNinfo(settings)[source]

Load physical information related to the created database of lightcurves

Parameters:: settings (ExperimentSettings) – controls experiment hyperparameters
Returns:: (pandas.DataFrame) dataframe holding physics information about the dataset

supernnova.utils.data_utils.log_standardization(arr)[source]

Normalization strategy for the fluxes and fluxes error

Log transform the data
Mean and std dev normalization

Parameters:: arr (np.array) – data to normalize
Returns:: (LogStandardized) namedtuple holding normalization data

supernnova.utils.data_utils.save_to_HDF5(settings, df)[source]

Saved processed dataframe to HDF5

Parameters:

settings (ExperimentSettings) – controls experiment hyperparameters
df (pandas.DataFrame) – dataframe holding processed data