Data documentation

Dataset construction

supernnova.data.make_dataset.build_traintestval_splits(settings)[source]

Build dataset split in the following way

  • Downsample each class so that it has the same cardinality as the lowest cardinality class

  • Randomly assign lightcurves to a 80/10/10 train test val split (except Out-of-distribution data 1/1/98)

OOD:

Will use the complete sample for testing, does not require settings.

Parameters:

settings (ExperimentSettings) – controls experiment hyperparameters

supernnova.data.make_dataset.process_single_FITS(file_path, settings)[source]

Carry out preprocessing on FITS file and save results to pickle. Pickle is preferred to csv as it is faster to read and write.

  • Join column from header files

  • Select columns that will be useful laer on

  • Compute SNID to tag each light curve

  • Compute delta times between measures

  • Filter preprocessing

  • Removal of delimiter rows

Parameters:
  • file_path (str) – path to .FITS file

  • settings (ExperimentSettings) – controls experiment hyperparameters

supernnova.data.make_dataset.process_single_csv(file_path, settings)[source]

Carry out preprocessing on csv file and save results to pickle. Pickle is preferred to csv as it is faster to read and write.

  • Compute delta times between measures

  • Filter preprocessing

Parameters:
  • file_path (str) – path to .csv file

  • settings (ExperimentSettings) – controls experiment hyperparameters

supernnova.data.make_dataset.preprocess_data(settings)[source]

Preprocess the FITS data

  • Use multiprocessing/threading to speed up data processing

  • Preprocess every FIT file in the raw data dir

  • Also save a DataFrame of Host Spe for publication plots

Parameters:

settings (ExperimentSettings) – controls experiment hyperparameters

supernnova.data.make_dataset.pivot_dataframe_single_from_df(df, settings)[source]

Carry out pivot: we will group time-wise close observations on the same row and each row in the dataframe will show a value for each of the flux and flux error column

  • All observations withing 8 hours of each other are assigned the same MJD

  • Results are cached with pickle

Parameters:
  • filename (str) – path to a .pickle file containing pre-processed data

  • settings (ExperimentSettings) – controls experiment hyperparameters

supernnova.data.make_dataset.pivot_dataframe_batch(list_files, settings)[source]
  • Use multiprocessing/threading to speed up data processing

  • Pivot every file in list_files and cache the result with pickle

Parameters:
  • list_files (list) – list of .pickle files containing pre-processed data

  • settings (ExperimentSettings) – controls experiment hyperparameters

supernnova.data.make_dataset.make_dataset(settings)[source]

Main function for data processing

  • Create the train test val splits

  • Preprocess all the FITs data, then pivot

  • Save all of the processed data to a single HDF5 database

Parameters:

settings (ExperimentSettings) – controls experiment hyperparameters

Data utilities

class supernnova.utils.data_utils.LogStandardized(arr_min, arr_mean, arr_std)

Bases: tuple

arr_mean

Alias for field number 1

arr_min

Alias for field number 0

arr_std

Alias for field number 2

supernnova.utils.data_utils.load_pandas_from_fit(fit_file_path)[source]

Load a FIT file and cast it to a PANDAS dataframe

Parameters:

fit_file_path (str) – path to FIT file

Returns:

(pandas.DataFrame) load dataframe from FIT file

supernnova.utils.data_utils.sntype_decoded(target, settings, simplify=False)[source]

Match the target class (integer in {0, …, 6} to the name of the class, i.e. something like “SN Ia” or “SN CC”

Parameters:
  • target (int) – specifies the classification target

  • settings (ExperimentSettings) – custom class to hold hyperparameters

  • simplify (Boolean) – if True do not show all classes

Returns:

(str) the name of the class

supernnova.utils.data_utils.tag_type(df, settings, type_column='TYPE')[source]

Create classes based on a type columns

Depending on the number of classes (2 or all), we create distinct target columns

Parameters:
  • df (pandas.DataFrame) – the input dataframe

  • settings (ExperimentSettings) – controls experiment hyperparameters

  • type_column (str) – the type column in df

Returns:

(pandas.DataFrame) the dataframe, with new target columns

supernnova.utils.data_utils.load_fitfile(settings, verbose=True)[source]

Load the FITOPT file as a pandas dataframe

Pickle it for future use (it is faster to load as a pickled dataframe)

Parameters:
  • settings (ExperimentSettings) – controls experiment hyperparameters

  • verbose (bool) – whether to display logging message. Default: True

Returns:

(pandas.DataFrame) dataframe with FITOPT data

supernnova.utils.data_utils.process_header_FITS(file_path, settings, columns=None)[source]

Read the HEAD FIT file, add target columns and return in pandas DataFrame format

Parameters:
  • file_path (str) – the path to the header FIT file

  • settings (ExperimentSettings) – controls experiment hyperparameters

  • columns (lsit) – list of columns to keep. Default: None

Returns:

(pandas.DataFrame) the dataframe, with new target columns

supernnova.utils.data_utils.process_header_csv(file_path, settings, columns=None)[source]

Read the HEAD csv file, add target columns and return in pandas DataFrame format

Parameters:
  • file_path (str) – the path to the header FIT file

  • settings (ExperimentSettings) – controls experiment hyperparameters

  • columns (lsit) – list of columns to keep. Default: None

Returns:

(pandas.DataFrame) the dataframe, with new target columns

supernnova.utils.data_utils.add_redshift_features(settings, df)[source]

Add redshift features to pandas dataframe.

Parameters:
  • settings (ExperimentSettings) – controls experiment hyperparameters

  • df (str) – pandas DataFrame with FIT data

Returns:

(pandas.DataFrame) the dataframe, possibly with added redshift features

supernnova.utils.data_utils.compute_delta_time(df)[source]

Compute the delta time between two consecutive observations

Parameters:

df (pandas.DataFrame) – dataframe holding lightcurve data

Returns:

(pandas.DataFrame) dataframe holding lightcurve data with delta_time features

supernnova.utils.data_utils.remove_data_post_large_delta_time(df)[source]

Remove rows in the same light curve after a gap > 150 days Reason: If no signal has been saved in a time frame of 150 days, it is unlikely there is much left afterwards

Parameters:

df (pandas.DataFrame) – dataframe holding lightcurve data

Returns:

(pandas.DataFrame) dataframe where large delta time rows have been removed

supernnova.utils.data_utils.load_HDF5_SNinfo(settings)[source]

Load physical information related to the created database of lightcurves

Parameters:

settings (ExperimentSettings) – controls experiment hyperparameters

Returns:

(pandas.DataFrame) dataframe holding physics information about the dataset

supernnova.utils.data_utils.log_standardization(arr)[source]

Normalization strategy for the fluxes and fluxes error

  • Log transform the data

  • Mean and std dev normalization

Parameters:

arr (np.array) – data to normalize

Returns:

(LogStandardized) namedtuple holding normalization data

supernnova.utils.data_utils.save_to_HDF5(settings, df)[source]

Saved processed dataframe to HDF5

Parameters:
  • settings (ExperimentSettings) – controls experiment hyperparameters

  • df (pandas.DataFrame) – dataframe holding processed data