Data documentation

Dataset construction

supernnova.data.make_dataset.build_traintestval_splits(settings)[source]

Build dataset split in the following way

  • Downsample each class so that it has the same cardinality as the lowest cardinality class

  • Randomly assign lightcurves to a 80/10/10 train test val split (except Out-of-distribution data 1/1/98)

OOD:

Will use the complete sample for testing, does not require settings.

Parameters:

settings (ExperimentSettings) – controls experiment hyperparameters

supernnova.data.make_dataset.process_single_FITS(file_path, settings)[source]

Carry out preprocessing on FITS file and save results to pickle. Pickle is preferred to csv as it is faster to read and write.

  • Join column from header files

  • Select columns that will be useful laer on

  • Compute SNID to tag each light curve

  • Compute delta times between measures

  • Filter preprocessing

  • Removal of delimiter rows

Parameters:
  • file_path (str) – path to .FITS file

  • settings (ExperimentSettings) – controls experiment hyperparameters

supernnova.data.make_dataset.process_single_csv(file_path, settings)[source]

Carry out preprocessing on csv file and save results to pickle. Pickle is preferred to csv as it is faster to read and write.

  • Compute delta times between measures

  • Filter preprocessing

Parameters:
  • file_path (str) – path to .csv file

  • settings (ExperimentSettings) – controls experiment hyperparameters

supernnova.data.make_dataset.preprocess_data(settings)[source]

Preprocess the FITS data

  • Use multiprocessing/threading to speed up data processing

  • Preprocess every FIT file in the raw data dir

  • Also save a DataFrame of Host Spe for publication plots

Parameters:

settings (ExperimentSettings) – controls experiment hyperparameters

supernnova.data.make_dataset.pivot_dataframe_single_from_df(df, settings)[source]

Carry out pivot: we will group time-wise close observations on the same row and each row in the dataframe will show a value for each of the flux and flux error column

  • All observations withing 8 hours of each other are assigned the same MJD

  • Results are cached with pickle

Parameters:
  • filename (str) – path to a .pickle file containing pre-processed data

  • settings (ExperimentSettings) – controls experiment hyperparameters

supernnova.data.make_dataset.pivot_dataframe_batch(list_files, settings)[source]
  • Use multiprocessing/threading to speed up data processing

  • Pivot every file in list_files and cache the result with pickle

Parameters:
  • list_files (list) – list of .pickle files containing pre-processed data

  • settings (ExperimentSettings) – controls experiment hyperparameters

supernnova.data.make_dataset.parse_sntypes_from_readme(raw_dir)[source]

Parse GENTYPE_TO_NAME block from .README files in raw_dir.

Looks for files matching *.README in raw_dir and extracts supernova type mappings from the GENTYPE_TO_NAME block. For each GENTYPE number N found, two entries are created: N and N+100 (the photo-ID convention used by SNANA simulations).

The expected block format is:

GENTYPE_TO_NAME:  # GENTYPE-integer (non)Ia transient-Name FITS-prefix
  1:   Ia       SALT3              SNIaMODEL00
  20:  nonIa    SNIIP              NONIaMODEL03

Column mapping (after splitting each data line on whitespace):

  • Column 1 – GENTYPE number (the key, e.g. 1:)

  • Column 2 – Ia / nonIa category

  • Column 3 – transient-Name (e.g. SNIIP)

For Ia types (column 2 == “Ia”) the type name is taken from column 2 directly (“Ia”). For non-Ia types the type name is taken from column 3 (the transient-Name, e.g. “SNIIP”).

Parameters:

raw_dir (str) – Path to the raw data directory.

Returns:

Parsed {sntype_number: type_name} mapping, or None when no README is found or the block is absent / empty.

Return type:

OrderedDict or None

supernnova.data.make_dataset.resolve_sntypes(settings)[source]

Resolve settings.sntypes when not explicitly provided by the user.

Priority order:

  1. Explicit --sntypes on CLI / config → already set, nothing to do.

  2. .README file in raw_dir → parse GENTYPE_TO_NAME block.

  3. Built-in DEFAULT_SNTYPES fallback.

Parameters:

settings (ExperimentSettings) – controls experiment hyperparameters

supernnova.data.make_dataset.detect_contaminant_types(settings)[source]

Pre-scan raw data files to detect types not in settings.sntypes.

Any type found in the data but missing from settings.sntypes is automatically added as ‘contaminant’. This must run before any column-name computation so that target column names are consistent throughout the pipeline.

Parameters:

settings (ExperimentSettings) – controls experiment hyperparameters

supernnova.data.make_dataset.make_dataset(settings)[source]

Main function for data processing

  • Create the train test val splits

  • Preprocess all the FITs data, then pivot

  • Save all of the processed data to a single HDF5 database

Parameters:

settings (ExperimentSettings) – controls experiment hyperparameters

Data utilities

class supernnova.utils.data_utils.LogStandardized(arr_min, arr_mean, arr_std)

Bases: tuple

arr_mean

Alias for field number 1

arr_min

Alias for field number 0

arr_std

Alias for field number 2

supernnova.utils.data_utils.load_pandas_from_fit(fit_file_path, columns=None, multidim='drop')[source]

Load a FIT file and cast it to a PANDAS dataframe.

FITS tables can contain vector / array-valued columns (TDIM > 1). astropy.Table.to_pandas raises ValueError on such columns because pandas DataFrames cannot natively hold n-dim arrays in a column. This helper handles them so the pipeline does not crash on otherwise valid FITS files.

Parameters:
  • fit_file_path (str) – path to FIT file

  • columns (list, optional) – if given, only these columns are kept before the pandas conversion. Acts as a whitelist that avoids loading multi-D columns we don’t care about into pandas in the first place. Names that are not present in the FITS table are silently ignored.

  • multidim (str) –

    strategy for multi-dimensional columns that survive the columns filter. One of:

    • "drop" (default): remove them with a yellow warning listing which columns were skipped.

    • "error": raise ValueError so the caller is forced to deal with them explicitly.

Returns:

(pandas.DataFrame) load dataframe from FIT file

supernnova.utils.data_utils.sntype_decoded(target, settings, simplify=False)[source]

Match the target class (integer in {0, …, 6} to the name of the class, i.e. something like “SN Ia” or “SN CC”

Parameters:
  • target (int) – specifies the classification target

  • settings (ExperimentSettings) – custom class to hold hyperparameters

  • simplify (Boolean) – if True do not show all classes

Returns:

(str) the name of the class

supernnova.utils.data_utils.tag_type(df, settings, type_column='TYPE')[source]

Create classes based on a type columns

Depending on the number of classes (2 or all), we create distinct target columns

Parameters:
  • df (pandas.DataFrame) – the input dataframe

  • settings (ExperimentSettings) – controls experiment hyperparameters

  • type_column (str) – the type column in df

Returns:

(pandas.DataFrame) the dataframe, with new target columns

supernnova.utils.data_utils.load_fitfile(settings, verbose=True)[source]

Load the FITOPT file as a pandas dataframe

Pickle it for future use (it is faster to load as a pickled dataframe)

Parameters:
  • settings (ExperimentSettings) – controls experiment hyperparameters

  • verbose (bool) – whether to display logging message. Default: True

Returns:

(pandas.DataFrame) dataframe with FITOPT data

supernnova.utils.data_utils.process_header_FITS(file_path, settings, columns=None)[source]

Read the HEAD FIT file, add target columns and return in pandas DataFrame format

Parameters:
  • file_path (str) – the path to the header FIT file

  • settings (ExperimentSettings) – controls experiment hyperparameters

  • columns (lsit) – list of columns to keep. Default: None

Returns:

(pandas.DataFrame) the dataframe, with new target columns

supernnova.utils.data_utils.process_header_csv(file_path, settings, columns=None)[source]

Read the HEAD csv file, add target columns and return in pandas DataFrame format

Parameters:
  • file_path (str) – the path to the header FIT file

  • settings (ExperimentSettings) – controls experiment hyperparameters

  • columns (lsit) – list of columns to keep. Default: None

Returns:

(pandas.DataFrame) the dataframe, with new target columns

supernnova.utils.data_utils.compute_delta_time(df)[source]

Compute the delta time between two consecutive observations

Parameters:

df (pandas.DataFrame) – dataframe holding lightcurve data

Returns:

(pandas.DataFrame) dataframe holding lightcurve data with delta_time features

supernnova.utils.data_utils.remove_data_post_large_delta_time(df)[source]

Remove rows in the same light curve after a gap > 150 days Reason: If no signal has been saved in a time frame of 150 days, it is unlikely there is much left afterwards

Parameters:

df (pandas.DataFrame) – dataframe holding lightcurve data

Returns:

(pandas.DataFrame) dataframe where large delta time rows have been removed

supernnova.utils.data_utils.load_HDF5_SNinfo(settings)[source]

Load physical information related to the created database of lightcurves

Parameters:

settings (ExperimentSettings) – controls experiment hyperparameters

Returns:

(pandas.DataFrame) dataframe holding physics information about the dataset

supernnova.utils.data_utils.log_standardization(arr)[source]

Normalization strategy for the fluxes and fluxes error

  • Log transform the data

  • Mean and std dev normalization

Parameters:

arr (np.array) – data to normalize

Returns:

(LogStandardized) namedtuple holding normalization data

supernnova.utils.data_utils.save_to_HDF5(settings, df)[source]

Saved processed dataframe to HDF5

Parameters:
  • settings (ExperimentSettings) – controls experiment hyperparameters

  • df (pandas.DataFrame) – dataframe holding processed data