Data documentation

Dataset construction

supernnova.data.make_dataset.build_traintestval_splits(settings)[source]

Build dataset split in the following way

Downsample each class so that it has the same cardinality as the lowest cardinality class
Randomly assign lightcurves to a 80/10/10 train test val split (except Out-of-distribution data 1/1/98)

OOD:: Will use the complete sample for testing, does not require settings.

Parameters:: settings (ExperimentSettings) – controls experiment hyperparameters

supernnova.data.make_dataset.process_single_FITS(file_path, settings)[source]

Carry out preprocessing on FITS file and save results to pickle. Pickle is preferred to csv as it is faster to read and write.

Join column from header files
Select columns that will be useful laer on
Compute SNID to tag each light curve
Compute delta times between measures
Filter preprocessing
Removal of delimiter rows

Parameters:

file_path (str) – path to .FITS file
settings (ExperimentSettings) – controls experiment hyperparameters

supernnova.data.make_dataset.process_single_csv(file_path, settings)[source]

Carry out preprocessing on csv file and save results to pickle. Pickle is preferred to csv as it is faster to read and write.

Compute delta times between measures
Filter preprocessing

Parameters:

file_path (str) – path to .csv file
settings (ExperimentSettings) – controls experiment hyperparameters

supernnova.data.make_dataset.preprocess_data(settings)[source]

Preprocess the FITS data

Use multiprocessing/threading to speed up data processing
Preprocess every FIT file in the raw data dir
Also save a DataFrame of Host Spe for publication plots

Parameters:: settings (ExperimentSettings) – controls experiment hyperparameters

supernnova.data.make_dataset.pivot_dataframe_single_from_df(df, settings)[source]

Carry out pivot: we will group time-wise close observations on the same row and each row in the dataframe will show a value for each of the flux and flux error column

All observations withing 8 hours of each other are assigned the same MJD
Results are cached with pickle

Parameters:

filename (str) – path to a .pickle file containing pre-processed data
settings (ExperimentSettings) – controls experiment hyperparameters

supernnova.data.make_dataset.pivot_dataframe_batch(list_files, settings)[source]

Use multiprocessing/threading to speed up data processing
Pivot every file in list_files and cache the result with pickle

Parameters:

list_files (list) – list of .pickle files containing pre-processed data
settings (ExperimentSettings) – controls experiment hyperparameters

supernnova.data.make_dataset.parse_sntypes_from_readme(raw_dir)[source]

Parse GENTYPE_TO_NAME block from .README files in raw_dir.

Looks for files matching *.README in raw_dir and extracts supernova type mappings from the GENTYPE_TO_NAME block. For each GENTYPE number N found, two entries are created: N and N+100 (the photo-ID convention used by SNANA simulations).

The expected block format is:

GENTYPE_TO_NAME:  # GENTYPE-integer (non)Ia transient-Name FITS-prefix
  1:   Ia       SALT3              SNIaMODEL00
  20:  nonIa    SNIIP              NONIaMODEL03

Column mapping (after splitting each data line on whitespace):

Column 1 – GENTYPE number (the key, e.g. 1:)
Column 2 – Ia / nonIa category
Column 3 – transient-Name (e.g. SNIIP)

For Ia types (column 2 == “Ia”) the type name is taken from column 2 directly (“Ia”). For non-Ia types the type name is taken from column 3 (the transient-Name, e.g. “SNIIP”).

Parameters:: raw_dir (str) – Path to the raw data directory.
Returns:: Parsed {sntype_number: type_name} mapping, or None when no README is found or the block is absent / empty.
Return type:: OrderedDict or None

supernnova.data.make_dataset.resolve_sntypes(settings)[source]

Resolve settings.sntypes when not explicitly provided by the user.

Priority order:

Explicit --sntypes on CLI / config → already set, nothing to do.
.README file in raw_dir → parse GENTYPE_TO_NAME block.
Built-in DEFAULT_SNTYPES fallback.

Parameters:: settings (ExperimentSettings) – controls experiment hyperparameters

supernnova.data.make_dataset.detect_contaminant_types(settings)[source]

Pre-scan raw data files to detect types not in settings.sntypes.

Any type found in the data but missing from settings.sntypes is automatically added as ‘contaminant’. This must run before any column-name computation so that target column names are consistent throughout the pipeline.

Parameters:: settings (ExperimentSettings) – controls experiment hyperparameters

supernnova.data.make_dataset.make_dataset(settings)[source]

Main function for data processing

Create the train test val splits
Preprocess all the FITs data, then pivot
Save all of the processed data to a single HDF5 database

Parameters:: settings (ExperimentSettings) – controls experiment hyperparameters

Data utilities

class supernnova.utils.data_utils.LogStandardized(arr_min, arr_mean, arr_std)

Bases: tuple

arr_mean: Alias for field number 1

arr_min: Alias for field number 0

arr_std: Alias for field number 2

supernnova.utils.data_utils.load_pandas_from_fit(fit_file_path, columns=None, multidim='drop')[source]

Load a FIT file and cast it to a PANDAS dataframe.

FITS tables can contain vector / array-valued columns (TDIM > 1). astropy.Table.to_pandas raises ValueError on such columns because pandas DataFrames cannot natively hold n-dim arrays in a column. This helper handles them so the pipeline does not crash on otherwise valid FITS files.

Parameters:

fit_file_path (str) – path to FIT file
columns (list, optional) – if given, only these columns are kept before the pandas conversion. Acts as a whitelist that avoids loading multi-D columns we don’t care about into pandas in the first place. Names that are not present in the FITS table are silently ignored.
multidim (str) –
strategy for multi-dimensional columns that survive the columns filter. One of:
- "drop" (default): remove them with a yellow warning listing which columns were skipped.
- "error": raise ValueError so the caller is forced to deal with them explicitly.

Returns:

(pandas.DataFrame) load dataframe from FIT file

supernnova.utils.data_utils.sntype_decoded(target, settings, simplify=False)[source]

Match the target class (integer in {0, …, 6} to the name of the class, i.e. something like “SN Ia” or “SN CC”

Parameters:

target (int) – specifies the classification target
settings (ExperimentSettings) – custom class to hold hyperparameters
simplify (Boolean) – if True do not show all classes

Returns:

(str) the name of the class

supernnova.utils.data_utils.tag_type(df, settings, type_column='TYPE')[source]

Create classes based on a type columns

Depending on the number of classes (2 or all), we create distinct target columns

Parameters:

df (pandas.DataFrame) – the input dataframe
settings (ExperimentSettings) – controls experiment hyperparameters
type_column (str) – the type column in df

Returns:

(pandas.DataFrame) the dataframe, with new target columns

supernnova.utils.data_utils.load_fitfile(settings, verbose=True)[source]

Load the FITOPT file as a pandas dataframe

Pickle it for future use (it is faster to load as a pickled dataframe)

Parameters:

settings (ExperimentSettings) – controls experiment hyperparameters
verbose (bool) – whether to display logging message. Default: True

Returns:

(pandas.DataFrame) dataframe with FITOPT data

supernnova.utils.data_utils.process_header_FITS(file_path, settings, columns=None)[source]

Read the HEAD FIT file, add target columns and return in pandas DataFrame format

Parameters:

file_path (str) – the path to the header FIT file
settings (ExperimentSettings) – controls experiment hyperparameters
columns (lsit) – list of columns to keep. Default: None

Returns:

(pandas.DataFrame) the dataframe, with new target columns

supernnova.utils.data_utils.process_header_csv(file_path, settings, columns=None)[source]

Read the HEAD csv file, add target columns and return in pandas DataFrame format

Parameters:

file_path (str) – the path to the header FIT file
settings (ExperimentSettings) – controls experiment hyperparameters
columns (lsit) – list of columns to keep. Default: None

Returns:

(pandas.DataFrame) the dataframe, with new target columns

supernnova.utils.data_utils.compute_delta_time(df)[source]

Compute the delta time between two consecutive observations

Parameters:: df (pandas.DataFrame) – dataframe holding lightcurve data
Returns:: (pandas.DataFrame) dataframe holding lightcurve data with delta_time features

supernnova.utils.data_utils.remove_data_post_large_delta_time(df)[source]

Remove rows in the same light curve after a gap > 150 days Reason: If no signal has been saved in a time frame of 150 days, it is unlikely there is much left afterwards

Parameters:: df (pandas.DataFrame) – dataframe holding lightcurve data
Returns:: (pandas.DataFrame) dataframe where large delta time rows have been removed

supernnova.utils.data_utils.load_HDF5_SNinfo(settings)[source]

Load physical information related to the created database of lightcurves

Parameters:: settings (ExperimentSettings) – controls experiment hyperparameters
Returns:: (pandas.DataFrame) dataframe holding physics information about the dataset

supernnova.utils.data_utils.log_standardization(arr)[source]

Normalization strategy for the fluxes and fluxes error

Log transform the data
Mean and std dev normalization

Parameters:: arr (np.array) – data to normalize
Returns:: (LogStandardized) namedtuple holding normalization data

supernnova.utils.data_utils.save_to_HDF5(settings, df)[source]

Saved processed dataframe to HDF5

Parameters:

settings (ExperimentSettings) – controls experiment hyperparameters
df (pandas.DataFrame) – dataframe holding processed data