Data documentation
Dataset construction
- supernnova.data.make_dataset.build_traintestval_splits(settings)[source]
Build dataset split in the following way
Downsample each class so that it has the same cardinality as the lowest cardinality class
Randomly assign lightcurves to a 80/10/10 train test val split (except Out-of-distribution data 1/1/98)
- OOD:
Will use the complete sample for testing, does not require settings.
- Parameters:
settings (ExperimentSettings) – controls experiment hyperparameters
- supernnova.data.make_dataset.process_single_FITS(file_path, settings)[source]
Carry out preprocessing on FITS file and save results to pickle. Pickle is preferred to csv as it is faster to read and write.
Join column from header files
Select columns that will be useful laer on
Compute SNID to tag each light curve
Compute delta times between measures
Filter preprocessing
Removal of delimiter rows
- Parameters:
file_path (str) – path to
.FITSfilesettings (ExperimentSettings) – controls experiment hyperparameters
- supernnova.data.make_dataset.process_single_csv(file_path, settings)[source]
Carry out preprocessing on csv file and save results to pickle. Pickle is preferred to csv as it is faster to read and write.
Compute delta times between measures
Filter preprocessing
- Parameters:
file_path (str) – path to
.csvfilesettings (ExperimentSettings) – controls experiment hyperparameters
- supernnova.data.make_dataset.preprocess_data(settings)[source]
Preprocess the FITS data
Use multiprocessing/threading to speed up data processing
Preprocess every FIT file in the raw data dir
Also save a DataFrame of Host Spe for publication plots
- Parameters:
settings (ExperimentSettings) – controls experiment hyperparameters
- supernnova.data.make_dataset.pivot_dataframe_single_from_df(df, settings)[source]
Carry out pivot: we will group time-wise close observations on the same row and each row in the dataframe will show a value for each of the flux and flux error column
All observations withing 8 hours of each other are assigned the same MJD
Results are cached with pickle
- Parameters:
filename (str) – path to a
.picklefile containing pre-processed datasettings (ExperimentSettings) – controls experiment hyperparameters
- supernnova.data.make_dataset.pivot_dataframe_batch(list_files, settings)[source]
Use multiprocessing/threading to speed up data processing
Pivot every file in list_files and cache the result with pickle
- Parameters:
list_files (list) – list of
.picklefiles containing pre-processed datasettings (ExperimentSettings) – controls experiment hyperparameters
- supernnova.data.make_dataset.parse_sntypes_from_readme(raw_dir)[source]
Parse GENTYPE_TO_NAME block from .README files in raw_dir.
Looks for files matching *.README in raw_dir and extracts supernova type mappings from the GENTYPE_TO_NAME block. For each GENTYPE number N found, two entries are created: N and N+100 (the photo-ID convention used by SNANA simulations).
The expected block format is:
GENTYPE_TO_NAME: # GENTYPE-integer (non)Ia transient-Name FITS-prefix 1: Ia SALT3 SNIaMODEL00 20: nonIa SNIIP NONIaMODEL03
Column mapping (after splitting each data line on whitespace):
Column 1 – GENTYPE number (the key, e.g.
1:)Column 2 – Ia / nonIa category
Column 3 – transient-Name (e.g.
SNIIP)
For Ia types (column 2 == “Ia”) the type name is taken from column 2 directly (“Ia”). For non-Ia types the type name is taken from column 3 (the transient-Name, e.g. “SNIIP”).
- Parameters:
raw_dir (str) – Path to the raw data directory.
- Returns:
Parsed
{sntype_number: type_name}mapping, or None when no README is found or the block is absent / empty.- Return type:
OrderedDict or None
- supernnova.data.make_dataset.resolve_sntypes(settings)[source]
Resolve settings.sntypes when not explicitly provided by the user.
Priority order:
Explicit
--sntypeson CLI / config → already set, nothing to do..READMEfile inraw_dir→ parseGENTYPE_TO_NAMEblock.Built-in
DEFAULT_SNTYPESfallback.
- Parameters:
settings (ExperimentSettings) – controls experiment hyperparameters
- supernnova.data.make_dataset.detect_contaminant_types(settings)[source]
Pre-scan raw data files to detect types not in settings.sntypes.
Any type found in the data but missing from settings.sntypes is automatically added as ‘contaminant’. This must run before any column-name computation so that target column names are consistent throughout the pipeline.
- Parameters:
settings (ExperimentSettings) – controls experiment hyperparameters
- supernnova.data.make_dataset.make_dataset(settings)[source]
Main function for data processing
Create the train test val splits
Preprocess all the FITs data, then pivot
Save all of the processed data to a single HDF5 database
- Parameters:
settings (ExperimentSettings) – controls experiment hyperparameters
Data utilities
- class supernnova.utils.data_utils.LogStandardized(arr_min, arr_mean, arr_std)
Bases:
tuple- arr_mean
Alias for field number 1
- arr_min
Alias for field number 0
- arr_std
Alias for field number 2
- supernnova.utils.data_utils.load_pandas_from_fit(fit_file_path, columns=None, multidim='drop')[source]
Load a FIT file and cast it to a PANDAS dataframe.
FITS tables can contain vector / array-valued columns (
TDIM > 1).astropy.Table.to_pandasraisesValueErroron such columns because pandas DataFrames cannot natively hold n-dim arrays in a column. This helper handles them so the pipeline does not crash on otherwise valid FITS files.- Parameters:
fit_file_path (str) – path to FIT file
columns (list, optional) – if given, only these columns are kept before the pandas conversion. Acts as a whitelist that avoids loading multi-D columns we don’t care about into pandas in the first place. Names that are not present in the FITS table are silently ignored.
multidim (str) –
strategy for multi-dimensional columns that survive the
columnsfilter. One of:"drop"(default): remove them with a yellow warning listing which columns were skipped."error": raiseValueErrorso the caller is forced to deal with them explicitly.
- Returns:
(pandas.DataFrame) load dataframe from FIT file
- supernnova.utils.data_utils.sntype_decoded(target, settings, simplify=False)[source]
Match the target class (integer in {0, …, 6} to the name of the class, i.e. something like “SN Ia” or “SN CC”
- Parameters:
target (int) – specifies the classification target
settings (ExperimentSettings) – custom class to hold hyperparameters
simplify (Boolean) – if True do not show all classes
- Returns:
(str) the name of the class
- supernnova.utils.data_utils.tag_type(df, settings, type_column='TYPE')[source]
Create classes based on a type columns
Depending on the number of classes (2 or all), we create distinct target columns
- Parameters:
df (pandas.DataFrame) – the input dataframe
settings (ExperimentSettings) – controls experiment hyperparameters
type_column (str) – the type column in df
- Returns:
(pandas.DataFrame) the dataframe, with new target columns
- supernnova.utils.data_utils.load_fitfile(settings, verbose=True)[source]
Load the FITOPT file as a pandas dataframe
Pickle it for future use (it is faster to load as a pickled dataframe)
- Parameters:
settings (ExperimentSettings) – controls experiment hyperparameters
verbose (bool) – whether to display logging message. Default:
True
- Returns:
(pandas.DataFrame) dataframe with FITOPT data
- supernnova.utils.data_utils.process_header_FITS(file_path, settings, columns=None)[source]
Read the HEAD FIT file, add target columns and return in pandas DataFrame format
- Parameters:
file_path (str) – the path to the header FIT file
settings (ExperimentSettings) – controls experiment hyperparameters
columns (lsit) – list of columns to keep. Default:
None
- Returns:
(pandas.DataFrame) the dataframe, with new target columns
- supernnova.utils.data_utils.process_header_csv(file_path, settings, columns=None)[source]
Read the HEAD csv file, add target columns and return in pandas DataFrame format
- Parameters:
file_path (str) – the path to the header FIT file
settings (ExperimentSettings) – controls experiment hyperparameters
columns (lsit) – list of columns to keep. Default:
None
- Returns:
(pandas.DataFrame) the dataframe, with new target columns
- supernnova.utils.data_utils.compute_delta_time(df)[source]
Compute the delta time between two consecutive observations
- Parameters:
df (pandas.DataFrame) – dataframe holding lightcurve data
- Returns:
(pandas.DataFrame) dataframe holding lightcurve data with delta_time features
- supernnova.utils.data_utils.remove_data_post_large_delta_time(df)[source]
Remove rows in the same light curve after a gap > 150 days Reason: If no signal has been saved in a time frame of 150 days, it is unlikely there is much left afterwards
- Parameters:
df (pandas.DataFrame) – dataframe holding lightcurve data
- Returns:
(pandas.DataFrame) dataframe where large delta time rows have been removed
- supernnova.utils.data_utils.load_HDF5_SNinfo(settings)[source]
Load physical information related to the created database of lightcurves
- Parameters:
settings (ExperimentSettings) – controls experiment hyperparameters
- Returns:
(pandas.DataFrame) dataframe holding physics information about the dataset
- supernnova.utils.data_utils.log_standardization(arr)[source]
Normalization strategy for the fluxes and fluxes error
Log transform the data
Mean and std dev normalization
- Parameters:
arr (np.array) – data to normalize
- Returns:
(LogStandardized) namedtuple holding normalization data
- supernnova.utils.data_utils.save_to_HDF5(settings, df)[source]
Saved processed dataframe to HDF5
- Parameters:
settings (ExperimentSettings) – controls experiment hyperparameters
df (pandas.DataFrame) – dataframe holding processed data