Data documentation
Dataset construction
- supernnova.data.make_dataset.build_traintestval_splits(settings)[source]
Build dataset split in the following way
Downsample each class so that it has the same cardinality as the lowest cardinality class
Randomly assign lightcurves to a 80/10/10 train test val split (except Out-of-distribution data 1/1/98)
- OOD:
Will use the complete sample for testing, does not require settings.
- Parameters:
settings (ExperimentSettings) – controls experiment hyperparameters
- supernnova.data.make_dataset.process_single_FITS(file_path, settings)[source]
Carry out preprocessing on FITS file and save results to pickle. Pickle is preferred to csv as it is faster to read and write.
Join column from header files
Select columns that will be useful laer on
Compute SNID to tag each light curve
Compute delta times between measures
Filter preprocessing
Removal of delimiter rows
- Parameters:
file_path (str) – path to
.FITS
filesettings (ExperimentSettings) – controls experiment hyperparameters
- supernnova.data.make_dataset.process_single_csv(file_path, settings)[source]
Carry out preprocessing on csv file and save results to pickle. Pickle is preferred to csv as it is faster to read and write.
Compute delta times between measures
Filter preprocessing
- Parameters:
file_path (str) – path to
.csv
filesettings (ExperimentSettings) – controls experiment hyperparameters
- supernnova.data.make_dataset.preprocess_data(settings)[source]
Preprocess the FITS data
Use multiprocessing/threading to speed up data processing
Preprocess every FIT file in the raw data dir
Also save a DataFrame of Host Spe for publication plots
- Parameters:
settings (ExperimentSettings) – controls experiment hyperparameters
- supernnova.data.make_dataset.pivot_dataframe_single_from_df(df, settings)[source]
Carry out pivot: we will group time-wise close observations on the same row and each row in the dataframe will show a value for each of the flux and flux error column
All observations withing 8 hours of each other are assigned the same MJD
Results are cached with pickle
- Parameters:
filename (str) – path to a
.pickle
file containing pre-processed datasettings (ExperimentSettings) – controls experiment hyperparameters
- supernnova.data.make_dataset.pivot_dataframe_batch(list_files, settings)[source]
Use multiprocessing/threading to speed up data processing
Pivot every file in list_files and cache the result with pickle
- Parameters:
list_files (list) – list of
.pickle
files containing pre-processed datasettings (ExperimentSettings) – controls experiment hyperparameters
- supernnova.data.make_dataset.make_dataset(settings)[source]
Main function for data processing
Create the train test val splits
Preprocess all the FITs data, then pivot
Save all of the processed data to a single HDF5 database
- Parameters:
settings (ExperimentSettings) – controls experiment hyperparameters
Data utilities
- class supernnova.utils.data_utils.LogStandardized(arr_min, arr_mean, arr_std)
Bases:
tuple
- arr_mean
Alias for field number 1
- arr_min
Alias for field number 0
- arr_std
Alias for field number 2
- supernnova.utils.data_utils.load_pandas_from_fit(fit_file_path)[source]
Load a FIT file and cast it to a PANDAS dataframe
- Parameters:
fit_file_path (str) – path to FIT file
- Returns:
(pandas.DataFrame) load dataframe from FIT file
- supernnova.utils.data_utils.sntype_decoded(target, settings, simplify=False)[source]
Match the target class (integer in {0, …, 6} to the name of the class, i.e. something like “SN Ia” or “SN CC”
- Parameters:
target (int) – specifies the classification target
settings (ExperimentSettings) – custom class to hold hyperparameters
simplify (Boolean) – if True do not show all classes
- Returns:
(str) the name of the class
- supernnova.utils.data_utils.tag_type(df, settings, type_column='TYPE')[source]
Create classes based on a type columns
Depending on the number of classes (2 or all), we create distinct target columns
- Parameters:
df (pandas.DataFrame) – the input dataframe
settings (ExperimentSettings) – controls experiment hyperparameters
type_column (str) – the type column in df
- Returns:
(pandas.DataFrame) the dataframe, with new target columns
- supernnova.utils.data_utils.load_fitfile(settings, verbose=True)[source]
Load the FITOPT file as a pandas dataframe
Pickle it for future use (it is faster to load as a pickled dataframe)
- Parameters:
settings (ExperimentSettings) – controls experiment hyperparameters
verbose (bool) – whether to display logging message. Default:
True
- Returns:
(pandas.DataFrame) dataframe with FITOPT data
- supernnova.utils.data_utils.process_header_FITS(file_path, settings, columns=None)[source]
Read the HEAD FIT file, add target columns and return in pandas DataFrame format
- Parameters:
file_path (str) – the path to the header FIT file
settings (ExperimentSettings) – controls experiment hyperparameters
columns (lsit) – list of columns to keep. Default:
None
- Returns:
(pandas.DataFrame) the dataframe, with new target columns
- supernnova.utils.data_utils.process_header_csv(file_path, settings, columns=None)[source]
Read the HEAD csv file, add target columns and return in pandas DataFrame format
- Parameters:
file_path (str) – the path to the header FIT file
settings (ExperimentSettings) – controls experiment hyperparameters
columns (lsit) – list of columns to keep. Default:
None
- Returns:
(pandas.DataFrame) the dataframe, with new target columns
- supernnova.utils.data_utils.add_redshift_features(settings, df)[source]
Add redshift features to pandas dataframe.
- Parameters:
settings (ExperimentSettings) – controls experiment hyperparameters
df (str) – pandas DataFrame with FIT data
- Returns:
(pandas.DataFrame) the dataframe, possibly with added redshift features
- supernnova.utils.data_utils.compute_delta_time(df)[source]
Compute the delta time between two consecutive observations
- Parameters:
df (pandas.DataFrame) – dataframe holding lightcurve data
- Returns:
(pandas.DataFrame) dataframe holding lightcurve data with delta_time features
- supernnova.utils.data_utils.remove_data_post_large_delta_time(df)[source]
Remove rows in the same light curve after a gap > 150 days Reason: If no signal has been saved in a time frame of 150 days, it is unlikely there is much left afterwards
- Parameters:
df (pandas.DataFrame) – dataframe holding lightcurve data
- Returns:
(pandas.DataFrame) dataframe where large delta time rows have been removed
- supernnova.utils.data_utils.load_HDF5_SNinfo(settings)[source]
Load physical information related to the created database of lightcurves
- Parameters:
settings (ExperimentSettings) – controls experiment hyperparameters
- Returns:
(pandas.DataFrame) dataframe holding physics information about the dataset
- supernnova.utils.data_utils.log_standardization(arr)[source]
Normalization strategy for the fluxes and fluxes error
Log transform the data
Mean and std dev normalization
- Parameters:
arr (np.array) – data to normalize
- Returns:
(LogStandardized) namedtuple holding normalization data
- supernnova.utils.data_utils.save_to_HDF5(settings, df)[source]
Saved processed dataframe to HDF5
- Parameters:
settings (ExperimentSettings) – controls experiment hyperparameters
df (pandas.DataFrame) – dataframe holding processed data