Data walkthrough¶
Recommended code organization structure:
├── snndump (to save the data)
│
├── SuperNNova
│ ├── supernnova
│ ├── env
│ ├── docs
│ ├── tests
To build the database:
Ensure you have raw data saved to
{raw_dir}/raw
The default settings assume the raw data and fits are saved to
snndump/raw
You can save the data in any folder, but you then have to specify the
dump_dir
with the--dump_dir XXX
command.You can specify a different place where the raw data is using
--raw_dir XXX
command.You can specify a different place where the fits to data is using
--fits_dir XXX
command.
Activate the environment¶
Either use docker
cd env && python launch_docker.py (--use_cuda optional)
Or activate your conda environment
source activate <conda_env_name>
Creating a debugging database¶
Using command line: .. code:
python run.py --data --dump_dir tests/dump --raw_dir tests/raw
This creates a database for a very small subset of all available data
This is intended for debugging purposes (training, validation can run very fast with this small database)
The database is saved to the specified
tests/dump/processed
An additional SALT2 fits can be provided as
--fits_dir tests/fits
for training of RF and interpretation
Using yaml: .. code:
python run_yaml.py <yaml_file_with_config> --mode data
an example <yaml_file_with_config>
is at configs_yml
.
Creating a database¶
Using command line: .. code:
python run.py --data --dump_dir <path/to/full/database/> --raw_dir <path/to/raw/data/>
An additional SALT2 fits can be provided as
--fits_dir <path/to/fits/>
for training of RF and interpretation
Using yaml: modify the configuration file .. code:
python run_yaml.py <yaml_file_with_config> --mode data
You DO NEED to download the raw data for this database or point where your data is.
This creates a database for all the available data with 80/10/10 train/validate/test splits.
Splits can be changed using
--data_training
(use data only for raining and validation) or--data_testing
(use data only for testing) commands. For yaml just adddata_training: True
or--data_testing: True
.The database is saved to the specified
dump_dir
, in theprocessed
subfolder.There is no need to specify salt2fits file to make the dataset. It can be used if available but it is not needed
--fits_dir <empty/path/>
.Raw data can be in csv format with columns:
`` DES_PHOT.csv ``: SNID,MJD, FLUXCAL, FLUXCALERR, FLT
`` DES_HEAD.csv``: SNID, PEAKMJD, HOSTGAL_PHOTOZ, HOSTGAL_PHOTOZ_ERR, HOSTGAL_SPECZ, HOSTGAL_SPECZ_ERR, SIM_REDSHIFT_CMB, SIM_PEAKMAG_z, SIM_PEAKMAG_g, SIM_PEAKMAG_r, SIM_PEAKMAG_i, SNTYPE.
Creating a database for testing a trained model¶
This is how to create a database with only lightcurves to evaluate.
python run.py --dump_dir <path/to/save/database/> --data --data_testing --raw_dir <path/to/raw/data/>
Note that:
- using --data_testing
option will generate a 100% testing set (see below for more details).
Using command yaml: modify the configuration file with data_testing: True
and use the --mode data
.
Creating a database using some SNIDs for testing and the rest for training and validating¶
This is how to create a database using a list of SNIDs for testing.
python run.py --dump_dir <path/to/save/database/> --data --raw_dir <path/to/raw/data/> --testing_ids <path/to/ids/file>
You can provide the SNIDs in .csv
or .npy
format. The .csv
must contain a column SNID
.
Creating a database with photometry limited to a time window¶
Photometric measurements may span over a larger time range than the one desired for classification. For example, a year of photometry is much larger than the usual SN timespan. Therefore, it may be desirable to just use a subset of this photometry (observed epochs cuts). To do so:
python run.py --dump_dir <path/to/save/database/> --data --raw_dir <path/to/raw/data/> --photo_window_files <path/to/csv/with/peakMJD> --photo_window_var <name/of/variable/in/csv/to/cut/on> --photo_window_min <negative/int/indicating/days/before/var> --photo_window_max <positive/int/indicating/days/after/var>
Creating a database with different survey¶
The default filter set is the one from the Dark Energy Survey Supernova Survey g,r,i,z
. If you want to use your own survey, you’ll need to specify your filters.
python run.py --dump_dir <path/to/save/database/> --data --raw_dir <path/to/raw/data/> --list_filters <your/filters>
Using a different redshift label¶
The default redshift label is either HOSTGAL_SPECZ
/HOSTGAL_PHOTOZ
(with option zspe
/zpho
). If you want to use your own label, you’ll need to specify it. Beware, this will override also SIM_REDSHIFT_CMB
used for the title of plotted light-curves.
python run.py --dump_dir <path/to/save/database/> --data --raw_dir <path/to/raw/data/> --redshift_label <your/label>
e.g. --redshift_label REDSHIFT_FINAL
.
Using a different sntype label¶
The default sntype label is SNTYPE
. If you want to use your own label, you’ll need to specify it.
python run.py --dump_dir <path/to/save/database/> --data --raw_dir <path/to/raw/data/> --sntype_var <your/label>
e.g. --redshift_label SIM_SNTYPE
.
Masking photometry¶
The default is to use all available photometry for classification. However, we support masking photometric epochs with a power of two mask. Any combination of these power of two integers, and with other numbers, will be eliminated from the database.
python run.py --dump_dir <path/to/save/database/> --data --raw_dir <path/to/raw/data/> --phot_reject <your/label> --phot_reject_list <list/to/reject>
e.g. --phot_reject PHOTFLAG --phot_reject_list 8 16 32 64 128 256 512
.
Adding another training variable¶
You may want to add another feature for training and classification from the metadata (HEAD for .fits)
python run.py --dump_dir <path/to/save/database/> --data --raw_dir <path/to/raw/data/> --additional_train_var <additional_column_name>
e.g. --additional_train_var MWEBV
.
Under the hood¶
Preparing data splits¶
We first compute the data splits:
By default the HEAD FITS/csv files are analyzed to compute 80/10/10 train/test/val splits.
You can change if the database contains 99.5/0.5/0.5 train/test/val splits using
--data_training
command.You can change if the database contains 0/0/100 train/test/val splits using
--data_testing
command. Beware, this option has other consequences.The splits are different for the salt/photometry datasets
The splits are different depending on the classification target
We downsample the dataset so that for a given classification task, all classes have the same cardinality
The supernova/light-curve types supported can be changed using
--sntypes
. Default contains 7 classes. If a class is not given as input in--sntypes
, it will be assigned to the last available tag. If a ‘Ia’ exists in provided--sntypes
, this will be taken as tag 0 in data splits, else the first class will be used.
Preprocessing¶
We then pre-process each FITS/csv file
Join column from header files
Select columns that will be useful later on
Compute SNID to tag each light curve
Compute delta times between measures
Removal of delimiter rows
Pivot¶
We then pivot each preprocessed file: we will group time-wise close observations on the same row and each row in the dataframe will show a value for each of the flux and flux error column
All observations within 8 hours of each other are assigned the same MJD
Results are cached with pickle for faster loading
HDF5¶
The processed database is saved to dump_dir/processed
in HDF5 format for convenient use
in the ML pipeline
The HDF5 file is organized as follows:
├── data (variable length array to store time series)
│
│
├── dataset_photometry_2classes (0: train set, 1: valid set, 2: test set, -1: not used)
├── dataset_photometry_7classes (0: train set, 1: valid set, 2: test set, -1: not used)
│
├── target_photometry_2classes (integer between 0 and 1, included)
├── target_photometry_7classes (integer between 0 and 6, included)
│
│
├── features (array of str: feature names to be used)
├── normalizations
│ ├── FLUXCAL_g
│ ├── min
│ ├── mean Normalization coefficients for that feature
│ ├── std
│ ...
├── normalizations_global
│ ├── FLUXCAL
│ ├── min
│ ├── mean Normalization coefficients for that feature
│ ├── std In this scheme, the coefficients are shared between fluxes and flux errors
│ ...
│
├── SNID The ID of the lightcurve
├── PEAKMJD The MJD value at which a lightcurve reaches peak light
├── SNTYPE The type of the lightcurve (120, 121...)
│
... (Other metadata / features about lightcurves)
The features used for classification are the following:
FLUXCAL_g (flux)
FLUXCAL_i (flux)
FLUXCAL_r (flux)
FLUXCAL_z (flux)
FLUXCALERR_g (flux error)
FLUXCALERR_i (flux error)
FLUXCALERR_r (flux error)
FLUXCALERR_z (flux error)
delta_time (time elapsed since previous observation in MJD)
HOSTGAL_PHOTOZ (photometric redshift)
HOSTGAL_PHOTOZ_ERR (photometric redshift error)
HOSTGAL_SPECZ (spectroscopic redshift)
HOSTGAL_SPECZ_ERR (spectroscopic redshift eror)
g (boolean flag indicating which band is present at a specific time step)
gi (boolean flag indicating which band is present at a specific time step)
gir (boolean flag indicating which band is present at a specific time step)
girz (boolean flag indicating which band is present at a specific time step)
giz (boolean flag indicating which band is present at a specific time step)
gr (boolean flag indicating which band is present at a specific time step)
grz (boolean flag indicating which band is present at a specific time step)
gz (boolean flag indicating which band is present at a specific time step)
i (boolean flag indicating which band is present at a specific time step)
ir (boolean flag indicating which band is present at a specific time step)
irz (boolean flag indicating which band is present at a specific time step)
iz (boolean flag indicating which band is present at a specific time step)
r (boolean flag indicating which band is present at a specific time step)
rz (boolean flag indicating which band is present at a specific time step)
z (boolean flag indicating which band is present at a specific time step)