Example of workflow - ΛCDM
Start by cloning CONNECT
git clone https://github.com/AarhusCosmology/connect_public.git
Then run the setup script from within the repository
cd connect_public
./setup.sh
Answer yes to all questions and leave paths as blank.
Your CONNECT installation is now ready to create neural networks.
The first thing you want to do is to create a parameter file in the input/
folder. It is a good idea for the first run to use the input/example.param
file (with iterative sampling), and this is also helpful for creating new parameter files. Open the parameter file in your favourite text editor and make sure that the parameter mcmc_sampler
is set to the MCMC sampler you want to use.
If using a cluster with SLURM, you can use the jobscript jobscripts/example.js
. Open this in a text editor and adjust the SLURM parameters to fit your cluster. Now submit the job
sbatch jobscripts/example.js
Once the job starts, you can monitor the progress in the data/<jobname>/output.log
file. This tells you how far the iterative sampling has come, and what the code is currently doing. The first thing the code does is to create an initial model from a Latin hypercube sampling. The output from this will look like
No initial model given
Calculating 10000 initial CLASS models
Training neural network
1/1 - 0s - loss: 228.5294 - 58ms/epoch - 58ms/step
Test loss: 228.5294189453125
Initial model is example
From here it will begin the iterative process and each iteration will look something like
Beginning iteration no. 1
Temperature is now 5.0
Running MCMC sampling no. 1...
MCMC sampling stopped since R-1 less than 0.05 has been reached.
Number of accepted steps: 12340
Keeping only last 5000 of the accepted Markovian steps
Comparing latest iterations...
Calculating 5000 CLASS models
Training neural network
1/1 - 0s - loss: 7.6460 - 34ms/epoch - 34ms/step
Test loss: 7.645951747894287
New model is example_1
This should not take more than 3-5 iterations with the setup in input/example.param
, so using 100 CPU cores with a walltime of 8-10 hours should be sufficient. The computational bottleneck is the Calculating N CLASS models
step, but this is very parallelisable, so given enough CPU cores, this will be fast. The more time consuming bottleneck is the MCMC samplings which can (as of now) only utilise a few cores at a time, given that it is not very parallelisable.
If the walltime was set too low or the iterative sampling did not halt for some reason, it is possible to resume the sampling from the last iteration. This is done by adding this line to your parameter file and submitting the job again
resume_iterations = True
This can also be used if you want to continue a job with new settings (different loss function, architecture, etc.).
When your job has halted, you can look in the data/<jobname>/output.log
file for the name of the last model. This will generally be a good model that you can use for MCMC and similar, but if you want to train a new model for more epochs or with another architecture, you can do so on the same data collected by the iterative process. This done by changing the training parameters in the parameter file (example.param
used here) and running
python connect.py train input/example.param
either in a jobscript similar to jobscripts/example.js
or locally with CPUs or GPUs (remember to load cuda
if using GPUs).
Once a neural network has been trained, this can be used as described in this section (Using a trained neural network for MCMC).
Useful commands for monitoring the iterative sampling
While the iterative process is running each individual step can be monitored with different .log
files.
All errors can be seen in the SLURM output file defined in the job script.
When calculating Class models, the computed amount can be monitored by the command
cat data/<jobname>/number_<iteration>/model_params_data/*.txt | wc -l
When an MCMC is running, the output from either Monte Python or Cobaya can be seen in
cat data/<jobname>/number_<iteration>/montepython.log
or
cat data/<jobname>/number_<iteration>/cobaya.log
When training the neural network, the progress can be monitored in
cat data/<jobname>/number_<iteration>/training.log