WEBVTT

00:00.000 --> 00:13.240
Thank you very much. Hi. My name is Christian. I'm a research software engineer with

00:13.240 --> 00:18.480
the Center for Advanced Research Computing at the University College London. And I'm also a

00:18.480 --> 00:24.680
particle physicist and I was in collision data from the Large Hadron Collider or LHC.

00:24.680 --> 00:29.520
And today I want to talk to you a little bit about how we preserve LHC analysis, once

00:29.520 --> 00:35.400
they're done for the long run, using a tool called rivet. So this talks me to give you

00:35.400 --> 00:41.600
a little bit of an insight into why we need a common tool to analyze Monte Carlo simulations.

00:41.600 --> 00:46.440
How rivet has become that standard tool and some of the challenges that we've encountered

00:46.440 --> 00:53.040
along the way. But before we get into that, let me set the stage a little better.

00:53.040 --> 00:57.920
So the LHC is the world's largest particle accelerator. It's housed in a 27-kilometer

00:57.920 --> 01:03.560
tunnel beneath the Swiss-French border. You can see the sort of indicated in this aerial view

01:03.560 --> 01:07.480
here. If you squint your eyes, you can see the city of Geneva and the background is along

01:07.480 --> 01:14.080
with the Alps. And inside this tunnel, we accelerate protons to nearly the speed of light and

01:14.080 --> 01:18.960
then we smash them together in foreign direction points. And that's where the LHC experiments

01:18.960 --> 01:27.680
are located. We have LHCMS, LHCB and the Atlas experiment, which the one that I'm working on.

01:27.680 --> 01:35.000
The LHC generates enormous amounts of data. Just last year in July, both Atlas and CMS each

01:35.000 --> 01:40.560
surpassed the one exabyte threshold, which maybe unit that we don't use at often. But

01:40.560 --> 01:47.840
ShetchyBT tells me it's the equivalent of about 3,000 years of uninterrupted Netflix streaming.

01:47.840 --> 01:56.320
So I'm not sure it makes it more tangible. But even that is only about 10% of the full data

01:56.320 --> 02:01.000
set that we expect to record over the lifetime of the LHC. It's a huge data set. This data

02:01.000 --> 02:08.640
doesn't just include the experimental results. It also includes simulated data, coming from

02:08.640 --> 02:13.240
the Monte Carlo-Vencher generators. And these are used to model the physics processes that

02:13.240 --> 02:22.120
we expect to observe in nature. Simulated data allows us to compare or experimental measurements

02:22.200 --> 02:28.120
against theoretical predictions. And so they play a crucial role in understanding our data

02:28.120 --> 02:35.960
and interpreting our data. Just to illustrate the scope of what we measure a little

02:35.960 --> 02:40.600
bit better, this plot, and I'm showing you here illustrates the various processes that the

02:40.600 --> 02:45.600
Atlas experiment has observed so far. So in the horizontal axis, you just see the various

02:45.680 --> 02:53.360
processes that we look for listed there. And the vertical axis represents the cross section,

02:53.360 --> 02:59.520
which is a measure for how likely a process is to occur. And I want you to note that this

02:59.520 --> 03:13.520
axis spends an incredible 15 orders of magnitude. So from the most common processes on the left,

03:13.600 --> 03:19.040
to the extremely rare ones on the right. I think this is probably my favorite plot that the

03:19.040 --> 03:23.840
Atlas experiment released and they sort of update this a couple of times here. And even though

03:23.840 --> 03:31.200
I've worked on some of these processes, I still find the range of this mind-blowing. So understanding

03:31.200 --> 03:39.120
and precisely measuring these processes requires a robust framework to compare data and theory

03:39.200 --> 03:46.960
hence the need for tool like Reddit. We compare or experimental measurements to the

03:46.960 --> 03:51.120
theoretical predictions, for example, from the standard model of particle physics, which is sort of

03:52.160 --> 03:59.120
a current best quantum field theory describing the fundamental interactions. And these theoretical

03:59.120 --> 04:05.600
these theoretical simulations are created with complex theoretical tools that are developed by the

04:05.680 --> 04:14.480
theory community. And they include Monte Carlo mentioned writers, but also part of the distribution

04:14.480 --> 04:20.720
functions, which is a thing that sort of describes how the momentum of the approach on the shared

04:20.720 --> 04:29.360
between its constituents. However, these these theory tools continuously improve over time,

04:30.320 --> 04:37.040
which makes it really important that we preserve our past analyses in such a way that we can

04:37.040 --> 04:45.360
go back and reinterpret them with updated theoretical models in the future. Without that, we lose the risk

04:47.600 --> 04:54.080
without that we risk losing the ability to go back and even compare our old data to new physics

04:54.080 --> 04:59.680
models, because at some level the theory is slacking a bit behind here, compared to the sheer

04:59.680 --> 05:06.720
precision from this huge experimental data set. At the same time, producing high quality simulations

05:06.720 --> 05:13.040
is computationally really expensive, and so large-scale validation is crucial in order to ensure

05:13.040 --> 05:19.760
that our tools remain reliable. And when I'm showing you here is a simplified view of a sort of

05:19.760 --> 05:24.640
typical Monte Carlo event generation chain to translate these collisions. We have various

05:27.520 --> 05:33.120
sophisticated specialized tools that focus on different aspects of quantum mechanical interactions

05:34.160 --> 05:38.640
and typically you have more than one option available to reflect different modeling approaches.

05:38.640 --> 05:43.120
And these are then linked to simulate an event collision that could occur at the LFC,

05:44.720 --> 05:48.560
and at the end this is written out in a standard format called HAPMC. You can see that sort of

05:48.560 --> 05:54.320
you have the bottom filtering down to the HAPMC format, and once such an event is generated you can

05:54.320 --> 05:59.120
either analyze it directly with a tool like rivet, or you can first pass it through dedicated

05:59.120 --> 06:04.480
detector simulation software to model how the experimental apparatus would have responded.

06:05.680 --> 06:09.760
And then finally there's a bunch of statistical inference tools available to

06:11.600 --> 06:16.800
further to assess the agreement with the data which then feeds back into the theoretical development.

06:19.520 --> 06:26.160
So analyses and high energy physics face several challenges for once complex analysis workflows

06:26.160 --> 06:32.240
which you've just seen on the previous slide, but also complex vent records. So each of these

06:32.240 --> 06:40.160
simulated events contains hundreds of particles and interaction vertices and different generators

06:40.160 --> 06:45.120
choose to typically choose to combine them slightly differently, which makes standardize analysis

06:45.200 --> 06:54.000
quite difficult. Limited portability, so many experiments specific analysis frameworks

06:54.640 --> 07:02.640
cannot easily be shared outside the collaboration which then restricts raw reviews and risk of knowledge

07:05.280 --> 07:11.840
and because without proper preservation and risk losing, valuable analysis insights

07:11.840 --> 07:19.120
once the original authors leave the field. So what we really need is a tool that allows us to

07:20.080 --> 07:26.800
analyze Monte Carlo events in an standardized experiment in an independent way and that's exactly

07:26.800 --> 07:32.400
a rivet comes in. So rivet stands for robust and independent validation of experiment and theory.

07:33.440 --> 07:39.440
Since its first release in 2007 it's been sort of now widely adopted as a common language to analyze

07:39.440 --> 07:47.040
Monte Carlo analysis, Monte Carlo Generate events, I should say, and last year we released rivet

07:47.040 --> 07:55.440
for the latest major version. rivet is the provides a framework that allows theorists and experimental

07:55.440 --> 08:04.560
physicists to analyze Monte Carlo events in a standard reproducible way. rivet was designed with

08:04.640 --> 08:10.240
three key principles in mind first of all ease of use because it's ultimately physicists and not

08:10.240 --> 08:23.600
software developers who are the primary users. Flexibility is written in C++ with additional

08:23.680 --> 08:37.680
Python bindings and supports essentially all of the main Monte Carlo generators and the specific

08:37.680 --> 08:42.800
analysis routines so come back to those later and can be loaded as plugins as well.

08:44.240 --> 08:48.560
Efficiency and scalability is another one. So we have built in caching systems

08:49.440 --> 08:53.760
which help optimize performance and not makes it then suitable for large data sets.

08:55.280 --> 09:01.680
So the way this works in practice is that rivet takes the generated events from the Monte Carlo

09:01.680 --> 09:09.760
generators and processes them with a bunch of predefined analysis routines and then writes out

09:09.760 --> 09:16.160
histograms effectively in the older format and those can then be used for further statistical analysis

09:16.240 --> 09:21.200
or plotting with a built-in plotting machinery that produces plots that look a bit like this.

09:23.360 --> 09:27.920
Speaking of Yoda, but by that I mean the statistics library that rivet uses internally

09:28.720 --> 09:34.960
for a sister gromming that's another great open source tool that we use in high energy physics.

09:34.960 --> 09:43.120
And if you're interested in Yoda's design and capabilities I encourage you to check out the

09:43.200 --> 09:51.120
Yoda talk tomorrow morning in the HPC track. So why is a rivet routine? As essentially

09:52.720 --> 10:00.960
a C++ class which is compiled into the shared object library and it encapsulates the analysis

10:00.960 --> 10:09.040
logic in software and code for given analysis and these can be plugged in at runtime.

10:09.360 --> 10:18.000
All of these analysis classes inherit from rivets analysis based class and they follow a

10:18.000 --> 10:23.200
standard structure. So they all have a little mini example here, very simplified but still

10:23.840 --> 10:30.400
and they all have sort of initialization phase where you book histograms define standard algorithms.

10:30.960 --> 10:36.960
There's a per event analysis which is used to construct the variables, observables and fill the

10:37.040 --> 10:45.120
histograms and then there's a finalization phase which is used to normalize the histograms

10:45.120 --> 10:51.280
or to construct derived quantities like ratios and efficiencies and such. So this standard

10:51.280 --> 11:00.000
structure ensures consistency across analyses and it helps to hide complexity or technical noise

11:00.000 --> 11:05.120
from the user. You can see an example here at the bottom of a user little pre-process and macro

11:05.200 --> 11:13.040
that hides the technical noisy complex logic that the shared object library needs in order to

11:13.040 --> 11:17.120
register itself to the core library but physicists don't care about that so we'll hide that away from

11:17.120 --> 11:24.560
them so they can concentrate on the physics. So by now the rivet library has grown to about

11:24.560 --> 11:30.400
2000 of these routines covering decades of different measurements from different experiments

11:30.480 --> 11:38.560
I think I'm back to 70 years or so and all of these are open source providing clear documentation

11:38.560 --> 11:50.560
of the analysis logic. Having a rivet routine makes it really easy to reproduce

11:50.560 --> 11:58.320
a measurement and that helps ensuring that they get sighted as well which also enhances the visibility

11:58.320 --> 12:02.720
for the experimental results. So there's a sort of feedback loop between people being able to use

12:02.720 --> 12:07.520
that and then actually sighting our results as well which makes a popular.

12:12.560 --> 12:17.920
So the time rivet has sort of become kind of the de facto standard for comparing

12:17.920 --> 12:28.160
Monte Carlo of engineers with LEC data. It provides a common framework that allows theorists

12:28.160 --> 12:38.320
and experimentalists. It provides a common framework for theorists and experimentalists that sort of

12:38.320 --> 12:49.520
has helped but also continues to help align best practices between them. It's maintained in workshops

12:49.520 --> 12:55.120
and training sessions and community engagement are driving rivets evolution of a time.

12:56.000 --> 13:01.920
Also the LEC experiments have embedded rivet by now into their official analysis preservation

13:01.920 --> 13:10.160
efforts which is great news for the community. rivet is also useful for searching for new physics

13:10.800 --> 13:18.000
and understand a model which is sort of a main theory is not complete so there is a deeper theory

13:18.000 --> 13:22.960
and so theorists that develop extensions to the standard model they might want to compare

13:22.960 --> 13:29.040
their models against the existing data against the LEC data. Now rivet has a large pool of the

13:29.040 --> 13:40.240
data and that makes it easy for them to scan a parameter space in the theory to quickly scan

13:40.240 --> 13:45.840
over that and check which bits of the parameter space have already been ruled out the existing

13:45.840 --> 13:53.280
data. That's kind of illustrated here for some theory model, decas only matter that has sort of

13:53.280 --> 13:59.840
two parameters and a different colors essentially refer to different processes that have been measured

13:59.840 --> 14:04.560
and that we have rivet routines for and you can see how different processes become sensitive

14:04.560 --> 14:08.960
depending on what region of the parameter space you're in. Then you can do a likelihood fit for

14:08.960 --> 14:15.360
example and compare that against experimental data that has been measured and rule out parts of

14:15.360 --> 14:23.120
this parameter space at 95 or 68% confidence level which is what the lines mean and this whole process

14:23.120 --> 14:28.320
can be extended over time as new measurements come in or as the theoretical calculations become more

14:28.320 --> 14:39.520
precise. So the key takeaways from developing rivet balance generality and usability

14:40.000 --> 14:46.400
because the target audience are physicist and also for developers. Interfacing with experiment

14:46.400 --> 14:53.680
and theories important and having a common framework forced its collaboration.

14:56.080 --> 15:02.720
Robust validation is essential to use automated regression tests to ensure that the rivet routines

15:02.880 --> 15:12.880
once submitted remain correct all the time. Standardization matters consistency enables best practices.

15:15.440 --> 15:22.560
Community engagement is vital for sustainability and likewise sustainability requires effort

15:23.440 --> 15:27.440
academic software needs active maintenance and onboarding.

15:27.920 --> 15:35.440
So just to summarize rivet has become over time a standard tool for

15:36.160 --> 15:40.000
Monte Carlo event generator analysis and LHC analysis preservation.

15:40.960 --> 15:47.840
It bridges experiment and theory by providing a shared framework and looking at we want to focus on

15:48.560 --> 15:55.680
automation, usability improvements and also adapting to the next generation of LHC data challenges

15:56.240 --> 16:00.720
coming away. And of course strengthening analysis preservation efforts remains a key priority.

16:01.760 --> 16:04.720
Thanks very much for your attention. I'll take in questions.

16:11.680 --> 16:12.720
That's great.

16:26.080 --> 16:29.440
The root files are a bit only given a file for that matter.

16:30.320 --> 16:41.360
Do I have to do some where or is it just a really just tool of storing it from my side for just

16:41.360 --> 16:49.120
storing and generating randoms for myself to share with everyone. And then after I did that

16:50.000 --> 16:57.040
let's say Gions simulation and we get my detector hits out of that. Do I also store the work

16:57.040 --> 17:06.160
rivet? No, none of this in fact. So the question is how this works in principle. What is the

17:06.160 --> 17:13.040
thing that you actually have to provide? I guess that's a fairly summarize it. The analysis logic

17:13.040 --> 17:16.560
but you actually do in the analysis. What cuts to your apply? How do you select an event?

17:16.560 --> 17:21.440
Do you have the constant flux of events coming in? Some on your analysis you decide this is the

17:21.440 --> 17:24.560
thing that is interesting that I want to measure that you're filtering out your apply selection

17:24.560 --> 17:28.800
cuts filtering whatever. And this is described in a paper and the way that is essentially

17:28.800 --> 17:33.200
not reproducible because you have this small section and if you're trying to code this up

17:33.200 --> 17:37.920
you're almost certainly going to get it wrong. What you need is a C++ snippet. It doesn't have to

17:37.920 --> 17:42.400
be C++ but in this case we we that's what we use and that's what the experiment of frameworks

17:42.480 --> 17:50.160
are written in. And essentially it's just a library of C++ snippets that encode the analysis

17:50.160 --> 17:55.600
logic. How did the analysis actually work? And so in the future with someone comes well you use

17:55.600 --> 18:00.480
this on this calculation back then we've got something that is much better nowadays. How does

18:00.480 --> 18:05.280
that compare to the stuff that you've measured? Then you need to be able to reproduce that analysis

18:05.280 --> 18:11.040
logic and this is what rivet helps you with. It maintains the logic for the future and helps you

18:11.040 --> 18:17.360
analyze this and an automated reproducible way so that you can then go back to your

18:17.360 --> 18:21.760
data that you've actually measured and reinterpreted with much better calculations in the future.

18:23.120 --> 18:24.160
That's the principle.

18:41.280 --> 18:49.920
Is this the format? Yeah it depends a little bit on the experiment. So sorry the question is

18:49.920 --> 18:54.640
are there guidelines or standard practices and how you extract that code logic from your original

18:54.640 --> 19:00.080
analysis framework? It depends on the analysis. So the collaboration sorry so I know that for

19:00.080 --> 19:06.160
instance CMS allows you to or just use the rhythm classes directly in the original analysis and

19:06.160 --> 19:12.560
basically becomes a copypacing exercise at some level. The people in Atlas tend to write it from scratch

19:12.560 --> 19:16.560
because they realize it's an independent framework and we can actually cross-check that we haven't

19:16.560 --> 19:21.280
got any bugs and everything and occasionally they find things and are able to fix it before they

19:21.280 --> 19:28.880
actually reuse the result. So it depends how people want to use it but yeah. That's a question now.

19:28.880 --> 19:34.640
I work at Sarah in the idea in the open source problem office as well. So I have to

19:34.640 --> 19:39.360
question questions one is on the analyzing the old data from the data preservation. So there's

19:39.360 --> 19:45.760
all this like a half data preservation move but there is a little bit of push also on the

19:45.760 --> 19:53.360
cell sites so do you work with them etc. And the second one is to you do you work with the

19:53.360 --> 20:00.000
Reana people with reproducible analysis frameworks etc. And if the answer is no maybe we can

20:00.000 --> 20:08.400
also have it because I know sometimes connecting it is a bit tricky and yeah that's a good

20:08.400 --> 20:14.000
question. So have data the questions you know there are other tools available surrounding that

20:14.000 --> 20:17.840
kind of things such as have data or explain a second what that is for the people who don't know

20:18.400 --> 20:23.600
and whether we engage with these other communities. Have data is a great tool. I don't have any

20:23.600 --> 20:27.920
time unfortunately in my talk to talk about it. It's essentially an online database where the

20:27.920 --> 20:33.600
people once they've made an measurement they use the digitized data I mean this is all just floating

20:33.600 --> 20:39.200
points ultimately what we measure but hundreds of them and so that people don't have to actually

20:39.200 --> 20:44.160
go to the paper and copy the tables we don't do tables anymore we put a plot and we upload

20:44.720 --> 20:51.040
the digitized record to a central repo essentially which is open source and everyone can go

20:51.040 --> 20:55.840
and download all the particle physics measurements from the past I know 70 years or so and do

20:55.840 --> 21:00.720
with them whatever they want because they're open. Fantastic tool and we work with the

21:00.720 --> 21:05.840
update of developers quite frequently we sync or repo against that because we pull the

21:05.840 --> 21:11.680
snapshot for every plus plus snippet that encodes the analysis we pull a snapshot of the numerical

21:11.680 --> 21:17.120
data so that we can then superimpose it on plots like this automatically. So I'm happy with this

21:17.200 --> 21:24.800
product we helped some guides were like open source in some previews like let analysis framework

21:24.800 --> 21:29.120
yeah these are really for the experimental data lot more and I think there's there's a bit of over

21:29.120 --> 21:35.120
luck but more than that. Yeah update there's more about the numerical points that you want to put

21:35.120 --> 21:41.680
in your plot but the actual values of that rivet is more about how did you arrive at that histogram

21:41.760 --> 21:46.240
what do you have to do in a few moments

21:47.280 --> 21:50.960
go interrupt it by drum and you're not here

21:51.040 --> 22:04.480
you just use a git to store the rivets of it yes get up but yeah

22:04.640 --> 22:13.920
and do you happen to know people actually have a rivet or a gsi found that are already also

22:15.120 --> 22:22.720
rivet or is it leaving an atmosphere. No it's all of have we've got all of the left stuff we've got

22:22.720 --> 22:30.240
bass as well rake all of it and we welcome people to contribute more we don't have all of it unfortunately

22:30.320 --> 22:36.320
but we've got a big coverage of staff we'd love to have all of it we don't yet but

22:37.520 --> 22:42.560
we are very keen for experiments to actually contribute their analysis themselves because they know

22:42.560 --> 22:44.400
best what they've done.

22:52.000 --> 22:54.400
Spread the word spread the word

22:54.400 --> 23:12.000
Thank you

