WEBVTT

00:00.000 --> 00:11.000
Welcome, I'm going to do the first presentation, so I think everything is going to be

00:11.000 --> 00:18.000
in Cretando, but I'm going to present you a statistical path coverage that this is a technique

00:18.000 --> 00:23.000
or a method that we started working on, still to Linux and P project, lead by us at the end.

00:23.000 --> 00:32.200
This was a project based on researching on Linux safety projects, and we started with a statistical

00:32.200 --> 00:37.600
path coverage there. It was very focused for safety related systems, but we think it's interesting

00:37.600 --> 00:43.600
for all type of critical systems, not only safety ones, and it may be interesting for

00:43.600 --> 00:52.000
increasing the testing of strategy of these systems. So, first of all, let's put it a little

00:52.000 --> 00:57.400
bit in context. The critical system that we are building today, we say the next generation

00:57.400 --> 01:02.600
critical system, but we are already building them, have or are running deep learning algorithms,

01:02.600 --> 01:08.200
we see trying with the break of AI. Also, they have...

01:08.200 --> 01:24.200
Okay, okay, so loader? Okay, I'm going to start again. So, first of all, the context is that the critical

01:24.200 --> 01:30.200
system that we are building currently, have different requirements, they are running deep learning

01:30.200 --> 01:36.200
algorithms, some of them, they also security plays a key role in all these applications, and

01:36.200 --> 01:43.000
they have high performance requirements. Different industries, as far as to employ the commercial

01:43.000 --> 01:49.000
of the self-smaltical, the performance provided by the commercial of the multiple processors,

01:49.000 --> 01:54.400
but these hardware was designing from an average maximum performance and not for critical

01:54.400 --> 02:00.400
systems. For example, if we check it for safety related systems, right now we don't have a 35

02:00.600 --> 02:09.200
multi-core hardware to use it. With all these points in siltyling example, we saw that there

02:09.200 --> 02:16.000
was a need of an operating system to run complex algorithms with high performance requirements

02:16.000 --> 02:24.800
and also with security requirements. If we check the Linux kernel, I will say we can see that

02:25.200 --> 02:34.200
it's the key, I will say. It's the non-critical domain. It's the linear operating system

02:34.200 --> 02:41.200
in embedded systems, in smartphones, it's almost the only solution in supercomputers, and it

02:41.200 --> 02:47.200
works also in the majority of web servers. There are different reasons for this, but the

02:47.200 --> 02:53.000
main ones it has white hardware support. We have lots of drivers in the Linux kernel, and

02:53.000 --> 02:57.920
this provides all this support. It has really good, or significantly better than all

02:57.920 --> 03:03.480
those multi-core capabilities, and also security capabilities are important, and that's

03:03.480 --> 03:10.200
why different companies choose or even governments choose Linux. Also, let's develop

03:10.200 --> 03:16.760
a new system and why the developers community like here, right? So the question that

03:16.760 --> 03:23.960
racist here is, can we use Linux for critical systems? Right now, we have thousands of Linux

03:23.960 --> 03:29.400
based computer orbiting their Earth. So different companies and governments rely on Linux to

03:29.400 --> 03:35.560
build their satellites. We can find Linux even in a space rocket, in control units of a space

03:35.560 --> 03:41.000
rocket, and we can find Linux even in other planets like in Mars. NASA deployed a small

03:41.080 --> 03:48.440
drone, or a small helicopter based on Linux, and a Snapdragon multicore. Also, different

03:48.440 --> 03:56.600
governments and companies rely on Linux for the telecommunications, critical servers, and for

03:56.600 --> 04:03.720
instance, banking systems. So can we use it for critical system? I would say that, yeah, we are

04:03.720 --> 04:12.760
already using it, how do we test the Linux 40s. So here, when we start it into Linux, we

04:12.760 --> 04:18.680
identify some problems, and it's that we know that this in an analysis of a critical

04:18.680 --> 04:24.520
system is crucial, and that we need to quantify this testing effort. When we are doing testing,

04:24.520 --> 04:30.520
we need to know how much do we test. Traditionally, the objective of the testing process was to get

04:30.520 --> 04:37.240
100% test coverage, but we are going to see that in the presentation, sometimes this is not possible,

04:37.240 --> 04:46.520
not feasible, and even not the syllable. We don't need 100% test coverage. I'm going to put some

04:46.520 --> 04:51.240
definition from functional safety. It's only for functional safety, but I think it's interesting

04:51.240 --> 04:59.000
to have this ideas for all critical systems. And when we talk about functional safety, we are talking

04:59.160 --> 05:05.320
about systems with credible risk, or freedom from an acceptable risk. Sometimes we can hear from

05:05.320 --> 05:12.360
some people that safety systems without any risk, and any developer in this room will say that

05:12.360 --> 05:18.920
the system has no back, right? So every system has some risk, but we need to achieve an acceptable

05:18.920 --> 05:25.800
risk. So therefore, in system safety, when we are doing engineering processes, tools, the

05:25.880 --> 05:34.280
objective is to get this achieve our risk. For that, we use testing analysis, and testing helps us

05:34.280 --> 05:40.360
to fix bugs, or mitigate them, also we understand better the system while we are doing this testing,

05:41.080 --> 05:46.600
and therefore, consequently, we need to quantify this testing effort, right? This testing process.

05:46.600 --> 05:50.760
We need to know how much do we know about the system or how tested it is it?

05:51.640 --> 05:59.080
To quantify this testing effort, we always talk about test coverage, and there are different

05:59.080 --> 06:05.320
metrics for test coverage. We can find French coverage, function coverage, land coverage, there are

06:05.320 --> 06:11.720
different types. For the SILT-LINUXMP project, as the objective was a SILT, we focus in path coverage,

06:11.720 --> 06:18.120
but that's not an important point here, and the presentation goes about path coverage, but it's not

06:18.120 --> 06:26.360
my idea. The important note here is that, for instance, in the SILT-LINUXMP, SILT-LINUXMP, we can consider that

06:26.360 --> 06:32.520
it is the general standard for SILT-LINUXMP systems. It notes world 100 coverage cannot be achieved,

06:33.160 --> 06:39.240
and a property explanation should be given. This is one point that we need to remember through

06:39.240 --> 06:46.520
the presentation. So let's try to get 100% test coverage, right? We take the LINUXMP

06:46.520 --> 06:51.880
kernel, we start like that, let's download the LINUX kernel and check it. LINUXMP has

06:51.880 --> 06:57.400
over 27 million lives of code, so it's a huge project. It is treated from this

06:58.680 --> 07:05.480
almost 13 million lives of code, or maybe it's by now it's over 30, I don't know. Half of it

07:05.480 --> 07:11.560
remains for drivers, it's for hardware, so we are not going to use all this 27 million lives of code,

07:11.560 --> 07:17.000
right? Also, our application is going to use only part of this kernel, only some features that

07:17.000 --> 07:24.760
we need only for our application, so it's not going to exercise. It's also important to know

07:24.760 --> 07:29.400
that LINUXMP is continuously evolving. We have already this critical system needs to be

07:29.400 --> 07:34.680
updated, traditionally we have the systems critical and we didn't have it, but now it's

07:34.680 --> 07:41.240
going to be important to update the system for the security requirements. We have a six

07:41.240 --> 07:46.840
patcher per hour rate of update and LINUXMP has been developed every day in all the year,

07:46.840 --> 07:54.840
weekdays, weekends, every day. So what is going to be exercise in the LINUX kernel for

07:54.840 --> 08:01.400
our critical application? Yes, starting like that, it was like, okay, just take us, that the

08:01.400 --> 08:10.120
code analysis and see what is going to be exercise. We got this huge call graph, enormous call graphs,

08:11.000 --> 08:15.720
there were not to suffer for anything. You cannot deal with that, it's enormous, it makes no sense.

08:16.920 --> 08:23.240
Amor, this call graph only gives partial results because they cannot solve some problems that are

08:23.240 --> 08:28.680
selected in runtime. They cannot solve integrated calls, they cannot solve aliases, they provide

08:28.680 --> 08:33.560
that code also that it's not going to be used and therefore it's not going to be tested and they

08:33.560 --> 08:41.000
cannot solve assembly code. To this problem, we need to add another feature of the LINUX

08:41.000 --> 08:48.520
kernel that it's non-determinist. LINUX kernel is non-deterministic, this means that the same application

08:48.520 --> 08:55.560
with the same inputs may follow different execution paths. So traditionally, we have deterministic

08:55.560 --> 09:01.160
systems. This means that for the same input, always the function sequence that it was the

09:01.160 --> 09:07.160
executed, it was always the same and therefore it was easier to test it. But right now, LINUX kernel

09:07.160 --> 09:13.880
is non-deterministic and this is due to the global state of the system. It will select which

09:13.880 --> 09:23.160
execution path to follow depending on the global state of the system. So as example, if we do an

09:23.160 --> 09:29.320
application that it's writing in DevNull for instance, we write some string in DevNull, we have

09:29.400 --> 09:38.280
a really clean and nice looking execution phrase like this. Very short time, taking into account

09:38.280 --> 09:44.040
that the LINUX kernel is designed or developed from performance point of view, perfect sense. But

09:44.040 --> 09:51.640
sometimes there will happen as synchronous events like RCQ that we can find almost anywhere in the

09:51.640 --> 09:58.600
execution phrase. So for non-determinist, it's important to have

09:58.600 --> 10:03.880
to account that it's not possible to force the execution of a specific path because it

10:03.880 --> 10:08.280
doesn't rely only in the input. We cannot force the execution because depending on the input,

10:08.280 --> 10:14.680
there can be many execution paths. It relates on the state of the system. And this

10:14.680 --> 10:19.960
state is not generally reproducible due to the complexity. There are enormous amount of asynchronous

10:19.960 --> 10:28.040
events and concurrency going on in hardware and software. So with all these, we identify the

10:28.040 --> 10:33.400
many issues we will have due in testing. We see that one iteration of a testing, it's not enough.

10:33.400 --> 10:37.880
And this is why many kernel developers struggle finding some facts. They know that

10:38.840 --> 10:45.880
some time, a bug is happening, but they are not able to force the execution of that bug. And they

10:45.880 --> 10:54.360
once per week find it and something is going on. Therefore, we need to execute all the tests

10:54.360 --> 11:00.040
repeatedly. And one conclusion is that we need continuous testing, continuous testing is okay.

11:01.400 --> 11:05.560
But the problem is that we don't know which traces can be executed,

11:05.560 --> 11:12.360
which execution path can be executed. So the question that arises here is, which traces need to

11:12.680 --> 11:19.000
be tested, how many traces do we need to test? And how do we quantify this testing effort?

11:20.920 --> 11:29.320
And that's why we start thinking on it and say, let's go to a statistical work. Let's use probabilities

11:29.320 --> 11:36.360
to do it. So we are going to show a statistical pulse coverage that it's based on a statistical analysis

11:36.360 --> 11:41.880
to do test quantification and also to be able to estimate the residual risk in software.

11:43.080 --> 11:49.800
The approach is to based on probability and not only in possibilities,

11:50.680 --> 11:57.080
we are going to be focused on credible risk of our critical system and not all risk,

11:57.720 --> 12:03.240
because it makes no sense. We are going to focus in the credible risk of our system.

12:04.760 --> 12:09.400
And therefore, for doing that, we are going to record the behavior of our system with this

12:09.400 --> 12:14.120
recording, we are going to get some data, where we can provide some statistical analysis.

12:14.840 --> 12:19.960
And finally, we can quantify the testing of software selection likeliness.

12:23.880 --> 12:31.080
Statistical path coverage, we can divide it in three faces. Collect data, model data and risk estimation.

12:31.400 --> 12:41.080
For the first one, for the first one, data collection, we built a tool called DV4C2

12:41.880 --> 12:48.520
that it's based on dynamic data collection, that it's F trace. It's well known in kernel developers.

12:48.520 --> 12:54.440
It's including in the Linux kernel and it allows to record a execution that it's going on in the Linux kernel.

12:55.400 --> 12:59.960
And these two, it's publicly available. It's in our first repository in GitLab.

13:00.920 --> 13:07.320
So you can view it or you can test it if you want. And basically what it does, it has a client and a server.

13:07.960 --> 13:11.560
In the client we have our critical system that it's recording, what it's going on.

13:12.200 --> 13:17.080
Once we have all these recordings, we send it to the server and we post-process them.

13:18.040 --> 13:21.640
In this first processing, we identify the system calls right now.

13:21.640 --> 13:24.920
This is not complete, but it has a prototype, it works well.

13:25.480 --> 13:28.520
Because we consider the system calls the entry point in the kernel.

13:31.400 --> 13:39.240
We do an analysis of independence within system calls and we check that system calls are independent.

13:39.800 --> 13:43.480
And we calculate the MD5 hash of these execution traces.

13:44.360 --> 13:50.040
Why do we calculate the hash? Because this permits us to do analysis matches here.

13:50.040 --> 13:54.360
We can see how the sequence is closed, the frequency of the its system call.

13:55.560 --> 14:01.000
To calculate the statistical analysis later, it matches here. So we deal with MD5.

14:03.480 --> 14:08.120
We are working also in a graphical interface unit for the tool to be matches here.

14:08.840 --> 14:11.800
This is not yet publicly available, but I think it is to have it.

14:12.520 --> 14:19.240
Where we can solve different data. In this data we can solve for instance the sequence diagram.

14:19.240 --> 14:23.560
The execution sequence diagrams by system calls. So we have the system calls.

14:23.560 --> 14:27.240
In black lines we find the common one, the one that I showed you before for instance.

14:28.280 --> 14:32.840
And in red lines, it's around here and if the color is the best one,

14:33.800 --> 14:39.400
we identify the sapaths that have happened and when they happen, we switch frequency.

14:41.400 --> 14:48.760
Also we are able to plot the histograms. So we see how many times its trace has been executed within the testing process.

14:49.960 --> 14:55.480
And we get an idea that normally as Linux it's developed from an upper promise point of view.

14:55.480 --> 14:59.800
They come on for a security a lot of times, but we have some faces.

15:00.520 --> 15:05.160
Sometimes there are some pretty tries that are happening and there are a huge amount of these pretty tries.

15:08.440 --> 15:14.280
So once we have all this data, we can continue to fetch you that it's model in data.

15:15.480 --> 15:22.440
And for this data model we choose to have a parametric approach. Why a parametric approach?

15:22.440 --> 15:28.840
Because if we get a model with a fixed parameters that describe the behavior of the system,

15:29.480 --> 15:32.840
we are able to extrapolate from the model we have.

15:33.800 --> 15:41.640
So if we get this model, we can extrapolate to say, okay, I get the decay during 10,000 hours of testing.

15:42.360 --> 15:43.720
Let's coat and thin it, right?

15:46.520 --> 15:50.920
So for this events, I'm not going to go deep in the statistics. If you want to

15:50.920 --> 15:57.880
can ask me or send me an email, but we just choose over your events that it's a

15:57.880 --> 16:03.160
personal distribution, but for that we need to focus in the execution traces that happened

16:03.160 --> 16:10.760
directly. To select these execution traces, the very ones we use entropy theory that basically

16:10.760 --> 16:16.920
we divide the groups into two with the same amount of information. And we can focus on them.

16:17.880 --> 16:22.760
So here, in the plot we saw the number of traces that have appeared, the rate traces that has

16:22.760 --> 16:27.080
appeared during the testing process, during the test cycles that we have managed,

16:27.080 --> 16:32.760
test cycles that test campaigns. And we see that while we are testing the system,

16:32.760 --> 16:37.720
the number of these rate traces goes, it's decreasing, it makes sense, right? Because we know more

16:37.720 --> 16:44.360
about our system. Therefore, if we are able to model, we get the model that it's code decreasing.

16:44.760 --> 16:52.040
And after we can do, we can extrapolate this model and think, okay, instead of 250 test

16:52.040 --> 16:58.360
campaigns of test cycles, if we go to infinity, how many traces would be appear? And just doing an

16:58.360 --> 17:05.720
improper integral, calculating the area, we can do this estimation. So in this use case that we

17:05.720 --> 17:14.120
have that it was a autonomous emergency braking system, we got a test coverage of 85%. So 15% was

17:14.120 --> 17:25.640
not tested in 10,000 hours of testing process. Is this 15% an acceptable risk or not? That's the

17:25.640 --> 17:32.600
question, right? We have 85, but what does 85 mean in this case? We need to know if it is a acceptable risk

17:32.600 --> 17:38.440
or not. So for that we need to estimate the risk. And if you remember, we were thinking that

17:38.520 --> 17:44.200
100% coverage cannot be achieved and appropriate estimation should be given. So let's calculate

17:44.200 --> 17:50.360
the risk as an explanation. And therefore, we can see if it is freedom from an acceptable risk or not.

17:51.880 --> 17:57.880
Risk can be calculated probability by severity. So let's go to the work system area where

17:57.880 --> 18:04.200
the severity is one because we consider that the execution of an untested trace, it's an acceptable

18:04.280 --> 18:10.200
or it's catastrophic, right? We need to calculate the probability for that. And to calculate

18:10.200 --> 18:15.720
the probability of an event that didn't happen, we can use simple good terrain, that it's

18:15.720 --> 18:22.600
known in statistics and we calculate that probability. So we know now, which is the probability

18:22.600 --> 18:33.080
of executing one of unknown tests, right? And in hardware, it is common to find in different

18:33.080 --> 18:39.560
standards or in manuals, not as it's probability of failure per hour. But in software, it's not at home.

18:39.560 --> 18:46.920
There are some standards, but it's not with software developers don't talk about that much, right?

18:47.560 --> 18:51.800
And I think that for this case, it's interesting to have these ranges. So we can get this range

18:51.800 --> 18:55.480
each and say, okay, it's this acceptable or it is not acceptable.

18:56.360 --> 19:03.880
Furthermore, this probability can be proof and this can be proof capitalizing on complexity

19:03.880 --> 19:11.080
of the systems. So if we know that the Linux kernel is not deterministic, we can use

19:11.080 --> 19:18.040
redundant architecture that are well known in critical system and safety system. And let's say,

19:18.040 --> 19:24.760
okay, execute application, same time in two containers, for instance, and the probability that

19:24.840 --> 19:30.280
two containers execute at the same time and untested races, the probability will decrease.

19:30.280 --> 19:36.040
So we can reduce also this probability by using or capitalizing the non-determinist

19:36.040 --> 19:41.400
of the Linux kernel, using redundant architectures. Instead of two channels, if we have three

19:41.400 --> 19:51.960
channels or four channels, this risk also will go decreasing. So just to end, some conclusions about

19:52.840 --> 20:00.920
the presentation and the method, we proposed this statistical method to estimate the number of

20:00.920 --> 20:05.800
traces that we are going to exercise and have relevant probability of being executed.

20:06.760 --> 20:13.240
We can estimate this execution probability of what we haven't tested and therefore we can calculate

20:13.240 --> 20:25.560
also the residual risk of this untested races. I will say that this can be a problem or also a

20:25.560 --> 20:31.000
possibility and if we capitalize on complexity and we don't focus on all possible risks but

20:31.000 --> 20:37.240
incredible risk in all possible paths but in paths with a probability of being executed,

20:38.200 --> 20:44.680
we can save to a probability world and capitalize on that. So it will be an opportunity and not only

20:44.680 --> 20:52.360
a problem and we are open to other statistics. The idea was not to select a statistic, which wants to

20:52.360 --> 20:59.480
use. The idea was to see if it is feasible to use a statistical world for this and we think the

20:59.480 --> 21:06.200
objective was achieved and we know that the technique is only possible if continuous monitoring

21:06.200 --> 21:11.720
is done but I think for the next generation, next generation, critical system, continuous monitoring

21:11.720 --> 21:19.480
would be mandatory anyway. And it will be great to have additional experts review or certification

21:19.480 --> 21:27.160
authorities review. As future long, we want to extend analysis not by system call but we want

21:27.160 --> 21:33.640
to include all the calls that are in the Linux kernel. We know that this is limited. We want to

21:33.640 --> 21:41.160
publish the graphical interface in the public repository soon and it's working adequately and also

21:41.160 --> 21:47.160
we note that this is an argument to be updating continuously and we know that our statistic has

21:47.160 --> 21:54.200
some parameters that will detect if an update changes the behavior of the system. Right now it's working

21:54.200 --> 22:01.400
correctly but we need further analysis on this also and this is the next big step that we are doing

22:01.400 --> 22:09.640
in this statistical path coverage. And this is in context technique that depends on the use case.

22:09.640 --> 22:15.080
It's great to have different use case to be analyzed and being continuous monitoring to do it.

22:17.320 --> 22:22.360
So thank you very much. If anyone has any question, you can ask it right now or you can send

22:22.360 --> 22:32.360
me an email if you want.

22:32.360 --> 22:37.880
I would like to ask a question about the unique traces and experiments as a whole. Where you have

22:37.880 --> 22:44.920
dates in the kernel version or user space during these 10,000 hours? No, in this 10,000 hours it was

22:45.080 --> 22:53.800
a question. Okay. So he asked the result if we saw that if the system was being updated and

22:53.800 --> 23:00.200
it was not, it was a static kernel that it was not being updated. So the update part, it's been

23:00.200 --> 23:05.800
tested right now, we are dealing with that. So all these statistical models has some parameters that

23:05.800 --> 23:14.840
will detect if changing the behavior of the system happens. Right now in this result is

23:14.840 --> 23:27.160
it was not. Okay, thank you. All clear? Okay, great.

23:27.160 --> 23:35.480
What is the question? Do you have a target of the reduction on the number of tests of our testing

23:35.480 --> 23:43.960
when you start to expect that to save 20 percent of the time? No, no, the about the percentage of

23:43.960 --> 23:49.800
the test coverage, no, about the residual risk. That's why I talk about probability per

23:49.800 --> 23:56.440
value per hour. There are some in hardware, well known and 65,58, that it's the generic

23:56.440 --> 24:04.120
standards, so some values and we did this comparison with these values and we using redundant

24:04.120 --> 24:11.320
architecture we can achieve them in this case. But there is no well-known ranks right now selected.

24:14.920 --> 24:24.120
You gave the example and showed us that for the test that you did, you got about 85 percent

24:24.120 --> 24:29.480
test coverage and you asked if that wasn't added. You have any metrics to find out whether or not

24:29.480 --> 24:35.560
that is enough, you get any results on that. Can you repeat it? Can you have any anecdotes of whether

24:35.560 --> 24:41.720
or not 85 percent is a good number for testing in this kernel? Or is this enough in use cases

24:42.600 --> 24:48.760
Okay, so tradition, if you go to a tradition, I'm sorry, so he asked if the 80 percent

24:48.760 --> 24:57.080
test coverage is it enough or not. So traditionally one will say that if you have not 100 percent

24:57.080 --> 25:03.000
it's not enough, I say that that's why we do the risk estimation and that's the important thing.

25:03.000 --> 25:10.360
The test coverage it can give you some feelings, so if it is enough or not, but the important

25:10.360 --> 25:18.680
thing is the risk. So I can say that it's 99, but if this one percent is catastrophic, 99

25:18.680 --> 25:25.800
it's not enough. If this is 10 percent it's non-catastrophic, it's enough. So it's not about getting

25:25.800 --> 25:32.600
the number and this happened to every developer. When you are testing, the objective of testing

25:32.600 --> 25:39.800
it's to fix it and not to get 100 percent, but sometimes you are doing the test and you need to

25:39.800 --> 25:44.840
be objective on that. You are doing the test to get the number and that doesn't make sense, right?

25:45.880 --> 25:52.840
It's not about getting the number, it's making the system safe. Thank you.

25:53.880 --> 26:00.920
Yeah? You talked about needs continuous monitoring, what you mean by continuous monitoring

26:01.160 --> 26:07.880
and what purpose? Okay, so yes, asking about continuous monitoring.

26:10.840 --> 26:16.040
So continuous monitoring here should be, I think, deciding in context of the use case,

26:16.040 --> 26:20.760
we need to doubt that because the test will be designed depending on the use case,

26:21.720 --> 26:28.360
but like in real-time Linux they are using continuous monitoring to see how it's

26:28.360 --> 26:34.200
pre-altime Linux working on, for that we also will need to do that for these systems.

26:34.200 --> 26:42.440
And before for instance, we have to unafter to the critical system that it's in the route or in the

26:42.440 --> 26:47.720
street or in an industry, we need to do this continuous monitoring and our labs. So that's why

26:47.720 --> 26:52.360
we need these systems running in our labs, being tested continuously and checking all these

26:52.360 --> 27:03.720
topics, it's going to be okay, you're not okay. That's it. Thank you very much.

