WEBVTT

00:00.000 --> 00:08.320
All right, okay, I don't want to swallow it, all right.

00:08.320 --> 00:13.520
So we are a young startup, like five months out of stealth, I am Vinay, and this is my

00:13.520 --> 00:21.320
colleague, Geon, and yeah, we want to show you what we've been cooking.

00:21.320 --> 00:29.920
So before we get started, let's have a look at the AI game from a special point of view.

00:29.920 --> 00:33.680
We have the world divided in two sections, basically.

00:33.680 --> 00:39.320
We have training, we have the training guy on one hand, and we have the inference guy

00:39.320 --> 00:44.040
on the other hand, and introducing our players.

00:44.040 --> 00:50.200
We have training, training is typically done as a research endeavor.

00:50.200 --> 00:56.080
You have one of something, which means basically you train your next best greatest model

00:56.080 --> 00:58.640
of all time.

00:58.640 --> 01:04.760
More is better, you want bigger models, you want more modalities, and everything, and yeah,

01:04.760 --> 01:12.080
obviously you want to iterate fast, and yeah, you love Python.

01:12.080 --> 01:16.720
On the other hand, we have inference.

01:16.720 --> 01:23.640
Inference is run in production, so this guy operates in production mode, running thousands

01:23.640 --> 01:27.280
and millions of models doing billions of requests.

01:27.360 --> 01:33.600
In this scenario, less is actually better, you want to consume less resources per model,

01:33.600 --> 01:41.000
and smaller containers, and everything, you want predictable latency, and Python there

01:41.000 --> 01:46.680
is, yeah, this is some story, yeah.

01:46.680 --> 01:54.640
And so when you write the framework, any AI framework, you should obviously prioritize

01:54.720 --> 02:01.280
training, right, because when you're training framework, you get inference for free,

02:01.280 --> 02:08.720
and this is what most of the frameworks have been doing, but the experience is often like

02:08.720 --> 02:10.080
this.

02:10.080 --> 02:18.760
So the Python ecosystem is sometimes not the most friendly one to get started running

02:18.760 --> 02:19.760
inference.

02:19.760 --> 02:27.080
So when we talk to potential customers, they are typically AI-flavored backend engineers,

02:27.080 --> 02:34.080
and they have very strong demands, like accelerator, agnosticity.

02:34.080 --> 02:40.160
They don't want to run only on, or it'd be first to only run on Nvidia.

02:40.160 --> 02:48.240
They want compiled models with static typing, cross compiling, runtime cent boxing,

02:48.240 --> 02:49.720
and extra less.

02:49.720 --> 02:54.440
Parallel IO, Async IO, Kubernetes, and all the good stuff.

02:54.440 --> 02:59.440
And this is why we created Cedemail.

02:59.440 --> 03:03.840
So how can we best describe Cedemail?

03:03.840 --> 03:12.800
It's resting on four pillars, ZIG is a programming language, producing MLIR, using open

03:12.800 --> 03:20.280
XLA to compile the models to the accelerators, and yes, we use Basel to orchestrate all of

03:20.280 --> 03:22.920
this.

03:22.920 --> 03:31.160
So we use ZIG as a front end, so you write a model source code, and ZIG will show some

03:31.160 --> 03:37.660
examples of that later, and your ZIG program then produces MLIR, which passes it to open

03:37.700 --> 03:46.700
XLA, which then targets your target platform, and you have a handy Basel interface to do

03:46.700 --> 03:48.900
all that.

03:48.900 --> 03:54.500
Speaking of ZIG, our framework is for inference only.

03:54.500 --> 03:59.580
At the moment, we don't even bother with training because the inference demand is so high,

03:59.580 --> 04:03.780
so it's actually okay to specialize on that.

04:03.780 --> 04:11.620
So as I said, you write your models in ZIG, and what we achieved was zero Python in our

04:11.620 --> 04:12.620
stack.

04:12.620 --> 04:19.060
We can still load PyTorce models, we have a ZIG implementation of that, so that's good.

04:19.060 --> 04:26.380
And the focus is on producing readable, maintainable, modular code, that's statically compiled,

04:26.380 --> 04:30.580
and where it feels more like, if you are a systems engineer, it feels more like you're

04:30.580 --> 04:33.340
doing proper programming, right?

04:33.340 --> 04:37.460
Anything against Python though, so.

04:37.460 --> 04:39.460
Is that a hand over to?

04:39.460 --> 04:40.460
Okay.

04:40.460 --> 04:41.460
Okay.

04:41.460 --> 04:52.500
So this is like the main dot ZIG file, I'm not sure it's pretty readable anyway, so the main

04:52.500 --> 04:57.060
point of this slide is to show where it's looked like regular ZIG code, so we have a fine

04:57.060 --> 05:03.020
going controller allocation, but we also swing a few modern features that we use as

05:03.060 --> 05:09.460
synchronism, so typically what we do here is that we first open the PyTorce files, we

05:09.460 --> 05:15.060
extract all the shapes, but we don't load the white sets, then we take off the compilation

05:15.060 --> 05:23.020
or the model, paste on the shapes, and yeah, as synchronously we also load the weights on

05:23.020 --> 05:26.900
the device.

05:26.900 --> 05:32.140
But now let's look a bit more about the model code looks like, so the idea is that to make

05:32.140 --> 05:38.140
something which is familiar, if you're coming from PyTorce or other high-level frameworks,

05:38.140 --> 05:42.420
and even though it's still ZIG, statically compiled and so on, we made it so that we

05:42.420 --> 05:49.740
you don't need to under a location, so it feels more very high-level, but we still try

05:49.740 --> 05:57.220
to add a few goodies that's to make you want to write this code because, and what

05:57.220 --> 06:01.300
does that mean is access tagging, what basically what we do is we give names to the different

06:01.300 --> 06:08.260
axes of tensor, and we can propagate it to the different operation, so a matrix multiplication

06:08.260 --> 06:13.780
and so on, and it means in practice it simplifies a bit a lot of the model code because

06:13.780 --> 06:20.660
you need less transposition, especially I want a matrix multiplication, so to be concrete,

06:20.660 --> 06:24.740
like if you have an image tensor, so it's a tensor with three axes, you have the width,

06:24.740 --> 06:33.700
the height, and the channels, so usually you would refer to the width by its offset, so offset

06:33.700 --> 06:37.780
zero, and you have to remember in your code that offset zero is the width, and sometimes it becomes

06:37.780 --> 06:43.780
offset one or offset two, depending on what you do, but if you give it names, then eventually

06:43.780 --> 06:49.940
transpose, you can just use the name to refer to the width and always be consistent

06:49.940 --> 06:55.460
with the other program, also means you don't really want to write transpose one zero two,

06:55.460 --> 07:01.540
just say I want to transpose that the heads and the width, then the channels, and it gives you

07:01.540 --> 07:08.980
that, and for watch multiplication, here we have A and B, which are two matrices, but they are

07:08.980 --> 07:14.580
not in the text book, they are out from watch multiplication, so we cannot choose like the

07:14.580 --> 07:22.100
math rule operator, which usually find it frameworks, so what we do, we give them names,

07:22.100 --> 07:29.140
and this A and B have both axes, namely K, we can say I want to multiply A with B and contract

07:29.140 --> 07:36.340
over the K axis, and you just do what you want, and it scales particularly well to even more

07:36.420 --> 07:45.380
complicated operation like the self-attention, here we just say we want to multiply the queries

07:45.380 --> 07:52.820
with the keys, over the axis name, head dimension, then we want to compute the self-max, over

07:52.820 --> 07:58.340
the other keys, and then we aggregate all the values, weighted by the attention of the weights,

07:59.380 --> 08:05.860
and you can open up this code in the compiler, and you have a lot of transpose everywhere to make

08:05.860 --> 08:11.780
sure you can do the math rules on this one, here it just goes away, if you want to learn more,

08:12.500 --> 08:19.860
you can check out the docs, we have tutorials, builds with the open source tool,

08:19.860 --> 08:25.780
ZIN, and now I'm going to give the mic back to Rene for the rest of the paper.

08:26.260 --> 08:40.100
Thank you, so that was Zig, let's come to OpenXLA, well OpenXLA is a huge ecosystem,

08:40.100 --> 08:46.180
and it's also backed by the who is who in AI, what else can you say?

08:46.180 --> 09:02.020
Yeah, so, in combination with the produced MLR, you can use OpenXLA, or set ML uses,

09:02.020 --> 09:11.060
I'm openXLA to produce highly optimized code for your target, which could be an Nvidia GPU,

09:11.060 --> 09:18.900
it could be a GPU, it could be an AMD GPU, yeah, so it supports the important things,

09:18.900 --> 09:25.140
like kernel fusion, collect memory allocations, it's all highly optimized, auto tuning,

09:27.700 --> 09:36.340
and it produces very mature and stable MLR, so this is not experimental code, this is

09:36.980 --> 09:44.100
like industry grade, MLR, you see a picture of it here, actually you don't see it,

09:44.100 --> 09:50.740
but it will be in the slides, you can download the PDF, it's very colorful and very nice,

09:50.740 --> 09:59.620
and very professional, yeah, and we also add baseline to the mix, it's like the user interface

09:59.700 --> 10:07.300
for you on the command line, and you can do some amazing things, for example cross compiling,

10:07.300 --> 10:13.140
this is something that Zik does out of the box, but Basel also supports it, and we pull in so many

10:13.140 --> 10:23.540
third-party libraries, so the whole cross compilation story is covered by this, and you can do it

10:23.620 --> 10:29.460
from Basel, so that means if for example, here you're on a MacBook, you can compile your

10:29.460 --> 10:39.700
model, cross compile it for a Linux AMD 64 server, and then just copy to the server and it will run.

10:41.860 --> 10:51.380
We do an awful lot of runtime, trimming, and sandboxing, so some of those frameworks like

10:51.380 --> 10:59.060
CUDA, or the runtime, you need for rock M, they can get pretty large, and especially for the AMD

10:59.060 --> 11:08.340
ecosystem, we managed to reduce that by roughly 90%, so we just take out only the required shared

11:08.340 --> 11:15.940
object files and whatever is needed, and bundle it together with your executable, so you can

11:15.940 --> 11:20.820
create self-deployable archives, like I just said, even with cross compiling, you can for example

11:20.820 --> 11:26.500
create a tar archive that you then just from any machine, that you then just copy to your server,

11:26.500 --> 11:36.020
and run the executable, and it will start doing what you want. Obviously, we can produce OCI images

11:36.020 --> 11:50.020
with Basel, and yeah, and ready for Kubernetes deployments, and speaking about CUDA, yeah, all of the

11:50.100 --> 12:00.260
things I just said is probably well illustrated by this mean here. There are many such cases where

12:01.300 --> 12:07.620
you get version incompatibilities, and you have your server or your machine, you installed the

12:07.940 --> 12:18.580
CUDA driver, and then the user's base libraries, your tens of low, your PyTorch state, they

12:18.580 --> 12:24.180
have a conflict, and it's sometimes really, really painful, and this is what we basically

12:25.620 --> 12:34.660
get rid of with the mail by doing our sandboxing, so actually you don't need to do any provisioning

12:34.820 --> 12:40.340
for the mail models, you just need to copy them, so it's just a deploy stage, there's no special

12:40.340 --> 12:48.180
provisioning, no special setup on the server's required, and this is what it looks like, okay,

12:48.980 --> 12:55.620
I'm not sure if you can read it, but it's in the slides anyway, basically we show three

12:56.420 --> 13:00.180
basel command line examples here. The first one just says basel run,

13:01.140 --> 13:11.780
optimize MNIST, and produce an executable that works with the CUDA runtime. The second one is a

13:11.780 --> 13:20.420
bit more involved, we pass to the basel build command that we want to create an archive, which means

13:20.500 --> 13:32.820
a tar archive that supports the rock M platform for AMD, and also the Linux AMD 64

13:32.820 --> 13:40.260
host architecture, and you can run this from any machine, you can run it from an arm or whatever

13:40.260 --> 13:46.900
M3 MacBook, and it will still cross compile everything, and go from your development machine,

13:46.900 --> 13:54.340
just copy the stuff over to your server, and it will work. Coping stuff over, why would you want

13:54.340 --> 13:59.140
to do that if you can just push an image, so that's the third example, you just say basel run,

13:59.140 --> 14:07.140
MNIST push for CUDA, and for TPU, so what you get is a container, you can pull on a machine that has

14:08.420 --> 14:14.740
that has an Nvidia GPU, but you can also pull it on a machine that has a TPU inside, and the

14:14.740 --> 14:21.460
container will auto detect and start up and run just fine, all with just one basel command.

14:24.100 --> 14:29.300
Yeah, so speaking of open source, we are on GitHub, very easy to find, say the mail,

14:29.300 --> 14:37.860
say the mail, if you want to check out the code, please do so.

14:38.260 --> 14:47.380
And something we are quite proud of is, with you, Jan Lecker, who is also often referred to as

14:47.380 --> 14:58.660
the Godfather of AI, thinks that the mail is what it is, impressive, and this man is a genius,

14:58.740 --> 15:11.380
so I won't argue with him. So, what's next? Obviously, we are young, so we want to support

15:11.380 --> 15:23.380
more chips, more models, we just want more, more modalities, more integrations, and we are also

15:23.380 --> 15:30.980
working on an LLM server, we are pretty far with it, we call it LLMD, and it's super fast, super small,

15:30.980 --> 15:37.140
and super cool, and we hope that soon we will also give you a chance to play with it.

15:38.980 --> 15:49.860
And speaking of playing with it, yeah, this was a short intro to the said mail, and this is basically

15:49.860 --> 15:58.580
the least, how you can run models on your machine, and basically that's it, and now we have

15:58.580 --> 16:12.100
a lot of time for questions. So, I would suggest if you have a question and don't mind,

16:12.740 --> 16:18.100
it can close, so I can hear you, and then I can repeat the question or you.

16:20.260 --> 16:23.380
And if you don't have any questions, we'll have a longer break, okay.

16:42.100 --> 16:56.020
Yeah, so I try to repeat a very long question, and it went along the lines of how do we find,

16:56.020 --> 16:59.940
so for example, we showed the amnesty example, and the question was how do we find

16:59.940 --> 17:07.460
this symbolic dimensions, and I'll try to answer the question, maybe I misunderstood it,

17:07.460 --> 17:13.780
but we have enough time. So basically, what we do is, so one thing that's very important,

17:13.780 --> 17:27.060
Siggy's a compiled language, and in the compilation step, you already need to know when you run it,

17:28.100 --> 17:33.220
you need to know the shapes to produce the MLR, right? And so you need to get those shape dimensions

17:33.300 --> 17:37.620
from somewhere. And in this case, like in the emnest case, or in the llama case, we get them

17:37.620 --> 17:42.420
by loading the weights from disk. So if there are safe tensors, they have a meta information,

17:42.420 --> 17:48.100
we just grab the meta information first, compile the code to MLR, then keep loading

17:48.100 --> 17:54.660
asynchronously the rest of the weights, and then it's done. Do you want to add to that, or

17:54.980 --> 18:06.500
typically the batch size is an input to the compilation, so yeah. So we get the actual dimensions

18:06.500 --> 18:19.460
by loading the weights. So we use symbolic. So the follow-up question is, I hand over to

18:19.460 --> 18:24.340
the film, he wants, for seconds lens, typically what you do is you have a max seconds lens,

18:24.340 --> 18:29.780
and then you compile for that. And if you really want more dynamism, you can compile several

18:29.780 --> 18:36.340
batch versions of the kernel. Yeah, that's some approaches, like with project extensions, where

18:36.340 --> 18:40.900
the seconds lens is not that much of a problem, so you can just compile for very long seconds lens,

18:40.900 --> 18:46.500
and you can trade off at one time between batch size and seconds lens, but yeah.

18:50.340 --> 18:53.620
Okay. Yes.

19:09.300 --> 19:17.460
Okay, so the question is, since we are using ZIG, are we not using the build, yeah, the ZIG build system,

19:18.420 --> 19:24.820
the reason for that is that basal is more mature, and some of the dependencies are also using

19:24.820 --> 19:33.140
basal, so typically open XLA, and we compile the part of LVM, so with basal it's easier to do this.

19:34.900 --> 19:41.460
We also generate like the current images and build the content to that for now, so

19:41.860 --> 19:50.820
we need to find a, we are working on a way to make it easier to call or stuff from build.zIG,

19:50.820 --> 19:57.460
but it's not ready yet, so yeah, but it's a common frequent from people using ZIG, but yeah, we have a

19:57.460 --> 20:14.580
working on it. So I'm not sure on the, I think the question is, how do we import

20:14.580 --> 20:23.060
PyTorch models? Yeah, so for PyTorch models, you have two things, you have the weights,

20:23.060 --> 20:28.660
we can load the weights typically used for PyTorch models, so either that, I mean, the PyTorch official

20:28.660 --> 20:35.940
formats are also safe tensor, but then you also need to pause the code, we don't, I mean, yeah, you need to

20:36.020 --> 20:44.980
rewrite the inference code. We tried to make it this easy, it's a bit of work, but usually,

20:44.980 --> 20:51.620
I mean, you probably already have seen like Lama implementation in one file, so it's not that much

20:51.620 --> 20:57.780
work, if you know what you're doing, and also sometimes the inference code is different from the

20:57.860 --> 21:06.660
training code anyway, because you want to optimize for varying set lengths or pre-filling versus

21:06.660 --> 21:14.660
generation, so yeah, so this rewrite is often a bit needed anyway when you go to production.

21:22.980 --> 21:25.860
Thank you very much. We start in a three,

21:27.780 --> 21:33.140
three.

