WEBVTT

00:00.000 --> 00:13.600
So our next talk here is presented by Lisa Kede about to mine or not to mine.

00:13.600 --> 00:15.600
So welcome everyone.

00:15.600 --> 00:17.600
Thank you very much.

00:17.600 --> 00:24.080
Thank you very much for being here this room was packed and I'm really excited to be here

00:24.080 --> 00:29.040
as well and this is probably not as even as much of an open source topic even though

00:29.040 --> 00:35.120
of course there's a lot of open source going on in AI as well at the moment but I want to be

00:35.120 --> 00:41.040
talking about some hot topics in copyright law actually which are going on at least in Germany

00:41.040 --> 00:51.520
at the moment when we're talking about AI so most of you probably have heard about the issue

00:51.520 --> 00:56.720
around training and using an AI model when it comes to copyright so this is at least

00:56.720 --> 01:01.600
in Germany one of the topics that is being talked a lot about at the moment and let's just look at

01:01.600 --> 01:11.040
the process for a second very overly simplified process of data in AI models so usually when we're

01:11.040 --> 01:16.560
training in AI model we do need a lot of data which can be images text which can be music

01:16.560 --> 01:23.680
code whatever can be open source code as well obviously most of the data that is being used

01:23.680 --> 01:30.960
for AI training is obviously copyrighted which will be fed into the AI model for training so on

01:30.960 --> 01:36.640
the left hand side is what we would call training data and the center we do have the AI model usually

01:36.640 --> 01:42.400
embedded in some kind of an AI system such as chat GPT or any other AI system which can then be used

01:42.400 --> 01:48.800
in turn by a user via prompting so we have input that goes into the AI model it does not necessarily

01:48.800 --> 01:54.720
have to be a text prompt you can also have an AM model for example analyze an image for the

01:54.720 --> 02:00.720
content have it create a caption for an image for example and then we have output which is on the

02:00.720 --> 02:07.600
right hand side but also can be probability values but can also be text music code images whatever

02:08.960 --> 02:17.040
so to sum it all up in order to create an AI system we need a lot of data

02:17.920 --> 02:26.800
and to create a good system we need even more data when we're talking about generative AI systems

02:26.800 --> 02:32.880
we usually require works what people call copyrights works so some kind of copyrighted

02:32.880 --> 02:40.000
artwork for example where a human being has put effort into it and usually we are collecting these

02:40.080 --> 02:47.040
works online so we have to download them to create a copy at some point of these works in order

02:47.040 --> 02:53.520
to be able to feed them into our AI model and since we are talking about the legal aspects here

02:54.160 --> 02:59.520
we will obviously notice that creating the copies also may be modifying the works

02:59.520 --> 03:08.160
in most cases require a license and there we have arrived at one of the main issues in AI

03:08.160 --> 03:18.080
use at the moment so I'll be talking about the situation in Europe since about 2021 we do have

03:18.080 --> 03:22.800
an exception in European copyright law which was introduced by a directive so it's not

03:23.920 --> 03:29.520
immediately binding for the member states of the EU but in 2021 each and every member state of the

03:29.520 --> 03:35.680
EU had to implement this directive in their local copyright laws and there was one article in this

03:35.760 --> 03:43.440
directive which is called the text and data mining exception so this exception basically requires

03:44.080 --> 03:49.920
that the member states shall provide for an exception or limitation for the reproduction and

03:49.920 --> 03:56.160
extraction of law for the accessible works and other subject matter for the purposes of text and data mining

03:56.160 --> 04:00.640
it does not have a definition in here it's in another one part of this directive I'll be talking

04:00.640 --> 04:09.360
about this in a second and it does have an additional provision which states that while they have

04:09.360 --> 04:14.720
to be creating an exception they also have to provide for the authors to have a way to opt out

04:14.720 --> 04:21.200
of this exception so the authors of copyrighted works should be able to declare by some kind of

04:21.200 --> 04:26.800
reservation that they do not want their works to be used for for text and data mining so they do not

04:26.800 --> 04:37.280
want their works to be reputuous especially for text and data mining as I said this is

04:38.560 --> 04:45.840
based on the digital single markets except directive and it constitutes a statutory exception

04:45.840 --> 04:54.000
so it's basically an exception which allows us to copy copyrighted works for the purposes of

04:54.000 --> 05:01.600
text and data mining without having to obtain a license of the author and it also as I announced

05:01.600 --> 05:06.560
has a definition of what actually is text and data mining because we are talking about AI here so

05:06.560 --> 05:10.880
why are we talking about text and data mining all of a sudden so the definition of text and data

05:10.880 --> 05:17.760
mining in this directive is the automated analytical technique aimed at analyzing text and data

05:17.760 --> 05:23.600
in digital form in order to generate information which includes but is not limited to pattern

05:23.600 --> 05:29.760
trends and correlations and when we're looking at what happens in AI I feel like the sound

05:29.760 --> 05:34.400
vaguely familiar what we're doing when we're using AI systems especially for example when we're

05:34.400 --> 05:40.720
using AI systems to analyze an image to figure out what is the contents of the image what would

05:40.720 --> 05:47.200
be the good caption for this image does the image contain some kind of content that we do

05:47.200 --> 05:52.160
and I want to display on our website for example so there's a lot of analysis of course going on

05:52.160 --> 06:00.000
in AI and this is why a lot of scholars jumped at this text and data mining exception when the

06:00.000 --> 06:06.720
question arose especially with the rise of all of these generative AI systems for how we can

06:07.520 --> 06:14.640
somehow figure out why it should be allowed to train AI systems even when there's never been

06:14.720 --> 06:23.680
a license for doing so of course as I announced there is a way to even if we assume that this is

06:23.680 --> 06:36.880
applicable to AI to opt out of this exception so it is aimed at providing a way to opt out the use

06:36.880 --> 06:42.800
of the copying of data for text and data mining so if people transfer this to AI a way to opt

06:43.680 --> 06:51.440
out of having your data used for AI training it should be declared by the right holders so it doesn't

06:51.440 --> 06:57.840
necessarily have to be the author but whoever they assign their rights to it should be declared in an

06:57.840 --> 07:05.920
appropriate manner and machine readable in online context whatever machine readable is it is not

07:05.920 --> 07:14.080
defined at all in this directive which will be one of the issues if we were talking in a second

07:14.720 --> 07:21.040
and it can be declared only ex-none which mainly means for the future only so you can have your work

07:21.040 --> 07:26.080
on the internet without any declared reservation and at some point you decide that you do not want

07:26.080 --> 07:31.040
any more that your work can be copied for text and data mining properties you can just declare

07:31.120 --> 07:36.800
the reservation everyone who obtains the work after that point in time have to respect your

07:36.800 --> 07:48.800
reservation there is a discussion going on especially in Germany I don't know I was I thought

07:48.800 --> 07:52.880
this was a good opportunity to be talking to you at this topic because I think there's a lot of people

07:52.880 --> 07:57.760
from different countries and right now in Germany we are very much focused on what's going on in

07:57.760 --> 08:02.480
Germany and we don't really know what's going on in the other European countries which are also

08:02.480 --> 08:06.880
of course somehow have to be talking about this text data mining exception because it's in their

08:06.880 --> 08:16.240
copyright law as well so the discussion in Germany for example is that there's some scholars which

08:16.240 --> 08:23.680
would assume that text and data mining exceptions cannot be applied to AI training and AI analysis

08:23.760 --> 08:31.840
because text and data mining is about a different kind of gathering information than it would be

08:31.840 --> 08:39.520
in AI training I could and I cannot really relate to this kind of view because from what I know about

08:39.520 --> 08:47.280
AI and text and data mining it would be that text and data mining in its basis is one of the

08:47.280 --> 08:52.160
precursors of what we would be doing with AI and machine learning at some point so

08:53.120 --> 08:59.440
there is a bit of a discussion going on in the legal landscape at the moment in Germany whether

08:59.440 --> 09:04.560
we can use the exception for text and data mining or not and there also even has been a first

09:04.560 --> 09:09.520
court case in Germany which I'll be talking about in a second but the discussion as I said

09:09.520 --> 09:14.720
is going around the question of whether the use of AI so not even training of AI systems but

09:14.720 --> 09:21.920
merely the use of AI systems constitutes text and data mining and the other question would be

09:21.920 --> 09:27.120
whether training of AI systems would constitute text and data mining in the sense of the text and

09:27.120 --> 09:32.880
data mining exception both of which obviously would have a huge impact because if we were to answer

09:32.880 --> 09:39.840
these questions with a no then we would need a license for each and every copyrighted work that would

09:39.840 --> 09:45.760
be going into machine learning models as training data and we could not for example simply

09:45.760 --> 09:51.840
download an image from the internet and analyze it by means of a machine learning system without

09:51.840 --> 09:59.440
the permission of the author and this is an issue especially if we are only talking about this

09:59.440 --> 10:05.360
in our national jurisdiction because web scraping for example for training data is obviously

10:05.360 --> 10:10.880
not something that ends at national borders because it's performed online it does not even

10:10.880 --> 10:16.880
end at European borders but we do have a common ground to start from there so I think this is

10:16.880 --> 10:23.280
something that we all have to discuss on a European level and where it also becomes clear that

10:23.280 --> 10:32.080
implementing copyright in the sense of directives versus regulations is not the ideal way when we have

10:32.160 --> 10:40.640
cross country subject matter because when we have a directive like the text like the digital

10:40.640 --> 10:46.720
single markets directive which brought the text and data mining exception it has to be implemented

10:46.720 --> 10:53.120
in the national jurisdictions and it can be different in each country each country could interpret

10:53.120 --> 11:00.080
this exception differently whereas if we have a regulation such as the GDPR or the AI act which is currently

11:00.080 --> 11:06.320
been talked about a lot this immediately is effective in each and every country of the European

11:06.320 --> 11:17.200
union and would have to be also interpreted in the same way so there is actually some lawsuits

11:17.200 --> 11:21.840
going on in Germany and I would love to hear about whether there are some lawsuits going on

11:21.840 --> 11:25.600
in other European countries as well I know that there's a lot going on in the US at the moment

11:26.480 --> 11:32.800
so there's one lawsuit I would be like to talk in about which is the first one on this list

11:32.800 --> 11:41.840
is Kniechke against Lyon which is a lawsuit that finished in its first round in September of last

11:41.840 --> 11:47.520
year so it's quite fresh and it's also being appeals so the final word has not been spoken

11:47.520 --> 11:53.200
but nevertheless I think it's really interesting to see what the courts thought about AI

11:53.840 --> 12:00.000
and the text later mining exception as of now so what was going on in this case is that a photo

12:00.000 --> 12:06.240
producer published some of their photos on a website on the internet it was an agency website so

12:06.240 --> 12:14.080
it was not his own website but it was a different website and on this website they also had

12:14.080 --> 12:19.520
terms and conditions which said that automatic scraping of content of this website is not allowed

12:19.520 --> 12:26.800
and then there's common crawl which many of you probably also know which regularly basically

12:26.800 --> 12:33.680
it's a backup of the internet at least of HTML and metadata they do not store images for example

12:33.680 --> 12:42.400
but thanks to all of the images and this common crawl organization provides their data set

12:42.560 --> 12:49.280
on their website and then there's Lyon Lyon is an on-profit organization which creates

12:49.280 --> 12:56.960
data sets for machine learning basically and they use the common crawl data set to create their own

12:56.960 --> 13:03.120
image data set which is basically a table which you can imagine as a table of entries where each

13:03.120 --> 13:08.160
entry would be an image and they have a URL to the image and a description of what is in the

13:08.160 --> 13:15.360
image whether it's like for example child safe content or whether what's the caption what's

13:15.360 --> 13:20.480
a description of the image and where you can obtain the image they do not have a copy of the

13:20.480 --> 13:26.880
image in their data set and they are publishing the data set which is in trend being used for

13:26.880 --> 13:36.640
example by stability AI and the photo producer of which one of those images that he published on their

13:36.640 --> 13:43.600
website also was in the common crawl data set and currently also was in the consequently

13:43.600 --> 13:51.520
also was in the Lyon data set did not like that their data their image was a youth in this

13:52.240 --> 14:01.040
data set which was in turn being used by AI creators and he sued Lyon so he sued the organization

14:01.040 --> 14:08.400
creating the data set and the court therefore had to deal with the issue of whether downloading

14:08.400 --> 14:18.080
his image which Lyon had to download because they are creating they were using their own AI to

14:18.080 --> 14:24.480
analyze the images for the caption so they could use the information they got out of the image for

14:24.480 --> 14:32.800
example the caption in their data set so they had to download the image to do so and he is now

14:32.800 --> 14:39.600
suing them for doing that because he says he claims that there are terms and conditions on this

14:39.600 --> 14:48.400
agency website which prohibits spraping of data from the website for what ever used

14:49.280 --> 14:54.720
constitutes the reservation according to the text and data mining exception and the court says

14:54.720 --> 15:01.840
actually that they assume this is one of the first I think main conclusions of the case

15:02.560 --> 15:08.400
that using AI or machine learning or any kind of software to analyze an image to get information

15:08.400 --> 15:13.600
out of the image constitute text and data mining which would mean that the text and data mining

15:13.600 --> 15:20.960
exception is applicable to the first of our questions and they were also talking about even though

15:20.960 --> 15:25.680
it's not relevant to the case that they would be assuming that training in AI system would most

15:25.680 --> 15:30.560
likely also constitute text and data mining but as I said this is not relevant to the case so it's

15:30.560 --> 15:37.120
not legally binding and since it's being appeased or legally binding anyways and they were also

15:37.120 --> 15:45.360
talking about the reservation even though the reservation in this case was not relevant why

15:45.360 --> 15:49.920
because there is another provision which I haven't been talking about yet also in copyright law

15:49.920 --> 15:58.080
which allows for research institutions which are doing nonprofit research to do text and data mining

15:58.080 --> 16:04.000
without having to respect any reservations and they were assuming that Lyon is one of those research

16:04.080 --> 16:09.600
institutions and they do not have to respect reservations but nevertheless they were talking quite

16:09.600 --> 16:15.680
extensively about the reservation that was on this website and their conclusion was that even though

16:16.640 --> 16:24.480
it's a sentence in English on a third party website in the term buried somewhere in the terms and

16:25.440 --> 16:35.040
it constitutes a machine readable reservation according to article 4 paragraph of the text and data

16:35.040 --> 16:43.040
mining exception which is I think amazing because I wouldn't there there reason was that

16:44.640 --> 16:51.200
the data is being used for supposed to be used for AI training so the people who are

16:51.520 --> 16:57.760
scraaking the data should also be using AI to analyze when a very strong kind of reservation

16:57.760 --> 17:04.560
this is actually one of the arguments so they said that natural language nowadays is machine

17:04.560 --> 17:10.720
readable because we do have systems that can be natural language and interpret it but they in my opinion

17:10.720 --> 17:19.520
did not take into account was there's a lot of different ways to express a reservation so it's not

17:19.520 --> 17:25.920
at all a help for any of the others because they do not have security on what kind of sentence

17:25.920 --> 17:31.280
would constitute a valid reservation they can just basically write anything but then in each case

17:31.280 --> 17:36.560
a quote would have to decide whether this sufficient sufficiently concrete enough or not and on the

17:36.560 --> 17:45.360
other hand it's obviously a huge effort for everybody creating a crawling doing web crawling for images

17:45.440 --> 17:52.240
because they have to search entire web sites and their sub websites and sub pages for some kind

17:52.240 --> 17:58.560
of sentences that might say that a text and data mining is not allowed and this is obviously

17:58.560 --> 18:05.440
very inefficient and cannot be why the EU initially put the machine readable requirement somewhere

18:05.440 --> 18:14.800
in there and also if this stands and if the next quote also helps hold that this is actually

18:14.960 --> 18:22.080
the case in Germany this would mean that for example in Germany it would be required to search

18:22.080 --> 18:30.880
entire websites everybody crawling German websites would have to be checking for natural language

18:30.880 --> 18:38.160
reservation against text and data mining but if the crawling is happening in France and France has

18:39.040 --> 18:48.160
requirements not just that anybody could basically ignore natural language reservations but also

18:48.160 --> 18:53.680
the author would not be protected for example in France if they were only declaring the reservation

18:53.680 --> 18:59.280
in natural language so this is why I'm saying we do have to go for a European approach here

18:59.280 --> 19:05.920
and it's not really helping anyone if national courts just decide about these issues and I think

19:06.000 --> 19:14.080
I've already talked about this I compiled some kind of an overview of machine

19:14.960 --> 19:20.240
actually machine readable reservation options you can find them online on our website

19:21.040 --> 19:28.320
there are a lot of ideas going on for example you could be could be using the Robots TXC data

19:28.960 --> 19:37.120
file as we've been doing for centuries almost for decades for search engines so it's a quite

19:37.120 --> 19:44.560
good place to start then there's actually a text and data mining reservation protocol group

19:44.560 --> 19:55.200
going on at the W3C so there are standards which are seemingly being established at the moment

19:55.200 --> 20:02.160
and which can be used and which can be useful and if you're thinking about applying this to your

20:02.160 --> 20:10.080
own work to your own website to for example marketing company images on your website it is I think

20:10.080 --> 20:16.720
it's really a good idea to start by using actually machine readable reservations and the options

20:16.720 --> 20:24.800
that are already out there so I think I talked a lot and if there's any questions and any comments

20:24.800 --> 20:29.360
and any insight on how it's going in your country I'm more than happy to talk about this

20:38.400 --> 20:45.440
yeah thanks for the presentation I have a question where are the limits of the state are mining

20:45.840 --> 20:52.880
when it comes to verbatine copies and I have seen this using tools like co-pilot the GitHub

20:52.880 --> 21:00.160
co-pilot will say give me a code for a problem blah blah blah yeah and it spits out the page of code

21:00.960 --> 21:06.880
and then when I do the effort of trying to find where that code came from I find the page in

21:06.880 --> 21:14.000
the internet was exactly the same code yeah so the AI speed out a verbatine copy and that code in the

21:14.000 --> 21:21.360
internet might have a license or it might not have what's then the legal viewpoint can I say it

21:21.360 --> 21:27.760
was generated by AI or do I have to go with the real source the thing is that it takes data mining

21:27.760 --> 21:32.560
exception in this case wouldn't even apply because the tax and data mining exception covers

21:32.560 --> 21:38.640
copying data as training data which would go into the AI model at some point and then if there's

21:38.640 --> 21:44.560
a copy in the output of the AI system it's never covered and it's never covered by the

21:44.560 --> 21:48.720
tax and data mining exception because the tax and data mining exception only allows copies which

21:48.720 --> 21:54.480
are required to perform tax and data mining but this would be a copy which happens way after

21:54.480 --> 21:59.040
that and this reproductions which would happen in output of the AI systems are not covered

21:59.040 --> 22:14.720
by the tax and data mining exception okay what when question and one remark so one more place

22:14.720 --> 22:19.840
where also these discussions about a standard for opting out our taking place is the IETF they have

22:19.840 --> 22:28.880
a new working group called AI controls I think and a question so in order for copyright to apply

22:28.880 --> 22:35.440
they need to be an act of reproduction so it's quite clear when you're creating a training data

22:35.440 --> 22:41.760
sets that an act of reproduction is taking place but if the training data set is already available

22:41.760 --> 22:47.840
and you're kind of training your AI model on the fly is there even an act of reproduction

22:47.840 --> 22:53.520
that requires permission what is your opinion on that yeah good question I mean you do have to

22:53.520 --> 23:02.960
obtain the data set somehow so we would have to to think about whether even if you're just

23:03.680 --> 23:12.720
using it from a remote server this would constitute like a volatile copy or whether this is

23:12.720 --> 23:18.480
a permanent copy and if it's a volatile copy then there's another exception in copyright which would

23:18.480 --> 23:25.520
apply so I think this is a good question I've asked myself the same question it's not really

23:25.520 --> 23:30.400
being discussed at the moment because I think the technical requirements are still that you have

23:30.400 --> 23:37.440
a copy like a permanent copy somewhere but I think if it would be definitely where it works

23:37.440 --> 23:45.920
or thinking about whether this is even relevant yeah so do you think it's I'm wondering what happens

23:45.920 --> 23:54.640
even interest other laws so I book the university and some researchers used images available online

23:54.640 --> 24:03.600
for training the model to predict you know or detect the level of facial palsy that was possible

24:03.600 --> 24:08.880
and they had to stop doing it because essentially they were just looking up on an internet using

24:08.880 --> 24:14.720
images from like you know people that posted online and other websites and the reasoning was

24:14.720 --> 24:21.920
essentially GDPR violation was that but what they were looking to develop an AI model to kind of train

24:21.920 --> 24:29.920
the model to the with this negate that or so this is merely a copyright exception so

24:29.920 --> 24:36.480
do you still have to obviously observe any kind of GDPR related requirements so this does not

24:36.480 --> 24:45.920
allow you to process personal data yeah there's copyright and then there's the state of protection

24:47.920 --> 24:49.920
okay thank you

24:51.920 --> 24:53.920
you