1
00:00:01,070 --> 00:00:05,889
People perceive the world through our senses, but how do machines perceive

2
00:00:05,889 --> 00:00:11,869
the world? Computers use different types of sensors, like microphones,

3
00:00:11,990 --> 00:00:17,370
cameras, radars, or GPS receivers, among others, to receive information from the environment

4
00:00:17,370 --> 00:00:22,109
that surrounds them and build a representation of their surroundings. But computers only understand

5
00:00:22,109 --> 00:00:27,030
working with numbers, so all the information they receive from their sensors has to be

6
00:00:27,030 --> 00:00:31,870
stored as a set of numbers. For example, a black and white image is encoded

7
00:00:31,870 --> 00:00:36,369
as a matrix of numbers, where each value indicates the brightness of each pixel.

8
00:00:36,950 --> 00:00:41,750
If the image is in color, three numbers are stored for each pixel, representing the

9
00:00:41,750 --> 00:00:46,929
brightness of the red, green, and blue components. Sounds are also encoded as a

10
00:00:46,929 --> 00:00:50,909
series of numbers, indicating the waveform values at different moments,

11
00:00:51,429 --> 00:00:54,109
taking hundreds or thousands of samples per second.

12
00:00:54,109 --> 00:01:00,630
And the fact that a machine can receive information from the world already makes it an artificial intelligence system?

13
00:01:01,369 --> 00:01:07,370
Well, no, for us to consider it as such, it needs to be able to extract meaning from that information.

14
00:01:07,370 --> 00:01:12,569
Let's think about a supermarket door that opens when a sensor detects movement.

15
00:01:12,569 --> 00:01:17,730
The system is too simple to be able to perceive who or what is entering

16
00:01:17,730 --> 00:01:20,769
and make decisions based on this meaning.

17
00:01:20,769 --> 00:01:35,969
And thanks to this limitation, we can enjoy wonderful videos of wild animals strolling through supermarket aisles, as Turesky and Garner joke in their chapter on AI literacy in this magnificent work.

18
00:01:35,969 --> 00:01:42,489
But how do computers extract meaning from a set of numbers that represents

19
00:01:42,489 --> 00:01:44,950
an image, for example?

20
00:01:44,950 --> 00:01:50,209
This signal to meaning transformation occurs in progressive stages through

21
00:01:50,209 --> 00:01:55,049
a process called feature extraction.

22
00:01:55,049 --> 00:01:59,590
On the screen, we have an image of a number 4 written by a person that the computer

23
00:01:59,590 --> 00:02:04,769
has already encoded into a matrix of numbers from the information of its camera.

24
00:02:04,769 --> 00:02:09,090
But how could it know that it is a 4 and not a 1 or a 7?

25
00:02:09,090 --> 00:02:13,469
By looking for specific combinations of values representing light and dark pixels

26
00:02:13,469 --> 00:02:20,009
in small areas of the image, in this case 3x3 pixels, the location can be detected

27
00:02:20,009 --> 00:02:24,090
and the orientation of different edges in the image.

28
00:02:24,090 --> 00:02:30,009
Thus, the result of applying a filter to detect left edges is shown in the

29
00:02:30,009 --> 00:02:34,569
image on the right, where areas detected as left edges appear

30
00:02:34,569 --> 00:02:40,650
marked in red. Opposite areas are shown in blue, meaning in this case the right edges.

31
00:02:40,650 --> 00:02:49,270
Now let's apply a filter to detect upper edges. See? So, through this

32
00:02:49,270 --> 00:02:54,270
progressive stage process of feature extraction, where different

33
00:02:54,270 --> 00:03:02,310
types of filters are used and combined, a signal is transformed into meaning. With sounds, it's done

34
00:03:02,310 --> 00:03:07,430
something very similar, for example for speech recognition, since each vowel and each

35
00:03:07,430 --> 00:03:13,150
consonant can be associated with different patterns of a spectrogram, which is a representation

36
00:03:13,150 --> 00:03:17,909
visual that allows identifying the different variations of frequency and intensity of the

37
00:03:17,909 --> 00:03:22,909
sound. But there are AI systems that not only can translate

38
00:03:22,909 --> 00:03:28,530
an audio into text, but also seem to understand these texts. But how can this be?

39
00:03:28,610 --> 00:03:32,509
How is this possible? Well, that's precisely what we'll see in the next video.