1 00:00:01,070 --> 00:00:05,889 People perceive the world through our senses, but how do machines perceive 2 00:00:05,889 --> 00:00:11,869 the world? Computers use different types of sensors, like microphones, 3 00:00:11,990 --> 00:00:17,370 cameras, radars, or GPS receivers, among others, to receive information from the environment 4 00:00:17,370 --> 00:00:22,109 that surrounds them and build a representation of their surroundings. But computers only understand 5 00:00:22,109 --> 00:00:27,030 working with numbers, so all the information they receive from their sensors has to be 6 00:00:27,030 --> 00:00:31,870 stored as a set of numbers. For example, a black and white image is encoded 7 00:00:31,870 --> 00:00:36,369 as a matrix of numbers, where each value indicates the brightness of each pixel. 8 00:00:36,950 --> 00:00:41,750 If the image is in color, three numbers are stored for each pixel, representing the 9 00:00:41,750 --> 00:00:46,929 brightness of the red, green, and blue components. Sounds are also encoded as a 10 00:00:46,929 --> 00:00:50,909 series of numbers, indicating the waveform values at different moments, 11 00:00:51,429 --> 00:00:54,109 taking hundreds or thousands of samples per second. 12 00:00:54,109 --> 00:01:00,630 And the fact that a machine can receive information from the world already makes it an artificial intelligence system? 13 00:01:01,369 --> 00:01:07,370 Well, no, for us to consider it as such, it needs to be able to extract meaning from that information. 14 00:01:07,370 --> 00:01:12,569 Let's think about a supermarket door that opens when a sensor detects movement. 15 00:01:12,569 --> 00:01:17,730 The system is too simple to be able to perceive who or what is entering 16 00:01:17,730 --> 00:01:20,769 and make decisions based on this meaning. 17 00:01:20,769 --> 00:01:35,969 And thanks to this limitation, we can enjoy wonderful videos of wild animals strolling through supermarket aisles, as Turesky and Garner joke in their chapter on AI literacy in this magnificent work. 18 00:01:35,969 --> 00:01:42,489 But how do computers extract meaning from a set of numbers that represents 19 00:01:42,489 --> 00:01:44,950 an image, for example? 20 00:01:44,950 --> 00:01:50,209 This signal to meaning transformation occurs in progressive stages through 21 00:01:50,209 --> 00:01:55,049 a process called feature extraction. 22 00:01:55,049 --> 00:01:59,590 On the screen, we have an image of a number 4 written by a person that the computer 23 00:01:59,590 --> 00:02:04,769 has already encoded into a matrix of numbers from the information of its camera. 24 00:02:04,769 --> 00:02:09,090 But how could it know that it is a 4 and not a 1 or a 7? 25 00:02:09,090 --> 00:02:13,469 By looking for specific combinations of values representing light and dark pixels 26 00:02:13,469 --> 00:02:20,009 in small areas of the image, in this case 3x3 pixels, the location can be detected 27 00:02:20,009 --> 00:02:24,090 and the orientation of different edges in the image. 28 00:02:24,090 --> 00:02:30,009 Thus, the result of applying a filter to detect left edges is shown in the 29 00:02:30,009 --> 00:02:34,569 image on the right, where areas detected as left edges appear 30 00:02:34,569 --> 00:02:40,650 marked in red. Opposite areas are shown in blue, meaning in this case the right edges. 31 00:02:40,650 --> 00:02:49,270 Now let's apply a filter to detect upper edges. See? So, through this 32 00:02:49,270 --> 00:02:54,270 progressive stage process of feature extraction, where different 33 00:02:54,270 --> 00:03:02,310 types of filters are used and combined, a signal is transformed into meaning. With sounds, it's done 34 00:03:02,310 --> 00:03:07,430 something very similar, for example for speech recognition, since each vowel and each 35 00:03:07,430 --> 00:03:13,150 consonant can be associated with different patterns of a spectrogram, which is a representation 36 00:03:13,150 --> 00:03:17,909 visual that allows identifying the different variations of frequency and intensity of the 37 00:03:17,909 --> 00:03:22,909 sound. But there are AI systems that not only can translate 38 00:03:22,909 --> 00:03:28,530 an audio into text, but also seem to understand these texts. But how can this be? 39 00:03:28,610 --> 00:03:32,509 How is this possible? Well, that's precisely what we'll see in the next video.