1 00:00:00,500 --> 00:00:15,830 When you access the application urlv2-learningml.org for text recognition, image and number recognition, 2 00:00:16,149 --> 00:00:18,929 you'll see there's a new type of recognition which is sound. 3 00:00:19,969 --> 00:00:21,230 Let's see how it works. 4 00:00:22,210 --> 00:00:27,329 We click on it and the three phases of supervised learning appear, just like in the rest of the recognitions. 5 00:00:27,989 --> 00:00:32,369 The training phase to collect data, learning to build the model and the testing phase. 6 00:00:33,130 --> 00:00:38,969 We're going to build a model that will be capable of recognizing my voice from a whistle and from the background noise in the room. 7 00:00:39,670 --> 00:00:44,549 So we create the three classes we need, in this case voice, whistle and background. 8 00:00:54,750 --> 00:00:59,090 Good, and now it's about adding examples of voice sound or whistle background sound. 9 00:00:59,829 --> 00:01:05,590 I'm going to start with the voice because while I speak and explain how the recording works we'll be collecting voice samples. 10 00:01:05,590 --> 00:01:11,370 when we want to collect sound samples we simply click record and then you'll see it will start 11 00:01:11,370 --> 00:01:16,530 collecting sample recordings of about one second duration more or less and automatically that is 12 00:01:16,530 --> 00:01:21,430 it will keep recording until we stop it if we're going to stop the recording stops in this case 13 00:01:21,430 --> 00:01:26,750 it has collected 12 recordings of approximately one second of my voice if we want to play it back 14 00:01:26,750 --> 00:01:31,730 to see what it has recorded we click here and we see how it has been collecting the different things 15 00:01:31,730 --> 00:01:36,650 dive. Been saying here the interesting thing is to collect more the timbre because that's what 16 00:01:36,650 --> 00:01:42,349 this recognizes quite well the timbre of sounds. If we don't like any of the samples we can simply 17 00:01:42,349 --> 00:01:46,790 delete it let's imagine we don't want number 12 we click on the trash can button and it's deleted 18 00:01:46,790 --> 00:01:52,290 now we're going to take sound samples. It's very important that the samples we take since they're 19 00:01:52,290 --> 00:01:57,030 one second long that during that second or so that it's recording that it really records what we want. 20 00:01:57,030 --> 00:02:02,450 that's why it's good to review afterwards how the samples were collected to see if it really 21 00:02:02,450 --> 00:02:08,030 recorded what we wanted we always have to keep in mind that data quality is fundamental to then 22 00:02:08,030 --> 00:02:33,090 obtain a good model well let's collect whistle sounds i'll stop it well 13 samples since that 23 00:02:33,090 --> 00:02:37,849 number 13 brings bad luck and we're going to be a little superstitious we'll take advantage and 24 00:02:37,849 --> 00:02:44,870 delete the last sample. Good and now we're going to take 12 background samples. I'll simply press 25 00:02:44,870 --> 00:02:50,030 record, stay quiet and it will capture the ambient noise there is, a bit of the fan motor, anyway 26 00:02:50,030 --> 00:03:09,900 there's always noise everywhere we go. Good, 12 samples more or less. Remember that it's important 27 00:03:09,900 --> 00:03:15,319 that the number of data samples whatever they are whether sound, texts, numbers or in this case sound 28 00:03:15,319 --> 00:03:20,159 it's important that each class has more or less the same number of samples what's called a balanced 29 00:03:20,159 --> 00:03:28,139 data set. Good, we now have the sample data set. Now it's time for learning, that is, building the 30 00:03:28,139 --> 00:03:33,979 model. We click here and well, the machine learning algorithm will analyze that data to build a model 31 00:03:33,979 --> 00:03:46,110 capable of recognizing those three tambras. Good, it has been trained. It took 9.3 seconds and now 32 00:03:46,110 --> 00:03:53,069 we're going to test it. To test it, well, we do the same as when we collected data. We press the 33 00:03:53,069 --> 00:03:59,590 record button, in this case from the testing phase and see what happens. Well, first we stay quiet to 34 00:03:59,590 --> 00:04:06,409 see if it picks up the background. Perfect, it picked up the background noise. Now I'm going to 35 00:04:06,409 --> 00:04:14,930 speak. Hello, hello, hello. And again it got it right, it recognized the voice. And now I'm going 36 00:04:14,930 --> 00:04:21,670 to make a small whistle. And we see it recognized the whistle. And well, this is how to build sound 37 00:04:21,670 --> 00:04:29,860 recognition models well next i'm going to make a program with scratch that uses the model we just 38 00:04:29,860 --> 00:04:34,120 created for sound recognition we click on the cat and we'll see that in the learning ml blocks 39 00:04:34,120 --> 00:04:40,079 there's a new block called record audio this block works very similar to this record button 40 00:04:40,079 --> 00:04:46,240 when executed it records a sound of approximately one second duration and that sound is converted 41 00:04:46,240 --> 00:04:51,540 into a vector a vector that is multidimensional which is what will really be passed to the machine 42 00:04:51,540 --> 00:04:57,040 learning algorithm to recognize it and how is classification performed? Well as we do with the 43 00:04:57,040 --> 00:05:01,720 rest of classification problems with this classify item block what happens is that here we're going 44 00:05:01,720 --> 00:05:09,199 to place audio as an argument. Let's see, let's try it. First we're going to execute it with silence 45 00:05:09,199 --> 00:05:20,389 to see if it detects the background. Very good, now I'm going to execute it while speaking. 46 00:05:20,389 --> 00:05:23,730 Hello, hello, hello, hello 47 00:05:23,730 --> 00:05:27,009 And now I'm going to execute it while whistling 48 00:05:27,009 --> 00:05:30,569 As we can see, it works exactly the same 49 00:05:30,569 --> 00:05:33,089 As the rest of the recognitions 50 00:05:33,089 --> 00:05:37,129 But in this case recording samples of one second duration 51 00:05:37,129 --> 00:05:40,329 And with this we could make some type of program 52 00:05:40,329 --> 00:05:46,430 For example, imagine making a model that recognizes the words up, down, left and right 53 00:05:46,430 --> 00:05:52,050 and then with scratch make a program that moves the cat based on what the user is saying 54 00:05:52,050 --> 00:05:59,290 that goes up when up is said down when down is said etc well that will be the subject of 55 00:05:59,290 --> 00:06:04,850 another later video for now we'll stick with this so you get an idea of how this new learning ml 56 00:06:04,850 --> 00:06:06,029 functionality works