1
00:00:00,500 --> 00:00:15,830
When you access the application urlv2-learningml.org for text recognition, image and number recognition,

2
00:00:16,149 --> 00:00:18,929
you'll see there's a new type of recognition which is sound.

3
00:00:19,969 --> 00:00:21,230
Let's see how it works.

4
00:00:22,210 --> 00:00:27,329
We click on it and the three phases of supervised learning appear, just like in the rest of the recognitions.

5
00:00:27,989 --> 00:00:32,369
The training phase to collect data, learning to build the model and the testing phase.

6
00:00:33,130 --> 00:00:38,969
We're going to build a model that will be capable of recognizing my voice from a whistle and from the background noise in the room.

7
00:00:39,670 --> 00:00:44,549
So we create the three classes we need, in this case voice, whistle and background.

8
00:00:54,750 --> 00:00:59,090
Good, and now it's about adding examples of voice sound or whistle background sound.

9
00:00:59,829 --> 00:01:05,590
I'm going to start with the voice because while I speak and explain how the recording works we'll be collecting voice samples.

10
00:01:05,590 --> 00:01:11,370
when we want to collect sound samples we simply click record and then you'll see it will start

11
00:01:11,370 --> 00:01:16,530
collecting sample recordings of about one second duration more or less and automatically that is

12
00:01:16,530 --> 00:01:21,430
it will keep recording until we stop it if we're going to stop the recording stops in this case

13
00:01:21,430 --> 00:01:26,750
it has collected 12 recordings of approximately one second of my voice if we want to play it back

14
00:01:26,750 --> 00:01:31,730
to see what it has recorded we click here and we see how it has been collecting the different things

15
00:01:31,730 --> 00:01:36,650
dive. Been saying here the interesting thing is to collect more the timbre because that's what

16
00:01:36,650 --> 00:01:42,349
this recognizes quite well the timbre of sounds. If we don't like any of the samples we can simply

17
00:01:42,349 --> 00:01:46,790
delete it let's imagine we don't want number 12 we click on the trash can button and it's deleted

18
00:01:46,790 --> 00:01:52,290
now we're going to take sound samples. It's very important that the samples we take since they're

19
00:01:52,290 --> 00:01:57,030
one second long that during that second or so that it's recording that it really records what we want.

20
00:01:57,030 --> 00:02:02,450
that's why it's good to review afterwards how the samples were collected to see if it really

21
00:02:02,450 --> 00:02:08,030
recorded what we wanted we always have to keep in mind that data quality is fundamental to then

22
00:02:08,030 --> 00:02:33,090
obtain a good model well let's collect whistle sounds i'll stop it well 13 samples since that

23
00:02:33,090 --> 00:02:37,849
number 13 brings bad luck and we're going to be a little superstitious we'll take advantage and

24
00:02:37,849 --> 00:02:44,870
delete the last sample. Good and now we're going to take 12 background samples. I'll simply press

25
00:02:44,870 --> 00:02:50,030
record, stay quiet and it will capture the ambient noise there is, a bit of the fan motor, anyway

26
00:02:50,030 --> 00:03:09,900
there's always noise everywhere we go. Good, 12 samples more or less. Remember that it's important

27
00:03:09,900 --> 00:03:15,319
that the number of data samples whatever they are whether sound, texts, numbers or in this case sound

28
00:03:15,319 --> 00:03:20,159
it's important that each class has more or less the same number of samples what's called a balanced

29
00:03:20,159 --> 00:03:28,139
data set. Good, we now have the sample data set. Now it's time for learning, that is, building the

30
00:03:28,139 --> 00:03:33,979
model. We click here and well, the machine learning algorithm will analyze that data to build a model

31
00:03:33,979 --> 00:03:46,110
capable of recognizing those three tambras. Good, it has been trained. It took 9.3 seconds and now

32
00:03:46,110 --> 00:03:53,069
we're going to test it. To test it, well, we do the same as when we collected data. We press the

33
00:03:53,069 --> 00:03:59,590
record button, in this case from the testing phase and see what happens. Well, first we stay quiet to

34
00:03:59,590 --> 00:04:06,409
see if it picks up the background. Perfect, it picked up the background noise. Now I'm going to

35
00:04:06,409 --> 00:04:14,930
speak. Hello, hello, hello. And again it got it right, it recognized the voice. And now I'm going

36
00:04:14,930 --> 00:04:21,670
to make a small whistle. And we see it recognized the whistle. And well, this is how to build sound

37
00:04:21,670 --> 00:04:29,860
recognition models well next i'm going to make a program with scratch that uses the model we just

38
00:04:29,860 --> 00:04:34,120
created for sound recognition we click on the cat and we'll see that in the learning ml blocks

39
00:04:34,120 --> 00:04:40,079
there's a new block called record audio this block works very similar to this record button

40
00:04:40,079 --> 00:04:46,240
when executed it records a sound of approximately one second duration and that sound is converted

41
00:04:46,240 --> 00:04:51,540
into a vector a vector that is multidimensional which is what will really be passed to the machine

42
00:04:51,540 --> 00:04:57,040
learning algorithm to recognize it and how is classification performed? Well as we do with the

43
00:04:57,040 --> 00:05:01,720
rest of classification problems with this classify item block what happens is that here we're going

44
00:05:01,720 --> 00:05:09,199
to place audio as an argument. Let's see, let's try it. First we're going to execute it with silence

45
00:05:09,199 --> 00:05:20,389
to see if it detects the background. Very good, now I'm going to execute it while speaking.

46
00:05:20,389 --> 00:05:23,730
Hello, hello, hello, hello

47
00:05:23,730 --> 00:05:27,009
And now I'm going to execute it while whistling

48
00:05:27,009 --> 00:05:30,569
As we can see, it works exactly the same

49
00:05:30,569 --> 00:05:33,089
As the rest of the recognitions

50
00:05:33,089 --> 00:05:37,129
But in this case recording samples of one second duration

51
00:05:37,129 --> 00:05:40,329
And with this we could make some type of program

52
00:05:40,329 --> 00:05:46,430
For example, imagine making a model that recognizes the words up, down, left and right

53
00:05:46,430 --> 00:05:52,050
and then with scratch make a program that moves the cat based on what the user is saying

54
00:05:52,050 --> 00:05:59,290
that goes up when up is said down when down is said etc well that will be the subject of

55
00:05:59,290 --> 00:06:04,850
another later video for now we'll stick with this so you get an idea of how this new learning ml

56
00:06:04,850 --> 00:06:06,029
functionality works