Abstract— Crying is a communication method used by infants given the limitations of language. Parents or nannies who have never had the experience to take care of the baby will experience anxiety when the infant is crying. Therefore, we need a way to understand about infant’s cry and apply the formula. This research develops a system to classify the infant’s cry sound using MACF (Mel-Frequency Cepstrum Coefficients) feature extraction and BNN (Backpropagation Neural Network) based on voice type. It is classified into 3 classes: hungry, discomfort, and tired. A voice input must be ascertained as infant’s cry sound which using 3 features extraction (pitch with 2 approaches: Modified Autocorrelation Function and Cepstrum Pitch Determination, Energy, and Harmonic Ratio). The features coefficients of MFCC are furthermore classified by Backpropagation Neural Network. The experiment shows that the system can classify the infant’s cry sound quite well, with 30 coefficients and 10 neurons in the hidden layer.Keywords—infant’s cry sound; pitch; energy; harmonic ratio; mel-frequency cepstrum coefficients; backpropagation neural networkI. INTRODUCTIONThere are many problems for parents or nannies because of incomprehension infant’s language. So, we need a system that is able to show the meaning of infant’s language. A comprehension of infant’s language needs to reduce irritation, anxiety, etc. On a panic situation, a parent or nanny will take any action to calm down the infant even this is abusive action. So, this research will discuss how to classify the infant’s cry sound (based on voice type) and what solution is given to overcome it.Classification of infant’s cry sound is needed because parents or nannies who don’t have any experiences, especially young parents. They will feel uncomfortable when the infant is crying. They don’t know what the infant wants. So, the infant will be cried continuously.On paper 12 an identification infant’s cry using Matlab as program language and Mel-Frequency Cepstrum Coefficients (MFCC) algorithm has been tried to be done, which is the identification of infant’s cry successfully done as desired but this research identified the voice that was certainly an infant’s cry, while in the real world sometimes a cat sound like the sound of an infant’s cry. Also with paper 1 does the same but different with the previous study, this study classifies two kinds of infant’s cry that is physiological status and medical disease. Paper 4, it does identification of infant’s “cry” and “no cry” which more than 3 features is used. This research only observes limits the values of features of infant’s cry. Paper 8, it does classification of infant’s cry into 3 kinds that consist of normal, hypoacoustic and asphyxia. The research uses acoustic characteristics extraction techniques like Linear Prediction Coefficients (LPC) and MFCC as a feature with samples of 1 second, with 16 coefficients for every 50 ms/frame and Adaptive Backpropagation Neural Network as a classifier. The results obtained, of up to 98.67%.Besides using MFCC as a feature to classify the infant’s cry sound based on voice type, we propose the development with addition a multi features extraction in this research. There are 3 features (pitch with 2 approaches, energy, Harmonic Ratio). So, the classification of infant’s cry sound will be higher accurate. With this research, it can be seen how accuracy of infant’s cry sound classification based on voice type that can be helped to know the meaning of infant’s cry sound and give a solution.The remainder of this paper organized as follow: first we present methodology. Our design system on section 2. In section 3, we present the experimental result and finally on section 4 conclusion and the feature work of this paper.II. METHODOLOGYA methodology can be seen as the technique used to collect and analyze data. The data collected have to be related to the objective and problem statement. There are two types of method that used in this study to obtain the relevant data: data collection and interview.A. Data CollectionWe collect data of infant’s cry sound (0-3 months old) in 4 months. The average duration of recording infant’s cry sound is 5 seconds with file type .wav. The sound has been labeled by their parents. In this case study, the number of data is 180 that consist of 3 classes: hungry, discomfort, and tired. While the number of training data is 150 and the number of testing data is 30. For negative data (not infant’s cry sound), we collect 252016 2nd International Conference on Science and Technology-Computer (ICST), Yogyakarta, Indonesiadata, such as a sound of applause, laughing, dialog, silent,women’s cry, etc.B. InterviewThe data validation was conducted using interviewdifferent experts (parents or nannies, midwife, and nurse). Theground truth was obtained by at least 2 results from experts. Ifthe validation result is different, we use the midwifevalidation. The final classification data as shown table 1.TABLE I. DATA COLLECTION OF INFANT’S CRY SOUNDClassNumber of DataTraining TestingHungry 46 11Discomfort 53 10Tired 51 9Total 150 30III. DESIGN SYSTEMOn figure 1, we see general system architecture. Thearchitecture is divided 4 parts.1) Classification of infant’s cry sound, using 3 featuresthat are pitch, energy, and HR (Harmonic Ratio).2) Preprocessing, frame blocking and windowing using awindow function.3) Feature extraction using Mel-Frequency CepstrumCoefficients (MFCC).4) Classification of infant’s cry sound based on voicetype using Backpropagation Neural Network (BNN), using thevarious number of neuron at a hidden layer.Fig. 1. General System ArchitectureThe first step, system loads sound file. A file type of soundis .wav (waveform audio format). A sound has 2 kinds ofchannels: mono channel and stereo channel. Usually, a voicetype is stereo channel but the system can only process a voiceinput which is a mono channel. It is represented by a matrixwhich has 1 column. If the voice is a stereo channel, the systemwill convert it to a mono channel. Channel stereo to monoconversion can be done by averaging the sample values at eachrow of matrix. So, a voice input is ready for next processing.The system extracts voice input using 3 features (pitch,energy, and Harmonic Ratio) to know infant’s cry sound or not.If the voice input is infant’s cry sound, the system will extractMFCC feature. The extraction result gives coefficients.Furthermore, coefficients will be used as input data and trainedusing BNN as a classifier.A. Classification of Infant’s Cry SoundThe first step, a system classifies infant’s cry sound. Wecompute each feature (pitch, energy, and harmonic ratio) ofinfant’s cry sound and other. This step ensures the voice inputis the infant’s cry sound. If the voice input is infant’s crysound, the system will be extracted it using MFCC feature.1) PitchCommonly, a pitch is called by fundamental frequency f0.It is also often used for sound classification 4. Pitch featureextraction has 2 approaches: MACF (ModifiedAutocorrelation Function) and CPD (Cepstrum PitchDetermination). The difference of both that the MACFgenerates the autocorrelation function while the CPDtransforms a signal from time domain to frequency domain,called the FFT (Fast Fourier Transform).a) MACF (Modified Autocorrelation Function)The formula of a pitch using MACF is as follows 1.? ? ? ? ? ? ? ??? ?WLnRi m xi n xi n m1? ????Where is,m = maxlagxi ?n? = value of signal amplitudeWL = number of sample per frameAfter that, it needs normalization. The formula ofnormalization is as follows 2.? ? ?? ?? ? ? ? ? ?? ???WLnWLni iiix n x n mR mm1 12 2? ? ????Where is,Ri ?m? = autocorrelation valuesm = maxlagxi ?n? = value of signal amplitudeWL = number of sample per frameThe position of the maximum values of normalizedautocorrelation function corresponds to the fundamentalperiod. Furthermore, we get fundamental frequency (pitch).? ?m??fiT m T??min ? ? max0max1(3)Where is,?i ?m? = normalization result of autocorrelation function2016 2nd International Conference on Science and Technology-Computer (ICST), Yogyakarta, Indonesiab) CPD (Cepstrum Pitch Determination)The Cepstral domain represents the frequency in thelogarithmic magnitude spectrum of a signal. The Cepstrum isformed by taking the IFFT (Invers Fast Fourier Transform) oflog magnitude spectrum of a signal. The formula of cepstrum isas follows 4.? ? 1? ? ? ? ??2 ?? C F log F x n ? ? ? (4)Where is,C?? ? = cepstrumF = Fast Fourier Transform functions?1 F = inversion of Fourier Transform functionsAfter that, we use the formula 5 to get pitch value.? ? max01C ?f ? (5)Where is,? ? C ? max = cepstrum2) EnergyThe terms signal energy is used to characterize a signal.The definition of signal energy refers to any signal f(t),including signals that take on complex values. The average ofsignal energy in the discrete-time signal x(n) is as follows 6.? ? ? ????101 2 Nnx nNE ? ????Where is,x?n? = value of signal amplitudeN = number of signal sample3) HR (Harmonic Ratio)The last feature to classify infant’s cry sound is extractingHR feature. HR is part of the audio harmonicity descriptor ofthe MPEG-7 framework 11. The value of HR is obtained byusing formula 7. HR refers to a result of normalizationautocorrelation. The position of the maximum value of itcorresponds to the fundamental period. Therefore, thefundamental frequency is as follows 5. The maximum valueitself is the harmonic ratio 9.? HR m ? i ?m??T x Ti ? ?min ? ? max? ????Where is,min max T ? m ? T = minimum and maximum value offundamental period?m? i ? = normalization result of autocorrelationfunctionThe result of features extraction (pitch MACF, pitch CPD,energy, and harmonic ratio) can be shown table 2. Based ontable 2, we got ranges of features (205 data: 180 sounds ofinfant’s cry and 25 other sounds). So, we can classify infant’scry sound quite well.TABLE II. RESULT OF FEATURE EXTRACTION (PITCH, ENERGY, HR)Class FeatureMinimumValueMaximumValueCryPitch using MACF 158.63 Hz 501.14 HzPitch using CPD 60.49 Hz 501.14 HzEnergy 5.40E-06 dB 0.181 dBHR (Harmonic Ratio) 0 0.8698Not CryPitch using MACF 51.76 Hz 233 HzPitch using CPD 95.04 Hz 500 HzEnergy 3.87E-09 dB 0.08 dBHR (Harmonic Ratio) 0 0.83B. PreprocessingAfter the infant’s cry sound was obtained, we preprocess itbefore extract MFCC feature. Preprocess consists of 2 steps:frame blocking and windowing.1) Frame BlockingThe signal is divided into quasi-stationary overlappingframes. Frame length is 256 samples, and overlap length is100 sample. The number of frames was obtained from signaldata length, frame length, and overlap length. The formula ofit is as follows 8.? ?? ???? ???? ????? ? ?? 1ML NnFrame floor (8)Where is,L = signal data lengthN = frame lengthM = overlap length2) WindowingThe window function has many kinds but the windowfunction that is commonly used is the hamming window. Thewindowing process reduces the sort of spectral leakage. Theformula of hamming window is as follows 9.? ? ,0 12? 0.54 ? 0.46cos ? n ? N ?Nnw n?(9)Where is,N = frame lengthn = number of signal sampleAfter hamming window function was obtained, we getresult windowed signal using the formula 10. Each frame ismultiplied by a hamming window.x?n? ? s?n?.w?n? (10)Where is,w?n? = hamming window functions?n? = signal sample2016 2nd International Conference on Science and Technology-Computer (ICST), Yogyakarta, IndonesiaFig. 2. Result of windowingC. MFCC (Mel-Frequency Cepstrum Coefficients) FeatureMFCC feature is used as an input parameter. We will getcoefficients of MFCC as a feature. It is able to representcepstrum coefficients of short-term power signal 10.? ? ? ? ? ????? ???? ?????????? ?10.22 1log . cos1 Mii lME iMMFCC l?(10)Where is,l ? 0,…,M ?1Coefficients of MFCC are obtained by multiplying theshort-time Fourier Transform (STFT) of each analysis frameby a series of M triangularly-shaped ideal band-pass filters,with their central frequencies and widths arranged accordingto a Mel-frequency scale. The total spectral energy Eicontained in each filter which is computed. The DiscreteCosine Transform (DCT) is performed to obtain the MFCCsequence 5. Coefficients of MFCC will be normalized to beneuron’s input at BNN classifier. We tried to extract MFCCfeature with various numbers of coefficients from 10, 15, 20,25, until 30 coefficients, because a number of MFCCcoefficients at previous research is 18 coefficients 12.D. BNN (Backpropagation Neural Network) ClassifierThe result of MFCC feature extraction is used inputparameter. We proposed BNN as a classifier to classify aninfant’s cry sound. The BNN architecture has 1 input layer, 1hidden layer, and 1 output layer but the number of neurons atinput layer and hidden layer are miscellaneous.TABLE III. SPECIFICATIONS OF BNN ARCHITECTURESpecification Number of Neuron/ ValueInput Layer 10, 25, 20, 25, and 30 neurons (coef. of MFCC)Hidden Layer 5, 10, and 15 neuronsOutput Layer 3 neurons (hungry, discomfort, and tired)Based on table 3, we have been done 15 experiments withthe various specifications of BNN architecture (see table 4).The table displays the number of true and false prediction. So,every experiment can be known an accuracy level. Based ontable 4, we get the highest accuracy at the 14th experiment.TABLE IV. EXPERIMENTS RESULTNo.Coef.MFCCNeuronatHiddenLayerTrue False TotalAccuracy (%)1 10 5 23 7 30 76.672 10 10 25 5 30 83.333 10 15 21 9 30 704 15 5 21 9 30 705 15 10 23 7 30 76.676 15 15 22 8 30 73.337 20 5 23 7 30 76.678 20 10 22 8 30 73.339 20 15 21 9 30 7010 25 5 21 9 30 7011 25 10 25 5 30 8312 25 15 20 10 30 66.6713 30 5 24 6 30 8014 30 10 29 1 30 96.6715 30 15 24 6 30 80Fig. 3. Data training process with 10 neurons at input layer and 5 neurons athidden layerBNN is one of supervised training method in the neuralnetwork. The especially characteristic of it is minimizingerrors in the output, which is generated by the network. BNN2016 2nd International Conference on Science and Technology-Computer (ICST), Yogyakarta, Indonesiais used to find the error gradient of the network for network weights which can be modified. The error gradient will be used to find the weights, which can minimize error. For example, the data training process can be seen figure 3. The best validation of training data when MSE (Mean Square Error) of performance is lowest. If it was obtained, neural network should be saved because BNN initial weight is always random.IV. EXPERIMENTAL RESULTThe first, we observed the result of experiments from the classification of infant’s cry sound. We tried 180 data of infant’s cry sound and 25 data of other. From our observation, we got significant accuracy result of it. The result is shown table 5.TABLE V. RESULT OF CLASSIFICATION INFANT’S CRY SOUNDFeatureCryNot CryAverage AccuracyPitch MACF94.15%98.53%96.34%Pitch CPD87.80%18.54%53.17%Energy88.29%18.04%53.17%Harmonic Ratio87.80%12.68%50.24%The next step, we extract MFCC feature and determine a target as neuron at the output layer. A target is a class of infant’s cry sound. There are 3 classes: hungry, discomfort, and tired. The coefficients of MFCC are used to neuron in an input layer. Figure 7 shows the interface of MFCC feature extraction and 3 previous features (pitch with 2 approaches: Modified Autocorrelation Function and Cepstrum Pitch Determination, energy, and harmonic ratio).The data number for training dataset is 150 data. The result of each number of MFCC coefficients would be saved in a different file. We got 15 kinds of BNN based on the number of neuron input layer and hidden layer.Fig. 4. Interface of experimental resultBased on figure 11, we can see that the best of using the number of neurons in the hidden layer is 10 neurons. In addition, the best of using the number of coefficients MFCC is 30 coefficients. It can be concluded that the greater the number of coefficients MFCC is, the better the ability system to do classification of infant’s cry sound. The more the number of coefficients MFCC is, the longer time required is in the process of classification. The use of the neurons number in the hidden layer also affects the number of iterations at the training process.Every experiment can be saved by system including the number of coefficients MFCC, the number of neuron at hidden layer, the number of true data, the number of false data, and accuracy rate. Figure 11 shows accuracy rate automatically. In addition, we also present solutions as other outputs in the system. The output contains text, audio, and image of a solution, such as if the system detects that infant’s cry sound is “discomfort” class then the system displays text and audio output “Change a Diaper/ Holding a Baby” with the image solution.Fig. 5. Interface of testingV. CONCLUSIONBased on experiment result, we can conclude that the pitch MACF feature is more reliable than pitch CPD, energy, and HR for classification of infant’s cry sound. The pitch MACF got average accuracy 96.34%. So, we can use only pitch MACF as a reference to classify the infant’s cry sound.While classification infant cry sound based on voice type, we obtained highest accuracy level 96.67% with 30 coefficients of MFCC as neuron input and 10 neurons in the hidden layer. So that, the architecture is considered.In the future work, we will develop a system which utilizes a smartphone, so that it can be done to classify infant’s cry sound based on voice type. The purpose allows users to use the system easier and faster to know the meaning of infant’s cry.ACKNOWLEDGMENTThe authors would like to thank respondents (parents or nannies) and experts (midwife and nurse) for supporting the research. The authors are also grateful to www.mathwork.com for their invaluable support in deploying the system and in providing experimental data and technical documentation concerning the “Infant’s Cry Sound Classification” research.2016 2nd International Conference on Science and Technology-Computer (ICST), Yogyakarta, IndonesiaREFERENCES1 N.A. Al-Azzawi, “Automatic Recognition System of Infant Cry based on F-Transform”, International Journal of Computer Applications, 2014, pp. 28-322 A.V. Oppenheim and R.W. Schafer, “Discrete-Time Signal Processing 2nd Edition”, 1999, p 603 Cepstrum. https://en.wikipedia.org, 2 Juli 20154 R. Cohen and Y. Lavner, “Infant Cry Analysis and Detection”. IEEE 27-th Convention of Electrical and Electronics Engineers in Israel, 20125 L. Fausett, “Fundamentals of Neural Networks Architectures, Algorithm, and Applications”, New Jersey: Prentice Hall, pp. 289-3336 F. Gorunesca, “Data Mining Concepts, Models, and Techniques”, Springer, 2011, pp.325-3267 S. Furui, “Digital Speech Processing, Synthesis and Recognition”, Marcel Dekker Inc., 20018 O.F.R. Galaviz and C.A.R. Garcia, “Infant Cry Classification to Identify Hypoacoustics and Asphyxia with Neural Networks”, Springer, 20049 T. Giannakopoulos and A. Pikrakis, “Introduction to Audio Analysis: A MATLAB Approach”, Elsevier, 201410 M. Hariharan, S. Yaacob, and S.A. Awang, “Pathological Infant Cry Analysis using Wavelet Packet Transform and Probabilistic Neural Network”, Elsevier, 2011, pp. 15377-1538211 H.G. Kim., N. Moreau, and T. Sikora “MPEG-7 Audio and Beyond: Audio Content Indexing and Retrieval”, 200612 M.D. Renanti, A. Buono, and W.A. Kusuma, “Infant Cries Identification by Using Codebook as Feature Matching, and MFCC as Feature Extraction”, Journal of Theoretical and Applied Information Technology, 2013, pp. 437-442