Missing Value Imputation for PM10 Concentration in Sabah using Nearest Neighbour Method (NNM) and Expectation-Maximization (EM) Algorithm

Missing data in large data analysis has affected further analysis conducted on dataset. To fill in missing data, Nearest Neighbour Method (NNM) and Expectation Maximization (EM) algorithm are the two most widely used methods. Thus, this research aims to compare both methods by imputing missing data of air quality in five monitoring stations (CA0030, CA0039, CA0042, CA0049, CA0050) in Sabah, Malaysia. PM10 (particulate matter with aerodynamic size below 10 microns) dataset in the range from 2003–2007 (Part A) and 2008–2012 (Part B) are used in this research. To make performance evaluation possible, missing data is introduced in the datasets at 5 different levels (5%, 10%, 15%, 25% and 40%). The missing data is imputed by using both NNM and EM algorithm. The performance of both data imputation methods is evaluated using performance indicators (RMSE, MAE, IOA, COD) and regression analysis. Based on performance indicators and regression analysis, NNM performs better compared to EM in imputing data for stations CA0039, CA0042 and CA0049. This may be due to air quality data missing at random (MAR). However, this is not the case for CA0050 and part B of CA0030. This may be due to fluctuation that could not be detected by NNM. Accuracy evaluation using Mean Absolute Percentage Error (MAPE) shows that NNM is more accurate imputation method for most of the cases.


INTRODUCTION
Air quality monitoring in Malaysia is continuously conducted by Department of Environment (DOE) and is done in stations around Malaysia (Dominick et al., 2012).These stations collect PM 10 concentration data at one-hour interval.However, due to maintenance, calibration of monitoring instruments and power outage, data collected by monitoring stations may suffer missingness.Missing data mechanism can be categorized into three different types: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR) (Nakai and Ke, 2011).Missingness is categorized as MNAR when it depends on the missing value itself.MNAR is known to be non-ignorable and missing data due Missing Value Imputation for PM 10 Concentration in Sabah using Nearest Neighbour Method (NNM) and Expectation-Maximization (EM) Algorithm to MNAR is not possible to be recovered (Graham, 2009).On the other hand, missingness due to MAR depends on the observed data.MAR is ignorable and missing data can be recovered because its missingness does not depend on missing data itself.MCAR is a special case of MAR, where missingness is independent of both missing data and observed data (Dong and Peng, 2013).A set of data containing missing data due to MCAR can be considered as complete dataset because the missingness does not introduce bias (Dong and Peng, 2013).Little's MCAR test can be used to determine whether the missingness is due to MCAR (Li, 2013).If the missingness is not MCAR instead, this test cannot be used to determine whether the missingness is due to MAR or MNAR (Dong and Peng, 2013).In terms of air quality data in Malaysia, mis singness can be considered as MAR because the missingness is mainly caused by maintenance, calibration of monitoring instruments and power outage.It does not depend on whether the value of data is lower or higher than certain value.Missingness can affect further analysis that requires complete dataset such as Fourier analysis and principal component analysis.
Particulate matter (PM) is mixture of substances in the form of small particles suspended in the air.PM is one of the critical components of air pollution (Li et al., 2017b).Due to its small size, PM can enter respiratory system, thus becoming one of major concerns in public health (Chang et al., 2018).Because of this, scientific attraction has been attracted towards PM (Shahraiyni and Sodoudi, 2016).PM mainly comes from motor vehicles, dust from construction sites and landfills.It also comes from biomass burning and brought by haze, a typical challenge in Southeast Asia since 1980s (Shaadan et al., 2015).PM 10 (particulate matter with aerodynamic diameter less than 10 microns) is one of major concern because it possesses hazardous properties towards human health compared to other pollutants such as carbon monoxide and nitrogen dioxide (Kim et al., 2015;Ny and Lee, 2010).This is because it can enter respiratory system while defending natural defences of human body (Chang et al., 2018).PM 10 can increase risk of asthma, aggravate bronchitis, respiratory syncytial virus (RSV) bronchiolitis and other lung diseases (Carugno et al., 2018;Lelieveld et al., 2015).This is especially true for children aged between 5-15 years (Cadelis et al., 2014).Other than respiratory problems, cardiovascular disease and cancer can be developed due to PM 10 in the air (Li et al., 2017a).
Many agencies around the world such as European Union (EU) and World Health Organization (WHO) implemented guidelines and set limit on air pollution concentration levels (Abd. Rani et al., 2018).In Malaysia, the guidelines are implemented by DOE.According to New Malaysia Ambient Air Quality Standard, PM concentration has its standard set to 50 μg/m 3 (1-year averaging time) on 2015 before it is gradually lowered to 40 μg/m 3 by 2020 (Department of Environment, n. d.).The implementation of this standard is important in order to ensure that air quality can be maintained at safe level.Therefore, there is a need to continuously monitor ambient air quality around Malaysia.
This research focuses on evaluating performance of data imputation on air quality data from five monitoring stations around Sabah.To make performance evaluation possible, missingness is introduced to compare observed data with imputed data.Two methods of data imputation are studied in this research, namely Nearest Neighbour Method (NNM) and Expectation-Maximization (EM) algorithm.Many previous studies have employed nearest neighbour method and expectation-maximization algorithm to obtain complete dataset.However, not many of these studies emphasize on the efficiency of these two methods in data imputation.By comparing between both NNM and EM algorithm, further analysis that requires complete dataset can be made more accurate.

1 Study Area and Data
Five monitoring stations (CA0030, CA0039, CA0042, CA0049, CA0050) in Sabah are listed in Table 1.Respective cities of each monitoring station are located as shown in Fig. 1.Except for CA0049, other monitoring stations are located at low altitudes and are close to the sea.Furthermore, Labuan (CA0050) is situated on a small island located at western of Sabah.As shown in Fig. 2, PM 10 concentration in Sabah differs between seasons and location (Kanniah et al., 2016).Western coast of Sabah generally has higher PM 10 concentration compared to other parts of Sabah all-year round.Also, PM concentration in Sabah is generally lower during intermonsoon October.
These monitoring stations, operated by DOE, continuously measures PM 10 concentration data at 1-hour interval.PM 10 concentration is measured using tapered element oscillating microbalance (TEOM), with temporal resolution of 1 h.As wind direction is angular quantity, wind speed and direction must be converted into x-component (east-west) and y-component (northsouth) wind speed using equations (1) and (2).This prevents difficulty in analysis due to nature of angular quantity (Muhammad Izzuddin et al., 2019;Kovač-Andrić et al., 2009).

2 Introduce Missingness to Data
In order to ensure that imputed data can be validated, a fraction of observed data must be replaced by missingness.Depending on complexity, missingness is introduced into data by percentage as conducted in previous research by Noor et al. (2014) as shown in Table 2.A sequence of zeros and ones (0 -do not replace observed data, 1 -replace observed data with missingness) is randomly generated using MATLAB 2018b and is used as a reference to introduce missingness to observed data.The actual percentage after introducing missingness may  deviate by up to 2% due to existing missingness in the data.

3 Data Imputation
A lot of data imputation method has been proposed for temporal dataset (Bai et al., 2019).Due to simplicity, two of the most popular methods used in data imputation are NNM and EM.NNM is common in replacing missing air quality data (Li and Liu, 2014;Dominick et al., 2012).For a stream of missing data bounded by observed data (x 1 , y 1 ) in lower bound and (x 2 , y 2 ) in upper bound, missing data is replaced with a value calculated using equations ( 3) and (4) (Abd Rani et al., 2018;Zakaria and Noor, 2018;Siti Zawiyah et al., 2010;Junninen et al., 2004).NNM is performed by executing a code developed using MATLAB 2018b.
EM algorithm employs a set of iterative equations to estimate mean vector and covariance matrix of multivariate distribution from exponential family ( Junger and de Leon, 2015).This method maximizes log likelihood to find parameters when there are missing values (Nakai and Ke, 2011).The simplicity and smooth operation of EM algorithm makes it unique among present multiple imputation methods.In addition, its faster operation compared to the alternatives makes EM algorithm one of the most popular imputation methods (Abd Rani et al.,

2018).
Given a set of data consisting of observed data D obs and missing data D mis , EM algorithm starts by defining parameter θ as a random value.Then, E-step (expectation step) calculates the likelihood of each values of D mis for every missingness.M-step (maximization step) uses computed values of D mis to find better estimation of θ.Given the likelihood function L and expected value of log likelihood function Q (θ|θ (t) ), both E-step and M-step iterate until the value converges (Abd Rani et al., 2018).Both E-step and M-step are executed using equations ( 5) and (6). (5)

4 Performance Evaluation
The performance of data imputation is evaluated by using performance indicators.The performance indicators that have been used are root mean square error (RMSE), mean absolute error (MAE), index of agreement (IOA), and coefficient of determination (COD).The performance indicators are calculated by using equations ( 7) to (10) (Abd. Rani et al., 2018;Nuryazmin et al., 2015;Ul-Saufie et al., 2013;Junninen et al., 2004):  ---------------------------------- where n is total number of data, P i is predicted value of ith data, O i is observed value of ith data, is mean predicted value, is mean observed value, s p is standard deviation of predicted values, and s o is standard deviation of observed values.

5 Mean Absolute Percentage Error (MAPE)
Mean absolute percentage error (MAPE) is a measure that evaluates accuracy of a prediction model (Khair et al., 2017).MAPE indicates error in predicting the value of missing data when comparing to real value.MAPE is calculated using equation (11) as follows (Khair et al., 2017).

RESULT AND DISCUSSION
3. 1 Performance Indicators PM 10 concentration datasets for five monitoring stations in Sabah are analysed.RMSE, MAE, IOA, and COD are calculated for every percentage of missingness and station for both part A and B. Tables 3 and 4 reveals performance indicators for NNM and EM at 5 missingness levels and 5 different stations for part A and part B respectively.The desirable attributes between these methods are highlighted in bold.In terms of missingness level, there is no definite relationship between performance of data imputation and missingness level.This is because both NNM and EM impute missing data based on available data.As long as available data is sufficient, missing data can still be effectively imputed.
Most of the data show that nearest neighbour method is better imputation method.This may be due to the nature of missingness in relation to ability of EM algorithm to impute data.EM algorithm works best for missing data caused by MCAR (Nakai and Ke, 2011;Graham, 2009).However, air quality data collected in monitoring stations are not caused by MCAR as the cause of missingness is known.This may attribute to lower performance of EM algorithm compared to NNM.
However, this is not the case for CA0050, where most of the performance indicators for that station show that EM algorithm is a better imputation method.This may be due to the fact that Labuan is surrounded by sea.One study has shown that air humidity is affected by bodies of water due to high heat capacity and strong evaporation (Zhu and Zeng, 2018).Furthermore, cold-wet air that surrounds a water body enhances air flow away from bodies of water by changing the local air circulation (Zhu and Zeng, 2018).The local air circulation highly affects humidity in Labuan.Another study suggests that different levels of humidity affects PM 10 concentration differently (Lou et al., 2017).PM 10 concentration increases with humidity up to 60%.Beyond that point, gravity deposition occurs and PM 10 concentration begins to drop (Lou et al., 2017).PM 10 concentration as monitored by CA0050 may fluctuate due to continually changing of humidity level, traffic congestion and active industrial activity.This fluctuation is not accounted by NNM, leading to indication that EM algorithm is better imputation method for data collected by CA0050.
As for PM 10 concentration read by CA0030, several performance indicators show that EM algorithm is better imputation method especially for part B of the data.This may be due to fluctuation of PM 10 concentration in Kota  Kinabalu especially between year 2008 and 2012.One study shows that PM 10 concentration from 16 th to 18 th January 2012 spiked at 7.00 a.m. and fluctuates at the other time (Chang et al., 2018).When this portion of data is missing, NNM may not be able to restore the missing-ness as well as EM algorithm.

2 Regression Analysis on Imputed Data
The performance of data imputation is further evaluated by calculating correlation of coefficient R on predict- ed data against observed data.The most ideal case of imputed data occurs when predicted data equals observed data (R = 1).Tables 5 and 6 reveals coefficient of correlation of data in part A and B respectively, for all five missingness percentages and five stations.Similar to performance indicators, coefficient of correlation shows that NNM is better imputation method for monitoring stations in Tawau, Sandakan and Keningau.As for CA0030, NNM is better imputation method for Part A, but not in Part B. Dataset recorded by CA0050 strongly suggests that EM algorithm is better imputation method.
Fig. 3 reveals scatter plot of data imputation for both CA0042 and CA0050.CA0042 and CA0050 are selected to be presented in the Fig. 3 because CA0042 is located at high altitude while CA0050 is located in a small island.The predicted-observed regression is shown for both stations due to different geographical condition in contrast to the other three stations.Coefficient of correlation for CA0042 shows relatively large difference between two methods compared to other stations.As shown in Fig. 3, all scatter plots for CA0042 shows that line representing NNM is closer to dashed line compared to line that represents EM algorithm.This shows that NNM has greater tendency to predict missing data closer to observed data compared to EM algorithm.This might be caused by missingness mechanism, in which data is Missing at Random.EM algorithm may not be able to impute MAR data as well as MCAR data (Nakai and Ke, 2011;Graham, 2009).
Meanwhile, CA0050 shows that EM algorithm gives better coefficient of correlation in contrast to other stations.Despite that, Fig. 3 reveals that NNM has either greater tendency (Part A) or approximately similar to EM algorithm (Part B) to predict missing data.This is because the lines representing NNM and EM are plotted at best fit.However, the scatter plot shows that imputed data by NNM for CA0050 are more dispersed away from line of best fit compared to that of CA0042, which might contribute to lower R value of NNM compared to EM algorithm.Although best fit line for NNM is closer to dashed line, the dispersion of scatter plot shows that EM algorithm is better imputation method compared to NNM.

3 Mean Absolute Percentage Error (MAPE)
Performance of data imputation is further evaluated using MAPE.Data imputation is most accurate when MAPE approaches zero.Table 7 reveals accuracy of data imputation using NNM and EM for all stations and various level of missingness.According to Table 7, it is shown that NNM is generally more accurate data imputation method compared to EM (except for CA0050 in set B).This is reflected by lower values for NNM for most of the cases.This may be due to its ability to predict missing data closer to actual data compared to EM.

CONCLUSION
Generally, it has been shown that NNM is better imputation method for data from all the monitoring stations

For
the purpose of this research, 10-year hourly data from 2003 to 2012 are divided into two parts.The first part (Part A) ranges from 2003 to 2007, while the second part (Part B) ranges from 2008 to 2012.Due to climate change, trends of PM 10 concentration data may differ from both parts.Thus, both parts may have difference in these data.

Fig. 3 .
Fig. 3. Scatter plot for imputation of data from CA0042 and CA0050 for part A and B at various missingness percentage.Blue indicates NNM, red indicates EM, while dashed line represents the point where predicted data equals observed data.

Table 1 .
Location of monitoring stations in Sabah.

Table 3 .
Performance indicators for every station and missingness percentage for part A.
Remark: Data ranges from year 2003 to 2007

Table 4 .
Performance indicators for every station and missingness percentage for part B.
Remark: Data ranges from year 2008 to 2012

Table 5 .
Coefficient of correlation for dataset in Part A.

Table 6 .
Coefficient of correlation for dataset in Part B.

Table 7 .
Mean absolute percentage error (MAPE) of stations in Sabah for various missingness level.Sabah except CA0050.NNM works most efficient for CA0049 in Part A (RMSE<14.302,MAE<10.640,IA>0.819andCOD>0.586) and CA0042 in Part B(RMSE<10.722,MAE<7.526,IA>0.835and COD  >0.632).This may be due to missing data type of MAR.However, strong fluctuation which may be present in data from CA0050 and part B from CA0030 may cause NNM to impute data not as well as EM algorithm.This may be further confirmed by regression analysis for CA0050 (R>0.711 for part B).Evaluation of accuracy using MAPE reveals that NNM is more accurate imputation method for most cases (except for set B in CA 0050).This shows that NNM can be used as data imputation for missing data found in dataset observed by stations in Sabah.Accurate data imputation is important for future research because this enables further analysis on air quality data to become more reliable. in