1. Introduction
In present days, a wide range of healthcare applications are projected to benefit from the Internet of Things (IoT), which includes wearable monitoring systems [1, 2]. Remote healthcare systems are paying close attention to the Internet of Things (IoT), which is considered as the most important future technology. The IoT was quickly adopted by the healthcare sector because incorporating IoT features into medical devices improves both the quality and efficiency of services. For older persons, people with chronic illnesses, and others who require consistent management, this has significant benefits [3, 4]. The greatest cause of death worldwide is cardiovascular disease (CVD), usually referred to as heart disease. According to a recent study by the International Heart Federation, CVD is the major reason for one in three fatalities. Statistics from the World Health Organization (WHO) indicate that by 2030, more than 23.6 million individuals could pass away from CVD, primarily due to heart failure and stroke. Anxiety, drinking, tobacco, improper nutrition, sedentary lifestyles, and other conditions like diabetes or high blood pressure are only a few of the causes of CVD [5, 6]. Nonetheless, once discovered in their early stages, the majority of CVD-related disorders are believed to be totally treatable. Here, diagnosing and predicting heart failure must be given more priority by healthcare professionals. Emerging innovative methodologies for data analysis may enable an early diagnosis of CVD by examining a patient's medical history [7]. Cardiovascular disease mortality rates are increasing daily, making it one of the most important topics in healthcare to be able to forecast these diseases. Prediction aids in early disease detection, lowering the risk of illness or ensuring the best possible treatment. Machine Learning (ML) methods are used in numerous research to predict CVD using clinical information [8, 9]. Nonetheless, the high complexity and class imbalance of clinical records make them extremely challenging. As a result, using machine learning without addressing these issues can decrease the methods' efficiency and accuracy. Previous studies used multiple ML algorithms to predict CVD with a focus on feature selection [10, 11]. The application of ML techniques to medical diagnosis is constantly growing. This is primarily attributable to improvements in disease classification and recognition, which can provide information to aid medical professionals in the early detection and diagnosis of diseases, sustaining human health, and lowering the death rate [12, 13]. The probability of disease occurrence is frequently determined using classification algorithms, which are machine learning methodologies.
As a result, the purpose of this study is to develop a classification algorithm to predict the cardiovascular diseases using actual patient data. The goal of ML is to build a system that can predict the future using past data [14-16]. There are three different kinds of machine learning methods. There are two types of learning: supervised learning and unsupervised learning. Supervised learning involves labelled data for model training and test data for model performance assessment. Classification and regression issues are frequently present in supervised learning. The second form is unsupervised learning, where no labels are applied to the data and the model seeks out any potential hidden patterns [17, 18]. Unsupervised learning is exemplified via a clustering strategy. The third kind of learning is reinforcement learning, which neither uses labelled data nor relates its findings to the data. It is focused on the ability of intelligent agents to act in a given environment. The classification methods are learning algorithms that are frequently employed to determine the likelihood of disease occurrence. It is a well-known machine learning technique that predicts the class of new samples using a model learned from training data. As current approaches for diagnosing heart diseases are inefficient for a variety of reasons, including lack of accuracy, high computational time, and etc [19, 20]. Hence, researchers are working to develop an efficient strategy for the timely identification of heart diseases. However, it is extremely difficult to diagnose and manage cardiac disease when cutting-edge equipment and medical professionals are not available. With the right diagnosis and treatment, many lives can be spared. Therefore, the proposed work intends to develop a novel and intelligent remote healthcare framework for predicting CVD according to the patients’ medical information. The novel contribution of the proposed work is to design and develop an efficient and lightweight health monitoring and disease diagnosis tool for remote applications. For this purpose, some of the unique methodologies including Auto-encoder for Recurrent Imputation and Denoising (ARID), Artificial Gorilla Troop Optimization (AGTO), and Multi-Linear Regression Classification (MLRC) methodologies are implemented in this study. With the help of these techniques, the overall disease diagnosis performance of the remote healthcare monitoring system is highly improved. The research work’s original contributions are as follows:
• To build an IoT-assisted remote healthcare framework for the classification and prediction of cardiovascular disease (CVD) based on a patient's medical history.
• Data preprocessing is used first to create the normalized dataset by replacing the missing attributes and values with the help of Auto-encoder for Recurrent Imputation and Denoising (ARID).
• An Artificial Gorilla Troop Optimization (AGTO) technique is used to choose the most important attributes from the standardized medical dataset.
• A Multi-Linear Regression Classification (MLRC) model is used to accurately forecast the patients' medical data as to whether they are in good health or have CVD.
• Four independent benchmarking datasets, including Cleveland, Statlog, Comprehensive, and Mendeley, have been used to validate the performance and efficacy of the recommended AGTO-MLRC based disease detection system.
The main contribution of the proposed work is to develop a computationally effective disease diagnosis framework for remote healthcare applications. In the previous literature works, a lot techniques are developed for identifying and categorizing the classes of diseases from the patients’ medical record. But, some of the techniques could use more number of features for precisely identifying the class of disease, which significantly increases the time complexity of classification with low processing speed. In certain cases, the existing deep learning techniques follow complex mathematical modeling for disease classification, hence the computational burden is highly increased. Moreover, some of the learning approaches require more number of samples for training and testing operations, which affects the efficacy of classifier. By considering these issues, the proposed work aims to develop a simple, lightweight as well as computationally sound disease diagnosis methodology for remote healthcare applications. The novel contribution behind this work is, it adopts new computational algorithms including ARID, AGTO, and MLRC for developing a unique disease detection framework for CVD diagnosis in remote healthcare applications. In the proposed work, ARID based deep learning methodology is implemented for data preprocessing, in which the missing value identification and replacement are performed to create the balanced medical data. Consequently, the AGTO technique is deployed for feature selection that helps to simplify the process of classification by reducing the number of samples required for training and testing. By adopting this methodology, the time complexity and computational burden of classification are greatly minimized. In addition, the MLRC technique is deployed for accurately categorizing the class of disease with simple mathematical modeling. By integrating these methodologies, the overall disease diagnosis performance of remote healthcare system is highly improved in the proposed work. The majority of existing frameworks could directly use the deep learning based classification methodologies without preprocessing and dimensionality reduction techniques, but in the proposed work the different data handling operations are applied with simple computational operations.
The following units make up the remaining sections of this paper: The entire literature of previous studies related to remote healthcare monitoring and CVD prediction using AI algorithms is provided in Section 2. The suggested healthcare monitoring framework is briefly explained in Section 3 along with its general layout and examples. Section 4 compares the outcomes with current state-of-the-art models and evaluates the proposed CVD prediction model's results using a number of parameters.
In Section 5, the overall overview of the article is offered along with its future scope.
2. Related Works
The most commonly used machine learning techniques in the remote healthcare applications are as follows: Support Vector Machine (SVM), K-Nearest Neighbor (kNN), Logistic Regression (LR), Stochastic Gradient Descent (SGD), and Decision Tree (DT).
CVDs are currently the leading cause of death in the world and kill almost 18 million people every year. Early detection and management of the disease continue to be of utmost importance because the interrelationship between factors such as high blood pressure, high levels of cholesterol, abnormal pulse rates, and other risk factors complicates diagnosis. In the last couple of years, AI and IoT-related technology has shown outstanding promise in transforming healthcare by means of better and timelier diagnosis. Among these, machine learning algorithms have emerged as one of the most powerful tools for predicting CVD. It provides the ability to analyze large, complex volumes of data that might extract patterns that would otherwise go unnoticed through traditional methods of analysis. However, with such advances, improving their accuracy, efficiency, and scalability does have certain challenges. It is in this regard that the work will try to bridge the gap in addressing the above limitations in developing a framework for IoT-based healthcare monitoring in early prediction of CVD. Preprocessing data techniques had been employed in the development of the framework to present a normalized dataset that enhances the classification performance of the framework. In feature selection, the most relevant features in the dataset are selected using the Artificial Gorilla Troop Optimization algorithm in order to ensure that only the most relevant information is considered for prediction. Further, a Multi-Linear Regression Classification model is integrated for accurate classification of medical data as either healthy or CVD-affected. Results of the proposed approach, using the AGTO-MLRC mechanism, have been validated and compared against the results of some popular benchmarking datasets with a view to advance the state-of-the-art in CVD prediction, hence providing a more robust healthcare monitoring system that is scalable.
SVM: A popular supervised learning technique called SVM [21] is utilized to resolve challenging issues like classification and regression. In order to separate the data points linearly, SVM translates them from low-dimensional to high-dimensional space; the data points are then given a classication border using a hyper-plane.
KNN: Moreover, the kNN [22] is used to categorize data points depending on how far apart they are from one another. This strategy classifies unknown data using the distance technique and then links it with learned data to produce the prediction results. The most neighbors that fall into the same class, as defined by the k value, are considered to be the maximum number. In addition, the weight function used to make the prediction can be adjusted in accordance with the problem, with the points either being weighted equally or according to the inverse of the proximity.
LR: The linear model known as LR [23] handles tasks involving classication. It is a predictive analysis algorithm that uses regression and is built around the concept of likelihood. For binary data, LR is frequently used to determine the result of one or more features. The cost function of LR could differ depending on the regularization method used.
SGD: The one-versus-all approach is used by SGD [24] to integrate multiple binary classifiers. Moreover, the SGD is often used for datasets since it iteratively consumes all samples. It is simple to comprehend and put into practice because the operational approach is comparable to the regression approach.
DT: It is [25] one of the popular classification technique that uses a tree structure to synthesize a set of classification rules that were collected from the data in order to explain actions and their potential impacts. Three elements make up a DT classifier such as a root node that represents the entire data set, numerous decision nodes that represent sub-node divisions and attributes, and multiple leaf nodes that indicate the outcome classes. The training set is repeatedly split up into subsets by DT algorithms, each of which is given a subset with superior feature values. Pruning, which entails removing a few sub-nodes of decision nodes, is a technique used by DT to reduce overfitting. The maximum tree depth is a key hyper-parameter that controls the model's complexity because a deeper tree has more sub-trees that allow for more accurate decision-making.
Wani, et al [26] applied an advanced AI mechanism for an efficient disease prevention and diagnosis in healthcare sectors. Identification of possible machine learning applications in the study of transmissible illnesses and the wider healthcare system is the goal of this article. AI and similar technological advancements are being employed progressively in industries and communities, and they are beginning to appear in medical care. These advances may revolutionize a variety of aspects of healthcare, including the rules and regulations used by suppliers, consumers, and drug companies. Numerous studies additionally demonstrate that AI might perform more successfully than individuals in crucial healthcare tasks including disease diagnosis. Algorithms currently surpass radiologists in the area of recognizing malignant cells and instructing researchers concerning how to generate networks for complex clinical studies. Neelakandan, et al [27] deployed a blockchain integrated deep learning architecture for secured healthcare data transmission and disease diagnosis. This research offers the BDL-SMDTD model, a revolutionary blockchain-based system that enables safe medical data transfer and diagnosis. This model aims to diagnose disease with a superior detection rate and reliably transfer imaging data. Also, it includes a number of operational stages, including image capture, encoding, blockchain, and the diagnostic procedure. Babu, et al [28] applied a linear regression policy for handling large dimensional big data. The purpose of this paper is to develop a smart healthcare system for an effective decision making while identifying diseases from medical records. Ahsan, et al [29] conducted a comprehensive review to analyze several machine learning techniques for disease diagnosis. It also provides a clear overview about several machine learning and deep learning algorithms for enabling reliable disease prediction.
Yang, et al [30] implemented several classification methods for the detection and categorization of CVD, which includes CART, bagged trees, NB, and AdaBoost. The purpose of this work is to deploy a multi-variate regression model for an accurate CVD diagnosis and classification. However, imbalanced dataset handling, and reduced AUC are the major flaws of this work. Mohan, et al [31] introduced an effective heart disease prediction framework with the use of hybridized machine learning algorithm. Krittanawong, et al [32] presented a comprehensive study to examine the performance and effectiveness of several machine learning algorithms used for CVD prediction. Based on this review, it is noted that the standard SVM and AdaBoost algorithms provide an effective results in CVD prediction. Also, the selection of most suitable machine learning algorithm for solving the appropriate research problem is one of the most essential task in recent times. Rossello, et al [33] introduced a new risk prediction mechanism for identifying CVD from the patients’ medical history. Despite the enormous effort made in recent decades to improve clinical outcomes, CVD continues to be the major reason for death and disease worldwide. The application of risk prediction tools to identify those who are most at risk of CVD and offer them with preventative actions is advised by guidelines for CVD prevention. Risk prediction is primarily used to support knowledgeable treatment and triage decisions regarding the start, stop, or intensification of preventive medication. In terms of absolute risk reduction, it is generally believed that "high-risk" patients gain the greatest benefit from risk factor treatment. Ghosh, et al [34] developed a new feature selection mechanism named as Least Absolute Shrinkage and Selection Operator (LASSO) for the prediction of CVD with better accuracy. Here, two selection strategies such as Relief and LASSO are applied to obtain the much more significant features according to the rank values in medical references. Relief is an attribute selection approach that assigns weights to each feature in the dataset. Thereafter, these weights can be gradually changed. The vital features should have a substantial weight, whereas the less significant ones should have a small weight. In order to calculate feature weights, Relief employs methods similar to those used in KNN. This aids in addressing the machine learning over-fitting and under-fitting training issues. Moreover, a number of hybrid strategies, such as Boosting and Bagging, are used to increase the testing rate while minimizing execution time. Moreover, the data transformation and scalar conversion operations are also performed to handle the missing values during preprocessing. Fitriyani, et al [35] developed an effective heart disease diagnosis model using edited nearest neighbor classification algorithm. Here, the purpose of using the DBSCAN clustering technique is to identify dense regions, which can be determined by the quantity of items that are located at a particular point (the center point). Points that lie outside of regions are classified as outliers. The target of this paper is to improve the detection accuracy of heart disease detection framework with the use of hybrid algorithm.
The prediction and early recognition of CVD with machine learning and IoT technology have advanced in the last few years. Many studies are on algorithms and frameworks for the early disease detection that are accurate and efficient. Machine learning has emerged as a very powerful tool in predicting cardiovascular diseases because it can analyze big datasets and detect complex dependencies, which usually classical statistics are not able to reveal. Some of the earlier works on the Cleveland Heart Disease dataset applied different machine learning algorithms ranging from decision trees through neural networks and logistic regression to prove that high accuracies can be achieved with machine learning models, hence opening up the pathway for more research in this area. The later works involved more advanced techniques, including but not limited to support vector machines and ensemble methods. Recent studies showed that the ensemble methods, especially random forests, consistently outperformed single classifiers for multiple heart disease datasets, including Cleveland and Statlog, which had significantly better accuracy and robustness.
Feature selection can be the crux of developing any good machine learning model for predicting CVD. This involves identifying more relevant features in contributing to predicting the outcome and improving the model accuracy while reducing the complexity of computation. These methods have been addressed using techniques such as principal component analysis and recursive feature elimination, although they are acceptable and work well with most of the datasets. They usually fail to manipulate high-dimensional data and sometimes do not capture the nonlinear relationship between features. Some of the recent developed improvements have been employing evolutionary algorithms that derive the feature selection. Genetic Algorithms and Particle Swarm Optimization both seem conducive to optimizing feature sets for enhanced predictive performance. Research has been conducted into PSO for feature subset selection from large CVD datasets to enhance model interpretation while avoiding overfitting.
The current embedding of IoT devices in the field of machine learning models for real-time health monitoring undoubtedly plays a monumental role in making predictions for CVD. With unceasing data capture related to heart rate, blood pressure, and physical activity, this dataset is undoubtedly rich for machine learning models. Although there have been many studies exploring the potential of using IoT-based systems for monitoring CVDs, here is an example of an IoT-based predictive framework for heart rate and CVD monitoring using wearable sensors in conjunction with machine learning algorithms located in the cloud. The results showed that continuous and real-time monitoring and analytics had the potential to vastly improve the detection of cardiac anomalies. Similarly, the Internet of Things-enabled systems related to high-risk elderly patients for cardiovascular diseases and associated with continuous monitoring used sensors in a combined sense with machine algorithms for predictive alerts on adverse cardiac events and the issuance of timely alerts to caregivers or healthcare providers.
Despite the marked improvements in machine learning techniques for the prediction of CVD, a number of issues and shortcomings continue to exist in the said application of the same that actually hamper its true effectiveness and robustness. One such major issue is the high dimensionality of medical datasets. High dimensions can trap algorithms into a "curse of dimensionality," wherein performance deteriorates upon an exponentially large space of irrelevant or redundant features, among other reasons. Most of the time, traditional methodologies of feature selection such as PCA and RFE are not against non-linear relationships and interactions between features. Many machine learning models suffer from the problem of overfitting, be it a decision tree or a support vector machine. In particular, overfitting often happens on small databases and databases with imbalances, which is the characteristic of the medical problem. Overfitting is an outcome of not generalizing well on new, unseen data, as modeling noise and outliers in the training data. Besides, complex models, like neural networks, and ensemble methods, like random forests and gradient boosting machines, typically lack interpretability. Even though these models are powerful, they act like a black box, which makes the predictions and decisions made by these models difficult to understand by the healthcare practitioners. This makes them less than perfectly transparent, a fact that may impact adoption of technics in clinical settings where interpretability and explainability are paramount. In addition, the real-time data from IoT devices is also a challenge for the integration part. This is of a major concern in issues to be addressed in ensuring reliability of continuous data streams, handling issues of privacy and security in data, and maintaining high level robustness of a system against a failure. The downside is the computational complexity and the resource-intensive nature of the advanced machine learning algorithms. The training of such types of sophisticated AI models on huge datasets is also computation-intensive and time-taking, which may not be affordable at all by the healthcare facility. Third, most efforts from researchers generally lack a standardized protocol for preprocessing data and evaluating the developed models, leading to inconsistency in the results and consequently making the performance comparison of different models across studies problematic. These are clear signals of the need for stronger, more interpretable, and more effective machine learning techniques that can exact at the singular demands of CVD prediction and health care applications.
The literature research led to the conclusion that the bulk of healthcare frameworks created in the current works place a strong emphasis on employing machine learning algorithms to predict CVD. Since the main drawbacks of traditional works are their higher computational complexity, training time requirements, and lack of reliability. As a result, the proposed effort aims to create a novel integrated classification model for CVD prediction from patient history using meta-heuristic optimization and IoT technology.
CVDs are the leading cause of death globally, with almost 18 million deaths each year. Early detection is highly important for these diseases but challenging to diagnose in view of several factors such as high blood pressure, high levels of cholesterol, and abnormalities in pulse rate. This work presents an IoT-based healthcare monitoring framework that can predict CVD, integrated with machine learning algorithms to facilitate early detection.
In our approach, data preprocessing normalizes the dataset and improves classification performance. We employ the AGTO algorithm for feature selection and an MLRC model for categorizing medical data into either a healthy or CVD-affected class. For the sake of validation of our performance claims, we compare the AGTO-MLRC framework with several baselines including some traditional machine learning algorithms like Logistic Regression, Support Vector Machines, Random Forests, more advanced ones like Gradient Boosting Machines, and Neural Networks.
Precision, accuracy, recall, and F1-score are some of the statistical measures showing the importance of our results and giving a full overview of the performance of models. The proposed framework of AGTO-MLRC outperforms all the baseline models in proving to be highly effective in detecting CVD at an early stage. Our results will emphasize the IoT and machine learning technologies that can be effectively utilized for the integrated healthcare sector to enhance disease predication and management.
3. Proposed Methodology
This section provides the complete explanation for the proposed healthcare monitoring framework for detecting CVD from the patient records. The main purpose of this work is to develop an efficient and intelligent disease detection system for CVD diagnosis with the use of optimization and machine learning classification algorithms. In the existing health monitoring frameworks, various machine learning algorithms are implemented for the accurate detection and classification of CVD. However, the majority of techniques have the limitations of computational burden, low prediction accuracy, and lack of reliability. Therefore, the proposed work aims to utilize an Artificial Gorilla Troop Optimization (AGTO) based Multi-Linear Regression Classification (MLRC) algorithm for CVD detection. Based on the obtained sensor data, the classication is carried out using AGTO-MLRC to assess the cardiovascular disease. The system goes through testing and training to do the classication. For training and assessing the disease, data from the UCI machine learning repository have been used in this work. In the previous studies, several existing methodologies are developed for remote healthcare applications. Specifically, the machine learning and deep learning methodologies are implemented for identifying the type of disease along with its appropriate class. When compared to the conventional approaches, the proposed system applies a combination of methodologies in the stages of data imputation, feature optimization and classification. During data preprocessing, the novel imputation methodology, named as, ARID has been used to replace the missing attributes for creating the balanced dataset. Then, the cleaned and preprocessed data is further used for feature optimization, where the most suited and required features are chosen with the use of AGTO. The adoption of Internet of Things technology by Conservatory for Healthcare Analytics, applied with machine learning algorithms, opens a new outlook on the management of cardiovascular diseases. More specifically, while conventional healthcare systems rely on the availability of sporadic data from a patient visiting a medical center or hospital only once in a while, IoT devices are continuously measuring critical health parameters such as blood pressure, heart rate, and cholesterol. These data streams run constantly, allowing real-time analysis, anomaly detection at the very first instance, and early risk detection in the management of CVDs. Real-time IoT data collection would mean constant updates and retraining for machine learning models to improve predictive accuracy over time. Besides, IoT devices make remote monitoring possible; hence, this eliminates many hospital visits by patients, especially in remote areas or areas with underserved populations. All this continuous monitoring and data collection may foster earlier interventions, which could curtail the mortality rate from CVD by appropriate alerts and actionable insights provided for patients and healthcare providers timely. The Artificial Gorilla Troop Optimization (AGTO) algorithm represents a new way in which cardiovascular risk feature selection for diseases can be done. Feature selection, a process for choosing the relevant variables, was an important and requisite step of a data preprocessing phase since it would influence the prediction outcomes tremendously. The AGTO algorithm borrows from the social behaviour and hierarchy structure of gorilla troops, which are exploited in navigating the search space efficiently to identify optimal subsets of features. Unlike the method, conventional approaches of feature selection are characterized by their adaptive nature and the ability to avoid falling into the local optima, which is also a common pitfall of many optimization algorithms. By balancing exploration and exploitation, AGTO ensures that the feature space is very well searched to select useful features that benefit the predictive power of the machine learning model. In the framework, the implementation of AGTO has increased classification accuracy with a reduced computational overhead, proving it fit for real-time applications.
This type of feature optimization helps to increase the speed of classification with improved disease detection accuracy and performance. In addition, the MLRC is implemented to accurately predict the type of CVD with improved training and validation results. Since, the combination of these methodologies such as ARID, AGTO and MLRC makes a unique and intelligent remote healthcare system for disease diagnosis.
As shown in Fig. 1, the proposed disease detection framework comprises the major elements of data preprocessing, AGTO based feature selection, and MLRC based disease prediction. By using IoT technology, the patients’ medical information are obtained through body sensors, and are transmitted to the cloud system through gateway. After that, the obtained medical history is stored in the cloud database for disease prediction and classification. In this model, the dataset information (i.e. patient medical data) is used for training purpose, where the data preprocessing is carried out at first for normalizing the attribute values. Consequently, the set of relevant features are obtained from the preprocessed dataset with the use of AGTO algorithm. Furthermore, the MLRC algorithm is used to predict the CVD according to the selected features obtained from the medical record. Finally, the prediction output is shared to the cloud, where the medical professional diagnose the disease for an earlier treatment to the patients. The key benefits of this framework are simple to deploy, high prediction accuracy, reduced storage complexity and capability of handling huge dimensional data.
Fig. 1. Proposed healthcare monitoring and disease detection framework
3.1 Data Preprocessing
Typically, the prediction rate of the classification approaches is completely depends on the preprocessing procedures, since it helps to obtain the highest classification performance. For assuring that the machine learning model to be effective, the dataset must be pre-processed. In this study, the preprocessing is mainly performed to identify the missing attributes and, applying a standard scalar to the dataset for enhancing the models' efficacy. Moreover, it helps to acquire a respectable and trustworthy accuracy for forecasting the disease. After determining the age, blood cholesterol, and heart rate of the patient, the data point of a particular attribute is substituted. A patient's attribute value is substituted in the same position if the majority of their attribute values match. By removing duplicate (irrelevant) attributes, redundancy removal shrinks the amount of the data. The main contribution of the proposed work is to develop a simple and effective health monitoring and disease diagnosing framework for remote applications. For this purpose, an advanced algorithms are utilized in this study, which includes the data preprocessing & imputation, feature optimization and classification. Since, the data preprocessing is the basic and essential step of all machine learning applications, but in the proposed work the preprocessing along with the data imputation model is utilized in this framework for identifying missing fields of information. Important computations with significant missing data could produce inaccurate estimates and unreliable predictions. It could hinder learning by producing false assumptions. It is crucial to appreciate the mechanics of missing data in order to comprehend the effects of missing data on a particular analysis or method to analyze missing data. The deletion strategy has a number of downsides, including a loss of accuracy and outcome bias after valuable data is removed. The missing data imputation strategy links the missing information before employing the normal full information look at on the filled data. Some sorts of imputation strategies, such as binary, categorical, and continuous variable mixtures, aren't able to deal with insufficient data with multi-type elements and do not have the capacity to control outliers. Meanwhile, several deep learning-based approaches are now being developed to deal with this issue. However, the conventional deep learning models typically suffer from a variety of drawbacks, such as inefficient training, complicated networks, local minimums, troublesome control variable tweaking, and loss. Therefore, the proposed work aims to apply a novel deep learning based imputation methodology named as, Auto-encoder for Recurrent Imputation and Denoising (ARID) for data preprocessing and normalization. Denoising auto- encoders, a subclass of unsupervised artificial neural networks used by ARID, are intended to lower dimensionality by distorting a portion of input and making an attempt to rebuild it. By interpreting missing values as an additional component of corrupted data and deriving imputations from an algorithm tuned to reduce the degree of reconstruction error on the first observed portion, the denoising auto-encoders are specifically employed for multiple imputation in this case. With the time difference increasing linearly with the number of columns and moderately with the number of rows.
ARID consistently produces complete datasets faster than existing algorithms. Practitioners can save a lot of time due to ARID's efficiency, even with moderately large datasets. As shown in Fig. 2, there are a total of three phases in the ARID:
Fig. 2. ARID based imputation
• Substituting individually determined estimated values that retain correlations indicated by actual elements for each missing component of the dataset.
• Evaluating each of the finished datasets independently and determining key parameters.
• Combining a number of different variables predicts according to a straightforward set of guidelines that takes advantage of deviation between these datasets to demonstrate the uncertainties regarding the ideal imputation model.
3.2 Artificial Gorilla Troop Optimization (AGTO)
Finding the best viable or desirable solution(s) to a problem that is frequently faced in a different fields, especially is referred to as optimization. Among others, the meta-heuristic models are now widely used in the healthcare applications due to the number of factors like simple to execute, straightforward concepts, performs better, and does not require any derivative functions. In the conventional disease detection frameworks, there are several meta-heuristic models such as ant colony optimization, flower pollination, swarm intelligence, genetic algorithm, and etc have been used for feature selection. However, the challenges associated to these methods are local optimum, more iterations to reach optimal solution, and computational complexity. When compared to the other meta-heuristic models, the AGTO technique is less complex, which offers the best optimal solution to effectively solve the multi-objective optimization problems. Also, the searching complexity of this technique is low and it reaches the best solution with minimal number of iterations. In addition, it efficiently balancing the exploration and exploitation operations with no local optimum. Moreover, this algorithm is not frequently used for CVD disease prediction. Hence, the proposed work aims to apply this algorithm along with ARID and MLRC techniques for predicting the type of CVD with increased accuracy. Therefore, the proposed work intends to use a highly sophisticated optimization algorithm, named as, Artificial Gorilla Troop Optimization (AGTO) for choosing the relevant features from the normalized dataset.
The improved AGTO algorithm, spiced with the Multi-Linear Regression Classification model, can eliminate most practical drawbacks in the healthcare monitoring system intended to predict cardiovascular diseases, chiefly among which is the high accuracy and reliability of the predictive model. The selected features through AGTO in this study were based on the chosen relevant features for the training of the MLRC model. This makes the process of overfitting minimized such that classification performance can be maximized with minimal risk of overfitting, which may compromise generalization of the model on new and unseen data. This is especially true in medical applications, where a false positive or false negative condition is too costly in terms of safety to the patient and efficiency to a treatment.
However, from a more practical perspective, the framework supported by AGTO-MLRC allows status monitoring of a patient in real time with its IoT devices, and combined alerts and recommendations are given through timely intervention in relevant CVD cases. A proactive follow-through in identification and suggestion of the best practice to carry out before problems become critical conditions not only enhances efficacy on the patients' outcomes but also reduces the cost of health care through constant monitoring during the provision of health care. Moreover, the power to give instant feedback about the system ensures the best application in populations that are underprivileged or remote, providing inadequate expert facilities for the local population's general health. The interpretability and transparency of the model AGTO-MLRC provided supported acceptances from doctors. Selection of a smaller set of features where the selected features relate well with the MLRC is interpretable and hence helpful for the clinicians to understand and trust the prediction. Clinical decision-making should be possible to justify and explain a diagnosis with the treatments being recommended. An application is a system that, in case interpretability were provable, would be relatively easy to design according to the large part of the medical regulatory standards, that mostly demand clear documentation and justification of the ones made diagnoses.
The practical consequences arising from operationalizing the AGTO-MLRC framework stem from the fact that it has two basic properties: scalability and adaptability. We would be in a position to scale it easily through feature selection and the appropriate setting of classification parameters, respectively, with respect both to the increasingly large number of patients and with respect to the extension to other types of diseases. By all means, these render PMEO a pretty flexible and versatile model in the wider context of health care monitoring and management. More importantly, the feature selection process inherent in AGTO is basically computationally efficient and, therefore, driven by the operability of resource-constrained environments, as well as the consideration of as many scenarios in healthcare as possible. The conventional notions of feature selection, meanwhile, in AGTO have seconded with the practical benefits of an MLRC model in designing an optimal and highly effective framework for achieving the early detection and management of cardiovascular diseases. Therefore, the application finds all-pervasive use and relevance not only within the affiliated healthcare system in efficient delivery of health service and improved patient outcome but also within clinical settings. This lies in a highly precise, realizable, and enhanced prediction system.
The social intelligence of gorilla troops in nature formed the basis for the revolutionary meta-heuristic algorithm known as GTO [36]. Gorillas cannot live alone because of their predisposition to dwell in groups. As a result, they continue to seek for food as a group, and a silverback gorilla serves as the group's leader and makes all decisions. The strongest member of the gorilla family is believed to be the most detrimental option in this method, whereas the silverback is thought to be the best solution. According to the behavior of gorillas, there are 5 distinct operators have been used in the exploration and exploitation stages of optimization. During exploration, each gorilla in this phase is considered a potential candidate for the silverback gorilla, which is the best solution after each repetition. Two alternative mechanisms can be used during exploitation phase. Follow the Silverback is the name for the initial process, while Competition for Adult Females is the label of the second. Besides that, the GTO algorithm's steps can be summarized in the following way:
1. Initialize all of the algorithm's parameters, and then assess the initial optimal value.
2. Set the parameters for the current iteration and the gorilla candidate to 1.
3. After performing n iterations of this phase, perform the exploration phase, which is repeated with a number equal to the parameter to select the best fitness function and position.
As shown in Algorithm 1, the input parameters such as maximum number of iterations 𝑖te𝑟𝑀 and search agent populations (i.e. gorilla troops) ɠ𝑘 are initialized at first. Then, the fitness values of all individual gorillas are estimated, where the parameter ∝ is updated based on the following equation:
\(\begin{align}\propto=\left(\cos \left(2 * r^{d}\right)+1\right) *\left(1-\frac{t}{\text { iter }^{M}}\right)\end{align}\) (1)
ɳ =∝∗ 𝑟d (2)
Where, 𝑟𝑑 is the random number between -1 to 1. During exploration, the position of current gorillas are updated by using the following model:
\(\begin{align}g_{k}(t+1)=\left\{\begin{array}{lc}\left(u_{b}-l_{b}\right) * r^{d}+l_{b} & r_{1}^{d}<\delta \\ \left(r^{d}-\propto\right) * \mathrm{~g}_{k}^{A}(t)+\eta * \Re * \mathrm{~g}_{k}(t) & r_{1}^{d} \geq 0.5 \\ \mathrm{~g}_{k}(t)-\alpha *\left(\alpha *\left(\mathrm{~g}_{k}(t)-\mathrm{g}_{k}^{B}(t)\right)\right)+r^{d} *\left(\mathrm{~g}_{k}(t)-\mathrm{g}_{k}^{B}(t)\right) & r_{1}^{d}<0.5\end{array}\right.\end{align}\) (3)
Where, ɠ𝑘(𝑡) represents the current position vector of the individual gorilla, ɠ𝑘(𝑡 + 1) refers to the candidate position of search agents in the next iteration. Besides, the parameters 𝑟𝑑, 𝑟𝑑1 indicates the random values between 0 and 1, ɠ𝐴𝑘(𝑡) and ɠ𝐵𝑘(𝑡) are two randomly selected gorilla positions in the current population, 𝛿 is a constant, and Z denotes a row vector in the problem dimension with values of the element are randomly generated in [−∝, ∝]. After that, the fitness value is estimated for all gorillas, and the best optimal solution ɠ𝑠ilverback𝑘 is saved as the silverback. During exploitation, the position of current gorilla is updated by using the following equation:
ɠ𝑘(𝑡 + 1) =∝∗ 𝔪 ∗ [ɠ𝑘(𝑡) − ɠ𝑠ilverback𝑘] + ɠ𝑘(𝑡) (4)
\(\begin{align}\mathrm{m}=\left(\left|\sum_{k=1}^{N} \frac{\mathrm{~g}_{k}(t)}{N}\right|^{2^{\alpha}}\right)^{\frac{1}{2^{\alpha}}}\end{align}\) (5)
Where, N refers to the population size, and ɠ𝑘(𝑡) denotes each position vector of the gorilla in the current iteration. Again, the position of current gorilla is updated using the following equation:
ɠ𝑘(𝑡 + 1) = ɠ𝑠ilverback𝑘 − [ɠ𝑠ilverback𝑘 ∗ Ƒ − ɠ𝑘(𝑡) ∗ Ƒ) ∗ ɸ (6)
Ƒ = 2 ∗ 𝑟𝑑 − 1 (7)
\(\begin{align}\phi=\vartheta * \overline{\mathrm{E}}\end{align}\) (8)
\(\begin{align}\hat{\overline{\mathrm{E}}}=\left\{\begin{array}{ll}N^{1} & r^{d} \geq 0.5 \\ N^{2} & r^{d}<0.5\end{array}\right.\end{align}\) (9)
Based on the above model, ɠ𝑘(𝑡) indicates the current position and Ƒ stands for the impact force, r6 is a random value in the range of 0 to 1. Moreover, the coefficient ɸ is used to mimic the violence intensity in the competition is evaluated, 𝜗 denotes a constant and the values of \(\begin{align}\acute{\bar{E}}\end{align}\) are assigned, 𝑟𝑑 is also a random number in [0, 1]. If 𝑟𝑑 ≥ 0.5, E would be defined as a 1-by-D array of normal distribution random numbers, and D is the spatial dimension. Instead, if 𝑟𝑑< 0.5, E would be equal to a stochastic number that obeys the normal distribution. By using this optimization algorithm, the best optimal solution is computed for selecting the optimal features to train the classifier.
Algorithm 1 – Artificial Gorilla Troop Optimization (AGTO)
#####
3.3 Multi-Linear Regression Classification (MLRC)
In this stage, the selected features are used for classifier training, where the healthy and disease affected patient data are accurately predicted with the use of MLRC [37] mechanism. It is a kind of machine learning algorithm, specifically used for solving the complex prediction problems. Many existing studies used various machine learning algorithms for predicting CVD from the patients’ medical history. Yet, it has complex computational steps for prediction, requires more time for feature training, overfitting, and lack of reliability. Thus, the proposed work intends to utilize a novel MLRC algorithm for CVD detection and classification. One approach that is widely used to address estimation problems is the linear regression methodology. It is based on the notion that samples belonging to the same class can be represented by a linear equation and corresponding to an identical linear subspace. The MLRC is a type of regression based classification mechanism. Typically, the technique of regression is applied to a pair of concepts. In the beginning regression models are frequently used for prediction and estimation, two areas where machine learning and their application have a lot in common. Second, in some circumstances, causal relationships between the independent and dependent variables can be ascertained using regression analysis. Regressions alone, it is important to note, only display relationships between a dependent variable and a fixed dataset collection of other variables. A number of explanatory factors are used using MLRC to predict the outcome of a response factor. Modeling the linear relationship among the variables that are independent x and the dependent variable y which will be examined is the goal of MLRC. Fig. 3 shows the overall flow diagram of the proposed model.
Fig. 3. Overall flow
4. Results and Discussion
This section validates the results of the proposed CVD disease prediction framework using several performance measures, and the obtained results are validated by using the recent benchmark public datasets. For simulation, the Intel Core – i5 processor with 3.10GHz speed, and python is used to carry out all research experiments. During this process, the coding packages such as Scikit-Learn, Pandas, NumPy, and Matplotlib have been used. In this analysis, four different datasets such as Cleveland, Mendeley, Statlog, and Comprehensive [38, 39].
• The proposed CVD prediction framework considers the four most popularly available datasets, which are as follows: Cleveland, Mendeley, Statlog, and Comprehensive. The rationale behind selection of these datasets is that they are the most publicized and well benchmarked in the context of benchmarking CVD prediction models.
• Cleveland Dataset: It comprises of 303 instances. The dataset has 76 raw attributes, with the commonly employed features in the context of CVD inclusive of age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, resting electrocardiographic results, maximum heart rate achieved, exercise-induced angina, ST depression induced by exercise relative to rest, the slope of the peak exercise ST segment, number of major vessels colored by fluoroscopy, and thalassemia. These features are important key indicators of a good heart—very essential for prediction of heart disease.
• Mandeley Dataset: This dataset of Mandeley is large, with thousands of instances collected from different sources. A serious set of the features comes from the dataset, the range can be heard from demographic data down to lifestyle and medical history, as well as clinical measurements. Important attributes for this dataset are age, sex, blood pressure readings, cholesterol levels, smoking status, periodic diabetes status, and the history of family cardiovascular condition. A wide range of features further allows for extensive analysis to be conducted in this dataset and built for a powerful model in prediction.
• Statlog (Heart) Dataset: This dataset has 270 instances. It has 13 features in it; these are not exactly similar to Cleveland's dataset yet would be useful for CVD prediction. The key features listed include: age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, resting electrocardiographic results, maximum heart rate achieved, exercise-induced angina, ST depression induced by exercise, the slope of the peak exercise ST segment, number of major vessels colored by fluoroscopy, and thalassemia. These features are thus chosen with a view to including only the highly relevant number of variables with a high correlation to the disease outcome.
• Comprehensive Dataset: As the name suggests, the Comprehensive dataset is a large-scale dataset that integrates data from multiple sources to provide a vast and diverse collection of instances. This makes a good variety, relatively, in terms of demographic, clinical, and lifestyle information on a high number of characteristics captured by this data set. These features will include age, sex, body mass index (BMI), blood pressure readings, cholesterol levels, smoking status, alcoholemia, physical activities, diet, previous medical conditions, genetic predispositions, and others; all these features provide a broad data collection for the comprehensive approach to the prediction of CVD.
The designed setup for the evaluation of the devised CVD prediction framework is mostly made in the idea to make this evaluation very rigorous and validation-oriented. For instance, all experiments have been conducted on an Intel Core i5 whose clock frequency is 3.10 GHz—enough to manage the data flow and heavy computations. The experiments are conducted in Python to make use of the large ecosystem of its libraries that may be relevant to machine learning and data analysis. Concretely, one of the most popularly used Python libraries is used for the experiment setup of the proposed CVD prediction framework to assure the rigorousness of data processing, model implementation, and visualization. The principal library for these supervised learning algorithms is scikit-learn. It contains a model of the built-in Multi-Linear Regression Classification (MLRC) and a feature selection methodology based on the Artificial Gorilla Troop Optimization (AGTO) algorithm. It also provides some general tooling connected with model appraisal, cross-validation, and performance metrics allowing one to understand the effectiveness exercised by the introduced framework. Data manipulation and data preprocessing will be done with the help of Pandas—Python for Data Analysis. It will provide ease in data structuring, particularly DataFrames for efficient cleaning, normalization, and transformation of the data. Numerical computation will be very vital for the operations, particularly in manipulating the matrices and arrays through the implements of the AGTO algorithm and MLRC model. Another important library is Matplotlib, very useful in providing graphical representations to propose several plots and graphics of the experiment results like feature importance, model performances, and comparative analysis. This extensive usage of the libraries of Python provides the effectiveness and efficiency of the prophesied framework in predicting CVDs.
While evaluating the model's correctness, a subset of its performance is taken into consideration. Accuracy is such metric used to evaluate the performance of the classification models and was calculated as follows:
\(\begin{align}Accuracy=\frac{T p+T n}{T p+T n+F p+F n}\end{align}\) (10)
A high rate of correct prediction is implied by precision. With respect to the entire number of positives that the model requires, it refers to a percentage of the overall amount of genuine positives as specified by the model. The following model is used to estimate the rate of precision:
\(\begin{align}Precision=\frac{T p}{T p+F p}\end{align}\) (11)
The recall, also known as the TP value, compares the precise number of positives in the data to the overall number of positives in the system states. The computation for this parameter is shown below:
\(\begin{align}Recall=\frac{T p}{T p+F n}\end{align}\) (12)
The F1 score may also be used to estimate the model's performance. It is equivalent to the weighted average of the model's recall and precision, as determined below:
\(\begin{align}F 1-score=\frac{2 \times T p}{2 \times T p+F p+F n}\end{align}\) (13)
Where, 𝑇p – true positives, 𝑇n – true negatives, 𝐹p – false positives, and 𝐹n – false negatives. The capacity of a classifier to distinguish between classes with and without cardiovascular disease (CVD) is measured in this work by the Area under the Curve- Receiver Operator Characteristic (AUC-ROC). The True Positive Rate (TPR) vs the False Positive Rate (FPR) at various threshold settings is plotted on a probability curve called the Receiver Operator Characteristic. Significantly, this study determines the precision of the existing and proposed machine learning techniques used for categorizing that whether the patients are affected by CVD or not. The proposed work uses a new MLRC technique, since it is a regression based classifier used for accurately identifying and classifying the type of CVD disease. In order to determine the overall efficacy and performance of the suggested mechanism, several performance measures including accuracy, precision, recall, error rate, and etc have been computed in this study. Moreover, each classification model has a unique illness prediction accuracy and efficiency, and the results stated that the proposed AGTO-MLRC provides an increased prediction results over other models.
Fig. 4 (a) and (b) displays the proposed machine learning algorithm's confusion matrix for Cleveland and Statlog datasets correspondingly. In this analysis, TPs/TNs and FPs/FNs are used to illustrate the percentage of predicted values and the percentage of actual values, respectively. Based on the outcomes, it is predicted that the proposed AGTO-MLRC technique provides an improved prediction rate by accurately categorizing the medical data into CVD affected or healthy.
Fig. 4 (a). Confusion matrix for Cleveland dataset
Fig. 4 (b). Confusion matrix for Statlog dataset
The suggested model outperformed the current benchmark techniques [40, 41] in terms of accuracy throughout all datasets. Additionally, the outcomes show that the developed model is accurate and applicable to any dataset, regardless of size. In any disease prediction system, the performance measurement is one of the most vital task. So, once a classification problem is incorporated, we are able to predict an AUC-ROC curve. One of the most crucial factors for determining whether a category model is effective is this one. The computed ROC curves for the proposed AGTO-MLRC model using various datasets are displayed in Fig. 5 to 7. AUC stands for the measure of a model's ability to distinguish between classes, while ROC is a probability curve. It reveals how well the model can differentiate across classes. The model is more effective at separating the classes when the AUC is larger. The TPR is shown on the x-axis versus the FPR on the y-axis of the ROC curve. The performance of the proposed model in comparison to the standard benchmark algorithms is better understood from the illustrations. In addition, the ROC analysis is carried out for the proposed mechanism with and without CKD as shown in Fig. 8. The overall results indicate that the proposed AGTO-MLRC provides an improved performance outcomes, when compared to the other existing algorithms.
Fig. 5. ROC analysis using Cleveland dataset
Fig. 6. ROC analysis using Comprehensive dataset
Fig. 7. ROC analysis using Statlog dataset
Fig. 8. ROC analysis with and without CKD
The average error between the actual and projected values is provided by the Mean Squared Error (MSE). By averaging the square of the difference between the initial and projected values, the value may be calculated. The method to determine the MSE is provided by equ (14), where "n" indicates the total amount of cardiovascular patient records in the dataset. The average error between actual and anticipated values is given as the square root by the Root Mean Squared Error (RMSE), and the following equ (15) can be used to determine its value.
\(\begin{align}M S E=\frac{1}{n} \sum_{i=1}^{n}(A V-P V)^{2}\end{align}\) (14)
\(\begin{align}R M S E=\sqrt{\frac{\sum_{i=1}^{n}\left(P V_{i}-A V_{i}\right)^{2}}{n}}\end{align}\) (15)
Where, 𝐴V indicates the actual value, and 𝑃V indicates the predicted value. A set of measures based on classification algorithms is frequently used to assess the efficacy of machine learning prediction algorithms. As shown in Fig. 9, the measures such as MSE and RMSE are used to measure the prediction error rates. The true/false positive/negative rate of the predictions is examined using the confusion matrix and receiver operating characteristic area under the curve. Fig. 10 compares the accuracy, precision, recall and f1-score of the standard and proposed machine learning models. Similarly, Fig. 11 to Fig. 13 presents the performance comparative analysis of the existing and proposed Ml algorithms using Cleveland, Comprehensive and Mendeley datasets. In this analysis, several feature subsets have been evaluated based on the cardiovascular disease dataset. To accurately classify cardiac and non-cardio patients, it chooses the most used attributes from the dataset. As a consequence, the proposed AGTO-MLRC methodology using recursive feature elimination beats the other classification rate techniques.
Fig. 9. MSE and RMSE analysis
Fig. 10. Comparative analysis with the standard machine learning classification models
Fig. 11. Performance analysis using Cleveland dataset
Fig. 12. Performance analysis using Comprehensive dataset
Fig. 13. Performance analysis using Mendeley dataset
In Fig. 10 to Fig. 13, the performance of the proposed CVD detection system is examined and compared with several existing classification approaches using different datasets. For this study, the techniques such as Linear Discriminant Analysis (LDA), K-Nearest Neighbor (KNN), Decision Tree (DT), Naïve Bayes (NB), Multi-Layer Perceptron (MLP), and Random Forest Ensemble (RFE) - Gradient Boost (GB) are considered into account for comparison.
Fig. 14 presents the overall comparative analysis of the existing and proposed machine learning classification algorithms used for CVD prediction and classification. Here, the parameters such as accuracy, AUC, precision, recall, and f1-score have been considered.
Fig. 14. Overall comparative analysis using Cleveland heart disease dataset
Consequently, the detection accuracy is also validated and compared using three different datasets as shown in Fig. 15. Moreover, the accuracy of the optimization integrated classifiers are validated according to the level of disease severity as shown in Fig. 16. The results showed that the suggested approach can help medical professionals diagnose patients more correctly and provide better treatments.
Fig. 15. Accuracy analysis using different datasets
Fig. 16. Accuracy analysis based on the severity level
The False Positive Rate (FPR) and False Negative Rate (FNR) of traditional machine learning and suggested classification algorithms utilized for CVD disease detection are compared in Tables 1 and 2. Inferred from the results is that the FPR and FNR values in the suggested model are both significantly decreased. In this study, the overall number of incorrect prediction outcomes is decreased by properly imputing the data with the feature optimization technique.
Table 1. False positive rate (FPR) analysis
Table 2. False positive rate (FPR) analysis
The computational complexity of traditional and disease detection approaches for training and prediction operations is contrasted in Table 3. The computational complexity is verified and compared for both the proposed and the conventional approaches in order to ascertain the computational efficiency of the suggested remote healthcare system. In comparison to the other approaches, the findings show that the suggested method has less computational complexity for both training and prediction operations. As a result, Table 4 compares the time required by the suggested categorization approaches with traditional machine learning. In a similar vein, the suggested model based on time as shown in Fig. 17 is also compared with the most widely used deep learning techniques, such as CNN and Bi-LSTM. The comparative results show that the suggested AGTO-MLRC model performs faster than any of the current classification techniques. Given that the two main factors lowering the overall time consumption are efficient data preparation and dimensionality reduction.
Table 3. Comparison based on computational complexity
Table 4. Comparison based on time
Fig. 17. Comparison based on time with deep learning techniques
The training and testing loss values of the suggested AGTO-MLRC model are validated with regard to different numbers of epochs in Fig. 18. The loss value for both training and testing activities is determined and compared to ascertain the classification success rate. The results show that by correctly predicting the disease class, the proposed AGTO-MLRC model may successfully suppress the loss value.
Fig. 18. Training and testing loss
The ablation study findings for the proposed remote healthcare system are shown in Table 5, where the operating stages are taken into account when setting the settings. The study reports that when all stages of operation are included for correct and accurate disease diagnosis and categorization, the suggested model functions effectively.
Table 5. Ablation study
5. Findings
In this work, it has been shown that the proposed AGTO-MLRC framework was found to be precise in CVD prediction through the assessment of AGTO algorithm robustness in feature selection and the MLRC model performance in the classification task. While the last information can come across as incredibly abstract, the AGTO, emerging from the field of social and foraging behaviors of gorillas, has consistently outperformed all others in developing the most relevant features from the datasets, hence providing the best predictive power for the MLRC model. By tacking data dimensionality into line and focusing only on the most informative attributes, the AGTO-MLRC framework aims not only for classification accuracy improvement but also for bringing down computational complexity to a considerable level, hence making this solution practical for online real-time health monitoring systems. A comparative analysis with the traditional methods revealed that this framework, namely AGTO-MLRC, outperforms ordinary TSF in the traditional feature selection techniques and classification models across many standard metrics for the performance assessment of a classifier, namely accuracy, precision, recall, and F1-score. This implies that the proposed model possesses good performance strength and generalizes well over a variation of datasets, giving reliable predictions that can support healthcare providers in early diagnosis and treatment of CVD. In so doing, the use of these real-world, publicly available datasets—such as Cleveland, Mendeley, Statlog, and Comprehensive—further lends credence to the study because these datasets cover a wide scope of demographic and clinical features, representing diverse patient populations and variable conditions.
This may have an important implication, such as the potential to integrate the AGTO-MLRC framework with an IoT-based health monitoring system. Health data from wearable sensors and other IoT devices, collected uninterruptedly, will seamlessly feed into this AGTO-MLRC model to turn on the real-time prediction and monitoring of CVD risk. This has the ability to create a monumental shift in preventative health care, as at-risk individuals can be alerted and given recommendations on how to prevent adverse cardiac events, hence, benefiting public health outcomes in general. Further, interpretability of the MLRC model ensures predictions are clear and transparent to healthcare professionals for informed decisions regarding treatment plans on a personalized level. However, the study suffers from the following limitations. Note here that the main concerns in these inputs are data quality and completeness. Such challenges in most datasets mainly arise from medical datasets containing missing values, noise, or inconsistency. Data preprocessing techniques had been applied to handle these issues up to some extent, but more improvement in the data collection procedure and standardization will ensure model robustness even further. This algorithm has, however, been noted to be potentially very computationally intensive and sometimes highly sensitive to both the choice of dataset and other parameters. In this line, hybrid strategies of optimization techniques that combine forces with other evolutionary algorithms to attain even better feature selection efficiency and model performance in AGTO should be explored in future research.
Another limitation is the bias that the training data might introduce. Although this study has involved plenty of datasets, they may still not quite capture the true heterogeneity observed across worldwide populations. There may be factors that can vary highly across regions, such as genetic background, environmental influences, and socioeconomic conditions that may affect the generalizability of the reach of the model. Future work will require more diversified presented data with respect to modeling in order to cater to its applicability across different kinds of populations. Importantly, it should be noted that the current study worked with static datasets for the training and validation of EPs. Of course, health data in a real-world setting continuously change; future work implementing adaptive models is a must to learn in real time from new data. Integration of AGTO-MLRC into the IoT system also meets the challenge in terms of preserving the security and privacy of data. Data obtained from IoT devices regarding health should be stored in a way that ensures the confidentiality and integrity of the data. Strong technologies for encryption and anonymization of sensitive health information inside databases should be a research subject in the future.
In a nutshell, the new proposed framework of AGTO-MLRC is one of the most potential candidates in realizing the predictive aspect of CVD through innovative feature selection and robust classification to attain high predictive accuracy. This work will actually pave a stone towards research and development that subsequently develops a predictive healthcare system, pointing towards the possibility of real-time personalized health-monitoring systems. In addition, the identification of limitations and suggested directions for future work explored in the following directions is promised to be very fruitful in realizing the full potential of machine learning and IoT for the transformation in the management of cardiovascular disease and for its prevention.
The research study findings on CVD prediction using this framework assert findings to be robust and reinforce the likely impact and effectiveness in operations in health care. Very intensive experimentations and analysis bring forth certain key findings which support the massive conclusions generated by the performance and implications of the framework. First, the model of AGTO-MLRC provides important gains on the accuracy of predictions. The algorithm used Artificial Gorilla Troop Optimization for feature selection and Multi-Linear Regression Classification for modeling, which gave stable performance across diverse datasets: Cleveland, Mendeley, Statlog, and Comprehensive. All these marked increments in the accuracies, which are fully essential for the early detection and management of CVD, since proper predictions mean proper diagnosis and timely intervention for any patient's better result.
Moreover, the innovative feature selection approach from the AGTO algorithm works greatly in selecting a good subset of predictors that are relevant to the CVD risk assessment. The AGTO algorithm, derived through mimicry of social and foraging behaviors of gorillas, optimizes feature subsets in an iterative manner. This further improves the generalization of models by reducing noise that some features might be related to. This very ability provides more accuracy in the prediction and lessening of computational resources, bestowing it with excellent potential in a framework for resource-constrained healthcare environments. Moreover, the AGTO-MLRC framework integrated with IoT technologies can be considered a major leap for real-time health monitoring and prediction of diseases. Through continuous live health data steaming from the IoT devices like wearable sensors and remote monitoring systems, the framework enables proactive monitoring of key CVD indicators such as heart rate variability, blood pressure trends, and levels of physical activities. This integration will enable caregivers with timely insights for the implementation of early intervention strategies to mitigate risks of CVD and improve patient management strategies.
The current study further emphasizes interpretability and transparency as characteristics of the MLRC model needed to create trust among professionals and stakeholders within the healthcare domain. The MLRC model adequately illuminates relationships among predictor variables relative to CVD outcomes, thus providing an explanation that allows the clinician to make decisions based on meaningful insight rather than vague prediction deductions.
6. Conclusion
Prediction of CVD is difficult and crucial in the medical field. However, if the disease is discovered in its early stages and preventative measures are implemented as soon as feasible, the fatality rate can be significantly reduced. In order to focus the research on real-world datasets rather than only theoretical frameworks and simulations, further development of this study is extremely desirable. Here, the IoT technology allows for the collection of patient medical data via body sensors and transmission to a cloud system via a gateway. The acquired medical history is then saved in the cloud database for disease classification and prediction. In this model, the dataset information (i.e., medical data from patients) is used for training, where data preparation is first carried out to normalize the attribute values. As a result, the preprocessed dataset is used to extract the set of pertinent features using the AGTO algorithm. Additionally, using the chosen features extracted from the medical record, the MLRC algorithm is employed to forecast CVD. At last, the output of the prediction is sent to the cloud, where a medical expert may diagnose the illness and give patients an early course of therapy. The ease of deployment, excellent prediction accuracy, simplified storage requirements, and capacity to handle extremely large dimensional data are the main advantages of this system. The Cleveland, Mendeley, Statlog, and Comprehensive datasets were among the four separate ones used in this investigation. A portion of the model's performance is considered while assessing the model's accuracy. The obtained results show that the proposed AGTO-MLRC provides an increased accuracy up to 99% for all datasets by properly predicting the disease based on the pertinent extracted from the patients’ medical data. The main advantages of using the proposed model are simple to implement, minimized error rate, low training and testing time. However, it is required to analyze some other types of CVD datasets including the diseases of stroke, Alzheimer, and etc with high accuracy.
In future, the present work can be enhanced by implementing a new deep learning algorithm for the prediction of diabetes using IoT.
7. Discussion
Although the proposed framework of IoT-based healthcare monitoring for CVD prediction is amazing, there are several limitations that have to be considered. First, the study is going to depend on a certain set of benchmarking datasets, which can easily miss real diversity in medical data. This seriously impacts generalization capability across diverse populations and settings of health care. Additionally, it could have touched on the performance from a framework that is highly sensitive to the quality of input data. Erroneous data and noisy data may affect the performance of the predictions.
Other limitations include the reliance on the MLRC model. Though effective, more complex nonlinear interactions between elements of data would not be captured, which could perhaps be captured by higher-level models, such as found in deep learning techniques. The feature selection step, with the use of the AGTO algorithm, although useful, could also be extended by trying other optimization algorithms or hybrid approaches that combine a number of feature selection methods.
Future studies should avoid these limitations by including more varied data that can hopefully improve the robustness and generalizability of the model. Other advanced machine learning techniques, like deep learning and ensemble methods, could be explored to achieve better classification performance. Real-time data from multiple IoT devices and longitudinal studies can provide a comprehensive overview of CVD risk over time. Other studies can also be conducted to see the appropriateness and effectiveness of the proposed framework in the various healthcare settings.
References
- A. A. Nancy, D. Ravindran, P. M. D. Raj Vincent, K. Srinivasan, and D. Gutierrez Reina, "IoT-Cloud-Based Smart Healthcare Monitoring System for Heart Disease Prediction via Deep Learning," Electronics, vol.11, no.15, 2022.
- N. Verma, S. Singh, and D. Prasad, "A Review on existing IoT Architecture and Communication Protocols used in Healthcare Monitoring System," Journal of The Institution of Engineers (India): Series B, vol.103, pp.245-257, 2022.
- J. Logeshwaran, J. A. Malik, N. Adhikari, S. S. Joshi, and P. Bishnoi, "IoT-TPMS: An innovation development of triangular patient monitoring system using medical internet of things," International Journal of Health Sciences, vol.6, no.S5, pp.9070-9084, 2022.
- M. A. Kadam, S. Patil, P. Pethkar, R. Shikare, and S. Sarnayak, "A Cardiovascular Disease Prediction System Using Machine Learning," Journal of Pharmaceutical Negative Results, pp.7216-7225, 2023.
- S. N. S. Al-Humairi and A. I. Hajamydeen, IoT-Based Healthcare Monitoring Practices during Covid-19: Prospects and Approaches, Healthcare Systems and Health Informatics, pp.163-185, ed: CRC Press, 2022.
- P. J. Pronovost, M. D. Cole, and R. M. Hughes, "Remote Patient Monitoring During COVID-19: An Unexpected Patient Safety Benefit," JAMA, vol.327, no.12, pp.1125-1126, 2022.
- M. Peng, F. Hou, Z. Cheng, T. Shen, K. Liu, C. Zhao, and W. Zheng, "A Cardiovascular Disease Risk Score Model Based on High Contribution Characteristics," Applied Sciences, vol.13, no.2, 2023.
- S. Saravanan, M. Kalaiyarasi, K. Karunanithi, S. Karthi, S. Pragaspathy, and K. S. Kadali, "Iot Based Healthcare System for Patient Monitoring," in Proc. of IoT and Analytics for Sensor Networks: Proceedings of ICWSNUCA 2021, LNNS, vol.244, pp.445-453, 2022.
- C. Garcia-Vicente, D. Chushig-Muzo, I. Mora-Jimenez, H. Fabelo, I. T. Gram, M.-L. Lochen et al., "Clinical Synthetic Data Generation to Predict and Identify Risk Factors for Cardiovascular Diseases," in Proc. of Heterogeneous Data Management, Polystores, and Analytics for Healthcare: VLDB Workshops, Poly 2022 and DMAH 2022, Virtual Event, September 9, 2022, Revised Selected Papers, pp.75-91, 2023.
- M. Ahmid, O. Kazar, and L. Kahloul, "A secure and intelligent real-time health monitoring system for remote cardiac patients," International Journal of Medical Engineering and Informatics, vol.14, no.2, pp.134-150, 2022.
- Z. Ashfaq, R. Mumtaz, A. Rafay, S. M. H. Zaidi, H. Saleem, S. Mumtaz et al., "Embedded AI-Based Digi-Healthcare," Applied Sciences, vol.12, no.1, 2022.
- O. Gaidai, Y. Cao, and S. Loginov, "Global Cardiovascular Diseases Death Rate Prediction," Current Problems in Cardiology, vol.48, no.5, 2023.
- O. T. Kee, H. Harun, N. Mustafa, N. A. Abdul Murad, S. F. Chin, R. Jaafar, and N. Abdullah, "Cardiovascular complications in a diabetes prediction model using machine learning: a systematic review," Cardiovascular Diabetology, vol.22, pp.1-10, 2023.
- O. Gaidai, Y. Xing, R. Balakrishna, J. Sun, and X. Bai, "Prediction of death rates for cardiovascular diseases and cancers," Cancer Innovation, vol.2, no.2, pp.140-147, 2023.
- A. Behera, T. K. Mishra, K. S. Sahoo, and B. Sarathchandra, "An Improved Machine Learning Framework for Cardiovascular Disease Prediction," in Proc. of Computing, Communication and Learning: First International Conference, CoCoLe 2022, CCIS, vol.1729, pp.289-299, Warangal, India, 2023.
- S. de Vries, M. L. Haaksma, K. Jozwiak, M. Schaapveld, D. C. Hodgson, P. J. Lugtenburg et al., "Development and Validation of Risk Prediction Models for Coronary Heart Disease and Heart Failure After Treatment for Hodgkin Lymphoma," Journal of Clinical Oncology, vol.41, no.1, pp.86-95, 2023.
- J. O. Olmedo-Aguirre, J. Reyes-Campos, G. Alor-Hernandez, I. Machorro-Cano, L. Rodriguez-Mazahua, and J. L. Sanchez-Cervantes, "Remote Healthcare for Elderly People Using Wearables: A Review," Biosensors, vol.12, no.2, 2022.
- J. Hanumanthappa, A. Y. Muaad, J. V. Bibal Benifa, C. Chola, V. Hiremath, and M. Pramodha, IoT-Based Smart Diagnosis System for HealthCare, Sustainable Communication Networks and Application: Proceedings of ICSCN 2021, ed: Springer, pp.461-469, 2022.
- P. Rubini, C. Subasini, A. V. Katharine, V. Kumaresan, S. G. Kumar, and T. Nithya, "A Cardiovascular Disease Prediction using Machine Learning Algorithms," Annals of the Romanian Society for Cell Biology, vol.25, no.2, pp.904-912, 2021.
- T. Chekouo and S. E. Safo, "Bayesian integrative analysis and prediction with application to atherosclerosis cardiovascular disease," Biostatistics, vol.24, no.1, pp.124-139, 2023.
- A. V. Anandhalekshmi, V. Srinivasa Rao, and G. R. Kanagachidambaresan, "Hybrid approach of baum-welch algorithm and SVM for sensor fault diagnosis in healthcare monitoring system," Journal of Intelligent & Fuzzy Systems, vol.42, no.4, pp.2979-2988, 2022.
- D. C. Vinutha, Kavyashree, and G. T. Raju, Machine Learning-Assisted Remote Patient Monitoring with Data Analytics, Tele-Healthcare: Applications of Artificial Intelligence and Soft Computing Techniques, pp.1-26, 2022.
- J. Dhanke, N. Rathee, M. S. Vinmathi, S. Janu Priya, S. Abidin, and M. Tesfamariam, "Smart Health Monitoring System with Wireless Networks to Detect Kidney Diseases," Computational Intelligence and Neuroscience, vol.2022, 2022.
- S. Mohapatra, A. Sahoo, S. Mohanty, and P. K. Patra, "Real-Time Health Monitoring System Using Predictive Analytics," in Proc. of Ambient Intelligence in Health Care: Proceedings of ICAIHC 2022, SIST, vol.317, Springer, pp.417-427, 2023.
- A. Kishor and C. Chakraborty, "Artificial Intelligence and Internet of Things Based Healthcare 4.0 Monitoring System," Wireless personal communications, vol.127, pp.1615-1631, 2022.
- S. U. D. Wani, N. A. Khan, G. Thakur, S. P. Gautam, M. Ali, P. Alam et al., "Utilization of Artificial Intelligence in Disease Prevention: Diagnosis, Treatment, and Implications for the Healthcare Workforce," Healthcare, vol.10, no.4, 2022.
- S. Neelakandan, J. R. Beulah, L. Prathiba, G. L. N. Murthy, E. F. Irudaya Raj, and N. Arulkumar, "Blockchain with deep learning-enabled secure healthcare data transmission and diagnostic model," International Journal of Modeling, Simulation, and Scientific Computing, vol.13, no.4, 2022.
- S. Z. D. Babu, D. Pandey, G. T. Naidu, S. Sumathi, A. Gupta, M. B. Alazzam, and B. K. Pandey, "Analysation of Big Data in Smart Healthcare," in Proc. of Artificial Intelligence on Medical Data: Proceedings of International Symposium, ISCMM 2021, pp.243-251, 2022.
- M. M. Ahsan, S. A. Luna, and Z. Siddique, "Machine-Learning-Based Disease Diagnosis: A Comprehensive Review," Healthcare, vol.10, no.3, 2022.
- L. Yang, H. Wu, X. Jin, P. Zheng, S. Hu, X. Xu et al., "Study of cardiovascular disease prediction model based on random forest in eastern China," Scientific reports, vol.10, 2020.
- S. Mohan, C. Thirumalai, and G. Srivastava, "Effective Heart Disease Prediction Using Hybrid Machine Learning Techniques," IEEE Access, vol.7, pp.81542-81554, 2019.
- C. Krittanawong, H. U. H. Virk, S. Bangalore, Z. Wang, K. W. Johnson, R. Pinotti et al., "Machine learning prediction in cardiovascular diseases: a meta-analysis," Scientific reports, vol.10, 2020.
- X. Rossello, J. A. Dorresteijn, A. Janssen, E. Lambrinou, M. Scherrenberg, E. Bonnefoy-Cudraz et al., "Risk prediction tools in cardiovascular disease prevention: a report from the ESC Prevention of CVD Programme led by the European Association of Preventive Cardiology (EAPC) in collaboration with the Acute Cardiovascular Care Association (ACCA) and the Association of Cardiovascular Nursing and Allied Professions (ACNAP)," European Journal of Preventive Cardiology, vol.26, no.14, pp.1534-1544, 2019.
- P. Ghosh, S. Azam, M. Jonkman, A. Karim, F. M. J. M. Shamrat, E. Ignatious et al., "Efficient Prediction of Cardiovascular Disease Using Machine Learning Algorithms With Relief and LASSO Feature Selection Techniques," IEEE Access, vol.9, pp.19304-19326, 2021.
- N. L. Fitriyani, M. Syafrudin, G. Alfian, and J. Rhee, "HDPM: An Effective Heart Disease Prediction Model for a Clinical Decision Support System," IEEE Access, vol.8, pp.133034-133050, 2020.
- S. S. Kareem, R. R. Mostafa, F. A. Hashim, and H. M. El-Bakry, "An Effective Feature Selection Model Using Hybrid Metaheuristic Algorithms for IoT Intrusion Detection," Sensors, vol.22, no.4, 2022.
- D. O. Sahin, S. Akleylek, and E. Kilic, "LinRegDroid: Detection of Android Malware Using Multiple Linear Regression Models-Based Classifiers," IEEE Access, vol.10, pp.14246-14259, 2022.
- A. Abdellatif, H. Abdellatef, J. Kanesan, C.-O. Chow, J. H. Chuah, and H. M. Gheni, "An Effective Heart Disease Detection and Severity Level Classification Model Using Machine Learning and Hyperparameter Optimization Methods," IEEE Access, vol.10, pp.79974-79985, 2022.
- A. Alfaidi, R. Aljuhani, B. Alshehri, H. Alwadei, and S. Sabbeh, "Machine Learning: Assisted Cardiovascular Diseases Diagnosis," International Journal of Advanced Computer Science and Applications, vol.13, no.2, 2022.
- A. Saboor, M. Usman, S. Ali, A. Samad, M. F. Abrar, and N. Ullah, "A Method for Improving Prediction of Human Heart Disease Using Machine Learning Algorithms," Mobile Information Systems, vol.2022, 2022.
- S. Ahmed, S. Shaikh, F. Ikram, M. Fayaz, H. S. Alwageed, F. Khan, F. H. Jaskani, "Prediction of Cardiovascular Disease on Self-Augmented Datasets of Heart Patients Using Multiple Machine Learning Models," Journal of Sensors, vol.2022, 2022.