br For the cervical cancer dataset finding the representativ
For the cervical cancer dataset , finding the representative patterns from the complete data is a non-trivial task. The major di culties can be summarized as follows:
1. Noise and uncertainty in data collection: Owing to some subjective (e.g., misremembering certain information) and objective facts (e.g. slipping of the pen in the questionnaire), the collected data may have noise and uncertainty which brings di culties for deterministic clustering approaches such as the Hard C-Means (HCM). Moreover, some noise and uncertainty in data create outlier samples, which prevent the clustering algorithms to discover the underlying representative patterns;
2. Limited data: The number of complete data counts for a mere 6% of the entire dataset, and thus a highly robust clustering algorithm is required in such scenario. Since most clustering methods, e.g., Fuzzy C-Means (FCM) and Pos-sibilistic C-Means (PCM) adopt iterative search in their optimization procedures, how to derive robust and stable clustering results with various initializations becomes an important problem, especially when the number of samples for clustering is limited.
To overcome the di culties mentioned above, a new data clustering algorithm based on the Bayesian 6-NBDG and the fuzzy theory is proposed, which is referred to as the Bayesian Possibilistic C-Means Clustering algorithm (BPCM in short). The proposed BPCM has the following characteristics that help to handle the mentioned problems in estimating missing attributes.
1. A fuzzy clustering algorithm is designed to model the noise and uncertainty in the collected data. Specifically, to be robust against outliers, the possibilistic membership constraints in PCM is adopted to reduce the influence of outliers.
Summary of mathematical notations.
xi A vector representing the ith original observation, xi ∈ X .
ui,j The membership of the ith observation to the jth cluster.
2. A Bayesian formulation is Hyperchromicity adopted to obtain the cluster centroids, i.e., the representative patterns, and thus the overfit-ting phenomenon is automatically reduced. This formulation makes BPCM robust against random initializations even with limited data.
With the above two properties, the proposed BPCM is able to provide reliable representative patterns for the risk factors related to cervical cancer. In the next section, a detailed description of the proposed BPCM algorithm is presented.
3. The proposed Bayesian Possibilistic C-Means (BPCM) clustering
To extract representative patterns from the limited complete data, a Bayesian Possibilistic C-Means clustering approach has been designed, which combines the merits of possibilistic membership constraints and Bayesian estimation. Notations of the frequently used variables are listed in Table 2.
3.1. Derivation of the optimal cluster centroids
To obtain a robust estimation of the underlying representative patterns within the limited observation data, Bayesian estimation is adopted. The Maximum Likelihood (ML) criterion is employed, which aims to maximize the conditional prob-ability p(X C). By introducing the membership distribution U, the conditional probability can be reformulated as,
where p(U C) is a prior distribution and can be taken as a constant with uninformative prior applied . Such kind of target function can be optimized using the Expectation-Maximization (EM) routine . In the E-step, the expectation of the latent variable U is computed, and in the M-step, the local-maxima of the log-likelihood is obtained with the membership distribution fixed. Similar to the definitions in the Gaussian mixture model (GMM) , we have: