Misclassification Adjustment of Family History of Breast Cancer in a Case-Control Study: a Bayesian Approach

Reliability of questionnaires may be influenced by the characteristics of the Abstract Background: Misreporting self-reported family history may lead to biased estimations. We used Bayesian methods to adjust for exposure misclassification. Materials and Methods: A hospital-based case-control study was used to identify breast cancer risk factors among Iranian women. Three models were jointly considered; an outcome, an exposure and a measurement model. All models were fitted using Bayesian methods, run to achieve convergence. Results: Bayesian analysis in the model without misclassification showed that the odds ratios for the relationship between breast cancer and a family history in different prior distributions were 2.98 (95% CRI: 2.41, 3.71), 2.57 (95% CRI: 1.95, 3.41) and 2.53 (95% CRI: 1.93, 3.31). In the misclassified model, adjusted odds ratios for misclassification in the different situations were 2.64 (95% CRI: 2.02, 3.47), 2.64 (95% CRI: 2.02, 3.46), 1.60 (95% CRI: 1.07, 2.38), 1.61 (95% CRI: 1.07, 2.40), 1.57 (95% CRI: 1.05, 2.35), 1.58 (95% CRI: 1.06, 2.34) and 1.57 (95% CRI: 1.06, 2.33). Conclusions: It was concluded that self-reported family history may be misclassified in different scenarios. Due to the lack of validation studies in Iran, more attention to this matter in future research is suggested, especially while obtaining results in accordance with sensitivity and specificity values.


Introduction
Data collection from family history of diseases is very important in both clinical and research studies. Family history data are essential to estimate a family aggregation, obtain its effects on disease, discover heterogeneity, observe existence of multiple phenotypes in a family, and estimate the number of genes involved in a disorder (Szatmari and Jones, 1999).
Having a family history of cancer is an important risk factor for many cancers, including breast cancer (Ebrahimi et al., 2002;Holakouie-Naieni et al., 2007;Mahouri et al., 2007;Hassanzadeh et al., 2012;Zare et al., 2013;Hosseinzadeh et al., 2014;Tehranifar et al., 2015), which its family history is often evaluated via self-report (Veisy et al., 2015). However, validity of self-reported data is often problematic, as it is believed that the participants in the study tend to be underestimated, or may deny the existence of family history of the disease. So, people with a Family History of Breast Cancer (FHBC) in first-degree relatives are less likely to report it. Reliability of self-report questionnaires may be influenced by the characteristics of the participants. Consistency of self-reported data on However, these reasons may not justify disregarding the observed potential bias between exposure-outcome relationships (Gelder et al., 2014b).
Although, exposure misclassification correction using Frequentist methods are also possible, but Bayesian methods are preferable (Greenland, 2008).
In this study, we used Bayesian methods to adjust the exposure misclassification. Bayesian methods under consideration was introduced by MacLehose et al . We attempted to explore the relationship between breast cancer and its family history amongst first-degree relatives after adjusting for confounders, regardless the exposure misclassification and also, to compare the results with posterior odds ratios after adjusting for confounder variables and a range of exposure misclassification scenarios.

Study population
We have used a hospital-based case-control study to identify breast cancer risk factors amongst Iranian women. The findings of this study have been published in detail, previously (Ghiasvand et al., 2011;Ghiasvand et al., 2012). Cases were recruited from Mottahari Breast Clinic of Shiraz University of Medical Sciences. This center collects data from about 80% of all incident breast cancer patients treated in main hospitals of Shiraz city. Eligible cases were women with an incident histopathologically confirmed breast cancer, diagnosed at 50 years of age or older. Most cases (93%) were interviewed within six months after diagnosis. We consider all the individuals with complete data in this study. Also eleven case patients from other provinces have been deleted from the study because we could not find any matched controls for them. Controls were frequency-matched with cases on five-year age groups and province of residence. Controls were primarily selected from healthy female visitors accompanying patients referred to the Faghihi hospital for general surgery (60%), urology (24%) and cardiovascular (16%) diseases. A total of 1090 controls were selected, but 92 women (8%) refused to participate.
For both cases and controls, face-to-face interviews were performed. Controls were interviewed from May through August 2009. Interviews were conducted by two trained female nurses (one for cases and one for controls), and the time of interviews was similar for cases and controls. None of the interviewers were aware of the study hypotheses.

Statistical analysis
Many researchers have proposed probabilistic bias analysis methods to consider bias (Lash and Fink, 2003;Fox et al., 2005;Greenland, 2005;Greenland and Lash, 2008;Lash et al., 2009;MacLehose and Gustafson, 2012 January). These methods serve to adjust the main effect estimate and propagate uncertainty surrounding the bias parameter, incorporating this uncertainty into the variance estimate of the adjusted main effect. The approach taken in probabilistic bias analysis is to repeatedly draw a random sample from the bias parameter distribution(s) and use those sampled parameters to adjust the effect estimate. The resulting distribution will be "bias-adjusted" main effect (Fox et al., 2005;MacLehose and Gustafson, 2012 January).
Bayesian analysis is based on the posterior distribution, although samples from it can be generated using Markov Chain Monte Carlo (MCMC) algorithm (MacLehose and Gustafson, 2012 January). The Monte-Carlo sensitivity analysis procedure (Fox et al., 2005) is computationally intensive, while Bayesian analysis with MCMC algorithms can be implemented quickly (Chu et al., 2006). Unlike deterministic maximum-likelihood algorithms, the MCMC methods are some stochastic procedures that repeatedly generate random samples. The process of generating the random samples in MCMC return to the role of the Markov chain, and the process of generating summary statistics from the generated random samples return to Monte Carlo integration (Hamra et al., 2013a).
Details of Bayesian method have been described in the following.
The exposure variable in the study is FHBC in the first degree relatives (fh true ), where true family histories (fh reported ) are unobserved values, which we've estimated in this study.
Breast Cancer is the outcome variable. Z is a vector of possible confounders: age at menarche (Less than 12 years, 12-15 and more than 15 years), menopause status (before, after), parity (yes, no), past use of oral contraceptives (yes, no), age at first pregnancy (Less than 25 years, equal and more than 25 years and nulliparous) and history of breastfeeding (yes, no) and body mass index (BMI) (as a continuous variable).
We treated fh true as measured with error by fh reported and used information from previous research on the sensitivity and specificity of self-reported family history data to produce corrected estimates.
We conducted Bayesian uncertainty analyses conditional on prior hypotheses generated from published studies. Briefly, three models are jointly considered: an outcome, an exposure, and a measurement model, which allowed simultaneous imputation of the true family history exposure status and estimation of its effect on the risk of the breast cancer.
For our analysis, we used the directed acyclic graph (DAG) shown in Figure 1. Figure 1. Directed Acyclic Graph (DAG) for the relationship between self-reported family history status and breast cancer. The outcome model specified the probability of breast cancer as a function of fh true (arrow C) and the other covariates Z (arrow E), a measurement model that specified the probability of fh repored as a function of fh true (arrow A) and breast cancer (arrow B), and an exposure model that specified the probability of fh true as a function of the covariates Z (arrow D) .

Outcome Model
The outcome model which is used in the study is given by Where b 0 , b 1 and θ are the unknown parameters.
In the Bayesian approach, the prior distributions are defined for these unknown parameters.
We used a non-informative N(0,) for the prior distribution of the intercept term ( ), and informative priors for other coefficients in the model (Table 1). Also, we have used non-informative priors N(0,1000) (according to Gelman et al. (Gelman et al., 2003) and prior distributions N (0,1.38) (according to Hamra et al. (Hamra et al., 2013b;Greenland and Mansournia, 2015)) for other regression coefficients in the model.

Exposure model
For the true exposure variable, we consider the following exposure model: Where the prior distribution N(0,10 000) is used for intercept coefficient, and prior distributions N(0,1), N(0,10), and N(0,1.38) are used for other regression coefficients in this model .

Measurement Model
For the reported exposure, we consider the following measurement model: In this model a 0 is the sensitivity of reported family history (fh) among controls, a 1 is the false positive rate (FPR) of reported family history among controls, a 2 is the sensitivity of reported family history (fh) among cases and a 3 is the false positive rate (FPR) of reported family history among cases.
The exposure and measurement models were used to impute values of true family history variable in a way similar to that used with missing data techniques. These imputed values were then used to estimate the associations between FHBC and breast cancer.
In order to estimate the effects of under-reporting (misclassification) in the exposure of FHBC and breast cancer, we implemented two models that specified a 0 , a 1 , a 2 and a 3 in the measurement model. Similar to MacLehose et al. (2009), in model 1, sensitivity and specificity are equal to 1.00 (that is no misclassification), which lead to a standard Bayesian logistic regression model.
As mentioned in MacLehose et al (2009), model 2 is based on the assumption that the sensitivities and false positive rates used in the measurement model are not exactly known. For the prior distribution of the sensitivity and false positive rate, a beta distribution was chosen with prior parameters selected to reflect a priori beliefs concerning reported family history. The values for sensitivity and false positive rate among cases and controls were assumed not to be correlated. The beta distribution has two parameters; b 1 and b 2 . For sensitivities, b 3 is the number of women reported family history truly and b 2 is the number of women reported untruly family history. For FPRs, b 1 is the number of women reported family history but in truth not having family history, and b 2 is the number of women reported no family history and in truth not having family history . Also, we used uniform priors (beta(1,1)) for the coefficients in the measurement model.
All models were fitted using the Bayesian methods, which were run to achieve convergence. The convergence was checked with the Gelman-Robin diagnostic test (Gelman and Rubin, 1992).
After the burn-in period, the iterations of the MCMC algorithm are random draws from the posterior distributions of interest; the mean was exponentiated to obtain the odds ratio of interest. We exponentiated the 2.5th and 97.5th percentiles of the random draws to obtain 95% Posterior Credible Intervals (CRIs) (van Gelder et al., 2014a).
We used R, STATA 12.0 and OpenBUGS softwares for the data analysis.

Results
In the present study, a total of 880 cases with breast cancer and 998 controls were included in the study.
In logistic regression analysis, odds ratio for the relationship between FHBC in the first degree relatives and breast cancer adjusted for covariates is 2.55 (95% CRI: 2.55 , 3.35).
In Bayesian analysis, first, we stated no misclassification model (model 1). Then model 1 for different values of the outcome model priors was run. Similar to MacLehose et al. (MacLehose et al., 2009), priors for outcome model parameters from previous studies were selected (Table  1). In status A, the results of Bayesian analysis (Table 2) showed that odds ratio of the relationship between FHBC in the first degree relatives and breast cancer was 2.98 (95% CRI: 2.41, 3.71). In B status, priors for outcome model parameters were changed and non-informative priors were used. After that, in order to consider Gelman et al. (Gelman et al., 2003) recommendation, N(0, 1000) prior was used. OR of the aforementioned relationship was 2.57 (95% CRI: 1.95, 3.41). In C status N(0, 1.38) as priors for outcome model parameters were used. OR of the relationship between FHBC in first degree relatives and breast cancer was 2.53 (95% CRI: 1.93, 3.31).
In misclassification model (model 2), it is essential to define the prior distributions for the unknown parameters of the outcome, exposure and measurement models.
The priors of the parameters of the outcome model were selected from different perspectives, including previous literature and non-informative priors (N(0,1000) and N(0,1.38)). The priors of unknown parameters of the exposure model (g) defined in three scenarios, N(0,1), N(0,10) and N(0,1.38). Also, in the similar manner, in all situations, according to MacLehose et al. (MacLehose et al., 2009), priors for intercept of the outcome and exposure models were N(0, 10 6 ) and N (0, 10 4 ), respectively.
Priors of the measurement model, sensitivities and FPR in the case and control groups, were selected from Tehranifar et al and Jurek et al. (Jurek et al., 2009;Tehranifar et al., 2015) studies (Table 3) and beta(1,1). Therefore, we had 10 statuses including D, E, F, G, H, I and J (Table 2) and K, L and M (Table 2), that is, we could run 10 analyses based on Bayesian method.
In simulation with priors specification of D status, adjusted OR for misclassification and confounder variables between FHBC and breast cancer was 2.64 (95% CRI: 2.02, 3.47). In simulation with priors specification of E status, adjusted OR between FHBC and breast cancer was 2.64 (95% CRI: 2.02, 3.46).
All results in all scenarios showed that all ORs were statistically significant. All the results, taking into account the assumptions were acceptable, which will be explained in the discussion.

Discussion
All analyses of epidemiological studies should be seen as part of a sensitivity analysis, because we can't claim that assumptions are absolutely correct. Statisticians only consider inferential possibilities and do not pay attention to the inferences. Bias analysis focuses on the inferences which are not pointless and unreasonable (Greenland, 2009). On the other hand, any article should not be considered as a candidate for bias analysis. When no useful conclusion can't be running by conventional analysis, bias analysis is not necessary (Greenland, 2009).
Quantitative bias analysis prepare an estimate of the uncertainty caused by systematic errors and can provide a useful guide for further investigation (Lash et al., 2014).
In this study, we assumed that a family history of the disease in the first degree relatives is a candidate for bias due to misclassification. It is possible for a person to be asked whether she has a family history of breast cancer or not, and she does not assume the need for giving correct answer to this question. So they underestimate the question and may give inaccurate answers.
We consider the following limitations for our study. We assumed differential misclassification for the exposure variable. That is, sensitivity and specificity in the case and control groups are different from each other. As   another limitation for the study, the potential errors in other variables were ignored. In addition to the variables studied, other confounding variables may not be included in this study, which may affect the results. As another assumption, the amounts of previous literature priors were established for the present study. Moreover, it was assumed that the probability density of the external validation data is similar to those exists in this study. Because of the lack of validation data on FHBC in the first degree relatives in Shiraz (or in Iran as a whole), we used priors outside of Iran, therefore we assumed them as similar. A major challenge in data analysis applying Bayesian methods is appropriate priors determination. In this study, previous literature and non-informative and weakly informative priors were used for the coefficients of outcome and exposure models. Non-informative and weakly informative priors were selected as recommended by Gelman et al. (Gelman et al., 2003) and Hamra et al. (Hamra et al., 2013b;Greenland and Mansournia, 2015), respectively. Given the abovementioned assumptions and the accuracy of the bias model, we corrected odds ratios for misclassification of exposure variable on breast cancer data using Bayesian methods described by MacLehose et al (MacLehose et al., 2009).
In the model without misclassification (model 1), after applying the priors selected from literature as well as noninformative and weakly informative priors (means = 0 and variances =1000 and 1.38), FHBC in the first degree relatives of the cases with breast cancer were, respectively, 2.98, 2.57 and 2.53 times more than those of the control group, , after adjustment for confounders. On the other hand, in logistic regression analysis, OR of exposureoutcome association adjusted for confounders was 2.55 (95% CRI: 2.55, 3.35). This finding is similar to the results of Bayesian analysis using non-informative and weakly informative priors. So, the exposure-outcome relationship was relatively strong. Similar results were found in the studies conducted by Holakouee et al (Holakouie-Naieni et al., 2007) (OR= 2.96, 95% Confidence Interval(CI)I: 1.46, 5.99) and Sepandi et al (Sepandi et al., 2014) (OR= 1.94,95% CI: 1.35,2.78). But the exposure-outcome odds ratio in Mahouri et al, study (Mahouri et al., 2007 Nov-Dec) was 9.07 (95% CI: 4.06, 12.26). This difference may be due to the different settings of the studies, or it may be as a result of systematic errors occurred in the exposure measurement. However, it should be noted that such odds ratio (9.07, 95% CI: 4.06-12.26) is unusual for the association between FHBC and breast cancer.
The results of the misclassification model of exposure (model 2) showed some variations. If the priors of the outcome model were chosen from previous studies and the prior mean of the exposure model to be considered as 0 and its variances as 1 and 10, respectively, then the odds ratios will be 2.64(95% CRI: 2.02, 3.47) and 2.64(95% CRI: 2.02, 3.46). Therefore, the odds ratios obtained are similar and lower than those found in the model 1 (2.98). This difference is due to the misclassification existed in self-reported FHBC.
In another situation, the value of 1000 is the proper option for the prior variance of the outcome model. If the prior variance of outcome model is 1000 and the prior variances of exposure model are 1 and 10, then, the odds ratios will be 1.60 (95% CRI: 1.07, 2.38) and 1.61 (95% CRI: 1.07, 2.40), respectively. Considering the sensitivity and specificity, it is remarkably lower than the odds ratio of model 1(2.57, 95% CRI: 1.95, 3.41) (prior variance of outcome model = 1000). Also, its effect remains significant and credible interval is narrower.
If the prior variance of outcome model is 1.38 and the priors variances of exposure model are equal to 1, 1.38 and 10, then the odds ratios will be 1.57(95% CRI:1.05, 2.35), 1.58 (95% CRI: 1.06, 2.34) and 1.57(95% CRI: 1.06, 2.33), respectively. These ORs are almost similar. But comparing with the odds ratio obtained from this prior in model 1(OR= 2.53, 95% CRI: 1.93, 3.31), less effect was indicated, again. This is due to the misclassification existed in the exposure variable.
When uniform distribution used for sensitivity and specificity values, ORs in different values of weakly informative priors for outcome and exposure models were insignificant and their CRIs were too wide. This is due to the use of uniform prior distribution for coefficients of measurement model.
In order to compare the calculated odds ratios applying Bayesian methods found in this study to those found in the other studies on adjusted OR using probabilistic bias analysis, the study conducted by Jurek et al. (Jurek et al., 2009) may be adopted. In no misclassification model, the relationship between FHBC and the outcome in their study was 1.63 (95% uncertainty limits: 1.63, 1.63), which is much less than those found in this study. While, in misclassification model, in different scenarios for sensitivity and specificity values with uniform and triangular distributions, adjusted OR were 2.27 (95% CI: 1.33, 6.01), 1.77(95% CI: 1.25, 3.93), and 1.99 (95% CI: 1.37, 4.21) (Jurek et al., 2009).
Based on the results of the present study, exposureoutcome effects, applying the misclassification model, were toward the null, and the adjusted ORs for misclassification in different priors were 2.64, 1.60, 1.61, 1.57, and 1.58. Also, in all other situations, except for a situation, the corrected odds ratios for misclassification were less than the ORs derived from the model without misclassification. In the exception situation, OR was more than that derived from the model without misclassification within which the priors of the measurement model had uniform distributions and the priors of the outcome model were selected from previous literature and the prior variances of the exposure model were considered to be 1.
In the present study, in no misclassification models, credible intervals were wider than the uncertainty limits found in Jurek et al (Jurek et al., 2009) study. In their study, uncertainty limits width was 0.65 but in our study, CRI width for previous literature priors, 1000 and 1.38 prior variances were 1.30, 1.46 and 1.41, respectively. This may be due to the more sample size in Jurek et al. (Jurek et al., 2009) study. However, in the misclassification model of the present study, all of CRIs were narrower than uncertainty limits found in Jurek et al. (Jurek et al., 2009) study.
Finally, due to the lack of validation studies in Iran, more attention should be paid to this matter in future research, especially to sensitivity and specificity. As different diagnostic tests for breast cancer do not have the same accuracy at different ages, it is suggested to use bias analysis for correcting outcome misclassification. Also, it is recommended to consider the other sources of bias in future studies, and then it could be claimed that the measure of association obtained are closer to the causal measure.