-
Comparison of methods for the implementation of genome-assisted evaluation of Spanish dairy cattle.
The aim of this study was to evaluate methods for genomic evaluation of the Spanish Holstein population as an initial step toward the implementation of routine genomic evaluations. This study provides a description of the population structure of progeny tested bulls in Spain at the genomic level and compares different genomic evaluation methods with regard to accuracy and bias. Two bayesian linear regression models, Bayes-A and Bayesian-LASSO (B-LASSO), as well as a machine learning algorithm, Random-Boosting (R-Boost), and BLUP using a realized genomic relationship matrix (G-BLUP), were compared. Five traits that are currently under selection in the Spanish Holstein population were used: milk yield, fat yield, protein yield, fat percentage, and udder depth. In total, genotypes from 1859 progeny tested bulls were used. The training sets were composed of bulls born before 2005; including 1601 bulls for production and 1574 bulls for type, whereas the testing sets contained 258 and 235 bulls born in 2005 or later for production and type, respectively. Deregressed proofs (DRP) from January 2009 Interbull (Uppsala, Sweden) evaluation were used as the dependent variables for bulls in the training sets, whereas DRP from the December 2011 DRPs Interbull evaluation were used to compare genomic predictions with progeny test results for bulls in the testing set. Genomic predictions were more accurate than traditional pedigree indices for predicting future progeny test results of young bulls. The gain in accuracy, due to inclusion of genomic data varied by trait and ranged from 0.04 to 0.42 Pearson correlation units. Results averaged across traits showed that B-LASSO had the highest accuracy with an advantage of 0.01, 0.03 and 0.03 points in Pearson correlation compared with R-Boost, Bayes-A, and G-BLUP, respectively. The B-LASSO predictions also showed the least bias (0.02, 0.03 and 0.10 SD units less than Bayes-A, R-Boost and G-BLUP, respectively) as measured by mean difference between genomic predictions and progeny test results. The R-Boosting algorithm provided genomic predictions with regression coefficients closer to unity, which is an alternative measure of bias, for 4 out of 5 traits and also resulted in mean squared errors estimates that were 2%, 10%, and 12% smaller than B-LASSO, Bayes-A, and G-BLUP, respectively. The observed prediction accuracy obtained with these methods was within the range of values expected for a population of similar size, suggesting that the prediction method and reference population described herein are appropriate for implementation of routine genome-assisted evaluations in Spanish dairy cattle. R-Boost is a competitive marker regression methodology in terms of predictive ability that can accommodate large data sets.
Jiménez-Montero JA
,González-Recio O
,Alenda R
《-》
-
Assets of imputation to ultra-high density for productive and functional traits.
The aim of this study was to evaluate different-density genotyping panels for genotype imputation and genomic prediction. Genotypes from customized Golden Gate Bovine3K BeadChip [LD3K; low-density (LD) 3,000-marker (3K); Illumina Inc., San Diego, CA] and BovineLD BeadChip [LD6K; 6,000-marker (6K); Illumina Inc.] panels were imputed to the BovineSNP50v2 BeadChip [50K; 50,000-marker; Illumina Inc.]. In addition, LD3K, LD6K, and 50K genotypes were imputed to a BovineHD BeadChip [HD; high-density 800,000-marker (800K) panel], and with predictive ability evaluated and compared subsequently. Comparisons of prediction accuracy were carried out using Random boosting and genomic BLUP. Four traits under selection in the Spanish Holstein population were used: milk yield, fat percentage (FP), somatic cell count, and days open (DO). Training sets at 50K density for imputation and prediction included 1,632 genotypes. Testing sets for imputation from LD to 50K contained 834 genotypes and testing sets for genomic evaluation included 383 bulls. The reference population genotyped at HD included 192 bulls. Imputation using BEAGLE software (http://faculty.washington.edu/browning/beagle/beagle.html) was effective for reconstruction of dense 50K and HD genotypes, even when a small reference population was used, with 98.3% of SNP correctly imputed. Random boosting outperformed genomic BLUP in terms of prediction reliability, mean squared error, and selection effectiveness of top animals in the case of FP. For other traits, however, no clear differences existed between methods. No differences were found between imputed LD and 50K genotypes, whereas evaluation of genotypes imputed to HD was on average across data set, method, and trait, 4% more accurate than 50K prediction, and showed smaller (2%) mean squared error of predictions. Similar bias in regression coefficients was found across data sets but regressions were 0.32 units closer to unity for DO when genotypes were imputed to HD density. Imputation to HD genotypes might produce higher stability in the genomic proofs of young candidates. Regarding selection effectiveness of top animals, more (2%) top bulls were classified correctly with imputed LD6K genotypes than with LD3K. When the original 50K genotypes were used, correct classification of top bulls increased by 1%, and when those genotypes were imputed to HD, 3% more top bulls were detected. Selection effectiveness could be slightly enhanced for certain traits such as FP, somatic cell count, or DO when genotypes are imputed to HD. Genetic evaluation units may consider a trait-dependent strategy in terms of method and genotype density for use in the genome-enhanced evaluations.
Jiménez-Montero JA
,Gianola D
,Weigel K
,Alenda R
,González-Recio O
... -
《-》
-
Application of Bayesian least absolute shrinkage and selection operator (LASSO) and BayesCπ methods for genomic selection in French Holstein and Montbéliarde breeds.
Recently, the amount of available single nucleotide polymorphism (SNP) marker data has considerably increased in dairy cattle breeds, both for research purposes and for application in commercial breeding and selection programs. Bayesian methods are currently used in the genomic evaluation of dairy cattle to handle very large sets of explanatory variables with a limited number of observations. In this study, we applied 2 bayesian methods, BayesCπ and bayesian least absolute shrinkage and selection operator (LASSO), to 2 genotyped and phenotyped reference populations consisting of 3,940 Holstein bulls and 1,172 Montbéliarde bulls with approximately 40,000 polymorphic SNP. We compared the accuracy of the bayesian methods for the prediction of 3 traits (milk yield, fat content, and conception rate) with pedigree-based BLUP, genomic BLUP, partial least squares (PLS) regression, and sparse PLS regression, a variable selection PLS variant. The results showed that the correlations between observed and predicted phenotypes were similar in BayesCπ (including or not pedigree information) and bayesian LASSO for most of the traits and whatever the breed. In the Holstein breed, bayesian methods led to higher correlations than other approaches for fat content and were similar to genomic BLUP for milk yield and to genomic BLUP and PLS regression for the conception rate. In the Montbéliarde breed, no method dominated the others, except BayesCπ for fat content. The better performances of the bayesian methods for fat content in Holstein and Montbéliarde breeds are probably due to the effect of the DGAT1 gene. The SNP identified by the BayesCπ, bayesian LASSO, and sparse PLS regression methods, based on their effect on the different traits of interest, were located at almost the same position on the genome. As the bayesian methods resulted in regressions of direct genomic values on daughter trait deviations closer to 1 than for the other methods tested in this study, bayesian methods are suggested for genomic evaluations of French dairy cattle.
Colombani C
,Legarra A
,Fritz S
,Guillaume F
,Croiseau P
,Ducrocq V
,Robert-Granié C
... -
《-》
-
Comparison of genomic predictions using medium-density (∼54,000) and high-density (∼777,000) single nucleotide polymorphism marker panels in Nordic Holstein and Red Dairy Cattle populations.
This study investigated genomic prediction using medium-density (∼54,000; 54K) and high-density marker panels (∼777,000; 777K), based on data from Nordic Holstein and Red Dairy Cattle (RDC). The Holstein data comprised 4,539 progeny-tested bulls, and the RDC data 4,403 progeny-tested bulls. The data were divided into reference data and test data using October 1, 2001, as a cut-off date (birth date of the bulls). This resulted in about 25% genotyped bulls in the Holstein test data and 20% in the RDC test data. For each breed, 3 data sets of markers were used to predict breeding values: (1) 54K data set with missing genotypes, (2) 54K data set where missing genotypes were imputed, and (3) imputed high-density (HD) marker data set created by imputing the 54K data to the HD data based on 557 bulls genotyped using a 777K single nucleotide polymorphism chip in Holstein, and 706 bulls in RDC. Based on the 3 marker data sets, direct genomic breeding values (DGV) for protein, fertility, and udder health were predicted using a genomic BLUP model (GBLUP) and a Bayesian mixture model with 2 normal distributions. Reliability of DGV was measured as squared correlations between deregressed proofs (DRP) and DGV corrected for reliability of DRP. Unbiasedness was assessed by regression of DRP on DGV, based on the bulls in the test data sets. Averaged over the 3 traits, reliability of DGV based on the HD markers was 0.5% higher than that based on the 54K data in Holstein, and 1.0% higher than that in RDC. In addition, the HD markers led to an improvement of unbiasedness of DGV. The Bayesian mixture model led to 0.5% higher reliability than the GBLUP model in Holstein, but not in RDC. Imputing missing genotypes in the 54K marker data did not improve genomic predictions for most of the traits.
Su G
,Brøndum RF
,Ma P
,Guldbrandtsen B
,Aamand GP
,Lund MS
... -
《-》
-
The gradient boosting algorithm and random boosting for genome-assisted evaluation in large data sets.
In the next few years, with the advent of high-density single nucleotide polymorphism (SNP) arrays and genome sequencing, genomic evaluation methods will need to deal with a large number of genetic variants and an increasing sample size. The boosting algorithm is a machine-learning technique that may alleviate the drawbacks of dealing with such large data sets. This algorithm combines different predictors in a sequential manner with some shrinkage on them; each predictor is applied consecutively to the residuals from the committee formed by the previous ones to form a final prediction based on a subset of covariates. Here, a detailed description is provided and examples using a toy data set are included. A modification of the algorithm called "random boosting" was proposed to increase predictive ability and decrease computation time of genome-assisted evaluation in large data sets. Random boosting uses a random selection of markers to add a subsequent weak learner to the predictive model. These modifications were applied to a real data set composed of 1,797 bulls genotyped for 39,714 SNP. Deregressed proofs of 4 yield traits and 1 type trait from January 2009 routine evaluations were used as dependent variables. A 2-fold cross-validation scenario was implemented. Sires born before 2005 were used as a training sample (1,576 and 1,562 for production and type traits, respectively), whereas younger sires were used as a testing sample to evaluate predictive ability of the algorithm on yet-to-be-observed phenotypes. Comparison with the original algorithm was provided. The predictive ability of the algorithm was measured as Pearson correlations between observed and predicted responses. Further, estimated bias was computed as the average difference between observed and predicted phenotypes. The results showed that the modification of the original boosting algorithm could be run in 1% of the time used with the original algorithm and with negligible differences in accuracy and bias. This modification may be used to speed the calculus of genome-assisted evaluation in large data sets such us those obtained from consortiums.
González-Recio O
,Jiménez-Montero JA
,Alenda R
《-》