Marker-assisted backcrossing Readings: Frisch M, Bohn M, Melchinger AE (1999) Comparison of selection strategies for markerassisted backcrossing of a gene. Crop Sci 39:1295-1301 Young ND, Tanksley SD (1989) RFLP analysis of the size of chromosomal segments retained around the Tm-2 locus of tomato during backcross breeding. Theor Appl Genet 77:353-359 Traditional backcross programs and linkage drag Traditional backcrossing programs are planned on the assumption that the proportion of the recurrent parent genome is recovered at a rate of 1 - (1/2) t+1 for each of t generations of backcrossing. Thus, after four backcrosses, we expect to recover 1 - (1/2) 5 = 0.969% of the recurrent parent genome. Any specific backcross progeny, however, will deviate from this expectation due to chance and to linkage between the gene from the donor parent being selected for and nearby genes. A good example of the surprising amount of linkage drag that accompanies backcross breeding programs was reported by Young and Tanksley (1989), who genotyped the chromosome carrying the Tm-2 disease resistance gene in several tomato cultivars that were developed by introgressing the gene from a wild relative, L. peruvianum, via backcross breeding. They found that even cultivars developed from 20 backcrosses contained introgressed segments as large as 4 cm, and one cultivar developed from 11 backcrosses still contained the entire chromosome arm carrying the gene from the donor parent! Marker-assisted backcrossing Marker-assisted backcrossing may improve the efficiency of backcross breeding in three ways: (1) if the phenotype of the desired gene from the donor parent is not easily assayed, BC progeny possessing a marker allele from the donor parent at a locus near the target gene can be selected with good probability of carrying the gene. (2) markers can be used to select against BC progeny with larger amounts of donor parent germplasm in the genome outside of the target region. (3) markers can be used to select rare progeny that are the result of recombinations near the target gene, thus minimizing the effects of linkage drag. Basically, markers allow one to have a good idea of how much of the recurrent parent genome has been recovered in any particular BC progeny and to select for the best BC progeny available in any generation. This ability to select for recurrent parent genotype outside of the target locus can greatly reduce the number of generations required to develop lines that possess the desired gene, but are otherwise nearly isogenic with respect to the recurrent parent. If one selects for the desired marker allele linked to a gene of interest for several backcross generations, they should realize that there is a possibility that a recombination will occur
between the marker locus and the gene resulting in loss of the desired allele at the target gene during the breeding program. What is the probability of this happening? To model this program, let us use the notation developed in the QTL analysis section; there are two alleles at the marker locus, M 1 and M 2, and two alleles at the target gene, Q 1 and Q 2. M 1 is linked in coupling with Q 1 and in repulsion with Q 2. Q 2 is the target allele that we want to backcross into the recurrent parent, which has Q 1 to begin with. The F 1 of this backcrossing program has genotype: M 1 Q 1 M 2 r Q 2 and produces gametes in the same proportion as shown in Table 1. Table 1. Gametes produced by an F 1 heterozygous at both a QTL and a marker locus. Gamete Frequency M 1 Q 1 1/2(1-r) M 1 Q 2 1/2(r) M 2 Q 1 1/2(r) M 2 Q 2 1/2(1-r) When this F 1 is backcrossed to parent 1, the resulting genotype frequencies depend only on the gamete frequencies, because there is no segregation from the recurrent parent (Table 2): Table 2. BC 1 F 1 genotype frequencies for a marker locus linked to a target gene. Genotype M 1 M 1 Q 1 Q 1 M 1 M 1 Q 1 Q 2 M 1 M 2 Q 1 Q 1 M 1 M 2 Q 1 Q 2 Frequency 1/2(1-r) 1/2(r) 1/2(r) 1/2(1-r) Our desire is to select the Q 1 Q 2 plants in the BC 1 F 1 generation, and we hope to do this by selecting for the M 1 M 2 plants. What is the probability that by selecting an M 1 M 2 plant, we lose the target allele (i.e., we choose a Q 1 Q 1 plant)? The probability is simply: P(Q 1 Q 1 M 1 M 2 ) = (2 )r/(2) = r Thus, if the recombination frequency between marker locus and target gene is 10%, there is a 10% chance that selecting one plant on the basis of marker genotype alone will result in losing
the desired allele. It should be clear that using a marker that is tightly linked to the target gene is really critical for the success of marker-assisted backcrossing. To make the point more dramatically, what if t generations of backcrossing are conducted, and only a single plant is selected in each backcross generation based on its marker genotype? In each generation, the probability of not losing the target allele is 1- r, so the probability of not losing the target allele over t generations is (1- r) t. Therefore, the probability of losing the target allele after t generations of backcrossing is 1 - (1- r) t. Using the example of 10% recombination between marker and target genes again, the probability of losing the target allele after five generations of backcrossing is: 1 - (0.9) 5 = 0.41! Obviously, one could reduce the risk of this occurring by maintaining larger populations at each generation of backcrossing, but doing this will require the population sizes to grow geometrically each generation, and the amount of work will increase substantially. Furthermore, if more than one target gene is to be backcrossed, the problem becomes worse because the probability of missing at least one of the two target alleles (on different linkage groups) each generation is 1 - (1- r 1 )(1- r 2 ). Assuming there is 10% recombination between each target gene and its nearest marker locus (r 1 = r 2 = 0.1), the probability of losing at least one target allele per generation becomes: 0.19! The best way to avoid losing target alleles during backcrossing is to use marker loci flanking the target gene. In each backcross generation, one would select those plants that have the donor parent allele at both loci only. To show how this improves the probability of keeping the target allele, we use the model developed previously for interval mapping. There are two marker loci, M A and M B, with recombination frequency r A and r B between themselves and the target locus, Q. The F 1 is represented as: M A1 r A Q 1 r B M B1 M A2 Q 2 M B2 And we can compute the frequencies of the genotypes in the BC 1 F 1 generation based on the gametes produced by the F 1 genotype and assuming no interference (Table 3):
Table 3. BC 1 F 1 genotype frequencies using marker loci flanking the target gene. Genotype Frequency M A1 M A1 Q 1 Q 1 M B1 M B1 1/2(1-r A )(1-r B ) M A1 M A1 Q 1 Q 2 M B1 M B1 1/2r A r B M A1 M A2 Q 1 Q 1 M B1 M B1 1/2r A (1-r B ) M A1 M A2 Q 1 Q 2 M B1 M B1 M A1 M A1 Q 1 Q 1 M B1 M B2 1/2(1-r A )r B 1/2(1-r A )r B M A1 M A1 Q 1 Q 2 M B1 M B2 1/2r A (1-r B ) M A1 M A2 Q 1 Q 1 M B1 M B2 1/2r A r B M A1 M A2 Q 1 Q 2 M B1 M B2 1/2(1-r A )(1-r B ) Total 1 Only M A1 M A2 M B1 M B2 plants are selected for further backcrossing. The frequency of such plants is r A r B + (1/2)(1 - (r A + r B )). The probability of losing the target allele after selecting on flanking markers is the probability of selecting an M A1 M A2 Q 1 Q 1 M B1 M B2 plant given that one has selected an M A1 M A M B1 M B2 : 1 rarb rar P( M B A M A Q Q MB M B M A M A M B MB ) 2 1 2 1 1 1 2 1 2 1 2 = =. rarb + 1 (1 ra rb ) 1 ra rb + 2rAr 2 B This equation looks gnarly, but the point is that this probability is lower than the probability of losing the target allele based on selection for a single marker locus. For example, if the flanking markers each have 10% recombination frequency with the target locus, the probability of losing the target allele after a single generation is (1/2)(0.1)(0.1)/[(0.1)(0.1)+(1/2)(1-0.1-0.1)] = 0.005/0.41 = 0.012. This is much better than before! One final point about marker-assisted backcrossing concerns the type of marker used. A dominant marker gene, such as a RAPD marker that is scored as either being present or absent, can be used to track the target allele in a backcrossing program. If the dominant presence allele is linked to the target allele from the donor parents, this will work fine. On the other hand, consider the situation in which the desired allele is linked to the recessive absent allele. In the F 1, the plants will be heterozygous for both target gene and marker gene, and will exhibit the RAPD band. In the BC 1 F 1 generation, what will be the segregation ratio at the marker locus? Determining this should make you realize that using such a marker in repulsion phase with the target allele is close to a waste of time. Comparison of marker-assisted backcrossing strategies for recovery of recurrent parent genotype
Frisch et al (1999) compared several different backcrossing strategies in terms of how quickly they recovered a large proportion of the recurrent parent genotype (RPG in their terminology). They based their simulations on a maize genetic map (n = 10) with markers spaced about 20 cm apart, with two larger gaps. They also assumed that the target locus could be scored directly (via phenotyping or with a marker completely linked to the target gene). They compared four different selection strategies: Two-step selection: 1. Select individuals carrying the target allele. 2. Select one individual that is homozygous for recurrent parent genotype at most loci (across whole genome) among those that remain. In the first BC generation, after the marker locus is scored, we expect there to be (1/2)n 1 individuals remaining (n 1 is the number of individuals in the 1 st generation). In this procedure, each individual will be genotyped for all marker loci in order to do the 2 nd selection step; therefore this requires (1/2)mn 1 marker data points (MDPs) to be collected. After one individual is selected, it is backcrossed again to make a BC2F1 population, and the selection steps are performed again. First, the target locus is scored, leaving (1/2)n 2 individuals remaining to be genotyped. Then these individuals are genotyped only at the background marker loci that were homozygous in the selected parent of the previous generation. Therefore, the total number of MDPs required depends on how similar the selected BC1F1 plant was to the recurrent parent. This varied among simulation runs. And the number genotyped in the BC3 generation depends on the number of markers that are homozygous in the selected BC2, and so on. Three-step selection: 1. Select individuals carrying the target allele. 2. Select individuals homozygous for recurrent parent genotype at loci flanking the target locus. 3. Select one individual that is homozygous for recurrent parent genotype at most loci (across whole genome) among those that remain. Four-step selection: 1. Select individuals carrying the target allele. 2. Select individuals homozygous for recurrent parent genotype at loci flanking the target locus. 3. Select individuals homozygous for recurrent parent genotype at remaining loci on the same chromosome as the target allele. 4. Select one individual that is homozygous for recurrent parent genotype at most loci (across whole genome) among those that remain. For each selection scheme, they compared using from 20 200 plants per backcross generation. They also compared schemes that maintained a total of 300 plants across three generations, but evaluated different proportions of the total in different generations. The results of their simulations are presented in Tables 3 and 4. They compared alternate selection schemes in terms of the value of the top 10 percentile (Q10) for recurrent parent genotype. Thus, a value of 75%
indicates that there is a 90% chance that the recovered progeny will have at least 75% of the recurrent parent genotype. Table 3 main results: 1. Increasing the number of individuals genotyped each generation had little effect. For example, increasing the number of genotyped individuals from 20 to 40 in the two-stage selection scheme increased Q10 from 76.7% to 78.7% in the BC1 and from 98.7% to 98.9% in the BC5. Increasing the number of genotypes from 20 to 200 increased Q10 only to 82.2% in BC1 and to 99% in BC5. The gain in RPG is small, but the increase in MDP (the amount of work) increased dramatically by about the same factor as the increase in the number of plants (factors of 2 or 10 times)! 2. With MAS, recovery of 97% or more of the RPG was accomplished in one or two generations less compared to traditional selection (compare Table 3 to Table 2). 3. Many fewer marker data points are required for 3- and 4-stage selection than for 2-stage selection to get nearly the same recovery of RPG. Thus, the 3- and 4-stage selection procedures are more efficient. Table 4 main results: 1. In a 2-stage selection program, increasing population sizes with each generation is most efficient. This is because with each backcross, you get half of the unlinked loci homozygous for RPG essentially for free. So, you can allow the natural effect of backcrossing to convert most of the unlinked loci to RPG until the last generation, at which point you can grow larger populations to identify the rare individuals most similar to the RPG. 2. Fewer marker data points are required for 3- and 4-stage selection procedures than for 2-stage selection to get nearly the same recovery of RPG. As above! The difference is that sampling larger populations in later generations is not as beneficial with these types of selection schemes as for the 2-stage. In fact, the most efficient procedure (in terms of % RPG recovery per data point) is greatest when you sample larger populations in the first BC generation for the 4-stage procedure! The most efficient procedure for the 3- stage scheme is a 1:3:5 ratio of progeny sampled in generations BC1, BC2, and BC3. General conclusions: 1. 4-stage sampling strategy is probably most efficient procedure in general. 2. With 4-stage sampling and reasonable population sizes each generation (50 100), one can expect to find a BC3 progeny with at least 96% RPG with 90% probability. It would take 6 generations of traditional backcrossing to reach this stage (not to mention a larger probability of linkage drag around the target gene).
3. So, if reducing numbers of generations is most important, marker-assisted backcrossing makes sense, even if it costs more. 4. Increasing the number of markers genotyped each generation had little effect. So, once the threshold of one marker per 20 cm or so is reached, additional markers are a waste of money (except perhaps around the target locus). The frequency of recombination, not the number of markers is the more important limiting factor in reducing linkage drag this suggests that sampling larger populations with fewer markers makes more sense than the reverse.