******************************************* * Documentation * ******************************************* * GEE2loc.fit * * release 1 * ******************************************* * * * Program to fit the one locus model and * * the two locus model by solving GEEs * * * * Joanna Biernacka (08/04) * * * * Department of Public Health Sciences * * University of Toronto * * * * Based on the GENEFINDER program * * written by Chiu et al (Liang et al 2001)* ******************************************* ******************************************************************** * Note that in this program the 2-locus mean function * is parametrized in terms of delta2 = (C1,C2,tau1,tau2) * where C1 = E[S(tau1)|ASP] and C2 = E[S(tau2)|ASP]. * An alternative parameterization in terms of two "effect size" * parameters, C1* and C2*, is possible (see Biernacka et al. 2005). ******************************************************************** The program fits both the 1-locus and the 2-locus model by the GEE method. (Here 1-locus refers to the fact that it is assumed there is exactly one disease gene in the region being studied. Similarly, under the 2-locus model there are two linked disease genes in this region.) This program is based on the original GENEFINDER program (Liang et al., 2001) for localizing a single disease gene in a region. It fits both the one-locus model described by Liang et al. (2001) and the two-locus model described by Biernacka et al. (2004). To run the program first compile it using fortran by typing: f77 GEE2loc.fit.r1.f -o GEE2loc.fit.r1 Run the program by typing: GEE2loc.fit.r1 geeout. Before running GEE2loc.fit.r1 you will have to run Genehunter (or some other program to estimate IBD sharing for all the sibpairs), and generate the necessary input files described below. You will need the following 3 input files to run the program: geein numsibs_file sharing_file These 3 files are described below. ********************************************** The "geein" file ********************************************** The geein file is the main input files and contains the following lines: numsibs_file_name sharing_file_name number_of_loci number_of_ASPs epsilon maxint initial_tau initial_C initial_tau1 initial_tau2 initial_C1 initial_C2 marker1_position marker2_position . . . markerM_position *************** Details *************** numsibs_file_name is the name of the "numsibs" file described below sharing_file_name is the name of the "sharing" file described below number_of_loci = number of markers + 1 (number of loci in the 1-locus model) number_of_ASPs = total number of affected sib pairs in the analysis epsilon = value of the "epsilon" parameter used to smooth the curve in the 1-locus model (see Liang et al 2001). Usually set to 1.0. maxint = maximum number of iterations for the algorithm initial_tau = initial value of the location parameter in the 1-locus model initial_C = initial value of the excess IBD sharing parameter in the 1-locus model initial_tau1 = initial value of the first location parameter in the 2-locus model initial_tau2 = initial value of the second location parameter in the 2-locus model initial_C1 = initial value of the 1st excess IBD sharing parameter in the 2-locus model initial_C2 = initial value of the 2nd excess IBD sharing parameter in the 2-locus model marker1_position = location of the first marker (in cM) marker2_position = location of the second marker (in cM) . . . markerM_position = location of the last marker (in cM) ************* Example ************* An example geein file (called geein_ex1) is included with the example. ********************************************** The "numsibs" file ********************************************** As in the first release of GENEFINDER (Liang et al 2001), the "numsibs" file contains two columns: family, number_of _ASPs. The first column lists the families, the second indicates how many affected sib pairs there are in that family. Example: If there are 10 families, all consisting of 1 ASP, except families 2 and 6 which have 3 ASPs each, the file would be: 1 1 2 3 3 1 4 1 5 1 6 3 7 1 8 1 9 1 10 1 ***See note below about generating this file*** ********************************************** The "sharing" file ********************************************** As in the first release of GENEFINDER (Liang et al 2001), the "sharing" file contains four columns: family_number, ASP_number, marker_number, (estimated)_IBD_sharing. Example: If there are 3 families, all consisting of 1 ASP, and 3 markers, the file may look as follows: 1 1 1 1.000 1 1 2 2.000 1 1 3 2.000 2 1 1 0.900 2 1 2 1.000 2 1 3 1.000 3 1 1 1.780 3 1 2 2.000 3 1 3 1.900 ***See note below about generating this file*** ********************************************************************** Note: A sas program for generating the numsibs and sharing files from Genehunter output is available. ********************************************************************** ******************************************************************** Generating the numsibs and sharing files ******************************************************************** You will first need to run Genehunter (Kruglyak et al. 1996) to get marker IBD sharing estimates for all affected sib pairs. (You may use another software, however, the sas program for creating input files described below works with Genehunter output). Run genehunter, and obtain IBD sharing estimates at all markers, (i.e. use the commands: "increment step 1" and "dump ibd"). Depending on the version of genehunter used, you may need to replace the dashes between the two sibs in the ibd.dump file with spaces. Then you can modify as needed and run the fileprep.sas program, which will use the original pedigree file and the ibd.dump file output by genehunter, and will create the numsibs and sharing files needed to run GEE2loc.fit Once you have the 3 files: numsibs, sharing, and geein, you can run the GEE2loc program as described above. ************************************************************************** Additional Comments ************************************************************************** We have observed that for some data sets, the estimation algorithm implemented in this program is quite sensitive to the initial values provided by the user for the parameters, particularly for the putative disease gene locations. We recommend that initial values be chosen carefully, based on prior belief about the potential gene locations. For example, initial values can be chosen by looking at linkage plots (ex. NPL curve from the GH analysis that has to be carried out prior to using GEE2loc). Also, particularly when there is no obvious choice for the initial values, we recommend trying several sets of initial values and comparing the final results. (Generally when there is little evidence for the presence of two linked disease genes, the estimates can be quite variable depending on the initial values used.) ****************************************** Output ****************************************** The output generated by GEE2loc should be fairly self-explanatory. The "robust standard error estimates" are obtained using the so-called robust (sandwich) variance estimator. (See Liang and Zeger, 1986, Biometrika 73: 13-22.) The "convergence codes" are: 1 = convergence to a solution via the iterative algorithm was attained. Other convergence codes indicate that full convergence was not attained. In some cases, additional steps are taken to approximate a solution ("step-size" reduction followed by a grid-search near the solution if necessary). Other convergence codes: 2 = "periodic" non-convergence, i.e. the iterative algorithm gets stuck iterating back and forth between two points in the parameter space. A grid-search for the solution is undertaken. 3 = although convergence is not attained, the estimate is not changing much over the last few iterations. A grid search for the solution is then performed. 4 = at least one parameter is out of bounds at the end of the iterative algorithm. 5 = no convergence (and algorithm does not appear to be approaching point of convergence). In cases 4 and 5 a solution is not reported. ****************************************** References ****************************************** If you use the software for any publication please reference: Biernacka JM, Sun L, Bull SB (2004) Simultaneous localization of two linked disease susceptibility genes. Genet Epidemiol. Published Online Oct 12 2004. In press. Additional References: Liang K-Y, Chiu YF, Beaty TH (2001a) A robust identity-by-descent procedure using affected sib-pairs: multipoint mapping for complex diseases. Human Heredity 51: 64-78. Kruglyak L, Daly MJ, Reeve-Daly MP, Lander ES (1996) Parametric and nonparametric linkage analysis: A unified multipoint approach. The American Journal of Human Genetics 58: 1347-1363. ******************************************** Contact Information ******************************************** If you have any questions regarding GEE2loc please contact: Joanna Biernacka email: biernac@fisher.utstat.utoronto.ca or biernac@mshri.on.ca phone: 1-416-586-4800 x 7539