Multi-threshold (and Multi-marker) Association Study Analysis: MASA

Incorporating prior informatioin into association study in an optimal manner

MASA is a new association testing method incorporating prior information to increase power. MASA is closely related to the concept of multiple testing. To obtain the corrected p-values taking into account multiple testing, usually the Bonferroni correction is used. However, this is equivalent to treating every test equally. For example, some markers are more likely to be causal because it is proximal to funtional elements, or some markers can be in tight LD with many putative causal variants. By taking into account this prior information, we can increase power to detect causal variants (Eskin 2008). This is equivalent to varying significant threshold at each marker depending on the prior information of the marker (multi-thresholding).

Moreover, extending that concept, we can devise a new multi-marker based test (Darnell 2012). Our mutivariate-normal (MVN)-based test is fundamentally different from the traditional tests in that the test is applied to all putative causal variants such as all known variants in HapMap, not only to the collected markers. Again, multi-thresholding technique is applied to optimally incorporate prior information.

Download

Masa.zip containing the followings,
- Masa.jar (Main method Java archive package file)
- SimulateCohort.jar (Auxilary java program generating case/control cohorts)
- ENCODEbeagle: A directory containing HapMap PhaseII ENCODE region data which includes
  - A.B.beagle: Beagle format file of ENCODE region A of population B
  - A.B.marker: Beagle marker format file of ENCODE region A of population B
  - A.B.tag: RSIDs of tag SNPs in Affymetrix 500K chip, in ENCODE region A of population B
- cohort.beagle : an example case/control cohort file generated based on ENm010 region of CEU population

Version/bug info

v1.0.0 (2012-07-14) Initial version released

User's guide

MASA

Usage of MASA:

usage: java -jar Masa.jar [options]
    -cohort <FILE>                   Cohort file (case/control data) in Beagle format
    -maf_threshold <FLOAT>           Remove SNPs of MAF below this threshold both in reference and
                                     cohort (default=0.01)
    -marker <FILE>                   Marker file in Beagle marker file format
    -method <FILE>                   Multi-threshold association testing method ('eskin' or 'mvn')
                                     (default=eskin)
    -mvn_max_num_proxy <INT>         In MVN method, maximum number of proxies per tested putative
                                     causal SNP (default=20)
    -mvn_proxy_r_threshold <FLOAT>   In MVN method, select nearby SNP only if |r| is above this
                                     value (default=0.3)
    -out <FILE>                      Output file prefix (default='outFile')
    -permute <NUM>                   Perform permutation <NUM> times instead of assuming independent
                                     markers (required for MVN method)
    -prior <FILE>                    Prior information file
    -reference <FILE>                Phased reference data haplotype file in Beagle format
    -relative_risk <FLOAT>           Prior information of target relative risk (default=1.2)
    -seed <INT>                      Random number generator seed (default=0)
    -window <SIZE>                   Number of nearby SNPs to look up tags (default = 100)

Detailed description of MASA options:

Option	Description
`-cohort`	(Required) Case/control cohort file that we want to test in Beagle format. There must be “A” row (phenotype) whose values are either 1 (unaffected) or 2 (affected).
`-maf_threshold`	We remove low frequency SNPs in the analysis whose MAF is below this threshold. Both SNPs in cohort and SNPs in reference are filtered with this threshold. (default=0.01)
`-marker`	(Required) Beagle marker-format file which must be a superset of the markers in cohort file and the markers in reference file.
`-method`	Which MASA method will be used, “eskin” (Eskin 2008) or “mvn” (Darnell 2012).
`-mvn_max_num_proxy`	In MVN method, we test not only the tag SNP but also every putative causal SNP in reference file. The testing is done based on the set of proxy (tag) SNPs correlated to the causal SNP. This parameter limits the number of selected proxies for each putative causal SNP we test. (default=20)
`-mvn_proxy_r_threshold`	In MVN method, we select proxy for each putative causal SNP only if the proxy is in correlation with the causal SNP above this threshold. Note that this is the absolute value of r, not r-square. (default=0.3)
`-out`	Output file. (default='outFile’)
`-permute`	Without this option, independency between test is assumed similarly to Bonferroni correction (conservative). With this option, we perform permutation to get more accurate corrected p-value. For MVN method, this option is required.
`-prior`	Prior information file containing prior probability that each marker will be causal. The file consists of N rows of non-negative float values. N must equal to the number of markers in marker file. The value can be greater than 1.0, since only the relative quantity between markers matters.
`-reference`	(Required) Reference haplotype file in Beagle format, such as HapMap. The markers in reference file must be a superset of the markers in cohort file.
`-relative_risk`	In our method the effect size must be specified to a certain value. It is okay if it is not the exact effect size of true causal SNP – it is shown thata roughly similar value suffice (default=1.2).
`-seed`	Random number generator seed (default=0)
`-window`	When we assign each putative causal SNP to the best tag SNP ('eskin’ method)or when we select proxy or tag SNPs for each putative causal SNP,we search for the SNPs within this window size, in terms of number of SNPs.

Example Running Command:

java -jar Masa.jar -reference ENCODEbeagle/ENm010.CEU.beagle -marker ENCODEbeagle/ENm010.CEU.marker -cohort cohort.beagle

Another Example Running Command using MVN method (Darnell 2012):

java -jar Masa.jar -reference ENCODEbeagle/ENm010.CEU.beagle -marker ENCODEbeagle/ENm010.CEU.marker -cohort cohort.beagle -method mvn -permute 10000 -out myoutputfile

Output File: There are three output files
- outFile.excludedSNPs_from_cohort : Excluded SNPs from cohort file because of low MAF
- outFile.excludedSNPs_from_reference : Excluded SNPs from reference file because of low MAF
- outFile.pvalues : Main result file which includes the following columns

Column name	Description
RSID	SNP rsID
ZSCORE	SNP z-score
STANDARD_UNCORRECTED_PVALUE	Pointwise SNP P-value based on z-score
STANDARD_CORRECTED_PVALUE	Bonferroni-corrected P-value
MASA_CORRECTED_PVALUE	Corrected p-value calculated by MASA method, either using 'eskin’ or 'mvn’ method, either analytically or using permutation (if `-permute` option is used)

SimulateCohort

Usage of SimulateCohort:

usage: java -jar SimulateCohort.jar [options]
    -cohort_param <#CASE #CONTROL #COHORT>   Case size, control size, and number of cohorts
    -maf_threshold <FLOAT>                   Minimum MAF of a causal SNP that will be randomly
                                             selected (default=0.1)
    -marker <FILE>                           Marker file in Beagle marker file format
    -out <FILE>                              output file (default='cohort')
    -reference <FILE>                        Phased reference data haplotype file in Beagle format
    -relative_risk <FLOAT>                   Relative risk of causal SNP to simulate (default=1.2)
    -seed <INT>                              Random number generator seed (default=0)
    -tag <FILE>                              Tag file including rsids of tag SNPs

Detailed description of MASA options:

Option	Description
`-cohort_param`	(Required) This option specifies three simulation parameters. `#CASE` is the number of cases in the cohort. `#CONTROL` is the number of controls in the cohort. `#COHORT` is how many cohorts we want to simulate.
`-maf_threshold`	In simulation, we randomly pick one causal SNP. This parameter sets the minimum of the MAF of the causal SNP. (default=0.1)
`-marker`	(Required) Beagle marker-format file which must be a superset of the markers in reference file and the markers in tag file.
`-out`	Output file. (default='cohort’)
`-reference`	(Required) Reference haplotype file in Beagle format, such as HapMap. The markers in reference file must be a superset of the markers in tag file.
`-relative_risk`	We assume that the randomly selected causal SNP will have this much effect size. (default=1.2)
`-seed`	Random number generator seed (default=0)
`-tag`	(Required) tag SNP file containing list of RSIDs. Our simulate generate genotypes at the markers in this file.

Example Running Command: Generate 10 cohorts of 1000/1000 case/controls

java -jar SimulateCohort.jar -reference ENCODEbeagle/ENm010.CEU.beagle -marker ENCODEbeagle/ENm010.CEU.marker -tag ENCODEbeagle/ENm010.CEU.tag -cohort_param 2000 2000 10

Output Files:
- cohort.N.bgl : Simulated cohort file in Beagle format, where N is between 1 and #COHORT. The content of the file will be like this

# Simulation cohort in Beagle format
# Relative risk assumed: 1.500000
# Causal SNP assumed: rs28357162 (MAF: 0.983333, Index: 348)
# Base_position: 27022619
# Num of cases: 1000, Num of controls: 1000
I id IND0 IND0 IND1 IND1 IND2 IND2 IND3 ...
A disease 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ...
M rs2462910 A C A A C C C A C A C C A C ...
M rs774257 G G G G A A A G G G G G G G G ...
M rs774245 A A A A G G G A A A A A A A A ...
M rs774246 A A A A G G G A A A A A A A A ...
.........

Publication

Gregory Darnell, Dat Duong, Buhm Han, Eleazar Eskin. “Incorporating prior information into association studies.”, Bioinformatics (2012) 28 (12): i147-i153. Also in Proceedings of the Twentieth Annual Conference on Intelligent Systems for Molecular Biology (ISMB-2012).

Eleazar Eskin, “Increasing power in association studies by using linkage disequilibrium structure and molecular function as prior information.”, Genome Research (2008) 18:653-660.

Contact

Buhm Han : buhmhan (AT) broadinstitute (DOT) org

Gregory Darnell : gbd343 (AT) gmail (DOT) com

Funding information

G.D., D.D., B.H. and E.E. are supported by National Science Foundation grants 0513612, 0731455, 0729049, 0916676 and 1065276, and National Institutes of Health grants K25- HL080079, U01-DA024417, P01-HL30568 and PO1-HL28481. B.H. is supported by the Samsung Scholarship.