Skip to content

Preparing Data

zhengxwen edited this page Dec 20, 2014 · 7 revisions

Preparing Data

1. Data formats used in SNPRelate

To support efficient memory management for genome-wide numerical data, the gdsfmt package provides the genomic data structure (GDS) file format for array-oriented bioinformatic data, which is a container for storing annotation data and SNP genotypes. In this format each byte encodes up to four SNP genotypes thereby reducing file size and access time. The GDS format supports data blocking so that only the subset of data that is being processed needs to reside in memory. GDS formatted data is also designed for efficient random access to large data sets.

> # Load the R packages: gdsfmt and SNPRelate
> library(gdsfmt)
> library(SNPRelate)

Here is a typical GDS file:

> snpgdsSummary(snpgdsExampleFileName())
The file name: /library/SNPRelate/extdata/hapmap_geno.gds 
The total number of samples: 279 
The total number of SNPs: 9088 
SNP genotypes are stored in SNP-major mode (Sample X SNP).

snpgdsExampleFileName() returns the file name of a GDS file used as an example in SNPRelate, and it is a subset of data from the HapMap project and the samples were genotyped by the Center for Inherited Disease Research (CIDR) at Johns Hopkins University and the Broad Institute of MIT and Harvard University (Broad). snpgdsSummary() summarizes the genotypes stored in the GDS file. "Individual-major mode" indicates listing all SNPs for an individual before listing the SNPs for the next individual, etc. Conversely, "SNP-major mode" indicates listing all individuals for the first SNP before listing all individuals for the second SNP, etc. Sometimes "SNP-major mode" is more computationally efficient than "individual-major model". For example, the calculation of genetic covariance matrix deals with genotypic data SNP by SNP, and then "SNP-major mode" should be more efficient.

> # Open a GDS file
> (genofile <- snpgdsOpen(snpgdsExampleFileName()))
File: /library/SNPRelate/extdata/hapmap_geno.gds
+ 	[  ] *
|--+ sample.id	{ FStr8 279 ZIP(23.10%) }
|--+ snp.id	{ Int32 9088 ZIP(34.76%) }
|--+ snp.rs.id	{ FStr8 9088 ZIP(42.66%) }
|--+ snp.position	{ Int32 9088 ZIP(94.73%) }
|--+ snp.chromosome	{ UInt8 9088 ZIP(0.94%) } *
|--+ snp.allele	{ FStr8 9088 ZIP(14.45%) }
|--+ genotype	{ Bit2 279x9088 } *
|--+ sample.annot	[ data.frame ] *
|  |--+ sample.id	{ FStr8 279 ZIP(23.10%) }
|  |--+ family.id	{ FStr8 279 ZIP(28.37%) }
|  |--+ father.id	{ FStr8 279 ZIP(12.98%) }
|  |--+ mother.id	{ FStr8 279 ZIP(12.86%) }
|  |--+ sex	{ FStr8 279 ZIP(28.32%) }
|  |--+ pop.group	{ FStr8 279 ZIP(7.89%) }

The output lists all variables stored in the GDS file. At the first level, it stores variables sample.id, snp.id, etc. The additional information are displayed in the braces indicating data type, size, compressed or not + compression ratio. The second-level variables sex and pop.group are both stored in the folder of sample.annot. All of the functions in SNPRelate require a minimum set of variables in the annotation data. The minimum required variables are

  • sample.id, a unique identifier for each sample.
  • snp.id, a unique identifier for each SNP.
  • snp.position, the base position of each SNP on the chromosome, and 0 for unknown position; it does not allow NA.
  • snp.chromosome, an integer or character mapping for each chromosome. Integer: numeric values 1-26, mapped in order from 1-22, 23=X, 24=XY (the pseudoautosomal region), 25=Y, 26=M (the mitochondrial probes), and 0 for probes with unknown positions; it does not allow NA. Character: "X", "XY", "Y" and "M" can be used here, and a blank string indicating unknown position.
  • genotype, a SNP genotypic matrix. SNP-major mode: n_sample X n_snp, individual-major mode: n_snp X n_sample.

Users can define the numeric chromosome codes which are stored with the variable snp.chromosome as its attributes when snp.chromosome is numeric only. For example, snp.chromosome has the attributes of chromosome coding:

> # Get the attributes of chromosome coding
> get.attr.gdsn(index.gdsn(genofile, "snp.chromosome"))
$autosome.start
[1] 1
$autosome.end
[1] 22
$X
[1] 23
$XY
[1] 24
$Y
[1] 25
$M
[1] 26
$MT
[1] 26

autosome.start is the starting numeric code of autosomes, and autosome.end is the last numeric code of autosomes. put.attr.gdsn can be used to add a new attribute or modify an existing attribute.

There are four possible values stored in the variable genotype: 0, 1, 2 and 3. For bi-allelic SNP sites, "0" indicates two B alleles, "1" indicates one A allele and one B allele, "2" indicates two A alleles, and "3" is a missing genotype. For multi-allelic sites, it is a count of the reference allele (3 meaning no call). "Bit2" indicates that each byte encodes up to four SNP genotypes since one byte consists of eight bits.

> # Take out genotype data for the first 3 samples and the first 5 SNPs
> (g <- read.gdsn(index.gdsn(genofile, "genotype"), start=c(1,1), count=c(5,3)))
     [,1] [,2] [,3]
[1,]    2    1    0
[2,]    1    1    0
[3,]    2    1    1
[4,]    2    1    1
[5,]    0    0    0

Or take out genotype data with sample and SNP IDs, and four possible values are returned 0, 1, 2 and NA (3 is replaced by NA):

> g <- snpgdsGetGeno(genofile, sample.id=..., snp.id=...)

Get the attribute of genotype:

> get.attr.gdsn(index.gdsn(genofile, "genotype"))
$sample.order
NULL

The returned value could be either snp.order or sample.order, indicating individual-major mode (snp is the first dimension) and SNP-major mode (sample is the first dimension) respectively.

> # Take out snp.id
> head(read.gdsn(index.gdsn(genofile, "snp.id")))
[1] 1 2 3 4 5 6

> # Take out snp.rs.id
> head(read.gdsn(index.gdsn(genofile, "snp.rs.id")))
[1] "rs1695824"  "rs13328662" "rs4654497"  "rs10915489" "rs12132314"
[6] "rs12042555"

There are two additional variables:

  • snp.rs.id, a character string for reference SNP ID that may not be unique.
  • snp.allele, it is not necessary for the analysis, but it is necessary when merging genotypes from different platforms. The format of snp.allele is "A allele/B allele", like "T/G" where T is A allele and G is B allele.

The information of sample annotation can be obtained by the same function read.gdsn. For example, population information. FStr8 indicates a character-type variable.

> # Read population information
> pop <- read.gdsn(index.gdsn(genofile, path="sample.annot/pop.group"))
> table(pop)
pop
CEU HCB JPT YRI 
 92  47  47  93 

> # Close the GDS file
> snpgdsClose(genofile)

2. Create a GDS File of Your Own

snpgdsCreateGeno

The function snpgdsCreateGeno can be used to create a GDS file. The first argument should be a numeric matrix for SNP genotypes. There are possible values stored in the input genotype matrix: 0, 1, 2 and other values. "0" indicates two B alleles, "1" indicates one A allele and one B allele, "2" indicates two A alleles, and other values indicate a missing genotype. The SNP matrix can be either n_sample X n_snp (snpfirstdim=FALSE, the argument in snpgdsCreateGeno) or n_snp X n_sample (snpfirstdim=TRUE).

For example,

> # Load data
> data(hapmap_geno)
> # Create a gds file
> snpgdsCreateGeno("test.gds", genmat = hapmap_geno$genotype,
    sample.id = hapmap_geno$sample.id, snp.id = hapmap_geno$snp.id,
    snp.chromosome = hapmap_geno$snp.chromosome,
    snp.position = hapmap_geno$snp.position,
    snp.allele = hapmap_geno$snp.allele, snpfirstdim=TRUE)

> # Open the GDS file
> (genofile <- snpgdsOpen("test.gds"))
File: /Users/sts/Documents/Codes/Rpackages/SNPRelate_Sweave/output/test.gds
+ 	[  ] *
|--+ sample.id	{ VStr8 279 ZIP(29.89%) }
|--+ snp.id	{ VStr8 1000 ZIP(42.42%) }
|--+ snp.position	{ Float64 1000 ZIP(55.97%) }
|--+ snp.chromosome	{ Int32 1000 ZIP(2.00%) }
|--+ snp.allele	{ VStr8 1000 ZIP(13.85%) }
|--+ genotype	{ Bit2 1000x279 } *

> # Close the GDS file
> snpgdsClose(genofile)

Uses of the Functions in the gdsfmt Package

In the following code, the functions createfn.gds, add.gdsn, put.attr.gdsn, write.gdsn and index.gdsn are defined in the package gdsfmt:

# Create a new GDS file
newfile <- createfn.gds("your_gds_file.gds")

# add a flag
put.attr.gdsn(newfile$root, "FileFormat", "SNP_ARRAY")

# Add variables
add.gdsn(newfile, "sample.id", sample.id)
add.gdsn(newfile, "snp.id", snp.id)
add.gdsn(newfile, "snp.position", snp.position)
add.gdsn(newfile, "snp.chromosome", snp.chromosome)
add.gdsn(newfile, "snp.allele", c("A/G", "T/C", ...))

#####################################################################
# Create a snp-by-sample genotype matrix

# Add genotypes
var.geno <- add.gdsn(newfile, "genotype",
    valdim=c(length(snp.id), length(sample.id)), storage="bit2")

# Indicate the SNP matrix is snp-by-sample
put.attr.gdsn(var.geno, "snp.order")

# Write SNPs into the file sample by sample
for (i in 1:length(sample.id))
{
    g <- ...
    write.gdsn(var.geno, g, start=c(1,i), count=c(-1,1))
}


#####################################################################
# OR, create a sample-by-snp genotype matrix

# Add genotypes
var.geno <- add.gdsn(newfile, "genotype",
    valdim=c(length(sample.id), length(snp.id)), storage="bit2")

# Indicate the SNP matrix is sample-by-snp
put.attr.gdsn(var.geno, "sample.order")

# Write SNPs into the file sample by sample
for (i in 1:length(snp.id))
{
    g <- ...
    write.gdsn(var.geno, g, start=c(1,i), count=c(-1,1))
}


# Get a description of chromosome codes
#   allowing to define a new chromosome code, e.g., snpgdsOption(Z=27)
option <- snpgdsOption()
var.chr <- index.gdsn(newfile, "snp.chromosome")
put.attr.gdsn(var.chr, "autosome.start", option$autosome.start)
put.attr.gdsn(var.chr, "autosome.end", option$autosome.end)
for (i in 1:length(option$chromosome.code))
{
    put.attr.gdsn(var.chr, names(option$chromosome.code)[i],
        option$chromosome.code[[i]])
}

# Add your sample annotation
samp.annot <- data.frame(sex = c("male", "male", "female", ...),
    pop.group = c("CEU", "CEU", "JPT", ...), ...)
add.gdsn(newfile, "sample.annot", samp.annot)

# Add your SNP annotation
snp.annot <- data.frame(pass=c(TRUE, TRUE, FALSE, FALSE, TRUE, ...), ...)
add.gdsn(newfile, "snp.annot", snp.annot)


# Close the GDS file
closefn.gds(newfile)

3. Format conversion from PLINK binary files

The SNPRelate package provides a function snpgdsBED2GDS for converting a PLINK binary file to a GDS file:

> # The PLINK BED file, using the example in the SNPRelate package
> bed.fn <- system.file("extdata", "plinkhapmap.bed", package="SNPRelate")
> fam.fn <- system.file("extdata", "plinkhapmap.fam", package="SNPRelate")
> bim.fn <- system.file("extdata", "plinkhapmap.bim", package="SNPRelate")

Or, uses your own PLINK files:

> bed.fn <- "C:/your_folder/your_plink_file.bed"
> fam.fn <- "C:/your_folder/your_plink_file.fam"
> bim.fn <- "C:/your_folder/your_plink_file.bim"
> # Convert
> snpgdsBED2GDS(bed.fn, fam.fn, bim.fn, "test.gds")
Start snpgdsBED2GDS ...
	BED file: "/library/SNPRelate/extdata/plinkhapmap.bed" in the individual-major mode (SNP X Sample)
	FAM file: "/library/SNPRelate/extdata/plinkhapmap.fam", DONE.
	BIM file: "/library/SNPRelate/extdata/plinkhapmap.bim", DONE.
Thu Jul 17 17:48:04 2014 	store sample id, snp id, position, and chromosome.
	start writing: 279 samples, 5000 SNPs ...
 	Thu Jul 17 17:48:04 2014	0%
 	Thu Jul 17 17:48:04 2014	100%
Thu Jul 17 17:48:04 2014 	Done.

> # Summary
> snpgdsSummary("test.gds")
The file name: /Users/sts/Documents/Codes/Rpackages/SNPRelate_Sweave/output/test.gds 
The total number of samples: 279 
The total number of SNPs: 5000 
SNP genotypes are stored in individual-major mode (SNP X Sample).

4. Format conversion from VCF files

The SNPRelate package provides a function snpgdsVCF2GDS to reformat a VCF file. There are two options for extracting markers from a VCF file for downstream analyses: (1) to extract and store dosage of the reference allele only for biallelic SNPs and (2) to extract and store dosage of the reference allele for all variant sites, including bi-allelic SNPs, multi-allelic SNPs, indels and structural variants.

> # The VCF file, using the example in the SNPRelate package
> vcf.fn <- system.file("extdata", "sequence.vcf", package="SNPRelate")

Or, uses your own VCF file:

> vcf.fn <- "C:/your_folder/your_vcf_file.vcf"
> # Reformat
> snpgdsVCF2GDS(vcf.fn, "test.gds", method="biallelic.only")
VCF format ---> GDS SNP format:
Parsing "/library/SNPRelate/extdata/sequence.vcf" ...
	Import 2 variants.
Optimize the access efficiency ...
Clean up the fragments of GDS file:
	open the file "test.gds" (size: 2011).
	# of fragments in total: 22.
	save it to "test.gds.tmp".
	rename "test.gds.tmp" (size: 1987).
	# of fragments in total: 20.

> # Summary
> snpgdsSummary("test.gds")
The file name: /Users/sts/Documents/Codes/Rpackages/SNPRelate_Sweave/output/test.gds 
The total number of samples: 3 
The total number of SNPs: 2 
SNP genotypes are stored in SNP-major mode (Sample X SNP).