Skip to content

File Formats

Josh Fogg edited this page Aug 7, 2024 · 7 revisions

RobustOCS works with a standard set of file formats to read in problem data. These can be generated through the AlphaSimR package for simulation experiments. If anything here is unclear, please do open an issue.

Input File Descriptions

Below are file descriptions for each of the problem variables, using the toy n = 4 problem as an example. All files have the standard text file .txt extension unless specified otherwise.

Candidate Sexes (S, D)

The sex of each candidate in the cohort is stored in a space-separated values files. Each row of the file has an integer index of a candidate in the first column and a string for the sex of that candidate in the second column. Male candidates (sires) are denoted by an M and female candidates (dams) are denoted by an F.

For example, the file contents

1 M
2 F
3 M
4 F

correspond to the index sets $\mathcal{S} = \lbrace 1, 3\rbrace$ and $\mathcal{D} = \lbrace 2, 4\rbrace$.

Expected Breeding Values (μ, μ̄)

EBVs, regardless whether actual ($\boldsymbol{\mu}$) or average ($\boldsymbol{\bar\mu}$), are simply stored with one value per line. Unlike the sex file, here the line number indicates the candidate index rather than storing a separate index column.

For example, the file contents

1
2
1
2

corresponds to the vector $\boldsymbol{\mu} = {\left[1\ 2\ 1\ 2\right]}^{T}$.

EBV Covariance (Ω)

The covariance matrix for expected breeding values (i.e. the $\Omega$ when $\boldsymbol{\mu}\sim N!(\boldsymbol{\bar\mu}, \Omega)$) is stored in sparse matrix coordinate format (COO). This is another space-separated values file, where each line represents a non-zero value in the matrix. For a given entry $\omega_{ij}\neq0$ of $\Omega$, there's a line with first value $i$ (the row index of the entry), second value $j$ (the column index), and third entry $\omega_{ij}$ itself.

Note that the filetype indexes the matrix rows and columns from one, not zero as is the case in Python. For example, the file contents

1 1 0.11111111111111
2 2 0.11111111111111
3 3 4.0
4 4 4.0

correspond to the covariance matrix

$$ \Sigma = \begin{bmatrix} \frac{1}{9} & 0 & 0 & 0 \\ 0 & \frac{1}{9} & 0 & 0 \\ 0 & 0 & 4 & 0 \\ 0 & 0 & 0 & 4 \end{bmatrix}. $$

Relationship Matrix (Σ)

There are two standard formats that RobustOCS can work with with the relationship matrix, with which we use depending how we choose to model relationships in the cohort. Firstly, we can model the relationships using a covariance matrix in which case $\Sigma$ is stored in COO format (as $\Omega$). For example, the file contents

1 1 1
2 2 1
3 3 1
4 4 1

corresponds to the 4-by-4 identity matrix.

We can also model the relationships using a pedigree tree, which can be stored in a *.ped datafile, a CSV file with columns i, p, and q, where unknown parents are represented by a zero. When computing WNRM it doesn't matter whether each of $p$ and $q$ is a sire or dam, but the labelling must be such that $p < q$.

Consider an example with four candidates:

  1. a sire with unknown parentage,
  2. a dam parented by the first and an unknown dam,
  3. a dam parented by the first and the second,
  4. a dam parented by the first and the third.

We can represent these relationships using a digraph,

this data represented as a pedigree tree

which is associated pedigree file shown below.

i,p,q
1,0,0
2,1,0
3,1,2
4,1,3

Once the tree loaded RobustOCS generates $\Sigma$ as Wright's Numerator Relationship Matrix (WNRM) using Henderson's algorithm:

$$ \Sigma = \begin{bmatrix} 1 & 0.5 & 0.75 & 0.875 \\ 0.5 & 1 & 0.75 & 0.625 \\ 0.75 & 0.75 & 1.25 & 1 \\ 0.875 & 0.625 & 1 & 1.375 \end{bmatrix}. $$

Output File Descriptions

Below are file descriptions for the output files, using the toy n = 4 problem as an example.

Solution Vector (w)

The solveROCS quick-start function has an optional parameter solution_output="filename", which used will save the solution vector $\mathbf{w}$ to CSV file filename.csv in the local directory. This includes the optimal contribution for each candidate alongside the candidates' identifier, which are loaded alongside sex data. For the $n = 4$ example, this would produce the following file.

candidate,contribution
1,0.3822569445737661
2,0.3822569445664524
3,0.11774305542623399
4,0.11774305543354763

Model File

By using the optional model_output="filename" parameter with any of the solver functions, RobustOCS creates an MPS file filename.mps. This includes the model in a format that can be read into other optimization software, but it is not a human readable file so is unlikely to have any other use. For the $n = 4$ example, this would produce the following file.

NAME        robust-genetics
ROWS
 N  Obj     
 E  r0      
 E  r1      
 G  r2      
 G  r3      
 G  r4      
 G  r5      
 G  r6      
 G  r7      
 G  r8      
 G  r9      
 G  r10     
 G  r11     
 G  r12     
COLUMNS
    c0        Obj       -1
    c0        r0        1
    c0        r3        -0.05555555556
    c0        r4        -0.04761904762
    c0        r5        -0.03961550665
    c0        r6        -0.04446456755
    c0        r7        -0.04235026123
    c0        r8        -0.04290175015
    c0        r9        -0.04263010846
    c0        r10       -0.04249123841
    c0        r11       -0.04242101681
    c0        r12       -0.0424561939
    c1        Obj       -1
    c1        r1        1
    c1        r3        -0.05555555556
    c1        r4        -0.04761904762
    c1        r5        -0.03961550665
    c1        r6        -0.04446456755
    c1        r7        -0.04235026123
    c1        r8        -0.04290175015
    c1        r9        -0.04263010846
    c1        r10       -0.04249123841
    c1        r11       -0.04242101681
    c1        r12       -0.0424561939
    c2        Obj       -2
    c2        r0        1
    c2        r2        -2
    c2        r4        -0.2857142857
    c2        r5        -0.5738417606
    c2        r6        -0.3992755682
    c2        r7        -0.4753905958
    c2        r8        -0.4555369945
    c2        r9        -0.4653160956
    c2        r10       -0.4703154171
    c2        r11       -0.472843395
    c2        r12       -0.4715770195
    c3        Obj       -2
    c3        r1        1
    c3        r2        -2
    c3        r4        -0.2857142857
    c3        r5        -0.5738417606
    c3        r6        -0.3992755682
    c3        r7        -0.4753905958
    c3        r8        -0.4555369945
    c3        r9        -0.4653160956
    c3        r10       -0.4703154171
    c3        r11       -0.472843395
    c3        r12       -0.4715770196
    c4        Obj       1
    c4        r2        1.414213562
    c4        r3        0.2357022604
    c4        r4        0.2857142857
    c4        r5        0.4391994692
    c4        r6        0.3395559593
    c4        r7        0.3811586449
    c4        r8        0.3699825127
    c4        r9        0.3754615893
    c4        r10       0.3782821592
    c4        r11       0.3797133209
    c4        r12       0.3789959814
RHS
    RHS_V     r0        0.5
    RHS_V     r1        0.5
BOUNDS
 UP BOUND     c0        1
 UP BOUND     c1        1
 UP BOUND     c2        1
 UP BOUND     c3        1
QUADOBJ
    c0        c0        0.5
    c1        c1        0.5
    c2        c2        0.5
    c3        c3        0.5
ENDATA

Note that with SQP methods, the model file produced will be for the final model the solver constructed. This includes all of the constraints necessary to approximate the relaxed robust objective term.

Realistic Data Generation

As mentioned, the examples with 50, 1000, and 10,000 candidates are generated using AlphaSimR to have a realistic structure. The original simulation was of a 12,000 candidate cohort. Based on that simulated data we constructed:

  • $\boldsymbol{\bar\mu}$, a vector of length 12,000 computed as the posterior mean over 1000 samples of the expected breeding values,
  • $\Sigma$, a 12,000-by-12,000 matrix measuring co-ancestry between individuals based on the pedigree data,
  • $\Omega$, the a 12,000-by-12,000 covariance matrix between those 1000 EBV samples.

In each case for the 50, 1000, and 10,000 examples we then took the $n$ youngest individuals from the full generated cohort of 12,000. The implications of this non-random example are unknown.