PanVA: Homology app data format v0.0.0
This file provides a specification for the data format as used by the PanVA Homology app.
We recommend to run PanTools to construct a pangenome database and run functionality to obtain homology groups, sequence alignments, annotations and metadata. We support this format and provide a pipelines to preprocess the PanTools data such that it is formatted correctly for PanVA. It may also be possible to obtain this data using other software if the output can be transformed to the format required for PanVA.
Below we describe the data directory structure and file names, formats, and data features of the files inside those directories.
Data directory structure
The app data directory contains files containing data regarding the full dataset.
Each homology group is a subdirectory of the app data directory and has a unique name, corresponding to the homology_id
identifier in homologies.json
, and contains the files belonging to that homology group.
/
+-- [other apps]
+-- homology/
|-- homologies.json
|-- [tree files]
|
+-- <homology_id>/
|-- alignments.csv
|-- sequences.csv
|-- variable.csv
|-- annotations.csv (optional)
|-- metadata.csv (optional)
|-- linkage_matrix.npy (auto-generated)
Dataset files
homologies.json
A list of objects representing all homology ids of the selected set. Homology id objects have the following properties:
id
: Unique id for the homology group (string).members
: Number of sequences (integer).alignment_length
: Length of the alignment (integer).metadata
: Metadata shown in the frontend (object).key
: Internal name (string).value
: Value to display (string, boolean, number, Array of strings).
An example object in the array:
[
{
"id": "13773385",
"members": 197,
"alignment_length": 996,
"metadata": {
"in_genomes": 48,
"gene_names": ["GapA", "dnaX"],
"classification": "single copy core",
"var_sites": true,
"inf_sites": false
}
}
]
Important: When using an array of strings as metadata, be sure to consistently use an array across all homologies.
tree files
(OPTIONAL)
The PanVA frontend can be configured to display one or more additional trees, such as a Core SNP tree, gene dinstance or kmer distance tree.
- The files are in Newick format, with distances and
genome_nr
as leaf names. - The tree files have
.txt
extensions should be placed in the root of the app directory.
Homology group files
alignments.csv
This is a matrix of the aligned gene sequences and position specific attributes. For example:
mRNA_id | genome_nr | position | nucleotide |
---|---|---|---|
97_1_FEDMPDKE_03607_mRNA | 97 | 1 | A |
97_1_FEDMPDKE_03607_mRNA | 97 | 2 | T |
mRNA_id
: A unique identifier for each sequence in the homology group (string).genome_nr
: A unique ID for each genome sequence (integer).position
: The position in the alignment (integer).nucleotide
: The nucleotide value (string containing A, C, G, T, or -).
This file can be extended with additional metadata.
sequences.csv
The sequences extracted from the multiple sequence alignment. This file is used to generate dendrograms. For example:
mRNA_id | nuc_trimmed_seq |
---|---|
97_1_FEDMPDKE_03607_mRNA | ATGAGTTTTGATAATTCCCCACAATCACGCCTGATCCTAACCATGATGGGAGCC... |
87_1_JABOGBIO_03490_mRNA | ATGAGTTTTGATAATTCCCCACAATCACGCCTGATCCTAACCATGATGGGAGCC... |
mRNA_id
: A unique identifier for each sequence in the homology group (string).nuc_trimmed_seq
: The nucleotide sequences, trimmed in PanTools (string).
variable.csv
Summary of all variable positions in the alignment and their value counts. This data is used for calculation of the conservation score at each aligned position. For example:
position | informative | A | C | G | T | gap | pheno_specific |
---|---|---|---|---|---|---|---|
1 | True | 0 | 30 | 70 | 0 | 0 | True |
9 | False | 1 | 0 | 70 | 99 | 0 | False |
position
: The position in the alignment (integer).informative
: Is position parsimony informative in nucleotide alignment (boolean).A
: Number of sequences containing nucleotide A (integer).C
: Number of sequences containing nucleotide C (integer).G
: Number of sequences containing nucleotide G (integer).T
: Number of sequences containing nucleotide T (integer).gap
: Number of sequences containing a gap (-) (integer).
This file can be extended with additional metadata.
annotations.csv
(OPTIONAL)
This optional file is only used for Eukaryotic pangenomes. It specifies the gene models matched to each gene sequences of reference genomes (for which GFF files are available). For example:
mRNA_id | position | cds | exon |
---|---|---|---|
5_1_ATERI-1G45130.1 | 1 | True | False |
5_1_ATERI-1G45130.2 | 1 | False | False |
mRNA_id
: A unique identifier for each sequence in the homology group (string).position
: The position in the alignment (integer).cds
: Does position have this feature in nucleotide alignment (boolean).exon
: Does position have this feature in nucleotide alignment (boolean).
This file can be extended with additional metadata. The frontend needs to be configured to display these annotations.
metadata.csv
(OPTIONAL)
An optional CSV file containing metadata for each genome indicated by mRNA_id
that should be included in the analysis. For example:
mRNA_id | virulence | species |
---|---|---|
97_1_FEDMPDKE_03607_mRNA | avirulent | P.brasiliense |
87_1_JABOGBIO_03490_mRNA | ? | P.brasiliense |
mRNA_id
: A unique identifier for each sequence in the homology group (string).
This file can be extended with additional metadata.
linkage_matrix.npy
(AUTO-GENERATED)
The linkage matrix for generating the initial clustering dendrogram, stored as NumPy file. This file is generated once by the API and is used to improve application performance.
Important: Please delete this file if the contents of sequences.csv
has changed.
Additional metadata
Some files, specifically alignments.csv
, metadata.csv
, and variable.csv
, can be extended with additional metadata columns. Values in each column should be of the same type (string, number, optional boolean).
The frontend needs to be configured to use these additional columns.