Skip to content

PanVA: Homology app data format v0.0.0

This file provides a specification for the data format as used by the PanVA Homology app.

We recommend to run PanTools to construct a pangenome database and run functionality to obtain homology groups, sequence alignments, annotations and metadata. We support this format and provide a pipelines to preprocess the PanTools data such that it is formatted correctly for PanVA. It may also be possible to obtain this data using other software if the output can be transformed to the format required for PanVA.

Below we describe the data directory structure and file names, formats, and data features of the files inside those directories.

Data directory structure

The app data directory contains files containing data regarding the full dataset.

Each homology group is a subdirectory of the app data directory and has a unique name, corresponding to the homology_id identifier in homologies.json, and contains the files belonging to that homology group.

/
+-- [other apps]
+-- homology/
    |-- homologies.json
    |-- [tree files]
    |
    +-- <homology_id>/
        |-- alignments.csv
        |-- sequences.csv
        |-- variable.csv
        |-- annotations.csv     (optional)
        |-- metadata.csv        (optional)
        |-- linkage_matrix.npy  (auto-generated)

Dataset files

homologies.json

A list of objects representing all homology ids of the selected set. Homology id objects have the following properties:

  • id: Unique id for the homology group (string).
  • members: Number of sequences (integer).
  • alignment_length: Length of the alignment (integer).
  • metadata: Metadata shown in the frontend (object).
    • key: Internal name (string).
    • value: Value to display (string, boolean, number, Array of strings).

An example object in the array:

json
[
    {
        "id": "13773385",
        "members": 197,
        "alignment_length": 996,
        "metadata": {
            "in_genomes": 48,
            "gene_names": ["GapA", "dnaX"],
            "classification": "single copy core",
            "var_sites": true,
            "inf_sites": false
        }
    }
]

Important: When using an array of strings as metadata, be sure to consistently use an array across all homologies.

tree files (OPTIONAL)

The PanVA frontend can be configured to display one or more additional trees, such as a Core SNP tree, gene dinstance or kmer distance tree.

  • The files are in Newick format, with distances and genome_nr as leaf names.
  • The tree files have .txt extensions should be placed in the root of the app directory.

Homology group files

alignments.csv

This is a matrix of the aligned gene sequences and position specific attributes. For example:

mRNA_idgenome_nrpositionnucleotide
97_1_FEDMPDKE_03607_mRNA971A
97_1_FEDMPDKE_03607_mRNA972T
  • mRNA_id: A unique identifier for each sequence in the homology group (string).
  • genome_nr: A unique ID for each genome sequence (integer).
  • position: The position in the alignment (integer).
  • nucleotide: The nucleotide value (string containing A, C, G, T, or -).

This file can be extended with additional metadata.

sequences.csv

The sequences extracted from the multiple sequence alignment. This file is used to generate dendrograms. For example:

mRNA_idnuc_trimmed_seq
97_1_FEDMPDKE_03607_mRNAATGAGTTTTGATAATTCCCCACAATCACGCCTGATCCTAACCATGATGGGAGCC...
87_1_JABOGBIO_03490_mRNAATGAGTTTTGATAATTCCCCACAATCACGCCTGATCCTAACCATGATGGGAGCC...
  • mRNA_id: A unique identifier for each sequence in the homology group (string).
  • nuc_trimmed_seq: The nucleotide sequences, trimmed in PanTools (string).

variable.csv

Summary of all variable positions in the alignment and their value counts. This data is used for calculation of the conservation score at each aligned position. For example:

positioninformativeACGTgappheno_specific
1True0307000True
9False1070990False
  • position: The position in the alignment (integer).
  • informative: Is position parsimony informative in nucleotide alignment (boolean).
  • A: Number of sequences containing nucleotide A (integer).
  • C: Number of sequences containing nucleotide C (integer).
  • G: Number of sequences containing nucleotide G (integer).
  • T: Number of sequences containing nucleotide T (integer).
  • gap: Number of sequences containing a gap (-) (integer).

This file can be extended with additional metadata.

annotations.csv (OPTIONAL)

This optional file is only used for Eukaryotic pangenomes. It specifies the gene models matched to each gene sequences of reference genomes (for which GFF files are available). For example:

mRNA_idpositioncdsexon
5_1_ATERI-1G45130.11TrueFalse
5_1_ATERI-1G45130.21FalseFalse
  • mRNA_id: A unique identifier for each sequence in the homology group (string).
  • position: The position in the alignment (integer).
  • cds: Does position have this feature in nucleotide alignment (boolean).
  • exon: Does position have this feature in nucleotide alignment (boolean).

This file can be extended with additional metadata. The frontend needs to be configured to display these annotations.

metadata.csv (OPTIONAL)

An optional CSV file containing metadata for each genome indicated by mRNA_id that should be included in the analysis. For example:

mRNA_idvirulencespecies
97_1_FEDMPDKE_03607_mRNAavirulentP.brasiliense
87_1_JABOGBIO_03490_mRNA?P.brasiliense
  • mRNA_id: A unique identifier for each sequence in the homology group (string).

This file can be extended with additional metadata.

linkage_matrix.npy (AUTO-GENERATED)

The linkage matrix for generating the initial clustering dendrogram, stored as NumPy file. This file is generated once by the API and is used to improve application performance.

Important: Please delete this file if the contents of sequences.csv has changed.

Additional metadata

Some files, specifically alignments.csv, metadata.csv, and variable.csv, can be extended with additional metadata columns. Values in each column should be of the same type (string, number, optional boolean).

The frontend needs to be configured to use these additional columns.

Released under the GPL-3 License. Docs built with VitePress.