Custom Training Sets

PhiSpy can be trained on your own data to improve predictions for specific organisms or clades.

Why Create Training Sets?

Create custom training sets when:

  • Your organism is distantly related to reference datasets

  • You have manually curated prophage data

  • You want organism-specific predictions

  • Default training sets don’t work well

Requirements

To create training sets, you need:

  1. Annotated GenBank files with prophage regions marked

  2. Prophage markers: CDS features in prophage regions must have /is_phage="1" qualifier

  3. At least one genome (more is better for diversity)

Marking Prophage Features

Use the mark_prophage_features.py script to add /is_phage="1" qualifiers to prophage proteins.

Input Format

The script accepts a tab-delimited file with:

  1. Path to GenBank file

  2. Replicon ID

  3. Prophage start coordinate

  4. Prophage end coordinate

Example input file (prophages.txt):

genome1.gb    NC_002737    529631    569288
genome1.gb    NC_002737    778642    820599
genome2.gb    NC_003028    123456    145678

Usage

mark_prophage_features.py -i prophages.txt -o marked_genomes/

This updates the GenBank files with /is_phage="1" qualifiers for all CDS features within the specified prophage regions.

Creating Training Sets

Use make_training_sets.py to generate training sets from marked GenBank files.

Basic Usage

make_training_sets.py -d marked_genomes/ -g groups.txt

The script:

  1. Reads marked GenBank files

  2. Generates phage-specific and bacteria-specific k-mer sets

  3. Calculates training features

  4. Creates training set files

Parameters

-d INDIR, --indir INDIR

Path to directory containing marked GenBank files for training.

-g GROUPS, --groups GROUPS

Path to groups file mapping GenBank files to training set names.

If not provided, each file gets its own training set.

--use_taxonomy

Use taxonomy information from GenBank files to create groups.

Files without taxonomy are assigned to “Bacteria” group.

-k KMER_SIZE, --kmer_size KMER_SIZE

Size of k-mers to generate. Default: 12

For codon approach, use multiples of 3.

-t KMERS_TYPE, --kmers_type KMERS_TYPE

K-mer generation method:

  • simple: Slice sequence from first position

  • all: All possible k-mers (default)

  • codon: K-mers with step of 3 nucleotides

--phmms PHMMS

HMM profile database for additional feature calculation.

--threads THREADS

Number of threads for HMM searches.

--retrain

Trigger complete reanalysis when reference files have changed.

Use when you’ve modified /is_phage="1" qualifiers.

--absolute_retrain

Ignore PhiSpy’s default reference genomes.

Train only on your provided data.

Groups File Format

The groups file has two tab-separated columns:

  1. GenBank filename (with extension)

  2. Group name

Example groups.txt:

genome1.gb    Streptococcus
genome2.gb    Streptococcus
genome3.gb    Lactobacillus
genome4.gb    Streptococcus
genome5.gb    Lactobacillus
  • Multiple files can belong to the same group

  • Files can be assigned to multiple groups (list on separate lines)

Complete Example

Step 1: Prepare Prophage Coordinates

Create a file listing prophages:

cat > prophages.txt << EOF
genomes/strep1.gb    NC_002737    529631    569288
genomes/strep1.gb    NC_002737    778642    820599
genomes/strep2.gb    NC_003028    100000    145000
genomes/lacto1.gb    NC_004567    234567    289012
EOF

Step 2: Mark Prophage Features

mark_prophage_features.py -i prophages.txt -o marked_genomes/

Step 3: Create Groups File

cat > groups.txt << EOF
strep1.gb    Streptococcus
strep2.gb    Streptococcus
lacto1.gb    Lactobacillus
EOF

Step 4: Generate Training Sets

make_training_sets.py \
    -d marked_genomes/ \
    -g groups.txt \
    --use_taxonomy \
    --phmms pVOGs.hmm \
    --threads 4 \
    --retrain

This creates training sets for each group in PhiSpyModules/data/.

Step 5: Use Training Sets

# List available training sets
PhiSpy.py --list short

# Use your new training set
PhiSpy.py new_genome.gb -o results -t data/trainSet_Streptococcus.txt

Advanced Options

Using Taxonomy

The --use_taxonomy flag automatically groups genomes by taxonomy:

make_training_sets.py \
    -d marked_genomes/ \
    --use_taxonomy \
    --retrain

PhiSpy reads taxonomy from GenBank files and creates training sets for:

  • Genus level (if available)

  • Family level (if genus not available)

  • “Bacteria” (if no taxonomy information)

HMM-Enhanced Training

Include HMM signals in training:

# Download pVOGs
wget http://dmk-brain.ecn.uiowa.edu/pVOGs/downloads/All/AllvogHMMprofiles.tar.gz
tar -zxvf AllvogHMMprofiles.tar.gz
cat AllvogHMMprofiles/* > pVOGs.hmm

# Train with HMM
make_training_sets.py \
    -d marked_genomes/ \
    -g groups.txt \
    --phmms pVOGs.hmm \
    --threads 8 \
    --retrain

Absolute Retraining

To use ONLY your data (ignore PhiSpy defaults):

make_training_sets.py \
    -d marked_genomes/ \
    -g groups.txt \
    --absolute_retrain

This is useful for:

  • Highly specialized organisms

  • Quality control with known prophages

  • Specific research questions

Best Practices

Selecting Genomes

  • Use diverse prophages: Include genomes with different prophage types

  • Avoid too many genomes: ~5-20 genomes is often sufficient

  • Include remnants: Mark even degraded prophage regions

  • Quality over quantity: Better to have fewer well-annotated genomes

Marking Prophages

  • Be thorough: Mark all prophage proteins, including small ones

  • Include boundaries: Capture integration and excision genes

  • Mark remnants: Even partial prophages provide useful signal

  • Verify manually: Check that marked regions make biological sense

K-mer Selection

  • Default (12-mers, all): Works well for most cases

  • Codon k-mers: Better for very AT-rich or GC-rich genomes

  • Larger k-mers: More specific but require more data

  • Smaller k-mers: Less specific but work with less data

Training Set Management

  • Use consistent naming: Name sets after taxonomic groups

  • Document parameters: Record k-mer size, type, and source genomes

  • Version control: Keep track of training set versions

  • Test performance: Validate on held-out genomes

Storage Location

Training sets are stored in:

PhiSpyModules/data/

Temporary files during training are in:

PhiSpyModules/data/testSets/

These are preserved to speed up retraining.

Troubleshooting

Not Enough Phage-Specific K-mers

Problem: Warning about small k-mer sets.

Solutions:

  • Add more diverse genomes

  • Use smaller k-mer size

  • Include more prophage regions

  • Try --kmers_type simple

Training Set Not Found

Problem: PhiSpy can’t find your training set.

Solutions:

# List available sets
PhiSpy.py --list long

# Use full path if needed
PhiSpy.py genome.gb -o results -t /path/to/trainSet_Custom.txt

Poor Performance

Problem: Custom training set performs worse than default.

Possible causes:

  • Not enough training genomes

  • Poorly marked prophage regions

  • Overfitting to specific prophage types

  • K-mer parameters not optimal

Solutions:

  • Add more diverse training genomes

  • Verify prophage annotations

  • Try default training set for comparison

  • Experiment with k-mer parameters

Example: Training for Novel Genus

Complete workflow for a novel bacterial genus:

# 1. Annotate genomes
for genome in genomes/*.fasta; do
    prokka --outdir prokka_output/$(basename $genome .fasta) $genome
done

# 2. Run PhiSpy with default settings to get initial predictions
for gb in prokka_output/*/*.gbk; do
    name=$(basename $gb .gbk)
    PhiSpy.py $gb -o initial_predictions/$name --output_choice 11
done

# 3. Manually review and create prophage coordinate file
# (Review prophage_information.tsv files and create prophages.txt)

# 4. Mark features
mark_prophage_features.py -i prophages.txt -o marked_genomes/

# 5. Create training set
make_training_sets.py \
    -d marked_genomes/ \
    --use_taxonomy \
    --phmms pVOGs.hmm \
    --threads 8 \
    --retrain

# 6. Re-run with custom training set
for gb in marked_genomes/*.gb; do
    name=$(basename $gb .gb)
    PhiSpy.py $gb -o final_predictions/$name \
        -t data/trainSet_YourGenus.txt \
        --phmms pVOGs.hmm \
        --output_choice 11
done

Further Reading