Usage Guide
This guide provides detailed information about using PhiSpy effectively.
Basic Usage
The simplest command is:
PhiSpy.py genbank_file -o output_directory
Where:
genbank_file: Input DNA sequence file in GenBank formatoutput_directory: Directory where output files will be created
Important Parameters
phage_genes
The --phage_genes parameter controls how strict PhiSpy is in calling prophages.
By default, PhiSpy uses strict mode, looking for 1 or more genes that are likely to be phage genes in each prophage region.
Increasing the value (e.g., --phage_genes 5) will:
Reduce the number of prophages predicted
Make predictions more specific but less sensitive
Only report regions with strong phage signals
Setting to 0 (--phage_genes 0) will:
Identify other mobile elements (plasmids, integrons, pathogenicity islands)
Also identify ribosomal RNA operons (they’re unlike the host backbone)
Generate many false positives
Example:
PhiSpy.py genome.gb -o results --phage_genes 5
Training Sets
Training sets improve prediction accuracy by providing information about prophages in related organisms.
List available training sets:
PhiSpy.py --list short
Use a specific training set:
PhiSpy.py genome.gb -o results -t data/trainSet_Streptococcus.txt
The default training set (trainSet_genericAll.txt) works well for most genomes.
File Name Prefixes
When analyzing multiple genomes, use prefixes to avoid overwriting outputs:
PhiSpy.py genome1.gb -o results -p genome1_
PhiSpy.py genome2.gb -o results -p genome2_
All output files for genome1 will have the prefix genome1_.
Gzip Support
PhiSpy natively supports gzip format for both input and output:
If you provide a gzipped input file, PhiSpy will write gzipped output files
No need to manually decompress/compress files
Example:
PhiSpy.py genome.gb.gz -o results
HMM Searches
PhiSpy can use HMM profile searches to improve prophage detection.
Using pVOGs Database
Download and prepare the pVOGs database:
wget http://dmk-brain.ecn.uiowa.edu/pVOGs/downloads/All/AllvogHMMprofiles.tar.gz
tar -zxvf AllvogHMMprofiles.tar.gz
cat AllvogHMMprofiles/* > pVOGs.hmm
Run PhiSpy with pVOGs:
PhiSpy.py genome.gb -o results --phmms pVOGs.hmm --threads 4 --color
Using VOGdb Database
Download and prepare VOGdb:
curl -LO http://fileshare.csb.univie.ac.at/vog/latest/vog.hmm.tar.gz
mkdir vog
tar -C vog -xf vog.hmm.tar.gz
cat vog/* > VOGs.hmms
hmmpress VOGs.hmms
Run PhiSpy with VOGdb:
PhiSpy.py genome.gb -o results --phmms VOGs.hmms --threads 4
HMM Search Features
When using --phmms:
The input GenBank file is updated and saved in the output directory
With
--colorflag, proteins with HMM hits are colored for Artemis visualizationUse
--skip_searchto skip the search step when re-running on the same data
Metrics
PhiSpy uses several metrics to identify prophages:
Default Metrics
When no --metrics flag is provided, all metrics are used:
orf_length_med: Median ORF length
shannon_slope: Slope of Shannon’s diversity of k-mers
at_skew: Normalized AT skew
gc_skew: Normalized GC skew
max_direction: Maximum number of genes in the same direction
Specifying Metrics
You can choose specific metrics:
Single metric:
PhiSpy.py genome.gb -o results --metrics shannon_slope
Multiple metrics (method 1):
PhiSpy.py genome.gb -o results --metrics shannon_slope gc_skew
Multiple metrics (method 2):
PhiSpy.py genome.gb -o results --metrics shannon_slope --metrics gc_skew
The metrics used are recorded in the log file.
Additional Options
expand_slope: Improves Shannon score calculations:
PhiSpy.py genome.gb -o results --expand_slope
kmers_type: Controls k-mer generation method:
PhiSpy.py genome.gb -o results --kmers_type codon
Advanced Usage
Color Output for Artemis
The --color flag adds color qualifiers to the GenBank output:
PhiSpy.py genome.gb -o results --color --phmms pVOGs.hmm
Open the resulting GenBank file in Artemis to see colored CDS features.
Choosing Output Files
Control which files are generated using --output_choice:
Minimal output (coordinates only):
PhiSpy.py genome.gb -o results --output_choice 1
Default output (coordinates + GenBank):
PhiSpy.py genome.gb -o results --output_choice 3
Include prophage information file:
PhiSpy.py genome.gb -o results --output_choice 11
All outputs:
PhiSpy.py genome.gb -o results --output_choice 512
See the Output Files page for details.
Interactive PhiSpy
A Jupyter notebook is available for interactive exploration.
Use it to:
Test different parameters interactively
Visualize how parameters affect predictions
Explore your genome’s prophage content
Change the genome file path and parameter values to see how predictions vary.
Performance Optimization
Multi-threading
Use the --threads parameter to speed up HMM searches and random forest:
PhiSpy.py genome.gb -o results --phmms pVOGs.hmm --threads 8
Memory Usage
For large genomes or many genomes:
Process one genome at a time
Use appropriate
--min_contig_sizeto filter small contigsConsider using a high-performance computing cluster
Best Practices
Always annotate your genome first using RAST or PROKKA
Start with default parameters to get a baseline
Examine the output carefully (see Assessing Predictions)
Adjust parameters based on results:
Too many prophages? Increase
--phage_genesToo few prophages? Decrease
--phage_genesor use--metricsRelated organism available? Use appropriate
--training_set
Use HMM searches when available for better accuracy
Review predictions in context of genome biology
Common Workflows
Standard Workflow
# 1. Annotate genome (using PROKKA as example)
prokka --outdir prokka_results genome.fasta
# 2. Run PhiSpy with default settings
PhiSpy.py prokka_results/genome.gbk -o phispy_results
# 3. Review output files
less phispy_results/prophage_coordinates.tsv
# 4. If needed, adjust and re-run
PhiSpy.py prokka_results/genome.gbk -o phispy_results_v2 --phage_genes 3
High-Quality Workflow
# Get HMM database
wget http://dmk-brain.ecn.uiowa.edu/pVOGs/downloads/All/AllvogHMMprofiles.tar.gz
tar -zxvf AllvogHMMprofiles.tar.gz
cat AllvogHMMprofiles/* > pVOGs.hmm
# Run PhiSpy with HMM and appropriate training set
PhiSpy.py genome.gb -o results \
--phmms pVOGs.hmm \
--threads 8 \
--color \
--output_choice 11 \
-t data/trainSet_YourOrganism.txt
Batch Processing
# Process multiple genomes
for genome in genomes/*.gb; do
name=$(basename "$genome" .gb)
PhiSpy.py "$genome" -o results -p "${name}_" --threads 4
done