Usage Guide =========== This guide provides detailed information about using PhiSpy effectively. Basic Usage ----------- The simplest command is: .. code-block:: bash PhiSpy.py genbank_file -o output_directory Where: - ``genbank_file``: Input DNA sequence file in GenBank format - ``output_directory``: Directory where output files will be created Important Parameters -------------------- phage_genes ^^^^^^^^^^^ The ``--phage_genes`` parameter controls how strict PhiSpy is in calling prophages. By default, PhiSpy uses *strict* mode, looking for 1 or more genes that are likely to be phage genes in each prophage region. **Increasing the value** (e.g., ``--phage_genes 5``) will: - Reduce the number of prophages predicted - Make predictions more specific but less sensitive - Only report regions with strong phage signals **Setting to 0** (``--phage_genes 0``) will: - Identify other mobile elements (plasmids, integrons, pathogenicity islands) - Also identify ribosomal RNA operons (they're unlike the host backbone) - Generate many false positives Example:: PhiSpy.py genome.gb -o results --phage_genes 5 Training Sets ^^^^^^^^^^^^^ Training sets improve prediction accuracy by providing information about prophages in related organisms. List available training sets:: PhiSpy.py --list short Use a specific training set:: PhiSpy.py genome.gb -o results -t data/trainSet_Streptococcus.txt The default training set (``trainSet_genericAll.txt``) works well for most genomes. File Name Prefixes ^^^^^^^^^^^^^^^^^^ When analyzing multiple genomes, use prefixes to avoid overwriting outputs:: PhiSpy.py genome1.gb -o results -p genome1_ PhiSpy.py genome2.gb -o results -p genome2_ All output files for genome1 will have the prefix ``genome1_``. Gzip Support ^^^^^^^^^^^^ PhiSpy natively supports gzip format for both input and output: - If you provide a gzipped input file, PhiSpy will write gzipped output files - No need to manually decompress/compress files Example:: PhiSpy.py genome.gb.gz -o results HMM Searches ------------ PhiSpy can use HMM profile searches to improve prophage detection. Using pVOGs Database ^^^^^^^^^^^^^^^^^^^^ Download and prepare the pVOGs database: .. code-block:: bash wget http://dmk-brain.ecn.uiowa.edu/pVOGs/downloads/All/AllvogHMMprofiles.tar.gz tar -zxvf AllvogHMMprofiles.tar.gz cat AllvogHMMprofiles/* > pVOGs.hmm Run PhiSpy with pVOGs:: PhiSpy.py genome.gb -o results --phmms pVOGs.hmm --threads 4 --color Using VOGdb Database ^^^^^^^^^^^^^^^^^^^^ Download and prepare VOGdb: .. code-block:: bash curl -LO http://fileshare.csb.univie.ac.at/vog/latest/vog.hmm.tar.gz mkdir vog tar -C vog -xf vog.hmm.tar.gz cat vog/* > VOGs.hmms hmmpress VOGs.hmms Run PhiSpy with VOGdb:: PhiSpy.py genome.gb -o results --phmms VOGs.hmms --threads 4 HMM Search Features ^^^^^^^^^^^^^^^^^^^ When using ``--phmms``: - The input GenBank file is updated and saved in the output directory - With ``--color`` flag, proteins with HMM hits are colored for Artemis visualization - Use ``--skip_search`` to skip the search step when re-running on the same data Metrics ------- PhiSpy uses several metrics to identify prophages: Default Metrics ^^^^^^^^^^^^^^^ When no ``--metrics`` flag is provided, all metrics are used: - **orf_length_med**: Median ORF length - **shannon_slope**: Slope of Shannon's diversity of k-mers - **at_skew**: Normalized AT skew - **gc_skew**: Normalized GC skew - **max_direction**: Maximum number of genes in the same direction Specifying Metrics ^^^^^^^^^^^^^^^^^^ You can choose specific metrics: Single metric:: PhiSpy.py genome.gb -o results --metrics shannon_slope Multiple metrics (method 1):: PhiSpy.py genome.gb -o results --metrics shannon_slope gc_skew Multiple metrics (method 2):: PhiSpy.py genome.gb -o results --metrics shannon_slope --metrics gc_skew The metrics used are recorded in the log file. Additional Options ^^^^^^^^^^^^^^^^^^ **expand_slope**: Improves Shannon score calculations:: PhiSpy.py genome.gb -o results --expand_slope **kmers_type**: Controls k-mer generation method:: PhiSpy.py genome.gb -o results --kmers_type codon Advanced Usage -------------- Color Output for Artemis ^^^^^^^^^^^^^^^^^^^^^^^^^ The ``--color`` flag adds color qualifiers to the GenBank output:: PhiSpy.py genome.gb -o results --color --phmms pVOGs.hmm Open the resulting GenBank file in `Artemis `_ to see colored CDS features. Choosing Output Files ^^^^^^^^^^^^^^^^^^^^^^ Control which files are generated using ``--output_choice``: Minimal output (coordinates only):: PhiSpy.py genome.gb -o results --output_choice 1 Default output (coordinates + GenBank):: PhiSpy.py genome.gb -o results --output_choice 3 Include prophage information file:: PhiSpy.py genome.gb -o results --output_choice 11 All outputs:: PhiSpy.py genome.gb -o results --output_choice 512 See the `Output Files `_ page for details. Interactive PhiSpy ------------------ A `Jupyter notebook `_ is available for interactive exploration. Use it to: - Test different parameters interactively - Visualize how parameters affect predictions - Explore your genome's prophage content Change the genome file path and parameter values to see how predictions vary. Performance Optimization ------------------------ Multi-threading ^^^^^^^^^^^^^^^ Use the ``--threads`` parameter to speed up HMM searches and random forest:: PhiSpy.py genome.gb -o results --phmms pVOGs.hmm --threads 8 Memory Usage ^^^^^^^^^^^^ For large genomes or many genomes: - Process one genome at a time - Use appropriate ``--min_contig_size`` to filter small contigs - Consider using a high-performance computing cluster Best Practices -------------- 1. **Always annotate your genome** first using RAST or PROKKA 2. **Start with default parameters** to get a baseline 3. **Examine the output** carefully (see `Assessing Predictions `_) 4. **Adjust parameters** based on results: - Too many prophages? Increase ``--phage_genes`` - Too few prophages? Decrease ``--phage_genes`` or use ``--metrics`` - Related organism available? Use appropriate ``--training_set`` 5. **Use HMM searches** when available for better accuracy 6. **Review predictions** in context of genome biology Common Workflows ---------------- Standard Workflow ^^^^^^^^^^^^^^^^^ .. code-block:: bash # 1. Annotate genome (using PROKKA as example) prokka --outdir prokka_results genome.fasta # 2. Run PhiSpy with default settings PhiSpy.py prokka_results/genome.gbk -o phispy_results # 3. Review output files less phispy_results/prophage_coordinates.tsv # 4. If needed, adjust and re-run PhiSpy.py prokka_results/genome.gbk -o phispy_results_v2 --phage_genes 3 High-Quality Workflow ^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash # Get HMM database wget http://dmk-brain.ecn.uiowa.edu/pVOGs/downloads/All/AllvogHMMprofiles.tar.gz tar -zxvf AllvogHMMprofiles.tar.gz cat AllvogHMMprofiles/* > pVOGs.hmm # Run PhiSpy with HMM and appropriate training set PhiSpy.py genome.gb -o results \ --phmms pVOGs.hmm \ --threads 8 \ --color \ --output_choice 11 \ -t data/trainSet_YourOrganism.txt Batch Processing ^^^^^^^^^^^^^^^^ .. code-block:: bash # Process multiple genomes for genome in genomes/*.gb; do name=$(basename "$genome" .gb) PhiSpy.py "$genome" -o results -p "${name}_" --threads 4 done