Tips and Tricks =============== This page contains helpful tips, troubleshooting advice, and solutions to common problems. PATH Issues ----------- Command Not Found ^^^^^^^^^^^^^^^^^ If you get this error:: $ PhiSpy.py -v -bash: PhiSpy.py: command not found **Solution 1: Use full path** .. code-block:: bash ~/.local/bin/PhiSpy.py -v **Solution 2: Add to PATH** (recommended) .. code-block:: bash echo "export PATH=\$HOME/.local/bin:\$PATH" >> ~/.bashrc source ~/.bashrc PhiSpy.py -v Installation Tips ----------------- Quick Installation ^^^^^^^^^^^^^^^^^^ The simplest installation (requires sudo): .. code-block:: bash sudo apt install -y python3-pip python3 -m pip install --user phispy Note: ``python3-pip`` automatically installs ``build-essential`` and ``python3-dev``. Verify Installation ^^^^^^^^^^^^^^^^^^^ Check that PhiSpy is installed correctly: .. code-block:: bash PhiSpy.py --version PhiSpy.py --list short Running PhiSpy -------------- Handling Large Genomes ^^^^^^^^^^^^^^^^^^^^^^ For large or draft genomes: 1. **Filter small contigs**:: PhiSpy.py genome.gb -o results --min_contig_size 10000 2. **Process contigs separately** if memory is an issue 3. **Use appropriate resources**: - Memory: ~2-4 GB per genome typically - CPU: Use ``--threads`` for HMM searches Working with Draft Genomes ^^^^^^^^^^^^^^^^^^^^^^^^^^^ Draft genomes often have prophages spanning contig breaks. **Best practices:** - Pay attention to prophages at contig ends - Consider using assembly tools to close gaps - Review contig boundaries manually - Expect some fragmented predictions Gzip Files ^^^^^^^^^^ PhiSpy handles gzip automatically:: # These all work PhiSpy.py genome.gb -o results PhiSpy.py genome.gb.gz -o results # Output matches input format # If input is .gz, output files are also .gz Multiple Genomes ^^^^^^^^^^^^^^^^ Process multiple genomes efficiently: **Method 1: Use file prefixes** .. code-block:: bash for genome in *.gb; do name=$(basename "$genome" .gb) PhiSpy.py "$genome" -o results -p "${name}_" done **Method 2: Separate output directories** .. code-block:: bash for genome in *.gb; do name=$(basename "$genome" .gb) PhiSpy.py "$genome" -o "results_${name}" done Parameter Tuning ---------------- Finding the Right Settings ^^^^^^^^^^^^^^^^^^^^^^^^^^ Start with defaults and adjust based on results: .. code-block:: bash # 1. Run with defaults PhiSpy.py genome.gb -o results_default # 2. Review output less results_default/prophage_coordinates.tsv # 3. Adjust if needed # Too many predictions: PhiSpy.py genome.gb -o results_strict --phage_genes 5 # Too few predictions: PhiSpy.py genome.gb -o results_relaxed --phage_genes 0 Systematic Testing ^^^^^^^^^^^^^^^^^^ Test multiple parameter combinations: .. code-block:: bash for pg in 0 1 3 5; do PhiSpy.py genome.gb -o results_pg${pg} --phage_genes $pg done # Compare results wc -l results_pg*/prophage_coordinates.tsv Using Jupyter Notebook ^^^^^^^^^^^^^^^^^^^^^^ The `interactive notebook `_ is excellent for parameter exploration: 1. Clone the repository 2. Install Jupyter: ``pip install jupyter`` 3. Run: ``jupyter notebook jupyter_notebooks/PhiSpy.ipynb`` 4. Adjust parameters interactively Performance ----------- Speed Up Processing ^^^^^^^^^^^^^^^^^^^ 1. **Use multiple threads**:: PhiSpy.py genome.gb -o results --threads 8 2. **Skip HMM search on reruns**:: PhiSpy.py genome.gb -o results --phmms pVOGs.hmm --skip_search 3. **Use appropriate training sets** (faster than generic) 4. **Filter small contigs** (reduces processing time) Parallel Processing ^^^^^^^^^^^^^^^^^^^ Process multiple genomes in parallel: .. code-block:: bash # Using GNU parallel parallel -j 4 'PhiSpy.py {} -o results/{/.} --threads 2' ::: genomes/*.gb # Or simple background jobs for genome in genomes/*.gb; do name=$(basename "$genome" .gb) PhiSpy.py "$genome" -o "results_${name}" --threads 2 & done wait Debugging --------- Verbose Output ^^^^^^^^^^^^^^ For detailed logging: .. code-block:: bash PhiSpy.py genome.gb -o results --log phispy_detailed.log Review the log file for: - Which metrics were used - Training set loaded - Number of genes/contigs processed - Decisions made during prediction Keep Temporary Files ^^^^^^^^^^^^^^^^^^^^ Preserve intermediate files for inspection: .. code-block:: bash PhiSpy.py genome.gb -o results --keep This keeps all temporary files for debugging. Evaluation Mode ^^^^^^^^^^^^^^^ Re-run evaluation without reprocessing: .. code-block:: bash PhiSpy.py genome.gb -o results --evaluate Useful when testing different thresholds. Common Issues ------------- No Prophages Found ^^^^^^^^^^^^^^^^^^ **Possible causes:** 1. Genome truly has no prophages 2. Parameters too strict 3. Poor annotation quality 4. Contigs too small **Solutions:** .. code-block:: bash # Try relaxed parameters PhiSpy.py genome.gb -o results --phage_genes 0 # Check for annotation grep -c "CDS" genome.gb # Verify contig sizes PhiSpy.py genome.gb -o results --min_contig_size 1000 Too Many False Positives ^^^^^^^^^^^^^^^^^^^^^^^^^ **Symptoms:** - Many small predictions - Ribosomal operons called as prophages - Pathogenicity islands included **Solutions:** .. code-block:: bash # Increase stringency PhiSpy.py genome.gb -o results --phage_genes 5 # Use specific metrics PhiSpy.py genome.gb -o results --metrics shannon_slope gc_skew # Use HMM database PhiSpy.py genome.gb -o results --phmms pVOGs.hmm Fragmented Predictions ^^^^^^^^^^^^^^^^^^^^^^ **Symptom:** One prophage split into multiple predictions. **Solution:** .. code-block:: bash # Allow more non-phage genes between regions PhiSpy.py genome.gb -o results --nonprophage_genegaps 20 Memory Issues ^^^^^^^^^^^^^ **For very large genomes:** 1. Filter contigs:: PhiSpy.py genome.gb -o results --min_contig_size 10000 2. Reduce threads:: PhiSpy.py genome.gb -o results --threads 1 3. Process on high-memory machine 4. Split genome into smaller files File Format Issues ------------------ GenBank Format Requirements ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ PhiSpy expects: - Valid GenBank format - CDS features with locations - DNA sequence included **Verify format:** .. code-block:: bash # Check for required elements grep "LOCUS" genome.gb grep "CDS" genome.gb grep "ORIGIN" genome.gb Converting Formats ^^^^^^^^^^^^^^^^^^ From GFF + FASTA to GenBank: .. code-block:: bash # Using BioPython python -c "from BCBio import GFF; from Bio import SeqIO; \ SeqIO.write(GFF.parse('genome.gff', 'genome.fasta'), 'genome.gb', 'genbank')" From EMBL to GenBank: .. code-block:: bash # Using BioPython python -c "from Bio import SeqIO; \ SeqIO.convert('genome.embl', 'embl', 'genome.gb', 'genbank')" Quality Control --------------- Validating Predictions ^^^^^^^^^^^^^^^^^^^^^^ Always validate key predictions: 1. **BLAST search** prophage genes 2. **Check for att sites** (strong evidence) 3. **Verify integration sites** (often at tRNAs) 4. **Compare to known prophages** in related species 5. **Check GC content** (often different from host) Comparing Versions ^^^^^^^^^^^^^^^^^^ When testing parameters:: # Generate comparable outputs PhiSpy.py genome.gb -o v1 --phage_genes 1 PhiSpy.py genome.gb -o v2 --phage_genes 3 # Compare diff v1/prophage_coordinates.tsv v2/prophage_coordinates.tsv Batch Analysis -------------- Create Summary Statistics ^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash # Count prophages per genome for dir in results_*/; do count=$(wc -l < "$dir/prophage_coordinates.tsv") echo "$(basename $dir): $count prophages" done Merge Results ^^^^^^^^^^^^^ .. code-block:: bash # Combine all prophage coordinates echo -e "Genome\tProphage\tContig\tStart\tStop" > all_prophages.tsv for dir in results_*/; do genome=$(basename $dir) awk -v g="$genome" '{print g"\t"$0}' "$dir/prophage_coordinates.tsv" >> all_prophages.tsv done Best Practices Summary ---------------------- 1. **Start simple**: Use default parameters first 2. **Review carefully**: Always manually inspect results 3. **Use HMM databases**: Improves accuracy significantly 4. **Choose appropriate training sets**: Use organism-specific when available 5. **Document parameters**: Record what you used for reproducibility 6. **Validate predictions**: Use independent evidence 7. **Iterate**: Adjust parameters based on results 8. **Keep logs**: They're invaluable for troubleshooting Getting Help ------------ If you encounter issues: 1. **Check this documentation** thoroughly 2. **Review the log file** for error messages 3. **Search GitHub issues**: https://github.com/linsalrob/PhiSpy/issues 4. **Open a new issue** if your problem is novel: - Include PhiSpy version - Describe the problem clearly - Provide example data if possible - Include command used and error message 5. **Join the community**: Participate in discussions on GitHub Useful Resources ---------------- - **GitHub Repository**: https://github.com/linsalrob/PhiSpy - **Original Paper**: https://doi.org/10.1093/nar/gks406 - **RAST Annotation**: http://rast.nmpdr.org/ - **PROKKA Annotation**: https://github.com/tseemann/prokka - **pVOG Database**: http://dmk-brain.ecn.uiowa.edu/pVOGs - **VOGdb Database**: http://vogdb.org/ - **Artemis Genome Browser**: https://sanger-pathogens.github.io/Artemis/