Custom Training Sets ==================== PhiSpy can be trained on your own data to improve predictions for specific organisms or clades. Why Create Training Sets? -------------------------- Create custom training sets when: - Your organism is distantly related to reference datasets - You have manually curated prophage data - You want organism-specific predictions - Default training sets don't work well Requirements ------------ To create training sets, you need: 1. **Annotated GenBank files** with prophage regions marked 2. **Prophage markers**: CDS features in prophage regions must have ``/is_phage="1"`` qualifier 3. **At least one genome** (more is better for diversity) Marking Prophage Features -------------------------- Use the ``mark_prophage_features.py`` script to add ``/is_phage="1"`` qualifiers to prophage proteins. Input Format ^^^^^^^^^^^^ The script accepts a tab-delimited file with: 1. Path to GenBank file 2. Replicon ID 3. Prophage start coordinate 4. Prophage end coordinate **Example input file (prophages.txt):** .. code-block:: text genome1.gb NC_002737 529631 569288 genome1.gb NC_002737 778642 820599 genome2.gb NC_003028 123456 145678 Usage ^^^^^ .. code-block:: bash mark_prophage_features.py -i prophages.txt -o marked_genomes/ This updates the GenBank files with ``/is_phage="1"`` qualifiers for all CDS features within the specified prophage regions. Creating Training Sets ----------------------- Use ``make_training_sets.py`` to generate training sets from marked GenBank files. Basic Usage ^^^^^^^^^^^ .. code-block:: bash make_training_sets.py -d marked_genomes/ -g groups.txt The script: 1. Reads marked GenBank files 2. Generates phage-specific and bacteria-specific k-mer sets 3. Calculates training features 4. Creates training set files Parameters ^^^^^^^^^^ .. option:: -d INDIR, --indir INDIR Path to directory containing marked GenBank files for training. .. option:: -g GROUPS, --groups GROUPS Path to groups file mapping GenBank files to training set names. If not provided, each file gets its own training set. .. option:: --use_taxonomy Use taxonomy information from GenBank files to create groups. Files without taxonomy are assigned to "Bacteria" group. .. option:: -k KMER_SIZE, --kmer_size KMER_SIZE Size of k-mers to generate. Default: 12 For codon approach, use multiples of 3. .. option:: -t KMERS_TYPE, --kmers_type KMERS_TYPE K-mer generation method: - ``simple``: Slice sequence from first position - ``all``: All possible k-mers (default) - ``codon``: K-mers with step of 3 nucleotides .. option:: --phmms PHMMS HMM profile database for additional feature calculation. .. option:: --threads THREADS Number of threads for HMM searches. .. option:: --retrain Trigger complete reanalysis when reference files have changed. Use when you've modified ``/is_phage="1"`` qualifiers. .. option:: --absolute_retrain Ignore PhiSpy's default reference genomes. Train only on your provided data. Groups File Format ^^^^^^^^^^^^^^^^^^ The groups file has two tab-separated columns: 1. GenBank filename (with extension) 2. Group name **Example groups.txt:** .. code-block:: text genome1.gb Streptococcus genome2.gb Streptococcus genome3.gb Lactobacillus genome4.gb Streptococcus genome5.gb Lactobacillus - Multiple files can belong to the same group - Files can be assigned to multiple groups (list on separate lines) Complete Example ---------------- Step 1: Prepare Prophage Coordinates ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Create a file listing prophages: .. code-block:: bash cat > prophages.txt << EOF genomes/strep1.gb NC_002737 529631 569288 genomes/strep1.gb NC_002737 778642 820599 genomes/strep2.gb NC_003028 100000 145000 genomes/lacto1.gb NC_004567 234567 289012 EOF Step 2: Mark Prophage Features ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash mark_prophage_features.py -i prophages.txt -o marked_genomes/ Step 3: Create Groups File ^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash cat > groups.txt << EOF strep1.gb Streptococcus strep2.gb Streptococcus lacto1.gb Lactobacillus EOF Step 4: Generate Training Sets ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash make_training_sets.py \ -d marked_genomes/ \ -g groups.txt \ --use_taxonomy \ --phmms pVOGs.hmm \ --threads 4 \ --retrain This creates training sets for each group in ``PhiSpyModules/data/``. Step 5: Use Training Sets ^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash # List available training sets PhiSpy.py --list short # Use your new training set PhiSpy.py new_genome.gb -o results -t data/trainSet_Streptococcus.txt Advanced Options ---------------- Using Taxonomy ^^^^^^^^^^^^^^ The ``--use_taxonomy`` flag automatically groups genomes by taxonomy: .. code-block:: bash make_training_sets.py \ -d marked_genomes/ \ --use_taxonomy \ --retrain PhiSpy reads taxonomy from GenBank files and creates training sets for: - Genus level (if available) - Family level (if genus not available) - "Bacteria" (if no taxonomy information) HMM-Enhanced Training ^^^^^^^^^^^^^^^^^^^^^ Include HMM signals in training: .. code-block:: bash # Download pVOGs wget http://dmk-brain.ecn.uiowa.edu/pVOGs/downloads/All/AllvogHMMprofiles.tar.gz tar -zxvf AllvogHMMprofiles.tar.gz cat AllvogHMMprofiles/* > pVOGs.hmm # Train with HMM make_training_sets.py \ -d marked_genomes/ \ -g groups.txt \ --phmms pVOGs.hmm \ --threads 8 \ --retrain Absolute Retraining ^^^^^^^^^^^^^^^^^^^ To use ONLY your data (ignore PhiSpy defaults): .. code-block:: bash make_training_sets.py \ -d marked_genomes/ \ -g groups.txt \ --absolute_retrain This is useful for: - Highly specialized organisms - Quality control with known prophages - Specific research questions Best Practices -------------- Selecting Genomes ^^^^^^^^^^^^^^^^^ - **Use diverse prophages**: Include genomes with different prophage types - **Avoid too many genomes**: ~5-20 genomes is often sufficient - **Include remnants**: Mark even degraded prophage regions - **Quality over quantity**: Better to have fewer well-annotated genomes Marking Prophages ^^^^^^^^^^^^^^^^^ - **Be thorough**: Mark all prophage proteins, including small ones - **Include boundaries**: Capture integration and excision genes - **Mark remnants**: Even partial prophages provide useful signal - **Verify manually**: Check that marked regions make biological sense K-mer Selection ^^^^^^^^^^^^^^^ - **Default (12-mers, all)**: Works well for most cases - **Codon k-mers**: Better for very AT-rich or GC-rich genomes - **Larger k-mers**: More specific but require more data - **Smaller k-mers**: Less specific but work with less data Training Set Management ^^^^^^^^^^^^^^^^^^^^^^^ - **Use consistent naming**: Name sets after taxonomic groups - **Document parameters**: Record k-mer size, type, and source genomes - **Version control**: Keep track of training set versions - **Test performance**: Validate on held-out genomes Storage Location ^^^^^^^^^^^^^^^^ Training sets are stored in:: PhiSpyModules/data/ Temporary files during training are in:: PhiSpyModules/data/testSets/ These are preserved to speed up retraining. Troubleshooting --------------- Not Enough Phage-Specific K-mers ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Problem**: Warning about small k-mer sets. **Solutions**: - Add more diverse genomes - Use smaller k-mer size - Include more prophage regions - Try ``--kmers_type simple`` Training Set Not Found ^^^^^^^^^^^^^^^^^^^^^^^ **Problem**: PhiSpy can't find your training set. **Solutions**: .. code-block:: bash # List available sets PhiSpy.py --list long # Use full path if needed PhiSpy.py genome.gb -o results -t /path/to/trainSet_Custom.txt Poor Performance ^^^^^^^^^^^^^^^^ **Problem**: Custom training set performs worse than default. **Possible causes**: - Not enough training genomes - Poorly marked prophage regions - Overfitting to specific prophage types - K-mer parameters not optimal **Solutions**: - Add more diverse training genomes - Verify prophage annotations - Try default training set for comparison - Experiment with k-mer parameters Example: Training for Novel Genus ---------------------------------- Complete workflow for a novel bacterial genus: .. code-block:: bash # 1. Annotate genomes for genome in genomes/*.fasta; do prokka --outdir prokka_output/$(basename $genome .fasta) $genome done # 2. Run PhiSpy with default settings to get initial predictions for gb in prokka_output/*/*.gbk; do name=$(basename $gb .gbk) PhiSpy.py $gb -o initial_predictions/$name --output_choice 11 done # 3. Manually review and create prophage coordinate file # (Review prophage_information.tsv files and create prophages.txt) # 4. Mark features mark_prophage_features.py -i prophages.txt -o marked_genomes/ # 5. Create training set make_training_sets.py \ -d marked_genomes/ \ --use_taxonomy \ --phmms pVOGs.hmm \ --threads 8 \ --retrain # 6. Re-run with custom training set for gb in marked_genomes/*.gb; do name=$(basename $gb .gb) PhiSpy.py $gb -o final_predictions/$name \ -t data/trainSet_YourGenus.txt \ --phmms pVOGs.hmm \ --output_choice 11 done Further Reading --------------- - See the `Parameters Reference `_ for details on all options - Check the `Output Files `_ page for result interpretation - Visit the `GitHub repository `_ for examples