DAWG Workshop 3: Metagenome Assembled Genomes (MAGs)

Author: Alex Vompe

Date: 12/5/25

1. Configure your environment and get the data

Step 1.1: login to roar collab. Request an account if you don’t have one.

ssh [user ID]@submit.hpc.psu.edu

salloc --partition=sla-prio --ntasks=1 --cpus-per-task=12 --mem=128G

Step 1.2: navigate to your work directory

cd $HOME/work

Step 1.3: install the required programs and make a conda environment for assembly and a conda environment for MAG analysis

module load anaconda

conda create -n assembly -c bioconda -c conda-forge fastp megahit bowtie2 samtools

conda activate assembly

Step 1.4: navigate to your scratch directory, make a directory for this workshop, and copy over the data

cd $HOME/scratch

mkdir DAWG_MGS_2

cd DAWG_MGS_2

cp -r /scratch/azv5523/DAWG_MGS_2/reads ./

mkdir megahit

cp /scratch/azv5523/DAWG_MGS_2/megahit/contigs.fa ./megahit/

mkdir alignments

cp /scratch/azv5523/DAWG_MGS_2/alignments/sample1.sam ./alignments/

cp -r /scratch/azv5523/DAWG_MGS_2/CheckM2_database/ ./

Step 1.5: QC the reads with fastp

mkdir fastp_qc

fastp --in1 reads/SRR7595115_1.fastq.gz --in2 reads/SRR7595115_2.fastq.gz --out1 fastp_qc/SRR7595115_1.fastq.gz --out2 fastp_qc/SRR7595115_2.fastq.gz --trim_poly_g --html fastp_qc/SRR7595115_report.html --json fastp_qc/SRR7595115_report.json --thread 12

2. Run the megahit assembler (this will take a looong time). Set maximum threads and memory possible.

megahit -1 fastp_qc/SRR7595115_1.fastq.gz -2 fastp_qc/SRR7595115_2.fastq.gz -o ./megahit

Note: we will NOT run this during the workshop, as it takes hours to days. Use the “contigs.fa” file in the megahit directory that I assembled for you.

I recommend running SPAdes for better quality assemblies, but this takes even longer:

spades.py -1 left.fastq.gz -2 right.fastq.gz -o output_folder --meta

3. Map reads to the assembly to get depth coverage info

Add metabat2 to our environment:

conda install -c bioconda/label/cf201901 metabat2

bowtie2-build megahit/contigs.fa contigs_index This will take ~15 mins. Run this only if there is plenty of time, and want to see the bowtie2 index format. No need to run this if not, as we already provide the SAM file.

bowtie2 -x contigs_index -1 reads/SRR7595115_1.fastq.gz -2 reads/SRR7595115_2.fastq.gz -S alignments/sample1.sam -p 12##DO NOT RUN THIS (takes hours, use the SAM file we provided for the commands below).

samtools view -bS alignments/sample1.sam | samtools sort -o alignments/sample1.bam --threads 12

samtools index alignments/sample1.bam

jgi_summarize_bam_contig_depths --outputDepth output_depth.txt alignments/sample1.bam

4. Bin the assemblies into MAGs

mkdir -p metabat2_bins

metabat2 -i megahit/contigs.fa -o metabat2_bins/bin -a output_depth.txt --numThreads 12 --seed 42

5. Check bin quality with CheckM2

conda deactivate

mamba create -n checkm2 -c bioconda -c conda-forge checkm2

conda activate checkm2

mkdir checkm2

checkm2 predict --threads 30 --input ./metabat2_bins/ --output-directory ./checkm2/ -x fa --database_path ./CheckM2_database/uniref100.KO.1.dmnd

6. Annotate and visualize a high-quality bin (MAG) with Bakta and Proksee

Bin 11 seems to be the highest quality. Let’s download it and upload to Proksee.

https://proksee.ca/

Run the Bakta annotation, and add it as a layer to the map.

7. (more advanced) Annotate, visualize, and analyze MAGs with Anvi’o

7.1. Install Anvi’o to your working directory using a docker image:

singularity shell -B </path/to/your/datafolder>:/data --pwd /data /scratch/pmt5304/dawg/fa25/w3/anvio_7.sif

7.2. Navigate and execute either the Anvi’o phylogenomics or pangenomics workflow, depending on your needs (available here):

Phylogenomics: https://merenlab.org/2017/06/07/phylogenomics/

Pangenomics: https://merenlab.org/2016/11/08/pangenomics-v2/

STOP just before running anvi-interactive, and exit the apptainer (type “exit” then hit <enter>).

7.3. Tunnel a port # of your choice for running the interactive analysis in a web browser (Chrome works best):

ssh -L 12345:localhost:12345 <userid>@submit01.hpc.psu.edu

cd $HOME/scratch/DAWG_MGS_2

nano dummy_job.sh

Enter the following script and save + exit using ctrl+o <enter>, ctrl+x <enter>:

#!/bin/bash

#SBATCH --job-name=anvio_server. # Job name

#SBATCH --nodes=1 # Number of nodes

#SBATCH --ntasks=1 # Number of tasks (processes)

#SBATCH --mem=10G # Memory per node

#SBATCH --time=4:00:00 # Wall-clock time limit (HH:MM:SS)

# Commands to be executed

sleep 4h

Run the script:

sbatch dummy_job.sh

Run squeue -u $USERto find the node the job is running on (e.g. p-sc-2369).

Log in to the node:

ssh -L 12345:localhost:12345 p-sc-2369

Now, you are ready to run anvi-interactive. Set the port # to the one you chose (12345 in this case).