MAG-resolved metatranscriptomics

MetaG and metaT approaches both have limitations. In metaG, we know who is there and what they could do, but we don’t know which genes are actively expressed and to what extent. In metaT, we know which genes are actively expressed and to what extent, but we do not necessarily know who is doing the work. Thus, it is advantageous to combine the two approaches. Genome- (or MAG) resolved metaT tells us both which genes are actively transcribed/to what extent and in which organisms. So, we are able to have a sense for who is doing what in the community.

NOTE: the approach in this workshop is very basic. Please consider the specific parameters that would make sense to use for your individual project.

Getting Started

Enter the Roar dashboard and start a persistent terminal with 16 threads and 64 GB RAM
Download all the files and data:

cd $HOME/scratch

cp -r /scratch/azv5523/DAWG_wkshop/ ./

Load anaconda with module load anaconda
Make the needed conda environment and install the packages needed:

conda create -n DAWG_S26_2

conda activate DAWG_S26_2

mamba install -c bioconda -c conda-forge fastp bowtie2 samtools bakta kallisto -y

conda install -c bioconda/label/cf201901 metabat2 -y

Run fastp to QC metaG and metaT reads

mkdir fastp_qc

cd fastp_qc

mkdir metaT

mkdir metaG

cd $HOME/scratch/DAWG_wkshop

fastp –in1 reads/metaG/metaG_1.fastq.gz –in2 reads/metaG/metaG_2.fastq.gz –out1 fastp_qc/metaG/metaG_1_clean.fastq.gz –out2 fastp_qc/metaG/metaG_2_clean.fastq.gz –trim_poly_g –html fastp_qc/metaG_report.html –json fastp_qc/metaG_report.json –thread 16

fastp –in1 reads/metaT/metaT_1.fastq.gz –in2 reads/metaT/metaT_2.fastq.gz –out1 fastp_qc/metaT/metaT_1_clean.fastq.gz –out2 fastp_qc/metaT/metaT_2_clean.fastq.gz –trim_poly_g –html fastp_qc/metaG_report.html –json fastp_qc/metaG_report.json –thread 16

Take a look at the html outputs from fastp.

Assemble the metagenome with MEGAHIT

Don’t do this during the workshop. It takes hours. Use the ‘contigs.fa’ file generated from this process in the following steps.

mkdir megahit

megahit -1 fastp_qc/metaG/metaG_1_clean.fastq.gz -2 fastp_qc/metaG/metaG_2_clean.fastq.gz -o ./megahit

Map reads back to contigs with Bowtie2 and index with samtools

bowtie2-build megahit/contigs.fa contigs_index This will take ~15 mins. Run this only if there is plenty of time, and want to see the bowtie2 index format. No need to run this if not, as we already provide the SAM file.

mkdir alignments

bowtie2 -x contigs_index -1 fastp_qc/metaG/metaG_1_clean.fastq.gz -2 fastp_qc/metaG/metaG_2_clean.fastq.gz -S alignments/sample1.sam -p 12##DO NOT RUN THIS (takes hours, use output_depth.txt file we provided for step 5).

samtools view -bS alignments/sample1.sam | samtools sort -o alignments/sample1.bam –threads 12

samtools index alignments/sample1.bam

jgi_summarize_bam_contig_depths –outputDepth output_depth.txt alignments/sample1.bam

Bin the assembly into MAGs

mkdir -p metabat2_bins

metabat2 -i megahit/contigs.fa -o metabat2_bins/bin -a output_depth.txt –numThreads 16 –seed 42

Annotate the MAGs with Bakta

mkdir bakta_output

Run the bakta.sh bash script to annotate all of the MAGs with bakta. This will take hours, so use the .ffn files provided for the next steps.

Find and group the .ffn files

These files are really handy because they contain both the names of the indentified genes/CDSs and their DNA sequence for each MAG in fasta format. We will use these files to make indexes for mapping metaT reads.

cd bakta_output

mkdir ffn_files

mv *.ffn ffn_files

Use Kallisto to make indexes with the .ffn files

cd $HOME/scratch/DAWG_wkshop/

mkdir kallisto_indexes

Run the kallisto_indexes.sh bash script for making an index for each .ffn file.

Map the metaT reads back to the Kallisto indexes to quantify transcript abundance per gene per MAG

mkdir kallisto_quant

Run the kallisto_quant.sh bash script for mapping metaT reads back to the indexes to quantify transcript abundance.

This will take several hours, so we have provided you an example output for ‘bin.1’.

conda deactivate

Take a look at the output for bin.1 from the kallisto_quant.sh script:

cd kallisto_quant/bin.1

less abundances.tsv

Press ‘Q’ to exit the file view.

The ‘target IDs’ are from the bin.1.ffn file. So, to get the conversions to gene names we can understand, we can make a conversion file with the following commands:

cd $HOME/scratch/DAWG_wkshop/bakta_output/ffn_files/

cat bin.1.ffn | grep '>' > conv_intermediate.txt

sed 's/>//g' conv_intermediate.txt > conversions.txt

less conversions.txt

Now we know what the target ID’s mean.

You did it! Here are some applications for these integrated data:

Compare samples from 2 environments/treatments with differential gene expression analysis, resolved by bin. This tells you which genes are differentially expressed and which organisms are responsible. You can use DESeq2 or Sleuth.
Use KEGG or GO tools to visualize/identify expressed pathways in your sample.

MAG-resolved metatranscriptomics

Alex Vompe

2026-03-17