MetaG and metaT approaches both have limitations. In metaG, we know who is there and what they could do, but we don’t know which genes are actively expressed and to what extent. In metaT, we know which genes are actively expressed and to what extent, but we do not necessarily know who is doing the work. Thus, it is advantageous to combine the two approaches. Genome- (or MAG) resolved metaT tells us both which genes are actively transcribed/to what extent and in which organisms. So, we are able to have a sense for who is doing what in the community.

NOTE: the approach in this workshop is very basic. Please consider the specific parameters that would make sense to use for your individual project.

  1. Getting Started

cd $HOME/scratch

cp -r /scratch/azv5523/DAWG_wkshop/ ./

conda create -n DAWG_S26_2

conda activate DAWG_S26_2

mamba install -c bioconda -c conda-forge fastp bowtie2 samtools bakta kallisto -y

conda install -c bioconda/label/cf201901 metabat2 -y

  1. Run fastp to QC metaG and metaT reads

mkdir fastp_qc

cd fastp_qc

mkdir metaT

mkdir metaG

cd $HOME/scratch/DAWG_wkshop

fastp –in1 reads/metaG/metaG_1.fastq.gz –in2 reads/metaG/metaG_2.fastq.gz –out1 fastp_qc/metaG/metaG_1_clean.fastq.gz –out2 fastp_qc/metaG/metaG_2_clean.fastq.gz –trim_poly_g –html fastp_qc/metaG_report.html –json fastp_qc/metaG_report.json –thread 16

fastp –in1 reads/metaT/metaT_1.fastq.gz –in2 reads/metaT/metaT_2.fastq.gz –out1 fastp_qc/metaT/metaT_1_clean.fastq.gz –out2 fastp_qc/metaT/metaT_2_clean.fastq.gz –trim_poly_g –html fastp_qc/metaG_report.html –json fastp_qc/metaG_report.json –thread 16

Take a look at the html outputs from fastp.

  1. Assemble the metagenome with MEGAHIT

Don’t do this during the workshop. It takes hours. Use the ‘contigs.fa’ file generated from this process in the following steps.

mkdir megahit

megahit -1 fastp_qc/metaG/metaG_1_clean.fastq.gz -2 fastp_qc/metaG/metaG_2_clean.fastq.gz -o ./megahit

  1. Map reads back to contigs with Bowtie2 and index with samtools

bowtie2-build megahit/contigs.fa contigs_index This will take ~15 mins. Run this only if there is plenty of time, and want to see the bowtie2 index format. No need to run this if not, as we already provide the SAM file.

mkdir alignments

bowtie2 -x contigs_index -1 fastp_qc/metaG/metaG_1_clean.fastq.gz -2 fastp_qc/metaG/metaG_2_clean.fastq.gz -S alignments/sample1.sam -p 12##DO NOT RUN THIS (takes hours, use output_depth.txt file we provided for step 5).

samtools view -bS alignments/sample1.sam | samtools sort -o alignments/sample1.bam –threads 12

samtools index alignments/sample1.bam

jgi_summarize_bam_contig_depths –outputDepth output_depth.txt alignments/sample1.bam

  1. Bin the assembly into MAGs

mkdir -p metabat2_bins

metabat2 -i megahit/contigs.fa -o metabat2_bins/bin -a output_depth.txt –numThreads 16 –seed 42

  1. Annotate the MAGs with Bakta

mkdir bakta_output

Run the bakta.sh bash script to annotate all of the MAGs with bakta. This will take hours, so use the .ffn files provided for the next steps.

  1. Find and group the .ffn files

These files are really handy because they contain both the names of the indentified genes/CDSs and their DNA sequence for each MAG in fasta format. We will use these files to make indexes for mapping metaT reads.

cd bakta_output

mkdir ffn_files

mv *.ffn ffn_files

  1. Use Kallisto to make indexes with the .ffn files

cd $HOME/scratch/DAWG_wkshop/

mkdir kallisto_indexes

Run the kallisto_indexes.sh bash script for making an index for each .ffn file.

  1. Map the metaT reads back to the Kallisto indexes to quantify transcript abundance per gene per MAG

mkdir kallisto_quant

Run the kallisto_quant.sh bash script for mapping metaT reads back to the indexes to quantify transcript abundance.

This will take several hours, so we have provided you an example output for ‘bin.1’.

conda deactivate

Take a look at the output for bin.1 from the kallisto_quant.sh script:

cd kallisto_quant/bin.1

less abundances.tsv

Press ‘Q’ to exit the file view.

The ‘target IDs’ are from the bin.1.ffn file. So, to get the conversions to gene names we can understand, we can make a conversion file with the following commands:

cd $HOME/scratch/DAWG_wkshop/bakta_output/ffn_files/

cat bin.1.ffn | grep '>' > conv_intermediate.txt

sed 's/>//g' conv_intermediate.txt > conversions.txt

less conversions.txt

Now we know what the target ID’s mean.

  1. You did it! Here are some applications for these integrated data: