nf-core/ncrnannotator
Edit

nf-core pipeline for genome-level ncRNA annotation using Infernal

infernalncrnarfam

This is the development version of the pipeline.

Launch development version https://github.com/nf-core/ncrnannotator

Introduction

This document describes the output produced by ncrnannotator. All paths are relative to the --outdir directory specified at runtime.

Pipeline overview

ncrnannotator runs the following steps:

FILTER_RFAM_CM — filter Rfam covariance models to clade-specific accessions (skipped in mgnify-assembly and full modes)
GENOME_CHUNK — split genome into non-overlapping windows for parallel processing
CMSEARCH — search each chunk against Rfam covariance models using Infernal
PARSE_RFAM — consolidate results, remove overlapping hits, apply GA score thresholds
RFAM_TO_FORMATS — convert hits to GTF, GFF3, and BED annotation files
MultiQC — aggregate software versions and pipeline metrics into a report

Output directories

`rfam/`

Intermediate Rfam files published for inspection and reuse.

Output files

rfam/
- rfam_filtered.cm — Rfam covariance model file filtered to clade-specific accessions (not present in mgnify-assembly or full mode)
- rfam_hits.tsv — Tab-separated table of all final ncRNA hits after overlap removal and score filtering

`rfam_hits.tsv` format

Column	Description
`seqname`	Sequence name from the input FASTA
`start`	Hit start position (1-based)
`end`	Hit end position (1-based, inclusive)
`strand`	Strand (`+` or `-`)
`score`	Infernal bit score
`evalue`	E-value
`query_name`	Rfam model name (e.g. `U1`, `5S_rRNA`)
`accession`	Rfam accession (e.g. `RF00003`)
`biotype`	Ensembl-style biotype (e.g. `snRNA`, `snoRNA`, `rRNA`, `lncRNA`)

`annotation/`

Final annotation files ready for use in genome browsers and downstream analysis.

Output files

annotation/
- annotation.gtf — GTF v2.2 annotation with gene/transcript/exon features
- annotation.gff3 — GFF3 annotation with gene/ncRNA/exon hierarchy
- annotation.bed — 6-column BED file (0-based start coordinates)

GTF format

Ensembl-style GTF with three feature types per hit: gene, transcript, exon. Key attributes:

gene_id — sequential identifier (e.g. rfam_gene_000001)
gene_name — Rfam model name
gene_biotype — Ensembl biotype (e.g. snRNA, snoRNA, rRNA)
rfam_accession — Rfam family accession

GFF3 format

Standard GFF3 with gene → ncRNA (or biotype-specific type) → exon hierarchy. Feature types follow Sequence Ontology conventions.

BED format

6-column BED (chrom, chromStart, chromEnd, name, score, strand). Start is 0-based. Score is clamped to 0–1000.

`multiqc/`

Output files

multiqc/
- multiqc_report.html — standalone HTML report viewable in any browser
- multiqc_data/ — parsed statistics from all pipeline steps
- multiqc_plots/ — static images from the report

MultiQC aggregates software versions and pipeline metrics into a single report.

`pipeline_info/`

Output files

pipeline_info/
- execution_report.html — Nextflow execution report (resource usage per task)
- execution_timeline.html — timeline of all tasks
- execution_trace.txt — tab-separated trace of all tasks
- nf_core_ncrnannotator_software_mqc_versions.yml — software versions used

Biotype reference

The following biotypes are assigned based on Rfam model names and seed classifications:

Biotype	Examples
`snRNA`	U1, U2, U4, U5, U6, U11, U12
`snoRNA`	SNORD, SNORA families
`scaRNA`	scaRNA families
`rRNA`	5S_rRNA, 5_8S_rRNA, SSU_rRNA_eukarya, LSU_rRNA_eukarya
`rRNA` (prokaryotic, `mgnify-assembly` only)	SSU_rRNA_bacteria, LSU_rRNA_archaea, etc.
`tRNA`	tRNA families
`pre_miRNA`	miRNA precursors
`lncRNA`	Long non-coding RNA
`SRP_RNA`	Signal recognition particle RNA
`RNase_P_RNA`	RNase P families
`vault_RNA`	Vault RNA
`Y_RNA`	Y RNA families
`ribozyme`	Ribozyme families
`antisense_RNA`	Antisense RNA
`ncRNA`	Other non-coding RNA (default)

On this page

nf-core/ncrnannotatorEdit

Introduction

Pipeline overview

Output directories

rfam/

rfam_hits.tsv format

annotation/

GTF format

GFF3 format

BED format

multiqc/

pipeline_info/

Biotype reference

nf-core/ncrnannotator
Edit

`rfam/`

`rfam_hits.tsv` format

`annotation/`

`multiqc/`

`pipeline_info/`