# Usage

## Simple self-test

with Illumina test data:
```bash
nextflow run main.nf                                      \
        -profile staphylococcus_aureus,illumina,apptainer \
        -config nextflow.config                           \
        --csv assets/test_data/samplelist.csv
```

with ONT test data:
```bash
nextflow run main.nf                                      \
        -profile staphylococcus_aureus,nanopore,apptainer \
        -config nextflow.config                           \
        --csv assets/test_data/samplelist_nanopore.csv
```
Remember to edit the paths to the test file(s) in the samplelist.

## Usage arguments

| Argument type       | Options                                                                                                | Required |
| ------------------- | ------------------------------------------------------------------------------------------------------ | -------- |
| -profile (species)  | staphylococcus_aureus/escherichia_coli/mycobacterium_tuberculosis/klebsiella/streptococcus_pyogenes/streptococcus | True     |
| -profile (platform) | illumina/nanopore/iontorrent                                                                           | True     |
| -profile (RLS)      | development/diagnostic/validation                                                                      | False    |
| -config             | nextflow.config                                                                                        | True     |
| -resume             | NA                                                                                                     | False    |
| --output            | user-specified                                                                                         | False    |

RLS = Release life cycle (default: diagnostic)

## Input file format 

For short reads:
```{csv-table} Example of a *samplelist* input file in CSV format.
:header-rows: 1

id,platform,sequencing_run,read1,read2
sample01,illumina,seqrun0123,path_to_reads/sample01_forward.fastq.gz,path_to_reads/sample01_reverse.fastq.gz
```

For long reads (ONT):
```{csv-table} Example of a *samplelist* input file in CSV format.
:header-rows: 1

id,platform,sequencing_run,read1
sample01,nanopore,seqrun0123,path_to_reads/sample01.fastq.gz
```

As input for long reads we recommend fastq files that were obtained by basecalling using SUP model.

## Downsampling reads

There are an option to use [seqtk](https://github.com/lh3/seqtk) downsample the number of for a sample as a preprocessing step before all other analyses. This can be useful if a sample was sequenced too deeply, as extreme sequencing depth can causes issues with *de-novo* assemblies.

Activate downsampling by setting the parameter `target_sample_size` to the either the desired number of reads or the fraction of reads to include in the config.

## Removing Human reads

There are an option to use [hostile](https://github.com/bede/hostile) to filter human reads from further analyses. This can be useful if a sample has been contaminated, which could cause issues with *de-novo* assemblies.

Activate human read depletion by setting the parameter `use_hostile` to `true` in the config.

## Adapter and quality trimming (Illumina)

There is an option to use [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic) to trim adapters and low-quality bases from Illumina reads as a preprocessing step. This feature is turned off by default.

Activate Trimmomatic by setting `use_trimmomatic` to `true` in the config (Illumina platform only). Customise the trimming steps via `trimmomatic_args` (default: `LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36`); for adapter trimming, append an `ILLUMINACLIP:<adapter.fa>:2:30:10` step to the args.

# Output

* `postalignqc` output: statistics are computed using only a core genome
* A coverage uniformity is calculated by dividing interquartile range by median coverage, a lower value indicating more uniform coverage of a genome
* `coverage` output: statistics are computed using a whole genome (and plasmids, if they are a part of the reference genome)
* Polishing of genome assembly created from ONT data is done in two rounds with bacterial methylation model as default.  
* Variants reported by Freebayes are used for masking the genome before performing cgMLST analaysis (default: true for Illumina data, false for ONT data) and are computed by aligning reads to the assembly, not to the reference genome. When masking step is run, these variants are also reported in the output file `analysis_result/*_result.json`.
* Gambitcore identifies the closest species and asseses completeness of assembly, detailed description of the output can be found [here](https://github.com/SMD-Bioinformatics-Lund/gambitcore)