(no title)
matthew_stone | 2 years ago
Here's a simple scatter-gather example. Let's say you want to count the number of lines in each file for a list of samples, and report a table of counts collected from each sample. Define a rule to process each input file, and a rule to collect the results.
I find this much less complex than an equivalent bash workflow. Additionally, these rules can be easily containerized, the workflow can be parallelized, and the workflow is robust to interruption and the addition of new samples. Snakemake manages checking for existing files and running rules as necessary to create missing files, logic that is much more finicky to implement by hand in bash.
with open('data/samples.txt') as slist:
SAMPLES = [l.strip() for l in slist.readlines()]
rule all:
input:
"results/line_counts.txt"
rule count_lines:
input:
"data/lines/{sample}.txt"
output:
"processed/count_lines/{sample}.txt"
shell:
"""
cat {input} |
wc -l |
paste <(echo -e {wildcards.sample}) - > {output}
"""
rule collect_counts:
input:
expand("processed/count_lines/{sample}.txt", sample=SAMPLES)
output:
"results/line_counts.txt"
shell:
"""
cat <(echo -e "sample\tn_lines") {input} > {output}
"""
c0l0|2 years ago