datacarpentry / wrangling-genomics Goto Github PK

View Code? Open in Web Editor NEW

66.0 25.0 152.0 68.16 MB

Data Wrangling and Processing for Genomics

Home Page: https://datacarpentry.org/wrangling-genomics/

License: Other

Shell 100.00%

carpentries data-carpentry lesson shell data-wrangling programming english genomics stable

wrangling-genomics's Introduction

Wrangling Genomics

Lesson for quality control and wrangling genomics data. This repository is maintained by Josh Herr, Ming Tang, and Fotis Psomopoulos.

Amazon public AMI for this tutorial is "dataCgen-qc".

Background

Wrangling genomics trains novice learners on a variant calling workflow. Participants will learn how to evaluate sequence quality and what to do if it is not good. We will then cover aligning reads to a genome, and calling variants, as well as discussing different file formats. Results will be visualized. Finally, we will cover how to automate the process by building a shell script.

This lesson is part of the Data Carpentry Genomics Workshop.

Contribution

Make a suggestion or correct an error by raising an Issue.

Code of Conduct

All participants should agree to abide by the Data Carpentry Code of Conduct.

Authors

Wrangling genomics is authored and maintained by the community.

Citation

Please cite as:

Erin Alison Becker, Taylor Reiter, Fotis Psomopoulos, Sheldon John McKay, Jessica Elizabeth Mizzi, Jason Williams, … Winni Kretzschmar. (2019, June). datacarpentry/wrangling-genomics: Data Carpentry: Genomics data wrangling and processing, June 2019 (Version v2019.06.1). Zenodo. http://doi.org/10.5281/zenodo.3260609

wrangling-genomics's People

Contributors

Stargazers

Watchers

Forkers

butterflyology dbmarchant acharbonneau jasonjwilliamsny tracykteal vasilislenis gsc0107 hlindsay snamburi3 kariljordan khughitt denniscoding uh-ci cdeboever3 dpaudel hadrieng kate-crosby nielinfante yeemey gaiusjaugustus randomkrum peterjc gtehennepe hoytpr johnsolk jrkirk61 priya-gittest malvikasharan aschuerch mehakrafiq standage bebatut aays mtntsuchiya neilsonwilson astrobiomike data-lessons rltillett pajanne jmastough aalkanaq shiltemann dwinter kweitemier arredondo23 winni2k acnewton tobyhodges mfernandes61 murraycadzow cgpu binxiepeterson stevegoldstein sstevens2 starbuck85 sihart25 poquirion nooriyahp btmoyers fsvarn biowizz swbioinf kaust-vislab tejashree1modak julievecchio rcastelo kbieser johnyaku kccg novapyth aquasha mvdb01 antonioovc cautree jsarro13 arthurshem86 ssyamoako elnazamanzadeh sheffield-bioinformatics-core srsteinkamp ailithewing markdunning dinindusenanayake hesbornomwandho clementmlay gashanjakim godsonaryee28 esallychang klemensnoga rafasua mmb-umcu ameynert schuyler-smith suworkshops ishwarvh philippbayer mamanambiya jsgro grcvhon eltonjrv

wrangling-genomics's Issues

Why does setup exist?

This seems to be in the SWC format. This lesson is midstream in whatever workshop flavor. There should be nothing for the user to download or install. Unless original intent of teaching all on cloud/HPC is changing this should part or workshop webpage template, or in a lesson that addresses this for learners installing these tools elsewhere.

remove outdated files on the AMI

The trimmed FASTQs produced with we run Trimmomatic on our samples:

-rw-rw-r-- 1 dcuser dcuser 762M Nov 7 23:35 SRR097977.fastq_trim.fastq
-rw-rw-r-- 1 dcuser dcuser 2.7G Nov 7 23:36 SRR098026.fastq_trim.fastq
-rw-rw-r-- 1 dcuser dcuser 793M Nov 7 23:36 SRR098027.fastq_trim.fastq
-rw-rw-r-- 1 dcuser dcuser 2.7G Nov 7 23:37 SRR098028.fastq_trim.fastq
-rw-rw-r-- 1 dcuser dcuser 3.1G Nov 7 23:39 SRR098281.fastq_trim.fastq
-rw-rw-r-- 1 dcuser dcuser 3.0G Nov 7 23:40 SRR098283.fastq_trim.fastq

The trimmed FASTQs that are in the variant_calling/data/ directory (produced from the variant_calling.tar.gz file):

-rw-rw-r-- 1 dcuser dcuser 46M Jul 31 2015 SRR097977.fastq
-rw-rw-r-- 1 dcuser dcuser 44M Jul 31 2015 SRR098026.fastq
-rw-rw-r-- 1 dcuser dcuser 45M Jul 31 2015 SRR098027.fastq
-rw-rw-r-- 1 dcuser dcuser 44M Jul 31 2015 SRR098028.fastq
-rw-rw-r-- 1 dcuser dcuser 44M Jul 31 2015 SRR098281.fastq
-rw-rw-r-- 1 dcuser dcuser 44M Jul 31 2015 SRR098283.fastq

Using bwa mem instead of bwa aln in 02-variant_calling.md

In datacarpentry/wrangling-genomics/_episodes/02-variant_calling.md, bwa aln is taught instead of bwa mem. I see the note about bwa mem, but I think most people who will be taking this work shop will be working with reads >50bp. I think teaching bwa mem instead of bwa aln > bwa samse will simplify the learning process & lesson time, and give the learners a better understanding of the pipeline they will probably use when they do alignment with their own data. Thoughts?

Typos in 00-quality-control.md

TYPOS found in 00-quality-control.md

line 49: "its" should be "it's"
line 79: "eg" should be "e.g."
line 144: "begining" should be "beginning"
line 424: remove one of the "four"
line 427 "unziped" should be "unzipped"
line 560: "termperature" should be "temperature"

Remove FASTQ encoding variants; explain ASCII more

The history of FASTQ encodings and the legacy Solexa/Illumina variants is a distraction, and in my opinion can be removed from the QC lesson:

https://github.com/datacarpentry/wrangling-genomics/blob/gh-pages/_episodes/00-readQC.md

Furthermore since the lesson is under the CC-BY, there is a licensing problem with lifting the CC-BY-SA table from https://en.wikipedia.org/wiki/FASTQ_format (and without even giving the source URL, which would be the minimum form of attribution).

I would instead to reduce this section to focus on introducing ASCII and the idea that characters have an associated numerical code, and this is used with an offset to store the quality score using one character per base.

The different encodings can be reduced to a footnote or remark, and reference to the wikipedia page https://en.wikipedia.org/wiki/FASTQ_format or our paper https://doi.org/10.1093%2Fnar%2Fgkp1137

(And then move on to explain PHRED scores)

Figures page is missing

the link http://www.datacarpentry.org/wrangling-genomics/figures/ shows not content

Provide version numbers for software in setup?

The setup file should preferentially provide version numbers (that are compatible) for the software (where it matters). E.g samtools can vary quite a bit with versions there are combinations with bcftool versions that don't work.

Also, the setup for the whole DC genomics lesson could be combined? See datacarpentry/organization-genomics#30

Lesson "automation": better syntax for variable handling and a typo/mistake?

I have two issues with the automation course:

First issue:
IMHO the shown method to concatenate a bash variable and plain text is not the recommended way. See this excerpt from the lesson:

$ for infile in *.fastq
do
outfile=$infile_trim.fastq
java -jar ~/Trimmomatic-0.32/trimmomatic-0.32.jar SE $infile $outfile SLIDINGWINDOW:4:20 MINLEN:20
done

This kind of operation (bold in the example above) should be done via the ${variable} syntax, and not via an escape character (i.e. $variable_plain_text). Earlier courses teach the ${variable} syntax method, so also for consistency I think this should be changed. Or am I missing something?

Second issue:
I might have found a typo/mistake, but I`m not 100% sure and I have no VM to test it (anymore). So if someone could confirm I'd be most obliged. But, in the exercise quoted below:

Vizualize the alignment of the reads for our SRR098281.fastq_trim.fastq_small sample. What variant is present at position 145? What is the canonical nucleotide in that position?

Shouldn't this be position 146? Or did I miscount or became confused between 0-counting programs and 1-counting programs?

Kind regards,
Dennis

Add locations of true and false positive variant calls

As per #25:

explain how to inspect alignments to see whether a variant is indeed called correctly, maybe point to a variant that did not pass filtering

This requires someone who has run the pipeline to have a look in IGV to find a true variant and a false positive variant (assuming there is one) and add their locations to the end of 01-variant_calling_workflow.md, maybe even including screenshots from IGV and samtools tview.

Access to files for self-learners (and maintainers)

How would self-learners interested in going over this material be able to run the analyses? Can we give them access to the input files? An Amazon image?

Similarly, if I want to address some remaining issues, I would need to do many of the steps normally run at a workshop, or at least get my hands on the output files...

Automating a workflow - referencing earlier parts of the workshop (Issue Bonanza)

To make the narrative independent from the workshop schedule and more applicable for self-learners, I suggest to replace references to parts of the day by references to the lesson itself.

"Now, let’s do something real. First, recall the code from our our fastqc workflow from this morning, with a few extra “echo” statements."
Replace by
"Now, let’s do something real. First, recall the code from our our quality control workflow, with a few extra “echo” statements."

"3) Bonus points: Use something you learned yesterday to save the output of the script to a file while it is running."
Replace by
"3) Bonus points: Save the output of the script to a file while it is running (hint: redirection)"

Stage your data section redundant?

Episode: Quality Control

The Wrangling episodes (in my mind) come after the project organization. There will need to be harmonization between these sections, assuming that certain directories/prerequisites exist. Suggesting rephrasing to have a learner confirm that they have setup the appropriate directories.

Variant Calling Workflow Issue Bonanza

Objectives:

Use a For loop from the previous lesson to help automate repetitive tasks

Remove, as this is not done in this lesson.

Lesson:

rephrase learning objectives as 'be able to' etc
The first commands are misformatted

NOTE: This only has to be run once. The only reason you would want to create a new index is if you are working with a different reference genome or you are using a different tool for alignment.

Make it a Callout using the '{: .callout}' syntax? We could use Callout boxes more throughout.
Explain the general setup of commands, e.g. bwa aln reference_genome readfile, see #55 and #56

The SAM file, is a tab-delimited text file

Links to this github page, find a better alternative.
Explain why we would want to make a binary version of a file in general, and the SAM file in particular see #57

Filter the SNPs for the final output in VCF format, using vcfutils.pl:

Format vcfutils.pl as vcfutils.pl

. IGV requires the .bai file to be in the same location as the .bam file that is loaded into IGV, but there is no direct use for that file.

Isn't there? See #58
explain functionality, coloring and GUI elements of IGV in more detail See #59
explain how to inspect alignments to see whether a variant is indeed called correctly, maybe point to a variant that did not pass filtering See #60

Key Points

missing

Clarifying human readable for new users

On the exercise starting at line 215, the "Hint" is at least partially for learning the -h option. But to avoid confusion the use of "M" and "G" should be Mb and Gb. Suggest the line 238 read:
"There are six FASTQ files ranging from 840M (840MB) to 4.0G (4.0GB)".

Justification: New learners may have never seen the -h option before, and need to keep MB and GB distinct from Mbp or Gbp. Consistency in these designations will be important at this point in the lecture.

Add instructor notes document for this lesson

I'm working on helping direct instructor attention towards fixing up/contributing to instructor notes. Currently don't have a link to provide for instructor notes for this lesson. Please add - even a blank document would be somewhere to point towards.

Quality Control Issue Bonanza

Quality Control

link

Fix:
Teaching: 0 min
Exercises: 0 min

Are my data good enough?

Objectives:
• Clean (trim adaptors and low quality bases of the) FastQ reads using Trimmommatic
• Employ for loops to clean multiple FastQ files.

The first step in the variant calling work flow is to take the FASTQ files received from the sequencing facility and assess the quality of the sequence reads.

Workflow or work flow?

Although it looks complicated (and maybe it is), its easy to understand the fastq format with a little decoding. Some rules about the format include…

It is

so for example in our data set, one complete read is:

For example,

$ head -n4 SRR098281.fastq

Need to download the data first before the class?
A detailed instuction needed.

SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................
...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
.................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ......................   LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................   !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~   |                         |    |        |                              |                     |  33                        59   64       73                            104                   126   0........................26...31.......40                                
                       -5....0........9.............................40 
                             0........9.............................40 
                                3.....9.............................40    0.2......................26...31........41

Need to color the scores as in the wiki page. I do not know how to add colors..maybe a direct screen shot works? A link to the fastq wiki page is needed.

https://en.wikipedia.org/wiki/FASTQ_format

Using the quality encoding character legend, the first nucleotide in the read (C) is called with a quality score of 31 and our Ns are called with a score of 2. As you can tell by now, this is a bad read.

How to define a bad read? quality low and many Ns?
need to explain phred + 33, phred + 64

Running FASTQC

Some text here (need to be filled)

The sample data we will be working with is in a hidden directory (placing a ‘.’ in front of a directory name hides the directory. In the next step we will move some of those hidden files into our new dirctories to start our project.

why the data need to be hidden?

B. Run FastQC

need to download FASTQC first. (saw it http://www.datacarpentry.org/wrangling-genomics/setup/)
should we provide the output directory in the fastqc commands (-o ~/dc_workshop/results/fastqc_untrimmed_reads/) rather than generate the results in the current data folder and then move the results to the results folder?

How to clean reads using Trimmomatic

Some text here (need to be filled)

A. detailed explanation of features

Need to provide a ILLUMINACLIP:SRR_adapters.fa file in the repo

java -jar calls the Java program, which is needed to run trimmomargumentstic

trimmomargumentstic?

$ for infile in *.fastq
>do
>outfile=$infile\_trim.fastq
>java -jar ~/Trimmomatic-0.32/trimmomatic-0.32.jar SE $infile $outfile SLIDINGWINDOW:4:20 MINLEN:20
>done

always double quote your bash variable and use {}

$ for infile in *.fastq
>do
>outfile="${infile}"_trim.fastq
>java -jar ~/Trimmomatic-0.32/trimmomatic-0.32.jar SE $infile $outfile SLIDINGWINDOW:4:20 MINLEN:20
>done

Variant Calling Workflow

Objectives

Use a For (for) loop from the previous lesson to help automate repetitive tasks
Becoming (Become) familiar with data formats encountered during variant calling

Your directory structure should now look like this:

dc_workshop
├── data
    ├── ref_genome
        └── ecoli_rel606.fasta
    ├── untrimmed_fastq
    └── trimmed_fastq
        ├── SRR097977.fastq
        ├── SRR098026.fastq
        ├── SRR098027.fastq
        ├── SRR098028.fastq
        ├── SRR098281.fastq
        └── SRR098283.fastq
 ├── results
 └── docs

A tree command need to be introduced?

We perform read alignment or mapping to determine where in the genome our reads originated from. There are a number of tools to choose from and while there is no gold standard there are some tools that are better suited for particular NGS analyses.

>We perform read alignment or mapping to determine where in the genome our reads originated from. There are a number of tools to choose from and while there is no gold standard, there are some tools that are better suited for particular NGS analyses.

Index the reference genome

$ bwa index data/ref_genome/ecoli_rel606.fasta     # This step helps with the speed of alignment

How long it takes for a ecoli genome? my experience with human genome is about 2 hours.

Explore the information within your SAM file:

$ head results/sam/SRR097977.aligned.sam

less -S is better?

$ less -S results/sam/SRR097977.aligned.sam

Issues with software installations on Mac

I started with Setup and followed along with the instructions on a Mac instead of an Amazon instance, since all of the packages indicated they are available for Mac. I used the source download approach as I didn't have homebrew or conda installed. This went fine except for BCFtools, which failed to compile. I didn't chase the problem, but I'm guessing a missing library:

cd htslib-1.5 && /Library/Developer/CommandLineTools/usr/bin/make lib-static
gcc -g -Wall -O2 -I.  -c -o cram/cram_io.o cram/cram_io.c
cram/cram_io.c:60:10: fatal error: 'lzma.h' file not found
#include <lzma.h>
         ^~~~~~~~
1 error generated.
make[1]: *** [cram/cram_io.o] Error 1
make: *** [htslib-1.5/libhts.a] Error 2

I then tried installing homebrew and miniconda to see if they could successfully install BCFtools. Homebrew failed to install based on the command provided at https://brew.sh/. The miniconda install was successful, but then was unable to install BCFtools because it couldn't find BCFtools in the channels it searched:

PackageNotFoundError: Packages missing in current channels:
            
  - bcftools

We have searched for the packages in the following channels:
            
  - https://repo.continuum.io/pkgs/main/osx-64
  - https://repo.continuum.io/pkgs/main/noarch
  - https://repo.continuum.io/pkgs/free/osx-64
  - https://repo.continuum.io/pkgs/free/noarch
  - https://repo.continuum.io/pkgs/r/osx-64
  - https://repo.continuum.io/pkgs/r/noarch
  - https://repo.continuum.io/pkgs/pro/osx-64
  - https://repo.continuum.io/pkgs/pro/noarch

Considering that Carpentry lessons are expected to run on all three platforms, it seems like the instructions need updating and testing on Mac and PC as well.

I should note I'm running on Mac OS 10.12.6.

Missing Twitter account WranglingGenomics

The README has the line:

Follow updates on our soon to be active Twitter.

Which points at https://twitter.com/WranglingGenomics but that account does not exist yet

Automating a Variant Calling Workflow Issue Bonanza

Overview

Fix timing for Teaching and Exercises
Adapt the objectives to correspond to the lesson sections

What is a shell script?

Change the formatting for the code parts of the main text.

This looks a lot like the for loops we saw earlier. In fact, it is no different, apart from using indentation and the lack of the ‘>’ prompts; it’s just saved in a text file. The line at the top (‘#!/bin/bash’) is commonly called the shebang line, which is a special kind of comment that tells the shell which program is to be used as the ‘intepreter’ that executes the code.

A real shell script

Rephrase the name of the section to (e.g.) "Automating the previous FASTQC workflow with a script"
Explicitly add the wget command
Make sure that the tree command is installed
(maybe) restructure the rest of the section to two parts: "Variant Calling workflow" and "Automating the workflow"
Briefly explain the use of the three tools: samtools, bcftools and bwa.
Make sure that the three tools are installed (e.g. add the apt-get commands if necessary)
Add links to the applications and version

Key points

Add content

need exercises in 02-variant_calling.md

The variant calling workflow episode has only one exercise. Need to add two more (at least).

need instructor notes

New Varient Calling Tools/Consider GATK

I know that newer versions of the tools we are using have deprecated some arguments we are using. Update? Also, now that GATK is going open, is it worth teaching this? I have never used it.

Challenge idea for D. Document your work

Episode: Quality Control

When fastqc is complete, we gather all the summaries.txt. This would be a nice challenge to apply any learned shell commands to grepping on FAIL/PASS/WARNING in these results.

List of Issues (still incomplete)

Incomplete list of significant changes to content that needs to be written and/or revised:

General

Deletion of master branch - both files in master branch are present in gh-pages so master branch could be deleted

Main page

GitHub URL

See this issue

README

GitHub URL

There are a few missing links. Check for content

AUTHORS

GitHub URL

Empty page, needs content

CITATION

GitHub URL

Empty page, needs content

Conduct

GitHub URL

Check links

Lesson folder (scripts)

README

GitHub URL

Almost empty page

process_fastqc_results

GitHub URL

The script is not structured for execution (free text is interspersed with code). Needs cleaning up.

run_variant_calling

GitHub URL

Script is working fine if the environment is setup correctly. Need to re-check after cleaning up lesson + connecting to the lesson setup.

Extras

Reference

Site
GitHub URL

Empty page, needs content

About

Site
GitHub URL

Empty page, needs content

Discussion

Site
GitHub URL

Empty page, needs content

Figures

Site
GitHub URL

Empty page, needs content (figures are not located correctly?)

Instructor Notes

Site
GitHub URL

Empty page, needs content

Lessons

Lesson 0: Setup

Site
GitHub URL

Decide on whether to use Amazon or not
Update links to BWA, Samtools, BCFTools to go to manuals (or ...), not github repos
'tarball' is jargon to some/many

additional exercise about SAM format

An additional exercise for https://github.com/datacarpentry/wrangling-genomics/blob/gh-pages/lessons/02-variant-calling-workflow.md lesson could be: “Explore the content of sam file and find the first 6 mandatory fields listed in SAM format specifications: https://samtools.github.io/hts-specs/SAMv1.pdf “

Split from #39

Need to prep learners for `basename` in the automation lesson

I taught the automation lesson in a recent workshop at UC Davis, and one of the pain points was the use of the basename command. A couple of factors made this pretty challenging.

The command is introduced without explanation or motivation at a critical point in the lesson. Students are struggling to wrap their minds around the synthesis of a bunch of things we've just taught them, and this additional complexity is substantial.
Normally, I would've tried to take a moment and explain in more depth how the command works and what it's used for before proceeding. However, we were pressed for time so this wasn't really an option.

I felt comfortable enough with shell scripting and familiar enough with the materials that I tried ad-libbing my way forward without the command. But after a few minutes I realized it really was necessary and so I had to pull a "don't ask questions just trust me this is the right thing to do :-)" which of course is...suboptimal.

I propose that the basename command be introduced, with some explanation and motivating examples, earlier in the materials. I will think about how that might be don, but for now I just wanted to get this out there. :-)

typos and additional exercise [datacarpentry/shell-genomics/#28]

From datacarpentry/shell-genomics/#28, raised by Asli Uyar:

I have proof-read the lesson materials and noticed the following typos in the code or in the text:

https://github.com/datacarpentry/wrangling-genomics/blob/gh-pages/lessons/00-readQC.md”

"it will help your remember what you did”

https://github.com/datacarpentry/wrangling-genomics/blob/gh-pages/lessons/01-automating_a_workflow.md:

"First, recall the code from our our fastqc workflow from this morning”

add time estimate for FastQC

I'm working through the lesson now and running FastQC using:

$ ~/FastQC/fastqc *.fastq

So far it's completed 1.2 files (in about 5 minutes). I neglected to look how many files were in the set before running. Would be good to add a time estimate for instructors and self-directed learners so that they know how much time they have to work with while this is running. Possibly also give some directions about what would be useful use of time while learners are waiting for this to run.

Confusion over data-lessons vs datacarpentry GitHub accounts

File https://github.com/datacarpentry/wrangling-genomics/blob/gh-pages/README.md ends with the text:

Please cite as:

Wrangling Genomics. June 2017. http://data-lessons.github.io/wrangling-genomics/.

However, the actual repository here is:

https://github.com/datacarpentry/wrangling-genomics

This is the "official" repository as linked to from:

http://www.datacarpentry.org/wrangling-genomics/

See also data-lessons#17 about what the data-lessons fork is for.

Illumina PHRED/Q-score inconsistent

line 93 says:
"uses the standard Sanger quality PHRED score encoding, using by Illumina version 1.8 onwards."
But images "good_quality.png" and "bad_quality.png" have "(Illumina 1.5 encoding)" across the top. Suggest altering the image titles as in attached files.

need reference file

Which .md files to work on?

Could someone please tell me if the .md files (e.g. 00-readQC.md) in the primary /wrangling-genomics directory, or the /lessons/wrangling-genomics directory are under development?

Typo: 00-quality-control.md "Unziping" to "Unzipping" or "Unzip-ing"

Line 429 of 00-quality-control.md
"Unziping" should be changed to "Unzipping" or "Unzip-ing"

fixing image links within wrangling genomics

image links entered correctly

Using symbolic links instead of cp in 02-variant_calling.md

In datacarpentry/wrangling-genomics/_episodes/02-variant_calling.md, under the "Setting up" section, the reference genome and fastq files are copied from one directory to another. This wastes space on the hard drive, and though it is not a problem for this lesson as far as size goes, in general, I think it is better practice to teach symbolic links. Thoughts?

Move the contents to gh-pages

For consistency Data Carpentry lessons are stored in gh-pages branch which allows to generate HTML pages via GitHub Jekyll. It would be good to apply this solution to this repository. The easiest way would be renaming master branch into gh-pages (though this will affect those who locally have only master).

Worflow figure should include trimming

The figure used throughout img/variant_calling_workflow_cleanup.png should include a box for Trimming

setup page points to software pages but not to installation or download pages

It can be a very frustrating experience to work through the software pages in order to find how to download and install software. More so for novices who don't know what they're looking for. I think it would be useful to point to this place in the setup. This is what I would expect from the setup page.

Importing refference genome and checksum

In Cloud lesson, there is a section on importing using wget. I think this belongs here, and could be the place to import the ecoli reference. This would also be a QC step to introduce the idea of checksums.

double quoting bash variables

@crazyhottommy had included this in a list of issues. I've closed all of the other issues in the list and am moving this to its own issue. Would appreciate Maintainer feedback on this as I'm not familiar with style conventions for bash scripts.

Always double quote your bash variable and use {}

$ for infile in *.fastq
>do
>outfile="${infile}"_trim.fastq
>java -jar ~/Trimmomatic-0.32/trimmomatic-0.32.jar SE $infile $outfile SLIDINGWINDOW:4:20 MINLEN:20
>done

Expand /possibly move IGV section of Varient Calling Workflow

There is no intro to what IGV is and too few instructions. Also, if this is in the Cloud, they need to connect to their instance using VNC. This might work best as the very last thing in the workshop, I would consider moving.

Main Page - Add content (Issue Bonanza)

The main page of the lesson requires content in the following sections:

Getting Started
For Instructors

Also a short introduction (e.g. "what will you learn?", "why is this important?") should also be included

Revise the time estimates for the "Wrangling Genomics" lesson.

The following feedback came from instructor @jrkirk61:

The Wrangling Genomics lesson goes a lot longer than the predicted 3.5 hours. I made it through 3/4 of the material in the 3.5 hours I had, and I skipped a few sections of the episodes. I would estimate the the entire lesson takes anywhere from 4.5-5hrs in reality.

Explain commandline structure for samtools mpileup and bcftools

As per #25, we would like to add general commandline structure descriptions for the tools used. #55 addresses this for bwa and some samtools commands, but I don't know samtools mpileup and bcftools well enough to add some text for those.

mutation calling using bcftools or mutect

Following a discussion here #37 and my zoom call with Erin.

I recently deal with SNV calling a lot and gained experience with various variant callers such as mutect (now part of GATK), freebayes (not a somatic caller) and lancet.
In real-life, do any one uses bcftools for variant calling? No matter what callers one uses, the downstream filtering is very important. So, what should we teach for the leaners?

should we teach mutect which can be used for a real-life data sets?

we can leave the development of the somatic calling after the first release.

Best,
Tommy

Discuss how FASTQC is done for multiple files

In a recent workshop, we got a few questions on how FASTQC can be done on multiple files, and how could you view the results of multiple files.

We can emphasize that here we're teaching how to do this for one file, and that later we'll talk about how to do it for multiple files.

For viewing, we should emphasize that you wouldn't visualize the output for all your FASTQC files, but that you would look at representative files, or discuss how you would use the output of FASTQC for multiple files.

moving hidden files is confusing

Under the header "Running FastQC" in wrangling-genomics/_episodes/00-quality-control.md, the command:

 mv ~/.dc_sampledata_lite/untrimmed_fastq/ ~/dc_workshop/data/

Is confusing. It's not clear why hidden files existed in the first place, and if this step is missed, the paths in the rest of the lessons no longer work.

Pointed out by @ljcohen while she was teaching the lesson.

Issue list episode 00

Add exercise "how many files are there and how big are they?" in the untrimmed_fastq data.
Change language around unzip. Demotivating.
Add output for unzip command showing errors.
The for loop for unzip ing the files generates a bunch of warnings (eg replace SRR097_fastqc/Images/..... ) because one of the files is already unzipped. Walk learners through how to deal with this.
Add an ls after the unzipping steps to demonstrate what has happened. Make sure learners recognize that a new directory has been made for each FASTQ file. Walk them through the contents that are relevant.
have learners preview one of the summary files before cating them all together

trimming parameters are very stringent

In the wrangling-genomics episodes, the goal is variant calling. In this pipeline, we are using a well-defined reference genome. By using SLIDINGWINDOW:4:20, we are throwing out a lot of information that could be used for the variant calling workflow. Perhaps we should suggest using more lenient parameters like SLIDINGWINDOW:4:2 like those put forth by the McMannes paper (even though this is for mRNA), as suggested by @ljcohen.

Provide files for rendering gh-pages

Similarily to https://github.com/datacarpentry/excel-ecology this repository need _includes, _layout, css etc to render correctly in gh-pages

datacarpentry / wrangling-genomics Goto Github PK

wrangling-genomics's Introduction

Wrangling Genomics

Background

Contribution

Code of Conduct

Authors

Citation

wrangling-genomics's People

Contributors

Stargazers

Watchers

Forkers

wrangling-genomics's Issues

Objectives:

Lesson:

Key Points

Quality Control

Running FASTQC

B. Run FastQC

How to clean reads using Trimmomatic

A. detailed explanation of features

Variant Calling Workflow

Index the reference genome

Overview

What is a shell script?

A real shell script

Key points

General

Main page

README

AUTHORS

CITATION

Conduct

Lesson folder (scripts)

README

process_fastqc_results

run_variant_calling

Extras

Reference

About

Discussion

Figures

Instructor Notes

Lessons

Lesson 0: Setup

Lesson 1: Quality Control

Lesson 2: Variant Calling Workflow

Lesson 3: Automating a Variant Calling Workflow

Recommend Projects

Recommend Topics

Recommend Org

Jobs