GithubHelp home page GithubHelp logo

datacarpentry / wrangling-genomics Goto Github PK

View Code? Open in Web Editor NEW
66.0 25.0 152.0 68.16 MB

Data Wrangling and Processing for Genomics

Home Page: https://datacarpentry.org/wrangling-genomics/

License: Other

Shell 100.00%
carpentries data-carpentry lesson shell data-wrangling programming english genomics stable

wrangling-genomics's Introduction

DOI Create a Slack Account with us Slack Status

Wrangling Genomics

Lesson for quality control and wrangling genomics data. This repository is maintained by Josh Herr, Ming Tang, and Fotis Psomopoulos.

Amazon public AMI for this tutorial is "dataCgen-qc".

Background

Wrangling genomics trains novice learners on a variant calling workflow. Participants will learn how to evaluate sequence quality and what to do if it is not good. We will then cover aligning reads to a genome, and calling variants, as well as discussing different file formats. Results will be visualized. Finally, we will cover how to automate the process by building a shell script.

This lesson is part of the Data Carpentry Genomics Workshop.

Contribution

Code of Conduct

All participants should agree to abide by the Data Carpentry Code of Conduct.

Authors

Wrangling genomics is authored and maintained by the community.

Citation

Please cite as:

Erin Alison Becker, Taylor Reiter, Fotis Psomopoulos, Sheldon John McKay, Jessica Elizabeth Mizzi, Jason Williams, … Winni Kretzschmar. (2019, June). datacarpentry/wrangling-genomics: Data Carpentry: Genomics data wrangling and processing, June 2019 (Version v2019.06.1). Zenodo. http://doi.org/10.5281/zenodo.3260609

wrangling-genomics's People

Contributors

agt24 avatar aschuerch avatar binxiepeterson avatar bluegenes avatar crazyhottommy avatar dbmarchant avatar dinindusenanayake avatar erinbecker avatar fmichonneau avatar fpsom avatar gaiusjaugustus avatar hadrieng avatar hoytpr avatar jasonjwilliamsny avatar jessicalumian avatar kate-crosby avatar kcranston avatar kweitemier avatar laninsky avatar lexnederbragt avatar mckays630 avatar mfernandes61 avatar nielinfante avatar peterjc avatar raynamharris avatar taylorreiter avatar tobyhodges avatar tracykteal avatar vlrieg avatar zkamvar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wrangling-genomics's Issues

Why does setup exist?

This seems to be in the SWC format. This lesson is midstream in whatever workshop flavor. There should be nothing for the user to download or install. Unless original intent of teaching all on cloud/HPC is changing this should part or workshop webpage template, or in a lesson that addresses this for learners installing these tools elsewhere.

remove outdated files on the AMI

The trimmed FASTQs produced with we run Trimmomatic on our samples:

-rw-rw-r-- 1 dcuser dcuser 762M Nov 7 23:35 SRR097977.fastq_trim.fastq
-rw-rw-r-- 1 dcuser dcuser 2.7G Nov 7 23:36 SRR098026.fastq_trim.fastq
-rw-rw-r-- 1 dcuser dcuser 793M Nov 7 23:36 SRR098027.fastq_trim.fastq
-rw-rw-r-- 1 dcuser dcuser 2.7G Nov 7 23:37 SRR098028.fastq_trim.fastq
-rw-rw-r-- 1 dcuser dcuser 3.1G Nov 7 23:39 SRR098281.fastq_trim.fastq
-rw-rw-r-- 1 dcuser dcuser 3.0G Nov 7 23:40 SRR098283.fastq_trim.fastq

The trimmed FASTQs that are in the variant_calling/data/ directory (produced from the variant_calling.tar.gz file):

-rw-rw-r-- 1 dcuser dcuser 46M Jul 31 2015 SRR097977.fastq
-rw-rw-r-- 1 dcuser dcuser 44M Jul 31 2015 SRR098026.fastq
-rw-rw-r-- 1 dcuser dcuser 45M Jul 31 2015 SRR098027.fastq
-rw-rw-r-- 1 dcuser dcuser 44M Jul 31 2015 SRR098028.fastq
-rw-rw-r-- 1 dcuser dcuser 44M Jul 31 2015 SRR098281.fastq
-rw-rw-r-- 1 dcuser dcuser 44M Jul 31 2015 SRR098283.fastq

Using bwa mem instead of bwa aln in 02-variant_calling.md

In datacarpentry/wrangling-genomics/_episodes/02-variant_calling.md, bwa aln is taught instead of bwa mem. I see the note about bwa mem, but I think most people who will be taking this work shop will be working with reads >50bp. I think teaching bwa mem instead of bwa aln > bwa samse will simplify the learning process & lesson time, and give the learners a better understanding of the pipeline they will probably use when they do alignment with their own data. Thoughts?

Typos in 00-quality-control.md

TYPOS found in 00-quality-control.md

line 49: "its" should be "it's"
line 79: "eg" should be "e.g."
line 144: "begining" should be "beginning"
line 424: remove one of the "four"
line 427 "unziped" should be "unzipped"
line 560: "termperature" should be "temperature"

Remove FASTQ encoding variants; explain ASCII more

The history of FASTQ encodings and the legacy Solexa/Illumina variants is a distraction, and in my opinion can be removed from the QC lesson:

https://github.com/datacarpentry/wrangling-genomics/blob/gh-pages/_episodes/00-readQC.md

Furthermore since the lesson is under the CC-BY, there is a licensing problem with lifting the CC-BY-SA table from https://en.wikipedia.org/wiki/FASTQ_format (and without even giving the source URL, which would be the minimum form of attribution).

I would instead to reduce this section to focus on introducing ASCII and the idea that characters have an associated numerical code, and this is used with an offset to store the quality score using one character per base.

The different encodings can be reduced to a footnote or remark, and reference to the wikipedia page https://en.wikipedia.org/wiki/FASTQ_format or our paper https://doi.org/10.1093%2Fnar%2Fgkp1137

(And then move on to explain PHRED scores)

Lesson "automation": better syntax for variable handling and a typo/mistake?

I have two issues with the automation course:

First issue:
IMHO the shown method to concatenate a bash variable and plain text is not the recommended way. See this excerpt from the lesson:

$ for infile in *.fastq
do
outfile=$infile_trim.fastq
java -jar ~/Trimmomatic-0.32/trimmomatic-0.32.jar SE $infile $outfile SLIDINGWINDOW:4:20 MINLEN:20
done

This kind of operation (bold in the example above) should be done via the ${variable} syntax, and not via an escape character (i.e. $variable_plain_text). Earlier courses teach the ${variable} syntax method, so also for consistency I think this should be changed. Or am I missing something?

Second issue:
I might have found a typo/mistake, but I`m not 100% sure and I have no VM to test it (anymore). So if someone could confirm I'd be most obliged. But, in the exercise quoted below:

Vizualize the alignment of the reads for our SRR098281.fastq_trim.fastq_small sample. What variant is present at position 145? What is the canonical nucleotide in that position?

Shouldn't this be position 146? Or did I miscount or became confused between 0-counting programs and 1-counting programs?

Kind regards,
Dennis

Add locations of true and false positive variant calls

As per #25:

explain how to inspect alignments to see whether a variant is indeed called correctly, maybe point to a variant that did not pass filtering

This requires someone who has run the pipeline to have a look in IGV to find a true variant and a false positive variant (assuming there is one) and add their locations to the end of 01-variant_calling_workflow.md, maybe even including screenshots from IGV and samtools tview.

Access to files for self-learners (and maintainers)

How would self-learners interested in going over this material be able to run the analyses? Can we give them access to the input files? An Amazon image?

Similarly, if I want to address some remaining issues, I would need to do many of the steps normally run at a workshop, or at least get my hands on the output files...

Automating a workflow - referencing earlier parts of the workshop (Issue Bonanza)

To make the narrative independent from the workshop schedule and more applicable for self-learners, I suggest to replace references to parts of the day by references to the lesson itself.

"Now, let’s do something real. First, recall the code from our our fastqc workflow from this morning, with a few extra “echo” statements."
Replace by
"Now, let’s do something real. First, recall the code from our our quality control workflow, with a few extra “echo” statements."

"3) Bonus points: Use something you learned yesterday to save the output of the script to a file while it is running."
Replace by
"3) Bonus points: Save the output of the script to a file while it is running (hint: redirection)"

Stage your data section redundant?

Episode: Quality Control

The Wrangling episodes (in my mind) come after the project organization. There will need to be harmonization between these sections, assuming that certain directories/prerequisites exist. Suggesting rephrasing to have a learner confirm that they have setup the appropriate directories.

Variant Calling Workflow Issue Bonanza

Objectives:

Use a For loop from the previous lesson to help automate repetitive tasks

  • Remove, as this is not done in this lesson.

Lesson:

  • rephrase learning objectives as 'be able to' etc

  • The first commands are misformatted

NOTE: This only has to be run once. The only reason you would want to create a new index is if you are working with a different reference genome or you are using a different tool for alignment.

  • Make it a Callout using the '{: .callout}' syntax? We could use Callout boxes more throughout.

  • Explain the general setup of commands, e.g. bwa aln reference_genome readfile, see #55 and #56

The SAM file, is a tab-delimited text file

  • Links to this github page, find a better alternative.

  • Explain why we would want to make a binary version of a file in general, and the SAM file in particular see #57

Filter the SNPs for the final output in VCF format, using vcfutils.pl:

  • Format vcfutils.pl as vcfutils.pl

. IGV requires the .bai file to be in the same location as the .bam file that is loaded into IGV, but there is no direct use for that file.

  • Isn't there? See #58

  • explain functionality, coloring and GUI elements of IGV in more detail See #59

  • explain how to inspect alignments to see whether a variant is indeed called correctly, maybe point to a variant that did not pass filtering See #60

Key Points

  • missing

Clarifying human readable for new users

On the exercise starting at line 215, the "Hint" is at least partially for learning the -h option. But to avoid confusion the use of "M" and "G" should be Mb and Gb. Suggest the line 238 read:
"There are six FASTQ files ranging from 840M (840MB) to 4.0G (4.0GB)".

Justification: New learners may have never seen the -h option before, and need to keep MB and GB distinct from Mbp or Gbp. Consistency in these designations will be important at this point in the lecture.

Add instructor notes document for this lesson

I'm working on helping direct instructor attention towards fixing up/contributing to instructor notes. Currently don't have a link to provide for instructor notes for this lesson. Please add - even a blank document would be somewhere to point towards.

Quality Control Issue Bonanza

Quality Control

link

  • Fix:
    Teaching: 0 min
    Exercises: 0 min

Are my data good enough?

Objectives:
• Clean (trim adaptors and low quality bases of the) FastQ reads using Trimmommatic
• Employ for loops to clean multiple FastQ files.

The first step in the variant calling work flow is to take the FASTQ files received from the sequencing facility and assess the quality of the sequence reads.

  • Workflow or work flow?

Although it looks complicated (and maybe it is), its easy to understand the fastq format with a little decoding. Some rules about the format include…

  • It is

so for example in our data set, one complete read is:

For example,

$ head -n4 SRR098281.fastq 
  • Need to download the data first before the class?
  • A detailed instuction needed.
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................
...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
.................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ......................   LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................   !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~   |                         |    |        |                              |                     |  33                        59   64       73                            104                   126   0........................26...31.......40                                
                       -5....0........9.............................40 
                             0........9.............................40 
                                3.....9.............................40    0.2......................26...31........41  
  • Need to color the scores as in the wiki page. I do not know how to add colors..maybe a direct screen shot works? A link to the fastq wiki page is needed.

https://en.wikipedia.org/wiki/FASTQ_format

Using the quality encoding character legend, the first nucleotide in the read (C) is called with a quality score of 31 and our Ns are called with a score of 2. As you can tell by now, this is a bad read.

  • How to define a bad read? quality low and many Ns?

  • need to explain phred + 33, phred + 64

Running FASTQC

  • Some text here (need to be filled)

The sample data we will be working with is in a hidden directory (placing a ‘.’ in front of a directory name hides the directory. In the next step we will move some of those hidden files into our new dirctories to start our project.

  • why the data need to be hidden?

B. Run FastQC

  • need to download FASTQC first. (saw it http://www.datacarpentry.org/wrangling-genomics/setup/)

  • should we provide the output directory in the fastqc commands (-o ~/dc_workshop/results/fastqc_untrimmed_reads/) rather than generate the results in the current data folder and then move the results to the results folder?

How to clean reads using Trimmomatic

  • Some text here (need to be filled)

A. detailed explanation of features

  • Need to provide a ILLUMINACLIP:SRR_adapters.fa file in the repo

java -jar calls the Java program, which is needed to run trimmomargumentstic

  • trimmomargumentstic?
$ for infile in *.fastq
>do
>outfile=$infile\_trim.fastq
>java -jar ~/Trimmomatic-0.32/trimmomatic-0.32.jar SE $infile $outfile SLIDINGWINDOW:4:20 MINLEN:20
>done
  • always double quote your bash variable and use {}
$ for infile in *.fastq
>do
>outfile="${infile}"_trim.fastq
>java -jar ~/Trimmomatic-0.32/trimmomatic-0.32.jar SE $infile $outfile SLIDINGWINDOW:4:20 MINLEN:20
>done

Variant Calling Workflow

Objectives

  • Use a For (for) loop from the previous lesson to help automate repetitive tasks

  • Becoming (Become) familiar with data formats encountered during variant calling

Your directory structure should now look like this:

dc_workshop
├── data
    ├── ref_genome
        └── ecoli_rel606.fasta
    ├── untrimmed_fastq
    └── trimmed_fastq
        ├── SRR097977.fastq
        ├── SRR098026.fastq
        ├── SRR098027.fastq
        ├── SRR098028.fastq
        ├── SRR098281.fastq
        └── SRR098283.fastq
 ├── results
 └── docs

  • A tree command need to be introduced?

We perform read alignment or mapping to determine where in the genome our reads originated from. There are a number of tools to choose from and while there is no gold standard there are some tools that are better suited for particular NGS analyses.

  • >We perform read alignment or mapping to determine where in the genome our reads originated from. There are a number of tools to choose from and while there is no gold standard, there are some tools that are better suited for particular NGS analyses.

Index the reference genome

$ bwa index data/ref_genome/ecoli_rel606.fasta     # This step helps with the speed of alignment
  • How long it takes for a ecoli genome? my experience with human genome is about 2 hours.

Explore the information within your SAM file:

$ head results/sam/SRR097977.aligned.sam
  • less -S is better?
$ less -S results/sam/SRR097977.aligned.sam

Issues with software installations on Mac

I started with Setup and followed along with the instructions on a Mac instead of an Amazon instance, since all of the packages indicated they are available for Mac. I used the source download approach as I didn't have homebrew or conda installed. This went fine except for BCFtools, which failed to compile. I didn't chase the problem, but I'm guessing a missing library:

cd htslib-1.5 && /Library/Developer/CommandLineTools/usr/bin/make lib-static
gcc -g -Wall -O2 -I.  -c -o cram/cram_io.o cram/cram_io.c
cram/cram_io.c:60:10: fatal error: 'lzma.h' file not found
#include <lzma.h>
         ^~~~~~~~
1 error generated.
make[1]: *** [cram/cram_io.o] Error 1
make: *** [htslib-1.5/libhts.a] Error 2

I then tried installing homebrew and miniconda to see if they could successfully install BCFtools. Homebrew failed to install based on the command provided at https://brew.sh/. The miniconda install was successful, but then was unable to install BCFtools because it couldn't find BCFtools in the channels it searched:

PackageNotFoundError: Packages missing in current channels:
            
  - bcftools

We have searched for the packages in the following channels:
            
  - https://repo.continuum.io/pkgs/main/osx-64
  - https://repo.continuum.io/pkgs/main/noarch
  - https://repo.continuum.io/pkgs/free/osx-64
  - https://repo.continuum.io/pkgs/free/noarch
  - https://repo.continuum.io/pkgs/r/osx-64
  - https://repo.continuum.io/pkgs/r/noarch
  - https://repo.continuum.io/pkgs/pro/osx-64
  - https://repo.continuum.io/pkgs/pro/noarch

Considering that Carpentry lessons are expected to run on all three platforms, it seems like the instructions need updating and testing on Mac and PC as well.

I should note I'm running on Mac OS 10.12.6.

Automating a Variant Calling Workflow Issue Bonanza

Overview

  • Fix timing for Teaching and Exercises
  • Adapt the objectives to correspond to the lesson sections

What is a shell script?

  • Change the formatting for the code parts of the main text.

This looks a lot like the for loops we saw earlier. In fact, it is no different, apart from using indentation and the lack of the ‘>’ prompts; it’s just saved in a text file. The line at the top (‘#!/bin/bash’) is commonly called the shebang line, which is a special kind of comment that tells the shell which program is to be used as the ‘intepreter’ that executes the code.

A real shell script

  • Rephrase the name of the section to (e.g.) "Automating the previous FASTQC workflow with a script"
  • Explicitly add the wget command
  • Make sure that the tree command is installed
  • (maybe) restructure the rest of the section to two parts: "Variant Calling workflow" and "Automating the workflow"
  • Briefly explain the use of the three tools: samtools, bcftools and bwa.
  • Make sure that the three tools are installed (e.g. add the apt-get commands if necessary)
  • Add links to the applications and version

Key points

  • Add content

New Varient Calling Tools/Consider GATK

I know that newer versions of the tools we are using have deprecated some arguments we are using. Update? Also, now that GATK is going open, is it worth teaching this? I have never used it.

Challenge idea for D. Document your work

Episode: Quality Control

When fastqc is complete, we gather all the summaries.txt. This would be a nice challenge to apply any learned shell commands to grepping on FAIL/PASS/WARNING in these results.

List of Issues (still incomplete)

Incomplete list of significant changes to content that needs to be written and/or revised:

General

  • Deletion of master branch - both files in master branch are present in gh-pages so master branch could be deleted

Main page

GitHub URL

README

GitHub URL

  • There are a few missing links. Check for content

AUTHORS

GitHub URL

  • Empty page, needs content

CITATION

GitHub URL

  • Empty page, needs content

Conduct

GitHub URL

  • Check links

Lesson folder (scripts)

README

GitHub URL

  • Almost empty page

process_fastqc_results

GitHub URL

  • The script is not structured for execution (free text is interspersed with code). Needs cleaning up.

run_variant_calling

GitHub URL

  • Script is working fine if the environment is setup correctly. Need to re-check after cleaning up lesson + connecting to the lesson setup.

Extras

Reference

Site
GitHub URL

  • Empty page, needs content

About

Site
GitHub URL

  • Empty page, needs content

Discussion

Site
GitHub URL

  • Empty page, needs content

Figures

Site
GitHub URL

  • Empty page, needs content (figures are not located correctly?)

Instructor Notes

Site
GitHub URL

  • Empty page, needs content

Lessons

Lesson 0: Setup

Site
GitHub URL

  • Decide on whether to use Amazon or not
  • Update links to BWA, Samtools, BCFTools to go to manuals (or ...), not github repos
  • 'tarball' is jargon to some/many

Lesson 1: Quality Control

Site
GitHub URL

Lesson 2: Variant Calling Workflow

Site
GitHub URL

Lesson 3: Automating a Variant Calling Workflow

Site
GitHub URL

Need to prep learners for `basename` in the automation lesson

I taught the automation lesson in a recent workshop at UC Davis, and one of the pain points was the use of the basename command. A couple of factors made this pretty challenging.

  • The command is introduced without explanation or motivation at a critical point in the lesson. Students are struggling to wrap their minds around the synthesis of a bunch of things we've just taught them, and this additional complexity is substantial.
  • Normally, I would've tried to take a moment and explain in more depth how the command works and what it's used for before proceeding. However, we were pressed for time so this wasn't really an option.

I felt comfortable enough with shell scripting and familiar enough with the materials that I tried ad-libbing my way forward without the command. But after a few minutes I realized it really was necessary and so I had to pull a "don't ask questions just trust me this is the right thing to do :-)" which of course is...suboptimal.

I propose that the basename command be introduced, with some explanation and motivating examples, earlier in the materials. I will think about how that might be don, but for now I just wanted to get this out there. :-)

typos and additional exercise [datacarpentry/shell-genomics/#28]

From datacarpentry/shell-genomics/#28, raised by Asli Uyar:

I have proof-read the lesson materials and noticed the following typos in the code or in the text:

https://github.com/datacarpentry/wrangling-genomics/blob/gh-pages/lessons/00-readQC.md”

"it will help your remember what you did”

https://github.com/datacarpentry/wrangling-genomics/blob/gh-pages/lessons/01-automating_a_workflow.md:

"First, recall the code from our our fastqc workflow from this morning”

add time estimate for FastQC

I'm working through the lesson now and running FastQC using:

$ ~/FastQC/fastqc *.fastq

So far it's completed 1.2 files (in about 5 minutes). I neglected to look how many files were in the set before running. Would be good to add a time estimate for instructors and self-directed learners so that they know how much time they have to work with while this is running. Possibly also give some directions about what would be useful use of time while learners are waiting for this to run.

Confusion over data-lessons vs datacarpentry GitHub accounts

File https://github.com/datacarpentry/wrangling-genomics/blob/gh-pages/README.md ends with the text:

Please cite as:

Wrangling Genomics. June 2017. http://data-lessons.github.io/wrangling-genomics/.

However, the actual repository here is:

https://github.com/datacarpentry/wrangling-genomics

This is the "official" repository as linked to from:

http://www.datacarpentry.org/wrangling-genomics/

See also data-lessons#17 about what the data-lessons fork is for.

Illumina PHRED/Q-score inconsistent

line 93 says:
"uses the standard Sanger quality PHRED score encoding, using by Illumina version 1.8 onwards."
But images "good_quality.png" and "bad_quality.png" have "(Illumina 1.5 encoding)" across the top. Suggest altering the image titles as in attached files.
bad_quality1 8
good_quality1 8

Which .md files to work on?

Could someone please tell me if the .md files (e.g. 00-readQC.md) in the primary /wrangling-genomics directory, or the /lessons/wrangling-genomics directory are under development?

Move the contents to gh-pages

For consistency Data Carpentry lessons are stored in gh-pages branch which allows to generate HTML pages via GitHub Jekyll. It would be good to apply this solution to this repository. The easiest way would be renaming master branch into gh-pages (though this will affect those who locally have only master).

Importing refference genome and checksum

In Cloud lesson, there is a section on importing using wget. I think this belongs here, and could be the place to import the ecoli reference. This would also be a QC step to introduce the idea of checksums.

double quoting bash variables

@crazyhottommy had included this in a list of issues. I've closed all of the other issues in the list and am moving this to its own issue. Would appreciate Maintainer feedback on this as I'm not familiar with style conventions for bash scripts.


Always double quote your bash variable and use {}

$ for infile in *.fastq
>do
>outfile="${infile}"_trim.fastq
>java -jar ~/Trimmomatic-0.32/trimmomatic-0.32.jar SE $infile $outfile SLIDINGWINDOW:4:20 MINLEN:20
>done

Main Page - Add content (Issue Bonanza)

The main page of the lesson requires content in the following sections:

  • Getting Started
  • For Instructors

Also a short introduction (e.g. "what will you learn?", "why is this important?") should also be included

Revise the time estimates for the "Wrangling Genomics" lesson.

The following feedback came from instructor @jrkirk61:

The Wrangling Genomics lesson goes a lot longer than the predicted 3.5 hours. I made it through 3/4 of the material in the 3.5 hours I had, and I skipped a few sections of the episodes. I would estimate the the entire lesson takes anywhere from 4.5-5hrs in reality.

mutation calling using bcftools or mutect

Following a discussion here #37 and my zoom call with Erin.

I recently deal with SNV calling a lot and gained experience with various variant callers such as mutect (now part of GATK), freebayes (not a somatic caller) and lancet.
In real-life, do any one uses bcftools for variant calling? No matter what callers one uses, the downstream filtering is very important. So, what should we teach for the leaners?

should we teach mutect which can be used for a real-life data sets?

we can leave the development of the somatic calling after the first release.

Best,
Tommy

Discuss how FASTQC is done for multiple files

In a recent workshop, we got a few questions on how FASTQC can be done on multiple files, and how could you view the results of multiple files.

We can emphasize that here we're teaching how to do this for one file, and that later we'll talk about how to do it for multiple files.

For viewing, we should emphasize that you wouldn't visualize the output for all your FASTQC files, but that you would look at representative files, or discuss how you would use the output of FASTQC for multiple files.

moving hidden files is confusing

Under the header "Running FastQC" in wrangling-genomics/_episodes/00-quality-control.md, the command:

 mv ~/.dc_sampledata_lite/untrimmed_fastq/ ~/dc_workshop/data/

Is confusing. It's not clear why hidden files existed in the first place, and if this step is missed, the paths in the rest of the lessons no longer work.

Pointed out by @ljcohen while she was teaching the lesson.

Issue list episode 00

  • Add exercise "how many files are there and how big are they?" in the untrimmed_fastq data.
  • Change language around unzip. Demotivating.
  • Add output for unzip command showing errors.
  • The for loop for unzip ing the files generates a bunch of warnings (eg replace SRR097_fastqc/Images/..... ) because one of the files is already unzipped. Walk learners through how to deal with this.
  • Add an ls after the unzipping steps to demonstrate what has happened. Make sure learners recognize that a new directory has been made for each FASTQ file. Walk them through the contents that are relevant.
  • have learners preview one of the summary files before cating them all together

trimming parameters are very stringent

In the wrangling-genomics episodes, the goal is variant calling. In this pipeline, we are using a well-defined reference genome. By using SLIDINGWINDOW:4:20, we are throwing out a lot of information that could be used for the variant calling workflow. Perhaps we should suggest using more lenient parameters like SLIDINGWINDOW:4:2 like those put forth by the McMannes paper (even though this is for mRNA), as suggested by @ljcohen.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.