wurmlab / genomicscourse Goto Github PK

View Code? Open in Web Editor NEW

11.0 10.0 19.0 1.34 GB

For QMUL's Genome Bioinformatics MSc module BIO721P & SIB's Spring school in bioinfo & population genomics

Home Page: https://wurmlab.com/genomicscourse/

HTML 96.85% CSS 0.44% Shell 0.08% JavaScript 0.02% Perl 1.14% R 1.22% Rich Text Format 0.24% Ruby 0.02%

genome-assembly genome-analysis education bioinformatics practicals population-genomics qmul

genomicscourse's People

Contributors

Stargazers

Watchers

Forkers

flopezo cnyuanh emelinefavreau digrigor bmpvieira epi-cotton careennaitore ivopieniak mpriebe muthubioinfo shiyi-pan aswitwicka james-milburn-crowe schlogl2017 gabrielluishernandez discoform paoloinglese graziet

genomicscourse's Issues

Broken link in 2017/practicals/reference_genome/README.md

Link to assembly.md in README for reference_genome in 2017 is broken.

OUr default read filtering params delete too much

Close to 50% of data loss - prob too stringent. (I suspect its the quality filter which deletes any read that includes a base with -q 10 N)

2019 - Variant calling input

In variant calling practical (i.e., practical number 4) the input data is fetched as follows:

But popgen data is also available to students at ~/2019-09-BIO721_input (which is a symlink to /import/teaching/bio/data).

Shouldn't we just symlink from ~/2019-09-BIO721_input similar to the first part of the practical (read cleaning to genome assembly and gene prediction) of the practical?

2019 - bgzip command is not installed

Required for practical 5

@ivopieniak Please can you talk to Tom/Keith/Harry/Alijandro (whoever) and get this done asap

2019 - Manual curation

We haven't done manual curation in the last 2-3 years. Should the section be removed. Should we setup WebApollo beforehand?

assembly puzzle

Is there a quick javascript hack whereby we take a list of image URLs, place them randomly on screen, and allow users to drag them around?

eg following http://jsfiddle.net/gigyme/YNMEX/132/ but placing them randomly?

That would enable us to let students do a virtual equivalent of the paper assembly....

Alternatively, maybe using a collaborative google Powerpoint doc is easiest.... that way several students can drag and drop on a single doc (and work on differnt parts of the assembly)

samtools pileup warning (practical 4)

Running samtools mpileup in practical 4 results in the following warning. While not critical, it might be good to address it.

I also installed GV on the server running SequenceServer. So it would be possible for the students to run GV this time instead of only viewing the screenshots. I updated the practical to tell them to try the same example with GV as they would with SequenceServer/BLAST in the previous section.

Is this okay or should I revert this section to only show them the screenshots? If this is okay, then we need to go better prepared to answer GV related question.

2019 - basic updates starting with read cleaning practical

"Locate input data to work with" section in read cleaning practical needs to be updated. The data are available on local PC at /import/teaching/bio/data. Here, we want them to symlink to the data instead of making a copy.
Subsequent section "Set up directory hierarchy to work in" starts by saying 'All work must be done in ~/apocrita. This is no longer the case. We will want them to create their base dirs in home directory.
As per point 2 above, ~/apocrita needs to be removed from the paths throughout.
Related to 3, we need to change 2018 to 2019 throughout. Here, it might be worth considering dropping the day component of date throughout (i.e., 2019-09 instead of 2019-09-28). The argument in favour of keeing it is that we want to teach them the "right" pattern (full date). However, there's a recurring cost every year (even if a few minutes) to update this. Alternative idea is to change it to xx (i.e, 2019-09-xx). This is because we anyway don't know the exact date until a few days before the practical and in the practical we anyway ask them to change the day to today's date.

Update website link in repo to 2020 version

The highligted link in the screenshot below:

to: http://wurmlab.github.io/genomicscourse/2020/practicals/

2019 - source /import/teaching/bio/biotools.sh

Students are currently expected to source /import/teaching/bio/biotools.sh themselves for the tools to be available.

Push Tom to have this added to their bash_profile or so automatically.

2019 - quality control of genes

We tell students that they can check their genes by comparing the sequences to a high quality database using BLAST but it's not made clear what to look for in the BLAST output.

2017

This is all relative to this years course. e.g. https://github.com/wurmlab/genomicscourse/blob/master/2017/practicals/reference_genome/assembly.md

@adrlar can we suppose that everyone will have the data dir locally (on PC or VDI) in ~/2017-09-BIO721_genome_bioinformatics_input (this directory should be read-only)? Or is there a single (different) path that we can point them to (and ask them to create a link to that home dir?)

Get the data subsection - needs to be revised according to @adrlar 's response
Modify all directory dates (e.g. 2016-10-03-whatever) to their corresponding 2017 version. (Note that day of week also changes!)
directory -reference is -reference_genome` (change everywhere)
~~is there a more elegant way to the kmer filtering part of L125-145 (line nums may shift) with kmc2? (rather than messy khmer) (ideally without interleaving/uninterleaving)~~ [see comment below]
if sequenceserver is already installed, they don't need to do it (and the text needs to be correspondibly revised
if the reference db is already downloaded and blast formatted, they don't need to do it. (L252, "Quality control of individual genes"
split into 2 files (one for cleaning, another for assembly)

Changes made to 2017/data/reference_databases (commit 92462e7) needs to be synced to /data/SBCS-MSc-BioInf
~~Regenerate website once open PRs (currently #10) have been merged~~ (my bad, website is automatically built when pushed to master)

Use long form command line options

For example --min-length instead of -m.

(@yannickwurm's idea)

I believe this is already the case for practicals 1-3. Short form is used in practicals 1-3 when it is the only option (sorry if I missed any). I don't think this is relevant for practical 5.

Practical 4 can further improve in this regard.

kmc

Will need more explanations of the different steps its doing
(ideally an easier UI for providing input, and error messages when it returns nothing!! - or rather we should alert to the risk of finding nothing!)

SOAPdenovo

There is a potential issue that can arise based on the experience with BWA - since SOAP is ram intense, there is a chance that the server may crash when all the students would decide to run SOAP at the same time. I have spoken with EECS and they said that it is very likely to happen since the server and the NFS was not designed to handle this kind of biological software. @yeban @yannickwurm

2019 - IGV

@ivopieniak How to run IGV?

In practical 4, it's not 100% clear how to run IGV.

2019 - Regarding directory structure

Everyone finds the directory structure hard to follow. Two ideas that have come up during discussions:

Provide a tree output similar to practical 1 & 4 (i.e., read cleaning and variant calling) in practicals 2 & 3 so that students have a reference.
Skip the first level of hierarchy. Instead, students create 01-read_cleaning, 02-assembly, and so on directly in their home directory. Of course, we will have timestamp instead of 01, 02, etc. in this case.

Option 2 is likely simpler for students, but likely a bit more work from our end because things will have to be retested. Also, it encourages more modular analysis.

Option 1 keeps the current setup and is less work (no retesting required).

Practical 4 does not link to any software

It probably should. And maybe also to official SAM and VCF spec.

read cleaning

the practiacl doc is split in a manner that seems to be designed so that the students SKIP read cleaning rather than actually see it and do it. Please fix

Students struggle with symlinks

We need to explain to them that:

ln -s /tmp/foo /tmp/bar creates a tiny file (soft link) at /tmp/bar containing the path /tmp/foo.
When the soft link is accessed (read, or written to) programs read the stored path (/tmp/bar) and access that instead.
If the path stored in a soft link is relative (begins with ., ..) it will be resolved by the program accessing the soft link relative to the softlink.

So, should one always use absolute paths for creating softlinks? In the simplified directory structure we have adopted starting this year, yes that's fine. But in the classic directory structure, which contains multiple analyses in one big directory, there's merit in using relative paths for per analysis input directory (e.g., results/01-read_cleaning/input) and when linking output of one step as input of the next. If relative paths are used for soft linking here then the entire directory can be copied as it is elsewhere, otherwise links may break.

/cc @MPriebe

encourage softclipping of ends?

--local local alignment; ends might be soft clipped (off)

Regarding replacing SOAPdenovo

I think instead of replacing SOAP we can add some advice (e.g., in practice you may want to try multiple assemblers and even a few different assembly parameters), and list a few assemblers (e.g., SPADES and MaSuRCA) to the practicals.

Change input dataset for read cleaning and assembly

There are duplicated reads in the files, causing the orphan removal step of read-cleaning practical to give incorrect result.

To check that reads are indeed duplicated:

# Stage input
ln -s /data/SBCS-MSc-BioInf/2019/reference_assembly/reads.pe*.fastq.gz .

# Count all ids
$ seqtk comp reads.pe1.fastq.gz | cut -f1 | wc -l
977937

# Uniq the ids and then count
$ seqtk comp reads.pe1.fastq.gz | cut -f1 | sort | uniq | wc -l
977710

Both values should match, but don't, indicating that there are several duplicates.

One other reason to change the dataset is that it's very unfulfilling to go through the practical and obtaining an assembly with an N50 of barely 3kb. If I remember correctly, the input reads are barely 2x genome coverage. 10x-20x should be doable in class and can increase the assembly contiguity considerably.

2019 - MAKER output

It's tricky for students to get predicted gene structure or sequences out of MAKER output directory.

popgenome etc in R

Several issues:

Naming conventions too minimalistic
two PCAs (doing this once is sufficiently confusing)
graphs often lack labels.

2019 - potentially set an alias to open HTML files

It would be nice if we can have students double click on FastQC's HTML report instead of opening it from the terminal:

~~Test if gio open . works (i.e., it opens a file explorer list contents of the current directory)~~
~~If yes, get the following alias deployed to biotools.sh: alias open="gio open"~~
~~Update practical~~

swissprot gives insufficient information for genevalidator

one of the genes that came up repeateldy (SLIT ?) unambiguously had top hits around length 1500.
But in swissprot, most hits were 300-800 (!).
Should use on a bigger db, e.g. uniref50

SPAdes on raw reads produces more contiguous assembly than cleaned reads

N50 raw: 178 Kb; N50 cleaned (cutadapt -u 1 -q 12): 143 Kb

More contiguous doesn't necessarily mean more accurate, but it sure can look a bit awkward - what is the point of read cleaning then?

We may want to address this by increasing the coverage. Also, masking low-quality bases (seqtk seq -q 12 -n N) instead of trimming increases N50 to ~154 Kb. Contiguity further improves if orphaned reads are provided to SPAdes. Note also that SPAdes selects different k-mer sizes for raw and cleaned reads.

Documentation improvement around FASTQC

When we ask students to trim the reads, it may be not clear to them which plot to refer (per-base or per-sequence)
Neither the practical nor FASTQC documentation provide pointers into how base quality is assessed

Credit: Guy

readData fails in popgen practical

readData function in the popgen practical fails on the new setup with the following error:

|            :            |            :            | 100 %
|==========================Error in setwd(dfile) : cannot change working directory

I have tried passing ./popgenome-vcf as well as absolute path to the function. I also tried moving the VCFs and passing . to the function. None worked.

2019 - FASTQC 0.11.8

FastQC runs the analysis, however it cannot save/create HTML report, thus the default .zip output is corrupted and useless.

Error message:
'Failed to create an archive: can't create the output stream'

This was mentioned in this thread, however providing explicit paths to the files for limitations and adapters did not help.
https://www.biostars.org/p/189261/

Suggestion is to use older version - 0.11.7 and not 0.11.8.

Practical 4 overuses inline code blocks

All mentions of software are enclosed in a backtick causing it to be rendered in typewriter font. Most published work (papers, books), as well as practicals 1 to 3 do not use typewriter font when calling out software by name.

For e.g.,

correct: we will use bowtie2
incorrect: we will use bowtie2

Issues with sym linking the input data

1/3 of the students have no access to the /import/teaching/bio/ directory which is gonna make it impossible for them to start the practicals.

2019 - MAKER

MAKER caused problems last time. I don't remember exactly why, but I guess it was because practical asks them to run it on Apocrita but we ended up asking them to run on local PC afterwards.

Check MAKER can be run on local PC and update practicals accordingly.
Need to be extra vigilant because installation is being outsourced and we don't what version we might end up with.

2019 - Typo in read cleaning practical

'BIO271' below should be 'BIO721'

page width too small, makes commands illegibl3e

e.g. here - scroll box ap0ears in a short line of code

http://wurmlab.github.io/genomicscourse/2017/practicals/reference_genome/read-cleaning

prac 1

add that "emerging model organisms" includes most crops, most animal & plant pest species, many pathogens, and most major models for ecology & evolution.
add "Do not jump ahead".
ther tools including fastx_toolkit, kmc2 and Trimmomatic can also be useful. ** but we won't use them now** also kmc2 is irrelevant giv en kmc3
o appropriately trim from the beginning (--cut) and end (--quality-cutoff) of the sequences
to identify a relevant quality cutoff, you'll need to read the cutadapt documentation, understand phred scores, and examine your fasqtc qualiyut scores
Say you have sequenced your sample at 45x genome coverage. ** THis means that every nucleotide of the genome was sequenced 45 times on average... so for a genome of 450,000,000 nucleotides, this means you have xxxxxxxxx nucleotides of raw sequence.
[ [ The real coverage distribution will be influenced by factors including DNA quality, library preparation type and local GC content, but you might expect most of the genome to be covered between 20 and 70x --> sentence too long. chop in 2.
Below, we use kmc3 to "mask" extremely rare k-mers (i.e., convert each base of rare k-mers to 'N'). This is because we know that those portions of sequences are not really present in the species

2019 - quality control of genes

"Quality control of individual genes" section in gene prediction practical asks students to run BLAST on Apocrita and inspect the results on local PC. We want to simplify this if possible:

Test that BLAST can be run in a reasonable time on the local PC (which seems well equipped - 4 HT cores, 16 GB RAM).

Test this through locally installed seqserv rather than on CLI. Make sure to start sequenceserver with -n 8 and this should be reflected in the practical too (or ask Tom and team to change ~/.sequenceserver.conf for both database_dir and num_threads param).

If cannot run BLAST on local PC, ideal would be to run BLAST on Apocrita and visualise results on local PC using sequenceserver 2.0 beta:

Test that 2.0 beta won't struggle to render the results. If it won't, update practical accordingly. If it will, update practicals to ask them to run BLAST with -html flag (we asked them to figure this out themselves last time, but might be tough with 50 students?).

Update or ask (Tom King) to update the copy of uniref50 in /import/teaching/bio/data (or the relevant copy on Apocrita if we determine that BLAST cannot be run on local PC)?

samtools view is not required before samtools sort (practical 4)

It is no longer required to pass SAM output from read mappers through samtools view -b. Instead, -o BAM can be specified directly to samtools sort.

singularities

Hi Priyam,
can you please add here the singularity def files that were needed?
Thanks,
Yannick

2019 - read cleaning revision

We want to use kmc3 for k-mer filtering step.

All steps will need to be revised. Can adapt from my reads-qc pipeline (L213-236 here), but will need to change kmc_tools' flag to drop reads instead of masking. Alternatively there's probably a more straightforward script from Bruno in scripts repository. Bruno's script might be better (iirc that he used kmc and not khmer).
A step to eliminate orphans will need to be introduced (check with Yannick on how to go about it. My version based on earlier discussions with Yannick - https://github.com/wurmlab/reads-qc/blob/master/separate_orphans.rb - but can potentially be streamlined.

In practical 1, do not compress after trimming

Issues with 2019 practical

Maker works nomrally only on the Apocrita - requirement of copying the files back and forth to do gene prediction.
Could not localise singularity containers with seqserv and genevalidator nor blast, thus could not do quality control step.
KHMER is not installed - KMC is there instead @yeban
There is no parallel installed locally.

2019 - PR#51 potentially broke symlink commands in practical 4 & 5

PR #51

Per-tile graph doesn't appear in FASTQC report

This is because the FASTQ sequence headers are not valid.

Thanks to Guy for reporting the issue.

Prettier skin

Can the CSS look better? (maybe just remove the bounding box)

Help with get.xloc.sh and update_reference.sh scripts

Hi,

I do hope this meets you well.

I have been running an RNA-Seq analysis on a non-model organism and I stumbled upon your tutorial on Differential Gene Expression. The subsection on Renaming our favorite candidate genes (so we can find them easily) fits perfectly for what I intend to achieve with my analysis. Unfortunately, I am unable to proceed beyond the blastx on the merged.fa data.

I was wondering if you could please grant me access to the scripts "get.xloc.sh and update_reference.sh" so that I may complete that aspect of my analysis. I would be very grateful if you could help me with this.

Many thanks,

Joachim

Input data

I modified read ids so that read cleaning practical is less convoluted.

Need to copy /data/SBCS-MSc-BioInf/2019p/reads.*.fastq.gz to /import/teaching/bio/data (@ivopieniak)
Need to go over the changes to read cleaning practical and stuff in /data/SBCS-MSc-BioInf with Yannick

Uncommented code blocks

There are a few in practical 4

wurmlab / genomicscourse Goto Github PK

genomicscourse's People

Contributors

Stargazers

Watchers

Forkers

genomicscourse's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs