wurmlab / genomicscourse Goto Github PK
View Code? Open in Web Editor NEWFor QMUL's Genome Bioinformatics MSc module BIO721P & SIB's Spring school in bioinfo & population genomics
Home Page: https://wurmlab.com/genomicscourse/
For QMUL's Genome Bioinformatics MSc module BIO721P & SIB's Spring school in bioinfo & population genomics
Home Page: https://wurmlab.com/genomicscourse/
Link to assembly.md in README for reference_genome in 2017 is broken.
Close to 50% of data loss - prob too stringent. (I suspect its the quality filter which deletes any read that includes a base with -q 10 N)
In variant calling practical (i.e., practical number 4) the input data is fetched as follows:
But popgen data is also available to students at ~/2019-09-BIO721_input
(which is a symlink to /import/teaching/bio/data
).
Shouldn't we just symlink from ~/2019-09-BIO721_input
similar to the first part of the practical (read cleaning to genome assembly and gene prediction) of the practical?
Required for practical 5
@ivopieniak Please can you talk to Tom/Keith/Harry/Alijandro (whoever) and get this done asap
We haven't done manual curation in the last 2-3 years. Should the section be removed. Should we setup WebApollo beforehand?
Is there a quick javascript hack whereby we take a list of image URLs, place them randomly on screen, and allow users to drag them around?
That would enable us to let students do a virtual equivalent of the paper assembly....
Alternatively, maybe using a collaborative google Powerpoint doc is easiest.... that way several students can drag and drop on a single doc (and work on differnt parts of the assembly)
@yannickwurm headsup / question:
I also installed GV on the server running SequenceServer. So it would be possible for the students to run GV this time instead of only viewing the screenshots. I updated the practical to tell them to try the same example with GV as they would with SequenceServer/BLAST in the previous section.
Is this okay or should I revert this section to only show them the screenshots? If this is okay, then we need to go better prepared to answer GV related question.
"Locate input data to work with" section in read cleaning practical needs to be updated. The data are available on local PC at /import/teaching/bio/data
. Here, we want them to symlink to the data instead of making a copy.
Subsequent section "Set up directory hierarchy to work in" starts by saying 'All work must be done in ~/apocrita
. This is no longer the case. We will want them to create their base dirs in home directory.
As per point 2 above, ~/apocrita
needs to be removed from the paths throughout.
Related to 3, we need to change 2018 to 2019 throughout. Here, it might be worth considering dropping the day component of date throughout (i.e., 2019-09 instead of 2019-09-28). The argument in favour of keeing it is that we want to teach them the "right" pattern (full date). However, there's a recurring cost every year (even if a few minutes) to update this. Alternative idea is to change it to xx (i.e, 2019-09-xx). This is because we anyway don't know the exact date until a few days before the practical and in the practical we anyway ask them to change the day to today's date.
The highligted link in the screenshot below:
to: http://wurmlab.github.io/genomicscourse/2020/practicals/
Students are currently expected to source /import/teaching/bio/biotools.sh
themselves for the tools to be available.
Push Tom to have this added to their bash_profile or so automatically.
We tell students that they can check their genes by comparing the sequences to a high quality database using BLAST but it's not made clear what to look for in the BLAST output.
This is all relative to this years course. e.g. https://github.com/wurmlab/genomicscourse/blob/master/2017/practicals/reference_genome/assembly.md
@adrlar can we suppose that everyone will have the data dir locally (on PC or VDI) in ~/2017-09-BIO721_genome_bioinformatics_input
(this directory should be read-only)? Or is there a single (different) path that we can point them to (and ask them to create a link to that home dir?)
Get the data subsection - needs to be revised according to @adrlar 's response
Modify all directory dates (e.g. 2016-10-03-whatever
) to their corresponding 2017 version. (Note that day of week also changes!)
directory -reference
is -reference_genome` (change everywhere)
is there a more elegant way to the kmer filtering part of L125-145 (line nums may shift) with kmc2? (rather than messy khmer) (ideally without interleaving/uninterleaving) [see comment below]
if sequenceserver is already installed, they don't need to do it (and the text needs to be correspondibly revised
if the reference db is already downloaded and blast formatted, they don't need to do it. (L252, "Quality control of individual genes"
split into 2 files (one for cleaning, another for assembly)
For example --min-length
instead of -m
.
(@yannickwurm's idea)
I believe this is already the case for practicals 1-3. Short form is used in practicals 1-3 when it is the only option (sorry if I missed any). I don't think this is relevant for practical 5.
Practical 4 can further improve in this regard.
Will need more explanations of the different steps its doing
(ideally an easier UI for providing input, and error messages when it returns nothing!! - or rather we should alert to the risk of finding nothing!)
There is a potential issue that can arise based on the experience with BWA - since SOAP is ram intense, there is a chance that the server may crash when all the students would decide to run SOAP at the same time. I have spoken with EECS and they said that it is very likely to happen since the server and the NFS was not designed to handle this kind of biological software. @yeban @yannickwurm
@ivopieniak How to run IGV?
Everyone finds the directory structure hard to follow. Two ideas that have come up during discussions:
Option 2 is likely simpler for students, but likely a bit more work from our end because things will have to be retested. Also, it encourages more modular analysis.
Option 1 keeps the current setup and is less work (no retesting required).
It probably should. And maybe also to official SAM and VCF spec.
the practiacl doc is split in a manner that seems to be designed so that the students SKIP read cleaning rather than actually see it and do it. Please fix
We need to explain to them that:
ln -s /tmp/foo /tmp/bar
creates a tiny file (soft link) at /tmp/bar
containing the path /tmp/foo
./tmp/bar
) and access that instead..
, ..
) it will be resolved by the program accessing the soft link relative to the softlink.So, should one always use absolute paths for creating softlinks? In the simplified directory structure we have adopted starting this year, yes that's fine. But in the classic directory structure, which contains multiple analyses in one big directory, there's merit in using relative paths for per analysis input directory (e.g., results/01-read_cleaning/input
) and when linking output of one step as input of the next. If relative paths are used for soft linking here then the entire directory can be copied as it is elsewhere, otherwise links may break.
/cc @MPriebe
--local local alignment; ends might be soft clipped (off)
I think instead of replacing SOAP we can add some advice (e.g., in practice you may want to try multiple assemblers and even a few different assembly parameters), and list a few assemblers (e.g., SPADES and MaSuRCA) to the practicals.
There are duplicated reads in the files, causing the orphan removal step of read-cleaning practical to give incorrect result.
To check that reads are indeed duplicated:
# Stage input
ln -s /data/SBCS-MSc-BioInf/2019/reference_assembly/reads.pe*.fastq.gz .
# Count all ids
$ seqtk comp reads.pe1.fastq.gz | cut -f1 | wc -l
977937
# Uniq the ids and then count
$ seqtk comp reads.pe1.fastq.gz | cut -f1 | sort | uniq | wc -l
977710
Both values should match, but don't, indicating that there are several duplicates.
One other reason to change the dataset is that it's very unfulfilling to go through the practical and obtaining an assembly with an N50 of barely 3kb. If I remember correctly, the input reads are barely 2x genome coverage. 10x-20x should be doable in class and can increase the assembly contiguity considerably.
It's tricky for students to get predicted gene structure or sequences out of MAKER output directory.
Several issues:
It would be nice if we can have students double click on FastQC's HTML report instead of opening it from the terminal:
gio open .
works (i.e., it opens a file explorer list contents of the current directory)alias open="gio open"
one of the genes that came up repeateldy (SLIT ?) unambiguously had top hits around length 1500.
But in swissprot, most hits were 300-800 (!).
Should use on a bigger db, e.g. uniref50
N50 raw: 178 Kb; N50 cleaned (cutadapt -u 1 -q 12
): 143 Kb
More contiguous doesn't necessarily mean more accurate, but it sure can look a bit awkward - what is the point of read cleaning then?
We may want to address this by increasing the coverage. Also, masking low-quality bases (seqtk seq -q 12 -n N
) instead of trimming increases N50 to ~154 Kb. Contiguity further improves if orphaned reads are provided to SPAdes. Note also that SPAdes selects different k-mer sizes for raw and cleaned reads.
Credit: Guy
readData
function in the popgen practical fails on the new setup with the following error:
| : | : | 100 %
|==========================Error in setwd(dfile) : cannot change working directory
I have tried passing ./popgenome-vcf
as well as absolute path to the function. I also tried moving the VCFs and passing .
to the function. None worked.
FastQC runs the analysis, however it cannot save/create HTML report, thus the default .zip output is corrupted and useless.
Error message:
'Failed to create an archive: can't create the output stream'
This was mentioned in this thread, however providing explicit paths to the files for limitations and adapters did not help.
https://www.biostars.org/p/189261/
Suggestion is to use older version - 0.11.7 and not 0.11.8.
All mentions of software are enclosed in a backtick causing it to be rendered in typewriter font. Most published work (papers, books), as well as practicals 1 to 3 do not use typewriter font when calling out software by name.
For e.g.,
correct: we will use bowtie2
incorrect: we will use bowtie2
1/3 of the students have no access to the /import/teaching/bio/ directory which is gonna make it impossible for them to start the practicals.
MAKER caused problems last time. I don't remember exactly why, but I guess it was because practical asks them to run it on Apocrita but we ended up asking them to run on local PC afterwards.
e.g. here - scroll box ap0ears in a short line of code
http://wurmlab.github.io/genomicscourse/2017/practicals/reference_genome/read-cleaning
add that "emerging model organisms" includes most crops, most animal & plant pest species, many pathogens, and most major models for ecology & evolution.
add "Do not jump ahead".
ther tools including fastx_toolkit, kmc2 and Trimmomatic can also be useful. ** but we won't use them now** also kmc2 is irrelevant giv en kmc3
o appropriately trim from the beginning (--cut) and end (--quality-cutoff) of the sequences
to identify a relevant quality cutoff, you'll need to read the cutadapt documentation, understand phred scores, and examine your fasqtc qualiyut scores
Say you have sequenced your sample at 45x genome coverage. ** THis means that every nucleotide of the genome was sequenced 45 times on average... so for a genome of 450,000,000 nucleotides, this means you have xxxxxxxxx nucleotides of raw sequence.
[ [ The real coverage distribution will be influenced by factors including DNA quality, library preparation type and local GC content, but you might expect most of the genome to be covered between 20 and 70x --> sentence too long. chop in 2.
Below, we use kmc3 to "mask" extremely rare k-mers (i.e., convert each base of rare k-mers to 'N'). This is because we know that those portions of sequences are not really present in the species
"Quality control of individual genes" section in gene prediction practical asks students to run BLAST on Apocrita and inspect the results on local PC. We want to simplify this if possible:
Test that BLAST can be run in a reasonable time on the local PC (which seems well equipped - 4 HT cores, 16 GB RAM).
-n 8
and this should be reflected in the practical too (or ask Tom and team to change ~/.sequenceserver.conf for both database_dir
and num_threads
param).If cannot run BLAST on local PC, ideal would be to run BLAST on Apocrita and visualise results on local PC using sequenceserver 2.0 beta:
-html
flag (we asked them to figure this out themselves last time, but might be tough with 50 students?).Update or ask (Tom King) to update the copy of uniref50 in /import/teaching/bio/data
(or the relevant copy on Apocrita if we determine that BLAST cannot be run on local PC)?
It is no longer required to pass SAM output from read mappers through samtools view -b
. Instead, -o BAM
can be specified directly to samtools sort
.
Hi Priyam,
can you please add here the singularity def files that were needed?
Thanks,
Yannick
We want to use kmc3 for k-mer filtering step.
kmc_tools
' flag to drop reads instead of masking. Alternatively there's probably a more straightforward script from Bruno in scripts
repository. Bruno's script might be better (iirc that he used kmc and not khmer).PR #51
This is because the FASTQ sequence headers are not valid.
Thanks to Guy for reporting the issue.
Can the CSS look better? (maybe just remove the bounding box)
Hi,
I do hope this meets you well.
I have been running an RNA-Seq analysis on a non-model organism and I stumbled upon your tutorial on Differential Gene Expression. The subsection on Renaming our favorite candidate genes (so we can find them easily) fits perfectly for what I intend to achieve with my analysis. Unfortunately, I am unable to proceed beyond the blastx on the merged.fa data.
I was wondering if you could please grant me access to the scripts "get.xloc.sh and update_reference.sh" so that I may complete that aspect of my analysis. I would be very grateful if you could help me with this.
Many thanks,
Joachim
I modified read ids so that read cleaning practical is less convoluted.
/data/SBCS-MSc-BioInf/2019p/reads.*.fastq.gz
to /import/teaching/bio/data
(@ivopieniak)There are a few in practical 4
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.