cogent3 / cogent3workshop Goto Github PK

Materials for the Phylomania workshop

License: BSD 3-Clause "New" or "Revised" License

Jupyter Notebook 10.14% Python 89.86%

cogent3workshop's Introduction

Data Wrangling with Cogent3 Workshop

The workshop is geared towards researchers who specialize in developing new phylogenetic methods. Acquiring empirical data for evaluating the performance of a new method is challenging. Your choice of data has important ramifications for your evaluation of the performance of your method. While the standard strategy of recycling published data sets is convenient, this risks perpetuating biases embedded in those data sets.

We will show how cogent3 and associated tools can greatly simplify getting public data for analyses, serving the needs of phylogenetic method developers. As there is an overlap with standard phylogenetic data pre-processing workflows, users of phylogenetic (or molecular evolutionary) methods can also benefit from attending this workshop.

Registration, Time, Location

The workshop is on the afternoon of 21st November 2023 (before the start of Phylomania) and it's free 🎉!

You can register here.

The time: 1pm-5pm, 21st November 2023

The location: Room 329, Maths and Physics, U. Tasmania (Sandy Bay campus).

Workshop Content

The topics we will cover are:

Factors for method developers to consider when designing experiments.
How to extract specific types of sequences (e.g. by homology type, sequence class, or level of divergence) from biological databases like Genbank and Ensembl.
Handling inconsistencies in biological data resources.
Multiple sequence alignment and subsequent quality control.

Prerequisites

All you need for this workshop is a laptop and some familiarity with command-line applications.

Ideally, you should be able to install software on your machine (see Computer Setup page).

More workshop information

Details regarding physical location, zoom link and setting up your computer before the workshop will be posted to the Cogent3 Workshop Wiki.

We look forward to seeing you there!

👾 cogent3's development team.

Cogent3 Funding

Cogent3 has received funding support from the Australian National University and an Essential Open Source Software for Science Grant from the Chan Zuckerberg Initiative.

cogent3workshop's People

Contributors

Stargazers

Watchers

Forkers

khiron yapenglang katherinecaley rmcar17

cogent3workshop's Issues

LO - 3 - Identifying and dealing with data issues

inconsistent meta-data (data wrangling REFSOIL GenBank files)
- demonstrate using annotation_db
- explore using dotplots (e.g., 2 sequences that claim to be the same, dotplot to see if this is the case, ungapped smith waterman score -> p-value)
File formats issues
- Duchene phylip formats, solving using bad_phylip app (Gavin to upload this app)
- extremely long fasta sequence labels (e.g. making sure you can collate genomes from one species)

LO - 3 - Unsolved / Important problems

Things that should influence choice of data for analyses. How good are alignments?
Alignment quality scores, pairwise and multiple. Alignment saturation statistics.

LO - 2 - Getting Data

Draft content for block 2 of the workshop.

superset of issues #24, #25, #26.

Getting data from:

list of Genbank IDs (@KatherineCaley todo)
EnsemblLite (@GavinHuttley todo)
Published data (@YapengLang todo and is actually part of LO- 4)

data links for LO - 3

A place to put data links for data required for LO - 3

Rework the docker video

The video needs to be reshot

remove the details about switching git branches
add use of sudo in VS code's terminal on linux installations where the user is not in the docker group
add the installation of VS code extensions for remote containers and docker

LO - 2 - Getting data

Published GenBank ID's (e.g. REFSOIL)
Ensembl downloading and Installing

What do participants need?

working installation of cogent3 (in another issue)
ability to run jupyter notebooks, either via browser or VS Code
terminal access (for Ensembl Cli)
basic experience with scripting, Python advantageous
Add others below

update workshop objectives

edit the wiki page to make it congruent with what we're actually going to present

Add intro to conda to wiki

c3 stickers

two versions, one with just logo, one with QR code that goes to the GitHub (or to cogent3.org ?)

@fredjaya to sort stickers

@KatherineCaley to provide links to the svg files

LO - 2 - Sampling ensembl

This will introduce the (currently under development) command line interface to Ensembl data.

Installation

Within the VS Code terminal, with your virtual environment activated, install EnsemblLite as

pip install git+https://github.com/cogent3/EnsemblLite.git

Notes

We will cover:

Downloading
Installing
Data summaries
Selecting related sequence types
Selecting aligned sequences with position types masked

Add intro to bioconda/PyPi to wiki

Create pre-workshop survey

Main focuses:

What operating system are you on and what privileges do you have on your machine
What is your comfort with coding
What type of data do you typically use
What are your typical analyses

Add intro to pip to wiki

backref from colima

on the wiki page describing colima install add a backlink so users can navigate to the next step spot on the Computer Setup page.

Add "how to ask for help" to wiki

Add intro to git/github to wiki

Write an app that reads in accession numbers and downloads from GenBank.

gaving has code to start this with.

docker image extras

add git
add zsh and make sure system path setup

Plus add more details on the docker command (local directory is the absolute path)

Add docker desktop for linux to wiki docs

Now that Docker Desktop is released for Debian, Fedora, Ubuntu, and Arch we should use that and deprecate the more complicated docker engine installation instructions

Also remove Colima so the instructions are just Docker Desktop for all 3 platforms.

Use a non elevated user for the container

Create a password-less user:USER for the container which has sudo access to the password-less user:root

Install everything possible under the user:USER account

LO - 3 - From orthologs to alignment

Draft content for block 3 of the workshop.

This is the superset of issues #27, #28, and #29.

Data

Workshop data available here (link takes you to dropbox)
Decompress the file and move it to your Cogent3_Workshop_2023 folder.
- Refer to this wiki page for instructions on importing data into your Docker environment.

Please open up a blank notebook in your docker container and set the data directory as follows. Note, that if you are using a different directory structure within Docker, you will need to change the path here accordingly.

from pathlib import Path

data_dir = Path("LO3/data/") # change path here if different.

Now, we are ready to go!

The road from orthologs to alignment 🛣️🚙

The original vector used in the above image was created by Elena Akifeva.

Curating data is like navigating a treacherous road. In this section we will show strategies for data exploration, and show how to use the outcomes of data exploration for the construction of well-informed data-wrangling pipelines, using cogent3 apps. Once we have constructed a pipeline, filtering data is like a well-maintained highway, it's efficient and straightforward. More importantly, these pipelines can be integrated into the alignment process, ensuring automated data provenance and reproducibility.

⚠️ Speed bumps

Small issues that can be easily fixed but might slow you down if you don't know how to fix them.

Overloaded Sequence Labels

Sometimes a fasta file will have an overloaded sequence label. For example, the label of this tardigrade mitochondrial sequence contains a lot of information! Let's have a look using load_seq

Example

from cogent3 import load_seq

overloaded_fasta = data_dir / "EU244602.fa"
tardigrade_mt_seq = load_seq(overloaded_fasta, moltype="dna")
tardigrade_mt_seq.name

Why is this an issue?

Downstream analysis may require consistent naming to collate genomes from one species.
In cogent3, to annotate a Sequence (allowing you the functionality of features) requires an exact match between the names in the sequence and annotation files.

Take a look at the help for load_seq and see if there is a way to get around this issue (without changing the name in the file, which would be tedious if you had a lot of files!).

Tip
Execute load_seq? in a notebook cell to see the help for the function.

👀 Show solution

Solution

To change the name of the tardigrade sequence, use the label_to_name argument with a lambda function.
Lambda function: lambda x: x.split()[0] i.e., split the label on whitespace and take the first element.

seq_stripped = load_seq(
      overloaded_fasta, moltype="dna", label_to_name=lambda x: x.split()[0]
      )
seq_stripped.name

🕳️ Potholes

Issues that you want to avoid, but if you hit one, it shouldn't be too hard to get out of if you are prepared.

File Formats Issues

Sometimes a file does not match the standard of its suffix. For example, a file with the suffix .txt may contain sequence data in fasta format. AM088141.txt is an example of this. Let's see what happens if we try to load it using load_seq.

Example

hidden_fasta = data_dir / "AM088141.txt"
tardigrade_rRNA = load_seq(hidden_fasta, moltype="dna")

If you had a lot of files, it would be a pain to change the suffix for all of them. Take a look at the help for load_seq and see if there is a way to get around this issue.

👀 Show solution

Solution

We can use the format argument to specify the format of the file.

tardigrade_rRNA = load_seq(hidden_fasta, format="fasta", moltype="dna")
tardigrade_rRNA

Note
If the content of the file isn’t a supported format, that is a different can of worms.. we will discuss this problem later.

🚧 Roadblocks

Roadblocks are issues that require a bit more thought to solve. There is often more than one way to get around a roadblock, you just need to find the right detour.

Which sequences are related?

Imagine you have just sequenced (and annotated) the genome of a new soil bacteria 🦠! A core step in science is to put your research into the context of what is known in the field. The context in this case is to compare to other soil bacteria, available in the REFSOIL reference dataset. For any sequence comparison or molecular evolution analyses, we will need information on what genes/sequences are homologous (we will be referring to orthologs since that's what is desired for species relationship reconstruction).

We have a collection of genomes and their annotations. We want to know what genes are related to each other. Can you come up with a strategy to solve this problem?

👀 Possible strategy

Use identical gene identifiers as a proxy for orthology!

This is a hypothetical phylogenomic study based on the ~1k whole microbial genomes downloaded in GenBank format from NCBI.

To do phylogenetics, we need to know orthologs. A naive strategy for identifying these is to select identically named genes from the genomes. Because of gene content variability (typical of microbes), there will be incomplete coverage.

We are aiming to identify a set of genes present in as many species from this collection as possible.

Using cogent3, and in particular the annotation_db module, we can select the maximum number of genes (identified by identical gene names) from the maximum number of species and write them to file. These ~90 genes in 370 species are in the data directory at the top of the section (in data/raw/) and will be used as an example for the rest of this section.

You can load the AnnotationDb for yourself as follows:

from cogent3.core.annotation_db import GenbankAnnotationDb

path = data_dir / "features.gbdb"
db = GenbankAnnotationDb(source=path)
db

Take a look at the documentation for annotation handling here and explore! Ask someone for tips if needed, and tell us about the experience.

⛔️ Error: road not found

These are hard problems to solve.

Which sequences are actually related ?

Following on from the previous step, if we were to blindly assume that identical gene identifiers were indicative of sequences being orthologous, we would be putting a lot of trust into the person who annotated the sequences.
The meta-data associated with a sequence is frequently dubious. How can we verify that the sequences are actually related? Although this is a specific example, the question of whether sequences purported to be orthologous, are actually orthologous, is an extremely common one!

👀 Possible strategy

Wrangling even when GenBank annotations give you trust issues 🫣

There is no "one size fits all" solution to this problem. So let's start exploring.

Starting with just one gene selected from the previous step, "alaS", which we can load into a SequenceCollection using load_unaligned_seqs. Accessing the .num_seqs attribute, we can see that there are 370 sequences in the collection.

from cogent3 import load_unaligned_seqs

alaS_path = data_dir / "raw/alaS.fa"
seq_coll = load_unaligned_seqs(alaS_path, moltype="dna")
seq_coll.num_seqs

In cogent3, a k-mer-based approach offers a quick estimation of sequence relationships without requiring alignment.

"There's an app for that!"

This approach is an example of cogent3 "apps"! Apps are special functions designed to simplify complex tasks. Apps can be used individually, or combined to define a “composed function” (aka pipeline). Composed functions can be applied to a single, or thousands, of data file(s)!

We need two apps for this task:

jaccard_dist(): Calculates pairwise Jaccard distances between the set of k-mers in each sequence within a SequenceCollection.
approx_pdist(): Transforms the matrix of pairwise Jaccard distances into an approximation of "proportion sites different," a measure of sequence divergence.

Note that before we can apply the apps to our data, we need to initialise them.

from cogent3.app.dist import jaccard_dist, approx_pdist

# create instances of the apps
p_dist = approx_pdist()
j_dist = jaccard_dist()

To combine these apps into a pipeline, so that the output of the first is the input of the second, we add them together using the + operator.

pipeline = j_dist + p_dist

If we are applying the pipeline to a single dataset, we can simply provide the data set as the argument to the app. Later in this workshop, we will learn how to apply a pipeline to a collection of datasets using the apply_to() method.

Applying the pipeline to our sequence collection returns a DistanceMatrix object with an approximation of the distance between each pair of sequences.

pw_pdists = pipeline(seq_coll)
pw_pdists

What does the data look like?

A way to get a feel for the data is to visualise the distribution of all pairwise distances. Are there any outliers?

from itertools import combinations
import plotly.express as px

# get distances into list for plotting
dists = [pw_pdists[pair] for pair in combinations(pw_pdists.keys(), 2)]

fig = px.histogram(dists, nbins=50, labels={"value": "p-distance"})
fig.show(width=500, height=500)

How would you describe the distribution❓

What's the worst-case scenario?

Which are the (approximately) most divergent sequences:

import numpy as np

# find max index (using numpy magic) 
max_index_flat = np.argmax(pw_pdists)
max_index_1, max_index_2 = np.unravel_index(max_index_flat, pw_pdists.shape)

# index <DistanceMatrix>.names to get the names of these sequences
max_pair = (pw_pdists.names[max_index_1], pw_pdists.names[max_index_2])
max_pair

Dotplots: Visualising the relationship between two sequences

A dotplot is a visualisation of the relationship between two sequences. The x-axis is the first sequence, the y-axis is the second sequence. Where the sequences have segments that are similar beyond a threshold, a line is drawn.

The visualisation backend for cogent3 is plotly, which means that the dotplot is ✨interactive✨! Try zooming in by clicking and dragging on the plot. Double-click to reset the zoom. Click on the legend to toggle the visibility of the reverse complement matches.

dp = seq_coll.dotplot(max_pair[0], max_pair[1], rc=True)
dp.show(width=500, height=500)

❓Are you convinced that these sequences are orthologous?

What does kath think...

i'm not convinced! But this is an open question.

How can we quantify our confidence in orthology?

When comparing sequences, a fundamental question arises: what's the probability of obtaining a similarity score purely by chance? Karlin and Altschul (1990) derived the theoretical null distribution of local alignment scores¹, enabling the calculation of the probability of a score equal to or higher than the obtained score (a p-value). This theory forms the basis of the BLAST algorithms.

We can directly use this theory to calculate the probability that the observed similarity between two sequences is due to chance. Inside the script filter.py² are a few functions that implement this logic. One of them is theoretical_null_filter(), an app that filters a SequenceCollection based on the theoretical null distribution, retaining only the sequences that satisfy a threshold of probability of being related!³

To use theoretical_null_filter(), we need to set a reference sequence for comparison against all other sequences. Here, let's choose E. coli as the reference. Additionally, we'll specify a desired p-value threshold.

from LO3.filter import theoretical_null_filter

#  we want E. coli as our reference
ref_seq = "NC_000913"

# create instance of the app, with a threshold of 0.01
tnf = theoretical_null_filter(ref_seq, threshold=0.01)
tnf

We can see that the app has been initialised with E. coli as reference, there are other default parameters that we could have changed if we wanted to.

Now we can apply the app to seq_coll to filter out likely non-homologous genes.

seq_coll_filtered = tnf(seq_coll)
seq_coll_filtered.num_seqs

We started with 370 sequences, so the filtering process removed several sequences!

What does the most divergent pair look like now?

pw_pdists = seq_coll_filtered.distance_matrix()

max_index_flat = np.argmax(pw_pdists)
max_index_1, max_index_2 = np.unravel_index(max_index_flat, pw_pdists.shape)

max_pair = (pw_pdists.names[max_index_1], pw_pdists.names[max_index_2])

dp = seq_coll.dotplot(max_pair[0], max_pair[1], rc=True)
dp.show(width=500, height=500)

❓Are you convinced that these sequences are orthologous? (this is again an open question!)

How to assess the quality of an alignment?

Why use cogent3 for alignment?

A quick note on why to use cogent3 for alignments?

even once sequences have been aligned, annotated features remain on the sequences, and can be mapped onto the alignment. This allows you to do things like slice your new sequence based on an annotated gene on another sequence via the alignment.
novel codon aligners!

Prior to aligning, we need to remove the terminal stop codons. Interestingly, some (cds) sequences have internal in-frame stop codons (@Genbank 😬) that we need to remove altogether! There is a custom app in the filter.py script that will do this.

from LO3.filter import remove_stop_codons

# its an app, so init it
no_stop_codons = remove_stop_codons(check_inframe=True)

# apply it to our collection
seq_coll_filtered = no_stop_codons(seq_coll_filtered)

# how many sequences did we lose?
seq_coll_filtered.num_seqs

Now we should be good to align! cogent3 has a progressive alignment app that can align coding sequences using a codon mode. We can use get_app to grab the app, and then apply it to our collection.

from cogent3 import get_app

codon_aligner = get_app("progressive_align", "codon")
aln = codon_aligner(seq_coll_filtered)
aln

As you can see, there are plenty of gaps! To take a look at non-degenerate positions, execute the following

aln.no_degenerates()

❓ How can we assess the quality of this alignment? cogent3 has three different alignment quality metrics, the information content score ("ic_score"), the sum of pairs score ("sp_score"), and the log-likelihood of the alignment ("cogent3_score"). You can access them on an alignment in the following way:

aln.alignment_quality("ic_score")

Warning
Alignment quality scores are imperfect, and there are a few things to keep in mind:

what does a singular score tell you?

describing the quality of an alignment with a score that has good properties is an open problem.

the cogent3_score is only available for alignments produced using cogent3. It is a likelihood measure but suffers from underflow issues.

Are there other ways we can assess the quality of an alignment?

👀 Possible strategies

Visualisation

We can quantify alignment quality via visualisation.

This could be with dotplots, as we did earlier. We can also visualise the number of gaps and entropy using the information_plot method.

aln.information_plot()

There are a lot of gaps! In some regions, almost all sequences have a gap. Some regions where there are very few gaps.

To see if there are certain sequences that are contributing to the gaps, we can use the aln.count_gaps_per_seq() which returns the number of gaps per sequence. Plotting this as a violin plot shows up the distribution of gaps per sequence.

gaps = aln.count_gaps_per_seq()
fig = px.violin(gaps, box=True,)
fig.update_layout(width=500, height=500)
fig.show()

To remove sequences that are contributing a significant number of gaps, we can use the omit_bad_seqs method. omit_bad_seqs returns new alignment without sequences for which the number of uniquely introduced gaps exceeds the specified quantile of the distribution of gaps per sequence (shown in the violin plot).

aln_omitted = aln.omit_bad_seqs(quantile=0.99)
aln_omitted.num_seqs

or use omit_gap_seqs., which removes sequences that have more than a specified percentage gap.

aln_omitted = aln_omitted.omit_gap_seqs(allowed_gap_frac=0.65)
aln_omitted.num_seqs

Off the mountain road and on the highway.

All that we've learned about our data can be seamlessly woven into a pipeline.

We've already seen that apps can be composed together to form a pipeline (recall the jaccard_dist and approx_pdist pipeline). However, we then applied this pipeline to a single SequenceCollection. We can also apply a pipeline to many files at once. This requires us to collect the files that we want to apply the data to in a DataStore. If our files are already in a directory, this is easy! We just use open_data_store, with the following information:

the path to the data
the suffix of the data of interest (indicating that all files with this suffix are of are a "DataMember" of the DataStore)
the mode (read/write)
optionally, an argument to limit the number of files to read/write. In this case, because we're building a prototype and want things to be fast, we will limit the number of files to be 5.

from cogent3.app.io import open_data_store

in_dstore = open_data_store(data_dir / "raw", suffix=".fa", mode="r", limit=5)

Similarly, we need a data store for the output, where the pipeline's result will be written out. Once more, we specify the path, suffix, and mode using open_data_store:

# open data stores for the output
out_dstore = open_data_store(data_dir / "aligned", suffix=".fa", mode="w")

Now let's build our pipeline. Directly from our exploration, we know that we need to do the following:

filter out sequences that are likely not orthologous
remove stop codons (and check for internal stop codons!)
remove sequences that are contributing a significant number of gaps
aligning with the default parameters of the progressive codon aligner appears to be reasonable.

Additionally, there are filtering steps that are common to many analyses, many of these will be predefined apps. To take a look at all predefined apps, execute available_apps() in a notebook cell.

from cogent3.app import available_apps

available_apps()

Our methods may require a minimum length of alignment. For this, we would use the min_length app to filter out alignments that are too short. Perhaps we wish to apply methods to data where the impact of natural selection is minimised, for which we would use the take_codon_positions app to select only the third codon position.

# we can use ``get_app`` to grab the apps that we need
from cogent3 import get_app

reader = get_app("load_unaligned", moltype="dna")
filterer = get_app("theoretical_null_filter", ref_seq)
trim_stops = remove_stop_codons(check_inframe=True)
no_shorties = get_app("min_length", length=300)
codon_aligner = get_app("progressive_align", "codon")
writer = get_app("write_seqs", out_dstore)

# Now we can create the pipeline
align_pipeline = reader + filterer + trim_stops + no_shorties + codon_aligner + writer

Apply the pipeline using the apply_to method using the input data store as the first argument. The pipeline can easily be parallelised using the parallel=True argument. A more complex configuration can be done if required (like telling it to use MPI) but we won't go into this here! Importantly, assigning the result of apply_to to a variable will allow us to interrogate the results of the pipeline run.

r = align_pipeline.apply_to(
      in_dstore, 
      show_progress=True, 
      parallel=True
  )

We can look at the summary of the pipeline run by accessing the describe attribute of the result object.

r.describe

Conveniently, failures do not cause the whole pipeline to fail. Importantly, they are tracked, if we have failures we can interrogate them in the not_completed directory in the data store. To get a summary of failures we can use the summary_not_completed attribute.

r.summary_not_completed

Look inside the output directory to see the results of the pipeline run. In addition to the "curated" data, there is a directory with a detailed file for all failures. There is a directory that collects the log files, take a look at these to see a summary of the pipeline. It provides all the information needed to recreate the exact apps and the exact pipeline you ran, i.e., it is reproducible! 🎉

The theorem assumes that nucleotides in the sequence are independent and identically distributed variables! Something that is not true for real sequences. ↩
Functions in the filter.py script are not yet part of the cogent3 package as they are still in development. Use at your own risk! ↩
This approach works best for sequences with a reasonably even distribution of nucleotides. For sequences with a skewed nucleotide composition, using numerical simulation to generate the null is a better approach, see empirical_null_filter! ↩

Clean up c3dev so we can reference it in the workshop

Add a DockerFile for development
Add how to install docker to c3dev wiki
Add how to develop natively (miniconda) to c3dev wiki

LO - 4 - getting data related to a publication

explain the nature of the problem you were trying to solve
explain the information you needed (a tree, the alignments, information on how those were derived, etc...)
what was actually linked to the paper?
how did you get the data in the end (thanks to David Duchene!)
what was the indicated formats?
what was the experience? (Bad Phylip!)
go to actual definition of phylip format
show actual format -- crying emoji
we will then solve this in the final hour when we write a custom app

Write content for a wiki page that sketches the motivation, providing links to:

the paper, web site(s)
the data url defined within the paper
upload in a separate comment (on this issue) some representative compressed data that has the indicated problem to it, we can then use a link to that issue for folk to be able to download the data

Do all the above as a single comment on this issue. Also make a presentation (10') on this.

@YapengLang write text in a comment on this issue, including screen grabs of the relevant papers so

LO - 2 - explore the ncbi command line tools

https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/

Test of using devcontainers

test_devcontainer.zip

If you unzip this into a new folder and tell VS code to open that folder - you should be prompted to run it as a container and that will set up a full cogent3 envrironment

Flowchart wrangling and checking data

devise a post workshop survey

we need to understand what went well, what was missing, but give explicit items in there related to learning outcomes

Verify docker is working

Add documentation for verifying a working docker installation by downloading hello-world image, and running the container

Add "how to use cogent natively" to wiki

miniconda
venv

wiki item for loading a file into your container

Wiki page for how to download a file from the wiki and put in your container

LO - 3 - Alignments

identifying divergent sequences prior to alignment and their relationships (using k-mer based tree estimation)
using cogent3 to align, the app, memory, speed, parameters
quantifying alignment quality using dotplots showing alignment path between most divergent sequences
information plots, plotting gaps
Alignment.count_gaps_per_seq(induced_by, unique)
Alignment.omit_bad_seqs()

Setting up compute environment

Docker / Colima installations
Instructions per OS
VS Code, or another approach to running Jupyter
VS Code extensions

Put other points below

LO - 1 - Understanding experimental design issue

This will be an introductory session on the different sequence "types". The target audience are modellers who may not be that well versed in the biochemistry of DNA or how information content differs along a DNA sequence.

We will go over very basic biochemistry concerning DNA replication, the different classes of mutation, and permutation based encodings.

We will also go over very basic descriptions of homology, orthology, paralogy, etc...

This background will then be related to experimental design issues. For example, paired design potential for exons/introns.

Also want to introduce reproducible computation -- scitrack -- somewhere.

Add "How to raise/contribute issues" to wiki

Focus the usage on VS code

Currently getting VS code working is tedious

put connecting via VS Code first (make connecting via browser a separate wiki page that die hards go to)
users will need to install VS code extensions in their OS install in order to use VS code; make sure that's highlighted!
have the required vs code extensions (Python, Jupyter, ...) added by the Docker file so users don't have to "Install the recommended Python plugin"

I get VS code connecting, I also get a notebook created. But when I try to run print() I get

Running cells with '/bin/python3' requires the ipykernel package.
Run the following command to install 'ipykernel' into the Python environment. 
Command: '/bin/python3 -m pip install ipykernel -U --user --force-reinstall'

The VS Code terminal environment is bash, not zsh+ ohmyzh. Possibly configuring this in the VS Code and including the settings json in the Docker image will solve this.

Learning outcomes

Know

Technology

computer setup pages(s), participants to do it before attending)
- installing / updating cogent3 (PyPl or GitHub). (Advice from Peter M was to do it via conda, but this won't work on
  macos if users don't have homebrew + xcode tools.) @ pre-meeting
intro to PyPI and GitHub
intro to pip
intro to conda
explanation of virtual environments (what and why)
how to ask for help (GitHub Discussions)
Raise issues, contribute (Issues, c3dev)
backup get jupyterhub working

LO Understanding experimental design issue

What they need to consider in choosing sequences for study.

Reproducible computation -- scitrack

different sequence types and relevance for experimental design
different sequence relationship types

LO - Getting data

Published GenBank ID's (e.g. REFSOIL)
Published (already aligned) data set (Duchene et al example)
Ensembl downloading and Installing

LO - Sampling ensembl

Downloading
Installing
Data summaries

LO - Identifying and dealing with data issues

inconsistent meta-data (data wrangling REFSOIL GenBank files)
- demonstrate using annotation_db
- explore using dotplots
File formats issues
- Duchene phylip formats, solving using bad_phylip app
- extremely long fasta sequence labels (e.g. making sure you can collate genomes from one species)

LO - sampling sequence classes Ensembl

sampling homologous sequences
sampling alignments

LO - Alignments

using cogent3
quantifying alignment quality
visualisation

LO - Sampling alignments

selecting by length
codon positions
consistent species presence

LO - Unsolved / Important problems

alignment quality scores!
pair and multiple

install DockerDesktop (Mac/PC/Linux, Colima (Mac), or apt-get (linux)
install VS code
clone this repo and find docker/DockerFile
watch the 10 min screen cast on the following
- build a docker image from DockerFile in the VS code terminal
- run a container of this image in the VS code terminal
- test accessing Jupyter server running in the container from a browser on the host
  - use cogent3 to do something meaningful - Open a remote window in VS code to attach to the running docker container
- Open or create a notebook (*.ipynb) in VS code
  - use cogent3 to do something meaningful
try doing it

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble

cogent3 / cogent3workshop Goto Github PK

cogent3workshop's Introduction

Data Wrangling with Cogent3 Workshop

Registration, Time, Location

Workshop Content

Prerequisites

More workshop information

Cogent3 Funding

cogent3workshop's People

Contributors

Stargazers

Watchers

Forkers

cogent3workshop's Issues

Installation

Notes

Data

The road from orthologs to alignment 🛣️🚙

⚠️ Speed bumps

Overloaded Sequence Labels

Example

Why is this an issue?

Solution

🕳️ Potholes

File Formats Issues

Example

Solution

🚧 Roadblocks

Which sequences are related?

Use identical gene identifiers as a proxy for orthology!

⛔️ Error: road not found

Which sequences are actually related ?

Wrangling even when GenBank annotations give you trust issues 🫣

"There's an app for that!"

What does the data look like?

What's the worst-case scenario?

Dotplots: Visualising the relationship between two sequences

How can we quantify our confidence in orthology?

What does the most divergent pair look like now?

How to assess the quality of an alignment?

Visualisation

Off the mountain road and on the highway.

Footnotes

Recommend Projects

Recommend Topics

Recommend Org

Jobs