GithubHelp home page GithubHelp logo

bioinformatics's People

Contributors

tungufoss avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Forkers

pariswu1988 avelx

bioinformatics's Issues

1.2 The Burrows-Wheeler Transform

The substring "ana" in "panamabananas$" plays the role of "and" in Watson and Crick’s paper and explains three of the five occurrences of "a" in the repeat "aaaaa" in BWT("panamabananas$$") = "smnpbnnaaaaa$$a". When the Burrows-Wheeler transform is applied to a genome, it converts the genome’s many repeats to runs. As we already suggested, after applying the Burrows-Wheeler transform, we can apply an additional compression method such as run-length encoding in order to further reduce the memory.

Exercise Break There is only one run of length at least 10 in the E. coli genome. How many runs of length at least 10 do you find after applying the Burrows-Wheeler transform to the E. coli genome?

1.1 Hidden Markov Models Code Challenges

CODE CHALLENGE: Implement the Viterbi algorithm solving the Decoding Problem.

  • Input: A string x, followed by the alphabet from which x was constructed,
    followed by the states States, transition matrix Transition, and emission matrix
    Emission of an HMM (Σ, States, Transition, Emission).

  • Output: A path that maximizes the (unconditional) probability Pr(x, π) over all possible paths π.

Note: You may assume that transitions from the initial state occur with equal probability.

1.10 Assembly Chapter Code-Graded Problems

1.7 Week 2 Code-Graded Problems

1.4 Epilogue: From Simulated to Real Spectra

You can now see that sequencing Tyrocidine B1 from a real spectrum, for which two-thirds of all masses are false, presents a much more difficult problem than sequencing this peptide from the simulated Spectrum25. In the following challenge problem, you will need to further develop the methods we studied in this chapter to analyze a real spectrum.

Final Challenge: Tyrocidine B1 is just one of many known NRPs produced by Bacillus brevis. A single bacterial species may produce dozens of different antibiotics, and even after 70 years of research, there are likely undiscovered antibiotics produced by Bacillus brevis. Try to sequence the tyrocidine corresponding to the real experimental spectrum below. Since the fragmentation technology used for generating the spectrum tends to produce ions with charge +1, you can safely assume that all charges are +1. Return the peptide as a collection of space-separated integer masses.

Peer-reviewed assignment

  • Based on definition of N50, define N75.
  • Compute N50 and N75 for the nine contigs with the following lengths:
    [20, 20, 30, 30, 60, 60, 80, 100, 200].
  • Say that we know that the genome length is 1000. What is NG50?
  • If the contig in our dataset of length 100 had a misassembly breakpoint in the middle of it, what would be the value of NGA50?
  • Based on the definition of scaffolds, what information could we use to construct scaffolds from contigs? Justify your answer.

Continue here as soon as your assembly of the Staph reads has completed.

  • Fill in the 9 missing values in the following 3 x 3 table:
k	N50	#long contigs	total length of long contigs
25			
55			
85
  • Which assembly performed the best in terms of each of these statistics? Justify your answer.Why do you think that the value you chose performed the best?

  • (Multiple choice) When you increase the length of k-mers, the de Bruijn graph ____________. Justify your answer.

    • A) Becomes more tangled.
    • B) Contains more nodes.
    • C) Becomes less tangled.
    • D) Remains the same.

You will use the Quality Assessment Tool for Genome Assembly QUAST (Gurevich et al, 2013) to evaluate the quality of your assembly using the Staph reference genome as the gold standard.

  • Download the contigs.fasta file as part of the SPAdes output from the best assembly you chose for question #8 above.

  • Go to QUAST (http://quast.bioinf.spbau.ru/) and upload your contigs.fasta file with the “Add files” button.

  • Leave the “Scaffolds” and “Find genes” boxes unchecked and keep the indicator on “Prokaryotic.”

  • Click on the “Another genome” link underneath “Genome.” Fill in a name and upload the staph_genome.fasta file that we provided for the “Reference” file. (Note: we provide this file as a .txt, you will need to save it as .fasta). Leave the other two inputs (“Genes” and “Operons”) blank and click “Evaluate.”

  • A link to the report should appear on the right side of the page in a few moments.

  • 1. How many misassemblies were there?

  • 2. How significant is the effect of misassemblies on the resulting assembly?

  • 1. What are NG50 and NGA50?

  • 2. How do they compare with the value of N50 that you previously calculated? Why?

  • What is the known species of Staphylococcus that is most similar to the species that you assembled?

1.2 How Do Bacteria Make Antibiotics?

Exercise Break: Solve the Peptide Encoding Problem for Bacillus brevis and Tyrocidine B1 (Val-Lys-Leu-Phe-Pro-Trp-Phe-Asn-Gln-Tyr). How many starting positions in Bacillus brevis encode this peptide? (Genetic code figure reproduced below.)
image

1.3 The Spectral Convolution Saves the Day

Exercise Break: Run ConvolutionCyclopeptideSequencing on Spectrum25 (reproduced below) with N = 1000 and M = 20. Identify the 86 highest-scoring linear peptides. (Return the peptides in integer format separated by a single space, e.g., 123-57-200-143 199-143-121-60)

1.5 A Brute Force Algorithm for Cyclopeptide Sequencing

  • Counting Peptides with Given Mass Problem: Compute the number of peptides of given mass.
    Input: An integer m.
    Output: The number of linear peptides having integer mass m.

    Suggestion: If you have difficulty solving this problem or getting the runtime down, please return to it after learning more about dynamic programming algorithms in a later chapter.
    Exercise Break: Solve the Counting Peptides with Given Mass Problem. Recall that we assume that peptides are formed from the following 18 amino acid masses:

G	A	S	P	V	T	C	I/L	N	D	K/Q	E	M	H	F	R	Y	W
57	71	87	97	99	101	103	113	114	115	128	129	131	137	147	156	163	186

It turns out that there are trillions of peptides have the same integer mass (1322) as Tyrocidine B1 (figure below). Therefore, BFCyclopeptideSequencing is completely impractical, and we will not even bother asking you to implement it.

  • Exercise Break: This figure suggests that for large m, the number of peptides with given integer mass m can be approximated as k · Cm, where k and C are constants. Find C. (Give your answer as a decimal; the allowable error is 0.002).
    image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.