GithubHelp home page GithubHelp logo

module-4-pairwise-alignment-2's Introduction

Module-4 - Pairwise alignment and BLASTing of protein sequences

4.1 Manual sequence alignment of protein sequences (~45min.)

In this exercise, you will be working in your groups to manually generate a pairwise sequence alignment, just like we did last week. To do this, each person in the group draws up an alignment matrix and fills it in with the numbers and arrows between the cells. Indicate (using color or another way of higlighting) the backtracing of the optimal alignment(s). It is probably smart to find a piece of paper to draft your individual solutions. You can then either take a picture of your piece of paper, or you can use the draw tool in Padlet. Feel free to use the padlet to share results with others in the group to compare your answers and verify whether they seem identical and correct.

  • The Padlet you should use is here: https://uio.padlet.org/jonaspaulsen/m7ti9z0rhyf08wub
  • Each group should upload:
    1. one picture/drawing of an alignment matrix
    2. The corresponding alignment(s) (use "code" formatting)
    3. Python code to generate the same alignment(s) (see below)
    4. Write "Done" at the bottom of your column, when your group is ready

To access the correct values of the substitution matrix, you can use biopython:

from Bio.Align import substitution_matrices
names = substitution_matrices.load() # This is a list of the names of all available substitution matrices
mat = substitution_matrices.load("BLOSUM62")
mat["A"]["R"] # BLOSUM62 value of Alanine vs. Arginine
  • Use the following code as inspiration. Modify the code according to your group's exercise. Use the code to check that your group's answer is correct:
from Bio import Align
aligner = Align.PairwiseAligner()
aligner.mode = 'global'
aligner.substitution_matrix = substitution_matrices.load("BLOSUM62")
aligner.gap_score = -5

seq1 = "WLCW"
seq2 = "WWGW"

alignments = aligner.align(seq1,seq2)

for alignment in alignments:
  print("Score = %.1f:" % alignment.score)
  print(alignment)

When the results from the other groups are ready on the Padlet:

  • Discuss in the group: Are the results as expected?

4.2 Running BLAST using the online tool (~30 min)

In this group exercise, we will try to identify human homologs of a the sequence of the RecA protein in E. coli bacteria.

Discuss in the group:

  • What is a homologous sequence
  • Read about RecA on wikipedia: https://en.wikipedia.org/wiki/RecA
  • Discuss in the group: Do you expect there to be a homologous sequence in human?

Then do the following group work:

  • Open this padlet: https://uio.padlet.org/jonaspaulsen/oo4uo5iqeg9mzgo4
  • Choose one in the group to share their screen
  • Find and copy the E. coli RecA protein sequence (FASTA format) from NCBI: (https://www.ncbi.nlm.nih.gov/protein)
  • Go to https://blast.ncbi.nlm.nih.gov/Blast.cgi
  • Select "Protein BLAST"
  • Paste the copied FASTA entry into the Query sequence field
  • Select (i.e. type in) Organism "Homo sapiens (taxid:9606)"
  • Click "Algorithm parameters" and choose parameters according to what it says under your group in the Padlet
  • Click BLAST
  • Once the results are ready, describe your group's results in the Padlet using both text and/or images.

Look at the "Example" to the right of the Padlet to get an impression of how to deliver your group's answer.

Individual work:

4.3 Using affine gap penalties

Use the following code as inspiration to explore the optimal local alignments of the protein sequences RLINLMPWVLATEYKNY and QFFPLMPPAPYWILATDFENY using:

  • Non-affine gap penalty: -5, matrix: BLOSUM62
  • Affine gap penalties (open: -11, extend: -1), matrix: BLOSUM62
  • Affine gap penalties (open: -11, extend: -1), matrix: PAM30
from Bio import Align
from Bio.Align import substitution_matrices

aligner = Align.PairwiseAligner()
aligner.substitution_matrix = substitution_matrices.load("BLOSUM62")
aligner.open_gap_score = -11
aligner.extend_gap_score = -1
aligner.mode = 'local'
! Explain any differences in the three different local alignments
! Explain the concept of affine gap penalties
! How could affine gap penalties be applied in a manual alignment?
! Advanced: Make a python script (`proteinalign.py`) that prints alignments of two input protein sequences

4.4 Running BLAST through Biopython

It is possible to run BLAST (like we did in exercise 4.2) using Biopython. This will query the online BLAST sever, but will store and keep the results of the alignment in a Python object. Note that the BLAST search will take a few minutes, just as for the online BLAST tool.

Here is an example of how to do this, blasting a protein sequence (with blastp) towards the entire nr database at expectation value level 0.01 and showing max 10 results. The example shows a blast search of part of the RecA sequence we worked with in exercise 4.2:

from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML

seq="RLDIRRIGAVKEGENVVGSETRVKVVKNKIAAPFKQA"
result_handle = NCBIWWW.qblast(program="blastp", database="nr", expect=0.01, hitlist_size=10, sequence=seq)
blast_records = NCBIXML.parse(result_handle)

Just like we saw for Bio.SeqIO and Bio.AlignIO in Module 1, we have a pair of input methods, read and parse, where read is for when you have exactly one object, and parse gives an iterator for when you can have lots of objects โ€“ but instead of getting e.g. a SeqRecord object, we get BLAST record objects.

You can use a for loop to iterate over all the results in blast_records:

for blast_record in blast_records:
  print("**** Num alignments:", len(blast_record.alignments))
  for alignment in blast_record.alignments:
    for hsp in alignment.hsps:
        print("****Alignment****")
        print("sequence:", alignment.title)
        print("e value:", hsp.expect)
        print(hsp.align_length)
        print(hsp.query)
        print(hsp.match)
        print(hsp.sbjct)
! Modify the code above to do a Blast search for the entire RecA sequence
! Explain/interpret the reults
! Modify the code above to work for nucleic acids
! Blast a random nucleic acid sequence of 30 letters, and explain the results
! Try to blast this sequence: GTCGTACTCGTATCGTGACTAGCTAGCTGCT
! Modify the E-value cutoff to see if you can get some hits

module-4-pairwise-alignment-2's People

Contributors

jonnings avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.