Python musings for Bioinformatics: 2012

Sunday, November 4, 2012

Grab sequences using web query (DAS)

Instead of downloading reference sequences and then interrogating them using API/other custom codes, you can make use of DAS client-server system.

Example:

http://www.ensembl.org/das/Homo_sapiens.GRCh37.transcript/features?segment=13:32889611,32973347

Request all transcripts (exons really) in the region [32889611,32973347] on human chromosome 13 (this is where the gene BRCA2 is located in the GRCh37 assembly).

IMPORTANT NOTE!!!: The DAS URL query only returns + strand. If you need the - strand, you have to reverse complement the sequence (I would suggest using the function found in BIOPYTHON).

To use other species/assemblies in:

I) Ensembl:

http://www.ensembl.org/das/dsn

II) UCSC:

http://genome.ucsc.edu/cgi-bin/das/dsn

Example Code:

#Function for passing url as an argument and parsing the XML output for obtaining the sequence.

def py_url(url):
   seq=""
   u=urllib.urlopen(url)
   data=u.readlines()
   for line in data:
       if line.startswith("<"):
           continue
       else:

           seq=seq+line
   return seq

url='http://useast.ensembl.org/das/Homo_sapiens.GRCh37.reference/dna?segment=13:32889611,32973347"

seq=py_url(url)

Saturday, November 3, 2012

Calculate GC Content

#counts the number of occurences of G and C (mononucleotide count)
def count_gc(sequence):
    """Counts the nitrogenous bases of the given sequence.
    Ambiguous bases are counted fractionally.
    Sequence must be in upper case"""
    gc = 0
    for base in sequence:
        if   base in 'GC':     gc += 1.0
        elif base in 'YRWSKM': gc += 0.5
        elif base in 'DH':     gc += 0.33
        elif base in 'VB':     gc += 0.66
    return gc

#output %GC(mononucleotide) for a given sequence
def gc_content(sequence):
    """Calculates the GC content of a DNA sequence.
       Mixed case, gaps and ambiguity codes are permitted"""
    sequence = sequence.upper()
    if not sequence:
        return 0
    return 100.0 * count_gc(sequence) / len(sequence)

Frequency of Bases

#output frequency of bases for a given sequence
def freq_print(sequence):
    sequence = sequence.upper()
    A = float(sequence.count('A'))
    T = float(sequence.count('T'))
    G = float(sequence.count('G'))
    C = float(sequence.count('C'))
    dict_freq={"A":100.0*(A/len(sequence)), "T":100.0*(T/len(sequence)), "G":100.0*(G/len(sequence)), "C":100.0*(C/len(sequence))}
    return dict_freq

Hello Universe

Hi Humans, Aliens and other intelligent entities. I would like to share my knowledge of coding in python for bioinfomatics as I learn new tricks. Feel free to comment, mock and trash my code :-)