Manipulating DNA sequences and calculating restriction fragments

Chapter 2: Printing and manipulating text header_1 = "ABC123"
Using the data from part one, write a program that will header_2 = "DEF456"
1.Calculating AT content print out the original genomic DNA sequence with coding header_3 = "HIJ789"
bases in uppercase and non-coding bases in lowercase.
Here's a short DNA sequence: # set the values of all the sequence variables
Solution: seq_1 = "ATCGTACGATCGATCGATCGCTAGACGTATCG"
ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT seq_2 = "actgatcgacgatcgatcgatcacgact"
my_dna = seq_3 = "ACTGAC-ACTGT-ACTGTA----CATGTG"
Write a program that will print out the AT content of this "ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCG
DNA sequence. Hint: you can use normal mathematical symbols ATCGATCGATCGATCGATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGAC # make three files to hold the output
like add (+), subtract (-), multiply (*), divide (/) and TACTAT" output_1 = open(header_1 +
parentheses to carry out calculations on numbers in Python. exon1 = my_dna[0:63] "/home/daniel/Python/exercises/Chapter_3/exercises/one.fast
intron = my_dna[63:90] a", "w")
Solution: exon2 = my_dna[90:] output_2 = open(header_2 +
print(exon1 + intron.lower() + exon2) "/home/daniel/Python/exercises/Chapter_3/exercises/two.fast
from __future__ import division a", "w")
ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGA output_3 = open(header_3 +
my_dna = TCGAtcgatcgatcgatcgatcgatcatgctATCATCGATCGATATCGATGCATCGACT "/home/daniel/Python/exercises/Chapter_3/exercises/three.fa
"ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT" ACTAT sta", "w")
length = len(my_dna)
a_count = my_dna.count('A') Chapter 3: Reading and writing files # write sequence 1 to output file 1
t_count = my_dna.count('T') output_1.write('>' + header_1 + '\n' + seq_1 + '\n')
7.Splitting genomic DNA
at_content = (a_count + t_count) / length # write sequence 2 to output file 2
print("AT content is " + str(at_content)) Look in the chapter_3 folder for a file called output_2.write('>' + header_2 + '\n' + seq_2.upper() +
genomic_dna.txt – it contains the same piece of genomic DNA '\n')
AT content is 0.685185185185 that we were using in the final exercise from chapter 2.
Write a program that will split the genomic DNA into coding # write sequence 3 to output file 3
2.Complementing DNA and non-coding parts, and write these sequences to two output_3.write('>' + header_3 + '\n' + seq_3.replace('-',
separate files. '') + '\n')
Here's a short DNA sequence:
Hint: use your solution to the last exercise from chapter 2 Chapter 4: Lists and loops
ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT as a starting point.
10.Processing DNA in a file
Write a program that will print the complement of this Solution:
sequence. The file input.txt contains a number of DNA sequences, one
# open the file and read its contents per line. Each sequence starts with the same 14 base pair
Solution: dna_file = fragment – a sequencing adapter that should have been
open("/home/daniel/Python/exercises/Chapter_3/exercises/gen removed. Write a program that will (a) trim this adapter
my_dna = omic_dna.txt") and write the cleaned sequences to a new file and (b) print
"ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT" my_dna = dna_file.read() the length of each sequence to the screen.
replacement1 = my_dna.replace('A', 't')
print(replacement1) # extract the different bits of DNA sequence Solution:
replacement2 = replacement1.replace('T', 'a') exon1 = my_dna[0:62]
print(replacement2) intron = my_dna[62:90] # open the input file
replacement3 = replacement2.replace('C', 'g') exon2 = my_dna[90:] file =
print(replacement3) open("/home/daniel/Python/exercises/Chapter_4/exercises/inp
replacement4 = replacement3.replace('G', 'c') # open the two output files ut.txt")
print(replacement4) coding_file =
print(replacement4.upper()) open("/home/daniel/Python/exercises/Chapter_3/exercises/cod # open the output file
ing_dna.txt", "w") output =
tCTGtTCGtTTtCGTtTtGTtTTTGCTtTCtTtCtTtTtTtTCGtTGCGTTCtT noncoding_file = open("/home/daniel/Python/exercises/Chapter_4/exercises/tri
tCaGtaCGtaatCGatatGataaaGCataCtatCtatatataCGtaGCGaaCta open("/home/daniel/Python/exercises/Chapter_3/exercises/non mmed.txt", "w")
tgaGtagGtaatgGatatGataaaGgatagtatgtatatatagGtaGgGaagta coding_dna.txt", "w")
tgactagctaatgcatatcataaacgatagtatgtatatatagctacgcaagta # go through the input file one line at a time
TGACTAGCTAATGCATATCATAAACGATAGTATGTATATATAGCTACGCAAGTA # write the sequences to the output files for dna in file:
coding_file.write(exon1 + exon2)
3.Restriction fragment lengths noncoding_file.write(intron) # get the substring from the 15th character to the end
trimmed_dna = dna[14:]
Here's a short DNA sequence: 8.Writing a FASTA file
# get the length of the trimmed sequence
ACTGATCGATTACGTATAGTAGAATTCTATCATACATATATATCGATGCGTTCAT FASTA file format is a commonly-used DNA and protein trimmed_length = len(trimmed_dna) - 1
sequence file format. A single sequence in FASTA format
The sequence contains a recognition site for the EcoRI looks like this: # print out the trimmed sequence
restriction enzyme, which cuts at the motif G*AATTC (the output.write(trimmed_dna)
position of the cut is indicated by an asterisk). Write a >sequence_name
program which will calculate the size of the two fragments ATCGACTGATCGATCGTACGAT # print out the length to the screen
that will be produced when the DNA sequence is digested print("processed sequence with length " +
with EcoRI. Where sequence_name is a header that describes the sequence str(trimmed_length))
(the greater-than symbol indicates the start of the header
Solution: line). Often, the header contains an accession number that processed sequence with length 42
relates to the record for the sequence in a public sequence processed sequence with length 37
my_dna = database. A single FASTA file can contain multiple processed sequence with length 48
"ACTGATCGATTACGTATAGTAGAATTCTATCATACATATATATCGATGCGTTCAT" sequences, like this: processed sequence with length 33
frag1_length = my_dna.find("GAATTC") + 1 processed sequence with length 47
frag2_length = len(my_dna) - frag1_length >sequence_one
print("length of fragment one is " + str(frag1_length)) ATCGATCGATCGATCGAT 11.Multiple exons from genomic DNA
print("length of fragment two is " + str(frag2_length)) >sequence_two
ACTAGCTAGCTAGCATCG The file genomic_dna.txt contains a section of genomic DNA,
length of fragment one is 22 >sequence_three and the file exons.txt contains a list of start/stop
length of fragment two is 33 ACTGCATCGATCGTACCT positions of exons. Each exon is on a separate line and
the start and stop positions are separated by a comma.
4.Splicing out introns, part one Write a program that will create a FASTA file for the Write a program that will extract the exon segments,
following three sequences – make sure that all sequences concatenate them, and write them to a new file.
Here's a short section of genomic DNA: are in upper case and only contain the bases A, T, G and C.
Solution:
ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGA Sequence header DNA sequence
TCGATCGATCGATCGATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGACT ABC123 ATCGTACGATCGATCGATCGCTAGACGTATCG # open the genomic dna file and read the contents
ACTAT DEF456 actgatcgacgatcgatcgatcacgact genomic_dna =
HIJ789 ACTGAC-ACTGT—ACTGTA----CATGTG open("/home/daniel/Python/exercises/Chapter_4/exercises/gen
It comprises two exons and an intron. The first exon runs omic_dna.txt").read()
from the start of the sequence to the sixty-third Solution:
character, and the second exon runs from the ninety- first # open the exons locations file
character to the end of the sequence. Write a program that # set the values of all the header variables exon_locations =
will print just the coding regions of the DNA sequence. header_1 = "ABC123" open("/home/daniel/Python/exercises/Chapter_4/exercises/exo
header_2 = "DEF456" ns.txt")
Solution: header_3 = "HIJ789"
# create a variable to hold the coding sequence
my_dna = # set the values of all the sequence variables coding_sequence = ""
"ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCG seq_1 = "ATCGTACGATCGATCGATCGCTAGACGTATCG"
ATCGATCGATCGATCGATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGAC seq_2 = "actgatcgacgatcgatcgatcacgact" # go through each line in the exon locations file
TACTAT" seq_3 = "ACTGAC-ACTGT--ACTGTA----CATGTG" for line in exon_locations:
exon1 = my_dna[0:63]
exon2 = my_dna[90:] # make a new file to hold the output # split the line using a comma
print(exon1 + exon2) output = positions = line.split(',')
open("/home/daniel/Python/exercises/Chapter_3/exercises/seq
ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGA uences.fasta", "w") # get the start and stop positions
TCGAATCATCGATCGATATCGATGCATCGACTACTAT start = int(positions[0])
# write the header and sequence for seq1 stop = int(positions[1])
5.Splicing out introns, part two output.write('>' + header_1 + '\n' + seq_1 + '\n')
# extract the exon from the genomic dna
Using the data from part one, write a program that will # write the header and uppercase sequences for seq2 exon = genomic_dna[start:stop]
calculate what percentage of the DNA sequence is coding. output.write('>' + header_2 + '\n' + seq_2.upper() + '\n')
# append the exon to the end of the current coding
Solution: # write the header and sequence for seq3 with hyphens sequence
removed coding_sequence = coding_sequence + exon
from __future__ import division output.write('>' + header_3 + '\n' + seq_3.replace('-', '')
my_dna = + '\n') # write the coding sequence to an output file
"ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCG output =
ATCGATCGATCGATCGATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGAC 9.Writing multiple FASTA files open("/home/daniel/Python/exercises/Chapter_4/exercises/cod
TACTAT" ing_sequence.txt", "w")
exon1 = my_dna[0:63] Use the data from the previous exercise, but instead of output.write(coding_sequence)
exon2 = my_dna[90:] creating a single FASTA file, create three new FASTA files output.close()
coding_length = len(exon1 + exon2) – one per sequence. The names of the FASTA files should be
total_length = len(my_dna) the same as the sequence header names, with the
print(100 * coding_length / total_length) extension .fasta.
78.0487804878 Solution:
6.Splicing out introns, part three # set the values of all the header variables

Manipulating DNA sequences and calculating restriction fragments

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Manipulating DNA sequences and calculating restriction fragments

Uploaded by

Copyright:

Available Formats

Chapter 2: Printing and manipulating text header_1 = "ABC123"

You might also like