Skip to end of metadata
Go to start of metadata

Friday May 6th: Finishing the semester

Update your wiki to document your work. Make sure you have posted

1) links to your GONUTS 6 annotations and 6 challenges. If I told you that you could do less, remind me on your wiki!

2) your final presentation

Monday May 2nd: Guidelines for 5 min final presentations

I prepared a slide based on one of Ayub's annotations. He submitted the phage A511 receptor binding protein we learned about in journal club. The slide contains:

Journal article information- author, year, journal, title

Protein annotated- name and Uniprot ID

Go term- the one you used, plus whether it is cellular component, biological process or molecular function

Evidence- the code you used

Transfer- let us know if you used the standard annotation as the basis for a transfer(s)!

The published evidence- show the figure(s) you used and summarize any relevant statements from the text.

We'll meet Friday May 6th 9-10:50 am in our classroom for final presentations. These folks do need to attend, but don't need to present if they don't want to- Rhuiridh, Steven, Brittany...

Post your slide to your wiki.

Friday April 22nd: Clarification of Use of GO:00000100 (It's easier than we thought)

I'm basing my answer on a combination of the GO_REF description and the GO consortium guidelines here:
You can make transfer annotations using BLASTP only, HHPred only, or a combination of both. 
BLASTP only = use ISA  (Inferred from sequence alignment) or ISO (Inferred from Sequence Orthology). To be honest, I'm not sure that I would use ISO for phage, because the ancestry can be complicated and horizontal gene transfer is common. However, if your hit is to a closely related phage and your BLASTP hit meets the "reciprocal best hit" criteria, then use ISO. Otherwise use ISA.
HHPred only = use ISM (Inferred from Sequence Model)
Both: Use ISS (Inferred from Sequence Similarity). The GO consortium documents say: 
An ISS annotation is often based on more than just one type of sequence-based evidence. Often, a host of searches are performed for any given query protein. These searches might include BLAST, profile HMMs, TMHMM, SignalP, PROSITE, InterPro, etc. 
In all cases, remember that the hit should be put into the with/from field and it should be a protein that has an experimental annotation.

Wednesday April 20th: Phage Infographic

Here it is! Thanks to Robbie, Sarah and Brittany for their contributions!

and as ppt and pdf, if that is useful for you.

Wednesday April 13th: Phage Infographic

I am envisioning this poster presentation as a kind of "Phage Hunting, by the numbers". Here are some sites that might inspire you.

  1. A pintrest page of great scientific infographics.
  2. An infographic about phage therapy.
  3. The google image search for genomics infographic.

I started an infographic on the tool

What do you want people to know about phage?

How many phages are on the planet (remember Welkin's slide from Roger Hendrix's paper?)?

How big is a phage compared to other organisms?

How big is a phage genome compared to other organisms?

How many phage genes have we discovered?

How many phage genes have function? What does a phage genome look like? (profile of protein types, with majority being unknown function!)

How many phage genes contribute to each part you see in a microscopy image?

How many genes are in the databases vs. how many have been annotated with GO terminology?


How will you display phage genome data in a visually pleasing way? Your data must be quantitative and describe something about phage genomes. Depending on the quality of your work, a contribution might be worth 2 or more annotations!

Genbank files and DNAMaster files for our other bacillus phages are here.

Wednesday April 6th: UMBC annotations from last year!

Monday April 4th: Special SEA PHAGES GO annotation term

You can use this code for transfer annotations were you choose a homology-based evidence code. An extensive list of all the reference GO codes are listed here.

go_ref_id: GO_REF:0000100 title: Gene Ontology annotation by SEA-PHAGE biocurators authors: Ivan Erill, SEA-PHAGE biocurators year: 2014 abstract: This GO reference describes the criteria used by biocurators of the SEA-PHAGE consortium for the annotation of predicted gene products from newly sequenced bacteriophage genomes in the SEA-PHAGE and other databases and in the GenBank records periodically released to NCBI for these genomes. In particular, this GO reference describes the criteria used to assign evidence codes ISS, ISA, ISO, ISM, IGC and ND. To assign ISS, ISA, ISO and ISM evidence codes, SEA-PHAGE biocurators use a varied array of bioinformatics tools to establish homology and conservation of sequence and structure functional determinants with proteins from multiple organisms with published association to experimental GO terms and lacking NOT qualifiers. These proteins are referenced in the WITH field of the annotation using their xref database accession. The primary tools for homology search in ISS, ISA, ISO and ISM assignments are BLASTP and HHpred, using a maximum e-value of 10^-7 for BLASTP and a minimum probability of 0.9 for HHpred, and manual inspection of alignments in both cases. For ISS and ISA assignments, BLASTP alignments are required to have at least 75% coverage and 30% identity. For ISO assignments, orthology is further validated using reciprocal BLASTP with the identified hit. For HHpred results, ISS or ISM annotations are made only if the source for the original GO annotation explicitly defines a matched domain function, or if more than half of the domains of the query protein are identified in the matching protein. All ISS, ISA, ISO and ISM assignments entail the manual verification of the source for the GO term in the matching protein sequence and critical curator assessment of the likelihood of preservation of function, process or component in the context of bacteriophage biology. IGC codes are assigned on the basis of suggestive evidence for function based on synteny, as inferred from whole-genome comparative analyses of multiple bacteriophage genomes using primarily the Phamerator software platform, and with special emphasis on the bacteriophage virion structure and assembly genes. When extensive review of published literature on putative homologs reveals no experimental evidence of function, component or process for a particular gene product, it is assigned an ND evidence code and annotated to the root term for Cellular Component, Molecular Function and Biological Process. As part of the review process for assignment of ISS, ISA, ISO, IGC and ISM evidence codes, SEA-PHAGE curators are required to analyze the reference literature for identified matches and shall perform GO annotations with appropriate evidence codes if these were not available.



Wednesday March 30th: Challenge Annotation

Today, through 6pm Monday April 4th, you'll be submitting challenges to previous annotations.

Start on the 2016 phage hunters GONUTs page. Pick a team from the scoreboard with points, and take a look at the annotations they've submitted. OR, choose the Round 2 Annotations tab and take a look at submitted annotations. Pick on! 

I picked Evoli gp 195, a putative DNA Polymerase. The annotation was submitted by user Corneljl on team Phages Woo!, and the evidence code offered was "IDA" for Inferred From Direct Assay.  Here is the paper they cite as evidence. Does the paper fulfill the criteria of evidence code "IDA"?

If you want to submit a challenge, you can click on the "Challenge" link on the right hand side of an annotation.

What does the rest of the semester look like? Probably like this:

March 30th/April 4th: Challenge

April 6th/11th: Annotate

April 13th/18th: Challenge

April 20th: VCU Poster Symposium for Undergraduate Research and Creativity

April 20th/25th: Annotate

April 27th/May 2nd: Challenge

Throughout this time, you should be working on annotations and challenges out side of class. Challenges will be evaluated by me as 1) high quality with sufficient evidence, appropriate evidence code, or evidence of strong sequence homology (can be supported by your own wiki documentation); 2) clear attempt to provide a thoughtful, quality annotation- perhaps you found a paper but didn't apply the correct figure, or evidence code ; or 3) low quality and lacking in quality and thought. You can submit as many as you want, and claim those scoreboard bragging rights! But, I will be looking for each student to submit at least 6 annotations and six challenges that are of high quality or a decent attempt. Low quality annotations will be disregarded. UMBC told me most of their annotations are Standard Annotations because the phage world has not yet been inspected by Biocurators, so I anticipate you will be digging into the literature (that is why we are doing this!) and providing at least three Standard Annotations, with a literature figure as evidence. Transfer Annotations will come easily, after that.

Monday March 28th: Practicing GO Annotation

On Monday, you worked through annotation of the Tail Assembly Chaperone protein of Troll/Eyuki/Megatron, seeing what you could tell from homology vs. the literature. You tried to find the Tail Assembly Chaperone in these files by text search and blast:

Troll Genbank file

Eyuki Genbank file

Megatron genbank file

Lambda genbank file

You learned a little bit (probably not enough) about how the Tail Assembly Chaperone is encoded by two different proteins that are merged by a programmed translational frameshift where the ribosome shifts from one reading frame to another. Therefore, there should be two genes in each genome that are annotated as Tail Assembly Chaperone. In Lambda, these genes are called gpG and gpGT. You should have learned that you can't really identify a Bacillus phage tail assembly chaperone based on sequence homology to the Lambda tail assembly chaperone.

If you haven't finished, read through the rest of the UMBC annotation guide and follow their approach linking Troll TAC with Lambda, through a bit of luck. It will take you 30 minutes to work through this. What did you find out? Document your steps on your wiki page.

So, how do you perform an annotation?  The first rule of GONUTS is that you should automatically assume Eyuki annotation of "Tail Assembly Chaperone" is wrong. Can you prove the two proteins identified as Tail Assembly Chaperone are in fact Tail Assembly Chaperone?

Evidence codes: You will have to satisfy one of the approved evidence codes:

  • Here's a list of the evidence codes that CACAO students may use:
  1. IDA: Inferred from Direct Assay
  2. IMP: Inferred from Mutant Phenotype
  3. IGI: Inferred from Genetic Interaction - requires with/from field to be filled in
  4. ISS: Inferred from Sequence or Structural Similarity - almost always requires with/from field to be filled in
  5. ISO: Inferred from Sequence Orthology - requires with/from field to be filled in
  6. ISA: Inferred from Sequence Alignment - requires with/from field to be filled in
  7. ISM: Inferred from Sequence Model - requires with/from field to be filled in
  8. IGC: Inferred from Genomic Context

  ALL other codes, even if used correctly, will cause the annotation to be rejected by the judges

Do you know what all those codes mean

5. What do you see when you look up Eyuki in Uniprot? Provide a screen shot to help you remember, and a link to the Uniprot identifier you'll need.

6. What do you see when you look up Eyuki Tail Assembly Chaperone in Uniprot? Provide a screen shot to help you remember, and a link to any Uniprot identifier you'll need.

7. Search QuickGo using the Eyuki Tail Assembly Chaperone ID from Uniprot. Are there any existing GO annotations in QuickGo? If not, it's open for annotation! Are there annotations that differ from the type you might provide? If yes, it's open for annotation.

8. Complete a transfer annotation using the approach described on the last page of the UMBC annotation guide! Bonus points if you dig into phage genomes in Phamerator to prove to yourself that Tail Assembly Chaperone occurs within the structural gene area, between portal/capsid and the tail lysins/baseplate proteins.

Helpful handouts for students are provided here.

This is a tutorial on how to evaluate the literature and produce a new GO annotation for UDG glycosylase.

Allison's quick steps to get started:

  1. Identify protein of interest. You're probably starting with Megatron/Eyuki and a homologous protein from your region of DirtyBetty or Kida. On Wednesday March 30th, you will identify an interesting annotation to challenge.
  2. Look up that protein in Megatron/Eyuki in Uniprot and QuickGo. Document what you find.
  3. Is that protein annotated in a well-studied phage like lambda, T4, T7, phi29, SPO1? You might look at Genbank files and/or Pubmed (I'm kidding myself, of course you're going to google)
  4. Can you find an existing GO annotation for that protein in a well-studied phage? You'll need to check QuickGo.
    1. If not, you need to find literature to make a new Standard Annotation. A great place to find papers is by going through HHPred hits to the PDB, where protein listings include published papers.
    2. If it is annotated, verify for  yourself that it is a good Standard Annotation, then head straight to Transfer Annotation.

Wednesday March 23rd: Introduction to GO Annotation

We will be trained over Skype in how to do GO annotations by Dr. Jim Hu, a professor of Biochemistry at Texas A&M University. His slides are posted as an attachment to that first link ("how to do GO annotations") for you to refer back to as you perform your own annotations.

Introduction guide from UMBC professor Ivan Erill

Paper about GO annotations



Monday March 21st: Comparative Genomics 

Today we're going to talk about comparative genomics of phages, and three ways we can compare genome content. A collection of bacillus phage genomes were compared using these measures of genome similarity:

  • Phamerator maps- open two genomes using the Phamerator map function, and examine the colorful shading between the two maps. This shading indicates the % identity and E value of blastn nucleotide matches.
  • Dot plots- examines nucleotide identity at a genome-wide scale, for multiple genomes. Phage researchers use this as a first measure of similarity to cluster phage genomes into groups. The rule is phages that are identical over >50% of their genome are placed into the same cluster. We use a tool called Gepard with the input file being a whole-genome(s) fasta file.
  • Splitstree analysis- examines protein content in a collection of genomes and calculates a sort-of phylogeny (don't call it a phylogeny, though) based on shared gene content. Phages that are grouped together on the image have very similar profiles of gene content (they have the same proteins). We use the tool Splitstree, with the input data being a pham table exported from Phamerator.

Task 1. Tidy up your table, make sure it is as complete as possible. Evaluate iTasser results as you receive them. Identify any genes from Kida that are not in DirtyBetty. Post them to your wiki and add them to the Kida table. These are likely HNH endonuclease or have very poor blast hits. I would expect we'll only have a few proteins in this table.

Task 2. Look through Phamerator and add pham number to your table. We don't yet have DirtyBetty in Phamerator, so you will have to use the pham numbers from the best blast hit. You can identify the pham number using the gene number of the best blast hit (megatron, eyuki, etc.) and gene size, and loooking for that gene in the blast match. Post the pham number and the # of phages with that pham as column 2 in your wiki. I will add to the combined table.

Task 3. Identify the most interesting candidates and write some thoughts on your wiki about what you'd like to investigate. This should be a post that includes both data and reflection on the process and where you think you will go with your analysis. Do you want to annotate a protein with known function, or try to connect one that we have less information about? Do the blast, HHPred and iTasser results for your proteins of interest match or disagree? Do you have a guess as to molecular function vs. biological process? Can you find any papers describing a phage protein with the suspected function? Hyperlink these as attachments to your wiki post.


Wednesday March 14th: Functional Annotation

Today we'll assess what we've learned and get ready to join a competition! to better predict protein function and to identify phage proteins that are of interest for follow up study. You will need to develop an ontological mindset. What's that? It is an analytical approach where you will develop a way of thinking and working through a problem that requires you to document specific evidence for functional annotation. You will determine if the appropriate evidence exists, and then use it to properly categorize the protein you're working on. Along the way, we will collect a library of papers describing functionally characterized phage proteins (a treasure, for me) and contribute functional annotations to a standard database.

To illustrate the problem, take a look at this paper, which describes the combined efforts of 30 different research groups world wide to assess the best methods to predict protein function. These groups tested 54 different function prediction algorithms on a dataset of 48, 298 proteins with no predicted function. They found "GO" functional predictions for 690 proteins (paper says 866, but there are 690 proteins listed in the excel file). These functional annotations describe the 1) biological process the protein belongs to and the 2) molecular function it is predicted to carry out.

It's time to start developing that ontological mindset.

Take a look at Figure 2. Please refer to the actual paper for a bigger description of what is displayed in the figure. This figure compares performance of the top-10 methods for predicting molecular function vs. biological process.

You've used blastp to predict function based on your protein's sequence matching another sequence in the database. Sometimes a conserved domain is identified by the tool. How does Blast rank in their comparison?

You've used HHPred to predict function based on comparing your protein's sequence to sequences of proteins that have been structurally characterized. HHPred isn't reflected in this paper.

Task 1: Go through the region  you annotated for DirtyBetty and Kida. Make a small table on your wiki showing any proteins with a predicted function by Blastp or HHPred, at the appropriate quality cutoffs.

DirtyBetty functional annotationsPHAMSBlastP hitsBlastP conserved domainBest blastp hitHHPred hitsOther notes*iTasser result
15 HNH endonucleasenoneHNH endonuclease of Bacillus phage Eyuki

Central domain matches beta-beta-alpha-Me type II restriction endonuclease Hpy99I from Helicobacter pylori with a probability of 98.7 and an E-value of 10^-9. Second hit is tandem zinc finger domain of fission yeast Stc1 from the organism Schizosaccharomyces pombe with a 0.0011 E value.

This sequence matches many other phage proteins and some bacterial proteins in blastp, but almost all of them are tagged as hypothetical and not as HNH endonuclease.result



Dihydrofolate reductase,

putative dihydrofolate reductase,

Gp244, dihydrofolate reductase

DHFR Superfamily

Dihydrofolate reductase

Central domain Matches, Dihydrofolate reductase, with a probability of 100.0% and an E-value of 2.4 *10^-44.

With similar secondary matches

This sequence matched many Dihydrofolate reductase proteins from different phages, with E-values as low as 7e-116.




Adenylate kinase, putative dephospho-CoA kinase, gp247



Adenylate kinase

Central Domain Matches, Adenylate kinase-related protein, With a probability fo 99.8% and an E-value of 2.7E-18

Pmkase, phoshpmevalona was the secondary match. (another type of kinase).

This sequece mostly matched Adenylate kinase proteins with E - values as low as 9e-160

63 Thymidylate synthaseThymidylate synthase

Thymidylate synthase

Thymidylate Synthase protein of bacillus phage Megatron

Thymidylate synthase

**Need some hits for this one


This is a reversed strand

Thymidylate synthase and pyrimidine hydroxymethylase it matches many other thymidylate Synthase proteins in blastp

evalue: 6.11e-79


Hypothetical Protein


Haloacid, Polynucleotide kinase

Putative Phosphate



Hydrolase protein of Bacillus phage Megatron

Putative phosphatase from Eubacterium, Human magnesium-dependent phosphatase, AphA class B acid phosphatase/phosphotransferase, D,D-heptose 1.7-bisphosphate phosphatase from E. Coli

Central domain matches Structure of the conserved protein coded by locus BT_0820 from Bacteroides thetaiotaomicron probability of 99.7% , E value: 3.4E-16. Second hit Crystal structure of an uncharacterized protein from Listeria monocytogenes serotype 4b 99.3% possibility, e value of 1.9E-11

Haloacid and Polynucleotide kinase both don't have a high enough E-value.


Haloacid dehalogenase-like hydrolases it matches many other Hydrolase proteins in blastp

evalue: 1.47e-03


Hypothetical Protein

Nicotinamide Phosphoribosyltranferase


Bacillus phage Megatron

Nicotinamide Phosphoribosyltranferase

Bacillus phage megatron

The first hit is Amides Derived From trans-2-cyclopropanecarboxylic Acid as Potent Inhibitors of Human Nicotinamide Phosphoribosyltransferase (NAMPT), with a 1.1E-78 E Value.

Second hit is the rate-limiting enzyme in the three-step Preiss-Handler pathway for the biosynthesis of NAD.

This sequences matches many other proteins in blastp, from Bacillus, Staph, Listeria and Enterococcus- indicating it is highly conserved in phages from different host bacteria.

Nicotinamide phophribosyl transferase it matches many other Nicotinamide phophribosyl transferase proteins in blastp

evalue 3.45e-46


72 Hypothetical ProteinnoneBacillus phage Eyuki

The first hit is human Phosphpribosylpyrophosphate Synthetase - ASSOCIATED PROTEIN 41 with an E value of 9.5E-55. The Second hit is a theCrystal structure of human phosphoribosyl pyrophosphate synthetase 1.

73 endolysinI got two conserved domains, on the left side I got a PGRP amidase and on the right side I got an SH3 superfamilyendolysin, Bacillus phage MegatronCentral domain matches structure of PlyL with a probability of 100% and an E-value of 10^-40. Second hit is structure of catalytic domain of XlyA with a 10^-38 E-value.  
75  terminaseI got a terminase GpA superfamilyterminus ligase subunit, Bacillus phage Megatron

Central domain matches T4 gp17 ATPase domain mutant complexed with ADP with a probability of 99.8% and an E-value of 10^-19. Second hit is crystal structure of gp17 with a 10^-15 E-value.

82 hypothetical proteinI got two conserved domains, on the left side I got a PI-PLCc_GDPD_SF superfamily and on the right side I got a AdoMet_MTases superfamily

hypothetical protein,

Bacillus phage Megatron

There was no HHPred hit that was sufficient enough. (sad)This sequence matched to a phosphodiester who had a function value greater than ^-5, but it was still a significant function. 
85 portal protienI got a Phage_Portal superfamilyportal protein, Bacillus phage EyukiCentral domain matches HK97 family phage portal protein from Corynebacterium diphtheriae with a probability of 100% and an E-value of 10^-35. Second hit is crystal structure of bacteriophage G20C portal protein with a 10^-6 E-value.  
89 capsid, major capsid proteinnonemajor capsid protein of Bacillus phage MegatronCentral domain matches HK97-lilke fold in Bordatella bacteriophage with a Probability of 97.8% and an E-value of 10^-5. Second hit is capsid protein from Bacteriopahge Epsilon 15, with a 10^-4 E-value.This sequences matches many other phage capsid proteins in blastp, from Bacillus, Staph, Listeria and Enterococcus- indicating it is highly conserved in phages from different host bacteria. 
101 Tail assembly chaperoneTail assembly chaperoneTail assembly Chaperon Eyuki (1:20) matchno significant matchs  
104 (Tail Lysin 1 for Bacillus Phage Eyuki)noneTail lysin 1 for bacillus Phage Megatron) I feel this sequence is interesting because it heavily matches Phages like Hakuna, Megatron and Eyuki throughout. 
106 minor tail protein


Tail Fiber

minor tail protein for Megatron


Minor Tail Fiber Eyuki (1:1)

base plate protein from phage Tuc 2009




baseplate protein, putative baseplate protein, baseplate lysozyme

baseplate wedge subunit

baseplate protein Bacillus phage Eyuki

First hit is protein 2IA7, a crystal structure of putative tail lysozyme from Geobacter Sulfurreducens. Probability is 99.9% and E-value of 3.2E-22. Second hit is 4HRZ, phage T4 Sheath Initiation Protein gp25. Probability is 99.8% and E-value is 5.3E-26.

The best matches are for Bacillus phages. There are other phages in blastp matches but their identity is very low. I would assume that this sequence of DNA is conserved.




Baseplate j protein, gp38, baseplate assembly protein J

Baseplate J-like protein

Baseplate_J superfamily

Baseplate_J specific hit

conserved domain

Baseplate j protein Bacillus phage Megatron(e value of 0,

1:1 match)

First hit is protein 5HX2, an in vitro assembled star-shaped hubless T4 baseplate with a probability of 100% and E-alue of 5E-42. Second hit is 3H2T, a crystal structure of gene product 6, baseplate protein of bacteriophage T4. It has a probability of 99.3% and an Ealue of 3.5E-11.


baseplate wedge

protein (e value of


best match

Shows up for many Bacillus phages but in blastp, as you scroll down the list, there are also other phages such as Listeria phages and Staphylococcus or Enterococcus phages with high query covers (around 99%). The sequence might not be heavily conserved.




DNA Helicase I, gp42 recombination helicase

Helicase superfamily c-terminal domain

DNA Helicase I Bacillus phage Megatron

First hit is 2OCA, the crystal structure of T4 UvsW, classified as a hydrolase. Probability is 100% with an E-value of 6.5E-54. Second hit is 5FMF, the P-lobe of RNA polymerase II pre-initiation complex. Probability is 100% and E-value is 5.6E-40.

This sequence of DNA shows up in Bacillus phages but also in Staphylococcus, Listeria, and Enterococcus phages. The sequence isn't very conserved.




HTH binding domain protein, DNA binding protein, transcriptional regulator


HTH binding domain protein Bacillus phage Eyuki

First hit is 2V79, crystal structure of the N-terminal domain of DNAD from Bacillus subtilis. Probability is 99.2% and E-value is 1.5E-11. Second hit is 4RBR, a crystal structure of repressor of toxin, a central regulator staphylococcus aureus virulence. Probability is 99.2 and E-value is 1.7E-10.

Blastp hits are only Bacillus phages. The DNA sequence is conserved.




Helix-turning-helix binding domain protein, putative transcriptional regulator, gp44


Helix-turn-helix binding domain protein Bacillus phage Megatron

First hit is 3C76, a crystal structure of transcription regulator from Pseudomonas syringae py. Tomato str. DC3000. Probability is 99.8% and E-value of 1.5E-19. Second hit is 4P96, a FadR, Fatty Acid Responsive Transcription Factor from Vibrio cholera with a probability of 99.8% and E-value of 1.8E-19.

Best blastp matches are for Bacillus phages. The DNA sequence is conserved.



DNA Helicase

DNA helicase II, RecA-like domain, gp46


DNA helicase C terminal domain

DNA Helicase II from Bacillus Phage Megatron

Central domain matches 4NMN-like fold in E. coli with a Probability of 100.0% and an E-value of 2.6E-54. Second hit is DnaB Replicative from E. coli, with a 8.1E-54 E-value.

First hit is 4NMN, an aquifex aeolicus replicative helicase (DNAB) complexed with ADP. Probability is 100% and E-value is 2.6E-54. Second hit is 2R6A, a crystal form BH1. Probability is 100% and E-value is 8.1E-54.

Protein Sequence matched with many Bacillus Phages: BPS13, W.Ph., Hoody T, and Bastille, which shows that this protein sequence is conserved among many different phages.

Good matches are only to a small number of Bacillus phages. This might mean that the DNA sequences is conserved.

123 nonenoneHypothetical Protein-Bacillus Phage Hakuna

Central domain matches 4R9F-lilke fold in Escherichia coli with a probability of 97.57% and an E-value of 3.1e-05. The main function is a sugar binding protein.

Central domain matches 3OMB-like fold in Escherichia coli with a probability of 97.52% and an E-value of 5e-06. The function also matches a binding protein.

First hit is CpMnBP1 with Mannobiose Bound, classification is a sugar binding protein. Probability is 97.6% and E-value is 3.1E-05. Second hit is 3OMB, a crystal structure of extracellular solute-binding protein from Bifidobacterium longum subsp. Infantis. Probability is 97.5% and E-value is 5E-06.

This sequences matches only hypothetical Bacillus proteins in BLAST, but gives a function in HHpred.Central domain matches 2PH5-like fold found in E. coli. The function serves as a homospermidine synthase.
124 Exonuclease IIMPP_superfamilyExonuclease II from Bacillus Phage Hakuna

Central domain matches 3THO found in Escherichia coli with 100.0% probability and a 2.9E-41 e-value, and a main function of Exonuclease. Second hit is 4LTY with an e-value of 3.4E-37 and a probability of 100.0.

First hit is 3THO, a crystal structure of Mre11:Rad50 in its ATP/ADP bound state. Classification is hydrolase/DNA binding protein. Probability is 100% and E-value is 2.9E-41. Second hit is 4LTY, a crystal structure of E.coli SbcD at 1.8 A resolution, classification is hyrolase. Probability is 100% and E-value is 3.4E-37.

Other top Blast matches were almost completely Bacillus Phages, Brochothrix phage A9 being the top non-Bacillus phage with an E-value of 4e-69.

Most matches are to bacillus phages though there are some staphylococcus phages. This sequence might be exclusive to bacillus phages.

127 Exonuclease 1 - Bacillus Phage Hakuna, Megatron, and EyukinoneExonuclease 1 - Bacillus Phage Hakuna (100% ident.)

Central domain matches 3QF7-like fold found in Escherichia coli with a 100% probability, and an E-value of 1.6e-37. The main function is that the RAD50 complex forms an ADP-dependent molecular clamp in DNA double-strand repair.

Central domain matches 3AUY-like fold found in Escherichia coli with a 100% probability match and a 9.9e-37 E-value. The main function is as a crystal structure of RAD50 bound to ADP.

First hit is 3QF7, the MRE11:Rad50 complex forms an ATP dependent molecular clamp in DNA double strand break repair. Probability is 100% and E-value is 1.6E-37. Second hit is 3AUY, a crystal structure of Rad50 bound to ADP. Probability is 9.9E-37.

This sequence had a 100% match with 9 other sequences on HHpred, and BLAST matched with multiple Bacillus and Listeria phages. All have similar functions and are primarily found in Escherichia coli.

There are many Bacillus phages, some Listeria phages, and a few staphylococcus phages. Though the best matches are for Bacillus phages which means that the DNA sequence is conserved.




Hypothetical protein, gp52


Hypothetical protein Eyuki_128

First hit is 4G6D, a G1 ORF67, a Staphyloccus aureus sigmaA domain 4 complex, classification DNA binding protein. Probability is 100% and E-value is 1.7E-60. E-value and probability for second hit is too low.



130 DNA primase - Bacillus phage Megatron, BPS13, and W. Ph.TOPRIM SuperfamilyBacillus phage Megatron (100% ident.)

Central domain matches 2AU3-like fold found in Escherichia coli with a 100% probability match and a 1.7e-54 E-value. This protein acts as a zinc binder, and RNA polymerase.

Central domain matches 1DD9-like fold found in Escherichia coli with a 100% probability and 1.8e-37 E-value. This protein acts as RNA polymerase.

First hit is 2AU3, crystal structure of the aquifex aeolicus primase (Zinc Binding and RNA Polymerase Domains), or DNA primase. Probability is 100% and E-value is 1.7E-54. Second hit is 1DD9, the structure of the DNAG catalytic core. Probability is 100% and E-value is 1.8E-37.

This sequence matched with multiple Bacillus, Listeria, Enterococcus, and Staphylococcus phages. On HHpred, the sequence came back with 4 100% probability matches.

Though there are some Staphylococcus phages and Listeria phages. The best matches are with the Bacillus phages. This shows that the DNA sequence is conserved within Bacillus phages.

132 dUTP nucleotihydrolyasetrimeric_dUTPase superfamilydUTP nucleotihydrolase from Bacillus phage Eyuki

4GV8 matched with a probability of 100.0% and an E-value of 4.1E-45. Second hit was 3ZF6 with a probability of 100.0% and an E-value of 9E-45.

First hit is 46V8, DUTPase from phage phi11 of S.aureus. Probability is 100% and E-value is 4.1E-45. Second hit is 3ZF6, a phage duTPases control transfer of virulence genes by proto-oncogenic G protein-like mechanism. Probability is 100% and E-value is 9E-45.

Other Blast Bacillus matches: Hakuna, Megatron, W.Ph., BPS10C

All good matches are to Bacillus phages.

135 Hypothetical ProteinNoneHypothetical Protein Eyuki No significant Hhpred matches 
136 Holiday junction resolvase - Bacillus phage Hakuna & MegatronNoneBacillus phage Hakuna (100% ident.)

Central domain matches 2WCW-like fold in E. coli with a 98.7% probability match and a 3.1e-07 E-value. This protein is a holiday junction resolvase.

Central domain matches 1OB8-like fold which is a holiday junction resolving enzyme in E. coli. This came with a 98.3% probability match and an E-value of 8.7e-11.

Archaeoglobus Fulgidus JJC, a holliday junction resolvase from an archaeal hyperthermophile. Probability is 98.7% and E-value is 3.1E-07. Second hit is 1OB8, a Holliday Junction Resolving Enzyme. Probability is 98.3% and E-value is 3.2E-06.

This sequence matched with multiple Bacillus, Listeria, Enterococcus, and Staphylococcus phages. However, HHpred came back with no 100% probability results, but had multiple matches in the 90-percentile.

The best matches are to a very few number of Bacillus phages. This means that the DNA sequence is conserved.

138 Ribonucleoside-diphosphate reductase (alpha subunit)

Ribonucleoside-diphosphate reductase (alpha subunit)

Evalue- 8.53e-131

1:1 with Eyuki ribonucleoside-diphosphate reductase

First match with 2xap_A had a probability of 100.0 and an Evalue of 2E-180, had the same name as the blastp conserved domain (ribonucleoside-diphosphate reductase), second match with 3HNC with a prob of 100 and Evalue 4E-180

Matches several sequences in blast,  Eyuki, Hakuna, Megatron, BPS13, and W.Ph, with the same functional description as well as several high probability and low Eval matches in HHpred 

Ribonucleoside-diphosphate reductase (beta subunit) - Bacillus phages Eyuki, Hakuna, and Megatron

Ferritin_like Superfamily


Ribonucleoside-diphosphate reductase (beta subunit) Evalue- 1.9E-54

Bacillus phage Eyuki (100% ident.)

Central domain matches 2RCC-like fold which is a putative class I ribonucleotide reductase found in E. coli. This had a probability match of 100% and an E-value of 5.7e-87.

Central domain matches 1MXR-like fold which is a ribonucleotide reductase found in E. coli. The probability match is 100% and the E-value is 2.1e-83.

This sequence mainly matched Bacillus pages on BLAST (other phages were lower than 32% ident. match). As for HHpred, there were 17 100% probability matches.


Although there were strong matches for both HHpred and Blastp, only Blastp specified a subunit (b).

142 Flavodoxin

Flavodoxin Evalue 1.78E-8


FMN_red superfamily

1:1 match with Flavodoxin protein of MegatronFirst match with 1F4P, probability 100, Evalue 1.3E-28. Second match with 4HEQ, prob 100, Eval 3.7E-28. First match specifies name as Flavodoxin.

Both HHpred and Blastp label the protein sequence as Flavodoxin with Evalues below 10^-5

While Blastp shows lots of matches, only four matches (Bacillus phages Megatron, Hakuna, BPS13, and W.Ph.) have Identities of over 50%.

145 ThioredoxinThioredoxin 8.61E-61:1 match with Thioredoxin protein of phage EyukiFirst match with 1EP7, prob 99.9, Eval 1.7E-21. Second match with 4OO4, prob 99.9, Eval 2.6E-21.Although the Evalue for Thioredoxin on Blastp is only 8.61E-6, HHpred confirms the function as Thioredoxin by higher probability matches with much lower Evalues.  
152 DNA binding proteinFour detected; they are all in the same superfamily and agree on a DNA binding proteinDNA binding protein with MegatronFirst hit had an e-value of 1.3x10-30 and is for protein-DNA recognition, and the second hit is a DNA binding protein with an e-value of 8.3x10-31The matches all seem to be from Bacillus phages, according to blastp 
160 DNA polymerase IINine detected; they all agree on it being exonuclease or DNA polymerase IIDNA polymerase II with Eyuki

First hit: e-value of 4x10^-58 and says it is a Klenow fragment of DNA polymerase I

Second hit: e-value of 3x10^-54 and says it is DNA polymerase

I think it should be noted that on blastp, the usable results said that this is DNA polymerase II, but HHPred says it is DNA polymerase (I) 
161 I-BasIFour domains detected; one result labels it as endonuclease, another says nuclease, and the last two say that it is a binding motifI-BasI with Bacillus MegatronFirst hit has an e-value of 1.4x10-50; result also says it is an endonucleaseI'm assuming binding motifs are similar in structure as endonucleases, and on blastp, the ones with the best query coverage (>90%) were all from Bacillus phages and those 5 sequences all coded for the same protein 
162 DNA polymerase IISix detected (slight variation, but all agree on polymerase)DNA polymerase II with Hakuna

First hit: e-value of 4.8x10^-56 and is a DNA polymerase (protein-DNA complex)

Second hit: e-value of 6.5x10^-56; DNA polymerase I

Blastp gave lots of results, guvung a query cover of 98-99% with stapholococcus, listeria, and bacillus phages 
168 helix-turn-helix motifHTH_XRE superfamilyHypothetical protein in Bacillus phage Hakuna (1:1)One match: 2B5A, a gene regulation protein with helix-turn-helix motif expressed in E. coli with probability 87.3 and E-value 0.5.Although HHPred results match predicted function results in blast, the E-value is rather high, indicating not very strong support for this particular match. 
179 ssDNA binding domain proteinnonessDNA binding protein in Bacillus phage Eyuki (1:1)One match: 2A1K, a single-stranded DNA binding protein expressed in E. coli with probability 97.3 and E-value 4.1x10-4.No conserved domains in Blastp, however HHPred reveals predicted function 
180 recombinase A

RecA multi-domain; P-loop NTPase superfamily


recombinase A in Bacillus phage Eyuki (1:1)

First hit: 3CMU, a homologous DNA-recombination protein expressed in E. coli with probability 100 and E-value 2.9x10-56. Second hit: 3HR8, a recombination protein expressed in E. coli with probability 100 and E-value 7.7x10-50.

Matches Protein RECA, recombinase A; homologous recombination, recombination/DNA complex with probability 100.00% and E-value of 2.7 e -56. Second match is Protein RECA; alpha and beta proteins (A/B, A+B), ATP-binding, cytoplasm, damage, DNA recombination with probability 100% and E-value of 7.3 e -50.

Very similar Blastp results in Eyuki, Megatron and Hakuna.

This sequence matches many other Bacillus phages and always matches with recA sequences.

183 RNA polymerase sigma factornoneRNA polymerase sigma factor 28 in Bacillus phage Megatron (1:1)

First hit: 4NQW, a DNA binding protein expressed in E. coli with probability 97.6 and E-value 1.6x10-3. Second hit: 1RP3, a transcription protein expressed in E. coli with probability 97.2 and E-value 2.6x10-3.

Matches ECF RNA polymerase sigma factor SIGK; sigma factor with a probability of 97.55% and E Value of 1.6 e -3. Second match RNA polymerase sigma factor sigma-28 (FLIA) with a probability of 97.23 and E value of 2.6 e -3.

No conserved domains in Blastp, but function results seem consistent among phages. However, HHPred predicts different functions for best-matched proteins.

This sequence just barely does not meet the e value requirement in HHPred but does meet the requirement in Blast P. Almost all of the matches have a function of RNA Polymerase sigma factor.

185 holin

Two detected; both agree that this is a holin protein, but one specifically says Phage lysis holin

holin with Bacillus phage Megatron

HHPred hits were not similar enough to the sequence; first hit had an e-value of 50 and coded for a helix-turn-helix regulatory protein

Matches Regulatory protein CII; helix-turn-helix, transcription activator with a probability of 45.87% and an E-value of 50. Second match with Diaphanous protein homolog 1; helix bundle, protein binding with a probability of 40.98% and an E-value of 50.

On blastp, virtually all of the sequences detected were from Bacillus phages

This sequence does not yield any significant results through HHPred but has a conserved domain through BlastP.
gene_189.txt hypothetical proteinnonehypothetical protein Bacillus phage Eyuki_189

1WPN, 4PY9, 5CET, 2QB7, 3DEV

Matches Putative exopolyphosphatase-related protein; structural genomics with a probability of 100% and E-value of 2 e -36. Second match with Bifunctional oligoribonuclease and PAP phosphatas with a probability of 100% and with E-value 3.6 e -34.

The general consensus from HHpred is that this is an enzyme that cleaves a Pyrophosphatase. 1WPN, 4PY9, 5CET, 2QB7

This sequence does not yield any significant results through BlastP but has several significant matches in HHPred, all with the same function.


DNA Polymerase I


noneDNA Polymerase I Bacillus phage Eyuki

3AV0, 3THO, 4LTY, 1II7


Matches DNA double-strand break repair protein MRE11 with 99.77% probability and E-value of 1.6 e -17. Second match with Exonuclease subunit SBCD; meiotic recombination 11 homolog with probability 99.73% and E-value of 1.3 e -16.

There are several matches on HHPred for Escherichia coli. The two main trends indicate that this is either a nuclease or a double-strand DNA break repair enzyme.

Blast matches are largely from Bacillus phage for DNA Polymerase I.

This sequence does not have a conserved domain but both tools come up with very consistent hits of DNA polymerase and repair proteins.

1981197(18)Hypothetical Proteinsee otherHypothetical Protein 198 Came up as Rhodococcus rubber 6- oxolauric acid dehydrogenase-like. E value was not strong enough to be considered a function 
204674(26)Putative Hydrolase hydrolase [Bacillus phage Eyuki] is a called dUTP, its an enzyme that removes pyrophosphates. 
gene_204.txt putative hydrolaseNucleoside Triphosphate PyrophosphohydrolaseBacillus phage Eyuki 4QGP, 2OIE, 2Q73, 2Q5Z

Several matches on HHPred for Escherichia coli, with indicators that suggest a pyrophosphatase function.

Blastp has a conserved domain with proteins that function as Pyrophosphohydrolase, largely from Bacillus phage matches. Matches suggest a hydrolase function.



*YCZA inhibitor of trap was not predicted on BLASTP–> ONLY on HHPred*



*YCZA inhibitor of trap was not predicted on BLASTP–> ONLY on HHPred*

Bacillus phage Eyuki_220

YCZA, inhibitor of trap

Matches to Eyuki with a probability of 96.83% and an E-value of 5.4e-4; a good match

There are twelve phages displayed on Blastp that are Bacillus phages. Some of these phages have the label of hypothetical protein, while others, specifically those of the cereus and anthracis strains, show that there is a transcriptional protein on it. This may or may not make it conserved.I-TASSER RESULTS ARE BEING PROCESSED


*TC3 transposase was not predicted on BLASTP–> ONLY on HHPred*



*TC3 transposase was not predicted on BLASTP–> ONLY on HHPred*

Bacillus phage Eyuki_222

TC3 transposase

Matches to Eyuki with a probability of 99.7% and an E-value of 1.8e-15; a pretty good match

There are fifteen phages that are Bacillus phages. All but one of these phages have been identified as having hypothetical proteins. The one that was not identified as hypothetical, Bacillus phage BPS13, marked the sequence to be a putative homeodomain-like protein. This, however, can be discarded because the probability (56%) of this case is too low..I-TASSER RESULTS ARE BEING PROCESSED

Bacillus phage Eyuki

Sigma-Factor (Sig-F)


Sigma-Factor (Sig-F)

Bacillus Phage Eyuki

Bacillus phage Eyuki

Sigma Factor (Sig-F)

Matches to Eyuki with a probability of 100% and an E-value of 7.2e-32; a very good match

There are seventeen phages displayed on Blastp that are Bacillus phages. Each of these phages have been indicated to have Sigma-Factors (Sig-F), making it highly conserved in phages from the same host bacteria. Other bacteria have this protein as well.I-TASSER RESULTS ARE BEING PROCESSED

Bacillus phage Megatron

Beta-Lactamase (Domain) Hydrolase


Beta-Lactamase (Domain) Hydrolase

Bacillus phage Megatron

Bacillus phage Megatron

Ribonuclease Z or phosphodiesterase beta lactamase tRNAse Z

Matches to Megatron 100.0 Probability as well as an E-value of 5.7e-32—> This is a very good match.

There are thirty-eight Bacillus phages displayed on Blastp that indicate there is a Beta-Lactamase protein. Some of the phages showed a Beta-Lactamase domain protein of hydrolase. This makes either highly conserved in phages from the same host bacteria.I-TASSER RESULTS ARE BEING PROCESSED

*For the other notes box: What else can you learn by taking a close look at the BlastP and HHPred hits? Do you see hits from only bacillus phages? Or do you see hits from a variety of bacteria and phages? What can you infer about conservation of this protein?

Kida functional annotationsBlastP hitsBlastP conserved domainBest blastp hitHHPred hitOther notes


Task 2. Let's see what we all know, together. I will paste your table with the genes in order onto this page so we have one master functional annotation table! Due date is noon Sunday March 20th.

Task 3. Use iTasser to examine the predicted function of one protein.  Chose one protein from your region, or from someone else's if  you have no good candidates, and submit the sequence to iTasser. You will have to set up an account first. This video will tell you how to submit your sequence and evaluate the results. The results will come to your email in a day or two. When you obtain a result, please revisit the video to understand how to interpret the output. You will see a significant portion of the output includes identifying Gene Ontology (GO) terms, which will be important in our next project.

We will discuss what you found and the best candidates for followup study, and then identify those candidates in homologous phages in Genbank (Megatron, Eyuki, Hakuna). This will be our launching point for participating in CACAO where you will move from Phage Hunter to Biocurator!  "The Community Assessment of Community Annotation with Ontologies (CACAO) is a project to do large-scale manual community annotation of gene function using the Gene Ontology as a multi-institution student competition". 

I know it's terrible but how can you resist. In this competition, 'A functional annotation is a note in a specific format that is made based on the evidence in a peer-reviewed paper about the attributes of a protein'. Over the next couple of weeks, you will learn to find a paper with experimental data that supports a functional annotation, and how to properly document that annotation in Gene Ontology (GO) terms.

Task 4. Choose a small group of 3-4 students, give your group a name, and fill in the table below so I can submit your teams to the CACAO folks. Make sure you pick a team name that hasn't been used before.

Team Name (Team _____ )usernameuser_real_name (First Last)user_email

Amay Iyer


Kendrick JafarsvumhMitchell
Kendrick JafarsrayesdDanny
Kendrick JafarsshasanShahzeb



  • No labels


  1. When you have workload but gotta walk it off like a boss

    1. Meme challenge- If it's that bad, what does HHPRed look like?

  2. When your annotations don't save.

  3. Unknown User (powellme)

    When you're done with lecture but the lecture isn't done with you 

  4. Unknown User (powellme)

    How Im handling life right now 


  6. ⋆   ✹ · · ˚ .    *    * · ˚ ✹ ⊹ * . . ·        ·     *      * · *    .    ˚ ˚   ⋆ · · ˚      · ✵ . · ˚   ✵         ˚ ˚ .  ⋆    · . ˚ ✦       · ✫       ·    . ⋆ * ·       + ˚ +     .    ·   ⊹  .       ·    . · . ✫       · ✧   ⊹ ⊹ ·   ˚      . ⊹ ✦     · .   ✦   *  .      ⋆   ✧ ˚ . · ✫      ⊹    ˚       ✦ ·    ·                  .     · +   *    * *   ·   ✧ ✫ ✦     * * .   ˚ .    ✵   . ·   *    . *      ✦  · * .   ✧ ˚ *     ˚    ✵    * ˚ ·   ⊹   ✧       ✺                 ✷   ✺ *     ·      

                                                                                                                                 \ (◕ヮ◕\ )

  7. Unknown User (powellme)

    When they expect you to understand GONUTS in one lecture

  8. Trying to figure out what we are doing in GoNuts 

  9. Allison is gonna abandon us on Wednesday

  10. Unknown User (hazardb)