Data integration

The first step is to integrate information across complementary biological sources. The figure to the right demonstrates the data integration process. We collaborate with providers of open biological information on proteins to retrieve and assemble the information that is housed on their sites. We provide feedback, and we discuss how to map the information to the protein structures. The information regarding the functions of the protein structures is either based upon the protein primary amino acid sequence or its three-dimensional structure. UniProt is our primary source of information where the information is attributed to a protein sequence. The Structural Biology Knowledgebase and a variety of other open sources provide different types of information attributed to protein amino acid sequence and/or three dimensional structure.

The information assembled is reviewed for errors and completeness. It is mapped precisely to individual structures. If we know which parts of the structure are associated with the functional information, we further make correspondences between particular residues within the structures and the function. The information that we retrieve and assemble is refreshed on a weekly basis to coordinate with the release of new protein structures. We work with the providers and request coordination of such updates. The tables below give a complete list of data resources and annotation types for protein and residue level assignments used along with links to their websites and most recent publications.

Protein level annotation types

Residue level UniProt annotation types

What types of annotations are used?

We use annotations that define a biological function, as ascribed by the biomedical community, that can be attributed to more than one protein. Annotations that are unique to a particular protein are not considered functional categories. For example, a RefSeq ID or UniProt accession number is not an annotation type used. However, UniProt keywords are used because they can be assigned to any protein that has that particular biological function or feature. The figure to the right gives a breif illustration of annotations as functional categories. Here, we see 3 different proteins and 4 different functional categories. All 3 proteins are phosphorylated and, therefore, are annotated by the GO Biological process term Phosphorylation. Thus, we can say that the GO Biological process Phosphorylation is a functional category, thus it is an annotation that we can use. The other bins are considered functional categories as well because we know that they are not specific to a particular protein and can be assigned to any protein that exhibits the corresponding biological function or feature.

Protein level vs. residue level annotations

We see that annotations can either be assigned at the level of the entire protein or to particular residues within the structure. In the figure to the left, we use the example of the vascular endothelial growth factor receptor or VEGF receptor to help explain the difference between the two levels of annotations. At the center of the figure, we see the entire protein sequence of 1356 amino acids, or residues, represented as a rectangle. The green areas highlight those residue ranges in the protein that have determined structure. Protein level annotations describe biological functions attributed to the entire protein, and not specifically attributed to certain areas of the protein. Examples of annotations made at the level of the entire protein are shown in the upper section of the figure. These include UniProt biological processes and cellular pathway assignments. For residue level annotations, annotations that describe the biological functions attributed to particular residues or residue ranges are used. For our purpose, we see which annotated residues lie totally or partially within the range of the sequence with a determined three dimensional structure. Examples of annotations for residues of the VEGF receptor are shown in the lower section of the figure. These include the immunoglobulin i-set domain, protein kinase domain, and phosphotyrosine. Notice that the annotations denoted by black lines are located within areas of determined structure whereas those denoted by grey lines lie outside those areas.