Difference between revisions of "Talk:Protein coding sequences"

Revision as of 14:38, 18 March 2009

1 Submitting parts to genbank in an automated fashion using tbl2asn
2 The plan
- 2.1 Obtaining unique identifiers for registry proteins
3 Barry's notes (ignore)
- 3.1 Alphabet soup
- 3.2 Papers
4 Old material
5 Questions
6 Categories
- 6.1 Part level function
- 6.2 Device level function
7 Parameters
8 Columns

Submitting parts to genbank in an automated fashion using tbl2asn

Install [http://www.ncbi.nlm.nih.gov/Sequin/ Sequin]
- Use this to generate a template file for sequence submissions. I think we can reuse the same template file for multiple submissions so Sequin may not need to run on the registry server.
Install [http://www.ncbi.nlm.nih.gov/Genbank/tbl2asn2.html tbl2asn] on the registry server
Generate two files from the registry database for each part. First file is a .fsa file for each sequence or batch of sequences. Second file is a feature file. Looks like one file per sequence.
Run tbl2asn to prepare the sqn files for each record to be submitted. The output of this is a .sqn file for each submission
Send each .sqn file to genbank via email with the address gb-sub@ncbi.nlm.nih.gov.

Ok, given .fsa file (and optionally a .tbl file) for a set of parts, I can generate the .sqn file that we could email to the above address using a single terminal command.

Next questions are -

Can we just submit a bunch of sequences to Genbank?
Can we generate the .fsa files and .tbl files from the registry records. One issue we will run into here is that we will need to be sure to generate a feature for every CDS we want to turn into a protein.
Can we set all the options on each sequence, such as organism, genetic code, etc. through the fasta definition line.

The plan

Obtaining unique identifiers for registry proteins

Deposit all registry parts in [http://www.ncbi.nlm.nih.gov/Genbank/ genbank] in a programmatic way using [http://www.ncbi.nlm.nih.gov/Sequin/ Sequin] or [http://www.ncbi.nlm.nih.gov/Genbank/tbl2asn2.html tbl2asn2]. Tbl2asn2 is a command line program tailored for batch sequence submissions to genbank.
- Doing this will generate a [http://www.ncbi.nlm.nih.gov/Sitemap/sequenceIDs.html gi] and [http://www.ncbi.nlm.nih.gov/Sitemap/sequenceIDs.html Version] number for every sequence.
- Should probably include some identifier, such as the BioBrick number, in some field so we can retrieve these records easily later.
All the annotated CDS will be translated and will show up in [http://www.uniprot.org/help/uniparc UniParc]. UniParc is updated every three weeks. Note: in reality the path is a bit more complicated - the annotated CDS we deposit in genbank will be translated by NCBI and show up in the Entrez protein database. From there, the record will be scraped by UniParc.
The registry generates a record in a table called "protein" for every CDS in the registry. For each protein sequence, query UniParc to see if a [http://www.uniprot.org/help/uniparc UPI] exists. If not, give the protein a temporary identifier and remember to query UniParc later to get a stable UPI for the protein. The UPI becomes the stable, unique identifier of the registry protein.
Once a UPI is found in UniParc, retrieve all cross-reference identifiers from other databases and add some or all of them to the record for the protein.
Via magic*, derive a common name/nickname/family name for the protein from entries in the external databases. *I'm still working on this one...

An example of the process in action -

Reshma submitted the ampicillin resistance gene (version number [http://www.ncbi.nlm.nih.gov/nuccore/169656099 EU496092.1]) to genbank when submitting her vector paper.
The annotated CDS in that genbank record is translated to yield a protein record ([http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=ACA62827.1 ACA62827.1]).
Searching UniParc using the [http://www.ebi.ac.uk/Tools/picr/ PICR] tool for that protein sequence returns a UniParc UPI ([http://www.uniprot.org/uniparc/UPI000002E509 UPI000002E509]). You can get the same result by searching for the accession number rather than the protein sequence. This search can be done programmatically using the SOAP and REST interfaces to PICR.
The UniParc record for the protein sequence finds cross references in 13 other databases (with multiple hits in each since they are non-redundant databases).
As an example, the first cross-reference for the UPI is an EMBL/genbank accession number ([http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=171466749 ACB46535.1]) and shows that the gene product is "beta-lactamase". Doing the same thing with the second cross reference from UniParc says the gene product is "synthetic beta-lactamase resistance protein".

Barry's notes (ignore)

what do we actually want for proteins?

A link to information in the outside world about that protein
- Based on this information, it would be nice to automatically identify some sort of family or classification that the protein belongs to.
A measure of what other proteins in the registry the protein is similar to.

The first problem can likely be solved simply by searching for the protein sequence in UNIParc (what about UNIRef?). If that doesn't return an exact match, we can turn to blast to see what the protein is most similar to. Can have this run ahead of time on every protein sequence.

The second problem can be solved by using something like BLink to return all related pid numbers and then comparing that list to the list of pid numbers in the registry. If that fails, turn to a precompiled Blast search on all registry proteins.

Alphabet soup

[http://uniprot.org Uniprot] - this is the umbrella organization of Swissprot and TrEMBL
[http://www.uniprot.org/help/uniprotkb UniprotKB] - this database is the superset of Swissprot (manually curated) and TrEMBL (autogenerated)
- The protein sequences are derived from the translation of the coding sequences (CDS) which have been submitted to the public nucleic acid database, the EMBL/GenBank/DDBJ database. All these sequences, as well as the related data submitted by the authors, are automatically integrated into UniProtKB/TrEMBL.
[http://www.uniprot.org/help/uniref Uniref] - The UniRef databases provide clustered sets of sequences from UniProt Knowledgebase (including splice variants and isoforms) and selected UniParc records, in order to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences (but not their descriptions) from view.
[http://www.genome.ad.jp/kegg/ KEGG] - they maintain their own database of proteins with KEGG numbers. Proteins are grouped the function, structure, etc.
[http://www.cs.utk.edu/~rcollins/bioinf/tutorial/home.html Blink] - BLink (BLAST Link) is a tool that displays the pre-computed results of BLAST searches that have been completed for every protein sequence in the Entrez Proteins data domain. If you know the pid, you can access the Blink results from this url - http://www.ncbi.nlm.nih.gov/sutils/blink.cgi?pid=insert pid here&all=1.
- Here is a php [http://www.cs.utk.edu/~rcollins/bioinf/tutorial/tutorial4.html script] that will return all related proteins based on a protein pid.
[http://www.uniprot.org/help/uniparc UniParc] - a non-redundant archive of all protein sequences that are or have been listed in the major protein databases including the UniprotKB, RefSeq, PDB and others. Simply contains the protein sequences and any identifiers from the different databases that share that exact sequence.
When doing a Blast search of the NCBI databases, the "nr" database is a non-redundant collection of protein sequences from translated genbank CDS, RefSeq proteins, items from the NCBI protein database and a few others. This is equivalent to UniParc but I'm not sure it lives as a standalone database.
[http://www.ncbi.nlm.nih.gov/Structure/lexington/lexington.cgi?cmd=rps Cdart] might be a quick way to find proteins with domains similar to the query.
[http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml CDD] - the conserved domain database. Searching this database from NCBI returns matches to annotated domains and also protein superfamilies. The superfamily information is pretty useful. for example, searching for the amino acid sequence of C0062 returns this [http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?ascbin=8&maxaln=10&seltype=2&uid=127388&query=MKNINADDTYRIINKIKACRSNNDINQCLSDMTKMVHCEYYLLAIIYPHSMVKSDISILDNYPKKWRQYYDDANLIKYDPIVDYSNSNHSPINWNIFENNAVNKKSPNVIKEAKTSGLITGFSFPIHTANNGFGMLSFAHSEKDNYIDSLFLHACMNIPLIVPSLVDNYRKINIANNKSNNDLTKREKECLAWACEGKSSWDISKILGCSERTVTFHLTNAQMKLNTTNRCQSISKAILTGAIDCPYFKN&aln=1,182,0,57 LuxR] protein family, and this [http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?ascbin=8&maxaln=10&seltype=2&uid=112297&query=MKNINADDTYRIINKIKACRSNNDINQCLSDMTKMVHCEYYLLAIIYPHSMVKSDISILDNYPKKWRQYYDDANLIKYDPIVDYSNSNHSPINWNIFENNAVNKKSPNVIKEAKTSGLITGFSFPIHTANNGFGMLSFAHSEKDNYIDSLFLHACMNIPLIVPSLVDNYRKINIANNKSNNDLTKREKECLAWACEGKSSWDISKILGCSERTVTFHLTNAQMKLNTTNRCQSISKAILTGAIDCPYFKN&aln=1,20,0,124 autoinducer binding superfamily]. Need to figure out if we can retrieve these superfamilies programmatically.
Refseq is redundant in the sense that it has one record per molecule per organism. UniParc has one record per molecule.

What are the sources fort the Entrez protein database? - [http://www.ncbi.nlm.nih.gov/entrez/query/static/nucprotfaq.html#8]

Papers

[http://www.biomedcentral.com/1471-2105/8/401 Biomed] paper about a protein identifier cross-referencing tool, [http://www.ebi.ac.uk/Tools/picr/ PCIR]. Written by the folks at the EBI, the tool is now available from the Uniprot website.
[http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=18697767 Bioinformatics] paper about [http://llama.med.harvard.edu/synergizer/translate/ the synergizer] which translates protein identifiers from one database to another.

Old material

BBa_J31004 - change the vector of this part
BBa_Y00029 - remove the prefix and suffix
Rename chromatinremodeling to chromatin

Questions

How do we want to indicate relationships between parts? - i.e. protein coding sequences and generators, transcriptional regulators and cognate promoters, similar versions of protein coding sequences,

Parameters

The following parameters can be auto-selected by the computer.

Completeness <-- need a better name here
- Full-length
  - begins with ATG/GTG and ends with TAA/TGA/TAG
  - begins with TTA/TCA/CTA and ends with CAT/CAC
- N-terminus
  - begins with ATG/GTG or ends with TAA/TGA/TAG
- C-terminus
  - begins with TTA/TCA/CTA or ends with CAT/CAC
- Fusion
- Domain
  - has neither an ATG/GTG nor a TAA/TGA/TAG
Direction
- Forward
  - begins with ATG/GTG or ends with TAA/TGA/TAG
- Reverse
  - begins with TTA/TCA/CTA or ends with CAT/CAC
Degradation tag
- probably ought to be able to auto-detect and annotate the different degradation tags as well. I'll compile the different sequences and tag names. May want to somehow tie this with the Protein tags and modifiers parts.
Signal sequences
- do the same with N-terminal tags?

Columns

all protein coding sequences
- Protein
- Description
- Direction (auto-generated)
- Completeness (auto-generated)
- Tags (auto-generated based on sequence search)
- SwissProt
- KEGG
- Length
all enzymes
- EC number
- substrate
- product

fluorescent_reporter_protein_coding_sequences
- excitation
- emission
- color
luminescent_reporter_protein_coding_sequences
color_reporter_protein_coding_sequences
- color
antibiotic_resistance_marker
- antibiotic
recombination_protein_coding_sequences
- recombination site
transcriptional_regulators
- operator site sequence
- ligand

@@ Line 7: / Line 7: @@
 #Send each .sqn file to genbank via email with the address gb-sub@ncbi.nlm.nih.gov.
-Ok, given a .fsa and a .tbl for a set of parts, I can generate the .sqn file that we could email to the above address.
+Ok, given .fsa file (and optionally a .tbl file) for a set of parts, I can generate the .sqn file that we could email to the above address using a single terminal command.
+Next questions are -
+#Can we just submit a bunch of sequences to Genbank?
+#Can we generate the .fsa files and .tbl files from the registry records.  One issue we will run into here is that we will need to be sure to generate a feature for every CDS we want to turn into a protein.
+#Can we set all the options on each sequence, such as organism, genetic code, etc. through the fasta definition line.
 ==The plan==