Talk:Protein coding sequences

Protein naming references

Alphabet soup

[http://uniprot.org Uniprot] - this is the umbrella organization of Swissprot and TrEMBL
[http://www.uniprot.org/help/uniprotkb UniprotKB] - this database is the superset of Swissprot (manually curated) and TrEMBL (autogenerated)
- The protein sequences are derived from the translation of the coding sequences (CDS) which have been submitted to the public nucleic acid database, the EMBL/GenBank/DDBJ database. All these sequences, as well as the related data submitted by the authors, are automatically integrated into UniProtKB/TrEMBL.
[http://www.uniprot.org/help/uniref Uniref] - The UniRef databases provide clustered sets of sequences from UniProt Knowledgebase (including splice variants and isoforms) and selected UniParc records, in order to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences (but not their descriptions) from view.
[http://www.genome.ad.jp/kegg/ KEGG] - they maintain their own database of proteins with KEGG numbers. Proteins are grouped the function, structure, etc.
[http://www.cs.utk.edu/~rcollins/bioinf/tutorial/home.html Blink] - BLink (BLAST Link) is a tool that displays the pre-computed results of BLAST searches that have been completed for every protein sequence in the Entrez Proteins data domain. If you know the pid, you can access the Blink results from this url - http://www.ncbi.nlm.nih.gov/sutils/blink.cgi?pid=insert pid here&all=1.
- Here is a php [http://www.cs.utk.edu/~rcollins/bioinf/tutorial/tutorial4.html script] that will return all related proteins based on a protein pid.
[http://www.uniprot.org/help/uniparc UniParc] - a non-redundant archive of all protein sequences that are or have been listed in the major protein databases including the UniprotKB, RefSeq, PDB and others. Simply contains the protein sequences and any identifiers from the different databases that share that exact sequence.
When doing a Blast search of the NCBI databases, the "nr" database is a non-redundant collection of protein sequences from translated genbank CDS, RefSeq proteins, items from the NCBI protein database and a few others. This is equivalent to UniParc but I'm not sure it lives as a standalone database.

Papers

[http://www.biomedcentral.com/1471-2105/8/401 Biomed] paper about a protein identifier cross-referencing tool, [http://www.ebi.ac.uk/Tools/picr/ PCIR]. Written by the folks at the EBI, the tool is now available from the Uniprot website.
[http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=18697767 Bioinformatics] paper about [http://llama.med.harvard.edu/synergizer/translate/ the synergizer] which translates protein identifiers from one database to another.

Old material

BBa_J31004 - change the vector of this part
BBa_Y00029 - remove the prefix and suffix
Rename chromatinremodeling to chromatin

Questions

How do we want to indicate relationships between parts? - i.e. protein coding sequences and generators, transcriptional regulators and cognate promoters, similar versions of protein coding sequences,

Parameters

The following parameters can be auto-selected by the computer.

Completeness <-- need a better name here
- Full-length
  - begins with ATG/GTG and ends with TAA/TGA/TAG
  - begins with TTA/TCA/CTA and ends with CAT/CAC
- N-terminus
  - begins with ATG/GTG or ends with TAA/TGA/TAG
- C-terminus
  - begins with TTA/TCA/CTA or ends with CAT/CAC
- Fusion
- Domain
  - has neither an ATG/GTG nor a TAA/TGA/TAG
Direction
- Forward
  - begins with ATG/GTG or ends with TAA/TGA/TAG
- Reverse
  - begins with TTA/TCA/CTA or ends with CAT/CAC
Degradation tag
- probably ought to be able to auto-detect and annotate the different degradation tags as well. I'll compile the different sequences and tag names. May want to somehow tie this with the Protein tags and modifiers parts.
Signal sequences
- do the same with N-terminal tags?

Columns

all protein coding sequences
- Protein
- Description
- Direction (auto-generated)
- Completeness (auto-generated)
- Tags (auto-generated based on sequence search)
- SwissProt
- KEGG
- Length
all enzymes
- EC number
- substrate
- product

fluorescent_reporter_protein_coding_sequences
- excitation
- emission
- color
luminescent_reporter_protein_coding_sequences
color_reporter_protein_coding_sequences
- color
antibiotic_resistance_marker
- antibiotic
recombination_protein_coding_sequences
- recombination site
transcriptional_regulators
- operator site sequence
- ligand

Talk:Protein coding sequences

Contents

Protein naming references

Alphabet soup

Papers

Old material

Questions

Categories

Part level function

Device level function

Parameters

Columns