Difference between revisions of "Part:BBa K4815001:Design"

 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
  
 
__TOC__
 
__TOC__
<partinfo>BBa_K4815000 short</partinfo>
+
<partinfo>BBa_K4815001 short</partinfo>
  
<partinfo>BBa_K4815000 SequenceAndFeatures</partinfo>
+
<partinfo>BBa_K4815001 SequenceAndFeatures</partinfo>
  
  
 
===Design Notes===
 
===Design Notes===
To begin with, how to extract features from the raw data and finally successfully generate PYPH1 needs to be tackled. we utilize a pre-trained model to construct an AI model. Then, we train the AI model using the raw data, enabling it to predict expression strength based on the provided core promoter sequences. Subsequently, we filter out highly expressed sequences from the original data. To create a diversity of new sequences, we randomly mutate these selected sequences. Finally, we use a state-of-the-art AI model with the highest goodness of fit to predict the expression levels of both the mutated and original sequences. This allows us to identify highly expressed promoter sequences. The overall procedure, which involves transitioning from raw data and a pre-trained model to filtering out highly expressed sequences, is depicted in the table below:
+
To begin with, how to extract features from the raw data and finally successfully generate PYPH2 needs to be tackled. we utilize a pre-trained model to construct an AI model. Then, we train the AI model using the raw data, enabling it to predict expression strength based on the provided core promoter sequences. Subsequently, we filter out highly expressed sequences from the original data. To create a diversity of new sequences, we randomly mutate these selected sequences. Finally, we use a state-of-the-art AI model with the highest goodness of fit to predict the expression levels of both the mutated and original sequences. This allows us to identify highly expressed promoter sequences. The overall procedure, which involves transitioning from raw data and a pre-trained model to filtering out highly expressed sequences, is depicted in the table below:
 
<html>
 
<html>
 
<figure><center>
 
<figure><center>
Line 20: Line 20:
 
Length: the functional part of PYHP1 is the core promoter, and we decide the length of 80bp mainly because an 80-bp window is short enough that a bound nucleosome would likely cover the entire region, simplifying modeling of accessibility, and because the entire region could be sequenced with a 150-cycle kit, with overlap in the middle, which is helpful for sequencing and validating the promoter sequence.
 
Length: the functional part of PYHP1 is the core promoter, and we decide the length of 80bp mainly because an 80-bp window is short enough that a bound nucleosome would likely cover the entire region, simplifying modeling of accessibility, and because the entire region could be sequenced with a 150-cycle kit, with overlap in the middle, which is helpful for sequencing and validating the promoter sequence.
  
Loci: we seat PYPH1 at approximately -170 to -90 upstream to the codon,owing to the consideration that it is the presumed transcription start site-TSS, where most transcription factors binding sites lie [1]. We hope that modifying this part will be the most efficient way to change the expression rate. In other words, expression rate is more sensitive to changes made in this 80bp sequence than other parts.
+
Loci: we seat PYPH2 at approximately -170 to -90 upstream to the codon,owing to the consideration that it is the presumed transcription start site-TSS, where most transcription factors binding sites lie [1]. We hope that modifying this part will be the most efficient way to change the expression rate. In other words, expression rate is more sensitive to changes made in this 80bp sequence than other parts.
  
 
Composition: Still, after deciding the core promoter sequence we need a scaffold that can link it to the condom and to make up a complete promoter. We get our ‘pA-pT‘ scaffold from one previous research. It can link the core promoter with the codon and provide restriction sites of BamH I and Xho I which make it possible for the plasmids with the scaffold to be inserted by various core promoter sequences at ease [2].
 
Composition: Still, after deciding the core promoter sequence we need a scaffold that can link it to the condom and to make up a complete promoter. We get our ‘pA-pT‘ scaffold from one previous research. It can link the core promoter with the codon and provide restriction sites of BamH I and Xho I which make it possible for the plasmids with the scaffold to be inserted by various core promoter sequences at ease [2].
Line 31: Line 31:
 
width="400"
 
width="400"
 
title="">
 
title="">
<figcaption>pA-core PYPH1-pT</figcaption>
+
<figcaption>pA-core PYPH2-pT</figcaption>
 
</figure>
 
</figure>
 
</html>
 
</html>
Line 38: Line 38:
 
===Source===
 
===Source===
 
We set out a large-scale search for raw data that can be used to train AI, and finally, we found a dataset published by a Nature article [2], a total of 30 million sets of core promoter sequences and expression data, the format is shown in the following figure, the randomly synthesized core promoter sequences with their expression rate represented by relative fluorescence intensity (which is described in detail in wet lab cycle in page engineering success) through high-throughput technique. The total data scale is large enough to cover all possibilities of any interaction between the 80bp core promoter and the transcription factors.
 
We set out a large-scale search for raw data that can be used to train AI, and finally, we found a dataset published by a Nature article [2], a total of 30 million sets of core promoter sequences and expression data, the format is shown in the following figure, the randomly synthesized core promoter sequences with their expression rate represented by relative fluorescence intensity (which is described in detail in wet lab cycle in page engineering success) through high-throughput technique. The total data scale is large enough to cover all possibilities of any interaction between the 80bp core promoter and the transcription factors.
We further generate sub dataset of the total data with various sample sizes to train Pymaker, and we use our best perform one to generate PYPH1 [3].
+
We further generate sub dataset of the total data with various sample sizes to train Pymaker, and we use our best perform one to generate PYPH2 [3].
 
<p>
 
<p>
 
<html>
 
<html>

Latest revision as of 12:57, 12 October 2023

PYPH2 -> Pymaker generated yeast promoter High 2


Assembly Compatibility:
  • 10
    COMPATIBLE WITH RFC[10]
  • 12
    INCOMPATIBLE WITH RFC[12]
    Illegal NheI site found at 1
  • 21
    INCOMPATIBLE WITH RFC[21]
    Illegal BamHI site found at 198
  • 23
    COMPATIBLE WITH RFC[23]
  • 25
    COMPATIBLE WITH RFC[25]
  • 1000
    COMPATIBLE WITH RFC[1000]


Design Notes

To begin with, how to extract features from the raw data and finally successfully generate PYPH2 needs to be tackled. we utilize a pre-trained model to construct an AI model. Then, we train the AI model using the raw data, enabling it to predict expression strength based on the provided core promoter sequences. Subsequently, we filter out highly expressed sequences from the original data. To create a diversity of new sequences, we randomly mutate these selected sequences. Finally, we use a state-of-the-art AI model with the highest goodness of fit to predict the expression levels of both the mutated and original sequences. This allows us to identify highly expressed promoter sequences. The overall procedure, which involves transitioning from raw data and a pre-trained model to filtering out highly expressed sequences, is depicted in the table below:

Length: the functional part of PYHP1 is the core promoter, and we decide the length of 80bp mainly because an 80-bp window is short enough that a bound nucleosome would likely cover the entire region, simplifying modeling of accessibility, and because the entire region could be sequenced with a 150-cycle kit, with overlap in the middle, which is helpful for sequencing and validating the promoter sequence.

Loci: we seat PYPH2 at approximately -170 to -90 upstream to the codon,owing to the consideration that it is the presumed transcription start site-TSS, where most transcription factors binding sites lie [1]. We hope that modifying this part will be the most efficient way to change the expression rate. In other words, expression rate is more sensitive to changes made in this 80bp sequence than other parts.

Composition: Still, after deciding the core promoter sequence we need a scaffold that can link it to the condom and to make up a complete promoter. We get our ‘pA-pT‘ scaffold from one previous research. It can link the core promoter with the codon and provide restriction sites of BamH I and Xho I which make it possible for the plasmids with the scaffold to be inserted by various core promoter sequences at ease [2].

pA-core PYPH2-pT

Source

We set out a large-scale search for raw data that can be used to train AI, and finally, we found a dataset published by a Nature article [2], a total of 30 million sets of core promoter sequences and expression data, the format is shown in the following figure, the randomly synthesized core promoter sequences with their expression rate represented by relative fluorescence intensity (which is described in detail in wet lab cycle in page engineering success) through high-throughput technique. The total data scale is large enough to cover all possibilities of any interaction between the 80bp core promoter and the transcription factors. We further generate sub dataset of the total data with various sample sizes to train Pymaker, and we use our best perform one to generate PYPH2 [3].

Data source origin

References

[1] Vaishnav, E.D., et al., The evolution, evolvability and engineering of gene regulatory DNA. Nature, 2022. 603(7901): p. 455-463.

[2] de Boer, C.G., et al., Deciphering eukaryotic gene-regulatory logic with 100 million random promoters. Nature Biotechnology, 2019. 38(1): p. 56-65.

[3] Ji, Y., et al., DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics, 2021. 37(15): p. 2112-2120.