diff --git a/src/dry-lab/software/chaosdna.md b/src/dry-lab/software/chaosdna.md index e1d6974..654cac1 100644 --- a/src/dry-lab/software/chaosdna.md +++ b/src/dry-lab/software/chaosdna.md @@ -1 +1,11 @@ # ChaosDNA + +## Context and Scope +To perform E-DBTL cycles without data from wet lab, we can generate faulty DNA sequences through software. Treating DNA sequences as a string, we can randomly mutate the string with deletions, insertions and mutations. + +## Goals +The goal of this in-silico testing platform is to perform 3-4 E-DBTL cycles before wet lab has data for us to try. Additionally, because wet lab will only be generating strands of 100 nucleotides long, we want to try our software on nucleotide sequences that are 1000s of bases long, and run statistics to show the utility of our software with input sizes that will be more realistic of information that would be encoded in long-term storage. + +## Design +- To test sequence generation, error correction: given a file to encode, total error rate, deletion error rate, insertion error rate, mutation error rate, return a distribution of faulty DNA sequences +- To test our sequence alignment (NGS): return a DNA sequence in the form of a fastq file diff --git a/src/dry-lab/software/compression.md b/src/dry-lab/software/compression.md index 0d6c9ab..c923055 100644 --- a/src/dry-lab/software/compression.md +++ b/src/dry-lab/software/compression.md @@ -6,8 +6,6 @@ ## Primary goal By encoding a heavily compressed file, we can effectively increase the amount of information stored in DNA for a given number of nucleotide bases. - - ## Key points 1. Data compression: defined in information theory as the process of encoding information using fewer bits than the original representation. In the context of our work, decreasing the number of nucleotide bases required to encode a given file. 2. Token (LLMs): the fundamental data unit within natural language processing systems such as large language models (LLMs). Most common AI systems used today are some form of LLM (e.g., ChatGPT, Google Gemini, Diffusion-based models such as Stable Diffusion). A token essentially acts as a small component of a large data set; when an LLM takes text input, such as a sentence inputted into a chatbot, it breaks the query down into a set of tokens. These tokens are then processed by the model. @@ -15,8 +13,6 @@ By encoding a heavily compressed file, we can effectively increase the amount of 4. Lossy compression: a compression process that results in data loss. For instance, when audio is compressed into common file formats such as .mp3, audio quality is sacrificed to decrease file sizes. 5. Compression ratio: the ratio between the file size of the inputted and outputted files. Often expressed in bits per base (bpb, output/input). - - ## Text compression ### Dictionary compression (traditional) @@ -36,3 +32,8 @@ Thus, the model used for compression must be careful selected, with a focus on o ![ts_zip benchmarks](./images/ts_zip-time.png) + +## Other Text compression algorithms: +- GZip: https://www.gnu.org/software/gzip/ +- LZ4: [https://github.com/lz4/lz4](https://github.com/lz4/lz4) +- [https://en.wikipedia.org/wiki/Bzip2](https://en.wikipedia.org/wiki/Bzip2) diff --git a/src/dry-lab/software/decoding.md b/src/dry-lab/software/decoding.md index 857cf28..7652606 100644 --- a/src/dry-lab/software/decoding.md +++ b/src/dry-lab/software/decoding.md @@ -111,6 +111,6 @@ After these steps, we can do [error correction](#ecc.md). - “These software packages are able to perform de novo assembly of Illumina short sequence reads with the exception of SHORTY, which is designed to assemble ABI SOLiD colour-space data. Velvet and SOPRA can assemble sequence-space and colour-space data. aCurtain is a pipeline, based on Velvet, for hierarchical assembly of short sequence reads in order to overcome memory usage limitations. bOases is specifically designed for assembling transcribed sequences.” [@paszkiewicz_2010_de] ## How do we test this? -We can test in silico by using open source genome data, and try to reassemble (without the reference template) and then check the performance. +We can test in silico by using open source genome data, and try to reassemble (without the reference template) and then check the performance, additionally through [ChaosDNA](chaosdna.md). --- diff --git a/src/dry-lab/software/ecc.md b/src/dry-lab/software/ecc.md index bd0bbeb..0a9d159 100644 --- a/src/dry-lab/software/ecc.md +++ b/src/dry-lab/software/ecc.md @@ -88,6 +88,6 @@ For more on these papers check out ## How do we test this? * We should see if added ECC bits actually increase the accuracy of information; need to perform statistical analysis * Or is the actual sequence more important - * Via ChaosDNA + * Via [ChaosDNA](chaosdna.md) ---