Skip to content

Commit

Permalink
done writing initial fault simulation page
Browse files Browse the repository at this point in the history
  • Loading branch information
lhao03 committed Mar 13, 2024
1 parent 1aa1835 commit f41915b
Show file tree
Hide file tree
Showing 4 changed files with 17 additions and 6 deletions.
10 changes: 10 additions & 0 deletions src/dry-lab/software/chaosdna.md
Original file line number Diff line number Diff line change
@@ -1 +1,11 @@
# ChaosDNA

## Context and Scope
To perform E-DBTL cycles without data from wet lab, we can generate faulty DNA sequences through software. Treating DNA sequences as a string, we can randomly mutate the string with deletions, insertions and mutations.

## Goals
The goal of this in-silico testing platform is to perform 3-4 E-DBTL cycles before wet lab has data for us to try. Additionally, because wet lab will only be generating strands of 100 nucleotides long, we want to try our software on nucleotide sequences that are 1000s of bases long, and run statistics to show the utility of our software with input sizes that will be more realistic of information that would be encoded in long-term storage.

## Design
- To test sequence generation, error correction: given a file to encode, total error rate, deletion error rate, insertion error rate, mutation error rate, return a distribution of faulty DNA sequences
- To test our sequence alignment (NGS): return a DNA sequence in the form of a fastq file
9 changes: 5 additions & 4 deletions src/dry-lab/software/compression.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,17 +6,13 @@
## Primary goal
By encoding a heavily compressed file, we can effectively increase the amount of information stored in DNA for a given number of nucleotide bases.



## Key points
1. Data compression: defined in information theory as the process of encoding information using fewer bits than the original representation. In the context of our work, decreasing the number of nucleotide bases required to encode a given file.
2. Token (LLMs): the fundamental data unit within natural language processing systems such as large language models (LLMs). Most common AI systems used today are some form of LLM (e.g., ChatGPT, Google Gemini, Diffusion-based models such as Stable Diffusion). A token essentially acts as a small component of a large data set; when an LLM takes text input, such as a sentence inputted into a chatbot, it breaks the query down into a set of tokens. These tokens are then processed by the model.
3. Lossless compression: a compression process that does not result in any data loss.
4. Lossy compression: a compression process that results in data loss. For instance, when audio is compressed into common file formats such as .mp3, audio quality is sacrificed to decrease file sizes.
5. Compression ratio: the ratio between the file size of the inputted and outputted files. Often expressed in bits per base (bpb, output/input).



## Text compression

### Dictionary compression (traditional)
Expand All @@ -36,3 +32,8 @@ Thus, the model used for compression must be careful selected, with a focus on o
![ts_zip benchmarks](./images/ts_zip-time.png)

</div>

## Other Text compression algorithms:
- GZip: https://www.gnu.org/software/gzip/
- LZ4: [https://github.com/lz4/lz4](https://github.com/lz4/lz4)
- [https://en.wikipedia.org/wiki/Bzip2](https://en.wikipedia.org/wiki/Bzip2)
2 changes: 1 addition & 1 deletion src/dry-lab/software/decoding.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,6 @@ After these steps, we can do [error correction](#ecc.md).
- “These software packages are able to perform de novo assembly of Illumina short sequence reads with the exception of SHORTY, which is designed to assemble ABI SOLiD colour-space data. Velvet and SOPRA can assemble sequence-space and colour-space data. aCurtain is a pipeline, based on Velvet, for hierarchical assembly of short sequence reads in order to overcome memory usage limitations. bOases is specifically designed for assembling transcribed sequences.” [@paszkiewicz_2010_de]

## How do we test this?
We can test in silico by using open source genome data, and try to reassemble (without the reference template) and then check the performance.
We can test in silico by using open source genome data, and try to reassemble (without the reference template) and then check the performance, additionally through [ChaosDNA](chaosdna.md).

---
2 changes: 1 addition & 1 deletion src/dry-lab/software/ecc.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,6 @@ For more on these papers check out
## How do we test this?
* We should see if added ECC bits actually increase the accuracy of information; need to perform statistical analysis
* Or is the actual sequence more important
* Via ChaosDNA
* Via [ChaosDNA](chaosdna.md)

---

0 comments on commit f41915b

Please sign in to comment.