Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unexpected char in string error #29

Open
ChristyPeterson opened this issue Dec 10, 2019 · 4 comments
Open

unexpected char in string error #29

ChristyPeterson opened this issue Dec 10, 2019 · 4 comments

Comments

@ChristyPeterson
Copy link

Hi Chad,

I'm trying to run panseq on some publically available genomes, and was successful when running the genomes from a subspecies. As soon as I included two other subspecies, I get an "unexpected char in string" error. Weirdly, this error is coming up in strains that were successful in the first run. Those characters do not exist in the input so I'm assuming its in a temp file the program is writing and then referring back to?

Below is an example from the Master log file (the top and bottom).

2019/12/10 14:29:27 INFO |  NovelIterator.pm:186> We have 74 genomes this run 
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
Unexpected character `7' in string NZ_CP016054.1_Treponema_pallidum_subsp._pallidum_strain_PT_SIF1127_genome
Unexpected character `4' in string NZ_CP016054.1_Treponema_pallidum_subsp._pallidum_strain_PT_SIF1127_genome
Unexpected character `4' in string NZ_CP016054.1_Treponema_pallidum_subsp._pallidum_strain_PT_SIF1127_genome


Unexpected character `.' in string NZ_CP016054.1_Treponema_pallidum_subsp._pallidum_strain_PT_SIF1127_genome_(1138930..1144388)
2019/12/10 14:30:00 WARN |  CombineFilesIntoSingleFile.pm:83> Skipping /PATH/vpd/syphilis/panseq/run2-all-strains/6665952b07be10cc3db02af26d6d6f3a_5616179e4a49b14a8e4caa454f9b6f58_NR as it has size of 0 
2019/12/10 14:30:00 INFO |  Panseq.pm:268> Panseq mode set as pan 
2019/12/10 14:30:00 INFO |  SegmentMaker.pm:164> Segmenting /PATH/vpd/syphilis/panseq/run2-all-strains/6665952b07be10cc3db02af26d6d6f3a_5616179e4a49b14a8e4caa454f9b6f58 into 500bp segments 

If I remove the isolate from the analysis I get even more of these errors, for several other isolates. Any insight would be awesome.

Thanks!
-Christy

@chadlaing
Copy link
Owner

Hi Christy,

Is it possible one of the sequences isn't in valid fasta format?
It looks similar to errors of that type.
If not, could you send me the config file and link the public genomes that cause the error?

Thanks,
Chad

@ChristyPeterson
Copy link
Author

ChristyPeterson commented Dec 10, 2019

The file listed as being problematic looks like a valid fasta to me. Also, this file went through the first run successfully.

>NZ_CP016054.1 Treponema pallidum subsp. pallidum strain PT_SIF1127 genome
TAGATGGACGCAGTAGGGTATGAAGTATTCTGGAACGAGACACTCAGCCAGATACGGAGTGAATCGACCGAAGCAGAATT
TAACATGTGGTTTGCTCATTTGTTCTTTATCGCATCTTTTGAAAACGCTATCGAAATAGCAGTACCTTCAGACTTTTTCC
GAATACAGTTTAGCCAAAAATATCAAGAAAAGCTTGAGCGCAAGTTCCTCGAACTTTCTGGACACCCCATTAAACTTTTG
TTTGCCGTTAAAAAAGGCACCCCTCATGGAAATACTGCTCCCCCCAAACACGTGCATACCTACCTGGAGAAAAACTCTCC
TGCAGAGGTTCCTTCCAAAAAGAGCTTTCACCCCGACCTGAACAGAGACTATACCTTCGAGAACTTTGTATCCGGAGAAG
AAACCAAATTCAGCCATAGCGCTGCTATCTCCGTATCAAAAAACCCAGGCACTTCCTACAATCCGTTACTTATCTACGGT
GGAGTGGGACTAGGAAAAACCCACCTTATGCAGGCTATTGGACACGAGATCTACAAGACAACAGACCTGAACGTCATATA
CGTCACTGCGGAGAATTTTGGAAATGAATTCATTTCCACATTACTCAATAAAAAGACCCAGGATTTTAAAAAAAAATACC
GCTACACCGCGGATGTACTTCTTATAGATGACATTCATTTTTTTGAAAACAAAGACGGATTACAAGAAGAGCTTTTCTAT
ACGTTCAACGAACTTTTCGAGAAAAAAAAACAAATTATCTTTACCTGCGACAGGCCTGTACAAGAATTGAAAAATCTCTC
TTCTCGCTTACGCTCGAGGTGCTCCCGAGGGCTTAGCACTGATCTGAATATGCCATGTTTTGAAACGCGCTGTGCTATCT

I did check using grep for any weird characters and nothing pops up outside of the header.

I've attached two lists:

  • acc-list-full.txt is the full list of accessions used for this run.
  • acc-list-add.txt are the accessions that were added to run1 (completed successfully) to make up this run (full list).

acc-list-add.txt
acc-list-full.txt

I looked through all the fasta in the 'add' txt file, and none of those have any weird characters in the sequence.

For the config file, do you mean the settings file?

@ChristyPeterson
Copy link
Author

In case you meant the settings file to run panseq, I've attached it below, though altered the pathways to where stuff is located

queryDirectory  PATH/ncbi_assemblies/ncbi-genomes-2019-12-06/
baseDirectory   PATH/panseq/run2-all-strains
numberOfCores   20
mummerDirectory /PATH/bin/
blastDirectory  /PATH/bin/
minimumNovelRegionSize  500
novelRegionFinderMode   no_duplicates
muscleExecutable        /PATH/bin/muscle
fragmentationSize       500
percentIdentityCutoff   85
coreGenomeThreshold     2
runMode         pan

@chadlaing
Copy link
Owner

Perfect, I will take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants