Skip to content
This repository has been archived by the owner on Oct 9, 2019. It is now read-only.

Fuzz.string only generates ascii strings #201

Open
drathier opened this issue Jul 30, 2017 · 5 comments
Open

Fuzz.string only generates ascii strings #201

drathier opened this issue Jul 30, 2017 · 5 comments
Assignees

Comments

@drathier
Copy link
Collaborator

As mentioned in #198, and #200, the Fuzz.string fuzzer only generates ascii characters in the range 32-126, which covers A-Za-z0-9, some whitespace and some special characters. It should generate any kind of string to make sure the code works with more characters. Even English-only users are impacted, as emoji aren't in ascii 😿.

I think we should do a breaking change and make Fuzz.string generate characters from all of unicode. This will probably fail some test suites that previously only tested ascii strings, but that's a good thing, right?

The full unicode solution is however blocked while we wait for a new release of elm-lang/core. The bug has been fixed, but it's not released yet.

@zkessin
Copy link
Contributor

zkessin commented Jul 31, 2017

I would suggest having a Fuzz.string and fuzz.utf8String or the like.

Imagine that you have a problem where somehow the Hebrew string חה were to become הח if you are not familiar with Hebrew that could be very confusing to debug, as you have 2 letters that look pretty similar swapping in position.

If we are going to do UTF8/UTF16 we want to make sure we do it really well

I assume similar problems could happen with a number of scripts but I happen to have a Hebrew keyboard handy.

@drathier
Copy link
Collaborator Author

drathier commented Jul 31, 2017

I would like the default to be unicode, so Fuzz.string is unicode and Fuzz.asciiString is the current version. I was planning on doing a small subset of unicode that should find these bugs without running into homoglyph problems, right-to-left text and other things where the rendered output is vastly different the actual string.

In javascript, the main thing to worry about is characters that don't fit inside a single utf-16 code unit, such as emoji, as well as combining characters (and maybe normalization for equality testing). I think ascii, emoji and some european characters should be enough, without being too hard to debug.

@zkessin
Copy link
Contributor

zkessin commented Jul 31, 2017

Sounds good, we probably will eventually want a way to specify character set, so if someone wants Hebrew/Greek/Arabic/Russian/Hindi etc they will be able to have them

@drathier
Copy link
Collaborator Author

I don't think we should let the user specify what character classes or character sets to use. That's one huge rabbit hole which could take tens of thousands of lines of code to implement in pure Elm. There are ranges of code points that can be used to select a code plane, but if you want whitespace, you'll have to manually list out the 8 different characters, and if you want mathematical characters, there's another set of ranges to use, and so on. For example, here are the code points of the Swedish alphabet: https://www.iana.org/domains/idn-tables/tables/se_sv-se_1.0.html

Since this is only for testing, I say we try to pick a subset which is easy to use when testing, but which covers "all" the special cases of unicode.

@mgold
Copy link
Member

mgold commented Aug 1, 2017

Since it sounds like Fuzz.string won't be removed, just changed, I'm removing the newly-renamed major-release-blocker label. Patch and minor releases can ship whenever they are ready.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants