Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Beeswarm plot #61

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

Conversation

tpoisot
Copy link

@tpoisot tpoisot commented Jul 4, 2017

this is currently a work in progress

I'm working on adding a draft implementation of beeswarm plots, to be iterated over. Very much looking for feedback.

@tpoisot
Copy link
Author

tpoisot commented Jul 4, 2017

There's evidently something I'm missing, but when I do

include("src/StatPlots.jl")
using StatPlots

Why do StatPlots.violin work, and StatPlots.beeswarm is not defined?

@mkborregaard
Copy link
Member

mkborregaard commented Jul 26, 2017

You need to add @shorthands beeswarm to the top of beeswarm.jl - it's a bit confusing because violin used to live in Plots, and so the @shorthands command still lives there.

Sorry for the wait, btw- I've been on holiday and I seem to be the only JuliaPlots member that currently maintains StatPlots actively. I'll try to get to do a real review within a few days. As soon as you have it working, it would be helpful with a png image for the readme.

@mkborregaard mkborregaard changed the title Beeswarm plot WIP: Beeswarm plot Jul 26, 2017
Copy link
Member

@mkborregaard mkborregaard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This'll be really nice. I have some stylistic comments. I should note this doesn't currently work on my machine (an error is thrown from RecipesBase) but I can't off the top of my head say what the problem is. Does the plot work on your system?

@@ -25,18 +26,24 @@ using StatPlots
gr(size=(400,300))
```

The `DataFrames` support allows passing `DataFrame` columns as symbols. Operations on DataFrame column can be specified using quoted expressions, e.g.
The `DataFrames` support allows passing `DataFrame` columns as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for also taking the time to clean up files etc. But I'd like to keep changes like this separate from changes that add new functionality - can you cherry-pick this and the other changes (deleted/insert lines) to a separate PR and keep this PR on the beeswarm recipe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit of the readme has been changed, so this can be scrapped when you rebase.

src/hist.jl Show resolved Hide resolved
src/violin.jl Show resolved Hide resolved
src/violin.jl Outdated Show resolved Hide resolved
src/violin.jl Outdated Show resolved Hide resolved
src/beeswarm.jl Outdated Show resolved Hide resolved
src/beeswarm.jl Outdated
info("side set to :$side")
end
x, y = Float64[], Float64[]
glabels = sort(collect(unique(x)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the code here is identical across beeswarm, violin and boxplot - I haven't checked carefully, but would it not be possible to extract a general function?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure yet -- I'll try to get the beeswarm function working, then see if we can extract some things.

src/beeswarm.jl Outdated Show resolved Hide resolved
@tpoisot
Copy link
Author

tpoisot commented Jul 26, 2017

@mkborregaard thanks for the comments -- it's not working for me either, but I did not expected it to. Still working on it.

@tpoisot
Copy link
Author

tpoisot commented Jul 26, 2017

Stopping here for the day -- the internals work, but there is nothing displayed. I'm at a loss, and the documentation of plots and recipes is not helpful.

@mkborregaard
Copy link
Member

I'll have a look in the morning :-)

@mkborregaard
Copy link
Member

There appears to be a bug in how recipes work together with Plots, where the default markershape value of :none is not overridden by the recipe. This needs to be fixed in Plots, but until then you can work around it by manually loading the markershape into d. See my line comment.


x := xp
y := yp
seriestype := :scatter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, add

  if get!(d, :markershape, :circle) == :none
          d[:markershape] = :circle
  end

Copy link
Member

@mkborregaard mkborregaard Jul 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cf JuliaPlots/Plots.jl#989 , a PR to fix this behaviour in Plots so this code won't be necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was merged so disregard this comment.

@tpoisot
Copy link
Author

tpoisot commented Jul 28, 2017

It's not quite done yet (I need to change the way the widths are calculated), but it is at least working as of bec231d

using DataFrames, Plots, Distributions
include("src/StatPlots.jl")
using StatPlots

N = 100
labels = repeat(["a", "b", "c", "d"], inner=N)
values = vcat(
  rand(Normal(0.0), N),
  rand(Normal(2.0), N),
  rand(Normal(3.0), N),
  rand(Normal(1.0, 0.5), N)
  )

d = DataFrame(labels = labels, values = values)

violin(d, :labels, :values, leg=false, c=:lightgrey, side=:right)
beeswarm!(d, :labels, :values, c=repeat([:green, :orange, :blue, :purple], inner=N), side=:left, bins=:scott)

screenshot from 2017-07-28 10-53-53

@mkborregaard
Copy link
Member

very nice :-)

@mkborregaard
Copy link
Member

bump? :-)

# make the violin
xcenter = Plots.discrete_value!(d[:subplot][:xaxis], glabel)[1]

for i in 2:length(centers)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using i here is what throws your width calculations - i is also the index in the outer loop. Change to j.


for i in 2:length(centers)
inside = Bool[centers[i-1] < u <= centers[i] for u in lab_y]
if sum(inside) > 1
Copy link
Member

@mkborregaard mkborregaard Sep 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need

      if sum(inside) == 1
        lab_x[inside] .+= xcenter
      elseif sum(inside) > 1

for when there's only a single point.

@mkborregaard
Copy link
Member

There are a number of difficulties with making a good violin plot, one of them being the highly local one that we generally don't know the markersize in points in StatPlots across the different backends. That is going to be remedied at some point, though, as Plots will move to a more explicit definition of dpi in terms of physical units. It just needs to be coded up, but means that we could put the basic functionality in here.

Another issue is the binning - as it is now, the histogram bins are much larger on the y axis than on the x axis, which makes the points avoid each other along the x axis but not the y axis:
skaermbillede 2017-09-12 kl 15 18 21

Making the bins much smaller (here n = n^2) gives better results, but doesn't solve the axis issue.
skaermbillede 2017-09-12 kl 15 18 01

I think if you do want to use binning for something like this, you'd need to align the points along the bin centers - doing that makes it look more regular, but still slightly strange:
skaermbillede 2017-09-12 kl 15 26 28

For the case for aligning to bin centers - and also a general critique of using dot-plots to observe densities - check this paper by Wilkinson (the Grammar of Graphics guy): http://moderngraphics11.pbworks.com/f/wilkinson_1999.DotPlots.pdf

@mkborregaard
Copy link
Member

mkborregaard commented Sep 13, 2017

I let myself be sucked into trying to implement a version of beeswarm plots that didn't have the issues outlined in the Wilkinson paper. Those are

  1. points get displaced from their real value, e.g. with R's ggplot implementation
    billede

and
2. points are not packed and create distributional shapes that are very different from the actual densities, as your violins above also show and R's beeswarm implementation:
billede

It took a lot more time than was sensible to spend on this, but there you are.

The algorithm packs the points as closely as possible without displacing them from their value. It does it simply by sequentially adding points, each time adding the point that can be fitted closest to the y value. It's costly, of course, but the main bulk of my work was spent on making the algorithm efficient, and it can fit 1000 points in just below a second, probably close to the maximum number anyone would use anyway.

Code:

function beeswarm_coords(olda, side = :both)

       # what should be the new y position of a point given it's x and it's closest point
       getypos(x, ptx, pty, cellsize) = pty + sqrt(cellsize^2-(ptx-x)^2)

       # find the minimum y position for a given point
       function minypos(xind, shift_signs = false)
              neighbors = potential_interactions(xind)
              length(neighbors) == 0 && return NaN
              yvals, xvals = view(ys, neighbors), view(a, neighbors)
              tmpyvals = shift_signs ? -yvals : yvals
              limit = maximum(tmpyvals - cellsize)

              ypos = zeros(length(xvals))
              for i in eachindex(xvals)
                     if tmpyvals[i] > limit
                            ypos[i] = getypos(a[xind], xvals[i], tmpyvals[i], cellsize)
                     end
              end
              ret = maximum(ypos) # Using maxmimum here is not an error - it is the closest point that determines the result
              shift_signs ? -ret : ret
       end

       function update_ypos!(n_ypos, inds, shift_signs = false)
              n_ypos[inds] .= minypos.(inds, shift_signs)
       end

       function potential_interactions(freeind)
              outside_range(x) = abs(a[freeind]-a[x])>cellsize
              lo = findprev(outside_range, linearindices(a), freeind)+1
              hi = findnext(outside_range, linearindices(a), freeind)-1
              if lo != 0 && hi != 0 && hi >= lo
                     included[freeind] ? (lo:hi)[.!(included[lo:hi])] : (lo:hi)[included[lo:hi]]
              else
                     Int[]
              end
       end

       # estimate a pointsize
       a = sort(shuffle(olda))
       cellsize = (diff([extrema(a)...])/(length(a)^0.6))[1]

       # vectors to hold return values
       included, ys = falses(a), zeros(length(a))

       # fill the first points, that are going to be along the middle line
       ind = start(a)
       nearest_ypos = fill(NaN, length(a))
       while ind != 0
              included[ind] = true
              ind = findfirst(v->v>a[ind] + cellsize, a)
       end

       # now fill the rest in turn
       update_ypos!(nearest_ypos, find(.!(included)))
       innow = sum(included)

       if side in (:left, :right)
              while innow < length(a)
                     newy, placenext = findmin(nearest_ypos)
                     included[placenext] = true
                     ys[placenext] = newy
                     update_ypos!(nearest_ypos, potential_interactions(placenext))
                     nearest_ypos[placenext] = NaN
                     innow += 1
              end
       elseif side == :both
              nearest_ypos_left = -copy(nearest_ypos)
              nearest_ypos_right = nearest_ypos

              while innow < length(a)
                     newyleft, placenext_left = findmax(nearest_ypos_left)
                     newyright, placenext_right = findmin(nearest_ypos_right)

                     if -newyleft < newyright
                            placenext = placenext_left
                            included[placenext] = true
                            ys[placenext] = newyleft
                            update_ypos!(nearest_ypos_left, potential_interactions(placenext), true)
                     else
                            placenext = placenext_right
                            included[placenext] = true
                            ys[placenext] = newyright
                            update_ypos!(nearest_ypos_right, potential_interactions(placenext))
                     end

                     nearest_ypos_right[placenext] = NaN
                     nearest_ypos_left[placenext] = NaN
                     innow += 1
              end
       else
              error("side must be :left, :right, or :both")
       end

       rety = zeros(length(olda))
       view(rety, sortperm(olda)) .= ys

       if side == :left
              rety .*= -1
       end
       rety, olda
end

using DataFrames, RDatasets, Plots
iris = dataset("datasets", "iris")

x,y = beeswarm_coords(collect(iris[:SepalLength]), :both)
scatter(x,y, group = iris[:Species], ms = 7, aspect_ratio = 1, xlim = (-2,2))

The iris data:
skaermbillede 2017-09-13 kl 20 22 16
2000 random normal data points:
skaermbillede 2017-09-13 kl 20 25 54
It's still far from being usable in StatPlots - in particular, I need to work out how to change markersizes so they both give sensible distributions and also so markers touch (this is more difficult than you'd think and requires the aspect ratio to be controlled).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants