Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sets, subsets in parscalarvec and masked operations #2

Open
bjoo opened this issue Sep 10, 2013 · 0 comments
Open

Sets, subsets in parscalarvec and masked operations #2

bjoo opened this issue Sep 10, 2013 · 0 comments

Comments

@bjoo
Copy link
Contributor

bjoo commented Sep 10, 2013

The way unordered subsets are managed in QDP++ is cumbersome for parscalarvec.

The subset is represented by a site table, referring to the 'linear site'. This is cumbersome to thread and vectorize. As an example consider how right now we would do a sum over an unordered subset as of commit:

  • We do a loop over sites in the subset (this can be parallelized over threads BUT.... see later)
  • We must find the block for the site
  • We do redundant operations (we compute the whole block)
  • We sum only one site from the block (actually I've generalized this to summation under a mask, but the mask has only 1 true element)

This can have several inefficiencies:
i) redundant computation within a thread if there is more than 1 site in the same outer block belonging to the thread. This also brings with it some additional memory traffic, tho it may be OK (ugh) if the repeatedly accessed memory stays in cache.

ii) potentially redundant computation carried out in several threads, if sites in the same outer block are scheduled to different threads. This will also duplicate memory traffic and may cause memory pingponging.

A natural table for parscalarvec would split into two tables:

  • a table of 'outer blocks' in the subset
  • for each 'outer block' a table of inner sites in the subset, or a mask

This latter approach would allow multi-threading over the outer blocks,
and vectorization (under mask) for the ILattice bits.

However, creating the tables from the 'site' table is like histogramming (go through sites and 'bin' them into 'outer blocks'). This can have an issue of parallelization (write contention on the binning.). For sets like rb, all, etc this is not a biggie as it can be done at startup and amortized. However, it can be a cost for SftMom in chroma which creates sets 'on the fly' or for user defined sets /subsets which create things on the fly, this can be a problem.

Thoughts anyone?

bjoo pushed a commit that referenced this issue Feb 6, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant