Skip to content

Proposed photutils API

astrofrog edited this page Dec 3, 2012 · 39 revisions

Proposed photutils API

NOTE: the actual proposed public API is at https://github.com/astrofrog/astropy-api/pull/2

Source detection

Initial detection of objects:

object_slices, labeled_array = detect_objects(data, threshold, mask=mask)  # data is background-subtracted.

object_slices  # list of slices corresponding to minimum parallelpiped containing the object
labeled_array  # 2-d array, same shape as data. Each pixel is an int corresponding to the object
               # it is a member of (or 0 if not an object)

From there you use a function that loops over object slices and uses labeled_array to estimate all kinds of parameters of each object.

It is about 3 lines of code to implement this function using scipy.ndimage.measurements. However, it would be very simplistic - no source deblending and no convolution to enhance source extraction. For that sort of thing I think we would want a C extension. But do we want a function like this, regardless?

DS: What exactly are the slices? For PSF-fitting it will also be necessary to group objects that have to be fit simultaneously.

TR the proposed API may be simple enough for optical data with no nebulosity, but as soon as you add nebulosity (e.g. Spitzer data) things become a lot more complex. It would be nice to provide a framework that can be extended by users for different kinds of source detection algorithms.

KB: Slices are basically objects representing the (xmin, xmax, ymin, ymax) of the object in array coordinates. (You could get this information from the labeled_array, which is exactly what the scipy functions do underneath.) Having the slice allows you to use the small subset of the image you know contains all the object's pixels, instead of having to search the whole image looking for member pixels. http://docs.scipy.org/doc/scipy/reference/generated/scipy.ndimage.measurements.find_objects.html#scipy.ndimage.measurements.find_objects

Definitely agree about making this as general as possible. I am coming from optical/NIR point of view. Can you describe nebulosity in more detail?

TR: Here's a picture to describe what I mean by nebulosity in Spitzer data: http://www.spitzer.caltech.edu/uploaded_files/images/0008/4970/ssc2012-02b_Med.jpg

And Herschel: http://sci.esa.int/science-e-media/img/ee/Herschel_Carina_orig.jpg

I picked worst case examples, but there are well defined algorithms to try and extract sources from these kinds of data, so we should have a framework that allows these to be implemented.

KB: So I guess it is not good enough to fit a variable background on these sorts of images? E.g., SExtractor does not work on them? Can you summarize the general outline of such algorithms, or an example API?

DS: Following an idea implemented in Herschel software: We could have an abstract ObjectSlice class with specialized RectangleSlice and PolygonSlice. For extended emission a user will want to be able to apply a particular PolygonSlice to maps taken at different wavelengths that have different WCS grids. (Here again is my bias towards working with objects at a higher level than ndarrays.)

KB: I'm a little confused by your comment here. The object_slices in the example are built-in python slice objects for indexing lists, arrays, etc.

DS: Ah, those slices. I'm thrown by the word "parallel_piped" in the API box. Don't we just mean "rectangle"? Do we assume 2-D images only, or a 3-D cube, in which case the slice will have three dimensions?

DS: I propose to add examples to the API to make the meaning of "slices" clearer.

Region masking

If we directly translate the existing Aperture API, it will look something like this:

from photutils.regions import CircularAperture, CircularAnnulus, EllipticalAperture, EllipticalAnnulus
circ = Circular(3.)
circ.extent()  # extent relative to center: (-3., 3., -3, 3.)
circ.area()  # area of region: pi*3.**2

# Return a 2-d float array of shape (ny, nx) indicating the fraction of each pixel enclosed
# by the region. The available choices for "method" depend on the specific Region class.
circ.encloses(xmin, xmax, ymin, ymax, nx, ny, method='center')

All region classes will derive from an abstract base class Region to guarantee that they implement the above methods.

KB: The encloses method API seems a bit clunky with its 6 parameters, but I'm not sure how to improve it. Basically (xmin, xmax, ymin, ymax, nx, ny) specify the coordinates of the pixel grid we're interested in, relative to the region center. If we limit ourselves to array coordinates, we could eliminate 2 parameters (nx, ny could be determined from xmax, xmin and ymax, ymin). Example use cases for grids with coordinates where dx != dy? (I'm sure there are some, but I can't think of any.)

TR: What about:

image = ... # read in some image as a 2-d numpy array
mask = circ.encloses(image, x, y, method='center')  # x, y pixel coordinates

since I can't think of a use case where the region/mask is then not going to be used on the image after anyway.

KB: I like this a lot. If I understand you, (x, y) are where you want the center of the circle in the pixel coordinates of image. Internally, image would only be used for its shape.

However, let's back up a sec and ask whether we want the region objects to contain information about their "centers". Currently a single Region object can be used to represent multiple regions with identical shapes. However, this means that you have to separately carry around information about where the center of each actual region is (e.g., in the form of two ndarrays xc and yc). Would it be more natural and clearer to have:

circ1 = Circular(xctr, yctr, 3.)
circ2 = Circular(xctr2, yctr2, 3.)

so that each region on the image gets its own Region object? A region would not necessarily have to define a "center" this way: you could do:

rect = Rectangle(xmin, xmax, ymin, ymax)

TR - I need to think about whether it makes sense to carry out the 'center/position/bounds' (and what happens when we want to do many regions - do we need an instance for each?). From a user point of view, I think either would be fine - you could argue that people would say 'we did the photometry with a 3" aperture', and the positions at which the photometry was done is a separate issue. If you have to photometer 100,000 sources, it would be more efficient to not have to create an aperture object for each. But as I said, I'm undecided. Let's think about it.

KB: As you know, having to create an aperture object for every single source was a motivation for the current design. But now I'm wondering if it is really a limiting factor.

When you're thinking about apertures, saying 'a 3" aperture' makes sense and naturally sounds position dependent. But if they're called 'regions', I personally think of various regions on an image (e.g., ds9 region files) in which each region has its own center. However, this is all based on what "sounds" most natural. What are other use cases for regions outside aperture photometry?

TR example application of pure regions:

  • https://github.com/astropy/photutils/issues/8 -> user wants weights of pixels to do their own arithmetic
  • if you want to extract a spectrum from a spectral cube, you might want to extract it for a region (and then collapse along the spatial direction)

DS: I agree with KB that I think of Region as having a specific center, while an Aperture that is circular has a radius, optionally an inner and outer sky annulus, but no center...so it can be applied over and over on different parts of an image.

I also like the second suggestion for circ.encloses.

KB: I'm convincing myself more and more that having the center be part of "region" is more natural and clearer. This:

mask = circ.encloses(image, x, y, method='center')  # x, y pixel coordinates

is saying "tell me what circ encloses on image. Oh, and put circ at (x, y) on image." I find it more natural to have the position of circ be a property of circ, so that you can just say "tell me what circ encloses on image":

mask = circ.encloses(image, method='center')

Now, if performance is a show-stopper, that's a different story. If we decide we like this better, I can do some speed tests before we go ahead with it.

Aperture photometry

Given a 2-d Numpy array, find the flux around a few positions with an aperture of radius 3 pixels:

flux = photutils.aperture_circular(data, xc, yc, 3.)

If xc and yc are 1-d arrays, flux will be a 1-d array. Multiple objects per aperture can be specified by making the radius parameter a 2-d array. In this case flux will be a 2-d array.

Calculate the flux and the error on flux given a constant error or error array:

flux, fluxerr = photutils.aperture_circular(data, xc, yc, 3., error=error)

PSF photometry

KB: How about something that simply mirrors aperture photometry?:

flux = photutils.psf_photometry(data, xc, yc, psf=psf, sampling=sampling)
  • For PSF kernels, psf can be a 2-d array. sampling is an integer (or even float) telling you the sampling of the kernel relative to the data.
  • For analytic PSFs, psf be a function that takes two floats (dx, dy) and returns a float.

TR: I think that PSFs should be stored in classes inheriting from an abstract base class PSF, because that way we can create parameterized analytical PSFs, numerical PSFs that can smartly read from files, and also any other kind of PSF we haven't thought of initially. For example, Spitzer used P**R**Fs, which are not oversampled PSFs, but lookup tables - for example, to extract a PSF for a particular offset, if the oversampling is a factor of 5, you would do psf = prf[1::5,1::5], i.e. read every 5 elements. It might be nice to support this kind of PSF definition too, hence why using a PSF class would make more sense (and can hide these kinds of details). Then the sampling becomes an argument of the PSF, rather than the photometry function - that is, the photometry function just asks the PSF for a rasterized PSF on the same grid as the data for a given offset, and the PSF class returns it.

So users might do something like:

psf = PSF.load('spitzer_irac_3.6.fits')
flux = photutils.psf_photometry(data, xc, yc, psf)

KB: I'm on board with a PSF class. The unification of analytical and numerical PSFs is very appealing. However, I say "cautiously" because I am always hesitant to force users to learn how to manipulate a new class unless it is really necessary. A compromise might be to allow psf to be any one of

  • A PSF (or derived class) object
  • A 2-d array
  • A callable

Internally, the 2-d array and callable would be converted to PSF objects right from the start. This approach lets beginner users use the function right away without having to learn about how PSF objects work and what their API is. And of course all the flexibility is still there for more advanced users.

I agree that psf should be a positional, not optional argument (as I had it).

DS: This suggestion brings to mind the "units" discussion about whether to allow simple strings or to require Units objects for everything. Or numpy.dtype taking a string or a dtype object. My preference would be to make "simple" classes deriving from the abstract PSF class for the simple cases...but I suppose it's okay to allow passing in the array or the callable.

KB: Sounds like TR agrees about using simple classes derived from a base PSF class. PSF could have a class method __init__ that returns a derived class depending on the input. So that you could do:

psf = PSF(np.array([ [0, 1, 0], [1, 1, 1], [0, 1, 0] ])  # psf is a ArrayPSF (or whatever name)
psf = PSF(myfunction)                                  # psf is a AnalyticalPSF
psf = PSF(SpitzerPSF(...))                             # psf is the same as the input

Accepting NDData versus multiple numpy.ndarray

TR: A couple of other points - the input data should probably be an NDData by default (though arrays could be accepted)

KB: First, I think everyone will agree that whatever we do should be consistent between psf_photometry and aperture_photometry.

Accepting NDData may be OK, provided that ndarray would also be accepted (possibly converted to NDData internally; whatever is cleaner internally). However, I have reservations about how error calculations would work with NDUncertainty objects. For example, in aperture_photometry gaussian errors are assumed. Unless this can be changed (and I'm not sure it can without significantly affecting performance), we could only accept NDData objects with certain types of ``NDUncertainty``s. This seems somewhat inelegant and unpythonic to me.

An alternative that I think I prefer: create methods in NDData or a derived class such as Image that correspond to functions in photutils. For example, one would do:

im = Image(...)
results = im.psf_photometry(...)
results['flux']  # the flux
results['chi2']  # chi2 of the fit

... and similarly for aperture_photometry. The advantage here is that the psf_photometry method inside im could do things like checking itself for the type of uncertainty, and taking care of things like units and even WCS stuff, before calling the photutils functions. I like the encapsulation of this: everything specific to the Image object is handled by that object, using self. The photutils functions don't have to go probing into the details of the Image object they are handed. For example, coordinate transformations: If one gives coordinates in (RA, Dec), photutils doesn't have to worry about coordinate transformations, or even asking the Image object for its transformation. The image object will take care of transformations itself before passing the array coordinates to photutils functions.

TR: I think that here and for the PSF, we can adopt what we've done in other parts of Astropy (e.g. for units), which is that the default we require for data is NDData (or Image(NDDdata) if you prefer) and the default for the PSF is a PSF object. But you can also pass any argument that would initialize an NDData or a PSF object, so that would include 2-d arrays, etc. NDData makes a lot of sense once you want to start considering doing photometry in world coordinates (which of course we want to do).

Also, I like your idea of Image.psf_photometry(...), but of course, implementation wise, that's exactly the same as psf_photometry(Image, ...) and it would be super easy to provide both (like numpy methods can also be used as functions, e.g. x.mean() and np.mean(x).

KB: I like the rule about passing any argument that would initialize, e.g., a PSF object.

Actually I was proposing that psf_photometry(Image, ...) would not be allowed. psf_photometry would only be base-level computation that only knows about array coordinates and arbitrary units. For unit handling and world coordinates, one would use Image.psf_photometry(...) (and it would call psf_photometry internally).

I could be convinced to go your way though, if my concerns about uncertainty handling are addressed.

DS: Aren't the two of you agreeing at this point? I feel like I am missing something.

KB: Tom is suggesting that the function photutils.psf_photometry(Image, ...) and the method Image.psf_photometry(...) be synonyms (pointing to the same exact code). I am suggesting that photutils.psf_photometry(data, ...) be the base-level function that is called by Image.psf_photometry(...).

In other words, Tom and I are suggesting identical things for the Image.psf_photometry(...) API. The difference is in what photutils.psf_photometry(...) accepts: I am suggesting it be more restrictive, dealing only in ndarrays, not higher-level NDData/Image/Table objects.

Most users will want to deal in these objects and can use Image.psf_photometry. However, I am envisioning users who want to write specialized scripts. It may not always be appropriate to use NDData. Similarly, it may be sub-optimal to keep passing around (and merging) a bunch of Table objects. For such cases, photutils.psf_photometry(data, ...) would provide an interface without the unnecessary mental overhead of NDData and Table. Note: I do agree that in most use cases, NDData and Table are great and necessary. I just want an option of not having that overhead for cases where it is not necessary. (I am talking about mental overhead; I do not think the performance overhead will be significant). This is a library after all; it should be as modular as possible.

Returning Table versus multiple ndarray

TR: The result should probably be returned as a Table object:

results = photutils.psf_photometry(...)
results['flux']  # the flux
results['chi2']  # chi2 of the fit
results['background']  # estimated background

etc. which allows any number of fit parameters to be returned.

KB: What if a user prefers to use 'Flux' or 'bkg' (or any other name)? I'm hesitant to force certain field names on a user. Also, is this actually more flexible than returning a tuple of ndarrays? For a tuple, the calling code needs to know how many ndarrays are returned. Here, it needs to know what fields are available in results (and what their names are). Seems pretty similar to me, but with the added complexity of a new API to learn (Table). I think the situation would be different if Table was a standard in the way that ndarray is, but we're not there yet... And if there was a standard for field names like 'flux', 'chi2', 'background', but we're not there yet. I know Table is meant to be an integral component of astropy, but consider a new user who knows how to use ndarray but just installed photutils: if that were me, I would balk at being handed back an unfamiliar object with no apparent gain over simple ndarrays that I am already familiar with.

TR: briefly, two points:

  1. Table is just a fancy structured array with meta-data, so users familiar with structured arrays (which is just numpy) can use Table easily.

  2. Using a table or structured array allows you to output a lot of information about the photometry (think of DAOphot tables). You don't want to have to do:

    flux, error, background, chi2, flags, etc = psf_photometry(...)
    

because (a) it's clunky, and (b) if in future we want to make psf_photometry output a new parameter, we break people's scripts. From my experience, users who start to use table classes (e.g. in ATpy) really like them, so I think this would be very user friendly. It also means you can do:

results = psf_photometry(...)
results.write('photometry_table.fits')  # fits table

or:

results.write('photometry_table.tbl', format='ipac')

etc.

KB: OK I do agree that if you have 6+ parameters, a tuple of ndarray objects gets a bit ridiculous. I'm a little less swayed by the "future new-parameters" thing. Isn't that what stable APIs are for? :)

I actually have used Table, and I do like it a lot. I just don't want to force it on any users, unless there is a really good case for it (e.g., 6+ parameters with metadata.) Writing out the results is nice, but you know the user can always do this with one extra line:

flux, error, background, chi2 = psf_photometry(...)
results = Table({'FLUX': flux, 'ERROR': error, 'BKG': background, 'CHISQ': chi2})
results.write('photometry_table.fits')  # fits table

and this allows arbitrary field names in your FITS table.

DS: I'm definitely in favor of returning a Table or an object derived from it. There are so many potential items that can be returned (see: SExtractor) that collecting these into a Table makes sense.

KB: Definitely agree for a SExtractor equivalent. But I imagine that a SExtractor equivalent would be built on top of these lower level functions. Even if these return arrays, a higher level function running multiple of them can package the arrays up into a Table. In general, what is the threshold for returning a Table instead of ndarrays? You wouldn't return a Table in place of a single ndarray output. But would you do it for any function returning 2 ndarrays?

TR for me, the threshold is >2 (which I think we definitely need here). As I said, we could also just return a structured array, but in that case we might as well take a small step and use Table, which adds units for each column, I/O, and easy ways to manipulate the table.