Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Choose a model by AIC in a Stepwise Algorithm #110

Open
sourish-cmi opened this issue Mar 9, 2023 · 3 comments
Open

Choose a model by AIC in a Stepwise Algorithm #110

sourish-cmi opened this issue Mar 9, 2023 · 3 comments
Assignees

Comments

@sourish-cmi
Copy link
Collaborator

Feature Request: Choose a model by AIC in a Stepwise Algorithm
Select a formula-based model by AIC.

The R code is following. We need to translate it into Julia

> getAnywhere(step)
A single object matchingstepwas found
It was found in the following places
  package:stats
  namespace:stats
with value

function (object, scope, scale = 0, direction = c("both", "backward", 
    "forward"), trace = 1, keep = NULL, steps = 1000, k = 2, 
    ...) 
{
    mydeviance <- function(x, ...) deviance(x) %||% extractAIC(x, 
        k = 0)[2L]
    cut.string <- function(string) {
        if (length(string) > 1L) 
            string[-1L] <- paste0("\n", string[-1L])
        string
    }
    re.arrange <- function(keep) {
        namr <- names(k1 <- keep[[1L]])
        namc <- names(keep)
        nc <- length(keep)
        nr <- length(k1)
        array(unlist(keep, recursive = FALSE), c(nr, nc), list(namr, 
            namc))
    }
    step.results <- function(models, fit, object, usingCp = FALSE) {
        change <- sapply(models, `[[`, "change")
        rd <- sapply(models, `[[`, "deviance")
        dd <- c(NA, abs(diff(rd)))
        rdf <- sapply(models, `[[`, "df.resid")
        ddf <- c(NA, diff(rdf))
        AIC <- sapply(models, `[[`, "AIC")
        heading <- c("Stepwise Model Path \nAnalysis of Deviance Table", 
            "\nInitial Model:", deparse(formula(object)), "\nFinal Model:", 
            deparse(formula(fit)), "\n")
        aod <- data.frame(Step = I(change), Df = ddf, Deviance = dd, 
            `Resid. Df` = rdf, `Resid. Dev` = rd, AIC = AIC, 
            check.names = FALSE)
        if (usingCp) {
            cn <- colnames(aod)
            cn[cn == "AIC"] <- "Cp"
            colnames(aod) <- cn
        }
        attr(aod, "heading") <- heading
        fit$anova <- aod
        fit
    }
    Terms <- terms(object)
    object$call$formula <- object$formula <- Terms
    md <- missing(direction)
    direction <- match.arg(direction)
    backward <- direction == "both" | direction == "backward"
    forward <- direction == "both" | direction == "forward"
    if (missing(scope)) {
        fdrop <- numeric()
        fadd <- attr(Terms, "factors")
        if (md) 
            forward <- FALSE
    }
    else {
        if (is.list(scope)) {
            fdrop <- if (!is.null(fdrop <- scope$lower)) 
                attr(terms(update.formula(object, fdrop)), "factors")
            else numeric()
            fadd <- if (!is.null(fadd <- scope$upper)) 
                attr(terms(update.formula(object, fadd)), "factors")
        }
        else {
            fadd <- if (!is.null(fadd <- scope)) 
                attr(terms(update.formula(object, scope)), "factors")
            fdrop <- numeric()
        }
    }
    models <- vector("list", steps)
    if (!is.null(keep)) 
        keep.list <- vector("list", steps)
    n <- nobs(object, use.fallback = TRUE)
    fit <- object
    bAIC <- extractAIC(fit, scale, k = k, ...)
    edf <- bAIC[1L]
    bAIC <- bAIC[2L]
    if (is.na(bAIC)) 
        stop("AIC is not defined for this model, so 'step' cannot proceed")
    if (bAIC == -Inf) 
        stop("AIC is -infinity for this model, so 'step' cannot proceed")
    nm <- 1
    if (trace) {
        cat("Start:  AIC=", format(round(bAIC, 2)), "\n", cut.string(deparse(formula(fit))), 
            "\n\n", sep = "")
        flush.console()
    }
    models[[nm]] <- list(deviance = mydeviance(fit), df.resid = n - 
        edf, change = "", AIC = bAIC)
    if (!is.null(keep)) 
        keep.list[[nm]] <- keep(fit, bAIC)
    usingCp <- FALSE
    while (steps > 0) {
        steps <- steps - 1
        AIC <- bAIC
        ffac <- attr(Terms, "factors")
        scope <- factor.scope(ffac, list(add = fadd, drop = fdrop))
        aod <- NULL
        change <- NULL
        if (backward && length(scope$drop)) {
            aod <- drop1(fit, scope$drop, scale = scale, trace = trace, 
                k = k, ...)
            rn <- row.names(aod)
            row.names(aod) <- c(rn[1L], paste("-", rn[-1L]))
            if (any(aod$Df == 0, na.rm = TRUE)) {
                zdf <- aod$Df == 0 & !is.na(aod$Df)
                change <- rev(rownames(aod)[zdf])[1L]
            }
        }
        if (is.null(change)) {
            if (forward && length(scope$add)) {
                aodf <- add1(fit, scope$add, scale = scale, trace = trace, 
                  k = k, ...)
                rn <- row.names(aodf)
                row.names(aodf) <- c(rn[1L], paste("+", rn[-1L]))
                aod <- if (is.null(aod)) 
                  aodf
                else rbind(aod, aodf[-1, , drop = FALSE])
            }
            attr(aod, "heading") <- NULL
            nzdf <- if (!is.null(aod$Df)) 
                aod$Df != 0 | is.na(aod$Df)
            aod <- aod[nzdf, ]
            if (is.null(aod) || ncol(aod) == 0) 
                break
            nc <- match(c("Cp", "AIC"), names(aod))
            nc <- nc[!is.na(nc)][1L]
            o <- order(aod[, nc])
            if (trace) 
                print(aod[o, ])
            if (o[1L] == 1) 
                break
            change <- rownames(aod)[o[1L]]
        }
        usingCp <- match("Cp", names(aod), 0L) > 0L
        fit <- update(fit, paste("~ .", change), evaluate = FALSE)
        fit <- eval.parent(fit)
        nnew <- nobs(fit, use.fallback = TRUE)
        if (all(is.finite(c(n, nnew))) && nnew != n) 
            stop("number of rows in use has changed: remove missing values?")
        Terms <- terms(fit)
        bAIC <- extractAIC(fit, scale, k = k, ...)
        edf <- bAIC[1L]
        bAIC <- bAIC[2L]
        if (trace) {
            cat("\nStep:  AIC=", format(round(bAIC, 2)), "\n", 
                cut.string(deparse(formula(fit))), "\n\n", sep = "")
            flush.console()
        }
        if (bAIC >= AIC + 1e-07) 
            break
        nm <- nm + 1
        models[[nm]] <- list(deviance = mydeviance(fit), df.resid = n - 
            edf, change = change, AIC = bAIC)
        if (!is.null(keep)) 
            keep.list[[nm]] <- keep(fit, bAIC)
    }
    if (!is.null(keep)) 
        fit$keep <- re.arrange(keep.list[seq(nm)])
    step.results(models = models[seq(nm)], fit, object, usingCp)
}

Referances

Hastie, T. J. and Pregibon, D. (1992) Generalized linear models. Chapter 6 of Statistical Models in S eds J. M. Chambers and T. J. Hastie, Wadsworth & Brooks/Cole.

Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. New York: Springer (4th ed).

@sourish-cmi sourish-cmi self-assigned this Mar 9, 2023
@waitasecant
Copy link

waitasecant commented Apr 12, 2023

Dear Sir,
I am working on package BestModelSubset.jl which implements naively three model selection algorithms i.e. Best Subset Selection, Forward-Stepwise Selection and Backward Step-wise Selection. I had referenced Chapter 6 of ISLR Linear Model Selection and Regularization for these.

One simple example.

Instantiate a ModelSelection object

# To execute forward step-wise selection with primary parameter to be R-squared score  
# and secondary parameter to be aic.
julia> obj = ModelSelection("forward", "r2", "aic")
ModelSelection(BestModelSubset.forward_stepwise, nothing, StatsAPI.r2, nothing, StatsAPI.aic,
               nothing, StatsAPI.r2, StatsAPI.aic)

Fit the ModelSelectionobject to the data

julia> Random.seed!(123); df = hcat(rand(Float64, (50, 21))); # 50*21 Matrix
# The fit! function updates the fields of the `ModelSelection` object.
# Returns the indexes of columns which have the least aic value.
julia> index = fit!(obj, df)
1-element Vector{Vector{Int64}}:
 [5, 6, 16, 17, 18, 20]

Access various statistics like r2, adjr2, aic and bic for the selected model

# One can always use something like MLBase.rss(df[:, index]) to get the statistics not passed earlier.
julia> obj.r2
0.8161760683631274

julia> obj.aic
21.189713760250477

I would be more than willing to contribute to CRRao.jl, a step-wise selection feature.

@sourish-cmi
Copy link
Collaborator Author

BestModelSubset.jl

The link is not working

@waitasecant
Copy link

Sir,
Sorry for responding late, I was unwell lately. I have fixed it, should work fine now.

julia> using Pkg; Pkg.add(url="https://github.com/waitasecant/BestModelSubset.jl.git")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants