Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow GLM style specification of Bernoulli outcomes for logistic regression #3

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

johnmyleswhite
Copy link
Member

I often like to use glm(X, y) where y is a vector of 0's and 1's and thought it would be nice to offer the same interface for working with glmnet. This patch makes some quick changes to allow that to happen. Let me know if you'd like me to handle the changes in another way since I didn't spend a lot of time thinking about the cleanest way to add this functionality.

The script below gives a basic demo of the extended functionality. I can make it into a test:

using GLMNet
using Distributions
using Base.Test

srand(1)

invlogit(z::Real) = 1 / (1 + exp(-z))

n, p = 250_000, 2

intercept = randn()
beta = randn(p)
X = randn(n, p)
y = X * beta
for i in 1:n
    y[i] = rand(Bernoulli(invlogit(intercept + y[i])))
end

path = glmnet(X, y, Binomial())
@test abs(intercept - path.a0[end]) < 0.1
@test norm(beta - convert(Matrix{Float64}, path.betas)[:, end]) < 0.1

@simonster
Copy link
Member

I agree that this is an API worth having. I had thought about this previously, but got bogged down in implementaiton. If X has a lot of duplicate rows, then I think the model fitting process would be faster if we pool the duplicate rows before calling lognet. This boils down to finding the unique rows of X. Doing this without allocation for each row seems possible but required more code than I was prepared to write at the time. Let's start with this approach and we can worry about efficiency later.

@@ -350,7 +365,11 @@ glmnet(X::AbstractMatrix, y::AbstractVector, family::Distribution=Normal(); kw..
glmnet(X::Matrix{Float64}, y::Matrix{Float64}, family::Binomial; kw...) =
glmnet!(copy(X), copy(y), family; kw...)
glmnet(X::Matrix, y::Matrix, family::Binomial; kw...) =
glmnet(float64(X), float64(y), family; kw...)
glmnet!(float64(X), float64(y), family; kw...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that there is always at least one unnecessary allocation happening here, but since float64() doesn't make a copy if the input already has the right type, this will e.g. cause X to be overwritten if it is already a Matrix{Float64} but y is not.

@johnmyleswhite
Copy link
Member Author

I'm going to come back to this tomorrow. I've had some problems getting the solver to converge recently and need to delve deeper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants