A Julia package for performing survival analysis.
Functionality:
- Time-to-event types
- Kaplan-Meier survival
- Nelson-Aalen cumulative hazard
- Cox proportional hazards regression
Survival analysis in Julia
License: MIT License
A Julia package for performing survival analysis.
Functionality:
In similar spirit to #1 I have my own package (originally just hobby code to learn Julia, pls don't judge too harsh) for survival analysis. As @ararslan and I discussed on Slack I don't want to be 'competing' or causing duplicative code to be written so just wanted to open this issue to see if there are any parts I can merge in or if I should just archive it/keep for hobby code only.
So far I have:
My future plans were going to be:
My plan was then to hook this up between Turing.jl and mlr3proba for cross-language probabilistic ML in R.
If useful happy to go into detail about features/methods but won't for now.
I was curious about some comparisons, these might be of interest:
using RCall
using Random: seed!
using Distributions
using BenchmarkTools
using Survival
seed!(1)
n = 1000
T = round.(rand(Uniform(1, 10), n));
Δ = rand(Binomial(), n) .== 1;
surv = Surv(T, Δ, "right");
R"
library(survival)
time = $T
status = $Δ
surv = Surv(time, status)
";
@benchmark R"
km = survfit(surv ~ 1)
"
julia> @benchmark R"
km = survfit(surv ~ 1)
"
BenchmarkTools.Trial: 6946 samples with 1 evaluation.
Range (min … max): 599.208 μs … 67.817 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 625.708 μs ┊ GC (median): 0.00%
Time (mean ± σ): 718.383 μs ± 1.690 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█▇▅▃▂▁▁ ▁
████████▇▇▇▆▆▅▅▆▄▄▄▄▄▅▁▁▃▄▄▃▃▃▄▁▁▄▁▁▁▁▁▃▄▁▁▃▃▃▁▁▁▁▃▃▃▃▁▁▃▁▁▃ █
599 μs Histogram: log(frequency) by time 2.26 ms <
Memory estimate: 1.12 KiB, allocs estimate: 41.
julia> @benchmark kaplan(surv)
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
Range (min … max): 6.067 μs … 619.550 μs ┊ GC (min … max): 0.00% … 97.93%
Time (median): 6.308 μs ┊ GC (median): 0.00%
Time (mean ± σ): 6.881 μs ± 14.405 μs ┊ GC (mean ± σ): 6.14% ± 2.92%
██▁
▂▅▆▆███▆▃▃▃▃▄▄▄▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
6.07 μs Histogram: frequency by time 8.47 μs <
Memory estimate: 8.03 KiB, allocs estimate: 151.
julia> R"
paste(round(km$lower, 2),
round(km$upper, 2), sep = ',')
"
RObject{StrSxp}
[1] "0.97,0.99" "0.9,0.94" "0.83,0.88" "0.76,0.81" "0.69,0.75" "0.62,0.69"
[7] "0.52,0.6" "0.41,0.49" "0.27,0.35" "0.14,0.23"
julia> km = kaplan(surv);
julia> confint(km)
11-element Vector{Tuple{Float64, Float64}}:
(1.0, 1.0)
(0.9655902766870649, 0.9846564570042777)
(0.90008189973811, 0.9343531290390091)
(0.8304177051112486, 0.8755272501579991)
(0.753866856584728, 0.8079562064849091)
(0.6863770858305547, 0.7466648377965709)
(0.6175078894755917, 0.683065386931595)
(0.5222658655269816, 0.5941460334482926)
(0.40985396069034197, 0.4878272995385684)
(0.2674278465799201, 0.35115893829800077)
(0.136515397416271, 0.22525900129765383)
julia> n = 1000;
julia> T = round.(rand(Uniform(1, 10), n));
julia> Δ = rand(Binomial(), n) .== 1;
julia> et = Survival.EventTime.(T, Δ);
julia> ot = Surv(T, Δ, "right");
julia> @btime EventTime.(T, Δ);
792.120 ns (3 allocations: 15.81 KiB)
julia> @btime Surv(T, Δ, "right"); ## longer as expected due to postprocessing
25.041 μs (49 allocations: 35.20 KiB)
julia> @btime (k = kaplan(srv); confint(k));
6.700 μs (152 allocations: 8.27 KiB)
julia> @btime (k = fit(Survival.KaplanMeier, et); confint(k));
36.041 μs (34 allocations: 52.33 KiB)
julia> @btime (k = kaplan(srv); confint(k));
6.700 μs (152 allocations: 8.27 KiB)
julia> @btime (k = fit(Survival.KaplanMeier, et); confint(k));
36.041 μs (34 allocations: 52.33 KiB)
julia> using Plots
julia> plot(kaplan(srv))
I'm really happy to have come across this package! Very handy and simple.
I am trying to compute Kaplan-Meier estimates to make a survival curve. I understand from the docs that I need an EventTable
to pass to the KaplanMeier
function, however, when I do EventTable(eventtimes)
I get an error. I checked and it seems that the type EventTable
doesn't exist:
using Survival;
? Survival.EventTable
No documentation found.
Binding Survival.EventTable does not exist.
To control my environment is alright:
? Survival.EventTime
EventTime{T}
Immutable object containing the real-valued time to an event as well as an indicator of whether the time corresponds to an observed event (true) or right censoring (false).
Am I doing something wrong? I am using Survival v0.2.2 on Julia 1.8.2
Would it be possible to officially release this package? So far I clone it for usage but I often encounter issues with it. Last time I was generating a new package using PckTemplates and I got an error due to Survival.
Activating environment at ~/.julia/environments/v1.3/Project.toml
[ Info: Committed 6 files/directories: src/, Project.toml, test/, README.md, LICENSE, .gitignore
Resolving package versions...
ERROR: Unsatisfiable requirements detected for package Survival [8a913413]:
Survival [8a913413] log:
├─Survival [8a913413] has no known versions!
└─restricted to versions * by FlippingModel [d1f21f3f] — no versions left
└─FlippingModel [d1f21f3f] log:
├─possible versions are: 0.1.0 or uninstalled
└─FlippingModel [d1f21f3f] is fixed to version 0.1.0
coxph result for R:
z:0.004
Pr(>|z|):0.997
Warning message:
In fitter(X, Y, istrat, offset, init, control, weights = weights, :
Loglik converged before variable 1 ; coefficient may be infinite.
coxph result for julia:
z:Inf
Pr(>|z|):<1e-99
Why is there such a big difference between the R and julia and which one should be trusted?
##################the test run code for R and julia################
R code
library(survival)
test1 <- read.csv(file = '/Users/guan.wang/Downloads/coxph_testdata.csv',sep="\t")
re<-coxph(Surv(survivalMonth, survivalEvent) ~ myclass, test1)
summary(re)
##R outputs:
Warning message:
In fitter(X, Y, istrat, offset, init, control, weights = weights, :
Loglik converged before variable 1 ; coefficient may be infinite.
> summary(re)
Call:
coxph(formula = Surv(survivalMonth, survivalEvent) ~ myclass,
data = test1)
n= 148, number of events= 41
coef exp(coef) se(coef) z Pr(>|z|)
myclass 1.948e+01 2.885e+08 4.568e+03 0.004 0.997
exp(coef) exp(-coef) lower .95 upper .95
myclass 288459349 3.467e-09 0 Inf
Concordance= 0.623 (se = 0.02 )
Likelihood ratio test= 23.75 on 1 df, p=1e-06
Wald test = 0 on 1 df, p=1
Score (logrank) test = 13.8 on 1 df, p=2e-04
#########
## julia code:
rossi = CSV.read("coxph_testdata.csv", DataFrame,header=1,delim="\t",)
rossi.event = EventTime.(rossi.survivalMonth, rossi.survivalEvent .== 1)
outcome = coxph(@formula(event ~ myclass), rossi)
outcome_coefmat = coeftable(outcome)
print(outcome_coefmat)
## julia output:
Estimate Std.Error z value Pr(>|z|)
───────────────────────────────────────────────
myclass 35.4617 0.0 Inf <1e-99
───────────────────────────────────────────────
Hi!
A few months back I had started developing a package for Survival analysis in Julia (it's here )
So far it mainly has Kaplan Meier, Cox proportional hazard model and Accelerated Failure Time models.
Cox model is well optimized (last time I benchmarked it was about 3x faster than matlab's version on some test dataset). Accelerated Failure Time models are not as polished (and way less common I guess).
I don't think I will have enough time to dedicate to this in the future to make one fully polished Survival package alone (plus, it makes limited sense to have 2 separate Survival packages). If you think that would be valuable I can polish my version up a bit (get it up to date with Julia v0.6 and so on), make it compatible with your type system/formalism and make a PR to your package to start unifying things. I guess the first thing to contribute would be Cox (as it's more polished/ it's more clear how to do it). Accelerated Failure Time models can wait until they are cleaner and it also depends whether you are interested in them.
Let me know what you think!
Pietro
Hi, Thanks for writing this package.
I found a potential bug, where @formula
from StatsModels.jl
is not automatically available
when only writing
using Survival
without
using StatsModels
using Survival
using StatsBase
using DataFrames
event = EventTime.(10 .+ randn(10), sample([true, false], 10))
coxph(@formula(events ~ x), DataFrame(x = randn(10), events = event))
output:
ERROR: LoadError: UndefVarError: @formula not defined
I have been using kaplan meier estomators and wrote a small function to evaluate the estimator at any timepoint. I have used this to generate statistics or diagnostic plots.
function evaluate(km_fit::KaplanMeier, t::Real)
time_points = km_fit.events.time
survival = km_fit.survival
if t <= time_points[1]
return 1
else
id = findlast(x -> x <= t, time_points)
return survival[id]
end
end
@ararslan if you think this is a usefull feature i can create pull request. I have not looked at the Nelson Aalen estimator but maybe something similar can be done / in general for the nonparametric estimator type.
I want to add random effects to my Cox model formulation, and I have noticed that
cox1 = coxph(@formula(event ~ A + B + C + (1|D)), data)
produces the following error:
ERROR: MethodError: no method matching |(::Int64, ::String)
Closest candidates are:
|(::Any, ::Any, ::Any, ::Any...) at operators.jl:591
|(::T, ::T) where T<:Union{Int128, Int16, Int32, Int64, Int8, UInt128, UInt16, UInt32, UInt64, UInt8} at int.jl:365
|(::T, ::CEnum.Cenum{S}) where {T<:Integer, S<:Integer} at ~/.julia/packages/CEnum/Bqafi/src/operators.jl:13
but this syntax is allowed in @formula
(i.e., the following works fine):
@formula(event ~ A + B + C + (1|D))
I assume mixed effects aren't supported in coxph
?
For reference, random effects in survival analysis are often referred to as 'frailty terms' (in fact, I think in R they are added as frailty(D)
as opposed to 1|D
, but I'm not sure.
Is this something that is supported but I'm just doing it wrong?
E.g. https://github.com/JuliaStats/Survival.jl/runs/3504403977. It happens randomly on both 1.6 and nightly although there shouldn't be anything stochastic about that test. I have no idea what is causing this and I can't reproduce locally.
The standard error for the Kaplan-Meier estimator returns what I would consider to be only "half" of the correct value. Perhaps this is just a misunderstanding on my part, but I think it should return the square root of the variance of the survival function. Right now, it does not. To actually make it the standard error of the survival curve, you need to multiply the two (AKA km.survival .* km.stderr
).
Should I make a PR to fix this, or was it intentional?
This issue is used to trigger TagBot; feel free to unsubscribe.
If you haven't already, you should update your TagBot.yml
to include issue comment triggers.
Please see this post on Discourse for instructions and more details.
If you'd like for me to do this for you, comment TagBot fix
on this issue.
I'll open a PR within a few hours, please be patient!
I think it would be good to add the ability to produce confidence bands in addition to the pointwise confidence intervals, at least for the Kaplan-Meier method.
The SAS PROC LIFETEST documentation is pretty thorough here (discussion starting ~pg 5155). There are some packages elsewhere that have pre-computed the appropriate critical values for some of the methods, so that legwork is already out there.
This would probably be a good first issue for someone.
Is it possible to fit a cox model given a set of predictors M
and patients Y
, then get the partial likelihood of the fitted model on a vector of predictors m_new
and survival y_new
from a new patient?
Cox: Error During Test at /home/travis/build/JuliaStats/Survival.jl/test/runtests.jl:188
Got exception outside of a @test
fatal error in type inference (type bound)
Stacktrace:
[1] StatsModels.Terms(::Formula) at /home/travis/.julia/packages/StatsModels/pBxdt/src/formula.jl:341
[2] #fit#36(::Dict{Any,Any}, ::Base.Iterators.Pairs{Symbol,Float64,Tuple{Symbol},NamedTuple{(:tol,),Tuple{Float64}}}, ::typeof(fit), ::Type{CoxModel}, ::Formula, ::DataFrame) at /home/travis/.julia/packages/StatsModels/pBxdt/src/statsmodel.jl:66
[3] (::getfield(StatsBase, Symbol("#kw##fit")))(::NamedTuple{(:tol,),Tuple{Float64}}, ::typeof(fit), ::Type{CoxModel}, ::Formula, ::DataFrame) at ./none:0
[4] #coxph#20(::Base.Iterators.Pairs{Symbol,Float64,Tuple{Symbol},NamedTuple{(:tol,),Tuple{Float64}}}, ::typeof(coxph), ::Formula, ::DataFrame) at /home/travis/build/JuliaStats/Survival.jl/src/cox.jl:203
[5] (::getfield(Survival, Symbol("#kw##coxph")))(::NamedTuple{(:tol,),Tuple{Float64}}, ::typeof(coxph), ::Formula, ::DataFrame) at ./none:0
[6] top-level scope at /home/travis/build/JuliaStats/Survival.jl/test/runtests.jl:192
[7] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Test/src/Test.jl:1113
[8] top-level scope at /home/travis/build/JuliaStats/Survival.jl/test/runtests.jl:189
[9] include at ./boot.jl:328 [inlined]
[10] include_relative(::Module, ::String) at ./loading.jl:1094
[11] include(::Module, ::String) at ./Base.jl:31
[12] include(::String) at ./client.jl:431
[13] top-level scope at none:5
[14] eval(::Module, ::Any) at ./boot.jl:330
[15] exec_options(::Base.JLOptions) at ./client.jl:271
[16] _start() at ./client.jl:464
When I try to install Survival.jl
(per the documentation instructions), I get an error saying that it is not a valid package name (see below). Do I need to specify anything else regarding the fact that package is not yet registered in Julia's General package registry? I came across this error when trying to use this package in a jupyter notebook (via using IJulia; notebook()
).
using Pkg;
Pkg.add("https://github.com/JuliaStats/Survival.jl")
Pkg.resolve();
https://github.com/JuliaStats/Survival.jl is not a valid packagename
Stacktrace:
[1] pkgerror(::String) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/Pkg/src/Types.jl:120
[2] check_package_name(::String) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/Pkg/src/API.jl:22
[3] iterate at ./generator.jl:47 [inlined]
[4] collect(::Base.Generator{Array{String,1},typeof(Pkg.API.check_package_name)}) at ./array.jl:619
[5] #add_or_develop#11(::Base.Iterators.Pairs{Symbol,Symbol,Tuple{Symbol},NamedTuple{(:mode,),Tuple{Symbol}}}, ::Function, ::Array{String,1}) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/Pkg/src/API.jl:28
[6] #add_or_develop at ./none:0 [inlined]
[7] #add_or_develop#10 at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/Pkg/src/API.jl:27 [inlined]
[8] #add_or_develop at ./none:0 [inlined]
[9] #add#18 at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/Pkg/src/API.jl:69 [inlined]
[10] add(::String) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/Pkg/src/API.jl:69
[11] top-level scope at In[2]:2
Rather than rolling our own optimization for Cox models, I think we might be better off depending on a well-established package such as Optim to perform that if at all possible.
Not sure what's happening (unfortunately I don't remember any more what manipulations StatsModels does to fit
before calling fit
with arrays):
julia> using Survival, DataFrames, StatsModels
julia> df = DataFrame(x = rand(4), y = EventTime.(rand(4), rand(Bool, 4)))
4×2 DataFrame
│ Row │ x │ y │
│ │ Float64 │ EventTim… │
├─────┼──────────┼───────────┤
│ 1 │ 0.827601 │ 0.662955+ │
│ 2 │ 0.909557 │ 0.0749414 │
│ 3 │ 0.379872 │ 0.0578822 │
│ 4 │ 0.122107 │ 0.712976 │
julia> fit(CoxModel, @formula(y ~ x), df)
ERROR: MethodError: no method matching fit(::Type{CoxModel}, ::Array{Float64,2}, ::Array{Float64,2})
Closest candidates are:
fit(::Type{Histogram}, ::Any...; kwargs...) at /home/pietro/.julia/dev/StatsBase/src/hist.jl:319
fit(::StatisticalModel, ::Any...) at /home/pietro/.julia/dev/StatsBase/src/statmodels.jl:151
fit(::Type{T<:RegressionModel}, ::FormulaTerm, ::Any, ::Any...; contrasts, kwargs...) where T<:RegressionModel at /home/pietro/.julia/dev/StatsModels/src/statsmodel.jl:82
...
Stacktrace:
[1] #fit#57(::Dict{Symbol,Any}, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{,Tuple{}}}, ::Function, ::Type{CoxModel}, ::FormulaTerm{Term,Term}, ::DataFrame) at /home/pietro/.julia/dev/StatsModels/src/statsmodel.jl:88
[2] fit(::Type{CoxModel}, ::FormulaTerm{Term,Term}, ::DataFrame) at /home/pietro/.julia/dev/StatsModels/src/statsmodel.jl:82
[3] top-level scope at none:0
Hello,
I noticed that Julia and R differ on an example from the MICE documentation. Initially, I thought this was related to #50, but upon closer inspection, I noticed that Julia removes duplicate times (R and Julia agree when duplicates are removed). It might be a good idea to explain this in the documentation, or perhaps not remove duplicates (return a warning instead?). I'm not sure what the best approach is or whether there is a best approach, but I think documenting the difference from R would be helpful to users.
R Code
require(MASS)
leuk$status <- 1 ## no censoring occurs in leuk data (MASS)
ch <- nelsonaalen(leuk, time, status)
plot(x = leuk$time, y = ch, ylab = "Cumulative hazard", xlab = "Time")
leuk$rchaz = ch
write.csv(leuk, "leuk.csv")
Julia Code
using CSV
using DataFrames
using Interpolations
using Plots
using Survival
df = CSV.read("leuk.csv", DataFrame)
R1 = fit(NelsonAalen, df.time, df.status)
scatter(R1.events.time, R1.chaz, markersize=2, label = "Julia")
scatter!(df.time, df.rchaz, markersize=2, label = "R")
Version Info
[336ed68f] CSV v0.10.11
[a93c6f00] DataFrames v1.6.1
[91a5bcdd] Plots v1.39.0
[8a913413] Survival v0.3.0
I noticed that NelsonAalen repeats the second to last value twice. I thought this was odd, so I compared it to R.
Julia
using Survival
x = 1:5
v = fill(1, 5)
result = fit(NelsonAalen, x, v)
result.chaz
Result
5-element Vector{Float64}:
0.2
0.45
0.7833333333333333
1.2833333333333332
1.2833333333333332
R
library("mice")
df = data.frame(time = c(1,2,3,4,5))
df$status = 1
nelsonaalen(df, time, status)
Result
[1] 0.2000000 0.4500000 0.7833333 1.2833333 2.2833333
Version
Julia 1.8.1
Survival 0.2.2
Support weights as a argument to the model.
Hey fellas, thanks a lot for this package. I've been using it recently, and I noticed that, for some reason, it's taking a huge toll in my memory. I have a dataframe with 20k rows. Once I load the dataframe, I have a 4gb of memory in use. After running the Cox-PH model, it grows to 11gb, and stays that way, even if I place the whole thing inside a function.
sdf = CSV.read("sdf.csv", DataFrame);
function coxphcoef(sdf)
X = Matrix(sdf[!,Not(:duracao)])
y = sdf[!,:duracao];
y = EventTime.(y,[true for i in y]);
@show cphmodel = coxph(X,y);
return cphmodel.β
end
coxphcoef(sdf);
What is going on here?
BTW, I'm in Julia 1.7.2. Using:
DataFrames v1.3.2
Survival v0.2.2
Are there are any plans to support stratification for the K-M
fit?
Hello, could you release a new version so the "new" structure is downloadable. I have a workflow using the survival package i would like to share but i would like to tell the people to use ] add Survival
instead of ] add https://github.com/JuliaStats/Survival.jl.git
.
Thank you.
More specifially: a few years ago you changed the api and functionality on stable but the released version is 2 years old.
Now that (hopefully) the Cox PR is nearing merging, and following on #3, I wanted to make a checklist of features to port from AcceleratedFailure (the other survival package, I've renamed it to avoid name clashing).
1 and 2 are almost done, 3 and 4 will be next. They will probably require some refactoring as the code is extremely similar for Nelson-Aalen, Kaplan-Meier and cumulative incidence. @ararslan : if you prefer I can wait that you refactor the code for cumulative incidence before adding Nelson-Aalen(in which case I would start by porting the tools for differentiation of cumulatives), otherwise I can try refactoring myself.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.