Title: | Local Association Measures |
---|---|
Description: | Implements the estimation of local (and global) association measures: Lewontin's D, Ducher's Z, pointwise mutual information, normalized pointwise mutual information and chi-squared residuals. The significance of local (and global) association is accessed using p-values estimated by permutations. |
Authors: | Olivier M. F. Martin [aut, cre], Michel Ducher [aut] |
Maintainer: | Olivier M. F. Martin <[email protected]> |
License: | GPL-3 |
Version: | 0.2.2.0 |
Built: | 2025-03-02 04:20:31 UTC |
Source: | https://github.com/omfmartin/zebu |
Chi-squared test: statistical significance of (global) chi-squared statistic and (local) chi-squared residuals
chisqtest(x, p_adjust = "BH")
chisqtest(x, p_adjust = "BH")
x |
|
p_adjust |
multiple testing correction method.
(see |
chisqtest
returns an S3 object of class
lassie
and chisqtest
.
Adds the following to the lassie object x
:
global_p: global association p-value.
local_p: array of local association p-values.
# Calling lassie on cars dataset las <- lassie(cars, continuous = colnames(cars), measure = "chisq") # Permutation test using default settings chisqtest(las)
# Calling lassie on cars dataset las <- lassie(cars, continuous = colnames(cars), measure = "chisq") # Permutation test using default settings chisqtest(las)
Maximum-likelihood estimation of marginal and multivariate observed and expected independence probabilities. Marginal probability refers to probability of each factor per individual column. Multivariate probability refer to cross-classifying factors for all columns.
estimate_prob(x)
estimate_prob(x)
x |
data.frame or matrix. |
List containing the following values:
margins: a list of marginal probabilities. Names correspond to colnames(x).
observed: observed multivariate probability array.
expected: expected multivariate probability array
# This is what happens behind the curtains in the 'lassie' function # Here we compute the association between the 'Girth' and 'Height' variables # of the 'trees' dataset # 'select' and 'continuous' take column numbers or names select <- c('Girth', 'Height') # select subset of trees continuous <-c(1, 2) # both 'Girth' and 'Height' are continuous # equal-width discretization with 3 bins breaks <- 3 # Preprocess data: subset, discretize and remove missing data pre <- preprocess(trees, select, continuous, breaks) # Estimates marginal and multivariate probabilities from preprocessed data.frame prob <- estimate_prob(pre$pp) # Computes local and global association using Ducher's Z lam <- local_association(prob, measure = 'z')
# This is what happens behind the curtains in the 'lassie' function # Here we compute the association between the 'Girth' and 'Height' variables # of the 'trees' dataset # 'select' and 'continuous' take column numbers or names select <- c('Girth', 'Height') # select subset of trees continuous <-c(1, 2) # both 'Girth' and 'Height' are continuous # equal-width discretization with 3 bins breaks <- 3 # Preprocess data: subset, discretize and remove missing data pre <- preprocess(trees, select, continuous, breaks) # Estimates marginal and multivariate probabilities from preprocessed data.frame prob <- estimate_prob(pre$pp) # Computes local and global association using Ducher's Z lam <- local_association(prob, measure = 'z')
Formats a lassie
object for printing to console
(see print.lassie
) and for writing to a file
(see write.lassie
). Melts probability or local association
measure arrays into a data.frame.
## S3 method for class 'lassie' format(x, what_x, range, what_range, what_sort, decreasing, na.rm, ...)
## S3 method for class 'lassie' format(x, what_x, range, what_range, what_sort, decreasing, na.rm, ...)
x |
|
what_x |
vector specifying values to be returned: |
range |
range of values to be retained (vector of two numeric values). |
what_range |
character specifying what value |
what_sort |
character specifying according to which values should |
decreasing |
logical value specifying sort order. |
na.rm |
logical value indicating whether NA values should be stripped. |
... |
other arguments passed on to methods. Not currently used. |
Estimates local (and global) association measures: Ducher's Z, Lewontin's D, pointwise mutual information, normalized pointwise mutual information and chi-squared residuals.
lassie(x, select, continuous, breaks, measure = "chisq", default_breaks = 4)
lassie(x, select, continuous, breaks, measure = "chisq", default_breaks = 4)
x |
data.frame or matrix. |
select |
optional vector of column numbers or column names specifying a subset of data to be used. By default, uses all columns. |
continuous |
optional vector of column numbers or column names specifying continuous variables that should be discretized. By default, assumes that every variable is categorical. |
breaks |
numeric vector or list passed on to |
measure |
name of measure to be used:
|
default_breaks |
default break points for discretizations.
Same syntax as in |
An instance of S3 class lassie
with
the following objects:
data: raw and preprocessed data.frames (see preprocess).
prob probability arrays (see estimate_prob).
global global association (see local_association).
local local association arrays (see local_association).
lassie_params parameters used in lassie.
Results can be visualized using plot.lassie
and
print.lassie
methods. plot.lassie
is only available
in the bivariate case and returns
a tile plot representing the probability or local association measure matrix.
print.lassie
shows an array or a data.frame.
Results can be saved using write.lassie
.
The permtest
function accesses the significance of local and global
association values using p-values estimated by permutations.
The chisqtest
function accesses the significance in the case
of two dimensional chi-squared analysis.
# In this example, we will use the 'mtcars' dataset # Selecting a subset of mtcars. # Takes column names or numbers. # If nothing was specified, all variables would have been used. select <- c('mpg', 'cyl') # or select <- c(1, 2) # Specifying 'mpg' as a continuous variables using column numbers # Takes column names or numbers. # If nothing was specified, all variables would have been used. continuous <- 'mpg' # or continuous <- 1 # How should breaks be specified? # Specifying equal-width discretization with 5 bins for all continuous variables ('mpg') # breaks <- 5 # Specifying user-defined breakpoints for all continuous variables. # breaks <- c(10, 15, 25, 30) # Same thing but only for 'mpg'. # Here both notations are equivalent because 'mpg' is the only continuous variable. # This notation is useful if you wish to specify different break points for different variables # breaks <- list('mpg' = 5) # breaks <- list('mpg' = c(10, 15, 25, 30)) # Calling lassie # Not specifying breaks means that the value in default_breaks (4) will be used. las <- lassie(mtcars, select = c(1, 2), continuous = 1) # Print local association to console as an array print(las) # Print local association and probabilities # Here only rows having a positive local association are printed # The data.frame is also sorted by observed probability print(las, type = 'df', range = c(0, 1), what_sort = 'obs') # Plot results as heatmap plot(las) # Plot observed probabilities using different colors plot(las, what_x = 'obs', low = 'white', mid = 'grey', high = 'black', text_colour = 'red')
# In this example, we will use the 'mtcars' dataset # Selecting a subset of mtcars. # Takes column names or numbers. # If nothing was specified, all variables would have been used. select <- c('mpg', 'cyl') # or select <- c(1, 2) # Specifying 'mpg' as a continuous variables using column numbers # Takes column names or numbers. # If nothing was specified, all variables would have been used. continuous <- 'mpg' # or continuous <- 1 # How should breaks be specified? # Specifying equal-width discretization with 5 bins for all continuous variables ('mpg') # breaks <- 5 # Specifying user-defined breakpoints for all continuous variables. # breaks <- c(10, 15, 25, 30) # Same thing but only for 'mpg'. # Here both notations are equivalent because 'mpg' is the only continuous variable. # This notation is useful if you wish to specify different break points for different variables # breaks <- list('mpg' = 5) # breaks <- list('mpg' = c(10, 15, 25, 30)) # Calling lassie # Not specifying breaks means that the value in default_breaks (4) will be used. las <- lassie(mtcars, select = c(1, 2), continuous = 1) # Print local association to console as an array print(las) # Print local association and probabilities # Here only rows having a positive local association are printed # The data.frame is also sorted by observed probability print(las, type = 'df', range = c(0, 1), what_sort = 'obs') # Plot results as heatmap plot(las) # Plot observed probabilities using different colors plot(las, what_x = 'obs', low = 'white', mid = 'grey', high = 'black', text_colour = 'red')
Subroutine for lassie
methods. Tries to retrieve a value from a lassie
object
and gives an error if value does not exist.
lassie_get(x, what_x)
lassie_get(x, what_x)
x |
|
what_x |
vector specifying values to be returned: |
Corresponding array contained in lassie
object.
las <- lassie(trees) las_array <- lassie_get(las, 'local')
las <- lassie(trees) las_array <- lassie_get(las, 'local')
Subroutines called by lassie
to compute
local and global association measures from a list of probabilities.
local_association(x, measure = "chisq", nr = 1) lewontin_d(x) duchers_z(x) pmi(x, normalize) chisq(x, nr)
local_association(x, measure = "chisq", nr = 1) lewontin_d(x) duchers_z(x) pmi(x, normalize) chisq(x, nr)
x |
list of probabilities as outputted by |
measure |
name of measure to be used:
|
nr |
number of rows/samples. Only used to estimate chi-squared residuals. |
normalize |
0 for pmi, 1 for npmi, 2 for npmi2 |
List containing the following values:
local: local association array (may contain NA, NaN and Inf values).
global: global association numeric value.
# This is what happens behind the curtains in the 'lassie' function # Here we compute the association between the 'Girth' and 'Height' variables # of the 'trees' dataset # 'select' and 'continuous' take column numbers or names select <- c('Girth', 'Height') # select subset of trees continuous <-c(1, 2) # both 'Girth' and 'Height' are continuous # equal-width discretization with 3 bins breaks <- 3 # Preprocess data: subset, discretize and remove missing data pre <- preprocess(trees, select, continuous, breaks) # Estimates marginal and multivariate probabilities from preprocessed data.frame prob <- estimate_prob(pre$pp) # Computes local and global association using Ducher's Z lam <- local_association(prob, measure = 'z')
# This is what happens behind the curtains in the 'lassie' function # Here we compute the association between the 'Girth' and 'Height' variables # of the 'trees' dataset # 'select' and 'continuous' take column numbers or names select <- c('Girth', 'Height') # select subset of trees continuous <-c(1, 2) # both 'Girth' and 'Height' are continuous # equal-width discretization with 3 bins breaks <- 3 # Preprocess data: subset, discretize and remove missing data pre <- preprocess(trees, select, continuous, breaks) # Estimates marginal and multivariate probabilities from preprocessed data.frame prob <- estimate_prob(pre$pp) # Computes local and global association using Ducher's Z lam <- local_association(prob, measure = 'z')
Permutation test: statistical significance of local and global association measures
permtest(x, nb = 1000L, group = as.list(colnames(x$data$pp)), p_adjust = "BH")
permtest(x, nb = 1000L, group = as.list(colnames(x$data$pp)), p_adjust = "BH")
x |
|
nb |
number of resampling iterations. |
group |
list of column names specifying which columns should be permuted together. This is useful for the multivariate case, for example, when there is many dependent variables and one independent variable. By default, permutes all columns separately. |
p_adjust |
multiple testing correction method.
(see |
permtest
returns an S3 object of class
lassie
and permtest
.
Adds the following to the lassie object x
:
global_p: global association p-value.
local_p: array of local association p-values.
global_perm: numeric global association values obtained with permutations.
local_perm: matrix local association values obtained with permutations. Column number correspond to positions in local association array after converting to numeric (e.g. local_perm[, 1] corresponds to local[1]).
perm_params: parameters used when calling permtest (nb and p_adjust).
# Calling lassie on cars dataset las <- lassie(cars, continuous = colnames(cars)) # Permutation test using default settings permtest(las, nb = 30) # keep resampling low for example
# Calling lassie on cars dataset las <- lassie(cars, continuous = colnames(cars)) # Permutation test using default settings permtest(las, nb = 30) # keep resampling low for example
Plots a lassie
object as a tile plot using
the ggplot2 package. Only available for bivariate association.
## S3 method for class 'lassie' plot( x, what_x = "local", digits = 3, low = "royalblue", mid = "gainsboro", high = "firebrick", na = "purple", text_colour = "black", text_size, limits, midpoint, ... )
## S3 method for class 'lassie' plot( x, what_x = "local", digits = 3, low = "royalblue", mid = "gainsboro", high = "firebrick", na = "purple", text_colour = "black", text_size, limits, midpoint, ... )
x |
|
what_x |
vector specifying values to be returned: |
digits |
integer indicating the number of decimal places. |
low |
colour for low end of the gradient. |
mid |
colour for midpoint of the gradient. |
high |
colour for high end of the gradient. |
na |
colour for NA values. |
text_colour |
colour of text inside cells. |
text_size |
integer indicating text size inside cells. |
limits |
limits of gradient. |
midpoint |
midpoint of gradient. |
... |
other arguments passed on to methods. Not currently used. |
Subroutine called by lassie
. Discretizes, subsets and remove missing data from a data.frame.
preprocess(x, select, continuous, breaks, default_breaks = 4)
preprocess(x, select, continuous, breaks, default_breaks = 4)
x |
data.frame or matrix. |
select |
optional vector of column numbers or column names specifying a subset of data to be used. By default, uses all columns. |
continuous |
optional vector of column numbers or column names specifying continuous variables that should be discretized. By default, assumes that every variable is categorical. |
breaks |
numeric vector or list passed on to |
default_breaks |
default break points for discretizations.
Same syntax as in |
List containing the following values:
raw: raw subsetted data.frame
pp: discretized, subsetted and complete data.frame
select
continuous
breaks
default_breaks
# This is what happens behind the curtains in the 'lassie' function # Here we compute the association between the 'Girth' and 'Height' variables # of the 'trees' dataset # 'select' and 'continuous' take column numbers or names select <- c('Girth', 'Height') # select subset of trees continuous <-c(1, 2) # both 'Girth' and 'Height' are continuous # equal-width discretization with 3 bins breaks <- 3 # Preprocess data: subset, discretize and remove missing data pre <- preprocess(trees, select, continuous, breaks) # Estimates marginal and multivariate probabilities from preprocessed data.frame prob <- estimate_prob(pre$pp) # Computes local and global association using Ducher's Z lam <- local_association(prob, measure = 'z')
# This is what happens behind the curtains in the 'lassie' function # Here we compute the association between the 'Girth' and 'Height' variables # of the 'trees' dataset # 'select' and 'continuous' take column numbers or names select <- c('Girth', 'Height') # select subset of trees continuous <-c(1, 2) # both 'Girth' and 'Height' are continuous # equal-width discretization with 3 bins breaks <- 3 # Preprocess data: subset, discretize and remove missing data pre <- preprocess(trees, select, continuous, breaks) # Estimates marginal and multivariate probabilities from preprocessed data.frame prob <- estimate_prob(pre$pp) # Computes local and global association using Ducher's Z lam <- local_association(prob, measure = 'z')
Print a lassie
object as an array or a data.frame.
## S3 method for class 'lassie' print(x, type, what_x, range, what_range, what_sort, decreasing, na.rm, ...)
## S3 method for class 'lassie' print(x, type, what_x, range, what_range, what_sort, decreasing, na.rm, ...)
x |
|
type |
print style: 'array' for array or 'df' for data.frame. |
what_x |
vector specifying values to be returned: |
range |
range of values to be retained (vector of two numeric values). |
what_range |
character specifying what value |
what_sort |
character specifying according to which values should |
decreasing |
logical value specifying sort order. |
na.rm |
logical value indicating whether NA values should be stripped. |
... |
other arguments passed on to methods. Not currently used. |
Writes lassie
object to a file in a table structured format.
write.lassie( x, file, sep = ",", dec = ".", col.names = TRUE, row.names = FALSE, quote = TRUE, ... )
write.lassie( x, file, sep = ",", dec = ".", col.names = TRUE, row.names = FALSE, quote = TRUE, ... )
x |
|
file |
character string naming a file. |
sep |
the field separator string. Values within each row of
|
dec |
the string to use for decimal points in numeric or complex columns: must be a single character. |
col.names |
either a logical value indicating whether the column
names of |
row.names |
either a logical value indicating whether the row
names of |
quote |
a logical value ( |
... |
other arguments passed on to write.table. |
The zebu package implements the estimation of local (and global) association measures: Ducher's Z, Lewontin's D, pointwise mutual information, normalized pointwise mutual information and chi-squared residuals. The significance of local (and global) association is accessed using p-values estimated by permutations.
lassie
estimates local (and global) association measures: Ducher's Z, Lewontin's D, pointwise mutual information, normalized pointwise mutual information and chi-squared residuals.
permtest
accesses the significance of local (and global) association values usingp-values estimated by permutations.
chisqtest
accesses the significance for two dimensional chi-squared analysis.
Maintainer: Olivier M. F. Martin [email protected]
Authors:
Michel Ducher [email protected]
Useful links:
Report bugs at https://github.com/oliviermfmartin/zebu/issues