Title: | Combine Categories Using Greenacre's Method |
---|---|
Description: | Implements a method of iteratively collapsing the rows of a contingency table, two at a time, by selecting the pair of categories whose combination yields a new table with the smallest loss of chi-squared, as described by Greenacre, M.J. (1988) <doi:10.1007/BF01901670>. The result is compatible with the class of object returned by the 'stats' package's hclust() function and can be used similarly (plotted as a dendrogram, cut, etc.). Additional functions are provided for automatic cutting and diagnostic plotting. |
Authors: | Jeff Jetton [aut, cre] |
Maintainer: | Jeff Jetton <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.1.1 |
Built: | 2025-03-12 03:28:38 UTC |
Source: | https://github.com/jeffjetton/greenclust |
Maps a vector of cluster numbers to another categorical vector,
yielding a new vector of matching cluster numbers. Useful for distributing
cluster numbers back out to the original observations in cases where the
clustering was performed on a table of unique levels rather than directly
on the observations (such as with greenclust
).
assign.cluster(x, clusters, impute = FALSE)
assign.cluster(x, clusters, impute = FALSE)
x |
a factor or character vector representing a categorical variable |
clusters |
a named numeric vector of cluster numbers, such as an
object returned by |
impute |
a boolean controlling the behavior when a value in |
Any categories in x
that do not exist in names(clusters)
are given a cluster of NA
, or (if impute
is TRUE
)
assigned the cluster number that is most-frequently used for the other
existing categories, with ties going to the lowest cluster number. If
there are no matching clusters for any of the categories in x
,
imputation will simply use the first cluster number in clusters
.
If there are duplicate names in clusters
, the first occurrence
takes precedence.
A factor vector of the same length as x
, representing
assigned cluster numbers.
greenclust
, greencut
,
greenplot
# Cluster feed types based on number of "underweight" chicks grc <- greenclust(table(chickwts$feed, ifelse(chickwts$weight < 200, "Y", "N"))) # Assign clusters to each original observation feed.clustered <- assign.cluster(chickwts$feed, greencut(grc)) table(chickwts$feed, feed.clustered)
# Cluster feed types based on number of "underweight" chicks grc <- greenclust(table(chickwts$feed, ifelse(chickwts$weight < 200, "Y", "N"))) # Assign clusters to each original observation feed.clustered <- assign.cluster(chickwts$feed, greencut(grc)) table(chickwts$feed, feed.clustered)
Iteratively collapses the rows of a table (typically a contingency table) by selecting the pair of rows each time whose combination creates the smalled loss of chi-squared.
greenclust(x, correct = FALSE, verbose = FALSE)
greenclust(x, correct = FALSE, verbose = FALSE)
x |
a numeric matrix or data frame |
correct |
a logical indicating whether to apply a continuity correction if and when the clustered table reaches a 2x2 dimension. |
verbose |
if TRUE, prints the clustered table along with r-squared and p-value at each step |
An object of class greenclust
which is compatible with most
hclust
object functions, such as plot()
and
rect.hclust()
. The height vector represents the proportion
of chi-squared, relative to the original table, seen at each clustering
step. The greenclust object also includes a vector for the chi-squared
test p-value at each step and a boolean vector indicating whether the
step had a tie for "winner".
Greenacre, M.J. (1988) "Clustering the Rows and Columns of a Contingency Table," Journal of Classification 5, 39-51. doi:10.1007/BF01901670
greencut
, greenplot
,
assign.cluster
# Combine Titanic passenger attributes into a single category tab <- t(as.data.frame(apply(Titanic, 4:1, FUN=sum))) # Remove rows with all zeros tab <- tab[apply(tab, 1, sum) > 0, ] # Perform clustering on contingency table grc <- greenclust(tab) # Plot r-squared and p-values for each potential cut point greenplot(grc) # Get clusters at suggested cut point clusters <- greencut(grc) # Plot dendrogram with clusters marked plot(grc) rect.hclust(grc, max(clusters))
# Combine Titanic passenger attributes into a single category tab <- t(as.data.frame(apply(Titanic, 4:1, FUN=sum))) # Remove rows with all zeros tab <- tab[apply(tab, 1, sum) > 0, ] # Perform clustering on contingency table grc <- greenclust(tab) # Plot r-squared and p-values for each potential cut point greenplot(grc) # Get clusters at suggested cut point clusters <- greencut(grc) # Plot dendrogram with clusters marked plot(grc) rect.hclust(grc, max(clusters))
Cuts a greenclust
tree at an automatically-determined number
of groups.
greencut(g, k = NULL, h = NULL)
greencut(g, k = NULL, h = NULL)
g |
a tree as producted by |
k |
an integer scalar with the desired number of groups |
h |
numeric scalar with the desired height where the tree should be cut |
The cut point is calculated by finding the number of groups/clusters that results in a collapsed contingency table with the most-significant (lowest p-value) chi-squared test. If there are ties, the smallest number of groups wins.
If a certain number of groups is required or a specific r-squared
(1 - height) threshold is targeted, values for either k
or h
may be provided. (While the regular cutree
function could
also be used in this circumstance, it may still be useful to have the
additional attributes that greencut()
provides.)
As with cutree()
, k
overrides h
if both are given.
greencut
returns a vector of group memberships, with the
resulting r-squared value and p-value as object attributes,
accessable via attr
.
Greenacre, M.J. (1988) "Clustering the Rows and Columns of a Contingency Table," Journal of Classification 5, 39-51. doi:10.1007/BF01901670
greenclust
, greenplot
,
assign.cluster
# Combine Titanic passenger attributes into a single category # and create a contingency table for the non-zero levels tab <- t(as.data.frame(apply(Titanic, 4:1, FUN=sum))) tab <- tab[apply(tab, 1, sum) > 0, ] grc <- greenclust(tab) greencut(grc) plot(grc) rect.hclust(grc, max(greencut(grc)), border=unique(greencut(grc))+1)
# Combine Titanic passenger attributes into a single category # and create a contingency table for the non-zero levels tab <- t(as.data.frame(apply(Titanic, 4:1, FUN=sum))) tab <- tab[apply(tab, 1, sum) > 0, ] grc <- greenclust(tab) greencut(grc) plot(grc) rect.hclust(grc, max(greencut(grc)), border=unique(greencut(grc))+1)
Displays a connected scatterplot showing the r-squared values (x-axis) and
p-values (y-axis) at each clustering step of a greenclust
object. Points are labeled with their cutpoints, i.e., the number of
groups/clusters found at each step. The point with the lowest p-value
(typically the optimal cutpoint) is highlighted.
greenplot( g, type = "b", bg = "gray75", pch = 21, cex = 1, optim.col = "red", pos = 2, main = "P-Value vs. R-Squared for Num. Clusters", xlab = "r-squared", ylab = NULL, ... )
greenplot( g, type = "b", bg = "gray75", pch = 21, cex = 1, optim.col = "red", pos = 2, main = "P-Value vs. R-Squared for Num. Clusters", xlab = "r-squared", ylab = NULL, ... )
g |
an object of the type produced by |
type |
1-character string giving the type of plot desired: "p" for points, "l" for lines, and "b" (the default) for both points and lines. |
bg |
a vector of background colors for open plot symbols. Also used for the line color if type is "b". |
pch |
a vector of plotting characters or symbols: see
|
cex |
a numerical vector giving the amount by which plotting characters and symbols should be scaled relative to the default. For this plot, the numeric labels on each point are always scaled to 0.80 of this value. |
optim.col |
color to use for highlighting the "optimal" cutpoint. |
pos |
specifies the position of labels relative to their points: 1 = below, 2 = left, 3 = above, and 4 = right. |
main |
an overall title for the plot. |
xlab |
a title for the x axis. |
ylab |
a title for the y axis. |
... |
additional arguments to be passed to the plotting methods. |
Greenacre, M.J. (1988) "Clustering the Rows and Columns of a Contingency Table," Journal of Classification 5, 39-51. doi:10.1007/BF01901670
greenclust
, greencut
,
assign.cluster
# Combine Titanic passenger attributes into a single category # and create a contingency table for the non-zero levels tab <- t(as.data.frame(apply(Titanic, 4:1, FUN=sum))) tab <- tab[apply(tab, 1, sum) > 0, ] grc <- greenclust(tab) greenplot(grc) # Plot using custom graphical parameters greenplot(grc, type="p", bg="lightblue", optim.col="darkorange", pos=3, bty="n", cex.main=2, col.main="blue")
# Combine Titanic passenger attributes into a single category # and create a contingency table for the non-zero levels tab <- t(as.data.frame(apply(Titanic, 4:1, FUN=sum))) tab <- tab[apply(tab, 1, sum) > 0, ] grc <- greenclust(tab) greenplot(grc) # Plot using custom graphical parameters greenplot(grc, type="p", bg="lightblue", optim.col="darkorange", pos=3, bty="n", cex.main=2, col.main="blue")