To demonstrate a good partitioning method can greatly improve the visualization.
I conducted a bibliometric study where I analyzed the country-to-country citation perference. In the analysis, the citation preferance of a country \(A\) citing a country \(B\) is quantitatively measured by a new metric “citation enrichment”, where a positive value means \(A\) cites \(B\) more preferably, and a negative value means \(A\) cites \(B\) less than random.
Following is the dataset. The citation enrichment scores are in the
column "log2_fc".
df = readRDS("datasets/citation_data.rds")
head(df)
## country_cited country_citing citations total_cited total_citing
## 5087 Egypt Peru 100 211882 36458
## 9940 Jordan Hungary 100 30158 512011
## 9985 Jordan Norway 100 30158 733995
## 13960 Nigeria New Zealand 100 54145 481066
## 16750 Saudi Arabia Peru 100 162945 36458
## 1298 Bangladesh Tanzania 101 42835 32257
## global_citations p_cited p_citing p_global_cited p_global_citing
## 5087 187015976 0.0027428822 0.0004719608 0.0011329620 0.0001949459
## 9940 187015976 0.0001953083 0.0033158698 0.0001612590 0.0027377928
## 9985 187015976 0.0001362407 0.0033158698 0.0001612590 0.0039247716
## 13960 187015976 0.0002078717 0.0018468926 0.0002895207 0.0025723257
## 16750 187015976 0.0027428822 0.0006137040 0.0008712892 0.0001949459
## 1298 187015976 0.0031311033 0.0023578849 0.0002290446 0.0001724826
## expected var_hyper p_hyper log2_fc z_score median
## 5087 41.305530 41.250689 7.843743e-15 1.2755932 9.138640 0.8327636
## 9940 82.566356 82.327029 3.397938e-02 0.2763741 1.921396 0.8327636
## 9985 118.363263 117.879703 9.617212e-01 -0.2432214 -1.691338 0.8327636
## 13960 139.278575 138.880085 9.998026e-01 -0.4779733 -3.333003 0.8327636
## 16750 31.765462 31.731598 3.625450e-22 1.6544691 12.113183 0.8327636
## 1298 7.388292 7.385325 2.951501e-76 3.7729706 34.446520 0.8327636
## mad mean sd log2_fc_scaled
## 5087 1.158198 0.9404198 1.191347 0.2813399
## 9940 1.158198 0.9404198 1.191347 -0.5573908
## 9985 1.158198 0.9404198 1.191347 -0.9935320
## 13960 1.158198 0.9404198 1.191347 -1.1905795
## 16750 1.158198 0.9404198 1.191347 0.5993631
## 1298 1.557938 0.9432873 1.374163 2.0592053
Find an efficient partitioning method.
Since we will use heatmap for visualization, we first convert the
data into a matrix. Note not every pair of countries have enough
citations for calculating the citation enrichment score, thus in
m there are missing values. xtab() set these
missing values with default value of zero.
m is internally saved as a matrix but we explicitely
change its “class label” to matrix.
m = xtabs(log2_fc ~ country_cited + country_citing, data = df)
class(m) = "matrix"
dim(m)
## [1] 77 77
As we have already had the matrix, let’s directly visualize it by a heatmap.
library(ComplexHeatmap)
Heatmap(m)
The heatmap can already show some levels of “cluster patterns”.
To emphasize the visual effect for missing values, we replace zeros
with NA:
m_with_na = m
m_with_na[m_with_na == 0] = NA
Heatmap(m_with_na, na_col = "grey")
To simplify the task, also for the quick demonstration, we use
m where missing values are filled with zeros. We can try to
split the dendrograms or perform k-means clustering to partition
countries.
Heatmap(m, row_split = 2, column_split = 2)
Or we can perform UMAP or similar analysis to see how close countries are.
library(cola)
loc = dimension_reduction(t(m), method = "UMAP")
loc = as.data.frame(loc)[, 1:2]
colnames(loc) = c("x", "y")
loc$label = rownames(m)
library(ggplot2)
library(ggrepel)
ggplot(loc, aes(x = x, y = y, label = label)) +
geom_point() + geom_text_repel(max.overlaps = Inf, size = 3, show.legend = FALSE) +
labs(x = "Dim 1", y = "Dim 2")
A better way is to use graph-clustering method. We convert the matrix to a graph and perform the Louvain method by taking the citation enrichment scores are the weights. A weight of zero means there is no link between the two nodes in the graph.
m2 = m
m2[m2 < 0] = 0
library(igraph)
set.seed(123)
g = graph_from_adjacency_matrix(m2, mode = "plus", weighted = TRUE)
cm = cluster_louvain(g, weight = E(g)$weight, resolution = 1.2)
communities(cm)
## $`1`
## [1] "Argentina" "Brazil" "Chile" "Colombia" "Mexico" "Peru"
## [7] "Uruguay"
##
## $`2`
## [1] "Australia" "Austria" "Belgium" "Canada"
## [5] "Denmark" "Estonia" "Finland" "France"
## [9] "Germany" "Iceland" "Ireland" "Israel"
## [13] "Jamaica" "Luxembourg" "Netherlands" "New Zealand"
## [17] "Norway" "Sweden" "Switzerland" "United Kingdom"
## [21] "United States"
##
## $`3`
## [1] "Bangladesh" "Cameroon" "Ethiopia" "Ghana" "Kenya"
## [6] "Nigeria" "South Africa" "Tanzania" "Uganda"
##
## $`4`
## [1] "Bulgaria" "Croatia" "Czech Republic" "Greece"
## [5] "Hungary" "Italy" "Lithuania" "Poland"
## [9] "Portugal" "Romania" "Russia" "Serbia"
## [13] "Slovakia" "Slovenia" "Spain" "Ukraine"
##
## $`5`
## [1] "China" "Hong Kong" "India" "Indonesia" "Japan"
## [6] "Malaysia" "Singapore" "South Korea" "Sri Lanka" "Taiwan"
## [11] "Thailand" "Vietnam"
##
## $`6`
## [1] "Egypt" "Iran" "Jordan"
## [4] "Kuwait" "Lebanon" "Morocco"
## [7] "Pakistan" "Qatar" "Saudi Arabia"
## [10] "Tunisia" "Turkey" "United Arab Emirates"
mem = membership(cm)
Heatmap(m, row_split = mem, column_split = mem)
Following is the original plot from the paper:
Using similar method, I also found local groups within the two European groups: