Purpose

To demonstrate a good partitioning method can greatly improve the visualization.

Question

I conducted a bibliometric study where I analyzed the country-to-country citation perference. In the analysis, the citation preferance of a country \(A\) citing a country \(B\) is quantitatively measured by a new metric “citation enrichment”, where a positive value means \(A\) cites \(B\) more preferably, and a negative value means \(A\) cites \(B\) less than random.

Following is the dataset. The citation enrichment scores are in the column "log2_fc".

df = readRDS("datasets/citation_data.rds")
head(df)
##       country_cited country_citing citations total_cited total_citing
## 5087          Egypt           Peru       100      211882        36458
## 9940         Jordan        Hungary       100       30158       512011
## 9985         Jordan         Norway       100       30158       733995
## 13960       Nigeria    New Zealand       100       54145       481066
## 16750  Saudi Arabia           Peru       100      162945        36458
## 1298     Bangladesh       Tanzania       101       42835        32257
##       global_citations      p_cited     p_citing p_global_cited p_global_citing
## 5087         187015976 0.0027428822 0.0004719608   0.0011329620    0.0001949459
## 9940         187015976 0.0001953083 0.0033158698   0.0001612590    0.0027377928
## 9985         187015976 0.0001362407 0.0033158698   0.0001612590    0.0039247716
## 13960        187015976 0.0002078717 0.0018468926   0.0002895207    0.0025723257
## 16750        187015976 0.0027428822 0.0006137040   0.0008712892    0.0001949459
## 1298         187015976 0.0031311033 0.0023578849   0.0002290446    0.0001724826
##         expected  var_hyper      p_hyper    log2_fc   z_score    median
## 5087   41.305530  41.250689 7.843743e-15  1.2755932  9.138640 0.8327636
## 9940   82.566356  82.327029 3.397938e-02  0.2763741  1.921396 0.8327636
## 9985  118.363263 117.879703 9.617212e-01 -0.2432214 -1.691338 0.8327636
## 13960 139.278575 138.880085 9.998026e-01 -0.4779733 -3.333003 0.8327636
## 16750  31.765462  31.731598 3.625450e-22  1.6544691 12.113183 0.8327636
## 1298    7.388292   7.385325 2.951501e-76  3.7729706 34.446520 0.8327636
##            mad      mean       sd log2_fc_scaled
## 5087  1.158198 0.9404198 1.191347      0.2813399
## 9940  1.158198 0.9404198 1.191347     -0.5573908
## 9985  1.158198 0.9404198 1.191347     -0.9935320
## 13960 1.158198 0.9404198 1.191347     -1.1905795
## 16750 1.158198 0.9404198 1.191347      0.5993631
## 1298  1.557938 0.9432873 1.374163      2.0592053

Strategy

Find an efficient partitioning method.

Analysis

Since we will use heatmap for visualization, we first convert the data into a matrix. Note not every pair of countries have enough citations for calculating the citation enrichment score, thus in m there are missing values. xtab() set these missing values with default value of zero.

m is internally saved as a matrix but we explicitely change its “class label” to matrix.

m = xtabs(log2_fc ~ country_cited + country_citing, data = df)
class(m) = "matrix"
dim(m)
## [1] 77 77

As we have already had the matrix, let’s directly visualize it by a heatmap.

library(ComplexHeatmap)
Heatmap(m)

The heatmap can already show some levels of “cluster patterns”.

To emphasize the visual effect for missing values, we replace zeros with NA:

m_with_na = m
m_with_na[m_with_na == 0] = NA
Heatmap(m_with_na, na_col = "grey")

To simplify the task, also for the quick demonstration, we use m where missing values are filled with zeros. We can try to split the dendrograms or perform k-means clustering to partition countries.

Heatmap(m, row_split = 2, column_split = 2)

Or we can perform UMAP or similar analysis to see how close countries are.

library(cola)

loc = dimension_reduction(t(m), method = "UMAP")
loc = as.data.frame(loc)[, 1:2]
colnames(loc) = c("x", "y")
loc$label = rownames(m)
library(ggplot2)
library(ggrepel)
ggplot(loc, aes(x = x, y = y, label = label)) + 
    geom_point() + geom_text_repel(max.overlaps = Inf, size = 3, show.legend = FALSE) +
    labs(x = "Dim 1", y = "Dim 2")

A better way is to use graph-clustering method. We convert the matrix to a graph and perform the Louvain method by taking the citation enrichment scores are the weights. A weight of zero means there is no link between the two nodes in the graph.

m2 = m
m2[m2 < 0] = 0
library(igraph)

set.seed(123)
g = graph_from_adjacency_matrix(m2, mode = "plus", weighted = TRUE)
cm = cluster_louvain(g, weight = E(g)$weight, resolution = 1.2)
communities(cm)
## $`1`
## [1] "Argentina" "Brazil"    "Chile"     "Colombia"  "Mexico"    "Peru"     
## [7] "Uruguay"  
## 
## $`2`
##  [1] "Australia"      "Austria"        "Belgium"        "Canada"        
##  [5] "Denmark"        "Estonia"        "Finland"        "France"        
##  [9] "Germany"        "Iceland"        "Ireland"        "Israel"        
## [13] "Jamaica"        "Luxembourg"     "Netherlands"    "New Zealand"   
## [17] "Norway"         "Sweden"         "Switzerland"    "United Kingdom"
## [21] "United States" 
## 
## $`3`
## [1] "Bangladesh"   "Cameroon"     "Ethiopia"     "Ghana"        "Kenya"       
## [6] "Nigeria"      "South Africa" "Tanzania"     "Uganda"      
## 
## $`4`
##  [1] "Bulgaria"       "Croatia"        "Czech Republic" "Greece"        
##  [5] "Hungary"        "Italy"          "Lithuania"      "Poland"        
##  [9] "Portugal"       "Romania"        "Russia"         "Serbia"        
## [13] "Slovakia"       "Slovenia"       "Spain"          "Ukraine"       
## 
## $`5`
##  [1] "China"       "Hong Kong"   "India"       "Indonesia"   "Japan"      
##  [6] "Malaysia"    "Singapore"   "South Korea" "Sri Lanka"   "Taiwan"     
## [11] "Thailand"    "Vietnam"    
## 
## $`6`
##  [1] "Egypt"                "Iran"                 "Jordan"              
##  [4] "Kuwait"               "Lebanon"              "Morocco"             
##  [7] "Pakistan"             "Qatar"                "Saudi Arabia"        
## [10] "Tunisia"              "Turkey"               "United Arab Emirates"
mem = membership(cm)
Heatmap(m, row_split = mem, column_split = mem)

Following is the original plot from the paper:

Using similar method, I also found local groups within the two European groups:

Reference

  1. Gu Z., 2025, Two separated worlds: On the preference of influence in life science and biomedical research. Journal of Informetrics.