Core Basics 4: Train a Coclustering

The steps to train a coclustering model with Khiops are very similar to what we have already seen in the basic classifier tutorials.

Make sure you have installed Khiops and Khiops CoVisualization.

We start by importing Khiops, checking its installation and defining some helper functions:

import os
import platform
import subprocess
from khiops import core as kh

# Define helper functions
def peek(file_path, n=10):
    """Shows the first n lines of a file"""
    with open(file_path, encoding="utf8", errors="replace") as file:
        for line in file.readlines()[:n]:
            print(line, end="")
    print("")


# If there are any issues you may Khiops status with the following command
# kh.get_runner().print_status()

As stated before, sometimes it is better to have a more adapted visualization for an unsupervised analysis. We illustrate this point with the dataset CountriesByOrganization that contains the relation country-organization for a large number of organizations and countries (it is bit outdated though)

countries_kdic = os.path.join(
    "data", "CountriesByOrganization", "CountriesByOrganization.kdic"
)
countries_data_file = os.path.join(
    "data", "CountriesByOrganization", "CountriesByOrganization.csv"
)

print(f"CountriesByOrganization dictionary file location: {countries_kdic}")
print("")
peek(countries_kdic)

print(f"CountriesByOrganization data table file location: {countries_data_file}")
print("")
peek(countries_data_file)
CountriesByOrganization dictionary file location: data/CountriesByOrganization/CountriesByOrganization.kdic

Dictionary  CountriesByOrganization
{
    Categorical     Country;
    Categorical     Organization;
};

CountriesByOrganization data table file location: data/CountriesByOrganization/CountriesByOrganization.csv

Country;Organization
Afghanistan;AsDB
Afghanistan;COLOMBO
Afghanistan;ECO
Afghanistan;ICCROM
Afghanistan;NAM
Afghanistan;PIARC
Afghanistan;SAARC
Afghanistan;WHO
Afghanistan;UN

We now create a coclustering model for this dataset

coclustering_report_file_path_CountriesByOrganization = os.path.join(
    "exercises", "CountriesByOrganization", "CoclusteringResults.khcj"
)

countries_cc_report = kh.train_coclustering(
    countries_kdic,
    dictionary_name="CountriesByOrganization",
    data_table_path=countries_data_file,
    coclustering_variables=["Country", "Organization"],
    coclustering_report_file_path=coclustering_report_file_path_CountriesByOrganization,
    field_separator=";",
)

We can now browse the results with the Khiops Covisualization app:

# To visualize uncomment the line below
# kh.visualize_report(countries_cc_report)

We can now dump the country clusters and its metrics to a file with the extract_clusters function

country_clusters_file = os.path.join(
    "exercises", "CountriesByOrganization", "CountryClusters.txt"
)
kh.extract_clusters(
    countries_cc_report,
    cluster_variable="Country",
    clusters_file_path=country_clusters_file,
)
peek(country_clusters_file, n=100)
Cluster     Value   Frequency       Typicality
{Germany, France, Denmark, ...}     Germany 106     1
{Germany, France, Denmark, ...}     France  125     0.968057
{Germany, France, Denmark, ...}     Denmark 101     0.952673
{Germany, France, Denmark, ...}     Netherlands     105     0.952506
{Germany, France, Denmark, ...}     Sweden  102     0.943957
{Germany, France, Denmark, ...}     Belgium 104     0.919928
{Germany, France, Denmark, ...}     Finland 100     0.887537
{Germany, France, Denmark, ...}     Norway  96      0.872681
{Germany, France, Denmark, ...}     Italy   105     0.870872
{Germany, France, Denmark, ...}     Spain   103     0.851888
{Germany, France, Denmark, ...}     Austria 88      0.766636
{Germany, France, Denmark, ...}     Portugal        94      0.761055
{Germany, France, Denmark, ...}     United Kingdom  102     0.744776
{Germany, France, Denmark, ...}     Luxembourg      81      0.73663
{Germany, France, Denmark, ...}     Switzerland     90      0.73639
{Germany, France, Denmark, ...}     Greece  87      0.692487
{Germany, France, Denmark, ...}     Ireland 75      0.64078
{Germany, France, Denmark, ...}     Iceland 55      0.429196
{United States of America, Canada, Japan, ...}      United States of America        92      1
{United States of America, Canada, Japan, ...}      Canada  85      0.809229
{United States of America, Canada, Japan, ...}      Japan   81      0.748647
{United States of America, Canada, Japan, ...}      Australia       75      0.742523
{United States of America, Canada, Japan, ...}      New Zealand     60      0.53756
{United States of America, Canada, Japan, ...}      South Korea     69      0.509906
{United States of America, Canada, Japan, ...}      Taiwan  7       0.112925
{United States of America, Canada, Japan, ...}       *      0       0
{Poland, Hungary, Turkey, ...}      Poland  79      1
{Poland, Hungary, Turkey, ...}      Hungary 72      0.897742
{Poland, Hungary, Turkey, ...}      Turkey  78      0.887951
{Poland, Hungary, Turkey, ...}      Czech Republic  64      0.86137
{Poland, Hungary, Turkey, ...}      Russia  80      0.839033
{Poland, Hungary, Turkey, ...}      Bulgaria        70      0.837675
{Poland, Hungary, Turkey, ...}      Romania 69      0.833748
{Poland, Hungary, Turkey, ...}      Slovakia        58      0.78181
{Poland, Hungary, Turkey, ...}      Slovenia        56      0.70581
{Poland, Hungary, Turkey, ...}      Ukraine 53      0.675801
{Poland, Hungary, Turkey, ...}      Croatia 57      0.665962
{Poland, Hungary, Turkey, ...}      Estonia 46      0.612454
{Poland, Hungary, Turkey, ...}      Latvia  45      0.601949
{Poland, Hungary, Turkey, ...}      Lithuania       43      0.54036
{Poland, Hungary, Turkey, ...}      Albania 47      0.454971
{Poland, Hungary, Turkey, ...}      Cyprus  62      0.448388
{Poland, Hungary, Turkey, ...}      Macedonia       39      0.42804
{Poland, Hungary, Turkey, ...}      Serbia  42      0.425104
{Poland, Hungary, Turkey, ...}      Malta   52      0.418058
{Poland, Hungary, Turkey, ...}      Israel  57      0.412066
{Poland, Hungary, Turkey, ...}      Liechtenstein   20      0.330004
{Poland, Hungary, Turkey, ...}      Monaco  32      0.307032
{Poland, Hungary, Turkey, ...}      Bosnia and Herzegovina  33      0.277743
{Poland, Hungary, Turkey, ...}      San Marino      17      0.153352
{Poland, Hungary, Turkey, ...}      Andorra 13      0.148879
{Kazakhstan, Kyrgyzstan, Moldova, ...}      Kazakhstan      47      1
{Kazakhstan, Kyrgyzstan, Moldova, ...}      Kyrgyzstan      45      0.92458
{Kazakhstan, Kyrgyzstan, Moldova, ...}      Moldova 47      0.892857
{Kazakhstan, Kyrgyzstan, Moldova, ...}      Azerbaijan      41      0.885575
{Kazakhstan, Kyrgyzstan, Moldova, ...}      Uzbekistan      41      0.877246
{Kazakhstan, Kyrgyzstan, Moldova, ...}      Tajikistan      35      0.808354
{Kazakhstan, Kyrgyzstan, Moldova, ...}      Turkmenistan    35      0.804177
{Kazakhstan, Kyrgyzstan, Moldova, ...}      Georgia 42      0.763898
{Kazakhstan, Kyrgyzstan, Moldova, ...}      Belarus 38      0.751741
{Kazakhstan, Kyrgyzstan, Moldova, ...}      Armenia 37      0.722566
{Kazakhstan, Kyrgyzstan, Moldova, ...}      Mongolia        36      0.357543
{Venezuela, Nicaragua, Ecuador, ...}        Venezuela       87      1
{Venezuela, Nicaragua, Ecuador, ...}        Nicaragua       73      0.966437
{Venezuela, Nicaragua, Ecuador, ...}        Ecuador 79      0.947892
{Venezuela, Nicaragua, Ecuador, ...}        Costa Rica      74      0.936956
{Venezuela, Nicaragua, Ecuador, ...}        Colombia        83      0.936721
{Venezuela, Nicaragua, Ecuador, ...}        Bolivia 72      0.909518
{Venezuela, Nicaragua, Ecuador, ...}        Guatemala       71      0.909214
{Venezuela, Nicaragua, Ecuador, ...}        Panama  72      0.903985
{Venezuela, Nicaragua, Ecuador, ...}        Mexico  87      0.903091
{Venezuela, Nicaragua, Ecuador, ...}        Peru    79      0.88863
{Venezuela, Nicaragua, Ecuador, ...}        Brazil  86      0.872311
{Venezuela, Nicaragua, Ecuador, ...}        Argentina       84      0.852174
{Venezuela, Nicaragua, Ecuador, ...}        Honduras        67      0.850095
{Venezuela, Nicaragua, Ecuador, ...}        El Salvador     64      0.841804
{Venezuela, Nicaragua, Ecuador, ...}        Uruguay 72      0.825964
{Venezuela, Nicaragua, Ecuador, ...}        Chile   80      0.825028
{Venezuela, Nicaragua, Ecuador, ...}        Paraguay        65      0.798743
{Venezuela, Nicaragua, Ecuador, ...}        Dominican Republic      67      0.735695
{Venezuela, Nicaragua, Ecuador, ...}        Cuba    63      0.536944
{Venezuela, Nicaragua, Ecuador, ...}        Haiti   62      0.502131
{Trinidad and Tobago, Barbados, Grenada, ...}       Trinidad and Tobago     63      1
{Trinidad and Tobago, Barbados, Grenada, ...}       Barbados        56      0.992078
{Trinidad and Tobago, Barbados, Grenada, ...}       Grenada 50      0.920647
{Trinidad and Tobago, Barbados, Grenada, ...}       Jamaica 63      0.906315
{Trinidad and Tobago, Barbados, Grenada, ...}       Belize  51      0.835998
{Trinidad and Tobago, Barbados, Grenada, ...}       Guyana  56      0.811344
{Trinidad and Tobago, Barbados, Grenada, ...}       Dominica        47      0.808439
{Trinidad and Tobago, Barbados, Grenada, ...}       Antigua and Barbuda     43      0.806975
{Trinidad and Tobago, Barbados, Grenada, ...}       Saint Lucia     46      0.777826
{Trinidad and Tobago, Barbados, Grenada, ...}       Saint Vincent and the Grenadines        41      0.771804
{Trinidad and Tobago, Barbados, Grenada, ...}       The Bahamas     49      0.747432
{Trinidad and Tobago, Barbados, Grenada, ...}       Suriname        48      0.694702
{Trinidad and Tobago, Barbados, Grenada, ...}       Saint Kitts and Nevis   36      0.689177
{Niger, Ivory Coast, Benin, ...}    Niger   66      1
{Niger, Ivory Coast, Benin, ...}    Ivory Coast     83      0.991021
{Niger, Ivory Coast, Benin, ...}    Benin   76      0.985146
{Niger, Ivory Coast, Benin, ...}    Burkina Faso    75      0.98505

Exercise

We’ll build a coclustering for the Tokyo2021 dataset which contains a table called Athletes Tokyo 2021 Kaggle dataset where each athlete is described by three variables: - Name: the name of the competing athlete - Country: the country (or organization) of the athlete - Discipline: the athlete’s discipline

The idea for this exercise is to make a coclustering between Country and Discipline and see which countries resemble the most in terms of the athletes they bring to the Olympics.

We start by saving the dataset dictionary file and data table location into variables:

tokyo_kdic = os.path.join("data", "Tokyo2021", "Athletes.kdic")
tokyo_data_file = os.path.join("data", "Tokyo2021", "Athletes.csv")
coclustering_report_file_path_Tokyo2021 = os.path.join(
    "exercises", "Tokyo2021", "CoclusteringResults.khcj"
)

peek the contents of the dictionary and data files

print(f"Tokyo2021 dictionary file: {tokyo_kdic}")
print("")
peek(tokyo_kdic, n=15)

print(f"Tokyo data table file: {tokyo_data_file}")
print("")
peek(tokyo_data_file)
Tokyo2021 dictionary file: data/Tokyo2021/Athletes.kdic

Dictionary  Athletes
{
    Categorical     Name;
    Categorical     Country;
    Categorical     Discipline;
};

Tokyo data table file: data/Tokyo2021/Athletes.csv

Name,Country,Discipline
AALERUD Katrine,Norway,Cycling Road
ABAD Nestor,Spain,Artistic Gymnastics
ABAGNALE Giovanni,Italy,Rowing
ABALDE Alberto,Spain,Basketball
ABALDE Tamara,Spain,Basketball
ABALO Luc,France,Handball
ABAROA Cesar,Chile,Rowing
ABASS Abobakr,Sudan,Swimming
ABBASALI Hamideh,Islamic Republic of Iran,Karate

Train the coclustering for the variables Country and Discipline

Do not forget that the separator is ,

tokyo_cc_report = kh.train_coclustering(
    tokyo_kdic,
    dictionary_name="Athletes",
    coclustering_variables=["Country", "Discipline"],
    data_table_path=tokyo_data_file,
    coclustering_report_file_path=coclustering_report_file_path_Tokyo2021,
    field_separator=",",
)

You may see the coclustering with the covisualization app:

# To visualize uncomment the line below
# kh.visualize_report(tokyo_cc_report)

Use extract_clusters to extract the country clusters and peek its contents

tokyo_country_clusters_file = os.path.join(
    "exercises", "Tokyo2021", "CountryClusters.txt"
)

kh.extract_clusters(
    tokyo_cc_report,
    cluster_variable="Country",
    clusters_file_path=tokyo_country_clusters_file,
)
peek(tokyo_country_clusters_file, n=200)
Cluster     Value   Frequency       Typicality
{Ghana, Cameroon, Republic of Moldova, ...} Ghana   14      1
{Ghana, Cameroon, Republic of Moldova, ...} Cameroon        11      0.920849
{Ghana, Cameroon, Republic of Moldova, ...} Republic of Moldova     19      0.903523
{Ghana, Cameroon, Republic of Moldova, ...} Kosovo  10      0.889051
{Ghana, Cameroon, Republic of Moldova, ...} Tajikistan      8       0.875786
{Ghana, Cameroon, Republic of Moldova, ...} Guatemala       22      0.849855
{Ghana, Cameroon, Republic of Moldova, ...} Turkmenistan    8       0.844016
{Ghana, Cameroon, Republic of Moldova, ...} Pakistan        10      0.832082
{Ghana, Cameroon, Republic of Moldova, ...} Niger   7       0.802549
{Ghana, Cameroon, Republic of Moldova, ...} Bosnia and Herzegovina  7       0.794854
{Ghana, Cameroon, Republic of Moldova, ...} Haiti   6       0.771924
{Ghana, Cameroon, Republic of Moldova, ...} Madagascar      6       0.771924
{Ghana, Cameroon, Republic of Moldova, ...} Jordan  11      0.764815
{Ghana, Cameroon, Republic of Moldova, ...} Lebanon 6       0.76423
{Ghana, Cameroon, Republic of Moldova, ...} Qatar   14      0.763605
{Ghana, Cameroon, Republic of Moldova, ...} Panama  9       0.743808
{Ghana, Cameroon, Republic of Moldova, ...} Albania 8       0.733188
{Ghana, Cameroon, Republic of Moldova, ...} Gabon   5       0.726799
{Ghana, Cameroon, Republic of Moldova, ...} Mauritius       7       0.724113
{Ghana, Cameroon, Republic of Moldova, ...} Burundi 6       0.717779
{Ghana, Cameroon, Republic of Moldova, ...} Mozambique      8       0.713343
{Ghana, Cameroon, Republic of Moldova, ...} Democratic Republic of the Congo        7       0.712369
{Ghana, Cameroon, Republic of Moldova, ...} Malawi  5       0.711301
{Ghana, Cameroon, Republic of Moldova, ...} Nepal   5       0.711301
{Ghana, Cameroon, Republic of Moldova, ...} Burkina Faso    7       0.709689
{Ghana, Cameroon, Republic of Moldova, ...} Papua New Guinea        7       0.708507
{Ghana, Cameroon, Republic of Moldova, ...} Guyana  7       0.707809
{Ghana, Cameroon, Republic of Moldova, ...} Cape Verde      6       0.706279
{Ghana, Cameroon, Republic of Moldova, ...} North Macedonia 8       0.706252
{Ghana, Cameroon, Republic of Moldova, ...} Tonga   5       0.704952
{Ghana, Cameroon, Republic of Moldova, ...} Benin   7       0.704835
{Ghana, Cameroon, Republic of Moldova, ...} Antigua and Barbuda     6       0.699148
{Ghana, Cameroon, Republic of Moldova, ...} Nicaragua       8       0.698111
{Ghana, Cameroon, Republic of Moldova, ...} Grenada 6       0.690036
{Ghana, Cameroon, Republic of Moldova, ...} Bangladesh      6       0.688843
{Ghana, Cameroon, Republic of Moldova, ...} Malta   6       0.686951
{Ghana, Cameroon, Republic of Moldova, ...} Kuwait  10      0.685139
{Ghana, Cameroon, Republic of Moldova, ...} Seychelles      5       0.682904
{Ghana, Cameroon, Republic of Moldova, ...} Lao People's Democratic Republic        4       0.674209
{Ghana, Cameroon, Republic of Moldova, ...} Sierra Leone    4       0.674209
{Ghana, Cameroon, Republic of Moldova, ...} El Salvador     5       0.66886
{Ghana, Cameroon, Republic of Moldova, ...} Eswatini        4       0.660165
{Ghana, Cameroon, Republic of Moldova, ...} United Arab Emirates    4       0.658142
{Ghana, Cameroon, Republic of Moldova, ...} Uruguay 11      0.655258
{Ghana, Cameroon, Republic of Moldova, ...} Guam    5       0.65335
{Ghana, Cameroon, Republic of Moldova, ...} Guinea  5       0.65335
{Ghana, Cameroon, Republic of Moldova, ...} Afghanistan     5       0.651207
{Ghana, Cameroon, Republic of Moldova, ...} Oman    5       0.651207
{Ghana, Cameroon, Republic of Moldova, ...} Palestine       4       0.6508
{Ghana, Cameroon, Republic of Moldova, ...} Sudan   5       0.646547
{Ghana, Cameroon, Republic of Moldova, ...} Iceland 4       0.644667
{Ghana, Cameroon, Republic of Moldova, ...} Virgin Islands, US      4       0.644667
{Ghana, Cameroon, Republic of Moldova, ...} Monaco  6       0.63624
{Ghana, Cameroon, Republic of Moldova, ...} Djibouti        4       0.628158
{Ghana, Cameroon, Republic of Moldova, ...} Mali    4       0.614115
{Ghana, Cameroon, Republic of Moldova, ...} Aruba   3       0.612258
{Ghana, Cameroon, Republic of Moldova, ...} Saint Lucia     5       0.610116
{Ghana, Cameroon, Republic of Moldova, ...} Cambodia        3       0.607913
{Ghana, Cameroon, Republic of Moldova, ...} Democratic Republic of Timor-Leste      3       0.607913
{Ghana, Cameroon, Republic of Moldova, ...} Federated States of Micronesia  3       0.607913
{Ghana, Cameroon, Republic of Moldova, ...} Palau   3       0.607913
{Ghana, Cameroon, Republic of Moldova, ...} Maldives        4       0.59693
{Ghana, Cameroon, Republic of Moldova, ...} Cyprus  14      0.591335
{Ghana, Cameroon, Republic of Moldova, ...} Rwanda  5       0.589451
{Ghana, Cameroon, Republic of Moldova, ...} American Samoa  5       0.586707
{Ghana, Cameroon, Republic of Moldova, ...} Solomon Islands 3       0.584505
{Ghana, Cameroon, Republic of Moldova, ...} San Marino      4       0.581695
{Ghana, Cameroon, Republic of Moldova, ...} Marshall Islands        2       0.575845
{Ghana, Cameroon, Republic of Moldova, ...} St Vincent and the Grenadines   2       0.575845
{Ghana, Cameroon, Republic of Moldova, ...} Libya   4       0.570547
{Ghana, Cameroon, Republic of Moldova, ...} Kiribati        3       0.569648
{Ghana, Cameroon, Republic of Moldova, ...} Yemen   3       0.569007
{Ghana, Cameroon, Republic of Moldova, ...} Bhutan  3       0.562485
{Ghana, Cameroon, Republic of Moldova, ...} Sri Lanka       9       0.56242
{Ghana, Cameroon, Republic of Moldova, ...} Congo   3       0.561863
{Ghana, Cameroon, Republic of Moldova, ...} Equatorial Guinea       3       0.561863
{Ghana, Cameroon, Republic of Moldova, ...} Virgin Islands, British 3       0.561863
{Ghana, Cameroon, Republic of Moldova, ...} Bolivia 5       0.560008
{Ghana, Cameroon, Republic of Moldova, ...} Cayman Islands  5       0.560008
{Ghana, Cameroon, Republic of Moldova, ...} Chad    3       0.55415
{Ghana, Cameroon, Republic of Moldova, ...} Comoros 3       0.547007
{Ghana, Cameroon, Republic of Moldova, ...} Gambia  3       0.547007
{Ghana, Cameroon, Republic of Moldova, ...} Samoa   8       0.540306
{Ghana, Cameroon, Republic of Moldova, ...} Brunei Darussalam       2       0.532593
{Ghana, Cameroon, Republic of Moldova, ...} Central African Republic        2       0.532593
{Ghana, Cameroon, Republic of Moldova, ...} Zimbabwe        5       0.531839
{Ghana, Cameroon, Republic of Moldova, ...} Cook Islands    6       0.52258
{Ghana, Cameroon, Republic of Moldova, ...} Liberia 3       0.507698
{Ghana, Cameroon, Republic of Moldova, ...} Nauru   2       0.503693
{Ghana, Cameroon, Republic of Moldova, ...} Somalia 2       0.503693
{Ghana, Cameroon, Republic of Moldova, ...} Togo    4       0.493268
{Ghana, Cameroon, Republic of Moldova, ...} Iraq    4       0.489464
{Ghana, Cameroon, Republic of Moldova, ...} Senegal 9       0.481748
{Ghana, Cameroon, Republic of Moldova, ...} Dominica        2       0.481052
{Ghana, Cameroon, Republic of Moldova, ...} Lesotho 2       0.481052
{Ghana, Cameroon, Republic of Moldova, ...} Mauritania      2       0.481052
{Ghana, Cameroon, Republic of Moldova, ...} Saint Kitts and Nevis   2       0.481052
{Ghana, Cameroon, Republic of Moldova, ...} South Sudan     2       0.481052
{Ghana, Cameroon, Republic of Moldova, ...} Tuvalu  2       0.481052
{Ghana, Cameroon, Republic of Moldova, ...} United Republic of Tanzania     2       0.481052
{Ghana, Cameroon, Republic of Moldova, ...} Belize  3       0.472751
{Ghana, Cameroon, Republic of Moldova, ...} Suriname        3       0.460859
{Ghana, Cameroon, Republic of Moldova, ...} Vanuatu 2       0.457327
{Ghana, Cameroon, Republic of Moldova, ...} Guinea-Bissau   4       0.457054
{Ghana, Cameroon, Republic of Moldova, ...} Myanmar 2       0.444802
{Ghana, Cameroon, Republic of Moldova, ...} Paraguay        8       0.42349
{Ghana, Cameroon, Republic of Moldova, ...} Sao Tome and Principe   3       0.413163
{Ghana, Cameroon, Republic of Moldova, ...} Andorra 2       0.40303
{Ghana, Cameroon, Republic of Moldova, ...} Syrian Arab Republic    6       0.364929
{Ghana, Cameroon, Republic of Moldova, ...} Liechtenstein   5       0.352692
{Ghana, Cameroon, Republic of Moldova, ...} Bermuda 2       0.296652
{Ghana, Cameroon, Republic of Moldova, ...}  *      0       0
{Poland, Switzerland, Lithuania, ...}       Poland  195     1
{Poland, Switzerland, Lithuania, ...}       Switzerland     115     0.77853
{Poland, Switzerland, Lithuania, ...}       Lithuania       37      0.329817
{Poland, Switzerland, Lithuania, ...}       Finland 45      0.32761
{Poland, Switzerland, Lithuania, ...}       Estonia 33      0.308739
{Poland, Switzerland, Lithuania, ...}       Peru    33      0.241423
{Colombia, Morocco, Ecuador, ...}   Colombia        64      1
{Colombia, Morocco, Ecuador, ...}   Morocco 48      0.98391
{Colombia, Morocco, Ecuador, ...}   Ecuador 46      0.859845
{Colombia, Morocco, Ecuador, ...}   Latvia  29      0.422137
{Colombia, Morocco, Ecuador, ...}   Philippines     18      0.398873
{Colombia, Morocco, Ecuador, ...}   Namibia 11      0.280155
{Colombia, Morocco, Ecuador, ...}   Costa Rica      13      0.254137
{Chinese Taipei, Thailand, Indonesia, ...}  Chinese Taipei  67      1
{Chinese Taipei, Thailand, Indonesia, ...}  Thailand        39      0.55234
{Chinese Taipei, Thailand, Indonesia, ...}  Indonesia       26      0.541865
{Chinese Taipei, Thailand, Indonesia, ...}  Slovakia        38      0.392312
{Chinese Taipei, Thailand, Indonesia, ...}  Vietnam 17      0.319059
{Austria, 'Hong Kong, China', Malaysia, ...}        Austria 72      1
{Austria, 'Hong Kong, China', Malaysia, ...}        Hong Kong, China        40      0.973696
{Austria, 'Hong Kong, China', Malaysia, ...}        Malaysia        29      0.805575
{Austria, 'Hong Kong, China', Malaysia, ...}        Singapore       23      0.796014
{Austria, 'Hong Kong, China', Malaysia, ...}        Luxembourg      11      0.289615
{Uzbekistan, Azerbaijan, Mongolia, ...}     Uzbekistan      63      1
{Uzbekistan, Azerbaijan, Mongolia, ...}     Azerbaijan      41      0.98929
{Uzbekistan, Azerbaijan, Mongolia, ...}     Mongolia        43      0.89835
{Uzbekistan, Azerbaijan, Mongolia, ...}     Georgia 35      0.870435
{Uzbekistan, Azerbaijan, Mongolia, ...}     Bulgaria        41      0.807297
{Uzbekistan, Azerbaijan, Mongolia, ...}     Kyrgyzstan      16      0.456466
{Uzbekistan, Azerbaijan, Mongolia, ...}     Refugee Olympic Team    26      0.453728
{Uzbekistan, Azerbaijan, Mongolia, ...}     Armenia 16      0.453643
{Turkey, Tunisia, Venezuela, ...}   Turkey  102     1
{Turkey, Tunisia, Venezuela, ...}   Tunisia 57      0.844811
{Turkey, Tunisia, Venezuela, ...}   Venezuela       43      0.650009
{Turkey, Tunisia, Venezuela, ...}   Algeria 41      0.459588
{Ukraine, Belarus, Cuba}    Ukraine 152     1
{Ukraine, Belarus, Cuba}    Belarus 104     0.788693
{Ukraine, Belarus, Cuba}    Cuba    69      0.57311
{Kazakhstan, Croatia, Greece}       Kazakhstan      92      1
{Kazakhstan, Croatia, Greece}       Croatia 57      0.825466
{Kazakhstan, Croatia, Greece}       Greece  75      0.588455
{ROC}       ROC     318     1
{Hungary, Montenegro}       Hungary 155     1
{Hungary, Montenegro}       Montenegro      35      0.504421
{Serbia, Islamic Republic of Iran}  Serbia  83      1
{Serbia, Islamic Republic of Iran}  Islamic Republic of Iran        66      0.932964
{Nigeria, Slovenia, Puerto Rico}    Nigeria 59      1
{Nigeria, Slovenia, Puerto Rico}    Slovenia        51      0.759133
{Nigeria, Slovenia, Puerto Rico}    Puerto Rico     35      0.668318
{United States of America}  United States of America        615     1
{Italy}     Italy   356     1
{Dominican Republic, Israel}        Dominican Republic      61      1
{Dominican Republic, Israel}        Israel  85      0.929927
{Mexico}    Mexico  155     1
{Jamaica, Ethiopia, Trinidad and Tobago, ...}       Jamaica 60      1
{Jamaica, Ethiopia, Trinidad and Tobago, ...}       Ethiopia        42      0.81877
{Jamaica, Ethiopia, Trinidad and Tobago, ...}       Trinidad and Tobago     31      0.437854
{Jamaica, Ethiopia, Trinidad and Tobago, ...}       Uganda  24      0.42675
{Jamaica, Ethiopia, Trinidad and Tobago, ...}       Bahamas 16      0.354908
{Jamaica, Ethiopia, Trinidad and Tobago, ...}       Botswana        13      0.237047
{Jamaica, Ethiopia, Trinidad and Tobago, ...}       Eritrea 13      0.213476
{Jamaica, Ethiopia, Trinidad and Tobago, ...}       Barbados        7       0.173651
{Kenya, Fiji}       Kenya   78      1
{Kenya, Fiji}       Fiji    28      0.565214
{Norway, Denmark, Portugal, ...}    Norway  92      1
{Norway, Denmark, Portugal, ...}    Denmark 103     0.826706
{Norway, Denmark, Portugal, ...}    Portugal        85      0.595021
{Norway, Denmark, Portugal, ...}    Angola  20      0.461764
{Norway, Denmark, Portugal, ...}    Bahrain 31      0.435065
{Brazil, Sweden}    Brazil  291     1
{Brazil, Sweden}    Sweden  129     0.577046
{France}    France  377     1
{Great Britain, Ireland}    Great Britain   366     1
{Great Britain, Ireland}    Ireland 116     0.548124
{New Zealand}       New Zealand     202     1
{Spain}     Spain   324     1
{South Africa}      South Africa    171     1
{Netherlands}       Netherlands     274     1
{Germany}   Germany 400     1
{Belgium, Czech Republic}   Belgium 125     1
{Belgium, Czech Republic}   Czech Republic  117     0.799098
{India}     India   117     1
{Japan}     Japan   586     1
{Argentina} Argentina       180     1
{Republic of Korea} Republic of Korea       223     1
{Egypt}     Egypt   133     1
{Australia} Australia       470     1

Use extract_clusters to extract the discipline clusters and peek its contents

tokyo_discipline_clusters_file = os.path.join(
    "exercises", "Tokyo2021", "DisciplineClusters.txt"
)

kh.extract_clusters(
    tokyo_cc_report,
    cluster_variable="Discipline",
    clusters_file_path=tokyo_discipline_clusters_file,
)
peek(tokyo_discipline_clusters_file, n=200)
Cluster     Value   Frequency       Typicality
{Handball}  Handball        343     1
{Hockey}    Hockey  406     1
{Football}  Football        567     1
{Rugby Sevens}      Rugby Sevens    283     1
{Athletics} Athletics       2068    1
{Boxing, Weightlifting, Taekwondo}  Boxing  270     1
{Boxing, Weightlifting, Taekwondo}  Weightlifting   187     0.721619
{Boxing, Weightlifting, Taekwondo}  Taekwondo       123     0.536492
{Judo}      Judo    373     1
{Swimming}  Swimming        743     1
{Rowing, Cycling Track}     Rowing  496     1
{Rowing, Cycling Track}     Cycling Track   208     0.496133
{Equestrian, Triathlon, Cycling Mountain Bike, ...} Equestrian      237     1
{Equestrian, Triathlon, Cycling Mountain Bike, ...} Triathlon       106     0.559253
{Equestrian, Triathlon, Cycling Mountain Bike, ...} Cycling Mountain Bike   74      0.515497
{Equestrian, Triathlon, Cycling Mountain Bike, ...} Beach Volleyball        90      0.486439
{Equestrian, Triathlon, Cycling Mountain Bike, ...} Skateboarding   77      0.41841
{Equestrian, Triathlon, Cycling Mountain Bike, ...} Cycling BMX Racing      43      0.355495
{Equestrian, Triathlon, Cycling Mountain Bike, ...} Surfing 38      0.313644
{Cycling Road, Golf, Canoe Slalom, ...}     Cycling Road    190     1
{Cycling Road, Golf, Canoe Slalom, ...}     Golf    115     0.605512
{Cycling Road, Golf, Canoe Slalom, ...}     Canoe Slalom    78      0.474946
{Cycling Road, Golf, Canoe Slalom, ...}     Marathon Swimming       49      0.269378
{Sailing}   Sailing 336     1
{Shooting, Archery} Shooting        342     1
{Shooting, Archery} Archery 122     0.409113
{Badminton, Table Tennis}   Badminton       164     1
{Badminton, Table Tennis}   Table Tennis    164     0.920471
{Tennis, Artistic Gymnastics, Modern Pentathlon, ...}       Tennis  178     1
{Tennis, Artistic Gymnastics, Modern Pentathlon, ...}       Artistic Gymnastics     187     0.939428
{Tennis, Artistic Gymnastics, Modern Pentathlon, ...}       Modern Pentathlon       69      0.515132
{Tennis, Artistic Gymnastics, Modern Pentathlon, ...}       Cycling BMX Freestyle   19      0.151931
{Tennis, Artistic Gymnastics, Modern Pentathlon, ...}        *      0       0
{Diving, Artistic Swimming, Trampoline Gymnastics, ...}     Diving  133     1
{Diving, Artistic Swimming, Trampoline Gymnastics, ...}     Artistic Swimming       98      0.874855
{Diving, Artistic Swimming, Trampoline Gymnastics, ...}     Trampoline Gymnastics   31      0.391158
{Diving, Artistic Swimming, Trampoline Gymnastics, ...}     Sport Climbing  37      0.28726
{Canoe Sprint}      Canoe Sprint    236     1
{Baseball/Softball} Baseball/Softball       220     1
{Water Polo}        Water Polo      269     1
{Basketball}        Basketball      280     1
{Wrestling, Rhythmic Gymnastics, Karate}    Wrestling       279     1
{Wrestling, Rhythmic Gymnastics, Karate}    Rhythmic Gymnastics     95      0.344437
{Wrestling, Rhythmic Gymnastics, Karate}    Karate  77      0.24672
{Volleyball}        Volleyball      274     1
{Fencing, 3x3 Basketball}   Fencing 249     1
{Fencing, 3x3 Basketball}   3x3 Basketball  62      0.272875