Can the Multi-Armed Subaltern Write?

multi-armed bandit
wikipedia
translations
GMM
spectral-clustering
recommendation model
In which we decided to create a TikTok-style recommendation model for Wikipedia translations.
Author

Akshay Regulagedda

Published

July 5, 2026

Figure 1: Working through an English - Tamil dictionary. Photo by Antony Hyson Seltran on Unsplash

Foreword

This is the final term project I did towards the completion of my Master of Science in Analytics degree from Georgia Institute of Technology.

Problem Statement1

Whilst multi-lingual anglophone writers may easily find a large readership for their writing in English, they often find it hard to find a ready audience for their writings in other vernacular languages. This is particularly so if they have very niche or domain-specific interests that may not have equivalent readership across language communities. And indeed, a sufficiently large readership is a crucial signal to motivate writers to write more, especially if they would like to practise or improve their writing talent.

Wikipedia translation offers perhaps the most accessible medium for such writers. In this sense, we see Wikipedia as a platform that connects interested writers and readers across a hidden dimension, that of language. But the difficulty for a sufficiently motivated writer is to find articles that are at the intersection of their own interests and what reasonably large number of readers would read. We explore Wikipedia as a two-sided market balancing the demand-side (readership, as measured in page-views) and the supply-side (translator interest or effort).

Put simply, the question we ask ourselves is this: how can we create a recommendation model that works similar to that on TikTok or Instagram Reels, but for volunteer translators on Wikipedia?

To this end, we have implemented a model that recommends articles for an interested translator using an explore-exploit approach (“The Multi-Armed Bandit problem”). The model balances the translator’s own interest whilst optimising for readership potential. In a later section, we will share a brief literature review on this problem (Section 3), and why the approach is unique both in its application to the problem area (motivating translators) and in its technical implementation.

We executed this model using Telugu Wikipedia as a primary case study, a language with 80 million native speakers yet dramatically underserved relative to comparable language communities.

We had previously proposed a detailed pipeline (Section 4.1) - from seeding interest by a translator to expanding, filtering, scoring, clustering and eventually recommending articles in the English Wikipedia back to them. In this report, we will not only elaborate on that pipeline, but also discuss nuances in implementation: how the problem was broken down into 9 JIRA-style epics with associated user-stories, how we optimized processing time for data gathering by a factor of 8 through parallelisation and batch API’s, and so on. We will evaluate the implementation both from a quantitative perspective - by considering MAE’s across a range of approaches - and a qualitative evaluation of the categories chosen. We will finally recommend a set of actions in the final section on how this framework can be taken forward.

Literature Survey & Novelty

Silva et al (2022) have observed that recommendation systems have “attracted a lot of attention from both industry and academia” as a response to exponential growth of digital information, which had led to “stressful situation[s]” of information overload for many users, particularly when comes to purchase decisions. Indeed, a lot of the effort has been towards recommending videos and such on Netflix (Ko et al. 2022), and of course, TikTok, which, Zhang et al (2021) have found, to be “impressive” in not just improving loyalty from users, but also drawing them “into new intersecting areas”.

Silva et al’s (2022) systematic literature review found 230 articles on Google Scholar published between 2000 and 2020 that used the Multi-Armed Bandit approach to develop recommendation systems. The field has had a growing interest in recent years; they found that more than 50% of such works were published between 2016 and 2020. Most applications, however, deal with recommending news, users’ purchases (e.g. on Amazon),songs, advertisements, points-of-interest, bookmarks, food, jobs, and even jokes. None of the papers in their meta-analysis dealt with recommending articles to be translated.

Wulczyn et al’s three-month study did generate a model to recommend articles for translators to translate from English to French Wikipedia. However, it did so by ranking articles that it deemed “missing” (and then matched translators to them) (2016). The starting point wasn’t necessarily the translator’s own interest, but achieving a certain completeness in the French Wikipedia compared to that in English. Translators in the study had raised questions of “language imperialism”, as Wulcyzn et al themselves had documented.

In comparison, our approach is centered on the Telugu Wikipedia and on the personal: the goal isn’t necessarily for the Telugu Wikipedia to reach parity, so to speak, with the English Wikipedia in terms of articles. Instead, the focus is essentially writer’s own interest first: on what they would like to see in the Telugu Wikipedia primarily, whilst maximizing for readership.

Methodology

Overview

We proposed the pipeline in our project proposal seen in Figure 2.

Figure 2: Pipeline

The pipeline was implemented largely as proposed using the Agile methodology, as Epics with self-contained user-stories. This project was essentially done beyond work-hours and on weekend afternoons. As such, we needed a way to break the overall task down into small chunks (that can be executed over lunch for example), while at the same time, have assured code quality when we integrated it all. Having self-contained unit-tests per user-story helped us tremendously in this regard.

Some optimisations were later found necessary for data gathering and processing.

Epic 1: Project Scaffolding

We used the Agile methodology to breakdown the pipeline into a series of epics and JIRA-style user-stories to be implemented sequentially. Each user-story had its own set of self-contained unit-tests. The overall project has a pyproject.toml file declaring all dependencies and a config.yaml to store values for alpha, curated_categories, traversal_depth, quality_filter_threshold, epsilon, cold_start_window_days

Epic 2: Data Acquisition

We combined manual ratings with openly available data from various Wikipedia API’s:

Data Sources & Feature Set
Source Signal Type Feature
Manual Ratings Supply side “seed” Translator interest rating
MediaWiki API Relevance (Features for Step #2: Expand) Article categories
MediaWiki API Feasibility features Article length (word count)
MediaWiki API Feasibility features Article quality rating
Pageviews API Reader demand Demand gap score

Optimisations

Our initial versions for client-classes for MediaWiki and Pageviews API’s called each Wikipedia article individually. In later stages, we realized that meant calling the API more than 2000 times in quick succession, especially if the constructed candidate pool was large. With enough throttling, that meant processing times of more than 45 minutes each time a candidate pool was constructed. Instead, we implemented the following:

  1. Batch Processing: We combined calls for multiple Wikipedia pages together into a single call. The API allowed up to 50 pages to be batched together
  2. Parallel Processing: We parallelized API calls into multiple threads using ThreadPoolExecutor.
  3. Throttling: Should an API call encounter a 429 error, we spaced out subsequent re-try attempts based on the message.

Epic 3: Candidate Pool Construction

The candidate pool was constructed through the following inputs:

  1. Seed Articles: The user would suggest a list of 10-15 Wikipedia articles to seed the model first. In the final evaluation, we seeded the following articles:

    Seed Articles for the Final Evaluation
    • Caitika
    • 1915 Singapore Mutiny
    • Ngogo chimpanzee war
    • Longest flights
    • Westray to Papa Westray flight
    • Boat Mail Express
    • Shangri-La
    • Tea Horse Road
    • List of tallest buildings in India
    • Longest train services of Indian Railways
    • Johor Bahru–Singapore Rapid Transit System
    • Montreal
    • Quebec
    • Montréal–Trudeau International Airport
    • Orlando: A Biography
  2. Categories: Querying the full English Wikipedia corpus of over 7 million articles is not just computationally infeasible, but semantically unfocussed. Therefore, we constrained the candidate pool of articles to a curated set of top-level Wikipedia categories specified by the user. This was defined in config.yaml

    We seeded the following categories in the final evaluation:

    Categories for the Final Evaluation
    • Transport-related lists of superlatives
    • Military history of Singapore
    • Named passenger trains of India
    • Mahayana sutras
    • Monks of Nalanda
    • Telugu monarchs
    • Novels about transgender topics
    • Metafictional novels
    • History of Andhra Pradesh
    • Chinese classic texts
    • Buddhist holy sites in Andhra Pradesh
    • Philosophical skepticism
  3. Depth of Category Traversal: The level to which the categories may be traversed can be set in config.yaml We initially tested the model with 1, but executed it with 2.

Epic 4: Feature Engineering

The raw data collected in the previous epic is to be transformed it into model inputs here, through the following features:

1. Article Quality Rating

Per Wikimedia’s own guidance, articles in the English Wikipedia can be ranked using the following quality classes in order of quality: FA, GA, B, C, START, and STUB (Johnson and Aragón 2022). Other language Wikipedias also have their own quality classifications, even if they may not exactly match those in English.

We grouped all quality classes into three tiers, as follows:

Quality Classes in English Wikipedia grouped into three tiers
Tier Classes Encoded Value
High quality FA, A, GA 3
Mid quality B, C 2
Low quality Start, Stub 1

All articles in low quality (encoded value = 1) were excluded from the candidate pool entirely before scoring.

2. Article Length

MediaWiki API returns a wordcount, which was used to signify length of an article. However, because the range of length of articles can be very wide indeed, we normalized this metric through a log transformation (log(wordcount)). In general, the difference in translation effort between a 500 and 1000 word article is much more significant than that between a 9,500 word and a 10,000 word article. A log transformation would naturally account for this and compress skewness in article length distribution.

3. Category Membership

Whilst we curated the list of top-level categories to a limited set, Wikipedia articles are still classified under multiple categories. This can be challenging in encoding the information, which we have addressed in the following manner.

If \(k\) is the number of curated top-level categories, each article would have: \[ category_i = \begin{cases} \text{1}, & \text{if article belongs to category i} \\ \text{0}, & \text{if it doesn't} \end{cases}\] where \(i = 1, 2, .... k\)

This gave us a sparse binary vector per article. We can avoid the curse of dimensionality because we are manually curating less than 15 top-level categories, thus ensuring \(k\) is small and fixed.

4. Demand Gap Score

Whilst the features thus far describe a Wikipedia article, we also would like some demand-side signals. To this end, we used a a demand gap score to measure is how different readership is for the same page between the English and Telugu Wikipedia. The feature is derived from PageViews API as \[\text{Demand Gap Score}=1-\frac{\text{Telugu views}}{\text{English views}} \]

This provides a normalized measure of unmet readership demand that is comparable across articles of varying popularity.

5. Cold-Start Gap

Whilst we can calculate a demand gap score between versions in English and Telugu Wikipedia, there is a possibility that a Telugu version may not exist at all (i.e., \(\text{Telugu views} = 0\)) In such a case, demand gap score will be 1 (undefined) for every untranslated article.

Which is to say, we may have a cold start problem - we will not gather sufficient pageview data for translated articles until well after the translations have been written. In such cases, we would use views for the Hindi language versions as a proxy instead:

\[\text{Cold Start Proxy} = 1 - \frac{\text{Hindi views}}{\text{English views}}\]

We chose Hindi as a proxy because we have seen that Hindi readership for the Wikipedia page on the Indian city Hyderabad (where both Hindi and Telugu are spoken) was 3-4 times that of the Telugu readership.(Langviews Analysis, n.d.) There is a very good chance that Hindi and Telugu readers may have overlapping topical interests for culturally proximate subjects: indeed, in many cases, they may be the same individuals as well, as many are fluent in both languages. Once sufficient Telugu pageview data has accumulated following the 3-month window, the model will transition back to the standard demand gap score.

This is naturally a hypothesis; it was evaluated independently in a later section. However, we took care to build this as a feature in this epic itself.

We also defined the following guardrails that can be set in config.yaml:

  1. Small constant guarding against division by zero in demand gap score: epsilon: 1.0e-6
  2. Days after a translation before switching from Hindi cold-start proxy to real Telugu views: cold_start_window_days: 90

Epic 3 & 4 Output

An output Excel was generated at the end of executing all the stories here with a constructed candidate pool expanding on the seed articles and categories with the features listed above, whilst filtering out low-quality articles. The pool is now ready for a translator to rate. In the final execution, 78 articles were retrieved.

Epic 5: Scoring Model

Translator Rating

This “supply-side” feature, that of translator’s own interest, is captured on a scale of three, roughly on par with swiping left or right as one does on some contemporary apps: -1 (dislike), 0 (neutral) or +1 (like).

In the final execution, 29 out of the 78 retrieved articles were rated either -1, 0, or 1.

Target Variable: Composite Score

We first normalized the translator’s ratings to \({0, 0.5, 1}\) and then combined them with the demand gap score into a composite score:

\[\text{Composite Score} = \alpha \cdot \frac{\text{Translator Rating}+1}{2} + (1-\alpha)\cdot\text{Demand Gap Score}\]

where the coefficient \(\alpha \in [0,1]\) was tuned through cross-validation.

Models: Baseline and Primary

We generated models that tried to learn the relationship between input features such as article length, quality tier etc and the target variable, composite score. The models will then be able to predict scores for articles that had not been evaluated as yet.

We compared three models:

  • Naive Baseline: Demand gap score used directly as predicted composite score
  • Linear Regression: This was meant to be a baseline model as we need a continuous variable to rank articles. Linear regression’s simplicity and interpretability via coefficients make it a natural reference point against which to measure the added value of more complex models. The model was preprocessed with continous variables standardized (with zero mean, unit variance), while binary category flags were left unscaled.
  • Random Forest: This was meant to be a primary model, as it can capture non-linear relationships between features and the composite score, which is something that linear regression cannot do. Random Forest is also useful for understanding feature importance. No scaling required here.

We will discuss the results later in Section 5.

Epic 6: Clustering (Topical Diversity)

Purpose

A scoring model that ranks articles by composite score alone is not sufficient for diversity. Without clustering, the recommendations could easily collapse into a single high-scoring topic cluster. Such an outcome would defeat the purpose of a personalized recommender and will accelerate translator burnout. Instead, to ensure topical variability, we performed clustering as a diversity mechanism, sampling top candidates from each cluster, rather than take just the global top N by score.

Choice of Algorithm

We used spectral clustering to cluster the articles by topic. The articles may share multiple category flags out of the 10-15 curated top-level categories, that is to say they are likely to have sparse binary vectors for categories. Consequently, while the articles are topically related, their relationship is geometric in a non-Euclidean sense. Connectivity-based similarity is better handled by spectral clustering than k-means.

The number of clusters \(k\) is determined by the eigenvalue gap: we examined the eigenvalues of the graph Laplacian and choose \(k\) where there’s a natural gap between the \(k\)-th and \((k+1)\)-th eigenvalue. The eigenvalue structure of the similarity matrix tells us how many natural topic communities exist in the article graph.

Feature Choice

Since the purpose of clustering is topical diversity rather than quality filtering, we excluded score and length features, and used category flags alone for clustering. This is to ensure that clusters reflect topical neighbourhoods and not quality or length neighbourhoods.

Epic 7: Recommendation Through Explore-Exploit

Takkari Donga: The Multi-Armed Bandit

With articles scored by interest and opportunity, and clustered by topic, we know what a translator’s preferences are, which are primed to be exploited. However, we also need to explore new territory to learn more about their preferences, a problem known as the Multi-Armed Bandit problem in reinforcement learning. Since we are dealing with the Telugu Wikipedia after all, one might call this the Takkari Donga problem2: how does the clever bandit (in Telugu) explore and exploit learned preferences and opportunities for the motivated translator?

A key element here is to model distribution of reader interest across topic clusters, that is to say, to ask where the latent demand is concentrated. Density estimation via Gaussian Mixture Model (GMM) was a likely approach: if GMM flagged an article as sitting in a sparse, uncertain region, that article is worth exploring. This is not to say that the translator will necessarily like the article, but that the user’s rating will give the most new information to improve the model. Consequently, a recommendation list would be a shortlist balancing exploit (articles with high composite scores) and explore (sparse GMM regions): the clever thief takes not just what is of high value, but also explores what could be of high value.

Recommendation Logic

The recommendation logic per cluster, can simply be stated as follows:

  1. We take all scored, clustered unrated candidates
  2. We fit GMM on feature vectors of already-rated articles
  3. We then compute the log-likelihood of each candidate under GMM
  4. Exploit: Within each cluster, we sort articles in descending order of composite score, taking the top 1 as an exploit candidate
  5. Explore: Within each cluster, we sort articles in ascending order of log-likelihood, taking the top 1 as an explore candidate
  6. We surface two such articles per cluster as recommendations. The total recommendations offered to the translator, therefore, would be \(2k\) articles, where \(k\) is the number of clusters. The translator’s ratings on these recommendations then feed into the feedback loop, updating the model for the next round.

Epic 7 Output

An eigenvalue-gap plot using spectral clustering was generated. Having fitted a Gaussian Mixture Model, we generated 1 exploit (highest composite score) and 1 explore (lowest GMM log-likelihood).

Epic 8: Take Feedback

Feedback Loop

Once recommendations are made, ratings given and the translations done, the model was intended to have a feedback loop with the following elements:

  1. When an update is triggered: The model will need to learn how readers have been taking to the translations. This would need sufficient time for organic readership to discover and engage with newly translated articles. To this end, we propose making updates every 3 months.

  2. What gets updated: We envisage a production update cadence as follows to give sufficient time for a model to learn:

    1. Continuous updates: Each time new ratings are added
    2. Annual: Retuning \(\alpha\) every 12 months
    3. Biennial: Retraining the full model

    This is obviously not possible within the term project cadence. But we simulated one manual update cycle using the author’s own ratings.

Results & Observations (Epic 9)

Evaluation Premise

We measured Mean Absolute Error (MAE) between predicted and actual composite scores. MAE is appropriate here because it treats all errors equally, which is appropriate given that the composite score is bounded [0,1]. For a recommendation system where consistent ranking is more important than avoiding large errors, MAE is perhaps more appropriate than other error metrics such as Root Mean Squared Error (RMSE)

Our assumption was that we would arrive at the following evaluation hierarchy:

  • Naive baseline: As mentioned, we used demand gap score directly as a predicted composite score and measure its error against the computed composite score. This is equivalent to setting translator interest weight, \(\alpha = 0\).
  • Linear Regression: MAE computed from model output and composite score via k-fold cross-validation
  • Random Forest: MAE computed against composite score via k-fold cross-validation.

Looking beyond MAE, feature importance scores from Random Forest could be examined to understand which signals drive recommendation quality most strongly.

Recommendation Quality

The model does more than predict scores, it produces a ranked list of recommendations. While we used MAE to tell us how well the model predicts scores, we will need to evaluate whether the ordering of recommendations is correct.

For such an evaluation, we first defined a threshold on the composite score. Given the range [0,1], we took the midpoint, 0.5, as the threshold. Articles above this score will be labelled as good candidates.

We then computed precision as the fraction of recommended articles exceeding this threshold, and compared this across naive baseline, Linear Regression and Random Forest.

Single-user bias

The approach would take a single user’s (the author) preferences (“translator ratings”) and build upon it, which could open up the chance that the translator’s niche interests could diverge from mass readership.

We had earlier proposed exploring this through a simple diagnostic: we compute a correlation between translator ratings and pageview gap across recommended articles. This will lead us to two possible outcomes:

  1. High Positive Correlation: The model’s two-sided approach is coherent: the translator’s interests and reader demand are well-aligned.
  2. Negative Correlation: Such an outcome would reveal a genuine tension between supply and demand - the translator’s niche interests diverge from mass readership. This is exactly the problem the model is designed to navigate (and therefore, the more interesting finding, academically speaking)

Result 1: Supply Demand Tension

The model’s core motivation - that there could be a genuine tension between a translator’s interests from mass readership - may be validated. Pearson r between a translator’s ratings and demand gap scores across rated articles calculates to -0.282, with a p=0.049.

Supply - Demand Alignment

Result 2: Evaluating Models

MAE

We expected Random Forest to have the least MAE followed by Linear Regression and Naive Baseline. Fascinatingly enough, that is not what we observed:

MAE by Model
Model Mean MAE Std MAE
NaiveBaseline 0.202 0.122
LinearRegression 0.217 0.032
RandomForest 0.216 0.042
  • Very High Demand Gap: In effect, we found that demand gap had essentially no variance, with a mean = 0.995, STD = 0.014 and the 50th percentile at 1. The candidate pool in retrospect was scoped to categories likely to be untranslated into Telugu; consequently, every article has near-zero Telugu views. This essentially meant the demand gap \(d \approx 1.0\) everywhere.

  • Model Ordering Failure: The naive baseline here was the demand gap. When the demand gap is close to constant, it is trivially hard to beat.

  • Precision@0.5 = 1.000 for all models: Essentially, composite_score = demand_gap \(\approx 1.0 > 0.5\) for all articles.

  • CV tuning returned \(\alpha = 0\): In effect, with near-constant demand gap, translator preference added no discriminating signal to the composite score on this dataset.

Feature Importance

We found log_wordcount to be of overwhelming importance compared with every other feature.

Top 15 Features by Random Forest

Result 3: Evaluating Cold Start Proxy

We computed a Pearson r between the Hindi proxy demand gap and actual Telugu demand gap on a separate set of articles. If r> 0.7, then the proxy would be justified.

We chose a list of generic Wikipedia articles that are likely to be available on both Hindi and Telugu Wikipedias:

Evaluating Cold Start Proxy
title hindi_proxy actual_gap abs_proxy_error
Mahabharata 0.836706 0.967915 0.131209
Shiva 0.917235 0.988801 0.071566
Yoga 0.933242 0.995607 0.062365
Telugu language 0.995403 0.943378 0.052025
India 0.952274 0.996127 0.043853
Hinduism 0.968478 0.996320 0.027842
Hyderabad 0.974165 0.989439 0.015274
Turmeric 0.980290 0.994050 0.013760
Rice 0.988496 0.997720 0.009223
Andhra Pradesh 0.965983 0.973239 0.007257

At a computed r=0.474, p=0.0040, the choice of a proxy was not justified:

Validating Hindi as a choice for Cold Start Proxy

Result 4: GMM Exploit/ Explore Mechanism

We received the following set of recommendations from the GMM model:

Exploit/ Explore Results
title cluster_id recommendation_
type
composite_
score
gmm_log_
likelihood
Hopscotch (Cortázar novel) 0 exploit 1.00 12.05
The World According to Garp 0 explore 1.00 -37.75
Twenty-Four Histories 1 exploit 1.00 12.54
Vijayawada 1 explore 0.94 9.91
Robert Brooke-Popham 2 exploit 1.00 10.48
Royal Malay Regiment 2 explore 1.00 9.54
John Tremayne Babington 3 exploit 1.00 9.92
Kalinga (region) 3 explore 0.98 9.79
Nüwa 4 exploit 1.00 9.52
Michel de Montaigne 4 explore 1.00 11.19
Sextus Empiricus 5 exploit 1.00 12.57
Sarvepalli Radhakrishnan 5 explore 0.92 11.52
Laozi 6 exploit 1.00 9.98
Lord Mountbatten 6 explore 1.00 7.38
Jizi 7 exploit 1.00 11.44
Pierre Bayle 8 exploit 1.00 13.16
Rajahmundry 8 explore 0.93 10.64
Trincomalee 9 exploit 0.99 11.52
Malayan campaign 10 exploit 1.00 11.48
Mazu 10 explore 1.00 11.03

The explore/ exploit distinction has partially collapsed here: several explore candidates have composite scores of 1. This is a consequence of what we saw earlier in Results 2: composite score has lost its discriminating power, even as GMM is doing the differentiation work alone.

It is also notable that The World According to Garp has a GMM value of -37.75. This is because it is the only article in the pool whose category, Novels about transgender topics appears in zero rated articles. This makes it genuinely maximally out-of-distribution for the GMM.

That being said, it is quite fascinating to see recommendations emerging from such a diverse set. Seeding Singapore history gave us Robert Brooke-Popham (a British Air Marshal), and the Malay Regiment for exploit and explore from Cluster 2. Likewise, we see Nüwa (a Chinese goddess) and Michel de Montaigne (a French philosopher) for exploit and explore from the same cluster, Cluster 4.

Result 5: Clustering

Clustering resulted an extremely unbalanced set. While Cluster 0 has 53 out of 78 articles (68%), there were 42 disconnected components in 78 articles. This meant the category co-membership graph is very sparse. Effectively, topical diversity was largely undermined, making it one giant cluster and 11 micro-clusters.

This was an entirely unexpected result as we deliberately chose to seed our categories in a diverse manner. Further drilling down, we note that four of twelve curated categories - such as Telugu monarchs or Buddhist holy sites in Andhra Pradesh - contributed zero candidates after quality filtering. This is because English Wikipedia’s coverage of these topics is predominantly at stub or start tier. What is utterly fascinating here is that this is not incidental: the same structural under-representation that motivates Telugu Wikipedia expansion is reproduced on the supply side. The quality filter, which was designed to source translatable articles, inadvertently maps the contours of the original problem.

In effect, our clever multi-armed bandit - the takkari donga - needs to operate in English as well as Telugu. We will not just need to find articles to translate to Telugu, we also would need to find articles a few degrees removed from Telugu culture in English Wikipedia and perhaps work towards improving their quality as well.

Summary

In this paper, we have confirmed our core motivation: that there is a genuine tension between supply (articles in the English Wikipedia that a translator would like to translate) and demand (what gets read in Telugu Wikipedia). We have seen that the demand gap had collapsed in this evaluation - either interesting articles of high quality had no translations at all to Telugu, or that categories likely to be of interest to Telugu readers did not have high quality articles in English Wikipedia. The cold start proxy hypothesis - that we may use Hindi pageviews as a proxy for Telugu pageviews while Telugu pageviews build up - was not empirically justifiable.

Next Steps

The code pipeline exists, and can be generated for any Wikipedia translator across any language pair. An immediate next step could be try the model out with more focussed categories or seed articles, whilst bearing in mind the quality of the articles. Other languages such as Kannada or Tamil could also be used as a cold start proxy.

Alternatively, replacing the language-proxy demand gap with a normalized quality tier score (quality_tier/3.0) provides a content-based signal that is immune to the pageview degeneracy problem. Indeed, doing so gives us the following fallback shortlist:

Fallback Shortlist
title cluster_id composite_score
Zuo Zhuan 0 1.000000
Vijayawada 1 0.943860
Robert Brooke-Popham 2 0.666667
Kalinga (region) 3 0.983580
Michel de Montaigne 4 0.666667
Sarvepalli Radhakrishnan 5 0.923803
Lord Mountbatten 6 0.999332
Jizi 7 0.666667
Rajahmundry 8 0.925769
Trincomalee 9 0.666667
Malayan campaign 10 0.666667

(Scores of 0.067 reflect quality_tier = 2 normalized to 2/3, while 1 reflects quality_tier = 3)

Conclusion

Whilst such tweaks can address some limitations in the model, to this multi-lingual writer, the problem perhaps is larger and must be seen across languages. The challenge here is not to bring English Wikipedia articles to Telugu Wikipedia, but to bring articles of interest to a Telugu-speaking milieu to both English and Telugu Wikipedias. Takkari donga, the clever multi-armed bandit, was multi-lingual all along. They need to write not just in Telugu, but in that other Indian language spoken and read widely, English.

Bibliography

Johnson, Isaac, and Pablo Aragón. 2022. “Research:cross-Lingual Article Quality Assessment - Meta-Wiki.” In Meta. https://meta.wikimedia.org/wiki/Research:Cross-lingual_article_quality_assessment.
Ko, Hyeyoung, Suyeon Lee, Yoonseo Park, and Anna Choi. 2022. “A Survey of Recommendation Systems: Recommendation Models, Techniques, and Application Fields.” Electronics 11 (1). https://doi.org/10.3390/electronics11010141.
Langviews Analysis. n.d. https://pageviews.wmcloud.org/langviews/?project=en.wikipedia.org&platform=all-access&agent=user&start=2015-07-01&end=2026-03-15&page=Hyderabad&sort=views&direction=1&view=list&page=Hyderabad.
Silva, Nícollas, Heitor Werneck, Thiago Silva, Adriano C. M. Pereira, and Leonardo Rocha. 2022. “Multi-Armed Bandits in Recommendation Systems: A Survey of the State-of-the-Art and Future Directions.” Expert Systems with Applications 197: 116669. https://doi.org/https://doi.org/10.1016/j.eswa.2022.116669.
Spivak, Gayatri Chakravorty. 1988. Can the Subaltern Speak? 271–313.
Wulczyn, Ellery, Robert West, Leila Zia, and Jure Leskovec. 2016. “WWW ’16: 25th International World Wide Web Conference.” (Montréal Québec Canada), April 11, 975–85. https://doi.org/10.1145/2872427.2883077.
Zhang, Min, and Yiqun Liu. 2021. “A Commentary of TikTok Recommendation Algorithms in MIT Technology Review 2021.” Fundamental Research 1 (6): 846–47. https://doi.org/https://doi.org/10.1016/j.fmre.2021.11.015.

Footnotes

  1. With tongue firmly in cheek, the title makes a pun to not just the Multi-Armed Bandit problem, but also to Gayatri Chakravorty Spivak’s seminal essay, Can the Subaltern Speak? (Spivak 1988). In that piece, Spivak famously questioned whether subalterns - colonial populations excluded from hierarchies of power - could express their native thought / reasoning without conforming to Western ways.↩︎

  2. Takkari Donga, or the clever bandit, is also as a reference to much forgotten Telugu belter from early 2000’s where the singer celebrates the fact that he finds success going off the beaten path.↩︎