Author: gkjohn

Thoughts on Wikipedia and its Language Challenges

This was written soon after Wikimania 2010 for the New Indian Express.

(With inputs from Arun Ram and Wikimania 2010 attendee, Srinivas Gunta)

A common point of discussion in matters regarding the global Internet is the somewhat inequitable distribution of content by languages with a skew towards English and languages of the traditional geographies of the Global North. Wikipedia is not immune to these inequalities either and this was a major point of discussion at the 2010 edition of Wikimania which recently concluded at Gdansk in Poland. Wikimania is an annual gathering, organized by the Wimikedia Foundation, of Wikipedians, as those who contribute to Wikipedia are called, who meet to discuss the state of various Wikipedia projects and to chart a course for the year ahead.

What stood out was the scale at which the Wikimedia Foundation is thinking. Its strategy plan aims to increase reach to 680 million unique visitors globally by 2015 (from the current 388 million). The aim is to achieve a 12% annual growth in the Global South, and 4% annual growth in the Global North; in other words most of the growth will be in Wikipedias in the multiple languages of the Global South.

Jimmy Wales’ keynote address at Wikimania this year focused on countries of the Global South and he did video interviews with active Wikipedians from the Bangla and Tamil Wikipedia buttressing the importance of the Foundation focusing on the smaller languages and varied geographies as represented within Wikpedia projects. Among other things, the Foundation’s strategy plan aims to foster the growth of smaller Wikipedias – by 2015, the aim is to have 100 Wikipedia language versions with more than 120 thousand “significant articles” each. To this end, the Foundation also aims to bootstrap community programs in key geographies: India, Brazil, the Middle East/North Africa.

Two presentations highlighted the challenges and the possibilities ahead. Achal Prabhala, a Wikimedia Advisory Board Member, spoke about the need for local representative bodies of the Wikimedia projects, or Chapters, in countries which were linguistically underrepresented. Achal’s larger point is that there is a distinct relationship between local growth and the existence of local chapters and that geographies in the South present enormous prospects for growth. They also present prospects for an increase in scope – which could mean, in turn, new ways for Wikimedia to grow the world over. On a cautionary note, Harel, from Wikipedia Israel, spoke of his experiences, that have been contrary to expectations, where local Wikimedia Chapters may find themselves in adversarial relationships with local Wikipedian communities and that there if often a trust deficit between the two sides. Harel spoke of the need for local chapters to treat editing communities as peers and equals. Chapters are meant to do outreach, he cautioned, while editing is the preserve of the community and that this is something that the community must be left to do without chapter interference.

Given this inequitable distribution of linguistic content within Wikipedia projects, external organizations have seen this as a possible gap to fill and there were presentations on translation toolkits and machine translations of content to populate otherwise sparse language Wikipedias. This is a route that has met with some resistance. An example is a translation toolkit that Google had introduced to blend both computer aided, or machine, translation with human translation. Users of this tool have been translating popular English language articles in to various local languages with a varying degree of success. However, ironically, the size of existing active user base in each of these Wikipedias may itself determine how successful these efforts will be. Translators using the tool needed lot of hand-holding and overseeing and after initial hiccups, Tamil Wikipedia has been able to engage with Google such that their contributions now fulfil quality parameters too, thanks to availability of more active users.

It is interesting to see how multiple approaches are being deployed to solve one common problem – a lack of linguistic diversity that matches the proportion of Internet users online. It is to be expected that there are some tensions between organic community lead translation efforts and efforts that are focused on automated translation and Wikimania provided both sides a venue at which to engage with each other to resolve their differences and work collaboratively.

Here’s what needs to be kept in mind while steaming ahead in India: English, with 225 million speakers in India, is also an Indian language. Several Indian editors already contribute to the English Wikipedia. So the emphasis needs to be on boosting contributions in all Indian languages, including English – rather than just an ‘Indic languages’ vs ‘English’ paradigm. Innovative ways to boost edits – and bring in new editors – include holding Wikipedia academies across the country; finding low-cost ways to create public access to Wikipedias in places like public libraries and removing technological obstacles related to scripts, keyboards etc.

With the Foundation’s new thrust on the creation of local Chapters and with the India Chapter in the final stages, one can expect a greater
deal of focus on these issues both within India and other under-represented areas of the world.

September 15, 2010
Thoughts on the Future of Journalism and the Print Media

Newspapers and the wider print media have traditionally formed an important bulwark against excessive intrusions of the State, the Fourth Pillar, as they have traditionally been referred to. However, their business model have been built on on an artificial scarcity of pathways of information to the general public. Since newspapers owned this channel, they monetised the channel at both ends – advertisers paid to access the channel and the associated captive audience while the audience paid, upfront, for an entire year even, for access to this channel which was their limited source of news and information. The Internet destroyed this artificial scarcity of channels and of access to ‘facts’ and this is a genie that cannot be put back in the bottle. In many ways this is equivalent to the problem that the music and video industry have faced as well – how do you get consumers to pay for content that they can otherwise, illegally and legally, access for free? The print media no longer is the only channel for news and certainly isn’t the fastest channel for news – when competing against the Internet it is very hard to beat digital pathways as it is to compete against ‘free’. The Nieman Journalism Lab recently wrote that “The Dallas Morning News now gets 38 percent of its revenue from circulation, 54 percent from advertising, and 8 percent from contract printing plus [and] those numbers are a far cry from the way it used to … 80% of their revenue came from advertising and 20% came from circulation.” Which leads to the question of whether the era of and advertising and subscription funded monolithic news organization is fast ending and what this means for the traditional news organization.

The recent example with Wikileaks, “the world’s first stateless news organization” as Jay Rosen a professor of Journalism at New York University called them, distributing tens of thousands of pieces of classified information from America’s war in Afghanistan points to a future where newspapers are not the first port of call for whistle-blowers and hence may no longer be sources of facts, and to an interesting model of value addition for journalists and traditional print media. Jeff Jarvis, an associate professor at the City University of New York’s Graduate School of Journalism, believes that this value addition is what journalists and media organizations can add to facts that are otherwise both free and freely available and says that “Thanks to the internet, the marginal cost of sharing information today is zero [and] this change in market reality forces us to examine journalists’ true value to the public in the market.”

Given the proliferation of channels of news and facts and that most of them are both free and freely accessible, there seems to be little value in ‘news’ and ‘facts’ as mere reproductions of events but this poses a challenge to the reading public as well – a challenge of an over-abundance of sources of news and facts without the ability to filter them, rank them or contextualise them. Another possible role that journalists and the print media could play is as filter to these multiple incoming sources of facts and news – to build filters of authenticity and to add context to these facts. That said it probable that such fact checking and verification will, in the future, be crowd-sourced as Truthsquad, a “community fact-checking experiment” and SwiftRiver a “free and open source software platform that uses algorithms and crowd-sourcing to validate and filter news” show. A necessary ingredient to building such filters of authenticity is trust and this is something journalists and the print media should keep in mind – undermining the element of trust is to undermine your relevance and future business models. The current brouhaha over the paid news syndrome is a malaise that will render those sources untrustworthy and without a necessary ingredient to build future models of sustainability.

Which brings us to the an emerging trend of data driven journalism. Governments across the world, with the United States and the United Kingdom taking the lead, have begun to disgorge vast quantities of hitherto unavailable data in to the public domain and as the Wikileaks example shows, this is an opportunity for print media to add value and context to such data and weave a narrative that data, as a standalone object, lacks or as the Nieman Journalism Lab put it, “… data in the service of somehow getting to the “big picture” about what’s really going on in the world”.

It remains to be seen, and we remain sceptical, whether placing news content behind walls, for which payment is required to access, will offset advertising and subscription revenue losses and a nuanced approach where there is greater perceived value to the end consumer of such news is likely to succeed better than a simple pay-to-view model. The Guardian has been experimenting with a very interesting platform based model that they are calling The Open Platform. This is, in their own words, “… a suite of services that enables partners to build applications with the Guardian.” The long term goal and vision of this project is to embed the Guardian as an elemental part of the Internet rather than be only a destination with the attendant risks that being a destination has. This Open Platform “… aims to make the Guardian a useful resource to partners all around the globe who want to leverage the value the Guardian can bring to their business.”

Mike Masnick has been thinking this through the challenges that traditional media face from the proliferation of digital networks and at a recent event (wonderfully titled Techdirt Saves Journalism) distilled a set of ideas that the print media could experiment with. In short, he writes that media must, mine the data to find the relevant, elevate their writers, create a platform for their community, think about multiple revenue streams, expand their brands and absorb changing ideas about “news” and its traditional notions of production.

India, of course, isn’t yet here because of a lack of ubiquitous digital networks but this will change rapidly with the roll-out of 3G wireless networks and rapidly falling handset prices. At which point it might all be too late.

August 25, 2010
The Case for a Unique Identification System in Public Education

During a visit to Hubli-Dharwad in November 2009, the local pages (Hubli-Dharwad-Belgaum) of the Times of India had the following headline: “Sky is their roof; the road their classroom – Government sanctions school without building”. The school is question was the government primary school in Ram Manohar Lohia Nagar in Hubli, with 67 students from classes 1-4 with one teacher who doubled up as the head teacher. With multiple such reports, it is not very surprising that few public institutions rival our government primary schools in public dissatisfaction – and all along we have been making significant investments on schools. It is estimated that in Karnataka we spend about Rs 6,500 per child per year and, in one study done by PROOF for primary schools run by the then Bangalore Mahanagara Palike (BMP) the number exceeded Rs 10,370 per child per year2. So how do we explain this lack of performance?

For a long time, the focus has been (and continues to) on the input side, on schools rather than on schooling and the primary questions were whether children have access to a school and whether children get uniforms, books, mid-day meals, etc. As the “road-side” primary schools shows even this fails often enough. While inputs are required for any process to work, in this case it has been done at the cost of focus on the outcomes of the education system. For example, there is little or no information on the learning levels of children prior to a child’s first “public” exam (We believe there is general agreement that the internal school reports are not good indicators in most cases) when she reaches class 10 which means it is too late to make course corrections with respect to quality.

About the only data that has been available consistently in the past five years has been from the Annual Status of Education Report (ASER) and some key findings for Karnataka were that only about 39% of the children between standards 1-8 can read a standard 2 level text with the implication that around 5.3 million children in the state who are unable to read in their medium of instruction. The performance in math is even more appalling. Over 30% of the children between standards 1-8 could not recognize double digit numbers. Less than 20% of all children could do simple division and less than 30% could do simple subtraction. This means that between 10 and 11 million children cannot do simple math. Only 11% children in Standards 3 to 5 can read an English sentence ans only 35% children in Bangalore can read English.

This pathetic state of affairs threatens to ruin the lives of millions of children in Karnataka and much larger numbers across the country, and it would not be entirely out of place if we were to say that the the failure of the schools is gradually destroying democracy. The often repeated rhetoric of elementary education being a fundamental right (now further enshrined in the Right to Education Bill 2009) seems to be accompanied by an inability to make the schools work for the children. It is true that over the past ten years enrolment has increased but enrolment does not mean attendance. Further attendance does not imply learning, for in many schools across the state, pupil-teacher ratios are very high and given the fact that teacher absenteeism is greater than 25%, these ratios get further skewed against children. Single teacher schools, such as the one in the Hubli case, are common and multi-grade teaching even more so.

It is our belief that a universal and unique identification system will help in improving quality outcomes in a significant manner. What this means is that there is a need for a unique identity that is assigned to a child from birth through till the end of her education and this unique ID will help in ensuring that all her rights as a child are available to her and that she receives a quality education.

In the ICDS anganwadis, the anganwadi worker has to worry about health, nutrition and education issues covering pregnant women, lactating mothers and children from 0-6 years. Clearly the education component suffers and has to be currently supplemented from the outside. It would be meaningful if data on all children is collected from this stage onwards so that the system would be able to (a) see to health needs if the data-base is accurate and updated regularly ; (b) check for issues like learning disabilities which can be “cured” if remedial interventions are done early enough ; (c) ensure that at the appropriate age children are admitted to primary schools and that the schools are made aware of every child’s proficiencies.

In the primary school system, there is a definite need to track migration issues. For example, children may be enrolled in a rural school and during difficult times the family may migrate to urban areas for livelihood reasons – this means that the child will also been enrolled in an urban school and therefore counted twice.

In the primary school system, there is a need to track attendance of both children and teachers on a regular basis. It is not uncommon to find out when you visit a school that declared enrolment / attendance is higher than actual. Attendance for both children and teacher communities need to be tracked.

Remedial interventions are required to bring what the system calls “slow learners” to the mainstream levels. This means that we need to know who needs help and this is possible only by administering diagnostic baseline tests and logging this data on a child-by-child basis. Currently, what happens in the government’s Parihara Bodhane programme is that teachers are asked to identify “weak” children and the number of children in these initiatives is limited by the budget. Moreover, children are not tracked because this is considered to be a burden on teachers (indeed every remedial intervention is considered a burden by the teacher community). We think it is vital that within the next 3 years all children should be at mainstream levels and this will be possible through budgetary support for planned remedial interventions accompanied by teacher training and teacher support for this programme and finally, with continuous child-by-child tracking of outcomes. Once remedial efforts are completed children need to be tracked so that we can en-sure that their acquired 3R skills are not lost. Libraries are a great vehicle to track children’s proficiencies and it is important to track how many books are being borrowed by each child every month so that we know by child who is NOT borrowing and these children are vulnerable children and need attention Beyond primary school, we should be able to track children going to secondary schools or vocational schools or even colleges.

There are many spin-offs for this tracking methodology which could feed into the government budgets – one could track how effective the Mid-Day Meal Scheme is ; or outlays for innovative Government schemes like scholarships, cycles and free books distribution. And, from a management perspective, we could even track budgets of individual schools and provide decision-makers with the kind of information they need to ensure that schooling happens and that the focus is the child.

For this system to work there is definitely a need for multiple departments within government to use this – Education Department, Women& Child Development, Health Department and Labour Department – as a minimum should be users of this system and drive multiple applications and reports based on the system and the success of such an initiative predicates upon a number of applications depending on this system.

August 13, 2010
Thoughts on the Unique Identity Framework for India n??e Aadhaar

Background:

I had written this piece many months ago when the UIDAI white paper leaked. Some of the commentary might now be dated because the goal posts have changed ever since. I have consciously ignored the privacy and security aspects because I am no expert in those areas.

I also need to credit a friend, lets just call him Vikram, for this piece.

Bottom line:

Trying to do too much. The smallest intervention becomes a massive undertaking in India because of scale and organizational/adminstrative complexity, so you should scope projects as narrowly as possible. If the main purpose is to bring marginalized people into mainstream economic life, then you should focus on getting them an ID rather than on eliminating redundant verification activity or eliminating fraud, both of which can be happy by-products further down the line. Why not: Make a national ID number available to anyone who wants it, target it to the people who currently lack any form of ID, and let things evolve from there. It doesn’t have to cover everyone or be the only recognized form of ID or be real-time and state-of-the-art to do most of what you want it to do.

Rough scope evaluation:

Aim 1: Getting everyone an ID; Project component: enrollment; Advantages: brings people into economic and social life; Disadvantages: big-brother possibilities (everybody means well in the beginning)

Aim 2: Eliminating redundant verification activity; Project component: on-demand authentication; Advantages: efficiencies over time; Disadvantages: business process disruption across swaths of the economy

Aim 3: Eliminating entitlement fraud; Project component: data de-deduplication using biometrics; Advantages: helps balance sheets of government agencies; Disadvantages: alienates those who benefit from current arrangements (customers as well as government employees who abet them)

I don’t know the relative costs of the three components, but I suspect that an incremental approach to 1 with a thinned-down version of 3 would be the 80/20 solution here. On the security side, all of this boils down to persons or individuals and what they can do. Allow me think aloud here…..

Identity

– Any kind of marker that defines or demarcates a person or individual. These persons can be real, fictional, fictitious, whatever. Captain James Kirk is an identifiable individual in the world of Star Trek. Avatars on Second Life or gaming sites are identifiable individuals within those universes. Witness protection and intelligence agencies assign fictitious identities to real individuals. In the serious world of business and government, identity is about each unique existing individual having a unique identity or marker to go along with it that can be used in official business. Most people have many such markers (credit card number, passport number, social security number, tax or voter ID number, combination of name and birthdate) and some countries have one marker that is close to universal (almost everyone in the US has a social security number, for instance). In India, some have many markers and many people have no official markers at all, despite being unique individuals. Having many markers is not really a problem except in an efficiency sense. My bank identifies me by my account number, my university used to have its own 9-digit ID for me, immigration agencies track me by my passport number, etc. etc. Some of these are parasitic on my social security number (which I provided when applying for a bank account or applying to college), but many are not. And the cost (in terms of business process changes, technology investments, confusion, etc.) of getting everyone in the economy to subordinate their own identification numbers to a common national number is going to be prohibitive in any normal decision-making horizon.

Authentication

– When you claim to be some identifiable individual (the owner of some identity marker), authentication is about making sure you really are that individual. First of all, we should decide whether we care more about false negatives (people falsely claiming to be someone else and getting away with it) or false positives (people truthfully claiming to be themselves but not being believed, maybe because they don’t have the paperwork to prove it). If you try to solve both, you end up with the biggest of all possible projects and also the least likely to succeed, because the solution to one exacerbates the other and only the all-singing all-dancing perfect solution (in which all real-world difficulties are assumed away) gives the illusion of bridging the tension. If you care more about false negatives, you’ll make it harder to get a valid identity marker, and there go the poor and the marginalized. If you make it easier to get one number, you’ve made it easier to get a second. That’s why they came up with the biometrics, but for that extra bit of security, they’ve fingerprinted an entire population (don’t tell me that won’t be abused) and, I suspect, added a whole lot of processing cycles on the IT side (I imagine it’s easier to look for matches of a 9-digit number than for fingerprint matches). The problem of identity theft (rather than the creation of false or duplicate identities) doesn’t even require the extra security. A 9-digit random number is pretty secure in the sense that it’s virtually impossible to guess and only you and maybe a handful of other people know it. [I won’t even get into the problems with biometrics. Fingerprint matches are far from unique at standard levels of detail, so it’s no silver bullet, and once fingerprint identification is used for high-value financial transactions, expect a rash of de-digitization…it even rhymes with de-duplication!]

Authorization

– Once we know you’re you, authorization is about defining what you’re allowed to do or what you’re entitled to. Here, that whole aspect is (correctly) left to the individual service providers.

In general , I think there are more things under heaven and earth than are thought of in any of our philosophies and these people would do well to ponder that. IT projects always take many times more time and money to finish than bargained on at the outset, and that’s only counting the ones that kind of reach their goal. Incentive and coordination problems will cripple (or disfigure beyond recognition) any large project in a complex organization, and you’re off the scale here in both size and complexity. Politicians and public figures and academics have a built-in preference for ambitious/sexy/grandiose projects, but the efforts that stick are the ones that start small and evolve.

Some of my concerns are listed under project risks, but there’s no clue there beyond platitudes as to how they might be addressed. Project risks are side things that can derail the project if you have bad luck. The obstacles we’re talking about here are what the project (or at least this document) should be about. It’s trivial to collect data and put it in a database and then query the database from a transaction site. It’s not trivial to do it for a billion people or through hundreds of overlapping independent agencies and politically antagonistic local governments. Don’t show me diagrams of how you’re going to approach the trivial problem and then mention by the way that there might be some complications. If you have a solution to the
complications, shout it from the rooftops. Otherwise, come back when you have one or let’s talk about how we can find one.

Biometrics

– It seems like they think that the trade-off between entitlement fraud and inclusiveness can be broken by this magic technology called biometrics. If only. If there’s one thing that technology executives in large organizations agree on, it’s that the technology is never the solution. Technology providers are less wise on this point but even they acknowledge it in their less commercial moments. In any case, I don’t know how much accuracy biometrics adds beyond what you could get by triangulating the information that’s normally used in verification (biographic data as attested by documents plus distinguishing facial features). I bet it’s not much, especially once you consider the failure rate of biometrics itself (since nothing is foolproof). It does however add a layer of certain costs for infrastructure, training, etc. And the privacy implications are chilling.

Demand-driven

– really? Enrolling agencies are looking at business process disruption, technology investments, and an extra operating burden, and not only in the first ten years, so they will have to be strong-armed.

It may be that the savings in the larger economy from not having to repeat verification procedures will offset the costs over time, but businesses don’t make decisions with a ten-year horizon (especially a ten-year horizon contingent on the success of a government project of unprecedented scale and complexity), so I wouldn’t expect them to be queueing up to ditch their current procedures. People who were getting duplicate benefits will lose out under this, so don’t expect them to rush forward either. And I’m sure some tribes like being out of view of the state. (In the US, the Amish have resisted social security numbers, I believe successfully.) Also, by making more and more services/entitlements/rights dependent on the ID, you’re putting an unrealistic reliance on the benevolence and competence of the enrolling agencies. Ultimately we’re talking about hundreds of millions of vulnerable people interacting with millions or hundreds of thousands of petty officials who have been given extra work to do for the benefit of people they very likely regard with distaste. Expect the worst. Also, I don’t think “network effects” means what the authors think it means. It’s not the case that the more people that have an ID, the more beneficial it is for me to have an ID. It is the case that the more government or other services that become contingent on having an ID, the more beneficial it is for me to have one. That’s a very different thing and not so different from what they disapprovingly call a mandate (except that a mandate would be more clean).

Data quality

– why do you think it’s so easy to duplicate identities or, if you prefer, so difficult to create a unique record for each individual? Lots of overlap of names, lots of names that don’t follow the western or north indian convention of given name followed by family name, lots of people with no clean permanent or even present address, haziness around date of birth (stop the first twenty people you meet in any village and ask them if they know their exact birthday). How exactly are you going to address this? These are problems with the data themselves, not with how the data are collected. Some of these might disappear over time (three generations from now I imagine there won’t be anyone left who doesn’t know their birthdate), some of them can be nudged out of existence (we could force south indian names into a given name/surname pattern as many of us have done out of necessity), and some could possibly combated through mega-projects of their own (if we could somehow make it so that everyone who doesn’t have a proper address now has one in twenty years’ time, we would have accomplished something much grander and worthier than a national identity scheme). If you launch a national identification number without solving these problems, you’re just going to be importing a lot of bad data into an arena where it can do much more damage (because now there’s a single point of failure as far as the individual is concerned–earlier, data problems might mess up your gas connection but not your phone application or your ration card, because each agency had its own idiosyncratic way of doing things; now, everything is connected). The document refers to KYR standards for the validity of demographic data and that sounded promising, but when I looked around for information on these standards – it was like “know your customer” but for residents rather than customers, which wasn’t a big help.

Thoughts

How about this? A chunk of the population is being left out of economic life and social programs because of a lack of an accepted identity marker. Why not provide a unique government-backed ID to everyone, or to anyone who asks for it. It doesn’t have to be foolproof, just good enough for the purpose. That way if you have a usable identity marker already, you keep using that, otherwise you apply for the government’s random-number ID. Service providers now accept the UID along with what they’ve always accepted, and they’re free to pressure customers to get a UID if they like. That way, you have an order of magnitude fewer people include in this UID project and the sequence in which people are brought into the system respects your pro-poor agenda much better because it starts with people who most urgently need an identity marker and only then (in a timeline decided by individuals or at most by individual businesses or service providers) gets taken up by people for whom it would be a marginal convenience. [This is roughly how the social security number came to be the de facto national identification number in the US.]

Think of how complicated the census is. And that just involves going door to door and counting people, trying to avoid double-counting. Now you want to catalogue them uniquely and be present in every interaction they have with a service provider? Come on. I don’t know of a single large company that has a unique identity for each employee matched to an up-do-date profile of what they can do and a reliable method to ensure that someone fiddling around on the network is who they say they are and aren’t doing something they’re not supposed to. The best companies have good robust identity, authorization, and authentication for a small group of employees and the bare minimum (including bad data and processes where they can be tolerated) everywhere else, because even something as simple as rolling out a smart card to 50,000 employees can take years because of the logistical and organizational hurdles. It was a big achievement a few years ago when Johnson & Johnson figured out a way to assign unique identifiers to its 150,000 or so employees so that it could keep track of them as they moved through the company. Now maybe these companies are just stupid, but I wouldn’t bet on it. I would expect the difficulty to increase exponentially with the number of people covered or at least the number of independent decision points involved, and companies have the advantage of a command-and-control structure that democracies don’t and shouldn’t have. If the success of your project requires pretty much everyone in the economy to do things differently (“business process change” is easy to say but traumatic for anyone in the middle of it), you can assume you’ve succumbed to hubris.

August 9, 2010