‘Aadhaar system is a great example of using open source technologies and extensive data-driven analytics to achieve its scale and quality’
Building the world’s largest biometric identity platform for authenticating the identity of a billion residents is a mammoth task and clearly has no parallel anywhere in the world. To understand more about the technology architecture for Aadhaar, InformationWeek’s Srikanth RP had the privilege of reaching out to Dr Pramod Varma, who is currently a technology advisor to the Unique Identification Authority of India (UIDAI) and a few technology startups
By Srikanth RP, InformationWeek, September 04, 2013
For initial 3 years of the Aadhaar project, Dr Pramod Varma was the Chief Architect at UIDAI responsible for entire system architecture and strategic technology decisions. He joined UIDAI in 2009 and has been pivotal in ensuring that an open, scalable, and secure architecture is built to meet the needs of Aadhaar project. He led the overall technology and application architecture and application development within the UIDAI Technology Unit and is based in Bangalore. Currently, in addition to working with UIDAI as technology advisor, he also sits on the advisory board for few startups to mentor and provide technology and architecture directions.
Some excerpts from an interview:
What were the factors considered while designing the architecture for Aadhaar?
Within the first 6 months of its inception, UIDAI has defined its technology architecture principles very clearly. They are — “Openness & vendor neutrality,” “Security & privacy by design,” “Horizontal scalability,” “Interoperability & Manageability,” “Use of analytics for transparency and decision making,” and most importantly a “Platform based approach at every layer.”
For example, Aadhaar system is entirely built using open source components and takes heavy advantage of international open standards such as ISO biometric standards, data representation standards such as XML, JSON, security standards such as 2048-bit PKI, AES-256, messaging standard AMQP, and so on. Aadhaar system uses widely adopted open source components such as MySQL, Hadoop, RabbitMQ and uses Java as the primary application programming language. Entire application is deployed on open commodity hardware using several blade/rack servers on x86 platform running 64-bit Linux and uses large scale cheap SATA storage arrays. Such open scale-out architecture allows UIDAI to procure latest servers and storage from “any” vendor at “the best price” only “when required”.
Similarly, security and privacy of data within Aadhaar system has been foundational and is clearly reflected in UDIAI’s strategy, design and its processes throughout the system. UIDAI has taken several measures to ensure security of resident data, spanning from strong end-to-end encryption of sensitive data, use of strong PKI-2048 encryption, use of HSM appliances, physical security, access control, network security, stringent audit mechanism, 24x7 monitoring, and measures such as data partitioning and data encryption.
Please describe the scale of data being handled and the complexity?
Aadhaar system requires providing unique identity to more than a billion people. With an aggressive target of reaching 600 million Aadhaars in a short span of 4 years, enrolment module processes about 1 million enrolments every day. De-duplication (ensuring every resident is indeed unique) requires matching 10 fingerprints, both irises, and demographics data of every resident and hence each enrolment packet (PKI-2048 encrypted data) per resident is about 5MB. Currently, system handles about 30 TB (terabytes) of I/O to process 1 million enrolments every day. Given the fact that Aadhaar system has enrolled 400+ million (40 crore) residents already, system storage is about 1.5 PB (petabytes) or 1,500 TB in one data center and same data is replicated across to UIDAI’s second data center having a total of more than 3,000 TB of data. This is expected to grow to about 10-12 PB when entire country is covered. In addition, process data including 100+ million events generated every day, RDBMS data, Hadoop analytics data, etc., adds another layer of data management complexity.
What are the unique needs with respect to analytics for a project like Aadhaar?
At UIDAI, analytics and reporting has been a constituent of the Aadhaar implementation strategy from inception. A large multi-provider ecosystem created by UIDAI can only be managed efficiently by measuring process data at a high degree of granularity, creating well-defined metrics from this process data, and creating feedback loop for these insights and learning to be shared back to the ecosystem for continuous improvement. When working with third-party organizations that are part of ecosystem, it is essential that entire system is measured using data and decisions are made completely based on data. Highly granular metadata (or process data) must be automatically collected throughout the system to ensure quality is measured systematically and feedback is given to improve any specific issues that are identified.
For example, every enrolment packet is reviewed by a supervisor for data quality (review audits are captured electronically) and signed as required, which means every enrolment is traceable in terms of “who,” “when,” “where,” “under which agency,” “under which registrar,” “who reviewed it,” etc. In addition, several metadata elements such as “how long operator spent on demographic data screen,” “how many times a fingerprint was captured,” “how many corrections were done,” etc. are also collected as part of every enrolment packet. This data is used for providing continuous improvement feedback on data quality to the registrars and enrolling agencies using UIDAI’s analytics platform. An extensive atomic data warehouse built on top of Hadoop Hive already handles several billion analytics data points. This fully anonymized (analytics system has no PII) data source is the driver for all analytics and reporting layer within Aadhaar system.
Please share the unique challenges and lessons learnt in building the analytics system?
Aadhaar system (both enrolment and authentication) records extensive event data within its atomic data warehouse. UIDAI could not have managed its large ecosystem rollout with high degree of data quality and process adherence without its analytics backbone.
Unique challenge was to effectively manage 100,000+ certified third-party operators operating across 30,000 stations under 50+ registrars across the country without creating massive manual audit processes and still maintain data quality, process adherence, and scale. Instead of using manual audit and control mechanism, UIDAI decided to drive entire system using extensive data instrumentation and analytics.
For example, enrolment client software is heavily instrumented to record “operator,” “agency,” “location,” “screen transition timing,” “data capture metrics for demographics and biometrics,” “station identifier,” “how many times machine restarted,” “who has logged in,” “what kind of OS and HW is being used,” etc., and all this data is synched with the server to track, monitor, and analyze data quality and process issues at every station and operator level across 100,000 operators. Real learning is that such large scale projects must drive all its decisions and make continuous improvement using extensive data analytics. And this is quite easily possible now using open source technologies such as Hadoop without having the need for large IT budgets.
From a Big Data and analytics point of view, what can Aadhaar enable for the entire country? What kind of functions you see being transformed using the power of analytics enabled by the Aadhaar architecture?
Aadhaar system is a great example of using open source technologies and extensive data-driven analytics to achieve its scale and quality. UIDAI regularly publishes its learning, APIs, technology choices, etc on its website. Specifically on the analytics front, UIDAI publishes these through its portal (portal.uidai.gov.in) and also provide machine readable data via its data platform (data.uidai.gov.in). All these metrics and data points are derived from its analytics platform built on top of Hadoop Hive that has already captured several billion analytics data points by now. This fully anonymized data source is the driver for all analytics and reporting layer within Aadhaar system. By providing all these analytics via its public portal and data portal, UIDAI hopes to enable transparency and encourage researchers to use this aggregate data for various studies and mashups. As described earlier, UIDAI uses this same platform very extensively to monitor data and process quality and provide continuous, automated, feedback to entire ecosystem partners all the way to the individual operators across the country.