We investigate the use of a machine learning algorithm to identify non-existent (fraudulent) firms that are used for tax evasion. Using a rich dataset of tax returns in an Indian state over several years, we train a machine learning-based model to predict fraudulent firms. We then use the model predictions to carry out field inspections of firms identified as suspicious by the machine learning tool. We find that the machine learning model is accurate in both simulated and field settings in identifying non-existent firms. Withholding a randomly selected group of firms from inspection, we estimate the causal impact of machine learning-driven inspections. Despite the strong predictive performance, our model-driven inspections do not yield a significant increase in enforcement, as shown by the cancellation of fraudulent firm registrations and tax recovery. We provide two reasons for this discrepancy, based on a close analysis of the tax department’s operating protocols – selection bias, and institutional friction in integrating the model into existing administrative systems. Our study serves as a cautionary tale for the application of machine learning in public policy contexts, and relying solely on test set performance as an effectiveness indicator. Field evaluations are critical in assessing the real-world impact of predictive models.
Aprajit Mahajan is an Associate Professor in the Department of Agriculture and Resource Economics at University of California, Berkeley. He is also a Research Associate at the National Bureau of Economic Research (NBER), and an affiliate of the Abdul Latif Jameel Poverty Action Lab (J-PAL) and the Center for Effective Global Action (CEGA). Aprajit is a development economist with a strong interest in econometric issues, motivated by empirical work. He has worked extensively in India, and recent areas of work include agriculture, health, management, and taxation. Aprajit received his PhD in Economics from Princeton University. He previously taught at Stanford University and the University of California, Los Angeles.
Shekhar Mittal is a Senior Economist at Amazon. During his PhD Shekhar was interested in development and public economics. His work used large-scale government data sets to better understand government capacity, and to combine these data sets with field interventions to address questions of first-order causal interest. This work began before he joined Amazon. The views expressed in this paper are those of the author(s) and cannot be attributed to Amazon Inc., its Executive Boards, or management teams.
Ofir Reich is an independent data scientist conducting data-intensive projects for economic development. He is primarily interested in using data to detect fraud and corruption in weak enforcement settings. He worked on flood prediction in India at Google, was a data scientist for Precision Agriculture for Development, a data scientist at the Center for Effective Global Action (UC Berkeley), Chief Data Scientist and Machine Learning expert for a fraud detection start-up, and a mathematical research team leader in an elite technological unit of the Israeli army. Ofir holds a Mathematics & Physics BSc from the Hebrew University in Jerusalem.
Taha Barwahwala is a Kabir Banerjee Predoctoral Fellow in the Department of Economics at Columbia University, and previously a Senior Data Research Associate at J-PAL South Asia. At J-PAL, he has worked on exploring the effectiveness of machine learning methods applied to administrative data with the ‘Innovations in Data and Experiments for Action’ (IDEA) initiative. His area of focus is taxation capacity under India’s Goods and Services Tax (GST) ecosystem. Through data-driven research, Taha wants to identify and advocate for policy interventions that boost productivity in urban settings through better service provision, increased state capacity, and efficient governance. Prior to joining J-PAL in 2020, Taha was a data engineer at a private sector bank. Taha is a Bachelor of Technology graduate in Engineering Physics with a minor in Mathematics from the Indian Institute of Technology, Guwahati.
Citation: Mahajan, A.; Mittal, S.; Reich, O. and Barwahwala, T. (2024) Using Machine Learning to Catch Bogus Firms, ICTD Working Paper 196, Brighton: Institute of Development Studies, DOI: 10.19088/ICTD.2024.050
This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.
Strictly Necessary Cookies
Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.
If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.
3rd Party Cookies
This website uses Google Analytics to collect anonymous information such as the number of visitors to the site, and the most popular pages.
Keeping this cookie enabled helps us to improve our website.
Please enable Strictly Necessary Cookies first so that we can save your preferences!