As lockdowns around the world continue to grind on, we are starting to notice their effects on all aspects of life. One of the rare silver linings that has emerged in popular discussion is the dramatic reduction in air pollution that lockdowns have caused. Research on the original Wuhan lockdown suggested that the resulting declines in air pollution (63% reduction in nitrogen dioxide) may have saved up to 11,000 lives in China.
Similar analyses have been carried out in India, with headlines like “Air pollution dropped significantly during lockdown” and conclusions like “India’s nationwide lockdown, in particular, has had stunning effects on air pollution levels.” But reading the language used in the Indian Express article made me nervous:
“Air pollution dropped significantly during 74-day lockdown period.” (title)
“The clampdown on all non-essential activities due to the COVID-19 pandemic, from March 25 to June 8, led to a significant decline in air pollution levels for major cities across India.” (subtitle)
The title and the subtitle are saying two different things! Just because air pollution dropped during the lockdown doesn’t mean that the lockdown caused a significant decline in air pollution levels. It’s the old saw: correlation doesn’t mean causation. So I wanted to look into some air quality data myself to see the story.
I obtained air quality index (AQI) data from this popular dataset, which extracts air quality data from the Central Pollution Control Board’s data platform. It only contained AQI data until May 1st in most cities, so I supplemented it by scraping the CPCB’s AQI bulletin announcements every day to bring the data up to July 6th.
In particular, I used daily city-level data on four major cities: Delhi, Mumbai, Kolkata and Bangalore. I chose these cities because they are the biggest cities, in which industrial and economic activity are likely to be the biggest sources of pollution - and because the Indian Express article that prompted this investigation focuses on those cities.
pacman::p_load("dplyr", "ggplot2", "gridExtra", "zoo", "lubridate", "ggthemes", "stargazer") theme_set(theme_grey() + theme(legend.position = "none", strip.text = element_text(size = 14), axis.title = element_text(size = 16), axis.text = element_text(size = 14))) cities <- c("Delhi", "Mumbai", "Kolkata", "Bengaluru") aqi <- read.csv("~/projects/india-data/india-aqi/city_full.csv") %>% filter(City %in% cities) %>% mutate(Date = as.Date(Date)) %>% arrange(City, Date) %>% mutate(AQI_MA7 = rollapplyr(AQI, 7, mean, fill=NA))
The first investigative question is, how has the AQI in each city changed since the lockdown began?
aqi %>% filter(Date >= "2020-03-25") %>% ggplot(aes(x = Date, y = AQI_MA7, color = City)) + geom_line() + geom_smooth(method = 'lm') + facet_wrap(~ City) + labs(y = "7-day moving average of AQI")
Note that higher AQI corresponds to more air pollution. This figure generally bears out the conclusion that air quality has increased during the lockdown period, with the notable exception of Delhi, where air quality has gotten worse. The most likely reason for this difference from the Indian Express article’s conclusion about Delhi is because I’m focusing on AQI, whereas the article focuses on the decline in specific pollutants like NO2 and PM2.5. With the caveat that I am not an environmental scientist, focusing on AQI seems to be more sound:
It’s calculated in a defined way set out by the CPCB, so there’s no risk of a misleading conclusion from selectively looking at some pollutants rather than others that tell a more favorable story. I’m not accusing the Indian Express of trying to p-hack their analysis, but choosing AQI is just a more sound approach.
CPCB guidelines correspond to AQI levels and give health impacts corresponding to each AQI level, which can be more easily applied to health impact analysis than aggregating impacts corresponding to each specific pollutant level.
So far, we’ve seen that air quality has improved during lockdown. But when looking at my main concern - how the AQI decline post-lockdown compares to the AQI trend pre-lockdown - the story looks a lot less clear.
aqi %>% filter(Date > "2020-01-01") %>% mutate(Post = ifelse(Date >= "2020-03-25", 1, 0)) %>% ggplot(aes(x = Date, y = AQI_MA7, color = City, group = Post)) + geom_line() + geom_smooth(method = 'lm') + geom_vline(xintercept = as.numeric(as.Date("2020-03-25")), linetype=4) + facet_wrap(~City) + labs(y = "7-day moving average of AQI")
It turns out that in all four cities, AQI was on a steady decline throughout 2020 in the pre-lockdown period. It is interesting to note that there seems to be a break at the beginning of lockdown, March 25th. But that should be taken with a grain of salt: after all, if you generate an arbitrary date to separate periods in 2020, the lines of best fit will often have a break. If you can’t convince yourself this is true, I’ve created a Shiny app for you to simulate different “lockdown” dates yourself and see the breaks that they create.
When we rewind to 2019, it becomes clear that the decline in AQI in 2020 is part of a seasonal effect that we can see in full cyclical form in 2019.
aqi %>% filter(Date >= "2019-01-01") %>% ggplot(aes(x = Date, y = AQI_MA7, color = City)) + geom_line() + geom_smooth(method = 'lm') + facet_wrap(~City) + labs(y = "7-day moving average of AQI")
It seems that air quality has a strong seasonal pattern, and focusing only on 2020 misses that seasonal pattern. This means analyses of AQI that take only 2020 as the pre-lockdown period are flat-out incorrect - they are conducting a full study within a season.
The Indian Express article does reference data from before 2020.
“The analysis showed that Mumbai’s PM 2.5 average during the lockdown was 20, while the average was 40 in 2017, 47 in 2018 and 36.1 in 2019… Kolkata’s average PM 2.5 during the lockdown was 22 as opposed to 69.3 2017, 86.2 in 2018 and 57.7 in 2019… Delhi’s PM 2.5 average during the lockdown was 49. It was 101.3 in 2017, 121 in 2018 and 109.2 in 2019. Similarly, Bangalore’s PM 2.5 average during the lockdown was 23; it was 46.1 in 2017, 47.4 in 2018 and 36.7 in 2019.”
Setting aside the issue of PM2.5 vs AQI as a metric, we can see from Figure 3 why this is a really bad comparison! We are only in the low-pollution half of the 2020 air quality cycle, whereas the average pollution levels of 2017, 2018 and 2019 include the high-pollution months of October through December, so they will obviously be higher on average. We can also see a general trend towards better air quality (lower AQI), which doubly explains improvements in 2020 relative to previous years. Nothing so far has shown the effect of lockdowns. The bottom line is that this comparison is misleading.
A Better Comparison - Diverging Trends
If we wanted to fix their analysis, we would make an apples-to-apples comparison by including only January through July 6th in 2019. In order to account for the trend towards improved air quality, we would ignore the absolute gap between the trend lines and focus only on whether they diverge after lockdown begins. If this diverges after lockdown, that would suggest that India’s lockdown is actually impacting air quality.
comparison <- aqi %>% mutate(AQI_MA7_before = lag(AQI_MA7, 365), AQI_MA7_dif = AQI_MA7_before - AQI_MA7) %>% filter(Date >= "2020-01-01" | (Date >= "2019-01-01" & Date <= "2019-07-06")) %>% mutate(Year = as.factor(year(Date)), Date = as.Date(format(Date, "%m-%d"), "%m-%d"), Post = ifelse(Date >= "2020-03-25", 1, 0)) ggplot(comparison, aes(x = Date, y = AQI_MA7, color = City, linetype = Year)) + geom_line() + geom_vline(xintercept = as.numeric(as.Date("2020-03-25")), linetype=4) + facet_wrap(~City) + labs(y = "7-day moving average of AQI") + theme(legend.position = "right", legend.title = element_text(size = 16), legend.text = element_text(size = 14))
This figure tells the most promising story yet: we can see in general, the lines diverge and grow further apart after March 25th, suggesting that air quality in April through June of 2020 has been much better than in April through June of 2019. Nowhere is this difference more clear than in Delhi, which originally looked like a disaster in Figure 1, but now is vindicated by a comparison to Delhi in 2019… when air quality was seriously worrying.
The intuition behind Figure 4 - to see how the trend is affected by comparing pre-/post-lockdown air quality - mirrors the differences-in-differences analysis, beloved among economists. The concept behind the diff-in-diff is simple: to study the impact of a change, you have to find a “treatment” group (which was affected by the change) and a “control” group (which was not). Furthermore, treatment and control have to be aligned in the pre-period: if they diverge in the post-period, it means the event has caused them to diverge. This can be captured in a regression
\[Y_i = \beta_1 Treated_i + \beta_2 Post_i + \beta_3 Treated_i * Post_i\]
If \(\beta_3\), the coefficient on \(Treated * Post\), is significant then it signifies that the treatment and control diverged significantly in the lockdown period. We can actually estimate this to demonstrate quantitatively what Figure 4 shows qualitatively.
did_model <- lm(AQI ~ City + Year + Post + Year * Post - 1, data = comparison) stargazer(did_model, type = 'html', header = F, omit = "City", title = "A pseudo-diff-in-diff analysis of air quality with city fixed effects", covariate.labels = c("Treated", "Lockdown", "Treated * Lockdown"))
|Treated * Lockdown|
|Residual Std. Error||73.585 (df = 1491)|
|F Statistic||1,178.935*** (df = 5; 1491)|
|Note:||p<0.1; p<0.05; p<0.01|
This regression is purely for fun and should not be taken as a serious causal inference exercise! This comparison is really unsuitable for a diff-in-diff analysis for a number of reasons, including that 2020 AQI is dependent on 2019 AQI so there is no way the difference between 2020 and 2019 could be quasi-random variation. It is just intended as a nice illustration of the diff-in-diff intuition spotted in the wild, and as a mirror of Figure 4.
Volatility, and a Conjecture
One more thing is interesting to note. I used the 7-day moving average of AQI as a measure because AQI is extremely noisy on a day-to-day basis, which makes it very hard to see trends. But it turns out this volatility has also changed under lockdown.
aqi %>% filter(Date > "2020-01-01") %>% mutate(Volatility = rollapplyr(AQI, 7, sd, fill=NA), Post = ifelse(Date >= "2020-03-25", 1, 0)) %>% ggplot(aes(x = Date, y = Volatility, color = City, group = Post)) + geom_line() + geom_vline(xintercept = as.numeric(as.Date("2020-03-25")), linetype=4) + facet_wrap(~City)
It seems that the biggest change in air quality during India’s lockdown has been in its volatility! As you can see, I define volatility as the standard deviation of the past 7 days’ AQI. Volatility has dropped almost to zero since lockdown began. This is also made clear by looking at AQI itself, rather than the smoothed-out moving average.
aqi %>% filter(Date > "2020-01-01") %>% mutate(Post = ifelse(Date >= "2020-03-25", 1, 0)) %>% ggplot(aes(x = Date, y = AQI, color = City, group = Post)) + geom_line() + geom_vline(xintercept = as.numeric(as.Date("2020-03-25")), linetype=4) + facet_wrap(~City)
The near-disappearance of AQI volatility makes me conjecture that one of the biggest effects of India’s lockdown has been to reduce the cyclicality of economic activity. Rather than booms and busts in driving, factory production, etc, we have entered a new equilibrium: if it is essential, it happens, and if it is non-essential, it doesn’t happen. This pattern of activity can change, but not extremely fast: people and businesses are being very deliberative about what economic activity to pursue. In short, it seems that India’s lockdown has driven economic activity to a near-steady state.
This is perfectly compatible with the lockdown also reducing economic activity in aggregate, of course. But I think it points to effects beyond the aggregate activity level that are underappreciated right now, but could be really influential in the long run. I see people around me forming habits: habits of routinizing shopping rather than making impulse purchases, of travelling only if travel is necessary. I don’t know how durable this behavior change will be - it’s certainly possible that when COVID is behind us, people will go back to behaving like they used to. But it’s interesting to think about what might be.
How I compiled the data, the Shiny app, and the rest of the analysis can be found in more detail on this post’s GitHub repository.