PROJECT: Analyses of FDA Drug Adverse Reactions
Updated: Nov 22, 2019
In this project, I take retrieve data from the FDA's API about adverse reactions to drugs to analyze.
Table of Contents
Introduction and Goals
Gathering the Data
Deeper Dive into Drug Manufacturers
1. Introduction and Goals
A. Gathering the Data from the FDA about Drug Adverse Reactions
1. Use openFDA API
B. Analyze these findings to see
1. What trends do we see in reporting
2. What information can we get about drug manufacturers
are some producing more adverse reaction reports than others?
are there trends for different manufacturers?
are some manufacturers more responsible for the overall trends we see in reporting than others
3. If we can accurately predict whether a reaction will be serious or not given various predictors
4. Forecast the number of serious and non-serious adverse reactions five years into the future
2. Gathering the Data
I made API calls to the FDA's API. The API only allowed very small calls -- limited to 100 instances.
First, I made a call to see how many total entries existed ('count_of_received') in all of the years that the data has been recorded. The start date was 01/01/2004 and the date I ran this code was 09/26/2019. I then define a function to find date cutoffs for batches of 99. These time intervals are then fed into a function ('get_json_data') that retrieves 99 observations for every date interval, of which there are over 4,500.
The data from the JSON is processed through methods I created in a separate python file make_dataframe.py (as 'md'). The 'get_all_data' function looks through several layers of a returned JSON file to find the information we're looking for and convert it into a dataframe in Pandas.
Below, you can see my code.
And the returned DataFrame with 451,836 rows below:
Each row is a separate adverse reaction observation. There are 26 different columns to the data. One of the most significant that we will look at and try to predict is the 'serious' column. 1 indicates an adverse reaction resulted in hospitalization, death, or other significantly debilitating outcomes. 2 indicates any reaction besides those in the first category.
To see what trends underlie the data, I uploaded the lineplots of both serious and non-serious adverse reactions. Each data point is a summed total by month over the course of 15 years. It is clear that there is a trend of increasing non-serious adverse reaction reports. The serious reports have remained much more stable, but they do seem to have a slightly downward trend over the last few years. We'll investigate both trends more closely and then forecast into the future.
4. Deeper Dive into Drug Manufacturers
In order to investigate what effect drug manufacturers have on the trends and overall counts of adverse reactions, we must separate them out. This is not as straightforward as it may seem, because the data from the FDA includes all drugs that the patient who experienced the adverse reaction was on. Essentially, we cannot tease apart which drug caused the adverse reaction, and therefore which of the drug manufacturers was most responsible.
The cleanest solution to this problem is to take adverse reaction reports where only one drug was recorded in the adverse reaction report. This way, we can isolate the different drug manufacturers. I made a list of the top 20 companies responsible for the most serious adverse reactions and separately for non-serious adverse reactions, then took a set union of those companies. Below is a breakdown summed over the 15 years of available data of serious and non-serious adverse reaction counts of those drug manufacturers.
Additional considerations to have in mind when interpreting the results:
1. the companies represented probably make drugs with intrinsically different therapeutic-harm tradeoffs. In order to directly compare drug manufacturer's drug safety, we would need to hold many factors, foremost the drug itself, constant. That is beyond the scope of this project, though this data does provides information useful for that analysis as well.
2. The manufacturers are as recorded by the FDA. During the 15 years of data recording, different manufacturers surely acquired or were acquired by others. We'll take a look at a time-series breakdown for some companies for their adverse reactions, but it's important to keep ownership in mind when diving into deeper analysis.
3. Taking only cases where one drug was taken presents its own problem. It is possible that certain manufacturers make drugs that are usually taken with other drugs (from other manufacturers). This data would not show results for those. This data represents 27.17% of our original observations. Furthermore, I hypothesize that taking more drugs might increase the chance that an adverse reaction event occurs. This is certainly supported by the simple fact that most of the reactions recorded two or more drugs.
Below, you can see the raw counts as well as a calculated difference (serious - non-serious).
It is important to look at the time-series element of the reaction reports for the drug manufacturers. For each company, our data is not only recorded by day, but also by observation. In order to see any trends, I first visualized the total adverse reactions by the top most responsible companies (as defined above), and grouped the data as weekly sums of reports. The data is hard to interpret, naturally. But this is important to show to demonstrate spikes, which may correspond to important events that we might want to take a closer look at in further analyses. For example, Immunex Corporation has several spikes as well as an overall increasing trend.
To make more sense of the trends, we smooth it out. We aggregate our data not by weekly sums, but by yearly sums. We lose some of the information (the spikes that may correspond to important events), but we gain clarity about overall trends.
Note: the downward trend at the very tail end of all lineplots is a result of aggregating data by year and 2019 having data for only 2/3 of the year. The post-2019 data is ignorable if you'd like.
It is clear Immunex is responsible for many of the adverse reactions, as confirmed by our table of raw numbers above. Furthermore, it is an increasing trend since 2013 that reversed in 2016. It would be interesting in further work to dig deeper into why this was.
We can also break down this data into serious and non-serious adverse reaction events over time for each company. Let's start with Immunex (aggregated by year again).
The rise in both kinds of adverse reactions around 2013 is slightly worrying, but Immunex has a strong downward trend since 2016.
We can take a look at a few other drug manufacturers as well just as a high-level comparison. Let's not forget the warnings I mentioned about interpretation earlier, though.
I find this last one interesting, as the number of serious reactions decrease drastically and the non-serious adverse reactions increase exponentially, mirroring the general trends.
Finally, let's see what happens when you remove Immunex, the second largest contributor to the overall adverse reaction numbers. I did this for several of the top companies, but am only showing Immunex here as an example. Below is our original data trends and in black, the same data without Immunex adverse reactions (where the adverse reaction is only caused by a drug manufactured by Immunex).
We don't see a huge change (you might have a hard time distinguishing the difference in serious adverse reaction reports especially). This means that we're probably looking at an industry-wide phenomenon, not just the amplified trends of one or a few companies.
Furthermore, there is a clear change in trajectory for the trends of non-serious adverse reaction reports circa 2012/2013. Perhaps there was a change in either the way the FDA classified these reactions (to be broader) or perhaps there was success in FDA efforts to curb underreporting of adverse reactions (which would affect non-serious reports much more than serious ones). If this trend is a correction of underreporting, we can consider that a good trend. Otherwise, it would be harder to interpret. More data is necessary for this analysis.
Turning now to forecasting the number of adverse reaction events that are reported to the FDA, we will look at the process just for non-serious adverse reactions which is the same process we do to forecast serious adverse reaction reports.
Here, again, is the real data we have over the last fifteen years.
The first thing we need to do is to decompose the data. Observed data is made up of a trend, seasonality, and randomness that sum up to give us our original data. Seasonality is a type of periodicity, as can be seen, a recurring pattern.
This gives us the seasonality information for our SARIMAX model (Seasonal AutoRegressive Integrated Moving Average with eXogenous regressors model). We need to get additional information before crafting our model. We need to detrend the data first. Below are two graphs. The one on the left shows our data with a rolling mean and standard deviation. This is clearly not stationary, as we confirm with the Dickey-Fuller test (p-value of .89). Stationarity is a fundamental assumption for the SARIMAX model, so we try to make it stationary first. The first attempt is to take a difference between one observation and the next and plot that as the graph on the right. We can see that the data is now stationary, confirmed by the Dickey-Fuller test (p-value of 2.23x10-29). This differencing information will be fed into the parameters of the SARIMAX model along with the seasonality information.
Next, we need to decide how many autoregressive and/or moving average terms to include as parameters of our SARIMAX models. On the left, we plot autocorrelation and partial autocorrelation (which only looks at correlation to the next observation). For our model, we need one moving average term and no auto-regressive terms. Take a look at this guide for more in-depth information about these parameters. On the right, we finally run our model.
The model was created using data until about mid-2017. Here is the predictions on all data. As you can see, it unsurprisingly follows the general shape, though with dampened spikes, on all the training data. It also does fairly well on the test data. And we see that it predicts into the future.
Zoomed in, we can see the thicker blue prediction line from 2016 to 2024 and our actual data as the thinner line.
Looking at the same thing, but with both serious and non-serious adverse reactions, we see that serious adverse reactions remain fairly stable while non-serious adverse reactions continue to rise, likely matching or overtaking the number of serious reports in the next five years.
This is significant for many reasons. Most obviously, there is more processing of reports for the FDA to handle. The overall reporting is increasing. We cannot say whether that fact is necessarily good or bad, but the agency must prepare for an increasing workload.
I am planning on revisiting this forecast soon to non just do a train-test split but also to validate across a number of folds. I believe that will improve the model drastically.
The goal of classification is simply to be able to predict whether an adverse reaction will be serious or not given certain predictor variables.
I am currently building out these models. This project is ongoing.
A naïve model will guess that an observation will be serious 76.25 of the time.