In this blog post, we analyse accidents and traffic data in the United Kingdom. The data are available for download on kaggle and contain UK government records of traffic from 2000 to 2016, as well as accidents recorded between 2005-2007 and 2009-2014. Using python and its modules pandas, numpy, glob, folium and seaborn we will visualise different aspects of the data and try to get some insight into trends and evolution of accidents and traffic in the UK.

In particular, we will compare trends in London, one of the major metropolitan areas in the world, to trends in the whole of the UK. We will also try to identify correlations between traffic demographics and accident rates. But let’s get started.

Introduction to the datasets

To obtain the data, click on the download link on the dataset’s page on kaggle. This downloads a zip file, which we assume has been extracted in the directory where the lines of code below are exectued.

import pandas as pd # contains definition of dataframe objects, analogous to R data.table package
import numpy as np  # defines numerical operations on vectors and matrices, as well as functions. Defines arrays
import glob         # allows to define list of files using regex

# UPLOPAD THE ACCIDENT DATA
files_ls = glob.glob('accidents*csv') 

dfs = []
for f in files_ls:
    dfs.append(pd.read_csv(f, dtype='unicode'))

df_acc = pd.concat(dfs)
df_acc = df_acc.reset_index(drop=True)

# check for duplicates and keep first
df_acc = df_acc.drop_duplicates(keep="first")

# convert convertible columns to floats
df_acc = df_acc.apply(pd.to_numeric, errors='ignore')

# UPLOPAD THE TRAFFIC DATA
df_traffic = pd.read_csv("./ukTrafficAADF.csv", dtype='unicode')

# check for duplicates and keep first
df_traffic = df_traffic.drop_duplicates(keep="first")

# convert convertible columns to floats
df_traffic = df_traffic.apply(pd.to_numeric, errors='ignore')

An overview of the dataframes can be printed using the pandas function describe(). This function provides some descriptive statistiscs of the tables, for each column, such as number of elements, mean, standard deviation, extrema and a few quantiles.

df_accidents:

df_acc.describe()
Location_Easting_OSGR Location_Northing_OSGR Longitude Latitude Police_Force Accident_Severity Number_of_Vehicles Number_of_Casualties Day_of_Week Local_Authority_(District) 1st_Road_Class 1st_Road_Number Speed_limit Junction_Detail 2nd_Road_Class 2nd_Road_Number Urban_or_Rural_Area Year
count 1.469894e+06 1.469894e+06 1.469894e+06 1.469894e+06 1.469995e+06 1.469995e+06 1.469995e+06 1.469995e+06 1.469995e+06 1.469995e+06 1.469995e+06 1.469995e+06 1.469995e+06 0.0 1.469995e+06 1.469995e+06 1.469995e+06 1.469995e+06
mean 4.398972e+05 2.986746e+05 -1.432652e+00 5.257597e+01 3.078191e+01 2.838774e+00 1.831846e+00 1.350891e+00 4.118639e+00 3.535705e+02 4.089804e+00 1.008856e+03 3.907896e+01 NaN 2.663909e+00 3.800811e+02 1.356722e+00 2.009309e+03
std 9.553525e+04 1.612611e+05 1.404330e+00 1.452076e+00 2.551827e+01 4.014209e-01 7.152256e-01 8.257998e-01 1.924702e+00 2.592795e+02 1.429898e+00 1.821693e+03 1.417134e+01 NaN 3.207808e+00 1.300899e+03 4.792192e-01 3.021200e+00
min 6.495000e+04 1.029000e+04 -7.516225e+00 4.991294e+01 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 -1.000000e+00 1.000000e+01 NaN -1.000000e+00 -1.000000e+00 1.000000e+00 2.005000e+03
25% 3.757500e+05 1.780082e+05 -2.363678e+00 5.148790e+01 7.000000e+00 3.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 1.220000e+02 3.000000e+00 0.000000e+00 3.000000e+01 NaN -1.000000e+00 0.000000e+00 1.000000e+00 2.006000e+03
50% 4.409300e+05 2.653400e+05 -1.391637e+00 5.227671e+01 3.100000e+01 3.000000e+00 2.000000e+00 1.000000e+00 4.000000e+00 3.280000e+02 4.000000e+00 1.290000e+02 3.000000e+01 NaN 3.000000e+00 0.000000e+00 1.000000e+00 2.010000e+03
75% 5.232900e+05 3.966000e+05 -2.184820e-01 5.346436e+01 4.600000e+01 3.000000e+00 2.000000e+00 1.000000e+00 6.000000e+00 5.320000e+02 6.000000e+00 7.260000e+02 5.000000e+01 NaN 6.000000e+00 0.000000e+00 2.000000e+00 2.012000e+03
max 6.553700e+05 1.208800e+06 1.759398e+00 6.075754e+01 9.800000e+01 3.000000e+00 6.700000e+01 9.300000e+01 7.000000e+00 9.410000e+02 6.000000e+00 9.999000e+03 7.000000e+01 NaN 6.000000e+00 9.999000e+03 3.000000e+00 2.014000e+03

The accidents data contains 33 columns, many of which are not really useful for our purposes. For instance the first 2 columns contain accidents points locations in Easting and Northing coordinates, a cartesian reference system that takes into account the shape of the Earth’s surface in a particular region and allows for accurate distance measurements. We will not use them, working instead with the usual Longitude and Latitude.

Looking at the summary statistics, one can identify outliers, for examples if the maximum or minimum of a given column is outside a reasonable range.

df_traffic:

df_traffic.describe()
AADFYear CP Easting Northing LinkLength_km LinkLength_miles PedalCycles Motorcycles CarsTaxis BusesCoaches ... V2AxleRigidHGV V3AxleRigidHGV V4or5AxleRigidHGV V3or4AxleArticHGV V5AxleArticHGV V6orMoreAxleArticHGV AllHGVs AllMotorVehicles Lat Lon
count 275385.000000 275385.000000 275385.000000 2.753850e+05 275385.000000 275385.000000 275385.000000 275385.000000 275385.000000 275385.000000 ... 275385.000000 275385.000000 275385.000000 275385.000000 275385.000000 275385.000000 275385.000000 275385.000000 275385.000000 275385.000000
mean 2007.967907 47277.490437 425675.250780 3.561826e+05 2.757116 1.712899 123.314654 222.099297 16813.332923 248.662175 ... 489.379193 83.551450 88.924513 74.249433 247.927389 271.747354 1255.774163 21152.693542 53.092936 -1.654634
std 4.862875 27004.889605 98687.570954 1.869923e+05 3.815569 2.371086 421.440798 420.874069 16618.266842 374.650774 ... 683.713666 121.773472 142.005359 183.767833 752.311296 835.966781 2446.175298 21536.689989 1.681459 1.478250
min 2000.000000 60.000000 69987.000000 7.625000e+04 0.070000 0.040000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 50.583853 -7.442680
25% 2004.000000 26218.000000 352970.000000 1.921200e+05 0.600000 0.370000 5.000000 58.000000 7125.000000 60.000000 ... 148.000000 24.000000 15.000000 7.000000 9.000000 10.000000 252.000000 8941.000000 51.614977 -2.711540
50% 2008.000000 47252.000000 430000.000000 3.240000e+05 1.400000 0.870000 26.000000 113.000000 12255.000000 127.000000 ... 273.000000 46.000000 41.000000 20.000000 34.000000 37.000000 489.000000 15143.000000 52.807705 -1.549475
75% 2012.000000 74392.000000 510600.000000 4.296000e+05 3.200000 1.990000 83.000000 214.000000 19951.000000 280.000000 ... 517.000000 91.000000 100.000000 56.000000 125.000000 137.000000 1041.000000 24660.000000 53.760638 -0.389834
max 2016.000000 99967.000000 655040.000000 1.205400e+06 55.500000 34.490000 18629.000000 9815.000000 207133.000000 11359.000000 ... 10942.000000 5968.000000 3684.000000 3949.000000 11034.000000 13758.000000 27095.000000 262842.000000 60.727231 1.754553

8 rows × 21 columns

The traffic data contains 21 columns, of which, again, we will use only a few.

Interactive and animated maps

Using folium, a python wrapper for leaflet, we can generate maps as illustrated with R in the previous posts. As a first illustration, let us create a heatmap animation of accident locations in London from 2004 to 2014, excluding 2008:

import folium # contains leaflet wrappers for python
from folium import plugins

# select london area (around Big Ben)
df_london = df_acc[df_acc['Latitude'] < 51.5007+0.03]
df_london = df_london[df_london['Latitude'] > 51.5007-0.03]
df_london = df_london[df_london['Longitude'] < -0.1246+0.08]
df_london = df_london[df_london['Longitude'] > -0.1246-0.08]

map_london = folium.Map(location=[51.5007, -0.1246], tiles="Stamen Toner", zoom_start=13) 

heat_df = df_london[['Latitude', 'Longitude']].copy()

# Create weight column, using date
months = df_london['Date'].str[3:5]
years = df_london['Date'].str[6:10]
months = [float(i) for i in months]
years = [float(i) for i in years]
months = np.array(months)
years = np.array(years)
weights = years+months/100

unique_weights = np.unique(weights)
hmap_weights = unique_weights[0::10]

heat_df['Weight'] = weights
heat_df['Weight'] = heat_df['Weight'].astype(float)
heat_df = heat_df.dropna(axis=0, subset=['Latitude', 'Longitude', 'Weight'])

# List comprehension to create list of lists
heat_data = [[[row['Latitude'], row['Longitude']] 
              for index, row in heat_df[heat_df['Weight'] == i].iterrows()] 
             for i in hmap_weights]
# Plot it on the map
hm = plugins.HeatMapWithTime(heat_data, auto_play=True, max_opacity=0.8, radius = 20)
hm.add_to(map_london)
# Display the map
map_london

We can do the same for traffic jams in London:

map_london = folium.Map(location=[51.5007, -0.1246], tiles="Stamen Toner", zoom_start=13) 

df_traffic_london = df_traffic[df_traffic['Region'] == 'London']

# List comprehension to create list of lists
heat_data = [[[row['Lat'], row['Lon'], row["AllMotorVehicles"]] 
              for index, row in df_traffic_london[df_traffic_london['AADFYear'] == i].iterrows()]
             for i in set(df_traffic_london.AADFYear)]

# Plot it on the map
hm = plugins.HeatMapWithTime(heat_data, auto_play=True, max_opacity=0.8, radius = 20)
hm.add_to(map_london)
# Display the map
map_london

In truth, recorded traffic jams locations do not vary a lot across the years, and the heatmap does not allow to discern any change in intensity either. This might be due to the fact that traffic is measured at some fixed points, which do not change.
Nevertheless, it looks like there is a correlation between traffic jams and accidents locations.

Grouping, counting and plotting

As in R with data.table, pandas’ dataframe incorporates many SQL-like functionalities. For instance we can count accidents grouped by severity on a scale ranging from 1 to 3, 1 being fatal, 2 serious and 3 slight, and plot their evolution. Let us do this for London and for the whole of the UK separately, and display the plots next to each other:

import matplotlib.pyplot as plt
import seaborn as sns

# accident severity by year in London
grp_london = df_london.groupby(["Year", "Accident_Severity"])
grp_london = grp_london.size().reset_index(name='counts')

# accident severity by year in UK
grp_uk = df_acc.groupby(["Year", "Accident_Severity"])
grp_uk = grp_uk.size().reset_index(name='counts')

sns.set(font_scale=1.4)
plt.figure(figsize=(20, 5))

plt.subplot(1, 2, 1)
bplot = sns.barplot(x = "Year", y = "counts", hue = "Accident_Severity", data = grp_london)
bplot.set_title("Count of accidents by year and by severity in London")
bplot.set_ylabel("")
bplot.legend_.remove()

plt.subplot(1, 2, 2)
bplot = sns.barplot(x = "Year", y = "counts", hue = "Accident_Severity", data = grp_uk)
bplot.set_title("Count of accidents by year and by severity in the UK")
bplot.set_ylabel("")
# place the legend out of the figure
bplt = plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

png

It is clear that London and the UK are not following the same trend: whereas in the UK slight accidents are decreasing, in London they seem to be increasing. But… Looking closer, London represents approximately 2.5% of all slight accidents, which, in light of the fact that London has close to 10% of the total population of the UK, proves that London is doing a better job than the rest of the country to prevent traffic accidents.

Regarding the trends of accidents of severity 1 or 2, it is less clear from the above plots what the trends are, but it seems severity 2 is fluctuating with no clear trend. Let’s have yet another closer look:

# accident severity by year in London
grp_london = df_london.groupby(["Year", "Accident_Severity"])
grp_london = grp_london.size().reset_index(name='counts')
grp_london =  grp_london[grp_london["Accident_Severity"] != 3] # exclude accidents of severity 3

# accident severity by year in UK
grp_uk = df_acc.groupby(["Year", "Accident_Severity"])
grp_uk = grp_uk.size().reset_index(name='counts')
grp_uk =  grp_uk[grp_uk["Accident_Severity"] != 3] # exclude accidents of severity 3

sns.set(font_scale=1.4)
plt.figure(figsize=(20, 5))

plt.subplot(1, 2, 1)
bplot = sns.barplot(x = "Year", y = "counts", hue = "Accident_Severity", data = grp_london)
bplot.set_title("Count of accidents by year and by severity in London")
bplot.set_ylabel("")
bplot.legend_.remove()

plt.subplot(1, 2, 2)
bplot = sns.barplot(x = "Year", y = "counts", hue = "Accident_Severity", data = grp_uk)
bplot.set_title("Count of accidents by year and by severity in the UK")
bplot.set_ylabel("")
# Put the legend out of the figure
bplt = plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

png

Now it is clearer: accidents of severity 1 and 2 seem to be in a decreasing trend both in London and in the rest of the UK. Also, the year 2012 was a bad year for London’s drivers. Perhaps because of the Olympics?

Let us now have a look at casualties, defined as people at least injured in an accident. Are they decreasing? Are they following the same trends as the total number of accidents?

# casualties by year in London
grp_london = df_london.groupby(["Year", "Number_of_Casualties"])
grp_london = grp_london.size().reset_index(name='counts')

# casualties by year in UK
grp_uk = df_acc.groupby(["Year", "Number_of_Casualties"])
grp_uk = grp_uk.size().reset_index(name='counts')

sns.set(font_scale=1.4)
plt.figure(figsize=(20, 5))

plt.subplot(1, 2, 1)
bplot = sns.barplot(x = "Number_of_Casualties", y = "counts", hue = "Year", palette="BuGn_d", 
                    data= grp_london[grp_london["counts"]>20])
bplot.set_title("Number of accidents by number of casualties by year in London")
bplot.set_xlabel("Number of casualties")
bplot.set_ylabel("")
bplot.legend_.remove()

plt.subplot(1, 2, 2)
bplot = sns.barplot(x = "Number_of_Casualties", y = "counts", hue = "Year", palette="BuGn_d", 
                    data= grp_uk[grp_uk["counts"]>2000])
bplot.set_title("Number of accidents by number of casualties by year in the UK")
bplot.set_xlabel("Number of casualties")
bplot.set_ylabel("")
# Put the legend out of the figure
bplt = plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

png

Casualty numbers tell the same story as the accident severity plots. On the one hand, London has a lower share of casualties compared to its population than the UK. On the other hand, the trend is towards decreasing numbers in the UK, while in London numbers seem to be randomly fluctuating.

Let’s now have a look at the number of accidents by posted speed limit on the accident’s streets. Are we going to see again the same trends?

# casualties by year in London
grp_london = df_london.groupby(["Year", "Speed_limit"])
grp_london = grp_london.size().reset_index(name='counts')

# casualties by year in UK
grp_uk = df_acc.groupby(["Year", "Speed_limit"])
grp_uk = grp_uk.size().reset_index(name='counts')

sns.set(font_scale=1.4)
plt.figure(figsize=(20, 5))

plt.subplot(1, 2, 1)
bplot = sns.barplot(x = "Speed_limit", y = "counts", hue = "Year", palette="BuGn_d", 
                    data= grp_london)
bplot.set_title("Number of accidents by speed limit by year in London")
bplot.set_xlabel("Speed limit (mph)")
bplot.set_ylabel("")
bplot.legend_.remove()

plt.subplot(1, 2, 2)
bplot = sns.barplot(x = "Speed_limit", y = "counts", hue = "Year", palette="BuGn_d", 
                    data= grp_uk)
bplot.set_title("Number of accidents by speed limit by year in the UK")
bplot.set_xlabel("Speed limit (mph)")
bplot.set_ylabel("")
# Put the legend out of the figure
bplt = plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

png

Yes we are. But we also learn something new: most of the accidents in London happen on streets with speed limit of 30 miles per hour. This is probably due to the fact that most streets in London have that speed limit. The same is true in the whole of the UK, but quite a few accidents happen also on faster roads. Is there a correlation between the speed limit and the severity of the accidents?

# accident severity and speed limit in London
grp_london = df_london.groupby(["Speed_limit", "Accident_Severity"])
grp_london = grp_london.size().reset_index(name='counts')

# accident severity and speed limit in UK
grp_uk = df_acc.groupby(["Speed_limit", "Accident_Severity"])
grp_uk = grp_uk.size().reset_index(name='counts')

# define function to calculate rates
def rates(x):
    rate = 100*x["counts"] / x["counts"].sum()
    return rate

london_barplot = grp_london.groupby("Accident_Severity").apply(rates).reset_index()
london_barplot["level_1"] = list(grp_london["Speed_limit"][london_barplot["level_1"]])
london_barplot = london_barplot.rename(index=str, columns={"level_1": "Speed_limit"})

uk_barplot = grp_uk.groupby("Accident_Severity").apply(rates).reset_index()
uk_barplot["level_1"] = list(grp_uk["Speed_limit"][uk_barplot["level_1"]])
uk_barplot = uk_barplot.rename(index=str, columns={"level_1": "Speed_limit"})

sns.set(font_scale=1.4)
plt.figure(figsize=(20, 5))

plt.subplot(1, 2, 1)
bplot = sns.barplot(x = "Speed_limit", y = "counts", hue = "Accident_Severity", palette="BuGn_d", 
                    data = london_barplot)
bplot.set_title("Percentage of accidents of given severity by speed limit in London")
bplot.set_xlabel("Speed limit (mph)")
bplot.set_ylabel("")
bplot.legend_.remove()

plt.subplot(1, 2, 2)
bplot = sns.barplot(x = "Speed_limit", y = "counts", hue = "Accident_Severity", palette="BuGn_d", 
                    data = uk_barplot)
bplot.set_title("Percentage of accidents of given severity by speed limit in the UK")
bplot.set_xlabel("Speed limit (mph)")
bplot.set_ylabel("")
# Put the legend out of the figure
bplt = plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

png

As illustrated in the barplots above, in London almost all accidents of all severities occur in roads with a speed limit of 30 mph. This is perhaps not very surprising. On the other hand, in the UK the distribution is more multidimensional, with accidents of severity 1 occurring mostly on faster roads.

Traffic vs accidents

To conclude, let us turn our attention to the relation between traffic and accidents. Can we predict the number of accidents based solely on the number of cars on the road? Or is there a more subtle relation, perhaps with more factors coming into play?

The traffic data available contains the number of vehicles on the road, sorted by their type. Without trying to understand what the name of each vehicle category corresponds to, we can plot the number of vehicles on the road per category and try to visualise if it is a good proxy for the number of accidents:

# number of accidents
acc_london = df_london.groupby("Year")
acc_london = acc_london.size().reset_index(name='counts')
acc_london["counts"] = acc_london.counts/acc_london.counts[0]
acc_london = acc_london.rename(columns = {"counts" : 'accidents (2005=1)'})

acc_uk = df_acc.groupby("Year")
acc_uk = acc_uk.size().reset_index(name='counts')
acc_uk["counts"] = acc_uk.counts/acc_uk.counts[0]
acc_uk = acc_uk.rename(columns = {"counts" : 'accidents (2005=1)'})

# select columns of all types of vehicles and sum by year
d = {value : "sum" for value in list(df_traffic)[14:27]}

traffic_london = df_traffic[df_traffic["Region"] == "London"].groupby(["AADFYear"])
traffic_london = traffic_london.agg(d).reset_index()
traffic_london = traffic_london.rename(columns = {"AADFYear" : "Year"})

traffic_uk = df_traffic.groupby(["AADFYear"])
traffic_uk = traffic_uk.agg(d).reset_index()
traffic_uk = traffic_uk.rename(columns = {"AADFYear" : "Year"})

# normalize all vechicles counts so that 2005 = 1
traffic_london.iloc[:,1:14] = traffic_london.iloc[:,1:14]/traffic_london.iloc[5,1:14]
traffic_uk.iloc[:,1:14] = traffic_uk.iloc[:,1:14]/traffic_uk.iloc[5,1:14]

# merged data
acc_tr_lon = pd.merge(traffic_london, acc_london, on='Year', how='outer')
acc_tr_uk = pd.merge(traffic_uk, acc_uk, on='Year', how='outer')

# order for barplots
london_barplot = pd.melt(acc_tr_lon, id_vars="Year", var_name="l1", value_name="l2")
uk_barplot = pd.melt(acc_tr_uk, id_vars="Year", var_name="l1", value_name="l2")

bar_palette = sns.color_palette("BuGn_d", 13)
bar_palette.append((1,0,0))

sns.set(font_scale=1.4)
plt.figure(figsize=(20, 15))

plt.subplot(2, 1, 1)
bplot = sns.barplot(x = "Year", y = "l2", hue = "l1", palette=bar_palette, 
                    data = london_barplot)
bplot.set_title("Accidents vs vehicles on the road in London")
bplot.set_xlabel("Year")
bplot.set_ylabel("")
bplot.legend_.remove()

plt.subplot(2, 1, 2)
bplot = sns.barplot(x = "Year", y = "l2", hue = "l1", palette=bar_palette, 
                    data = uk_barplot)
bplot.set_title("Accidents vs vehicles on the road in the UK")
bplot.set_xlabel("Year")
bplot.set_ylabel("")
# Put the legend out of the figure
bplt = plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

png

The barplots above are somewhat unclear: whereas some vehicle categories seem not to affect the number of accidents, others look correlated. Notice that the accident data is available for just 9 years, implying that it is impossible to conclude from this data alone yearly trends with good confidence. Maybe looking at shorter time intervals we can come up with a predictive model for accident rates?

It’s all for now!