Flight Dealy Analysis

click for git repo

§1. Introduction

The aim of this project is to use the data set provided by the U.S. Department of Transportation (DOT) Bureau of Transportation Statistics to analyze factors to predict the number of Total flight and punctuality rate for each day. I want to use the LSTM(Long Short-Term Memory) to help me get the result. The ultimate goal of the project is help people travel more efficiently and know the relationship between flight and weather.

from plotly import express as px
from keras.models import Sequential
from tensorflow.keras.layers import Dropout, Dense, LSTM
import numpy as np
import pandas as pd
import plotly.io as pio
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder
import sqlite3#sqlite3 python module that lets us interact wtih squile
import tensorflow as tf
from tensorflow.keras.layers.experimental import preprocessing
from tensorflow.keras import layers
import warnings
warnings.filterwarnings('ignore')
from datetime import datetime
import re
import time
import datetime

§2. Data preparation

1)Flight Data

Flight2016=pd.read_csv("2016.csv")
Flight2015=pd.read_csv("2015.csv")
Flight2017=pd.read_csv("2017.csv")
Flight2018=pd.read_csv("2018.csv")
Flight2014=pd.read_csv("2014.csv")
Flight2013=pd.read_csv("2013.csv")
Flight2012=pd.read_csv("2012.csv")

Because LSTM(Long Short-Term Memory) is one of the deep learning techniques, as much data as possible will help me get more accurate results, so I downloaded all the original flight data from 2012-2018, the following function will perform the most basic data The cleaning work is to facilitate the subsequent splicing of other datasets to form a complete structure.

From the raw data, we found that there are a total of 28 columns in the flight dataset. If the unit of our prediction is every flight instead of every day, I will select predictors through correlation plot or heat map, but in this project, the object we explore is the unit of day, so after data cleaning, Most columns will be removed, or converted to other varibles.

Flight2016.head()
FL_DATE OP_CARRIER OP_CARRIER_FL_NUM ORIGIN DEST CRS_DEP_TIME DEP_TIME DEP_DELAY TAXI_OUT WHEELS_OFF ... CRS_ELAPSED_TIME ACTUAL_ELAPSED_TIME AIR_TIME DISTANCE CARRIER_DELAY WEATHER_DELAY NAS_DELAY SECURITY_DELAY LATE_AIRCRAFT_DELAY Unnamed: 27
0 2016-01-01 DL 1248 DTW LAX 1935 1935.0 0.0 23.0 1958.0 ... 309.0 285.0 249.0 1979.0 NaN NaN NaN NaN NaN NaN
1 2016-01-01 DL 1251 ATL GRR 2125 2130.0 5.0 13.0 2143.0 ... 116.0 109.0 92.0 640.0 NaN NaN NaN NaN NaN NaN
2 2016-01-01 DL 1254 LAX ATL 2255 2256.0 1.0 19.0 2315.0 ... 245.0 231.0 207.0 1947.0 NaN NaN NaN NaN NaN NaN
3 2016-01-01 DL 1255 SLC ATL 1656 1700.0 4.0 12.0 1712.0 ... 213.0 193.0 173.0 1590.0 NaN NaN NaN NaN NaN NaN
4 2016-01-01 DL 1256 BZN MSP 900 1012.0 72.0 63.0 1115.0 ... 136.0 188.0 121.0 874.0 72.0 0.0 52.0 0.0 0.0 NaN

5 rows × 28 columns

conn=sqlite3.connect("project.db")#create data base called project
def flight_data_each_year(Flight):
    Flight.fillna(0,inplace=True)#because NA means there is no delay so we can fill it with 0
    LAX=Flight[Flight["ORIGIN"]=="LAX"]#Because there is too much data for each year of takeoff, hence we will study the planes taking off from LAX
    LAX["delay"]=(LAX[["CARRIER_DELAY","WEATHER_DELAY","NAS_DELAY","SECURITY_DELAY","LATE_AIRCRAFT_DELAY"]] != 0).astype(int).sum(axis=1)
    #There are many reasons for flight delay, and we only care whether a flight is delayed, so we need to add all the reasons, 
    #if it is greater than or equal to 1which means it is delayed
    LAX["delay"][LAX.delay>1]=1#Turn multiple reasons into 1
    cols=["FL_DATE","delay"]#
    LAX=LAX[cols]
    LAX.to_sql("LAX", conn, if_exists = "replace", index = False)#put LAX.csv to the data base
    #Call sql to calculate the total number of flights per day and the total number of delayed flights 
    #(because when delay==0, it means there is no delay, so you only need to add all the delays to know the total number of delays)
    cmd = """
       SELECT *, count() AS Total_flight_per_day, SUM(delay) AS Total_Delay_per_day
       FROM LAX
       GROUP BY FL_DATE
       
         """
    LAX_flight = pd.read_sql_query(cmd, conn)
    LAX_flight["punctuality rate"]=1-LAX_flight["Total_Delay_per_day"]/LAX_flight["Total_flight_per_day"]#计算每一天的准点率
    LAX_flight["week"] = ''#Create a new column named week
    for i in range(len(LAX_flight["FL_DATE"])):
        #Call the datetime package to standardize the time to get the day of the week for each day
        LAX_flight["week"][i]=datetime.datetime.strptime(LAX_flight["FL_DATE"].iloc[i],"%Y-%m-%d").isoweekday()
        #Because the output result only corresponds to the number of the week, so we need to convert it into English text
        LAX_flight["week"]=LAX_flight["week"].replace({5:'Friday',
                       1:'Monday',
                       2:'Tuesday',
                       3:'Wednesday',
                       4:'Thursday',
                       6:'Saturday',
                       7:'Sunday'
                                    })
    #Because we need to use a neural network in the end, we need to use OneHotEncode for categorical variables
    encoder = OneHotEncoder(handle_unknown='ignore')
    #perform one-hot encoding on 'week' column 
    encoder_df = pd.DataFrame(encoder.fit_transform(LAX_flight[['week']]).toarray())                         
    # #merge one-hot encoded columns back with original DataFrame
    LAX_flight = LAX_flight.join(encoder_df)
    # #view final df
    #LAX_flight.drop(labels=['week'], axis=1, inplace=True)
    colNameDict = {0:'Friday',
                   1:'Monday',
                   2:'Saturday',
                   3:'Sunday',
                   4:'Thursday',
                   5:'Tuesday',
                   6:'Wednesday',
                  }                  
    LAX_flight.rename(columns = colNameDict,inplace=True)
    LAX_flight.drop(labels=['Total_Delay_per_day','delay'],axis=1, inplace=True)
    return LAX_flight
LAX_flight2016=flight_data_each_year(Flight2016)
LAX_flight2015=flight_data_each_year(Flight2015)
LAX_flight2017=flight_data_each_year(Flight2017)
LAX_flight2018=flight_data_each_year(Flight2018)
LAX_flight2014=flight_data_each_year(Flight2014)
LAX_flight2013=flight_data_each_year(Flight2013)
LAX_flight2012=flight_data_each_year(Flight2012)
LAX_flight=pd.concat([LAX_flight2015,LAX_flight2016,LAX_flight2017,LAX_flight2018])
#Because the amount of data is too large, it is necessary to group the data sets to combine
LAX_flight=pd.concat([LAX_flight,LAX_flight2012,LAX_flight2013,LAX_flight2014])
LAX_flight
FL_DATE Total_flight_per_day punctuality rate week Friday Monday Saturday Sunday Thursday Tuesday Wednesday
0 2015-01-01 556 0.766187 Thursday 0.0 0.0 0.0 0.0 1.0 0.0 0.0
1 2015-01-02 640 0.656250 Friday 1.0 0.0 0.0 0.0 0.0 0.0 0.0
2 2015-01-03 575 0.526957 Saturday 0.0 0.0 1.0 0.0 0.0 0.0 0.0
3 2015-01-04 615 0.536585 Sunday 0.0 0.0 0.0 1.0 0.0 0.0 0.0
4 2015-01-05 622 0.667203 Monday 0.0 1.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ...
360 2014-12-27 564 0.734043 Saturday 0.0 0.0 1.0 0.0 0.0 0.0 0.0
361 2014-12-28 604 0.743377 Sunday 0.0 0.0 0.0 1.0 0.0 0.0 0.0
362 2014-12-29 627 0.733652 Monday 0.0 1.0 0.0 0.0 0.0 0.0 0.0
363 2014-12-30 614 0.547231 Tuesday 0.0 0.0 0.0 0.0 0.0 1.0 0.0
364 2014-12-31 493 0.748479 Wednesday 0.0 0.0 0.0 0.0 0.0 0.0 1.0

2557 rows × 11 columns

We used One-Hot Encoding instead of our commonly used label Encoding in the above processing of the week. Because the premise of using label Encoding is that the variable should behave to some extent, for example, easy is 1, medium is 2, difficult is 3. But in our case the day of the week does not represent such an attribute so I used One- Hot Encoding. In fact, One-Hot Encoding is very easy to understand. Each corresponding value is only 0 and 1. 0 means it does not have the property, and 1 means it has the property. For example, on 2015-01-01, the value corresponding to Thursday is 1, which means that 2015-01-01 is Thursday.

2) visulazation

fig = px.line(data_frame = LAX_flight2015, # data that needs to be plotted
                 x = "FL_DATE", # column name for x-axis
                 y = "Total_flight_per_day",
                 markers=True,
                 color = "week", # column name for color coding
                 width = 1000,
                 height = 500)

# reduce whitespace
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
# show the plot
fig.show()
fig = px.line(data_frame = LAX_flight2015, # data that needs to be plotted
                 x = "FL_DATE", # column name for x-axis
                 y = "punctuality rate",
                 markers=True,
                 color = "week", # column name for color coding
                 width = 1000,
                 height = 500)

# reduce whitespace
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
# show the plot
fig.show()
pio.write_html(fig, file='flight2.html', auto_open=True)

According to the image, we can conclude that the overall flight punctuality rate tends to be a stable sequence, that is, a sequence with no obvious upward or downward trend, and the average punctuality rate is about 70%-90%. The punctuality rate is inversely proportional, which is also in line with our intuition that the punctuality rate of the aircraft is not high when the shipping pressure is high. When we look at the data corresponding to the number of days, we find that the weather should be a very important factor. Therefore, the next part of the code is mainly to deal with the weather conditions in the Los Angeles area from 2012 to 2018.

3) Weather Data

weather = pd.read_csv("weather2012_2018.csv") 
weather.head()
local_time LA (airport) Ff N VV Td RRR
0 31.12.2018 22:00 2.0 20-0%. 16 -12.2 NaN
1 31.12.2018 16:00 4.0 50%. 16 6.7 NaN
2 31.12.2018 10:00 2.0 70 -80%. 16 7.8 NaN
3 31.12.2018 04:00 2.0 50%. 16 6.1 NaN
4 30.12.2018 22:00 3.0 20-0%. 16 8.3 NaN

Before processing, let’s take a look at the raw data. We found that the monitoring station will measure 4 times a day, the first time is 22:00, the second time is 16:00, the third time is 10:10, and the fourth time is 22:00. But don’t forget that the unit of our final data is each day, so the time doesn’t matter, so in the function below, we will deal with the time. And the varible name of the original data is not very clear, so we have to rename the variable name to help everyone better understand the corresponding real meaning.

def weather_data_each_year(Weather):
    Weather.fillna(0,inplace=True)
    Weather['local_time LA (airport)'] = pd.to_datetime(Weather['local_time LA (airport)'])#normalize time
    Weather["Day_time"]=Weather["local_time LA (airport)"].apply(lambda x:str(x)[0:10])#only need day don't care about hours and min
    cols=["Day_time",'local_time LA (airport)',"Ff","N","RRR","VV"]
    Weather=Weather[cols]
    Weather.columns = ["Day_time",'Observation_time', 'wind_speed',"cloud", 'precipitation', 'visibility']#rename
    Weather["precipitation"]=Weather["precipitation"].replace("Signs of precipitation", 0)
    Weather["cloud"]=Weather["cloud"].replace({'100%.':1,#Convert percentages to corresponding decimals for subsequent calculations
                          '70 -80%.':0.75,
                          '50%.':0.5,
                          0:0,
                         'The sky is not visible due to fog and/or other meteorological phenomena.':0,
                         '20-0%.':0.1,
                          'Cloudless':0
                                   })
    Weather.to_sql("Weather", conn, if_exists = "replace", index = False)#put Weather.csv to the data base"\
    #cmd will calculate average wind speed, average visibility, total precipitation per day
    cmd = """
   
       SELECT Day_time,AVG(wind_speed) AS wind_speed_perday,AVG(visibility) AS visibility_perday, AVG(cloud) AS cloud_perday ,SUM(precipitation) AS precipitation_perday
       FROM Weather
       GROUP BY Day_time
       
         """
    LAX_weather= pd.read_sql_query(cmd, conn)
    #Convert real cloud cover according to real weather
    for i in range(len(LAX_weather["cloud_perday"])):
        if(LAX_weather["cloud_perday"][i]==0):
            LAX_weather["cloud_perday"][i]='cloudless'
        elif(LAX_weather["cloud_perday"][i]<0.3):
            LAX_weather["cloud_perday"][i]='less cloudy'
        elif(LAX_weather["cloud_perday"][i]<0.7):
            LAX_weather["cloud_perday"][i]='cloudy'
        else:
            LAX_weather["cloud_perday"][i]="overcast"  
    #Convert precipitation to heavy, medium and light rain
    for i in range(len(LAX_weather["precipitation_perday"])):
        if(LAX_weather["precipitation_perday"][i]==0):
            LAX_weather["precipitation_perday"][i]="no_rain"
        elif(LAX_weather["precipitation_perday"][i]<10):
            LAX_weather["precipitation_perday"][i]="light_rain"
        elif(LAX_weather["precipitation_perday"][i]<25):
            LAX_weather["precipitation_perday"][i]="moderate_rain"
        else:
            LAX_weather["precipitation_perday"][i]="heavy_rain"   
    #creating instance of one-hot-encoder
    encoder = OneHotEncoder(handle_unknown='ignore')

    #perform one-hot encoding on ['cloud_perday','precipitation_perday'] column 
    encoder_df = pd.DataFrame(encoder.fit_transform(LAX_weather[['cloud_perday','precipitation_perday']]).toarray())

    #merge one-hot encoded columns back with original DataFrame
    LAX_weather = LAX_weather.join(encoder_df)

    #view final df
    LAX_weather.drop(labels=['cloud_perday','precipitation_perday'], axis=1, inplace=True)
    colNameDict = {0:'cloudless',
                   1:'cloudy',
                   2:'fog',
                   3:'less cloudy',
                   4:'overcast',
                   5:'heavy_rain',
                   6:'light_rain',
                   7:'moderate_rain',
                   8:'no_rain'
                  }                  
    LAX_weather.rename(columns = colNameDict,inplace=True)
    return LAX_weather
LAX_weather=weather_data_each_year(weather)
LAX_weather
Day_time wind_speed_perday visibility_perday cloudless cloudy fog less cloudy overcast heavy_rain light_rain moderate_rain
0 2012-01-01 2.00 14.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
1 2012-01-02 1.75 14.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0
2 2012-01-03 2.75 16.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0
3 2012-01-04 5.50 16.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
4 2012-01-05 3.50 13.5 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ...
2552 2018-12-27 5.25 16.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0
2553 2018-12-28 2.50 16.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
2554 2018-12-29 2.00 16.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
2555 2018-12-30 3.00 16.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
2556 2018-12-31 2.50 16.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0

2557 rows × 11 columns

From the above table, it can be found that I calculated the average wind speed and visibility for each day, and for the calculation of precipitation I used the sum function to sum instead of the average. This is because the precipitation after sum is a more accurate, but the wind speed and visibility cannot be used in the same way, because it does not have any meaning when we do the addition. Finally, I convert the precipitation and cloud conditions into categorical varibles according to the real division criteria. At the same time, perform One-Hot Encoding processing on it.

4) Merge datasets

LAX_weather.to_sql("LAX_weather", conn, if_exists = "replace", index = False)#put temps.csv to the data base"
LAX_flight.to_sql("LAX_flight", conn, if_exists = "replace", index = False)#put temps.csv to the data base"
2557
cmd = """
   SELECT *
   FROM LAX_flight
   LEFT JOIN LAX_weather ON LAX_weather.Day_time = LAX_flight.FL_DATE
       
     """
    
data = pd.read_sql_query(cmd, conn)
#min max normalization
for column in data[["wind_speed_perday","visibility_perday","Total_flight_per_day"]]:
    data[column] = (data[column] - data[column].min()) / (data[column].max() - data[column].min())    

We call SQL again, because the LEFT JOINT function will help us merge data set. The resulting data is the final dataset. However, at the same time we also normalized the numerical varible in the data. And this time we are using the Min-Max method instead of the standard scaler. Because I found that our data is not a normal distribution, so Min-Max will be more suitable for our project.

sorted_df = data.sort_values(by='FL_DATE')
sorted_df=sorted_df.drop(labels=["Day_time","week"],axis=1)
sorted_df 
FL_DATE Total_flight_per_day punctuality rate Friday Monday Saturday Sunday Thursday Tuesday Wednesday wind_speed_perday visibility_perday cloudless cloudy fog less cloudy overcast heavy_rain light_rain moderate_rain
1461 2012-01-01 0.585366 0.778966 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.200 0.871176 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
1462 2012-01-02 0.881098 0.653495 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.175 0.871176 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0
1463 2012-01-03 0.673780 0.806780 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.275 1.000000 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0
1464 2012-01-04 0.637195 0.787197 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.550 1.000000 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
1465 2012-01-05 0.740854 0.900327 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.350 0.838969 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1456 2018-12-27 0.908537 0.656672 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.525 1.000000 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0
1457 2018-12-28 0.896341 0.707391 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.250 1.000000 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
1458 2018-12-29 0.753049 0.735390 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.200 1.000000 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
1459 2018-12-30 0.868902 0.813456 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.300 1.000000 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
1460 2018-12-31 0.615854 0.861646 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.250 1.000000 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0

2557 rows × 20 columns

Before dividing the data into training set and test set, we need to reorder the data, because our standard of division is to divide the data into 60% training set, 20% vaildation group and 20% test set in chronological order rather than randomly.

§3. training,validation,and testing group

# 60% train, 20% validation, 20% test

train_size = int(0.6*len(sorted_df )) 
val_size = int(0.2*len(sorted_df ))

train = sorted_df [:train_size]
val = sorted_df [train_size : train_size + val_size]
test = sorted_df [train_size + val_size:]
from sklearn.model_selection import train_test_split
#split into X and y
X_train=train.drop(['Total_flight_per_day','punctuality rate','FL_DATE'],axis=1)
y_train=train[['Total_flight_per_day','punctuality rate']]
#split into X and y
X_val=val.drop(['Total_flight_per_day','punctuality rate','FL_DATE'],axis=1)
y_val=val[['Total_flight_per_day','punctuality rate']]
#split into X and y
X_test=test.drop(['Total_flight_per_day','punctuality rate','FL_DATE'],axis=1)
y_test=test[['Total_flight_per_day','punctuality rate']]

def create_sequence(values, time_steps):
    output = []
    for i in range(len(values) - time_steps):
        output.append(values[i : (i + time_steps)])
    return np.stack(output)

X_train = create_sequence(X_train, 7)
X_test  = create_sequence(X_test, 7)
X_val = create_sequence(X_val, 7)
y_train = y_train[-X_train.shape[0]:]
y_test  = y_test[-X_test.shape[0]:]
y_val = y_test[-X_val.shape[0]:]
X_train.shape
(1527, 7, 17)

The above create_sequence function can convert our 2D data into a 3D array because the input of the LSTM is always a 3D array. We have a total of 1533 training data. So far all the preparatory work is over.

§4. Model Selection

When we got the data ready we decided to use keras’ sequential model to construct the training model. For an ordinary neural network, whenever it completes one thing and outputting the result, it will forget the past, and when new variables enter the network, it will only make predictions based on new inputs. However, human thinking is not like this. Instead of throwing everything away and starting from scratch, we apply old and newly acquired information to get new answers. And RNNs work very similar to the way humans think. But RNN is not omnipotent. When the old information and the new input information are too far away, it will lose the ability to learn such distant information. And in the process, we may lose very important information, making predictions inaccurate. The solution is LSTM, a variant of RNN. LSTM does not remember every information, but remembers important information and brings it into the next operation, and automatically discards unimportant factors. The process of implementation is to introduce three important principles, input gate, an output gate and a forget gate. Considering that the weather will have a continuous impact on the alignment rate, the LSTM recurrent neural network is used here to make predictions, using Dense(2) , so that our output model can output two results, the first result is the prediction effect of the number of flights on the second day, and the second result is the prediction result of the punctuality rate on the second day. The picture below is a sample RNN that can help us to understand the concept.

RNN.png

from tensorflow.keras.layers import Bidirectional
model = Sequential()
# Adding a LSTM layer
model.add(LSTM(32,return_sequences=False,input_shape=(X_train.shape[1],X_train.shape[-1])))
model.add(Dense(2))
model.add(tf.keras.layers.Activation('sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='AdaDelta',metrics=['accuracy'])
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm (LSTM)                  (None, 32)                6400      
_________________________________________________________________
dense (Dense)                (None, 2)                 66        
_________________________________________________________________
activation (Activation)      (None, 2)                 0         
=================================================================
Total params: 6,466
Trainable params: 6,466
Non-trainable params: 0
_________________________________________________________________
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=20)
history = model.fit(X_train,
                    y_train,
                    validation_data=(X_val, y_val),
                    epochs = 500,
                    verbose = False,
                    batch_size=50, 
                    callbacks=[callback]
)
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model train vs validation loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train','testing'], loc='upper right')
plt.show()

output_45_0.png

It can be seen that our effect is still good! Because the loss of vaildation has been declining.

import matplotlib.pyplot as plt
plt.plot(history.history["accuracy"], label = "training")
plt.plot(history.history["val_accuracy"], label = "vaildation")
plt.gca().set(xlabel = "epoch", ylabel = "accuracy")
plt.legend()
<matplotlib.legend.Legend at 0x1f823099e80>

output_47_1.png

model.evaluate(X_test,y_test)
16/16 [==============================] - 0s 719us/step - loss: 0.5275 - accuracy: 0.7446





[0.5275416970252991, 0.7445544600486755]

The accuracy of the model finally reached 74% because only the weather factor is used, so the accuracy is pretty good. Of course, during model training, we can also adjust its window size, training times, batch size, dropout ratio, etc. When time is sufficient, the model can be adjusted repeatedly to obtain better prediction results.

5. Web Tech

For basic uses of flask such as render_template, html introduce, post, get, and etc. Please check this post

In this project, we use to flask add more functionality such as output dataframe and embeding plotly plots into the web

(1) output dataframe (convert into table format of HTML)

@app.route('/Data_preparation/')
def Data_preparation():
    df0 = pd.read_csv(r"data\flights_original.csv").iloc[:,:10]
    df1 = web_helper.LAX_table()
    df2 = web_helper.weather_data()
    df3 = web_helper.weather_data_each_year(pd.read_csv(r"data\weather2012_2018.csv"))
    df4 = web_helper.merge_data(df3, df1)
    return render_template('Data_preparation.html', table0=[df0.iloc[:10].to_html(classes='data', header="true")],
                                                    table1=[df1.iloc[:10].to_html(classes='data', header="true")],
                                                    table2=[df2.iloc[:10].to_html(classes='data', header="true")],
                                                    table3=[df3.iloc[:10].to_html(classes='data', header="true")],
                                                    table4=[df4.iloc[:10].to_html(classes='data', header="true")])

When we open the section ‘/Data_preparation/’ page, app.py will call the function Data_preparation(), inside this function call, we have functions to import and return corresponding data frames and assigned them with the name df0 to df4. before we need render these dataframes into html form by using pd.to_html(replace pd with dataframe object).

In the html, we also need place holder for these tables:

<figure class="table">
  
  <figcaption><small>Original Flights Data</small></figcaption>
</figure>

the to_html method returns a query string (similar to the list of lists here), so we need for loop here to embed each element in the table

(2) plotly plots

def plotly_flights(year):
    LAX_flight = pd.read_csv(r"data\LAX_flight.csv")

    fig = px.line(data_frame = LAX_flight[LAX_flight['FL_DATE'].str.contains(pat = str(year))], # data that needs to be plotted
                 x = "FL_DATE", # column name for x-axis
                 y = "Total_flight_per_day",
                 markers=True,
                 color = "week", # column name for color coding
                 width = 800,
                 height = 500)

    # reduce whitespace
    fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
    # show the plot
    return fig

Similar to the functions that we used to return data frame, this function creates plots by using plotly.
The year argument let users to choose which year of data they want to observe.

Under the same idea, we convert plotly object into a JSON query string that JavaScript could understand, then render it with the ‘Data_preparation.html’ template.

fig1 = web_helper.plotly_flights(request.form["year"])
graphJSON1 = json.dumps(fig1, cls=plotly.utils.PlotlyJSONEncoder)
return render_template('Data_preparation.html', graphJSON_object = graphJSON1)

There also is a place holder in the ‘Data_preparation.html’:

<span id="chart1"></span>

<script src='https://cdn.plot.ly/plotly-latest.min.js'></script>
<script type='text/javascript'>
  var graphs1 = ;
  Plotly.newPlot('chart1',graphs1,{});
</script>

Here, we need few more steps to actually make a plot that HTML could ‘understand’. ‘’ import the JavaScript package which use to convert JSON string to HTML. Plotly.newPlot() is the function of the package. ‘chart1’ is the id, which tells where to ‘draw’ the plot. The first line is the place where plot should be draw.

Written on June 5, 2022