Make More Plots By Using Plotly

To make some animation plots!

1. Create a Database

As always, we should import modules that we need to create a database first.

old friend panda
sqlite3: SQLite is a C library that provides a lightweight disk-based database that doesn’t require a separate server process and allows accessing the database using a nonstandard variant of the SQL query language.

import pandas as pd
import sqlite3
from plotly import express as px
import numpy as np
from sklearn.linear_model import LinearRegression

temps = pd.read_csv("temps_stacked.csv")

# rename columns
countries = pd.read_csv('countries.csv')
countries = countries.rename(columns = {"FIPS 10-4": "FIPS_10-4"}) 
countries = countries.rename(columns = {"ISO 3166": "ISO_3166"})

stations = pd.read_csv('station-metadata.csv')

The ‘countries’ table has column names that contain space. Keeping spaces in column names in the database is not a good idea. Later when we use those columns, we need to put them within “ “ to specific this is a column name.

So for convenience, we could change those column names by replacing ‘space’ with ‘_’

Steps

start by creating a Connection object that represents the database. Here the data will be stored in the temps.db file
use to_sql() function to write records stored in a DataFrame to a SQL database
Once a Connection has been established, create a Cursor object and call its execute() method to perform SQL commands:

# open a connection to temps.db sp that you can 'talk' to using python
conn = sqlite3.connect("temps.db")

# if the temperatures already exits in the temps.db, if_exists="replace" could replace the cvs file.
temps.to_sql("temperatures", conn, if_exists="replace", index=False)
countries.to_sql("countries", conn, if_exists="replace", index=False)
stations.to_sql("stations", conn, if_exists="replace", index=False)

# always close your connection
conn.close()

Here is a simple example:

we want to check how many tables inside the database:

“SELECT name FROM sqlite_master WHERE type=’table’” is a sql command

use conn.cursor() to create a Cursor object

Cursor objects have execute() method to perform SQL commands

conn = sqlite3.connect("temps.db")

# query the database
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")

# retrieve the data
print(cursor.fetchall())

conn.close()

[('temperatures',), ('countries',), ('stations',)]

From the output, we can see total 3 tables inside the database

Before, we write a function, lets check what are varibles in those tables:

temps.head()

	ID	Year	Month	Temp
0	ACW00011604	1961	1	-0.89
1	ACW00011604	1961	2	2.36
2	ACW00011604	1961	3	4.72
3	ACW00011604	1961	4	7.73
4	ACW00011604	1961	5	11.28

countries.head()

	FIPS_10-4	ISO_3166	Name
0	AF	AF	Afghanistan
1	AX	-	Akrotiri
2	AL	AL	Albania
3	AG	DZ	Algeria
4	AQ	AS	American Samoa

stations.head()

	ID	LATITUDE	LONGITUDE	STNELEV	NAME
0	ACW00011604	57.7667	11.8667	18.0	SAVE
1	AE000041196	25.3330	55.5170	34.0	SHARJAH_INTER_AIRP
2	AEM00041184	25.6170	55.9330	31.0	RAS_AL_KHAIMAH_INTE
3	AEM00041194	25.2550	55.3640	10.4	DUBAI_INTL
4	AEM00041216	24.4300	54.4700	3.0	ABU_DHABI_BATEEN_AIR

2. Write a Query Function

The return value of query_climate_database() is a Pandas dataframe of temperature readings for the specified country, in the specified date range, in the specified month of the year. This dataframe should have columns for:

country, a string giving the name of a country (e.g. ‘United States’) for which data should be returned.
year_begin a integer giving the earliest years for which should be returned.
year_end a integer giving the latest years for which should be returned.
month, an integer giving the month of the year for which should be returned.

The station name.
The latitude of the station.
The longitude of the station.
The name of the country in which the station is located.
The year in which the reading was taken.
The month in which the reading was taken.
The average temperature at the specified station during the specified year and month. (Note: the temperatures in the raw data are already averages by month, so you don’t have to do any aggregation at this stage.)

Idea: We want to open a sql connection and close it within the function call, write a sql command which takes input arguments. To achieve this task, we could use String.format(), or use panda modules.

def query_climate_database(country, year_begin, year_end, month):
    # open connection to database
    conn = sqlite3.connect("temps.db")
    
    # sql command
    cmd = "SELECT S.Name, S.LATITUDE, S.LONGITUDE, C.Name AS Country, T.Year, T.Month, T.Temp \
    FROM temperatures T \
    LEFT JOIN stations S ON S.ID = T.ID \
    LEFT JOIN countries C ON SUBSTRING(S.id, 1, 2)= C.'FIPS_10-4' \
    WHERE (Year BETWEEN {} AND {}) AND (Month={})".format(year_begin, year_end, month)
    
    # excute the command
    df = pd.read_sql(cmd, conn)
    
    # subset the dataframe with the country taht user wanted
    df = df[df['Country'] == country]

    # always close the database when we are finish use it
    conn.close()
    
    return df

df = query_climate_database(country = "India", 
                            year_begin = 1980, 
                            year_end = 2020,
                            month = 1)

df

	NAME	LATITUDE	LONGITUDE	Country	Year	Month	Temp
165212	PBO_ANANTAPUR	14.583	77.633	India	1980	1	23.48
165213	PBO_ANANTAPUR	14.583	77.633	India	1981	1	24.57
165214	PBO_ANANTAPUR	14.583	77.633	India	1982	1	24.19
165215	PBO_ANANTAPUR	14.583	77.633	India	1983	1	23.51
165216	PBO_ANANTAPUR	14.583	77.633	India	1984	1	24.81
...	...	...	...	...	...	...	...
168359	DARJEELING	27.050	88.270	India	1983	1	5.10
168360	DARJEELING	27.050	88.270	India	1986	1	6.90
168361	DARJEELING	27.050	88.270	India	1994	1	8.10
168362	DARJEELING	27.050	88.270	India	1995	1	5.60
168363	DARJEELING	27.050	88.270	India	1997	1	5.70

3152 rows × 7 columns

3. Write a Geographic Scatter Function for Yearly Temperature Increases

Goal: Write a function called temperature_coefficient_plot() which can make a scatter plot that indicates the average yearly change in temperature vary within a given country.

Panda Functions:

transform: Call func on self producing a DataFrame with the same axis shape as self.
apply: Apply a function along an axis of the DataFrame.

How do we find the yearly temperature increases?

fit a line through temperatures for each station/each month, find the slope of the line and that corresponds to “average change in temp per year”

If we want to know how many observations of each stations name, and add the corresponding number of observations into the data frame form we made:

df["count"] = df.groupby(["NAME"])["Country"].transform('count')
df

	NAME	LATITUDE	LONGITUDE	Country	Year	Month	Temp	count
165212	PBO_ANANTAPUR	14.583	77.633	India	1980	1	23.48	34
165213	PBO_ANANTAPUR	14.583	77.633	India	1981	1	24.57	34
165214	PBO_ANANTAPUR	14.583	77.633	India	1982	1	24.19	34
165215	PBO_ANANTAPUR	14.583	77.633	India	1983	1	23.51	34
165216	PBO_ANANTAPUR	14.583	77.633	India	1984	1	24.81	34
...	...	...	...	...	...	...	...	...
168359	DARJEELING	27.050	88.270	India	1983	1	5.10	7
168360	DARJEELING	27.050	88.270	India	1986	1	6.90	7
168361	DARJEELING	27.050	88.270	India	1994	1	8.10	7
168362	DARJEELING	27.050	88.270	India	1995	1	5.60	7
168363	DARJEELING	27.050	88.270	India	1997	1	5.70	7

3152 rows × 8 columns

def temperature_coefficient_plot(country, year_begin, year_end, month, min_obs = 10, **kwargs):
    # use the query function we made to make a dataframe
    df = query_climate_database(country, year_begin, year_end, month)
    
    # use transform to add the corresponding count for each rows
    df["count"] = df.groupby(["NAME"])["Country"].transform('count')
    
    # subset the dataframe by remove the data that less than min_obs
    df = df[df['count'] >= min_obs]

    # define a function to get the linear regression for input dataframe, then return it's slope
    def coef(data_group):
        X = data_group[["Year"]]
        y = data_group["Temp"]
        LR = LinearRegression()
        LR.fit(X, y)
        slope = LR.coef_[0]
        return slope
    
    df = df.groupby(["NAME", "LATITUDE", "LONGITUDE"]).apply(coef).round(4)
    df = pd.DataFrame(df)
    df = df.reset_index()
    df = df.rename(columns = {0: "Estimated_Yearly_Increase_(℃)"})
    
    fig = px.scatter_mapbox(data_frame = df, 
                            lat = "LATITUDE", 
                            lon = "LONGITUDE",
                            hover_name = "NAME", 
                            color = "Estimated_Yearly_Increase_(℃)",
                            **kwargs
                           )
    return fig

color_map = px.colors.diverging.RdGy_r

fig = temperature_coefficient_plot("India", 1980, 2020, 1, 
                                   min_obs = 10,
                                   zoom = 2,
                                   mapbox_style="carto-positron",
                                   color_continuous_scale=color_map
                                  )

fig.update_layout(title="Estimates of yearly increase in temperature in January for<br>stations in India, years 1980-2020")

fig.show()

color_map = px.colors.diverging.RdGy_r

fig = temperature_coefficient_plot("Japan", 1980, 2020, 5, 
                                   min_obs = 10,
                                   zoom = 2,
                                   mapbox_style="carto-positron",
                                   color_continuous_scale=color_map
                                  )

fig.update_layout(title="Estimates of yearly increase in temperature in May for<br>stations in Japan, years 1980-2020")

fig.show()

4. Create One More Query Function and Two More Interesting Figures

4-1. Let’s make a scatter plot with query_climate_database():

For this plot, we want to select specific citits of a country and output scatter plot with year verse temperature. Therefore, we could see the temperatures’ change over years. With facet_col=city’s name, this plot function will make multiple subplots with respect cities.

We use query_climate_database() to make a dataframe with country (ex: China), year from 1980 to 2020, and month equal to Feburary as example.
Use set to output unique cities’ name.
Pick cities you want to observe, and group those cities name as a list.

query_climate_database("China", 1980, 2020, 2)

	NAME	LATITUDE	LONGITUDE	Country	Year	Month	Temp
108398	AN_XI	40.50	96.0	China	1980	2	-5.27
108399	AN_XI	40.50	96.0	China	1981	2	-5.46
108400	AN_XI	40.50	96.0	China	1983	2	-5.10
108401	AN_XI	40.50	96.0	China	1984	2	-7.40
108402	AN_XI	40.50	96.0	China	1985	2	-5.90
...	...	...	...	...	...	...	...
123047	YUANLING	28.47	110.4	China	2009	2	9.99
123048	YUANLING	28.47	110.4	China	2010	2	7.56
123049	YUANLING	28.47	110.4	China	2011	2	7.74
123050	YUANLING	28.47	110.4	China	2012	2	4.85
123051	YUANLING	28.47	110.4	China	2013	2	7.25

14654 rows × 7 columns

list(set(df["NAME"]))[0:10] # just a example to use set and print first 10 cities from it.

['AGARTALA',
 'MADRAS_MINAMBAKKAM',
 'JAGDALPUR',
 'MINICOYOBSY',
 'SURAT',
 'LUDHIANA',
 'SHOLAPUR',
 'JAIPUR_SANGANER',
 'SATNA',
 'TEZPUR']

Here is the function to make a scatter plot:

def scatter_plot(country, year_begin, year_end, month, cities, **kwargs):
    
    df = query_climate_database(country, year_begin, year_end, month)
    
    df_new = pd.DataFrame(columns = df.columns) # make a empty dataframe
    
    # append the data that collected from input cities
    for city in cities:
        df_new = pd.concat([ df_new, df[df["NAME"]==city] ])
    
    fig = px.scatter(data_frame=df_new, 
                     x= "Year", 
                     y="Temp",
                     color = "NAME", 
                     hover_name="NAME", 
                     hover_data=["LATITUDE", "LONGITUDE"],
                     facet_col="NAME", 
                     **kwargs
                    )
    
    return fig

cities_selected = ['HEFEI', 'GUANGZHOU', 'SHANGHAI_HONGQIAO','AN_XI']

fig = scatter_plot("China", 1980, 2020, 2, cities_selected)

fig.update_layout(title="Scatter plots with year verse temperature in Feburary with selected cities of China, years 1980-2020")

fig.show()

4-2 Using Same Data Set To Make Another Query Function and Use It To Make a Figure

Query Function:

This Query Function will take arguments country, year_begin, year_end, and their definition is same as the arguments in query_climate_database function. In this function we will have a new column ‘Date’ that stores data time values.

def query_climate_database_new(country, year_begin, year_end):
    # open connection to database
    conn = sqlite3.connect("temps.db")
    
    # sql command
    cmd = "SELECT S.Name, C.Name AS Country, T.Year, T.Month, T.Temp \
    FROM temperatures T \
    LEFT JOIN stations S ON S.ID = T.ID \
    LEFT JOIN countries C ON SUBSTRING(S.id, 1, 2)= C.'FIPS_10-4' \
    WHERE (Year BETWEEN {} AND {}) \
    ORDER BY T.Year, T.Month".format(year_begin, year_end)
    
    # excute the command
    df = pd.read_sql(cmd, conn)
    
    # subset the dataframe with the country taht user wanted
    df = df[df['Country'] == country]
    
    # make new column with date time values
    df["Date"] = pd.to_datetime(df["Year"].astype(str) + "-" + df["Month"].astype(str))

    # always close the database when we are finish use it
    conn.close()
    
    return df

# an example
df = query_climate_database_new(country = "United States", 
                                year_begin = 2000, 
                                year_end = 2020)

df

	NAME	Country	Year	Month	Temp	Date
7506	ALEXANDER_CITY	United States	2000	1	7.79	2000-01-01
7507	ANDALUSIA_3_W	United States	2000	1	10.32	2000-01-01
7508	ASHLAND_3_ENE	United States	2000	1	6.89	2000-01-01
7509	ATHENS	United States	2000	1	5.88	2000-01-01
7510	ATMORE	United States	2000	1	11.24	2000-01-01
...	...	...	...	...	...	...
3416132	GRAND_MARAIS	United States	2020	12	-4.29	2020-12-01
3416133	SISSETON_MUNI_AP	United States	2020	12	-3.43	2020-12-01
3416134	BOSCOBEL_AP	United States	2020	12	-2.62	2020-12-01
3416135	LINCOLN_8_ENE	United States	2020	12	-1.32	2020-12-01
3416136	LINCOLN_11_SW	United States	2020	12	-0.51	2020-12-01

1727789 rows × 6 columns

Line plot:

People may interesting to observe the temperature change through the date for selected cities in a country. So we could make a line plot to achieve this goal.

def line_plot(country, year_begin, year_end, cities, **kwargs):
    
    df = query_climate_database_new(country, year_begin, year_end)
    
    df_new = pd.DataFrame(columns = df.columns) # make a empty dataframe
    
    # append the data that collected from input cities
    for city in cities:
        df_new = pd.concat([ df_new, df[df["NAME"]==city] ])
    
    fig = px.line(data_frame=df_new, 
                  x= "Date", 
                  y="Temp",
                  color = "NAME", 
                  hover_name="NAME", 
                  facet_col="NAME", 
                  **kwargs
                 )
    
    return fig

cities_selected = ['HEFEI', 'GUANGZHOU', 'SHANGHAI_HONGQIAO']

fig = line_plot("China", 2000, 2020, cities_selected)

fig.update_layout(title="Line plots with date verse temperature with selected cities of China, years 2000-2020")

fig.show()

Written on April 14, 2022