Make More Plots By Using Plotly

To make some animation plots!

1. Create a Database

As always, we should import modules that we need to create a database first.

  • old friend panda
  • sqlite3: SQLite is a C library that provides a lightweight disk-based database that doesn’t require a separate server process and allows accessing the database using a nonstandard variant of the SQL query language.
import pandas as pd
import sqlite3
from plotly import express as px
import numpy as np
from sklearn.linear_model import LinearRegression
temps = pd.read_csv("temps_stacked.csv")

# rename columns
countries = pd.read_csv('countries.csv')
countries = countries.rename(columns = {"FIPS 10-4": "FIPS_10-4"}) 
countries = countries.rename(columns = {"ISO 3166": "ISO_3166"})

stations = pd.read_csv('station-metadata.csv')

The ‘countries’ table has column names that contain space. Keeping spaces in column names in the database is not a good idea. Later when we use those columns, we need to put them within “ “ to specific this is a column name.

So for convenience, we could change those column names by replacing ‘space’ with ‘_’

Steps

  • start by creating a Connection object that represents the database. Here the data will be stored in the temps.db file
  • use to_sql() function to write records stored in a DataFrame to a SQL database
  • Once a Connection has been established, create a Cursor object and call its execute() method to perform SQL commands:


# open a connection to temps.db sp that you can 'talk' to using python
conn = sqlite3.connect("temps.db")

# if the temperatures already exits in the temps.db, if_exists="replace" could replace the cvs file.
temps.to_sql("temperatures", conn, if_exists="replace", index=False)
countries.to_sql("countries", conn, if_exists="replace", index=False)
stations.to_sql("stations", conn, if_exists="replace", index=False)

# always close your connection
conn.close()



Here is a simple example:

we want to check how many tables inside the database:

“SELECT name FROM sqlite_master WHERE type=’table’” is a sql command

use conn.cursor() to create a Cursor object

Cursor objects have execute() method to perform SQL commands

conn = sqlite3.connect("temps.db")

# query the database
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")

# retrieve the data
print(cursor.fetchall())

conn.close()
[('temperatures',), ('countries',), ('stations',)]

From the output, we can see total 3 tables inside the database

Before, we write a function, lets check what are varibles in those tables:

temps.head()
ID Year Month Temp
0 ACW00011604 1961 1 -0.89
1 ACW00011604 1961 2 2.36
2 ACW00011604 1961 3 4.72
3 ACW00011604 1961 4 7.73
4 ACW00011604 1961 5 11.28
countries.head()
FIPS_10-4 ISO_3166 Name
0 AF AF Afghanistan
1 AX - Akrotiri
2 AL AL Albania
3 AG DZ Algeria
4 AQ AS American Samoa
stations.head()
ID LATITUDE LONGITUDE STNELEV NAME
0 ACW00011604 57.7667 11.8667 18.0 SAVE
1 AE000041196 25.3330 55.5170 34.0 SHARJAH_INTER_AIRP
2 AEM00041184 25.6170 55.9330 31.0 RAS_AL_KHAIMAH_INTE
3 AEM00041194 25.2550 55.3640 10.4 DUBAI_INTL
4 AEM00041216 24.4300 54.4700 3.0 ABU_DHABI_BATEEN_AIR

2. Write a Query Function


The return value of query_climate_database() is a Pandas dataframe of temperature readings for the specified country, in the specified date range, in the specified month of the year. This dataframe should have columns for:

  1. country, a string giving the name of a country (e.g. ‘United States’) for which data should be returned.
  2. year_begin a integer giving the earliest years for which should be returned.
  3. year_end a integer giving the latest years for which should be returned.
  4. month, an integer giving the month of the year for which should be returned.

The return value of query_climate_database() is a Pandas dataframe of temperature readings for the specified country, in the specified date range, in the specified month of the year. This dataframe should have columns for:

  • The station name.
  • The latitude of the station.
  • The longitude of the station.
  • The name of the country in which the station is located.
  • The year in which the reading was taken.
  • The month in which the reading was taken.
  • The average temperature at the specified station during the specified year and month. (Note: the temperatures in the raw data are already averages by month, so you don’t have to do any aggregation at this stage.)

Idea: We want to open a sql connection and close it within the function call, write a sql command which takes input arguments. To achieve this task, we could use String.format(), or use panda modules.

def query_climate_database(country, year_begin, year_end, month):
    # open connection to database
    conn = sqlite3.connect("temps.db")
    
    # sql command
    cmd = "SELECT S.Name, S.LATITUDE, S.LONGITUDE, C.Name AS Country, T.Year, T.Month, T.Temp \
    FROM temperatures T \
    LEFT JOIN stations S ON S.ID = T.ID \
    LEFT JOIN countries C ON SUBSTRING(S.id, 1, 2)= C.'FIPS_10-4' \
    WHERE (Year BETWEEN {} AND {}) AND (Month={})".format(year_begin, year_end, month)
    
    # excute the command
    df = pd.read_sql(cmd, conn)
    
    # subset the dataframe with the country taht user wanted
    df = df[df['Country'] == country]

    # always close the database when we are finish use it
    conn.close()
    
    return df
df = query_climate_database(country = "India", 
                            year_begin = 1980, 
                            year_end = 2020,
                            month = 1)
df
NAME LATITUDE LONGITUDE Country Year Month Temp
165212 PBO_ANANTAPUR 14.583 77.633 India 1980 1 23.48
165213 PBO_ANANTAPUR 14.583 77.633 India 1981 1 24.57
165214 PBO_ANANTAPUR 14.583 77.633 India 1982 1 24.19
165215 PBO_ANANTAPUR 14.583 77.633 India 1983 1 23.51
165216 PBO_ANANTAPUR 14.583 77.633 India 1984 1 24.81
... ... ... ... ... ... ... ...
168359 DARJEELING 27.050 88.270 India 1983 1 5.10
168360 DARJEELING 27.050 88.270 India 1986 1 6.90
168361 DARJEELING 27.050 88.270 India 1994 1 8.10
168362 DARJEELING 27.050 88.270 India 1995 1 5.60
168363 DARJEELING 27.050 88.270 India 1997 1 5.70

3152 rows × 7 columns

3. Write a Geographic Scatter Function for Yearly Temperature Increases


Goal: Write a function called temperature_coefficient_plot() which can make a scatter plot that indicates the average yearly change in temperature vary within a given country.

Panda Functions:

  • transform: Call func on self producing a DataFrame with the same axis shape as self.
  • apply: Apply a function along an axis of the DataFrame.

How do we find the yearly temperature increases?

  • fit a line through temperatures for each station/each month, find the slope of the line and that corresponds to “average change in temp per year”

If we want to know how many observations of each stations name, and add the corresponding number of observations into the data frame form we made:

df["count"] = df.groupby(["NAME"])["Country"].transform('count')
df
NAME LATITUDE LONGITUDE Country Year Month Temp count
165212 PBO_ANANTAPUR 14.583 77.633 India 1980 1 23.48 34
165213 PBO_ANANTAPUR 14.583 77.633 India 1981 1 24.57 34
165214 PBO_ANANTAPUR 14.583 77.633 India 1982 1 24.19 34
165215 PBO_ANANTAPUR 14.583 77.633 India 1983 1 23.51 34
165216 PBO_ANANTAPUR 14.583 77.633 India 1984 1 24.81 34
... ... ... ... ... ... ... ... ...
168359 DARJEELING 27.050 88.270 India 1983 1 5.10 7
168360 DARJEELING 27.050 88.270 India 1986 1 6.90 7
168361 DARJEELING 27.050 88.270 India 1994 1 8.10 7
168362 DARJEELING 27.050 88.270 India 1995 1 5.60 7
168363 DARJEELING 27.050 88.270 India 1997 1 5.70 7

3152 rows × 8 columns

def temperature_coefficient_plot(country, year_begin, year_end, month, min_obs = 10, **kwargs):
    # use the query function we made to make a dataframe
    df = query_climate_database(country, year_begin, year_end, month)
    
    # use transform to add the corresponding count for each rows
    df["count"] = df.groupby(["NAME"])["Country"].transform('count')
    
    # subset the dataframe by remove the data that less than min_obs
    df = df[df['count'] >= min_obs]

    # define a function to get the linear regression for input dataframe, then return it's slope
    def coef(data_group):
        X = data_group[["Year"]]
        y = data_group["Temp"]
        LR = LinearRegression()
        LR.fit(X, y)
        slope = LR.coef_[0]
        return slope
    
    df = df.groupby(["NAME", "LATITUDE", "LONGITUDE"]).apply(coef).round(4)
    df = pd.DataFrame(df)
    df = df.reset_index()
    df = df.rename(columns = {0: "Estimated_Yearly_Increase_(℃)"})
    
    fig = px.scatter_mapbox(data_frame = df, 
                            lat = "LATITUDE", 
                            lon = "LONGITUDE",
                            hover_name = "NAME", 
                            color = "Estimated_Yearly_Increase_(℃)",
                            **kwargs
                           )
    return fig
color_map = px.colors.diverging.RdGy_r

fig = temperature_coefficient_plot("India", 1980, 2020, 1, 
                                   min_obs = 10,
                                   zoom = 2,
                                   mapbox_style="carto-positron",
                                   color_continuous_scale=color_map
                                  )

fig.update_layout(title="Estimates of yearly increase in temperature in January for<br>stations in India, years 1980-2020")

fig.show()
color_map = px.colors.diverging.RdGy_r

fig = temperature_coefficient_plot("Japan", 1980, 2020, 5, 
                                   min_obs = 10,
                                   zoom = 2,
                                   mapbox_style="carto-positron",
                                   color_continuous_scale=color_map
                                  )

fig.update_layout(title="Estimates of yearly increase in temperature in May for<br>stations in Japan, years 1980-2020")

fig.show()

4. Create One More Query Function and Two More Interesting Figures

4-1. Let’s make a scatter plot with query_climate_database():

For this plot, we want to select specific citits of a country and output scatter plot with year verse temperature. Therefore, we could see the temperatures’ change over years. With facet_col=city’s name, this plot function will make multiple subplots with respect cities.

  • We use query_climate_database() to make a dataframe with country (ex: China), year from 1980 to 2020, and month equal to Feburary as example.
  • Use set to output unique cities’ name.
  • Pick cities you want to observe, and group those cities name as a list.
query_climate_database("China", 1980, 2020, 2)
NAME LATITUDE LONGITUDE Country Year Month Temp
108398 AN_XI 40.50 96.0 China 1980 2 -5.27
108399 AN_XI 40.50 96.0 China 1981 2 -5.46
108400 AN_XI 40.50 96.0 China 1983 2 -5.10
108401 AN_XI 40.50 96.0 China 1984 2 -7.40
108402 AN_XI 40.50 96.0 China 1985 2 -5.90
... ... ... ... ... ... ... ...
123047 YUANLING 28.47 110.4 China 2009 2 9.99
123048 YUANLING 28.47 110.4 China 2010 2 7.56
123049 YUANLING 28.47 110.4 China 2011 2 7.74
123050 YUANLING 28.47 110.4 China 2012 2 4.85
123051 YUANLING 28.47 110.4 China 2013 2 7.25

14654 rows × 7 columns

list(set(df["NAME"]))[0:10] # just a example to use set and print first 10 cities from it.
['AGARTALA',
 'MADRAS_MINAMBAKKAM',
 'JAGDALPUR',
 'MINICOYOBSY',
 'SURAT',
 'LUDHIANA',
 'SHOLAPUR',
 'JAIPUR_SANGANER',
 'SATNA',
 'TEZPUR']


Here is the function to make a scatter plot:

def scatter_plot(country, year_begin, year_end, month, cities, **kwargs):
    
    df = query_climate_database(country, year_begin, year_end, month)
    
    df_new = pd.DataFrame(columns = df.columns) # make a empty dataframe
    
    # append the data that collected from input cities
    for city in cities:
        df_new = pd.concat([ df_new, df[df["NAME"]==city] ])
    
    fig = px.scatter(data_frame=df_new, 
                     x= "Year", 
                     y="Temp",
                     color = "NAME", 
                     hover_name="NAME", 
                     hover_data=["LATITUDE", "LONGITUDE"],
                     facet_col="NAME", 
                     **kwargs
                    )
    
    return fig
cities_selected = ['HEFEI', 'GUANGZHOU', 'SHANGHAI_HONGQIAO','AN_XI']

fig = scatter_plot("China", 1980, 2020, 2, cities_selected)

fig.update_layout(title="Scatter plots with year verse temperature in Feburary with selected cities of China, years 1980-2020")

fig.show()

4-2 Using Same Data Set To Make Another Query Function and Use It To Make a Figure

Query Function:

This Query Function will take arguments country, year_begin, year_end, and their definition is same as the arguments in query_climate_database function. In this function we will have a new column ‘Date’ that stores data time values.

def query_climate_database_new(country, year_begin, year_end):
    # open connection to database
    conn = sqlite3.connect("temps.db")
    
    # sql command
    cmd = "SELECT S.Name, C.Name AS Country, T.Year, T.Month, T.Temp \
    FROM temperatures T \
    LEFT JOIN stations S ON S.ID = T.ID \
    LEFT JOIN countries C ON SUBSTRING(S.id, 1, 2)= C.'FIPS_10-4' \
    WHERE (Year BETWEEN {} AND {}) \
    ORDER BY T.Year, T.Month".format(year_begin, year_end)
    
    # excute the command
    df = pd.read_sql(cmd, conn)
    
    # subset the dataframe with the country taht user wanted
    df = df[df['Country'] == country]
    
    # make new column with date time values
    df["Date"] = pd.to_datetime(df["Year"].astype(str) + "-" + df["Month"].astype(str))

    # always close the database when we are finish use it
    conn.close()
    
    return df
# an example
df = query_climate_database_new(country = "United States", 
                                year_begin = 2000, 
                                year_end = 2020)
df
NAME Country Year Month Temp Date
7506 ALEXANDER_CITY United States 2000 1 7.79 2000-01-01
7507 ANDALUSIA_3_W United States 2000 1 10.32 2000-01-01
7508 ASHLAND_3_ENE United States 2000 1 6.89 2000-01-01
7509 ATHENS United States 2000 1 5.88 2000-01-01
7510 ATMORE United States 2000 1 11.24 2000-01-01
... ... ... ... ... ... ...
3416132 GRAND_MARAIS United States 2020 12 -4.29 2020-12-01
3416133 SISSETON_MUNI_AP United States 2020 12 -3.43 2020-12-01
3416134 BOSCOBEL_AP United States 2020 12 -2.62 2020-12-01
3416135 LINCOLN_8_ENE United States 2020 12 -1.32 2020-12-01
3416136 LINCOLN_11_SW United States 2020 12 -0.51 2020-12-01

1727789 rows × 6 columns

Line plot:

People may interesting to observe the temperature change through the date for selected cities in a country. So we could make a line plot to achieve this goal.

def line_plot(country, year_begin, year_end, cities, **kwargs):
    
    df = query_climate_database_new(country, year_begin, year_end)
    
    df_new = pd.DataFrame(columns = df.columns) # make a empty dataframe
    
    # append the data that collected from input cities
    for city in cities:
        df_new = pd.concat([ df_new, df[df["NAME"]==city] ])
    
    fig = px.line(data_frame=df_new, 
                  x= "Date", 
                  y="Temp",
                  color = "NAME", 
                  hover_name="NAME", 
                  facet_col="NAME", 
                  **kwargs
                 )
    
    return fig
cities_selected = ['HEFEI', 'GUANGZHOU', 'SHANGHAI_HONGQIAO']

fig = line_plot("China", 2000, 2020, cities_selected)

fig.update_layout(title="Line plots with date verse temperature with selected cities of China, years 2000-2020")

fig.show()
Written on April 14, 2022