Make More Plots By Using Plotly
To make some animation plots!
1. Create a Database
As always, we should import modules that we need to create a database first.
- old friend panda
- sqlite3: SQLite is a C library that provides a lightweight disk-based database that doesn’t require a separate server process and allows accessing the database using a nonstandard variant of the SQL query language.
import pandas as pd
import sqlite3
from plotly import express as px
import numpy as np
from sklearn.linear_model import LinearRegression
temps = pd.read_csv("temps_stacked.csv")
# rename columns
countries = pd.read_csv('countries.csv')
countries = countries.rename(columns = {"FIPS 10-4": "FIPS_10-4"})
countries = countries.rename(columns = {"ISO 3166": "ISO_3166"})
stations = pd.read_csv('station-metadata.csv')
The ‘countries’ table has column names that contain space. Keeping spaces in column names in the database is not a good idea. Later when we use those columns, we need to put them within “ “ to specific this is a column name.
So for convenience, we could change those column names by replacing ‘space’ with ‘_’
Steps
- start by creating a Connection object that represents the database. Here the data will be stored in the temps.db file
- use to_sql() function to write records stored in a DataFrame to a SQL database
- Once a Connection has been established, create a Cursor object and call its execute() method to perform SQL commands:
# open a connection to temps.db sp that you can 'talk' to using python
conn = sqlite3.connect("temps.db")
# if the temperatures already exits in the temps.db, if_exists="replace" could replace the cvs file.
temps.to_sql("temperatures", conn, if_exists="replace", index=False)
countries.to_sql("countries", conn, if_exists="replace", index=False)
stations.to_sql("stations", conn, if_exists="replace", index=False)
# always close your connection
conn.close()
Here is a simple example:
we want to check how many tables inside the database:
“SELECT name FROM sqlite_master WHERE type=’table’” is a sql command
use conn.cursor() to create a Cursor object
Cursor objects have execute() method to perform SQL commands
conn = sqlite3.connect("temps.db")
# query the database
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
# retrieve the data
print(cursor.fetchall())
conn.close()
[('temperatures',), ('countries',), ('stations',)]
From the output, we can see total 3 tables inside the database
Before, we write a function, lets check what are varibles in those tables:
temps.head()
ID | Year | Month | Temp | |
---|---|---|---|---|
0 | ACW00011604 | 1961 | 1 | -0.89 |
1 | ACW00011604 | 1961 | 2 | 2.36 |
2 | ACW00011604 | 1961 | 3 | 4.72 |
3 | ACW00011604 | 1961 | 4 | 7.73 |
4 | ACW00011604 | 1961 | 5 | 11.28 |
countries.head()
FIPS_10-4 | ISO_3166 | Name | |
---|---|---|---|
0 | AF | AF | Afghanistan |
1 | AX | - | Akrotiri |
2 | AL | AL | Albania |
3 | AG | DZ | Algeria |
4 | AQ | AS | American Samoa |
stations.head()
ID | LATITUDE | LONGITUDE | STNELEV | NAME | |
---|---|---|---|---|---|
0 | ACW00011604 | 57.7667 | 11.8667 | 18.0 | SAVE |
1 | AE000041196 | 25.3330 | 55.5170 | 34.0 | SHARJAH_INTER_AIRP |
2 | AEM00041184 | 25.6170 | 55.9330 | 31.0 | RAS_AL_KHAIMAH_INTE |
3 | AEM00041194 | 25.2550 | 55.3640 | 10.4 | DUBAI_INTL |
4 | AEM00041216 | 24.4300 | 54.4700 | 3.0 | ABU_DHABI_BATEEN_AIR |
2. Write a Query Function
The return value of query_climate_database() is a Pandas dataframe of temperature readings for the specified country, in the specified date range, in the specified month of the year. This dataframe should have columns for:
- country, a string giving the name of a country (e.g. ‘United States’) for which data should be returned.
- year_begin a integer giving the earliest years for which should be returned.
- year_end a integer giving the latest years for which should be returned.
- month, an integer giving the month of the year for which should be returned.
The return value of query_climate_database() is a Pandas dataframe of temperature readings for the specified country, in the specified date range, in the specified month of the year. This dataframe should have columns for:
- The station name.
- The latitude of the station.
- The longitude of the station.
- The name of the country in which the station is located.
- The year in which the reading was taken.
- The month in which the reading was taken.
- The average temperature at the specified station during the specified year and month. (Note: the temperatures in the raw data are already averages by month, so you don’t have to do any aggregation at this stage.)
Idea: We want to open a sql connection and close it within the function call, write a sql command which takes input arguments. To achieve this task, we could use String.format(), or use panda modules.
def query_climate_database(country, year_begin, year_end, month):
# open connection to database
conn = sqlite3.connect("temps.db")
# sql command
cmd = "SELECT S.Name, S.LATITUDE, S.LONGITUDE, C.Name AS Country, T.Year, T.Month, T.Temp \
FROM temperatures T \
LEFT JOIN stations S ON S.ID = T.ID \
LEFT JOIN countries C ON SUBSTRING(S.id, 1, 2)= C.'FIPS_10-4' \
WHERE (Year BETWEEN {} AND {}) AND (Month={})".format(year_begin, year_end, month)
# excute the command
df = pd.read_sql(cmd, conn)
# subset the dataframe with the country taht user wanted
df = df[df['Country'] == country]
# always close the database when we are finish use it
conn.close()
return df
df = query_climate_database(country = "India",
year_begin = 1980,
year_end = 2020,
month = 1)
df
NAME | LATITUDE | LONGITUDE | Country | Year | Month | Temp | |
---|---|---|---|---|---|---|---|
165212 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1980 | 1 | 23.48 |
165213 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1981 | 1 | 24.57 |
165214 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1982 | 1 | 24.19 |
165215 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1983 | 1 | 23.51 |
165216 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1984 | 1 | 24.81 |
... | ... | ... | ... | ... | ... | ... | ... |
168359 | DARJEELING | 27.050 | 88.270 | India | 1983 | 1 | 5.10 |
168360 | DARJEELING | 27.050 | 88.270 | India | 1986 | 1 | 6.90 |
168361 | DARJEELING | 27.050 | 88.270 | India | 1994 | 1 | 8.10 |
168362 | DARJEELING | 27.050 | 88.270 | India | 1995 | 1 | 5.60 |
168363 | DARJEELING | 27.050 | 88.270 | India | 1997 | 1 | 5.70 |
3152 rows × 7 columns
3. Write a Geographic Scatter Function for Yearly Temperature Increases
Goal: Write a function called temperature_coefficient_plot() which can make a scatter plot that indicates the average yearly change in temperature vary within a given country.
Panda Functions:
- transform: Call func on self producing a DataFrame with the same axis shape as self.
- apply: Apply a function along an axis of the DataFrame.
How do we find the yearly temperature increases?
- fit a line through temperatures for each station/each month, find the slope of the line and that corresponds to “average change in temp per year”
If we want to know how many observations of each stations name, and add the corresponding number of observations into the data frame form we made:
df["count"] = df.groupby(["NAME"])["Country"].transform('count')
df
NAME | LATITUDE | LONGITUDE | Country | Year | Month | Temp | count | |
---|---|---|---|---|---|---|---|---|
165212 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1980 | 1 | 23.48 | 34 |
165213 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1981 | 1 | 24.57 | 34 |
165214 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1982 | 1 | 24.19 | 34 |
165215 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1983 | 1 | 23.51 | 34 |
165216 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1984 | 1 | 24.81 | 34 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
168359 | DARJEELING | 27.050 | 88.270 | India | 1983 | 1 | 5.10 | 7 |
168360 | DARJEELING | 27.050 | 88.270 | India | 1986 | 1 | 6.90 | 7 |
168361 | DARJEELING | 27.050 | 88.270 | India | 1994 | 1 | 8.10 | 7 |
168362 | DARJEELING | 27.050 | 88.270 | India | 1995 | 1 | 5.60 | 7 |
168363 | DARJEELING | 27.050 | 88.270 | India | 1997 | 1 | 5.70 | 7 |
3152 rows × 8 columns
def temperature_coefficient_plot(country, year_begin, year_end, month, min_obs = 10, **kwargs):
# use the query function we made to make a dataframe
df = query_climate_database(country, year_begin, year_end, month)
# use transform to add the corresponding count for each rows
df["count"] = df.groupby(["NAME"])["Country"].transform('count')
# subset the dataframe by remove the data that less than min_obs
df = df[df['count'] >= min_obs]
# define a function to get the linear regression for input dataframe, then return it's slope
def coef(data_group):
X = data_group[["Year"]]
y = data_group["Temp"]
LR = LinearRegression()
LR.fit(X, y)
slope = LR.coef_[0]
return slope
df = df.groupby(["NAME", "LATITUDE", "LONGITUDE"]).apply(coef).round(4)
df = pd.DataFrame(df)
df = df.reset_index()
df = df.rename(columns = {0: "Estimated_Yearly_Increase_(℃)"})
fig = px.scatter_mapbox(data_frame = df,
lat = "LATITUDE",
lon = "LONGITUDE",
hover_name = "NAME",
color = "Estimated_Yearly_Increase_(℃)",
**kwargs
)
return fig
color_map = px.colors.diverging.RdGy_r
fig = temperature_coefficient_plot("India", 1980, 2020, 1,
min_obs = 10,
zoom = 2,
mapbox_style="carto-positron",
color_continuous_scale=color_map
)
fig.update_layout(title="Estimates of yearly increase in temperature in January for<br>stations in India, years 1980-2020")
fig.show()
color_map = px.colors.diverging.RdGy_r
fig = temperature_coefficient_plot("Japan", 1980, 2020, 5,
min_obs = 10,
zoom = 2,
mapbox_style="carto-positron",
color_continuous_scale=color_map
)
fig.update_layout(title="Estimates of yearly increase in temperature in May for<br>stations in Japan, years 1980-2020")
fig.show()
4. Create One More Query Function and Two More Interesting Figures
4-1. Let’s make a scatter plot with query_climate_database():
For this plot, we want to select specific citits of a country and output scatter plot with year verse temperature. Therefore, we could see the temperatures’ change over years. With facet_col=city’s name, this plot function will make multiple subplots with respect cities.
- We use query_climate_database() to make a dataframe with country (ex: China), year from 1980 to 2020, and month equal to Feburary as example.
- Use set to output unique cities’ name.
- Pick cities you want to observe, and group those cities name as a list.
query_climate_database("China", 1980, 2020, 2)
NAME | LATITUDE | LONGITUDE | Country | Year | Month | Temp | |
---|---|---|---|---|---|---|---|
108398 | AN_XI | 40.50 | 96.0 | China | 1980 | 2 | -5.27 |
108399 | AN_XI | 40.50 | 96.0 | China | 1981 | 2 | -5.46 |
108400 | AN_XI | 40.50 | 96.0 | China | 1983 | 2 | -5.10 |
108401 | AN_XI | 40.50 | 96.0 | China | 1984 | 2 | -7.40 |
108402 | AN_XI | 40.50 | 96.0 | China | 1985 | 2 | -5.90 |
... | ... | ... | ... | ... | ... | ... | ... |
123047 | YUANLING | 28.47 | 110.4 | China | 2009 | 2 | 9.99 |
123048 | YUANLING | 28.47 | 110.4 | China | 2010 | 2 | 7.56 |
123049 | YUANLING | 28.47 | 110.4 | China | 2011 | 2 | 7.74 |
123050 | YUANLING | 28.47 | 110.4 | China | 2012 | 2 | 4.85 |
123051 | YUANLING | 28.47 | 110.4 | China | 2013 | 2 | 7.25 |
14654 rows × 7 columns
list(set(df["NAME"]))[0:10] # just a example to use set and print first 10 cities from it.
['AGARTALA',
'MADRAS_MINAMBAKKAM',
'JAGDALPUR',
'MINICOYOBSY',
'SURAT',
'LUDHIANA',
'SHOLAPUR',
'JAIPUR_SANGANER',
'SATNA',
'TEZPUR']
Here is the function to make a scatter plot:
def scatter_plot(country, year_begin, year_end, month, cities, **kwargs):
df = query_climate_database(country, year_begin, year_end, month)
df_new = pd.DataFrame(columns = df.columns) # make a empty dataframe
# append the data that collected from input cities
for city in cities:
df_new = pd.concat([ df_new, df[df["NAME"]==city] ])
fig = px.scatter(data_frame=df_new,
x= "Year",
y="Temp",
color = "NAME",
hover_name="NAME",
hover_data=["LATITUDE", "LONGITUDE"],
facet_col="NAME",
**kwargs
)
return fig
cities_selected = ['HEFEI', 'GUANGZHOU', 'SHANGHAI_HONGQIAO','AN_XI']
fig = scatter_plot("China", 1980, 2020, 2, cities_selected)
fig.update_layout(title="Scatter plots with year verse temperature in Feburary with selected cities of China, years 1980-2020")
fig.show()
4-2 Using Same Data Set To Make Another Query Function and Use It To Make a Figure
Query Function:
This Query Function will take arguments country, year_begin, year_end, and their definition is same as the arguments in query_climate_database function. In this function we will have a new column ‘Date’ that stores data time values.
def query_climate_database_new(country, year_begin, year_end):
# open connection to database
conn = sqlite3.connect("temps.db")
# sql command
cmd = "SELECT S.Name, C.Name AS Country, T.Year, T.Month, T.Temp \
FROM temperatures T \
LEFT JOIN stations S ON S.ID = T.ID \
LEFT JOIN countries C ON SUBSTRING(S.id, 1, 2)= C.'FIPS_10-4' \
WHERE (Year BETWEEN {} AND {}) \
ORDER BY T.Year, T.Month".format(year_begin, year_end)
# excute the command
df = pd.read_sql(cmd, conn)
# subset the dataframe with the country taht user wanted
df = df[df['Country'] == country]
# make new column with date time values
df["Date"] = pd.to_datetime(df["Year"].astype(str) + "-" + df["Month"].astype(str))
# always close the database when we are finish use it
conn.close()
return df
# an example
df = query_climate_database_new(country = "United States",
year_begin = 2000,
year_end = 2020)
df
NAME | Country | Year | Month | Temp | Date | |
---|---|---|---|---|---|---|
7506 | ALEXANDER_CITY | United States | 2000 | 1 | 7.79 | 2000-01-01 |
7507 | ANDALUSIA_3_W | United States | 2000 | 1 | 10.32 | 2000-01-01 |
7508 | ASHLAND_3_ENE | United States | 2000 | 1 | 6.89 | 2000-01-01 |
7509 | ATHENS | United States | 2000 | 1 | 5.88 | 2000-01-01 |
7510 | ATMORE | United States | 2000 | 1 | 11.24 | 2000-01-01 |
... | ... | ... | ... | ... | ... | ... |
3416132 | GRAND_MARAIS | United States | 2020 | 12 | -4.29 | 2020-12-01 |
3416133 | SISSETON_MUNI_AP | United States | 2020 | 12 | -3.43 | 2020-12-01 |
3416134 | BOSCOBEL_AP | United States | 2020 | 12 | -2.62 | 2020-12-01 |
3416135 | LINCOLN_8_ENE | United States | 2020 | 12 | -1.32 | 2020-12-01 |
3416136 | LINCOLN_11_SW | United States | 2020 | 12 | -0.51 | 2020-12-01 |
1727789 rows × 6 columns
Line plot:
People may interesting to observe the temperature change through the date for selected cities in a country. So we could make a line plot to achieve this goal.
def line_plot(country, year_begin, year_end, cities, **kwargs):
df = query_climate_database_new(country, year_begin, year_end)
df_new = pd.DataFrame(columns = df.columns) # make a empty dataframe
# append the data that collected from input cities
for city in cities:
df_new = pd.concat([ df_new, df[df["NAME"]==city] ])
fig = px.line(data_frame=df_new,
x= "Date",
y="Temp",
color = "NAME",
hover_name="NAME",
facet_col="NAME",
**kwargs
)
return fig
cities_selected = ['HEFEI', 'GUANGZHOU', 'SHANGHAI_HONGQIAO']
fig = line_plot("China", 2000, 2020, cities_selected)
fig.update_layout(title="Line plots with date verse temperature with selected cities of China, years 2000-2020")
fig.show()