Database Structure

How are the data organized

There are seven tables of data, each of which is relationally linked to the others. The tables, their columns, primary, and secondary keys are shown in the figure below. The redundancy in the table "stations" is to make it easier to query for stations from a particular river, system, or basin without having to do joins to all the tables. The redunancy in "front_of_books" is the same - it enables queries by station, station type, or system without excessive joins.

Details of database design. Table names are dark green, primary keys are light green, and secondary keys are blue.

List of tables and their items with descriptions

river_systems

This table gives the basic information about what river districts (such as the Yangtze, the Yunnan-Tibet International Rivers, or the Yellow River) are numbered in the database.

system_id (PK) - this is the database number for the river system and the primary key for the table
system_name - this is the name for the river system
system_code - this is the Chinese book code for the river system

basins

This table defines all the basins for which data are available (for example, the Lancang Jiang (or Mekong))

basin_id (PK) - this is the database number for the basin and the primary key for the table
basin - this is the name of the basin
system_id - this is the secondary key to the river_systems table

rivers

This is the list of rivers and which basins and systems they are in. There is some duplicating of river names because some small tributaries are given non-unique names such as "left branch" or "right branch".

river_id (PK) - this is the database number for the river
river_name - this is the name of the river
system_id (SK) - this is the secondary key to the river_systems table
basin_id (SK) - this is the secondary key to the basins table

types

This is the definitions of codes for hydrology and weather stations.

station_type (PK) - this is a code for whether a station is a hydrology station or a weather station and is the primary key for this table
type_description - this defines the two station_types

stations

This is the list of stations, what type they are, and what river, basin, and system they are in. The redunancy is to make it easy to search for all stations in a particular system, basin, or river, or of a certain type without too many joins.

station_number (PK) - this is the database number for the station and the primary key for the table.
station_name - this is the name of the station
river_id (SK) - this is the secondary key to the rivers table
basin_id (SK) - this is the secondary key to the basins table
system_id (SK) - this is the secondary key to the river_systems table
station_type (SK) - this is the secondary key to the types table

front_of_books

This is the information from the first few pages of each of the data books and gives the locations of all of the stations, their names, what rivers they are on, and what data are available for each. The records are unique for each station type, system, year, and book number. There is also a dummy primary key called record_number so that each row has a unique identifier.

record_number (PK) - this is the dummy, unique, primary key for the table
station_type - this is the secondary key to the types table
system_id - this is the secondary key to the river_systems table
station_number - this is the secondary key to the stations table
year - this is the year the book records data for; if a book has multiple years, it is the first year in the range
book_number - this is the unique number given to a station for that particular book; they are not consistent from book to book
longitude - reported longitude of station (if given)
latitude - reported latitude of station (if given)
area - reported upstream area of station (if given)
start_year - year the station started recording data
start_month - month the station started recording data
stage - this is a 0 for no data and a 1 for recorded data on river stage; it is null for weather stations
flow - this is a 0 for no data and a 1 for recorded data on river discharge; it is null for weather stations
sed - this is a 0 for no data and a 1 for recorded data on total suspended sediment; it is null for weather stations
rain - this is a 0 for no data and a 1 for recorded data on precipitation; it is null for hydrology stations
evap - this is a 0 for no data and a 1 for recorded data on evaporation; it is null for hydrology station

data_table

This is the bulk of the data and contains all the daily values which have been digitized. They are organized by station, year, month, and day and data is unique for each combination of these; there is also a dummy primary key for a unique single identifier for each record. The date information are not in date format so that it is easier to search for just a particular year or one month in all years.

data_number (PK) - this is the dummy primary key for the table
station_number (SK) - this is the secondary key to the stations table
year - this is the year of the data record
month - this is the month of the data record
day - this is the day of the data record
discharge - this is the discharge for the date at that station (null for weather stations or not recorded on that day)
sed - this is the total suspended sediment for that date at that station (null for weather stations or if data are not recorded for that station on that day)
rain - total precipitation for that date at that station (null for hydrology stations or if not recorded; 0 for no precipitation); this one is a bit confusing because sometimes rain was only recorded for half the year and entries are left blank for either not recorded or no precipitation - we used our best judgment to determine which was the case for each data point
evap - total evaporation for that date at that station (null for hydrology stations)

Chinese Hydrology Data Project