Database Structure
How are the data organized
There are seven tables of data, each of which is relationally linked to the others. The tables, their columns, primary, and secondary keys are shown in the figure below. The redundancy in the table "stations" is to make it easier to query for stations from a particular river, system, or basin without having to do joins to all the tables. The redunancy in "front_of_books" is the same - it enables queries by station, station type, or system without excessive joins.
Details of database design. Table names are dark green, primary keys are light green, and secondary keys are blue.
List of tables and their items with descriptions
river_systems
This table gives the basic information about what river districts (such as the Yangtze, the Yunnan-Tibet International Rivers, or the Yellow River) are numbered in the database.
- system_id (PK) - this is the database number for the river system and the primary key for the table
- system_name - this is the name for the river system
- system_code - this is the Chinese book code for the river system
basins
This table defines all the basins for which data are available (for example, the Lancang Jiang (or Mekong))
- basin_id (PK) - this is the database number for the basin and the primary key for the table
- basin - this is the name of the basin
- system_id - this is the secondary key to the river_systems table
rivers
This is the list of rivers and which basins and systems they are in. There is some duplicating of river names because some small tributaries are given non-unique names such as "left branch" or "right branch".
- river_id (PK) - this is the database number for the river
- river_name - this is the name of the river
- system_id (SK) - this is the secondary key to the river_systems table
- basin_id (SK) - this is the secondary key to the basins table
types
This is the definitions of codes for hydrology and weather stations.
- station_type (PK) - this is a code for whether a station is a hydrology station or a weather station and is the primary key for this table
- type_description - this defines the two station_types
stations
This is the list of stations, what type they are, and what river, basin, and system they are in. The redunancy is to make it easy to search for all stations in a particular system, basin, or river, or of a certain type without too many joins.
- station_number (PK) - this is the database number for the station and the primary key for the table.
- station_name - this is the name of the station
- river_id (SK) - this is the secondary key to the rivers table
- basin_id (SK) - this is the secondary key to the basins table
- system_id (SK) - this is the secondary key to the river_systems table
- station_type (SK) - this is the secondary key to the types table
front_of_books
This is the information from the first few pages of each of the data books and gives the locations of all of the stations, their names, what rivers they are on, and what data are available for each. The records are unique for each station type, system, year, and book number. There is also a dummy primary key called record_number so that each row has a unique identifier.
- record_number (PK) - this is the dummy, unique, primary key for the table
- station_type - this is the secondary key to the types table
- system_id - this is the secondary key to the river_systems table
- station_number - this is the secondary key to the stations table
- year - this is the year the book records data for; if a book has multiple years, it is the first year in the range
- book_number - this is the unique number given to a station for that particular book; they are not consistent from book to book
- longitude - reported longitude of station (if given)
- latitude - reported latitude of station (if given)
- area - reported upstream area of station (if given)
- start_year - year the station started recording data
- start_month - month the station started recording data
- stage - this is a 0 for no data and a 1 for recorded data on river stage; it is null for weather stations
- flow - this is a 0 for no data and a 1 for recorded data on river discharge; it is null for weather stations
- sed - this is a 0 for no data and a 1 for recorded data on total suspended sediment; it is null for weather stations
- rain - this is a 0 for no data and a 1 for recorded data on precipitation; it is null for hydrology stations
- evap - this is a 0 for no data and a 1 for recorded data on evaporation; it is null for hydrology station
data_table
This is the bulk of the data and contains all the daily values which have been digitized. They are organized by station, year, month, and day and data is unique for each combination of these; there is also a dummy primary key for a unique single identifier for each record. The date information are not in date format so that it is easier to search for just a particular year or one month in all years.
- data_number (PK) - this is the dummy primary key for the table
- station_number (SK) - this is the secondary key to the stations table
- year - this is the year of the data record
- month - this is the month of the data record
- day - this is the day of the data record
- discharge - this is the discharge for the date at that station (null for weather stations or not recorded on that day)
- sed - this is the total suspended sediment for that date at that station (null for weather stations or if data are not recorded for that station on that day)
- rain - total precipitation for that date at that station (null for hydrology stations or if not recorded; 0 for no precipitation); this one is a bit confusing because sometimes rain was only recorded for half the year and entries are left blank for either not recorded or no precipitation - we used our best judgment to determine which was the case for each data point
- evap - total evaporation for that date at that station (null for hydrology stations)