Basic usage

Background

This section describes how to use the tethysts package. The functions depend heavily on the xarray package. Nearly all outputs are either as xarray Datasets or python lists of dictionaries.

The datasets are organised in three main layers:
  • Dataset metadata

  • Stations

  • Results

There is also versioning of the Stations and Results. Dataset metadata is not currently versioned.

Dataset metadata

The first step is to figure out what datasets are available. Import the Tethys class:

from tethysts import Tethys

Public datasets

Initialising the Tethys class without any parameters will pull down all public remotes and parse the associated dataset metadata. The datasets object is a list of dictionaries with a lot of metadata about each dataset. It should tell you practically all you need to know about the data contained in the results (e.g. parameter, units, data licence, owner, etc). Use normal python list comprehension to select the dataset(s) of interest:

In [1]: ts = Tethys()

In [2]: datasets = ts.datasets

In [3]: my_dataset = [d for d in datasets if (d['parameter'] == 'precipitation') and
   ...:                                      (d['product_code'] == 'raw_data') and
   ...:                                      (d['frequency_interval'] == '24H') and
   ...:                                      (d['owner'] == 'Environment Canterbury')][0]
   ...: 

In [4]: my_dataset
Out[4]: 
{'feature': 'atmosphere',
 'parameter': 'precipitation',
 'method': 'sensor_recording',
 'product_code': 'raw_data',
 'owner': 'Environment Canterbury',
 'aggregation_statistic': 'cumulative',
 'frequency_interval': '24H',
 'utc_offset': '12H',
 'dataset_id': 'b5d84aa773de2a747079c127',
 'units': 'mm',
 'license': 'https://creativecommons.org/licenses/by/4.0/',
 'attribution': 'Data licenced by Environment Canterbury',
 'result_type': 'time_series',
 'extent': {'type': 'Polygon',
  'coordinates': [[[173.90134, -44.77033],
    [173.90134, -42.10688],
    [169.60018, -42.10688],
    [169.60018, -44.77033],
    [173.90134, -44.77033]]]},
 'time_range': {'from_date': '2022-05-23T12:00:00',
  'to_date': '2023-05-14T12:00:00'},
 'cf_standard_name': 'precipitation_amount',
 'precision': 0.1,
 'properties': {'encoding': {'precipitation': {'scale_factor': 0.1,
    'dtype': 'int16',
    '_FillValue': -99},
   'quality_code': {'scale_factor': 1, 'dtype': 'int16', '_FillValue': -9999}},
  'attrs': {'quality_code': {'standard_name': 'quality_flag',
    'long_name': 'NEMS quality code',
    'references': 'https://www.lawa.org.nz/media/16580/nems-quality-code-schema-2013-06-1-.pdf'}}},
 'system_version': 4,
 'heights': [0.0],
 'chunk_parameters': {'block_length': 0.0, 'time_interval': 14600}}

Private datasets

Some datasets are not available through the public repository. Accessing private datasets stored in S3 buckets requires remote connection configuration data. A remote configuration requires a list of dictionaries of bucket name, connection_config/public_url, and system version as shown in the following example:

from tethysts import Tethys

remotes = [{'bucket': 'ecan-env-monitoring', 'public_url': 'https://b2.tethys-ts.xyz/file/', 'version': 4}]

Initialise the class with the remotes to get the metadata about the available datasets:

In [5]: ts = Tethys(remotes)

In [6]: datasets = ts.datasets

In [7]: my_dataset = [d for d in datasets if (d['parameter'] == 'precipitation') and
   ...:                                      (d['product_code'] == 'raw_data') and
   ...:                                      (d['frequency_interval'] == '24H') and
   ...:                                      (d['owner'] == 'Environment Canterbury')][0]
   ...: 

In [8]: my_dataset
Out[8]: 
{'feature': 'atmosphere',
 'parameter': 'precipitation',
 'method': 'sensor_recording',
 'product_code': 'raw_data',
 'owner': 'Environment Canterbury',
 'aggregation_statistic': 'cumulative',
 'frequency_interval': '24H',
 'utc_offset': '12H',
 'dataset_id': 'b5d84aa773de2a747079c127',
 'units': 'mm',
 'license': 'https://creativecommons.org/licenses/by/4.0/',
 'attribution': 'Data licenced by Environment Canterbury',
 'result_type': 'time_series',
 'extent': {'type': 'Polygon',
  'coordinates': [[[173.90134, -44.77033],
    [173.90134, -42.10688],
    [169.60018, -42.10688],
    [169.60018, -44.77033],
    [173.90134, -44.77033]]]},
 'time_range': {'from_date': '2022-05-23T12:00:00',
  'to_date': '2023-05-14T12:00:00'},
 'cf_standard_name': 'precipitation_amount',
 'precision': 0.1,
 'properties': {'encoding': {'precipitation': {'scale_factor': 0.1,
    'dtype': 'int16',
    '_FillValue': -99},
   'quality_code': {'scale_factor': 1, 'dtype': 'int16', '_FillValue': -9999}},
  'attrs': {'quality_code': {'standard_name': 'quality_flag',
    'long_name': 'NEMS quality code',
    'references': 'https://www.lawa.org.nz/media/16580/nems-quality-code-schema-2013-06-1-.pdf'}}},
 'system_version': 4,
 'heights': [0.0],
 'chunk_parameters': {'block_length': 0.0, 'time_interval': 14600}}

In this example there is one remote we want to check for datasets, but more dictionaries can be added to the remotes list to parse more datasets.

Caching

New in version 4, the Tethys class can now be initialized with a local cache path. Tethys can now download the results chunks locally to be used again in future get_results calls. This is generally recommended.

Just pass a cache path when Tethys is initialized:

from tethysts import Tethys

ts = Tethys(remotes, cache='/my/local/cache/path')

Stations

Once you’ve decided which dataset you want (i.e. cumulative 24 hour precipitation), write down the dataset_id contained within the associated dictionary and pass it to the next method: get_stations.

In [9]: dataset_id = 'b5d84aa773de2a747079c127'

In [10]: stations = ts.get_stations(dataset_id)

In [11]: my_station = [s for s in stations if (s['name'] == "Waimakariri at Arthur's Pass")][0]

In [12]: my_station
Out[12]: 
{'station_id': '2e64175a7b3f0b62b8e4c685',
 'ref': '219510',
 'name': "Waimakariri at Arthur's Pass",
 'geometry': {'type': 'Point', 'coordinates': [171.56299, -42.9416]},
 'dataset_id': 'b5d84aa773de2a747079c127',
 'dimensions': {'time': 339, 'geometry': 1, 'height': 1},
 'time_range': {'from_date': '2022-05-23T12:00:00',
  'to_date': '2023-05-14T12:00:00'},
 'heights': [0.0],
 'modified_date': '2023-05-15T06:12:25'}

Again, the stations object is a list of dictionaries. Most of the data in each dictionary should be self-explanatory.

If you’ve got geographic coordinates as a GeoJSON point or a combination of a latitude and longitude, then this can be passed to the get_stations method to get the nearest single station.

In [13]: dataset_id = 'b5d84aa773de2a747079c127'

In [14]: geometry = {'type': 'Point', 'coordinates': [172.0, -42.8]}

In [15]: my_station = ts.get_stations(dataset_id, geometry=geometry)

In [16]: my_station[0]
Out[16]: 
{'station_id': 'f9c61373e7ca386c1fab06db',
 'ref': '219910',
 'name': 'Waimakariri at Bull Ck',
 'geometry': {'type': 'Point', 'coordinates': [171.97044, -42.89973]},
 'dataset_id': 'b5d84aa773de2a747079c127',
 'dimensions': {'time': 357, 'geometry': 1, 'height': 1},
 'time_range': {'from_date': '2022-05-23T12:00:00',
  'to_date': '2023-05-14T12:00:00'},
 'heights': [0.0],
 'modified_date': '2023-05-15T06:12:25'}

To get a bunch of stations within a specified area, you can pass a polygon GeoJSON geometry or a combination of latitude, longitude, and distance (radius in decimal degrees).

In [17]: dataset_id = 'b5d84aa773de2a747079c127'

In [18]: lon = 172.0

In [19]: lat = -42.8

In [20]: distance = 0.2

In [21]: my_stations = ts.get_stations(dataset_id, lat=lat, lon=lon, distance=distance)

In [22]: my_stations
Out[22]: 
[{'station_id': '96e9ff9437fc738b24d10b42',
  'ref': '320010',
  'name': 'Waimakariri at Kanuka Hills',
  'geometry': {'type': 'Point', 'coordinates': [172.05382, -42.9944]},
  'dataset_id': 'b5d84aa773de2a747079c127',
  'dimensions': {'time': 357, 'geometry': 1, 'height': 1},
  'time_range': {'from_date': '2022-05-23T12:00:00',
   'to_date': '2023-05-14T12:00:00'},
  'heights': [0.0],
  'modified_date': '2023-05-15T06:12:25'},
 {'station_id': 'f9c61373e7ca386c1fab06db',
  'ref': '219910',
  'name': 'Waimakariri at Bull Ck',
  'geometry': {'type': 'Point', 'coordinates': [171.97044, -42.89973]},
  'dataset_id': 'b5d84aa773de2a747079c127',
  'dimensions': {'time': 357, 'geometry': 1, 'height': 1},
  'time_range': {'from_date': '2022-05-23T12:00:00',
   'to_date': '2023-05-14T12:00:00'},
  'heights': [0.0],
  'modified_date': '2023-05-15T06:12:25'},
 {'station_id': '196bed26c6415e6310ce6c89',
  'ref': '218810',
  'name': 'Waimakariri at Ranger Stream',
  'geometry': {'type': 'Point', 'coordinates': [171.85399, -42.86847]},
  'dataset_id': 'b5d84aa773de2a747079c127',
  'dimensions': {'time': 357, 'geometry': 1, 'height': 1},
  'time_range': {'from_date': '2022-05-23T12:00:00',
   'to_date': '2023-05-14T12:00:00'},
  'heights': [0.0],
  'modified_date': '2023-05-15T06:12:25'},
 {'station_id': '2e856eee7639da88851db737',
  'ref': '217810',
  'name': 'Waimakariri at Mt Byrne',
  'geometry': {'type': 'Point', 'coordinates': [171.86383, -42.75267]},
  'dataset_id': 'b5d84aa773de2a747079c127',
  'dimensions': {'time': 256, 'geometry': 1, 'height': 1},
  'time_range': {'from_date': '2022-05-23T12:00:00',
   'to_date': '2023-02-02T12:00:00'},
  'heights': [0.0],
  'modified_date': '2023-02-02T21:12:42'}]

Results

What you’ll need next is to pick a station and write down the station_id just like you did with the dataset_id.

To get the results (the 4D data), you’ll need a dataset_id and station_id. Internally, the results are broken up by dataset and station. The get_results method has many input options. Take a look at the reference page for a description of all the options.

In [23]: station_id = 'f9c61373e7ca386c1fab06db'

In [24]: results = ts.get_results(dataset_id, station_id)

In [25]: results
Out[25]: 
<xarray.Dataset>
Dimensions:        (geometry: 1, height: 1, time: 357)
Coordinates:
  * geometry       (geometry) object '01010000005ED72FD80D7F6540739D465A2A734...
  * height         (height) float32 0.0
  * time           (time) datetime64[ns] 2022-05-23T12:00:00 ... 2023-05-14T1...
Data variables:
    lat            (geometry) float64 ...
    lon            (geometry) float64 ...
    modified_date  (time, geometry, height) datetime64[ns] ...
    name           (geometry) object ...
    precipitation  (time, geometry, height) float32 ...
    ref            (geometry) object ...
    station_id     (geometry) object ...
Attributes:
    result_type:     time_series
    title:           cumulative precipitation in mm of the atmosphere by a se...
    institution:     Environment Canterbury
    license:         https://creativecommons.org/licenses/by/4.0/
    source:          sensor_recording
    system_version:  4
    version_date:    2022-03-31T00:00:00

Unlike the previously returned objects, the results object (in this case) is an xarray Dataset. This xarray Dataset contains both the results (temperature) and all of the dataset metadata. If the results represent geospatially sparse data, then the results are indexed by geometry, height, and time. If the results represent gridded data, then the results are indexed by lat, lon, height, and time. The geometry dimension is a hexadecimal encoded Well-Known Binary (WKB) representation of the geometry. This was used to be flexible on the geometry type (i.e. points, lines, or polygons) and the WKB ensures that the geometry is stored accurately. This is a standard format by the Open Geospatial Consortium (OGC) and can be parsed by many programs including shapely, PostGIS, etc. Using WKB in a geometry dimension does not follow CF conventions, however. This was a trade off between flexibility, simplicity, and following standards. I leaned towards flexibility and simplicity on this one.

In addition to the get_stations spatial queries, the get_results method has a built-in nearest neighbour query if you omit the station_id and pass either geometry dict or a combination of latitude and longitude. This is especially useful for gridded results when each station represents a large area rather than a single point.

In [26]: geometry = {'type': 'Point', 'coordinates': [172.0, -42.8]}

In [27]: results = ts.get_results(dataset_id, geometry=geometry, squeeze_dims=True)

In [28]: results
Out[28]: 
<xarray.Dataset>
Dimensions:        (time: 357)
Coordinates:
    geometry       <U42 '01010000005ED72FD80D7F6540739D465A2A7345C0'
    height         float32 0.0
  * time           (time) datetime64[ns] 2022-05-23T12:00:00 ... 2023-05-14T1...
Data variables:
    lat            float64 ...
    lon            float64 ...
    modified_date  (time) datetime64[ns] ...
    name           object ...
    precipitation  (time) float32 ...
    ref            object ...
    station_id     object ...
Attributes:
    result_type:     time_series
    title:           cumulative precipitation in mm of the atmosphere by a se...
    institution:     Environment Canterbury
    license:         https://creativecommons.org/licenses/by/4.0/
    source:          sensor_recording
    system_version:  4
    version_date:    2022-03-31T00:00:00

If you want to get more than one station per dataset, then you can still use the get_results. The output will concatenate the xarray Datasets together and return a single xarray Dataset. Since the get_results method is multithreaded when downloading results, passing multiple station ids to it will be much faster than using a “for loop” over each station id.

In [29]: station_ids = [station_id, '96e9ff9437fc738b24d10b42']

In [30]: results = ts.get_results(dataset_id, station_ids)

In [31]: results
Out[31]: 
<xarray.Dataset>
Dimensions:        (geometry: 2, height: 1, time: 357)
Coordinates:
  * geometry       (geometry) object '01010000005ED72FD80D7F6540739D465A2A734...
  * height         (height) float32 0.0
  * time           (time) datetime64[ns] 2022-05-23T12:00:00 ... 2023-05-14T1...
Data variables:
    lat            (geometry) float64 ...
    lon            (geometry) float64 ...
    modified_date  (time, geometry, height) datetime64[ns] ...
    name           (geometry) object ...
    precipitation  (time, geometry, height) float32 ...
    ref            (geometry) object ...
    station_id     (geometry) object ...
Attributes:
    result_type:     time_series
    title:           cumulative precipitation in mm of the atmosphere by a se...
    institution:     Environment Canterbury
    license:         https://creativecommons.org/licenses/by/4.0/
    source:          sensor_recording
    system_version:  4
    version_date:    2022-03-31T00:00:00

Saving to hdf5 files

Starting in version 4.5, Tethys can now save results directly to hdf5 files that can be opened by xarray. You must specify an output_path and optionally a compression for the hdf5 file (lzf is the default compression). There’s no concern for excessive data volume in this process. You can download results from one station or all stations in a dataset to a single file without much trouble. It’s recommended to save the file with the .h5 extension rather than the .nc extension to make it clear that it’s a normal hdf5 file rather than a fully netcdf4-compliant file. Future versions might be formatted to be fully netcdf4-compliant…if I can figure out all of the nuances…any help is appreciated! Update using hdf5tools>=0.1.12…I’ve managed to make the hdf5 file compatible with the python netcdf4 package. This means that files created by the tethysts package should be compatible with any python packages that read netcdf4 data (which of course includes xarray).

results = ts.get_results(dataset_id, station_ids, output_path='/my/local/path/results.h5', compression='lzf')

And if you’d like to reopen the hdf5 file with xarray later, then you can use the xr.open_dataset function as normal (even with advanced compression…somehow…).

results = xr.open_dataset('/my/local/path/results.h5')

Selective filters

In Tethys version 4, the results have been saved into multiple chunks. These chunks contain specific time periods, heights, and stations. It is best to provide from_date, to_date, and heights filters to the get_results method so that less data needs to be downloaded and concatenated. If you don’t, you might end up downloading a lot of data, using a lot of RAM, and consuming a lot of processing time unnecessarily.

Dataset versions

If a version_date is not passed to the get_results or get_stations method, then the latest dataset version will be returned. If you’d like to list all the dataset versions and to choose which version you’d like to pass to the get_results or get_stations method, then you can use the get_versions method.

In [32]: versions = ts.get_versions(dataset_id)

In [33]: versions
Out[33]: 
[{'dataset_id': 'b5d84aa773de2a747079c127',
  'version_date': '2022-03-31T00:00:00',
  'modified_date': '2023-05-15T06:05:04'}]

Handling geometries

Depending data request, Tethys will either return geometries as GeoJSON or Well-Known Binary (WKB) hexadecimal geometries. If you’re not familiar with how to handle these formats, I recommend using Shapely to convert into and out of geometry formats and to provide a range of geospatial processing tools. Shapely is used as the geoprocessing tool behind geopandas.

For example if you’ve made a get_stations request and returned GeoJSON geometries, then you could convert them to shapely objects and put them into a dictionary with station_ids as keys:

In [34]: from shapely.geometry import shape

In [35]: dataset_id = 'b5d84aa773de2a747079c127'

In [36]: stations = ts.get_stations(dataset_id)

In [37]: stns_geo = {s['station_id']: shape(s['geometry']) for s in stations}

In [38]: stns_geo['f9c61373e7ca386c1fab06db']
Out[38]: <POINT (171.97 -42.9)>

Or you could convert the WKB hex of results into a list of shapely objects:

In [39]: from shapely import wkb

In [40]: station_ids = [station_id, '96e9ff9437fc738b24d10b42']

In [41]: results = ts.get_results(dataset_id, station_ids)

In [42]: geo_list = [wkb.loads(g, hex=True) for g in results.geometry.values]

In [43]: geo_list
Out[43]: [<POINT (171.97 -42.9)>, <POINT (172.054 -42.994)>]