__Marcelo Barreiro:__ "Climate networks and atmospheric connectivity"

Advancing our understanding of the complex dynamics of our climate requires the development and use of new approaches for climate data analysis. In this talk I will present how the application of the complex network approach in conjunction with nonlinear analysis has yielded new insights into atmospheric and oceanic phenomena. In particular, I will focus on the detection and variability of atmospheric connectivity during the XX century and how it might change under anthropogenic forcing.

__Dorit Hammerling:__ "Compression and Conditional Emulation of Climate Model Output"

Numerical climate model simulations runs at high spatial and temporal resolutions generate massive quantities of data. As our computing capabilities continue to increase, storing all of the generated data is becoming a bottleneck, and thus is it important to develop methods for representing the full datasets by smaller compressed versions. We propose a statistical compression and decompression algorithm based on storing a set of summary statistics as well as a statistical model describing the conditional distribution of the full dataset given the summary statistics. The statistical model can be used to generate realizations representing the full dataset, along with characterizations of the uncertainties in the generated data. Thus, the methods are capable of both compression and conditional emulation of the climate models. Considerable attention is paid to accurately modeling the original dataset, particularly with regard to the inherent spatial nonstationarity in global temperature fields, and to determining the statistics to be stored, so that the variation in the original data can be closely captured, while allowing for fast decompression and conditional emulation on modest computers.

__Alexis Hannart:__ "A few problems in climate research that data science may help tackle"

This conference attempts to gather researchers in the environmental sciences with researchers in data science, in the hope that the interaction between the two areas will be fruitful. But why are we inclined to believe that such an interaction could actually be fruitful? Beyond the very high level and blur «data deluge » argument, can we be more specific? Here we attempt to draft a list of precise scientific questions in the realm of climate sciences, for which a data science approach seems to be relevant and may potentially make a big difference. Furthermore, we attempt to reframe these scientific questions in a way that makes them directly amenable to data science usual tools and concepts, in order to make them more accessible and attractive to data scientists. An illustration on the attribution of climate trends and events is discussed.

__Ibrahim Hoteit:__ "Gaussian-Mixture Filtering High Dimensional Systems with Small Ensembles"

The update step of the Gaussian-mixture filter consists of an ensemble of Kalman updates for each center of the mixture, generalizing the ensemble Kalman filters (EnKFs) update to non-Gaussian distributions. Sampling the posterior distribution is required for forecasting with the numerical model. Because of computational limitations, only small samples could be considered when dealing with large scale atmospheric and oceanic models. As such a "targeted" sampling that captures some features of the posterior distribution might be a better strategy than a straightforward random sampling. This was numerically demonstrated for the Gaussian-based ensemble filters, with the deterministic EnKFs outperforming the stochastic EnKF in many applications. In this talk, I will present two filtering algorithms based on this idea of "targeted" sampling; the first one introduces a deterministic sampling of the observations perturbations in the stochastic EnKF in order to exactly match the first two moments of the Kalman filter, and the second one is based on a Gaussian-mixture update step based on a clustering of the forecast ensemble and a resampling step matching the first two moments of the posterior distribution. Numerical results will be presented and discussed.

__Erwan Le Pennec:__ "A gentle introduction to Data Science"

In this talk, I will try to explain what is Data Science, to demystify the Big Data term, to present a few Data Science open challenges and to describe what should be a Data scientist.

__Pierre-Yves Le Traon:__ "The Copernicus Marine Environnment Monitoring Service"

More than ever, there is a need to continuously monitor the oceans. This is imperative to understanding and predicting the evolution of our weather and climate. This is also essential for a better and sustainable management of our oceans and seas. The Copernicus Marine Environment Monitoring Service (CMEMS) has been set up to answer these challenges. CMEMS provides a unique monitoring of the global ocean and European seas based on satellite and in situ observations and models. CMEMS monitors past (over the last 30 years) and current marine conditions and provide short-term forecasts. Mercator Ocean was tasked by the EU to implement the service. The organisation is based on a strong European partnership with more than 60 marine operational and research centres in Europe that are involved in the service and its evolution. An overview of CMEMS, its drivers, organization and initial achievements will be given. The essential role of in-situ and satellite upstream observations will be discussed as well as CMEMS Service Evolution Strategy, associated R&D priorities and future technical and scientific challenges. Challenges related to big data issues will be, in particular, addressed.

__Olivier Mestre:__ "Calibration of Numerical Weather Forecasts using Machine Learning Algorithms"

NWP models usually capture main circulation patterns, but are usually biased in accounting for local variations in surface local meteorological parameters. Hence, statistical post-processing techniques are used to improve local weather predictions : MOS (Model Ouput Statistics), EMOS (Ensemble Model Output Statistics). In this talk, we briefly recall the principle of MOS techniques, and show examples of applications for parameters such as temperatures, windspeed, cloud cover. We discuss the applicability of linear models versus classical machine learning algorithms : from trees to random forests, SVM, etc. Since data amounts implied in post-processing of high resolution gridded fields is huge (> Tbyte), we investigate clues to solve computation time problems. Similarly to deterministic models, Ensemble Forecast Systems tend to be biased, but this bias also affects dispersion, very often raw ensembles tend to be underdispersive. We will show how techniques based on Quantile Regression Forests are able to efficiently correct probabilistic forecasts in a non parametric way.

__Takemasa Miyoshi:__ "Big Data Assimilation for 30-second-update 100-m-mesh Numerical Weather Prediction"

As computer and sensor technologies advance, numerical weather prediction will face the challenge of integrating Big Simulations and observation Big Data. I will present my perspective on the next 10-20 years of data assimilation with the future-generation sensors and post-peta-scale supercomputers, based on our own experience with the 10-petaflops “K computer”. New sensors produce orders of magnitude more data than the current sensors, and faster computers enable orders of magnitude more precise simulations, or “Big Simulations”. Data assimilation integrates the “Big Data” from both new sensors and Big Simulations. We started a “Big Data Assimilation” project, aiming at a revolutionary weather forecasting system to refresh 30-minute forecasts at 100-m resolution every 30 seconds, 120 times more rapidly than hourly-updated systems. We also investigated ensemble data assimilation using 10240 ensemble members, largest ever for the global atmosphere. Based on the experience using the K computer, we will discuss the future of data assimilation in the forthcoming Big Data and Big Simulation era.

__Douglas Nychka:__ "Large and non-stationary spatial fields: Quantifying uncertainty in the pattern scaling of climate models"

Pattern scaling has proved to be a useful way to extend and interpret Earth system model (i.e. climate) simulations. In the simplest case the response of local temperatures is assumed to be a linear function of the global temperature. This relationship makes it possible to consider many different scenarios of warming by using simpler climate models and combining them with the scaling pattern deduced from a more complex model. This work explores a methodology using spatial statistics to quantify how the pattern varies across an ensemble of model runs. The key is to represent the pattern uncertainty as a Gaussian process with a spatially varying covariance function. We found that when applied to the NCAR/DOE CESM1 large ensemble experiment we are able to reproduce the heterogenous variation of the pattern among ensemble members. Also these data present an opportunity to fit a large, fixed-rank Kriging model (LatticeKrig) to give a global representation of the covariance function on the sphere. The climate model output at 1 degree resolution has more than 50,000 spatial locations and so requires special numerical approaches to fit the covariance function and simulate fields. Much of the local statistical computations are embarrassingly parallel and the analysis can be accelerated by parallel tools within the R statistical environment.

__Thierry Penduff:__ "Probabilistic analysis of the OCCIPUT global ocean simulation ensemble"

The ocean dynamics are described by nonlinear Partial Derivative Equations, in which the time-dependent atmospheric forcing (winds, heat/freshwater fluxes) is prescribed as boundary conditions. Ocean General Circulation Models (OGCMs) are used to solve these equations, in order to study the Global Ocean evolution over weeks to centuries in a realistic context (in terms of physics, initial/boundary conditions, domain geometry, etc).

Unlike low-resolution OGCMs that were used in recent climate projections (IPCC), high-resolution OGCMs are nonlinear enough to spontaneously generate an intrinsic ocean variability, i.e. under constant forcing. This strong phenomenon has a chaotic behavior (i.e. sensitivity to initial perturbations) and impacts many climate-relevant quantities over a broad range of spatio-temporal scales (up to the scale of oceanic basins and multiple decades). Whether and how this atmospherically-modulated, low-frequency oceanic chaos may, in turn, impact the atmosphere and climate is an unsettled issue; it is however crucial in the perspective of the next IPCC projections, which will use high-resolution OGCMs coupled to the atmosphere.

Before addressing this coupled issue, oceanographers need to disentangle the forced/intrinsic parts of the oceanic variability, identify the structure and scales of both components, and their possible interplays. In the framework of the OCCIPUT ANR/PRACE project, we have performed a 50-member ensemble of global ocean/sea-ice 3D simulations, driven by the same 1958-2015 atmospheric forcing after initial state perturbations. The structure and temporal evolution of the resulting ensemble PDFs hence yield a new (probabilistic) description of the global ocean/sea-ice multi-scale variability over the last 5 decades, raising new questions regarding the detection and attribution of climatic signals, and providing new insights about the complex oceanic dynamical system.

We will first describe our objectives, our ensemble simulation strategy, the classical approaches we have first used to analyze these data and our present results. We will present the non-gaussian metrics (based e.g. on the Information Theory) we are developping to more thoroughly characterize the features, scales and imprints of the oceanic chaos and of their atmospheric modulation. As a perspective, it is likely that more specific (supervised/unsupervised classification/analysis/pattern recognition) signal processing techniques could provide more relevant information from this large (~100 TB), novel 5-dimensional (space, time, ensemble) dataset, and strengthen the emergence of probabilistic oceanography for climate science.

__Eniko Szekely:__ "Data-driven kernel methods for dynamical systems with application to atmosphere ocean science"

Datasets generated by dynamical systems are often high-dimensional, but they only display a small number of patterns of interest. The underlying low-dimensional structure governing such systems is generally modeled as a manifold, and its intrinsic geometry is well described by local measures that vary smoothly on the manifold, such as kernels, rather than by global measures, such as covariances. In this talk, a kernel-based nonlinear dimension reduction method, namely nonlinear Laplacian spectral analysis (NLSA), is used to extract a reduced set of basis functions that describe the large-scale behavior of the dynamical system. These basis functions are the leading Laplace-Beltrami eigenfunctions of a discrete Laplacian operator. They can be further employed as predictors to quantify the regime predictability of a signal of interest using clustering and information-theoretic measures. In this talk, NLSA will be employed to extract physically meaningful spatiotemporal patterns from organized tropical convection covering a wide range of timescales, from interannual to annual, semiannual, intraseasonal and diurnal scales.

__Christopher Wikle:__ "Recent Advances in Quantifying Uncertainty in Nonlinear Spatio-Temporal Statistical Models"

Spatio-temporal data are ubiquitous in the environmental sciences, and their study is important for understanding and predicting a wide variety of processes of interest to meteorologists and climate scientists. One of the primary difficulties in modeling spatial processes that change with time is the complexity of the dependence structures that must describe how such a process varies, and the presence of high-dimensional datasets and prediction domains. Much of the methodological development in recent years has considered either efficient moment-based approaches or spatio-temporal dynamical models. To date, most of the focus on statistical methods for dynamic spatio-temporal processes has been on linear models or highly parameterized nonlinear models (e.g., quadratic nonlinear models). Even in these relatively simple models, there are significant challenges in specifying parameterizations that are simultaneously useful scientifically and efficient computationally. Approaches for nonlinear spatio-temporal data from outside statistics (e.g., analog methods, neural networks, agent-based models) offer intriguing alternatives. Yet, these methods often do not have formal mechanisms to quantify various sources of uncertainty in observations, model specification, and parameter estimation. This talk presents some recent attempts to place these models, many of which were motivated in the atmospheric and oceanic sciences, into a more rigorous uncertainty quantification framework.

This is joint work with Patrick McDermott, PhD student, U. Missouri