General information

This one day conference focuses on boosting and optimisation in statistical learning. The audience is mostly the members of the research teams PS (Besançon), SPOC (Dijon) and the second year students of the Master MS. Six talks are scheduled, with a morning session on boosting and an afternoon session on optimisation.

Organisation of the day

Instructions for speakers

  • The talk duration is 35 minutes with additional 5 minutes for questions (40 min slots).
  • Please load your slides on the conference computer prior to the start of the session.

Acknowledgements

The organizers acknowledge the financial support of the region Bourgogne Franche-Comté and EUR EIPHI through the RaySynMath project.

   

Program

The conference starts at 9:30 (welcome and coffee) and ends at 16:00.

Morning session: Boosting

  • 10:00 - Alain Célisse (Université Panthéon Sorbonne)
    Title: Early stopping rule for spectral filter estimators in RKHS

  • 10:40 - Paul Liautaud (Sorbonne Université)
    Title: Adaptive Boosting for Online Nonparametric Regression
     
  • 11:20 - Jean-Jil Duchamps and Clément Dombry (Université de Franche Comté)
    Title: Infinitesimal Gradient Boosting

Afternoon session: Optimisation

  • 14:00 - Antoine Godichon Baggioni (Sorbonne Université)
    Title: Stochastic Newton algorithms with O(Nd) operations
     
  • 14:40 - Laura Hucker (Humboldt University Berlin)
    Title: Early stopping for conjugate gradients in statistical inverse problems

  • 15:20 - Ferdinand Genans-Boiteux (Sorbonne Université)
    Title: Semi-Discrete Optimal Transport - Nearly Minimax Estimation With Stochastic Gradient Descent and Adaptive Entropic Regularization

 

Abstracts

10:00 - Alain Célisse (Université Panthéon Sorbonne)
Early stopping rule for spectral filter estimators in RKHS
We consider a broad family of estimators called spectral filters which involves the Gradient descent (GD). After discussing the interest for optimally choosing the number of iterations when performing GD, we introduce a new data-driven rule based on the Discrepancy Principle (DP) which outputs an estimator of the optimal number of iterations. The statistical performance of ths rule is then explained and illustrated on siulation experiments. Optimality results are also established in terms of oracle inequalities.

10:40 - Paul Liautaud (Sorbonne Université)
Adaptive Boosting for Online Nonparametric Regression
In this work, we study gradient boosting for online nonparametric regression with arbitrary deterministic sequences and general convex loss functions. The procedure involves sequentially training generic weak learners using the gradient step of a given strong learner. Utilizing chaining-trees, we apply parameter-free online algorithms to specific weak learners, achieving an optimal regret of $\mathcal{O}(T^{(d-1)/d})$, where $d > 2$, after $T$ iterations, across Hölder function classes with smoothness parameter $|\alpha|\leq 1$. Furthermore, by incorporating chaining-trees as node experts within a core tree, our proposed algorithm, \texttt{LocAdaBoost}, competes effectively with any pruned tree predictions. It demonstrates improved regret performance over oracle pruning, which has knowledge of the Hölder profile of the competitor function. As a result, we present the first computationally efficient algorithm with locally adaptive optimal rates for online regression in an adversarial setting.

11:20 - Jean-Jil Duchamps and Clément Dombry (Université de Franche Comté)
Infinitesimal Gradient Boosting
We define infinitesimal gradient boosting as a limit of the popular tree-based gradient boosting algorithm from machine learning. The limit is considered in the vanishing-learning-rate asymptotic, that is when the learning rate tends to zero and the number of gradient trees is rescaled accordingly. For this purpose, we introduce a new class of randomized regression trees bridging totally randomized trees and Extra Trees and using a softmax distribution for binary splitting. Our main result is the convergence of the associated stochastic algorithm and the characterization of the limiting procedure as the unique solution of a nonlinear ordinary differential equation in a infinite dimensional function space. Infinitesimal gradient boosting defines a smooth path in the space of continuous functions along which the training error decreases, the residuals remain centered and the total variation is well controlled. Furthermore, some large sample properties of infinitesimal gradient boosting are discussed.

14:00 - Antoine Godichon Baggioni (Sorbonne Université)
Stochastic Newton algorithms with O(Nd) operations
The majority of machine learning methods can be regarded as the minimization of an unavailable risk function. To optimize this function using samples provided in an online fashion, stochastic gradient descent is a common tool. However, it can be highly sensitive to ill-conditioned problems. To address this issue, we focus on Stochastic Newton methods. We first examine a version based on the Ricatti (or Sherman-Morrison) formula, which allows recursive estimation of the inverse Hessian with reduced computational time. Specifically, we show that this method leads to asymptotically efficient estimates and requires $O(Nd^2)$ operations (where N is the sample size and d is the dimension). Finally, we explore how to adapt the Stochastic Newton algorithm for a streaming context, where data arrives in blocks, and demonstrate that this approach can reduce the computational requirement to $O(Nd) $ operations.
 
14:40 - Laura Hucker (Humboldt University Berlin)
Early stopping for conjugate gradients in statistical inverse problems

We consider estimators obtained by applying the conjugate gradient algorithm to the normal equation of a prototypical statistical inverse problem. For such iterative procedures, it is necessary to choose a suitable iteration index to avoid under- and overfitting. Unfortunately, classical model selection criteria can be prohibitively expensive in high dimensions. In contrast, it has been shown for several methods that sequential early stopping can achieve statistical and computational efficiency by halting at a fully data-driven index depending on previous iterates only. Residual-based stopping rules, similar to the discrepancy principle for deterministic problems, are well understood for linear regularisation methods. However, in the case of conjugate gradients, the estimator depends nonlinearly on the observations, allowing for greater flexibility. This significantly complicates the error analysis. We establish adaptation results for both the prediction and the reconstruction error in this setting.
 
15:20 - Ferdinand Genans-Boiteux (Sorbonne Université)
Semi-Discrete Optimal Transport: Nearly Minimax Estimation With Stochastic Gradient Descent and Adaptive Entropic Regularization
Optimal Transport (OT) based distances are powerful tools for machine learning to compare probability measures and manipulate them using OT maps. In this field, a setting of interest is semi-discrete OT, where the source measure $\mu$ is continuous, while the target $\nu$ is discrete. Recent works have shown that the minimax rate for the OT map is $\mathcal{O}(t^{-1/2})$ when using $t$ i.i.d. subsamples from each measure (two-sample setting). An open question is whether a better convergence rate can be achieved when the full information of the discrete measure $\nu$ is known (one-sample setting). In this work, we answer positively to this question by (i) proving an $\mathcal{O}(t^{-1})$ lower bound rate for the OT map, using the similarity between Laguerre cells estimation and density support estimation, and (ii) proposing a Stochastic Gradient Descent (SGD) algorithm with adaptive entropic regularization and averaging acceleration. To nearly achieve the desired fast rate, characteristic of non-regular parametric problems, we design an entropic regularization scheme decreasing with the number of samples. Another key step in our algorithm consists of using a projection step that permits to leverage the local strong convexity of the regularized OT problem. Our convergence analysis integrates online convex optimization and stochastic gradient techniques, complemented by the specificities of the OT semi-dual. Moreover, while being as computationally and memory efficient as vanilla SGD, our algorithm achieves the unusual fast rates of our theory in numerical experiments.
Online user: 2 Privacy
Loading...