“The Field of Data Science is Just Starting to Take Shape”

Interview with Stéphane Mallat
5 mars 2018
12 min de lecture

French scientist Stéphane Mallat is a specialist in deep learning algorithms, and is behind the new data sciences chair at the Collège de France. This has also been the topic of his lectures since January, with the aim of deepening the links between mathematics and applications.

Photo © Frédérique Plas

With over 90,000 citations in Google Scholar, Stéphane Mallat is the most quoted French scientist in the engineering and computer sciences. And for good reason! In 1987, he developed an algorithm based on the mathematical theory of “wavelets”, that led to multiple applications including the JPEG 2000 compression standard. In 2001, he used his mathematical expertise to create the start-up Let It Wave, specialising in the manufacture of electronic chips for televisions. Thanks to wavelets, these chips are able to convert standard resolution images into high-definition images. In 2008, after selling his company, he returned to his first love: research in mathematics. He started looking into deep-learning algorithms, whose performances in face recognition have impressed the scientific community. The Holy Grail? To discover the secret of these algorithms. In other words, to understand mathematically what exactly they do to process large volumes of data.

La Recherche In January 20019 you started in the new chair in data sciences at the Collège de France. In which direction would you like to see it go?

Stéphane Mallat The chair is called “data sciences” in the plural, as I believe it is important to remember that this is a field born of the combination of several scientific disciplines. They include statistics, computer science, artificial intelligence, signal processing, information theory, but also traditional sciences such as physics, biology, economics and social sciences, which all need to model and analyse vast quantities of data. The goal of the chair is to provide a common vision and language that go beyond the specificities of each of these fields. What I am trying to develop in those lectures, is the point of view of applied mathematics. Not forgetting the experimental component in the data sciences. It is indeed thanks to empirical approaches, and to the remarkable intuition of computer engineers and scientists that algorithms with spectacular abilities in face and voice recognition, automatic translation and the game Go, have seen the light. The development of applications and experimental research can raise new issues which are a considerable source of new ideas in mathematics. This link between mathematics and applications will serve as the backbone to my lessons at the Collège de France, and I hope gradually to eliminate the boundaries between experimentation and theory, that feed each other.

Why now?

The field of data sciences is just starting to take shape. This is a recent phenomenon, as is the term “data sciences”. The discipline has existed for a long time, but under the umbrella of statistics. With the explosion in the quantity of digital data (big data) produced every day and the acceleration in the processing power of computers, statistics and computing came together. This led to the emergence of machine learning, the aim of which is to develop algorithms that can analyse and classify, or in other words predict the answers to queries asked of large amounts of data. Why then, does a discipline acquire a form of independence? I don’t think it is because it takes on full scientific autonomy from its parent disciplines, in this case statistics and computing. It is more of a societal and university phenomenon; there comes a time when there is strong demand from students, industry and society needing to create a specific academic structure. The same happened fifty years ago with the emergence of computer science departments, which grew out of departments of mathematics and electrical engineering. This is why we are witnessing the appearance of data science centres all around the world. At the École Normale Supérieure, we will be setting up a multidisciplinary data sciences centre, that I will be coordinating, at the interface between computing, mathematics, physics, biology and also cognitive and social sciences. The creation of a chair at the Collège de France is part of this dynamic.

What issues do the data sciences deal with?

There are two separate families: modelling and prediction. The aim of the first is to build a representative model of the data to generate new data, compress them or improve their reconstruction from partial or damaged data. In medical imaging for example, we are trying to restore high-resolution images with as few measurements as possible, to limit patient exposure to radiation. For prediction, the goal is to ask questions about sets of data and predict an answer using the structure of these data. For example, we can recognise an object or an animal in an image from the value of the pixels composing it, predict the energy of a molecule from its conformation, or predict a cancer diagnosis by using the results of medical examinations and genomics data.

There are already many applications. Have these issues not been resolved?

Not at all. We are only at the beginning. The results of learning algorithms, especially those in deep neural networks (*), are indeed spectacular, but we do not really understand, mathematically, the reasons for their success. Progress in the field would enable us to improve them and make them more reliable, in particular for critical applications such as medicine or self-driving cars.

What do we not understand?

To answer that question, I first need to explain the general principle of learning algorithms. Let’s imagine an algorithm which aims to predict the quantum energy (noted y) of a molecule according to its conformation, or its geometry (noted x). This means finding a link between the conformation x and the energy y. Such an algorithm includes internal parameters that are calculated during learning. This algorithm will be trained on a database comprising tens of thousands of examples of x conformations for which we know the energy y of the molecule. This learning “optimises” the algorithm’s internal parameters so that it makes as few errors as possible on the examples we give it. Following this “supervised” learning period, the algorithm is able to predict the energy y of a molecule of unknown conformation x. Calculating the internal parameters has enabled it to generalise the link between conformation and energy. Mathematically, this means approximating the function (noted f) that connects any data x and the answer y = f(x). This approximated function (noted f) must be so that the answers y = f(x) are close to the exact answers y. If the average error between the results is low, this means that the learning has been well generalised. The ability of algorithms to generalise may appear magic, but it is simply based on a form of regularity of the function f, linking the data x and the answer y. It is precisely the nature of this regularity that we understand so little of when the data x are high-dimensional. This is indeed the case in most of the issues we are interested in, such as image classification, medical diagnosis, etc.

What is high-dimensional data?

Take a black and white image, 1000 pixels square, making 1 million pixels in total. The value of one pixel is between 0 (for black) and 1 (for white). An image is therefore comprised of 1 million variables, corresponding to the values of each single pixel. The same image however can also be seen as a point in a space with 1 million dimensions. This space is gigantic! It’s as if you had a coordinate with 1 million axes, in which each image corresponded to a point whose position is defined by 1 million coordinates.

Why is it difficult to define the regularity of high-dimensional data?

Let’s start with an easy case, seen in high-school; you are asked to measure the temperature of a mixture over time. So, you trace two axes representing time on the horizontal axis (x) and temperature on the vertical axis (y), then you insert points for each measurement. Here, each measurement is an example of a set for which x and y are known. How would you trace the temperature curve over time, or in other words the function y = f(x), for all values of x? By hand, naturally you will draw a regular curve that passes through all the points corresponding to your measurements. You could have made it irregular, but your knowledge of physics drives you to draw a regular curve. Now, if you are asked to predict the temperature at any given time, the curve you drew will give you the answer. In other words, you have generalised the values of the examples thanks to the regularity of the function f.

Is that what a learning algorithm does?

Exactly! A learning algorithm calculates a regular approximation, that passes through almost all known examples, by adjusting its internal parameters. This prediction is called a regression. The margin of error depends on two things: the local regularity of the curve y = f(x) and the distance between the experiment points, which are the training examples. Thus, we calculate a good prediction for a new x if this x is close to a known example. The smaller the distance, the lower the prediction error. We therefore need examples that are sufficiently close to each other to make a good prediction for any x.

Does this also apply to data classification problems?

Yes. In this case, decision boundaries (**) need to be defined, that enable the data to be classified into one class or another. For example, classifying images into cat or dog categories. To ensure a low classification error, the boundaries must be regular and the distance between each example sufficiently small to accurately trace these boundaries. In high dimensions, the problem is that the data contain far too many variables, so much so that when you take any random x, it is rare to find a training example that is close. In other words, it is very unlikely that the variables in this example will be almost identical to those for the data x. This is only possible if the number of learning examples increases exponentially with the number of variables. Very quickly however, you arrive at a number of necessary examples greater than the number of atoms in the universe. This is not realistic practically. In data sciences, we call this the “curse of dimensionality”.

How can you overcome this curse?

To get around it, the regularity of the function f we want to approximate must be much greater than the local regularity used at low dimension. This regularity is based on the principle of parsimony, according to which the data possess a certain structure that enables components to be eliminated from the data x without affecting the result y. This means changing the variables of the data x and replacing them with new variables, less numerous, known as x “attributes”. Parsimony can be seen as the mathematical translation of Occam’s razor principle [from the name of a 14th century philosopher], which tries to explain a phenomenon using a minimum number of hypotheses. Two properties of data make these parsimonious changes in variables possible: multiscale hierarchical organisation and symmetries. They both play a fundamental role in mathematics and in most complex learning problems.

What is the hierarchy property?

In the article “The Architecture of Complexity” published in 1962, US economist Herbert Simon observed that biological, physical and social systems all have a hierarchical and multiscale structure. In physics, matter can be studied on several levels of organisation, from elementary particles, at small scales, to galaxies, on very large scales, via atoms and molecules. This organisation can be found in very many systems, including biology, languages or human societies. In the data sciences, this hierarchical organisation is used by many learning algorithms. One first step in understanding the regularity of this hierarchical organisation was made between 1980 and 1990 using the theory of wavelets. This gave rise to a “wavelet transform” operation, used to represent a piece of data, an image for example, with wavelet coefficients which are attributes corresponding to the variations in pixels at various scales. Concretely, in an area where the light intensity is constant, the variation is zero, and hence the wavelet coefficient is zero. However, at the edges of an object, where the light variations are large, these coefficients are large. Thus, a parsimonious change in variables is obtained, in the sense that most of the wavelet coefficients are zero.

Is this enough to avoid the curse of dimensionality?

No. Even if the wavelets enable the number of variables to be reduced by a factor of 50, there remains far too many! Starting from a million variables (typical for images), several tens of thousands remain. This is still high-dimensional. In 2008, I realised that convolutional neural networks, a class of algorithms imagined by the French scientist Yann LeCun [currently head of Facebook’s artificial intelligence laboratory], could solve the problem thanks to the second property: symmetry.

How do these algorithms work?

As their name implies, they are comprised of calculation units, called “artificial neurons”, connected to form several layers more or less deep. In the first layers, this architecture can be used to process the data in a multiscale hierarchical manner, similar to wavelet transform algorithms. As we have seen, this is a first source of parsimony. The deeper into the layers we go, the more the network responses are invariable. In the deepest layers, the network eliminates variations in the data x that do not affect the result y = f(x). These variations of x with no effect on y are symmetries of the function f. Eliminating these symmetries is a source of parsimony. The neural networks therefore use these symmetries by calculating invariants, that effectively reduce the dimensionality of x without losing data about the answer y.

Can you explain more about these symmetries?

Some of them we know. For example, we know that image classification is invariant to translation; the nature of an object represented on an image doesn’t change when you move it vertically or horizontally. This symmetry can be found in the architecture of convolutional neural networks in which the “weights” of the neurons are also invariant to translation. Similarly, if an image x is slightly deformed, often this does not change the category y to which it belongs. All of these deformations define other symmetries. However, neural networks also seem able to learn to calculate invariants linked to far more complex transformations, that we still don’t understand too well. It is partly these groups of complex symmetries that enable them to overcome the curse of dimensionality.

What do we know about these groups of symmetries?

In mathematics, groups of symmetries play a key role in describing the structure of a problem concerning geometry, partial differential equations or number theory. They are also at the heart of physics, to describe the nature of interactions between particles. Thus, they constitute a great asset. When you realise that deep neural networks are not just able to recognise dogs or cats, but can also calculate the quantum energy of molecules, translate texts, recognise music or predict human behaviour, you can see that understanding these groups of symmetries is a challenge that goes far beyond the applications of learning. If we manage to specify them one day, we will better understand the geometry of high-dimensional data. This geometry underlies very many scientific problems. In my view, understanding it is the Holy Grail of data sciences.

(*) A deep neural network is a type of algorithm whose structure is inspired by the organisation of neurons in the cerebral cortex. It is highly efficient at some tasks, such as image classification, voice recognition and text translation.

(*) A decision boundary, for data in a 3-dimensional space, is a plane that separates the data into two categories. In a space of d dimensions, this boundary is a hyperplane of dimension d - 1.

ORGANISED COMPETITIONS

Stéphane Mallat’s lectures at the Collège de France are punctuated with a dozen data competitions organized by his team throughout 2018. Open to all, these are challenges to solve problems as varied as supervised earning for the energy economy, medical diagnosis or financial prediction, but also questionnaire analysis, image recognition of celebrities or predicting football scores. They will be organized on the website challengedata.ens.fr, which will make the data and the instructions freely available for each competition. The aim will be to evaluate the effectiveness of each algorithmic approach, facilitate data exchange, and also exchange ideas by bringing together a scientific community around the data sciences.

WAVELETS TO COMPRESS IMAGES

In 1986, the French mathematician Yves Meyer, who received the Abel Prize in 2017, discovered the first orthogonal wavelet transform. This mathematical operation is used to represent raw signals (sound, images, etc.) with a relatively low number of wavelet coefficients, from small to large scales. Soon afterwards, during his PhD in the United States, Stéphane Mallat showed that orthogonal wavelet transforms were fully characterised by an underlying hierarchical structure: multiresolution analysis. He deduced an algorithm to rapidly calculate wavelet coefficients, giving rise to numerous signal processing applications. In particular, this led to the JPEG 2000 image compression standard which reduces the “weight” of an image by a factor of 50, without altering its aspect. More recently, it led to the development of compression and compressed acquisition applications that retain only 10% of the data in images to compress them, but also can rebuild a signal (an image for example), from only very partial data.

> INTERVIEWEE

Stéphane Mallat

Scientist, specialist in deep learning algorithms

1962 ▪ Stéphane Mallat was born in Suresnes, in the Hauts-de-Seine department. 1984 ▪ He graduated from the École Polytechnique. 1988 ▪ He obtained a Ph.D. from Pennsylvania University, USA. He became a professor in the departments of mathematics and computer science at New York University. 1995 ▪ He became a professor in the department of applied mathematics at the École Polytechnique. 2001 ▪ He created and managed the image-processing start-up Let It Wave. 2012 ▪ He became a professor at the École Normale Supérieure, in the computer sciences department. 2017 ▪ He was appointed professor at the Collège de France, where he created the chair in data sciences.

#d #data #algorithm #deeplearning

“The Field of Data Science is Just Starting to Take Shape”

Commentaires

RECENT POST

“In Computing, the Notion of Trust is Essential”

Jerusalem Still Unveiling its Secrets

“Knowledge of Deep Carbon Makes a Giant Leap”

How to Finance the Energy Transition

Coma: The Promises of Brain Stimulation

The Geometry of Shadows and Light

“Each Additional Half-Degree Counts for the Climate”

“Fungi are Essential for the Survival of the Planet”

Should We Revise Bioethics Laws, and at What Speed?

Towards an Immortal Brain

Transport, Still Too Dependent on Petrol

”Digital Humanities” in school

Lennon, McCartney? Statisticians Identify Who Composed In My Life

Romancing the Phantom Particle

Ethics or Scientific Integrity Committees: What’s the Difference?

Reading the Robot Gaze

A Question of Taste

Reinventing the Smart City

The Delicate Passage From Lab to School

Anonymity, Confidentiality, Transparency: Which to Choose?

“Editing the Genome - a Major Revolution in Science”

The Web, from People to Objects

Terribly Non-Transitive Dice

Taking Fast Action with Long-Term Results

Is Design Thinking Obsolete?

LATEST DISCOVERIES OF FRENCH SCIENTIFIC RESEARCH