Universiteit Leiden Leiden University | Discover the world at Leiden University.

Data Science

Research

Combining different disciplines, Leiden University researchers work together to formulate innovative solutions to societal problems. Below is an example from the field of fundamental sciences.

Overview research dossiers

Smart searching in data

A global revolution is taking place in the field of data science, and Leiden University is a major academic hub for this discipline in the Netherlands.

Researchers across the entire scientific spectrum – including linguistics, the environment, medicine, astronomy and biology – are increasingly making use of data science. Big datasets are linked and smart algorithms are applied to detect unexpected patterns that cast new light on important questions. The results can range from new medical treatments and greener fuels to a better understanding of our history.

‘We’re currently witnessing a revolution in data science,’ says Joost Kok, Professor of Fundamental Computer Science. ‘This revolution was triggered by rapid developments in the area of high-performance computers and storage, combined with new algorithms such as ‘deep learning’ and the ubiquitous Big Data.’


Focal point for data science
Leiden University is important for data science in the Netherlands and has initiated several standards in the discipline. One reason for this, explains Kok, is that the University has always worked with data. ‘Leiden has a long tradition of producing and collecting large quantities of data in libraries, museums, laboratories, hospitals and the Observatory. Cohort studies and telescopes have been creating big datasets for a very long time.’

Collecting and classifying data is not the only long tradition of the University: for many decades before data science became a buzzword, research in statistics was a flourishing field of study here. ‘With data science, you’re trying to find the real signal among all the noise in a dataset. To do that, you need a solid basis in maths and statistics,’ says Aad van der Vaart, Professor of Stochastics and Spinoza Prize winner. Leiden University is therefore renowned for the mathematical, fundamental approach to data science, by combining statistics with computer science.


Multidisciplinary
Leiden University’s expertise in the area of data science clearly extends beyond the walls of the Faculty of Science. ‘Leiden is an excellent example of a multidisciplinary university,’ says Kok. ‘Our philosophy is that excellent scholars in different disciplines should work together. The leading researchers in astronomy collaborate with those in computer science, but both remain anchored in their own research field.’

This approach results in an incredible variety of research projects. For instance, Leiden researchers are participating in a project to provide digital access to handwritten 19th century expedition reports and make them searchable. They are also developing methods for predicting dementia from brain scans, and researching the ethical and legal implications of artificial intelligence. The origin of black holes is one of the topics being studied by astronomers, while the international standard for making data accessible to data scientists was also developed here at Leiden University.


Data science research programme
To give even greater impetus to the field of data science, a university-wide data science research programme was recently launched, involving an investment of four million euros over a period of four years. This will have the effect of further stimulating the exchange of knowledge between different research domains.

Leiden Institute of Advanced Computer Science (LIACS)
Mathematical Institute (MI)
Leiden Centre of Data Science (LCDS)

Leiden: Silicon Valley of FAIR data

When researchers make their data FAIR, computers can link large quantities of data and identify patterns, thus greatly accelerating the process of arriving at new insights. In Leiden, the birthplace of ‘FAIR data’, Professor Barend Mons explains the meaning of this term.


Imagine that a computer programme has access via internet to all the results of all the medical research in the world. The programme will then be able to detect relationships that no physician in the world could otherwise detected, simply because the quantity of data involved is more than a human being can process. This could lead to new insights, better diagnoses and new drugs. From the technical point of view, this is already possible, not only for medical science but for all disciplines.

However, before this can happen, the data must first become FAIR: Findable, Accessible, Interoperable and Reusable. Only when all these criteria are met will we come closer to achieving the envisaged future scenario.


Accessibility and privacy
‘Academic publications based on publicly-funded Dutch research already have to meet Open Access criteria,’ says Barend Mons, Professor of BioSemantics at LUMC. ‘However, the fact that everyone can read the article doesn’t mean that the underlying research data are findable and accessible for a computer.’ This requires metadata structures: data stations that tell the computer programme what kind of data can be found where, such as the medical data on smokers.

This must not, of course, disrupt the balance between data linking and privacy. ‘The metadata stations therefore clearly indicate the level of

Leiden: Silicon Valley of FAIR data

accessibility: are smokers’ data, for example, accessible for everyone, or do you have to contact the person conducting the research?’

Finally, the data must be interoperable and reusable by the computer programme. A computer isn’t good at handling ambiguities, such as the initials PSA, for instance, which not only stand for Prostate Specific Antigen but also have more than 100 other meanings. Every possible term in the world would therefore need to be given a unique numerical code that is centrally recognised.


Reward for data sharing
‘That all sounds more complicated than it really is,’ says Mons. ‘The problem is 80 per cent cultural. At the moment, there aren’t enough incentives to share scientific data. Researchers are rewarded for publishing their paper and for their citation / journal impact factor.’ The scientific paper itself, it seems to Mons, takes second place. ‘An impact factor needs to be assigned to the data output of research: the researcher will be rewarded if the data are combined with another dataset.’
 

Leiden: Silicon Valley of FAIR data

Mons is the chair of an EU advisory group in this area. He takes the view that it won’t be long before the outlined future scenario gradually becomes reality. ‘From 2017, researchers will only receive funding from the Horizon 2020 programme if they make their data FAIR. When other funding bodies follow, the researchers will have to as well. But ideally, they themselves will soon see the tremendous advantages of FAIR data and good data stewardship.’


Leiden: Silicon Valley of FAIR data
Making research data FAIR is on academic agendas all round the world. However, the concept originated in Leiden, says Mons with some pride. ‘About two and a half years ago, the principles of FAIR data were first formulated during a workshop in the Lorentz Center. Now experts in the area of linked data come here from all over the world to implement FAIR data. If the government invests enough, the Netherlands can become a very important FAIR data player, and Leiden can be a kind of Silicon Valley of FAIR data.’

From Big Bang to algorithm

Smart algorithms and powerful processors are just as essential for astronomy as big telescopes. Astronomers at Leiden University therefore constantly operate at the interface between astronomy and data science.


What happened immediately after the Big Bang? How do black holes and galaxies form? How do all the stars in our galaxy move in relation to one another? To answer questions of this kind, astronomers use not only enormous telescopes and other measuring equipment; they also use computers and smart algorithms to process the gigantic mountain of data produced by these instruments. This is why data science plays such a crucial role for Leiden astronomers.


Origin of black holes
‘Take LOFAR, for instance, a radio telescope that consists of a network of thousands of radio antennas in different European countries,’ says Huub Röttgering, Professor of Observational Cosmology and Director of the Leiden Observatory. ‘We use it to measure signals from space, from particles originating in the boundary area around black holes. Those signals give us a picture of the black holes on the edge of the universe, which were formed soon after the Big Bang and can teach us something about the origin of black holes in general.’


Six months of calculation
However, converting the received signals into colourful maps is a highly laborious process. ‘All the antennas together provide one terabyte of data every eight seconds,’ says Röttgering. Indicating a map of a small part of the universe, he continues: ‘This map is the outcome of eight hours of measuring, but then it took a supercomputer several months of calculating to process the data.’
This is partly due to the quantity of incoming data; for instance, how do you channel all those vast datasets to Leiden? It is also because the data has to be ‘cleaned’: all kinds of noise – including that caused by an aircraft, for instance – have to be removed. The data also need to be corrected for vibrations created by the atmosphere, and irregularities due to minuscule differences in the antennas’ receiving times have to be smoothed out by the computer later.

Map of the ‘Sausage field’, which took the supercomputer many months to calculate.

Map of the ‘Sausage field’, which took the supercomputer many months to calculate.


Developing solutions
Many other research projects like LOFAR are confronted with similar problems. The Leiden astronomers see developing solutions as a natural aspect of their discipline. ‘Reducing the calculation time by applying smart algorithms is one of the methods we use,’ says Röttgering. Another essential technique is ‘parallelisation’: breaking the calculation task into parts and using multiple processors simultaneously to perform the calculations.


Popular astronomy graduates
Developing software and hardware solutions is therefore just as important for Leiden University’s astronomers as developing theories about the universe. It is precisely this broad orientation that makes its Astronomy degree programme highly attractive to students. ‘We now have over a hundred first-year students. This is many more than in the past, but they still all find jobs,’ says Röttgering. ‘Because our students learn a lot of maths and physics during their study, and also gain hands-on experience with computer science, they’re very popular even outside astronomy.’

Leiden Observatory
LOFAR

Applied statistics as a pillar of data science

Although data science is now growing fast in many places, scholars at Leiden University have been developing data science techniques for a very long time. Thanks to their broad-based expertise, Leiden statisticians are currently combining the achievements in statistics with the latest methods of statistical and machine learning.


‘It seems as if data science is something new, but in applied statistics we’ve actually been developing data science techniques for many years,’ says Jacqueline Meulman, Professor of Applied Statistics in the Mathematical Institute. ‘Here in Leiden, we’ve been visualising links and analysing large, complex datasets for at least 35 years.’


SPSS
For example, one of Meulman’s earlier research groups made its first contribution to the well-known statistical data analysis package SPSS, now part of IBM, back in the 1990s. This package is used worldwide by researchers, students and the private sector, and the Leiden statisticians still keep their ‘Categories’ module updated by incorporating the latest technical developments. The royalties paid by IBM are re-invested in research and teaching-related activities.


Complex data analysis
A problem encountered when analysing big datasets is the purity of the data. ‘We’re often trying to detect a signal in the midst of a lot of noise,’ says Meulman. As an example, she mentions a study in the area of metabolomics. This study is looking at identical twins, with the question: is the metabolic system of the twins more similar than can be explained by chance? ‘Analyses of blood and urine produce vast quantities of data, but these data are always complex,’ explains Meulman. ‘For example, one of the twins may have had breakfast that morning, and the other not.’ There are also many variables in such datasets that are completely irrelevant. Meulman and her colleagues use the latest techniques to filter out the noise variables and thus to detect similarities.


Statistical learning
Peter Grünwald, Professor of Statistical Learning, conducts research at the interfaces between statistics, machine learning and information theory. Briefly put, he develops methods for statistically sound analysis of data by computers. He demonstrates how important this is with an example. A couple of years ago Google was in the 

Google Flu Trends

Google Flu Trends

news: the company had predicted a flu epidemic by analysing the geographic locations where people were making a lot of searches for words such as fever, cough and so on. ‘It worked once or twice, but then not any more,’ says Grünwald. ‘If a programme detects a pattern, you have to demonstrate that it isn’t chance. For that, you need real statistics.’


Reproducibility crisis
On the basis of statistical learning, the Leiden statisticians are also looking for ways to improve classical statistics with techniques devised in machine learning, an area within computer science. ‘I’m currently working on the reproducibility crisis: the fact that when research is repeated, it often doesn’t produce the same results,’ says Grünwald. ‘This may be because a researcher conducted extra experiments after his first findings weren’t significant enough to permit a firm conclusion. This creates a distorted picture: a purely chance result can suddenly appear significant. There are statistical methods to correct for this, but they’re very complicated. I’m now trying to improve those methods by applying ideas from machine learning and information theory.’

Mathematical Institute (Statistical Science group)

Predicting dementia

In the future, it may be possible for physicians to identify dementia much earlier than they can today, because a computer algorithm will be able to predict from brain scans how our memory is going to develop.


When a physician diagnoses dementia at present, the disease is already quite advanced. An MRI scan of the brain shows how the brain tissue has diminished. However, research has shown that dementia also affects the brain in other ways. The structure of the brain changes. Neural pathways connect parts of the brain less efficiently, and it also seems that changes take place in the functional connections: the links in brain activity that normally exist between different areas of the brain.

Predicting dementia


‘Healthy people with a genetic predisposition to dementia have on average different functional connections in their brains than people without that genetic predisposition,’ says Serge Rombouts, researcher at Leiden University’s Institute of Psychology and Leiden University Medical Center (LUMC). ‘The connections between some brain areas are less strong, while other connections are stronger. We’re investigating whether those changes are correlated with developing the disease.’

Predicting dementia


Patterns in data
The research study by Rombouts and his colleagues follows different groups of healthy people, including those with a predisposition to dementia. The participants are given regular brain scans, and, as soon as the disease is diagnosed, the researchers can look back over the scans made in previous years. They are hoping to find changes that correlate with the disease, and also to investigate whether they could have predicted the disease from these changes.

As it is currently not known which changes are important, this is rather like looking for a needle in a haystack. For instance, at least a thousand brain areas are functionally linked with each other, resulting in half a million functional connections. The same applies for anatomical connections. The researchers therefore select which areas are probably important, on the basis of the knowledge they already have. They are also developing self-teaching algorithms: computer programmes that train themselves to recognise patterns in this vast quantity of data, so that they can pick out the relevant changes in the scans.

Predicting dementia


Predictions in the future
However, Rombouts warns that much still needs to be done before computers can help us to predict how our memory will develop on the basis of brain scans. ‘The results of this study are at least true for the group of people that we’ve followed. But to validate the results, we need to study many more groups. Another problem is that each type of scanner has its own technology. The technologies are just as good as each other, but because of the differences, the results of the various types of scanners can’t be directly compared. The algorithm must be able to deal with this, and to give a reliable analysis for each scanner.’

The technology being developed by the Leiden researchers is not only relevant for dementia. ‘The method we’re developing for analysing brain scans can perhaps help with diagnosing other conditions,’ says Rombouts. ‘For instance, neurological disorders, such as Parkinson’s disease, or psychiatric disorders, such as depression. In the long run, it might even be possible to use this method to predict the treatment effects of various drugs.’

In addition to Leiden University and LUMC, other organisations involved in the research are Erasmus MC, VUmc Alzheimer Centre and the Centre for Human Drug Research.

Institute of Psychology (Methodology & Statistics unit)
Leiden Institute for Brain and Cognition (LIBC)
LUMC
Erasmus MC
VUmc Alzheimer Centre
Centre for Human Drug Research

Converting cultural heritage into usable data

How can we make the information in handwritten historical research reports accessible and searchable? Data scientists at Leiden University are collaborating with other universities on a method for improving access to cultural heritage.


Between 1820 and 1850, eighteen explorers from the Natural Sciences Commission for the Dutch East Indies travelled through the Indonesian archipelago. During their expeditions, they studied the exotic flora and fauna. Their reports, totalling around 17,000 richly illustrated pages, are held by the Naturalis Biodiversity Centre. The collection gives a magnificent picture of the biodiversity in that region at the beginning of the 19th century.

The pages of the reports have now been scanned and are digitally accessible, but simply googling them by place name or animal species is not yet an option. The research project ‘Making Sense of Illustrated Handwritten Archives’ aims to change this. By converting the archives into searchable and analysable data, other researchers will soon be able to cast new light on a wide range of historical and biological questions. In addition to Leiden University, the other participants in this project are Naturalis Biodiversity Centre, the University of Twente, the University of Groningen and the publisher Brill.


Data patterns in a jumble of images
The main task of the researchers in Leiden, Twente and Groningen is to train the computer to distinguish between the different kinds of information in the historical documents. Human beings can see at a glance the difference between an illustration and a handwritten sentence. For an untrained computer, on the other hand, a photograph of a logbook page is just one big jumble of images.

The researchers in the project are using the Monk 

Converting cultural heritage into usable data

handwriting recognition programme, which was developed in Groningen, but this algorithm alone is not enough.

BioSemantics data scientist Katy Wolstencroft and her colleagues are working on an algorithm that can identify the different parts of a layout on a scanned page: what is the table of contents, where is the name of an animal species, and where is its description? Once this programme can recognise these semantics, it will be possible to obtain interlinked data from the report: an illustration of a bat can then be combined with, for example, its name, the location where it was found and the description of its external appearance.

This wealth of data will enable biologists to research the different species of bats that lived on Java in the 19th century, and to compare them with the current bat species. This will give them an insight into their evolution, and perhaps result in the discovery of new species.


Heterogeneous data
Before that stage is reached, however, all kinds of problems need to be solved. ‘The data are of an extremely heterogeneous nature,’ explains Wolstencroft. ‘The reports contain words in different languages: German, Latin, Greek, Dutch, French and Malay. Place names change throughout history, and sometimes new authors added information to a report later.’ It’s not easy to develop a programme that understands such nuances, and leaves them intact.

The content of the expedition reports will ultimately be linked to the species archives of Naturalis, which will undoubtedly lead to new and valuable insights for historians and biologists. But that is not the only aim of the project. ‘We’re developing a generic method for processing historical documents,’ says Wolstencroft. ‘It can also be applied to other collections. In the end, it’s all about being able to share data.’

Converting cultural heritage into usable data

Ethical standards for data science

Computers are becoming so smart that in the future they will perhaps take over the role of judges. In the meantime, experts at Leiden University are examining the question of what standards must be met by responsible data science.


The time when judges will be replaced by computers is coming a step closer every day, with the rapid developments in the area of data science. ‘In the future, a computer will be able to extract elements from thousands of comparable cases, link them together, and make the best decision on the basis of those elements,’ says Jaap van den Herik, Professor of Computer Science and Law at Leiden University. ‘I think computers will already be giving simple judgements in about 15 years. And by 2080 computers will be better at making ethical decisions than judges.’


The key to that future is ‘deep learning’, where a computer programme itself detects patterns in large numbers of cases. Using this technology, a computer has already learnt how to defeat the human world champion at the game Go. However, much more progress still needs to be made before judges have to relinquish their gavels to computers, says Van den Herik.

Ethical standards for data science


More information needed
‘The number of possibilities in Go is extremely large, but finite. By contrast, the number of possibilities in legal judgements is infinite.’ There can also be no dispute about the outcome of the game, which is another point of contrast with legal judgements. This is because cultural differences lead to differences in the outcome of cases. ‘To learn how to make properly substantiated decisions, the computer must have access to a wide range of data relating to the context of earlier cases,’ says the professor. ‘A part is often played by data that may not be stored, such as race, religion and sexual orientation. The lack of that context makes it difficult for a computer programme to analyse cases correctly.’ Another problem is that sometimes very few accessible earlier cases exist. The more data available for the computer programme to learn from, the better its judgements will be.


Responsible data science
While computers are playing an ever larger role in society, it appears that public confidence in computers has fallen. This is partly due to concerns in the area of security and privacy. To reverse this trend, says Van den Herik, more research is needed on the conditions for responsible data science.

‘The data that you use for this must be accessible and processable by a computer. They must be securely encrypted and must guarantee people’s privacy. Here in Leiden we’re setting the standard for research data in this area with the FAIR principle.’ (see also ‘Leiden: Silicon Valley of FAIR data’)

In the professor’s view, students working with data science should learn at an early stage to incorporate ethical considerations in their methods. Together with his own research group, one of the questions he is investigating is how responsible data science relates to legal practice. ‘How do you measure all those conditions at the present time? And how will you do that in 10 years’ time? It’s certain that by then the issue won’t only be responsibility but also liability. With computers becoming smarter every day, it’s high time to start having serious discussions about this.’

The International Data Responsibility Group (IDRG)

Experts

  • Joost Kok
  • Aske Plaat
  • Barend Mons
  • Jaap van den Herik
  • Jacqueline Meulman
  • Katy Wolstencroft
  • Peter Grünwald
  • Serge Rombouts
  • Huub Rottgering
  • Stefan Manegold
  • Holger Hoos
  • Thomas Bäck
  • Hilde De Weerdt
  • Simon Portegies Zwart
  • Gerard van Westen
  • Wessel Kraaij
  • Suzan Verberne
  • Diego Garlaschelli
  • Matthijs van Leeuwen
  • Michael Lew
  • Frank den Hollander
  • Arjen Doelman
  • Jelle Goeman
  • Marta Fiocco
  • Elise Dusseldorp
  • Mark de Rooij
  • Tim van Erven
  • Aad van der Vaart

Joost KokProfessor of Fundamental computer science

Topics: Data Science, information processing

+31 (0)71 527 7057

Aske PlaatProfessor of Data science

Topics: Computer science

+31 (0)71 527 7065

Barend MonsProfessor of Biosemantics

Topics: Biosemantics, science policy, FAIR data, nanopublications

Jaap van den HerikProfessor of Law and Computer Science, Chair Board of Directors of LCDS

Topics: artificial intelligence, data science, big data, e-humanities, law and computer science, technology

+31 (0)71 527 7054

Jacqueline MeulmanProfessor of Applied statistics

Topics: Applied statistics, multi-dimensional data analysis, visualisation, prediction, classification

+31 (0)71 527 7135

Katy WolstencroftAssistant professor

Topics: Data science, data and knowledge integration, semantics and ontologies, bioinformatics

+31 (0)71 527 8926

Peter GrünwaldProfessor of Statistical learning

Topics: statistical learning,machine learning, foundations of statistics, information theory

+31 (0)71 527 7047

Serge Rombouts Professor of Methods of Cognitive Neuroimaging

Topics: dementia, neuro sciences, brain scans, FMRI, Brain Connectivity

+31 (0) 71 526 3309

Huub RottgeringProfessor of Observational cosmology

Topics: astronomy, galaxy formation, large scale structure, LOFAR, optical/ infrared and radio interferometers

+31 (0)71 527 5851

Stefan ManegoldProfessor of Computer science

+31 (0)71 527 2727

Holger HoosProfessor Machine learning

Topics: Machine learning

+31 (0)71 527 2727

Thomas BäckProfessor of Natural computing

+31 (0)71 527 7108

Hilde De WeerdtProfessor of Chinese History

Topics: Area studies, China, Chinese empire, Chinese history, Comparative history, digital methods for humanities research, environmental history, historical sociology, information networks, urban history

+31 (0)71 527 6505

Simon Portegies ZwartProfessor of Numerical stellar dynamics

Topics: Computational gravitational dynamics, high-performance computing, the formation and evolution of planetary systems, stellar and binary evolution

+31 (0)71 527 8429

Gerard van WestenAssistant professor

Topics: bio-informatics, cheminformatics, chemogenomics, data mining, drug discovery, machine learning, proteochemometrics

+31 (0)71 527 3511

Wessel KraaijProfessor of Applied data analytics

Topics: digital health, information retrieval, text mining, privacy respecting analysis

+31 (0)71 527 5778

Suzan VerberneAssistant professor

Topics: Text Mining, Information Retrieval

Diego GarlaschelliAssociate professor

Topics: Complex Networks, Econophysics, Statistical Physics, Network Reconstruction, Financial Networks, Systemic Risk

+31 (0)71 527 5510

Matthijs van LeeuwenAssistant professor

+31 (0)71 527 7048

Michael Lew Associate professor

Topics: Deep learning, multimedia analysis & mining, computer vision

+31 (0)71 527 7034

Frank den HollanderProfessor of Probability

Topics: complex networks, disordered systems, critical phenomena, population genetics, polymer chains

+31 (0)71 527 7105

Arjen DoelmanProfessor of Applied analysis

+31 (0)71 527 7123

Jelle GoemanProfessor of Biostatistics

Topics: Biostatistics, high-dimensional data analysis, hypothesis testing, genomic data

+31 (0)71 526 8569

Marta FioccoAssociate professor

+31 (0)71 527 7119

Elise Dusseldorp Associate Professor Psychology

Topics: Modelling of interaction effects in prediction problems, machine learning techniques, meta-analysis

+31 (0)71 527 8046

Mark de RooijProfessor of Psychology

Topics: Statistical analysis for measurement in Psychology

+31 (0)71 527 4102

Tim van ErvenAssistant professor

+31 (0)71 527 7126

Aad van der VaartProfessor of Stochastics

Topics: Statistics, probability, mathematics

+31 (0)71 527 7130

Education

Leiden University offers a broad range of programmes in the field of data science.  The core of data science is made up of Statistics and information science make up the core of data science and are the foundation on which these programmes are built. Leiden students from all disciplines can follow the minor ( optional subjects spread over six months) in Data Science in their bachelor’s programmes. Maths and Information Science students can follow a complete five-year science track by opting for a Data Science specialisation in their master’s.  A growing number of other master’s programmes, including Astronomy and Bio-Pharmaceutical Sciences,  also offer the opportunity to specialise in Data Science.

Outreach & News

News

Agenda