Needles in Haystacks:
Quality Control For
Large Datasets

Large-scale flooding has increasingly been in the news over the last few years and anyone who has witnessed the aftermath of these floods will understand how devastating they can be, and how important it is to have adequate flood insurance.

With the majority of UK insurers using the JBA 5m UK Flood Map for risk analysis and pricing, having the most accurate data possible in our products is very important. Since there are over 9 billion individual 5m by 5m pixels in the UK Flood Map, you might think that an error in the odd square here or there wouldn’t matter.

However, if your house happened to sit on that square, you might find that your ability to get flood insurance was affected.

With such a large quantity of data, how do you make sure it’s correct without taking too long to check everything?

For such large datasets, it’s impractical to have someone check everything manually – if it took just a couple of seconds to check each square, it would still take around 600 years to check the entire map.

Instead, we use computers to carry out automated checks. When quality criteria for the data can be defined quantitatively (for example, flood depths may not be less than zero), a computer can look for problems much faster than a human and will make fewer mistakes. To use the analogy of looking for a needle (an error) in a haystack (a large dataset), a computer doing automated checks acts as a large magnet that can remove all the needles very quickly, while a human would be a farmhand examining each individual strand of hay.

The continuing need for human validation

However, even though using computers reduces the time it takes to check data by a huge amount, there are some quality criteria that cannot easily be translated into a form that a computer can understand.

For example, criteria that require judgement (A is not too different from B) or things that just look “odd”. In these cases, a human can identify quality issues that a computer can’t – to return to the previous analogy, a large magnet can efficiently remove needles from a haystack, but it can’t tell you if the whole haystack is blue. In these cases, a human is still needed to ensure the final quality of the data.

Looking to the future

One area in which future developments may further improve the speed and consistency of data quality control is machine learning.

This is an area of artificial intelligence which is being employed for a wide range of uses, from recommending what you should watch next on Netflix to the diagnosis of heart disease. At JBA, we are developing specialist machine learning tools to help construct and check our flood maps to ensure that our data meet the highest possible standards.


To find out more about our work, get in touch.

News &
Insights

News Celebrating our 10th anniversary

We're celebrating 10 years of The Flood People and creating world-leading flood risk insights.

Learn more
Blog COP26: A window on change

Dr Emma Raven reflects on what COP26 means for climate modellers, the opportunity it presented to move forward together in addressing the challenges, and the importance of open data in doing so.

Continue reading
Blog Investing in Disaster Prevention in the EU

JBA specialists modelled flood risk in Europe as part of a World Bank and EU Commission project on disaster preparedness in the region. Our blog explores key results.

Continue reading
Blog Representing uncertainty in flood maps

As with any scientific modelling, there is inherent uncertainty within flood mapping. We explore JBA tools to help mitigate this uncertainty and help users understand the full risk.

Continue reading