Needles in Haystacks:
Quality Control For
Large Datasets

Large-scale flooding has increasingly been in the news over the last few years and anyone who has witnessed the aftermath of these floods will understand how devastating they can be, and how important it is to have adequate flood insurance.

With the majority of UK insurers using the JBA 5m UK Flood Map for risk analysis and pricing, having the most accurate data possible in our products is very important. Since there are over 9 billion individual 5m by 5m pixels in the UK Flood Map, you might think that an error in the odd square here or there wouldn’t matter.

However, if your house happened to sit on that square, you might find that your ability to get flood insurance was affected.

With such a large quantity of data, how do you make sure it’s correct without taking too long to check everything?

For such large datasets, it’s impractical to have someone check everything manually – if it took just a couple of seconds to check each square, it would still take around 600 years to check the entire map.

Instead, we use computers to carry out automated checks. When quality criteria for the data can be defined quantitatively (for example, flood depths may not be less than zero), a computer can look for problems much faster than a human and will make fewer mistakes. To use the analogy of looking for a needle (an error) in a haystack (a large dataset), a computer doing automated checks acts as a large magnet that can remove all the needles very quickly, while a human would be a farmhand examining each individual strand of hay.

The continuing need for human validation

However, even though using computers reduces the time it takes to check data by a huge amount, there are some quality criteria that cannot easily be translated into a form that a computer can understand.

For example, criteria that require judgement (A is not too different from B) or things that just look “odd”. In these cases, a human can identify quality issues that a computer can’t – to return to the previous analogy, a large magnet can efficiently remove needles from a haystack, but it can’t tell you if the whole haystack is blue. In these cases, a human is still needed to ensure the final quality of the data.

Looking to the future

One area in which future developments may further improve the speed and consistency of data quality control is machine learning.

This is an area of artificial intelligence which is being employed for a wide range of uses, from recommending what you should watch next on Netflix to the diagnosis of heart disease. At JBA, we are developing specialist machine learning tools to help construct and check our flood maps to ensure that our data meet the highest possible standards.


To find out more about our work, get in touch.

News &
Insights

Blog The Ongoing Journey of JBA Flood Maps

JBA global flood maps underpin many aspects of our flood risk intelligence. In this blog we highlight the importance of the continuous improvement and ongoing work to review and update flood maps for all parts of the world to achieve the most-informed results.

Learn more
News How are climate scenarios made?

Find out what's behind the alphabet soup of climate scenario names. JBA Risk Management's Head of Science, Dr Paul Young shares a technical explainer.

Continue reading
News GRiP expands into flood intelligence with JBA Risk Management tie-up

South African Spatial Technology and Data Specialist GRiP partners with JBA to deliver advanced flood risk intelligence across Africa.

Continue reading
News JBA Risk Management teams up with Oxford University for infrastructure study

JBA and Oxford University join forces to research the risks of climate extremes on infrastructure networks worldwide today and in the future.

Continue reading