Needles in Haystacks:
Quality Control For
Large Datasets

Large-scale flooding has increasingly been in the news over the last few years and anyone who has witnessed the aftermath of these floods will understand how devastating they can be, and how important it is to have adequate flood insurance.

With the majority of UK insurers using the JBA 5m UK Flood Map for risk analysis and pricing, having the most accurate data possible in our products is very important. Since there are over 9 billion individual 5m by 5m pixels in the UK Flood Map, you might think that an error in the odd square here or there wouldn’t matter.

However, if your house happened to sit on that square, you might find that your ability to get flood insurance was affected.

With such a large quantity of data, how do you make sure it’s correct without taking too long to check everything?

For such large datasets, it’s impractical to have someone check everything manually – if it took just a couple of seconds to check each square, it would still take around 600 years to check the entire map.

Instead, we use computers to carry out automated checks. When quality criteria for the data can be defined quantitatively (for example, flood depths may not be less than zero), a computer can look for problems much faster than a human and will make fewer mistakes. To use the analogy of looking for a needle (an error) in a haystack (a large dataset), a computer doing automated checks acts as a large magnet that can remove all the needles very quickly, while a human would be a farmhand examining each individual strand of hay.

The continuing need for human validation

However, even though using computers reduces the time it takes to check data by a huge amount, there are some quality criteria that cannot easily be translated into a form that a computer can understand.

For example, criteria that require judgement (A is not too different from B) or things that just look “odd”. In these cases, a human can identify quality issues that a computer can’t – to return to the previous analogy, a large magnet can efficiently remove needles from a haystack, but it can’t tell you if the whole haystack is blue. In these cases, a human is still needed to ensure the final quality of the data.

Looking to the future

One area in which future developments may further improve the speed and consistency of data quality control is machine learning.

This is an area of artificial intelligence which is being employed for a wide range of uses, from recommending what you should watch next on Netflix to the diagnosis of heart disease. At JBA, we are developing specialist machine learning tools to help construct and check our flood maps to ensure that our data meet the highest possible standards.

To find out more about our work, get in touch.

News &

Blog Building back better to increase flood resilience

Research conducted by JBA provides the re/insurance industry with data to support the application of flood resilience measures as a cost-effective method of reducing future flood losses.

Learn more
Blog Regulators build momentum on Climate Change and ESG Reporting

Climate change continues to hold the world's focus as the biggest challenge facing our planet. Regulatory bodies are working towards establishing parameters to support best practice. JBA's Judith Ellison shares a quick overview.

Continue reading
Blog How do you validate flood data for the future?

If validation is about assessing data based on real-world expectations, how do you validate something that has not yet happened? In this blog, Dr Ashleigh Massam demonstrates valuable ways to evaluate and build confidence in the data that represent future climates.

Continue reading
News Cytora and JBA partner to streamline property risk evaluation in commercial insurance workflows

Cytora and JBA announce their collaboration with the addition of JBA's marketing-leading flood data within Cytora’s risk digitisation platform.

Continue reading