AI & Data Preparation – Avoiding ‘Garbage-In, Garbage-Out’


MARCH 3, 2018

Once the bogeyman of science fiction movies and video games, artificial intelligence is rapidly approaching the point at which it will become an indispensable business tool. More than 75% of executives believe that AI will open up new business for their companies, and 85% believe that the technology will allow them to gain a competitive advantage. While not possessing the general intelligence and sentience of fictional robots, real world applications of what’s referred to as narrow or weak AI (focused on a single task) are breaking new ground and upending old paradigms.
This said, they’re not a silver bullet for every problem a business could face. As with any popular technology, any AI evangelist’s first job is to deflate a decision-makers over-puffed expectations – the day when C-level staff can kick back with a latte while a robot makes strategic-level decisions is not here yet. While these tools are powerful, they remain tools. No AI is powerful and broad enough in scope and ability to take on the role of an independent actor within an organisation, and all will need direction.

However, even weak AI does represent a significant jump in the level of abstraction by a user from the end product over other, simpler programs. With proper instruction, these systems are more than capable of performing complex, repetitive, and reasonably diverse work. Where they resemble current day systems is in their fundamental limitations. User errors at the input stage can still contaminate the results of an AI’s operation. A bad operator with the world’s most advanced narrow AI is going to run into the same class of problems as the autonomous car ‘driver’ who feeds the machine a poorly constructed route – they won’t get where they want to be.


So how do you get the most from the state-of-the-art AI system that your business is considering investing in and avoid one of the key mistakes companies make when building their AI system – data quality?
Just like taking the time to properly chop your ingredients before you start cooking, or ensuring your car has a full tank and clean fluids before a road trip, ensuring that you’re putting clean, well prepared data into your system is the only way to achieve optimal results. We’ve all heard variations of the 6Ps, but the version to remember when working with an AI is ‘Proper Prior Preparation Prevents Poor Performance’.

In an interview with the Boston Consulting Group, co-director of the Massachusetts Institute of Technology Initiative on the Digital Economy Andrew McAfee made it clear that the ongoing issue of garbage-in, garbage-out (poisoning your results by inputting low-quality data into your machine) becomes keener when dealing with AI systems.
“[This] issue does not go away in the era of artificial intelligence,” he said. “It becomes more profound in the era of artificial intelligence because the approaches in AI that are succeeding today are not about really clever programmers codifying knowledge and putting it into a system; they’re about building systems that can learn on their own. And the way they learn is by seeing lots and lots of examples.”
Because AIs are so dependent on seeing accurate and complete examples of a pattern as part of the learning process, Mr McAfee stated that there’s less room for error on the part of the user and a greater number of ways the system can go wrong.

“If the data that they’re learning from is bad, inappropriate, skewed, or not representative—has any of these problems that we know exist in data—you are going to get a poorly configured system. It’s just that simple,” he said.
A case study – Seek job recommendations

Zetaris has first-hand experience with the transformation of quality data into actionable insights at scale. In 2013, we were contracted by job recruitment SEEK Limited to build a recommender engine that would analyse a job seeker’s profile – interests, background, experience, prior applications, etc. – and serve them targeted, relevant job ads from prospective employers.
The challenge came with the sheer scale of the project. With more than 100,000 ads on SEEK at any one time, and more than 2.5 million candidates, the system had to be able to make over 250 billion points of comparison. Within three months of launch at scale of the recommender system, more than 20 million job recommendations had been served, and by four months the system had grown popular enough to serve a million job recommendations in a single day. Building the system on the Amazon Web Services platform allowed SEEK to scale the operations of the system to precisely meet demand, ensuring that they were better able to control their operational expenditure, achieving their goals without overshooting their budget.
A two-way street

While proper data preparation is a necessary step in order to get the most out of your AI, these systems can also be enlisted to assist in data preparation for future use. An example of this was given by CrowdFlower founder and chief data scientist Lukas Biewald in an interview on The O’Reilly Data Show Podcast. In a discussion about the role of algorithms in assisting with helping humans be more efficient, Mr Biewald pointed out the inefficiencies of manually segmenting and labelling images for use in image-recognition systems.
“If you just have someone open up Photoshop and try to do that, you can image that labelling process could take more than an hour. That really is the state of the art. We find customers that literally are doing that[…] You can imagine the cost of that quickly adds up,” he said. Where these operations can be improved is in leveraging the power of an AI system to detect patterns and draw connections between data.

“The first thing you can do [is] try to group the pixels into chunks. Photoshop has a magic wand tool for editing, where it tries to figure out what are continuous blocks of pixels. But you can actually pre-segment those blocks into chunks,” he said.
“That makes people a lot more efficient. [If] your algorithm is reasonably good, it will do a lot of the labelling for the person. It can really cut down the labelling time from more than an hour to a couple minutes.”
download
Zetaris has extensive experience in the management, cleaning and preparation of data. If your business is looking to investigate artificial intelligence, make a meaningful business partnership and work with Zetaris today.
TRY FREE