Can AI Help Us Evaluate Better? (we tested it)

We were tired of seeing all the cool demonstrations of AI tools but wondering if they can actually do that for me, for my work. So, we decided to test the AI tools we were familiar with with a common task we do, seeing if it could streamline the process, from messy data to clear visualizations. The answer was a resounding ‘kinda-of’, but with a few important lessons learned.

We set out to test the capabilities in a real-life scenario. We wanted to know what it was really good, ok, average at, and incapable of. A frequent task we do as evaluators is receiving a dataset, reviewing it, cleaning it, and preparing it for analysis. The actual analysis then usually goes rather quickly, but then we slow down again working to present the results and communicate with key audiences. Our experiment set out to test each of these areas.

We selected a public data set on a survey of heart health and related risk factors. It had 1000s of responses, so we limited it to 1000 responses and used all demographic and content questions, 18 in total. We then set up our ‘study’ by drafting the key steps we would take if not using AI. Step 1 was reviewing the data for completeness and quality, checking to see if the data was ready for analysis or needed to be re-coded or cleaned. Step 2 was completing the analysis using a predetermined set of questions and statistical methods. Step 3 was preparing charts and graphs with summary text and, finally, Step 4 was preparing key talking points for leadership on the results of the survey (in this case, heart health risk factors).

We chose five tools to test, though we know there are many more. We opted for ChatGPT 4, Microsoft’s Copilot, FormulaBot, DataChat, and Google’s Gemini based on our awareness of each at the time and their own claims to be able to perform all or portions of our project.

So, what’d we learn? Here is the TLDR but we’ll walk through each in detail

No tool did it all, but ChatGPT 4 was fairly close
Things take time. So, take your time.
Security concerns are warranted, but…
The ROI for time is clear and we argue the quality is better too

1. No tool did it all, but ChatGPT 4 was fairly close

Each tool is designed for a different function, and that became pretty clear through this. However, ChatGPT 4 and Copilot were a pretty good blend and did a decent job at each step. The task all did well with was seeing the data and providing an analysis of it (except Data Chat, it did nothing well, it’s designed for a different use case). The task all struggled with was providing charts and graphs that were good and none of them provided graphs that were editable. ChatGPT 4 stood out the most because of its ability to understand the tasks it was being asked and providing audience specific summaries, providing key points very near what we would have done. All did the work faster than us and kept a paper trail better than we do. (When we originally conducted this study, in February 2024, Copilot was unable to complete the tasks, but as of April 2024 Copilot was able to complete the experiment very well)

2. Things take time, so take your time.

You may have heard to treat AI tools as an assistant or intern in the way we provide instruction and tasks. Initial requests or big complex projects are not helpful but clear and linear steps allow it to excel. So, as you go through a project, give step by step guidance and encourage additional drafts of versions of outputs. We found if you provide feedback and ask for the task to be repeated, it often comes back better, across tools, but specifically with ChatGPT, Gemini, and Copilot. When we see those cool AI demonstrations or talk about it ‘replacing’ people, we overestimate the tools available today, right now, that can be helpful. Rather than giving up or getting frustrated, we found success in working with the tools, refining our ask and providing feedback, working toward a finished product. And, while it may take more ‘management’ time, certainly it maintains a speed advantage over manually doing the work, especially as we get better at understanding what AI is good at.

3. Security concerns are warranted, but you can do quite a bit without uploading data.

There is a lot of discussion around the security, privacy, and training on user data with AI tools, with good reason. However, allowing this to be an all-or-nothing, either they have perfect and satisfactory data privacy policies or I am not using it at all, doesn’t seem to be the right approach. We found in working with these tools that we can do quite a lot to speed up and improve the quality of our work, without ever uploading/entering any data. Each tool we worked with will write excel formulas for you if you describe your data set and desire. Each will suggest the correct analysis to run and guide you in thinking through what a particular audience is most interested in. From our perspective, sitting on the sidelines waiting for privacy to be fully guaranteed is not the right decision.

4. The ROI for time is clear and we argue quality is better too

We saved time. It was much faster, even with the tools that couldn’t ‘do it all’. Each had their strengths but all sped up the time to communicate results. In addition, each elevated the quality of our work, ensuring we were using the right formulas, suggesting new analyses and double-checking our assumptions. As evaluators, we are often the only ones in the weeds with our data and we can miss things. Working with an AI tool can help shine light in blind spots and expand our perspectives before we ever get to the board room or community meeting to share results.

So, where do we go from here? What do we do with these tools?

Well, first, we believe AI can help us evaluate better. With the right tools and training, it can help improve the quality and speed of evaluation. This allows us, the evaluators, to apply more of what makes us great, our thoughtfulness, intuition, intelligence, and people skills. Further, the tools are impressive. Even more, they are quickly improving, with more specific skills and more general applications.

Getting started, however small, seems to be the thing of importance, acknowledging there is no one right way, but undoubtedly a host of wrong ways. Starting increases our understanding, participating with the technology, allowing us to be active advocates of appropriate uses and it prepares us for a future that likely has AI integrated in meaningful ways.

At Krueger Consulting, we set up workshops and coaching arrangements to simplify AI journeys. We aim to provide clarity amidst AI complexity, saving time and increasing quality. We provide the strategies and support you need to confidently harness AI for immediate benefits and long term success. Consider partnering with KC for value-driven training that takes the guesswork out of implementation.