
As with every emerging general-purpose technology, Generative AI (GenAI) is searching for problems to solve. Finding the most fitting will take time. I consider it pointless to look for the things that GenAI can’t do; instead, I prefer focusing on what it already can.
One of the few areas where GenAI has already demonstrated its usefulness is innovation. In a recent PPT presentation, “Powering Front-End Innovation with AI/LLM Tools,” I explored how AI can enrich the front-end of the innovation process. In this article, I’ll review academic literature describing the application of LLM algorithms to one specific stage of this process: generating new ideas.
Faster, Cheaper, Better
Meincke et al. (2023) appear to be the first to use an LLM algorithm to generate new product ideas. The authors took advantage of a pool of ideas created by MBA students enrolled in a course on product design in 2021 (that is, before the wide availability of LLMs). The students were given the following prompt:
“You are a creative entrepreneur looking to generate new product ideas. The
product will target college students in the United States. It should be a physical good, not a service or software. I’d like a product that could be sold at a retail price of less than about USD 50…The product need not yet exist, nor may it necessarily be clearly feasible.”
200 ideas generated by the students were used as a benchmark to compare with two pools of ideas generated by OpenAI’s ChatGPT-4 with the same prompt. One set comprised 100 ideas generated by ChatGPT with minimal guidance (zero-shot prompting); the other 100 ideas generated by the model after providing it with a few examples of high-quality ideas (few-shot prompting).
The first important discovery made by Meincke et. al. was that ChatGPT was generating new product ideas with remarkable efficiency. It took one human interacting with the model only 15 minutes to come up with 200 ideas; a human working alone generated just five.
This dramatically reduces the cost of new ideas generated by ChatGPT. Under specific conditions described in the article, generating one ChatGPT idea costs $0.65 compared to $25 for an idea generated by a human working alone. That means a human using ChatGPT generates new product ideas about 40 times more efficiently than a human working alone.
Faster and cheaper. But what about the quality of the ideas?
To assess the quality of all 400 ideas, the purchase intent measurement through a consumer survey was applied. Measured this way, the average quality of ideas generated by ChatGPT is statistically higher than the ones generated by humans: 47% for ChatGPT with zero-shot prompting and 49% with few-shot prompting vs. 40% for human-generated ideas.
Moreover, among the 40 top-quality ideas (top decile of all 400), 35(!) were generated by ChatGPT.
The only consolation for us humans was that the mean novelty of human-generated ideas was higher than the ones generated by the model: 41% vs. 36%. Besides, ChatGPT-generated ideas, especially with few-shot prompting, exhibited higher overlap, limiting their diversity compared to human ideas. Unfortunately, the novelty itself didn’t affect purchase intent.
Prompting Diversity
In a follow-up study, Meincke et al. set out to improve the diversity of ChatGPT-generated ideas by using 35 different prompting techniques. The authors used the same framework as in the previous study: seeking ideas for new consumer products targeted to college students that can be sold for $50 or less.
Meincke et al. show that of all 35 prompting approaches, Chain of Thought (CoT) prompting, which asks the LLM to work in multiple, distinct steps, resulted in the most diverse pool of ideas; its diversity approached the level of the ideas generated by the students.
The authors also showed a relatively low overlap between ideas generated using different prompt techniques. That means that a “hybrid” approach—using different prompting techniques and then pooling the ideas together—might be a promising strategy for generating large sets of high-quality and diverse ideas.
From Students to Professionals
One of the limitations of the above two studies was that human-generated ideas were created by students. One might argue that students, being less experienced, couldn’t come up with higher-quality ideas that would beat the algorithm.
This limitation was addressed by the study of Joosten et al (2024). In this study, professional designers and ChatGPT-3.5 were assigned identical tasks of generating novel ideas for a European supplier of highly specialized packaging solutions. A total of 95 ideas were generated, 43 by humans and 52 by ChatGPT. All the solutions were evaluated, in a blind fashion, by the company’s managing director, a seasoned innovation expert.
The results show that when assessed by the overall quality score, ChatGPT generated better ideas than professionals. More specifically, ChatGPT-generated ideas scored significantly higher than humans’ in perceived customer benefit, while both sets scored almost identically in feasibility.
Interestingly enough—and in contrast to the results of Meincke et al.—ChatGPT-generated ideas scored significantly higher in novelty. As a result, ChatGPT produced more top-performing ideas in terms of novelty and customer benefit.
Similar results were obtained by Castelo et. al (2024). These authors compared ideas for a new smartphone application that were generated by GPT4 and professional app designers. The authors showed that GPT4-generated ideas were ranked as more original, innovative, and useful.
Furthermore, Castelo et al. used a text analysis approach to determine what specifically made GPT4-generated ideas superior. To do so, they compared two types of creativity—creativity in form (when the language used to describe an idea is more unusual or unique) and creativity in substance (when the idea itself is more novel)—and found that GPT4 outperformed humans in both types of creativity.
Complementing the above two studies is the work by Si et al. (2024) who analyzed the ability of Claude 3.5 Sonnet to generate research ideas (in the field of Natural Language Processing), rather than new product ideas. Comparing ideas generated by the LLM model with those generated by professional NLP researchers, the authors showed that the LLM-generated output was ranked as more novel, although slightly less feasible, than the one generated by human experts.
LLMs vs. Crowds
Of all known idea-generation techniques, crowdsourcing is considered one of the most effective, a consistent source of ideas whose novelty, quality, and diversity exceed those created by individuals and small groups (of experts and laypeople alike). One, therefore, could hope that at least a crowd of people would beat an LLM algorithm in an idea-generating competition.
Alas.
Boussioux et al. (2024) designed crowdsourcing content to generate circular economy business ideas. In total, 234 ideas were generated (and evaluated by 300 independent human judges): 54 by a human crowd of creative problem solvers and 180 by GPT-4.
Indeed, solutions proposed by the human crowd exhibited a higher level of novelty, both on average and at the upper end of the rating distribution. Yet, GPT-4 scored higher in the ideas’ strategic viability for successful implementation, as well as environmental and financial value. Overall, the solutions generated by the algorithm were rated higher in quality than the crowd-generated solutions.
Elaborating on findings by Meincke et al. (2024), Boussioux et al. found that a special prompting technique, prompt-chaining, resulted in the enhanced novelty of GPT-4-generated solutions without compromising their overall quality.
Once again, the authors demonstrated the high cost-efficiency of the LLM-assisted idea-generation process: under specific conditions used by the authors, it took 2,520 hours and $2,555 to generate 54 “human” solutions; the same numbers for LLM-generated solutions were 5.5 hours and $27.
Some Final Thoughts
As recently as a few years ago, the conventional wisdom was that AI tools would only be used to automate routine knowledge work but that the creative part of this work would remain in the human domain. Recent developments forcefully disprove this discourse.
One can split proverbial hairs while assessing the novelty or feasibility of ideas generated by LLMs. But one thing is clear: the overall quality of LLM-generated ideas is at least as high as the one generated by us humans. And all this is only at a fraction of the time and cost of human ideation.
That means that in silico ideation is here to stay, which allows firms to shift their attention from the ideation stage of the innovation process to later stages, such as idea incubation and prototyping.
At least until LLMs show us they are better at these stages too.
Pingback: In Silico Creativity. Part 1. LLMs and Poetry (and Short Stories) |