IBM’s research and development (R&D) unit has issued an interesting blog about domain-specific generative AI (gen AI) for Industry 4.0, which talks about how developers can convene around a “community-driven approach” to language model development. It has introduced a new technique for training enterprise-geared large language models (LLMs) that uses “taxonomy guided synthetic data generation”. In parallel, it has launched a new open-source project with Red Hat that invites the Industry 4.0 community to collaborate on the new technique.
The methodology is called LAB, which stands for large-scale alignment for bots (or chatbots). The new project with Red Hat is called InstructLab; it has a GitHub page – offering training, taking contributions – and a toolkit that generates synthetic data for chatbots tasks, as per the LAB technique, and for assimilating new knowledge into the foundation model – “without overwriting what the model has already learned”, it writes. It also offers “seed compute and infrastructure” to train the LLMs, which are updated by Red Hat with community contributions on a weekly basis.
It states: “LLMs can be improved in less time and at a lower cost than is typically spent training LLMs.” The project uses IBM’s Watsonx gen AI platform and Granite series of foundation models; but, as per its community mindedness, it also integrates with other cloud AI models / engines. IBM has added a new “interactive visualization” in watsonx.ai called ‘Taxonomy Explorer’, which allows users to explore the data and taxonomy behind InstructLab tunings. The whole concept is about “democratizing LLM development”, it said.
The democratic process, here, is to multiply-out the private investments in “talent, infrastructure, and data” required by enterprises to make generative AI work in their domain-specific industry settings. The point is, of course, that popular foundational models like GPT and Llama, effective in the consumer market, lack appropriate specialist knowledge, gained from specialist data sets, to be useful for Industry 4.0 – which requires custom models based and trained on private data, about products and systems and policies, as well as legislation and regulation.
IBM Research writes in the blog: “LLMs behind modern-day chatbots are pre-trained on large sets of unstructured text, which allows them to learn many new tasks quickly once they see labeled instructions during the alignment stage. However, creating quality instruction data is difficult and costly… Hence, this LAB technique aims to overcome some challenges around LLM training by using taxonomy guided synthetic data generation… We are entering the era where quality datasets to train the LLMs are needed more than ever before.
“Many companies are looking to adapt and tune models with their proprietary data to teach the model the language of their business. Further, model architectures are changing. They are becoming increasingly modular, relying on external memory. As such, organizations are taking a multimodel approach, where they want the flexibility and modularity to work with various models, whether it be open-source or commercial, depending on the use case.” IBM Research goes on to warn about complexities with governance, as well as how to track hallucinations, bias, and drift.
As well, it notes that LLMs have to be periodically updated by specialist engineers. Until now, vendors and enterprises have engaged differently, and separately. “This has resulted in… multiple forks, or variants of an LLM, that cater to different specializations. With lack of standards, many of today’s ‘open-source LLMs’ can suffer from monolithic development with siloed contributions, where no one knows what’s coming or how to best train and tune the model for their wanted task,” it writes.
The blog offers a couple of headline stats about gen AI in Industry 4.0: that it could add $4 trillion to the economy (McKinsey), but that only 10 percent of companies are putting their gen AI solutions into production (Gartner). The point of InstructLab is to draw developers together around the LAB technique, and to “overcome the challenge of multiple forks” and to “enable real collaborative model development in the open”. This way, by pooling R&D at this early phase of development, the 10 percent figure will jump and the $4 trillion figure will shorten. That’s the logic.
IBM Research said internal multi-turn benchmark (MT-Bench) testing of the LAB approach on IBM’s Granite LLM (chat-13B-v2) on watsonx.ai (rated for coherence, accuracy, engagement) returned “great performance”. The open-source LLM Merlinite, built on Mistral 7B, also achieved strong scores with the InstructLab method, it said. A series of InstructLab tuned language models and open-code models are available on watsonx.ai. IBM and Red Hat have released select open-source licensed Granite language and code models under the Apache 2.0 license.
The blog says: “We are committed to fostering an open innovation ecosystem around AI to help our clients maximize model flexibility and enhancements with new skills and knowledge. As part of our hybrid, multimodel strategy, we’ll continue to offer a mix of [third-party models, select open-source models, proprietary domain-specific models], as well as IBM open-sourced InstructLab code and language models licensed from Red Hat.”