Hello Dolly: Create a ChatGPT clone for $150
Databricks, the data lakehouse and AI company, has announced the launch of Dolly, a cheap-to-build large language model (LLM) that exhibits a surprising degree of the instruction following capabilities exhibited by ChatGPT. Using Databricks, any business can take an off-the-shelf open source LLM and give it magical ChatGPT-like instruction following ability by training it in 30 minutes on a single machine, using high-quality training data.
Dolly works by taking an existing open source 6 billion parameter model from EleutherAI and modifying it ever so slightly to elicit instruction following capabilities such as brainstorming and text generation not present in the original model, using data from Alpaca.
The model underlying Dolly only has 6 billion parameters, compared to 175 billion in GPT-3, and is two years old, making it particularly surprising that it works so well. This suggests that much of the qualitative gains in state-of-the-art models like ChatGPT may owe to focused corpuses of instruction-following training data, rather than larger or better-tuned base models.
“We’re calling the model Dolly — after Dolly the sheep, the first cloned mammal — because it’s an open-source clone of an Alpaca, inspired by a LLaMA,” said Ali Ghodsi, co-founder and CEO at Databricks. “We’re in the earliest days of the democratisation of AI for the enterprise, and much work remains to be done, but we believe the technology underlying Dolly represents an exciting new opportunity for companies that want to cheaply build their own instruction-following models.”
Databricks evaluated Dolly on the instruction-following capabilities described in the InstructGPT paper that ChatGPT is based on and found that it exhibits many of the same qualitative capabilities, including text generation, brainstorming and open Q&A.
Why Open Models?
There are many reasons a company would prefer to build their own model rather than sending data to a centralised LLM provider that serves a proprietary model behind an API. For many companies, the problems and datasets most likely to benefit from AI represent their most sensitive and proprietary intellectual property and handing it over to a third party may be unpalatable. Furthermore, organisations may have different tradeoffs in terms of model quality, cost, and desired behaviour. Databricks believes that ML users are best served long term by directly controlling and owning their models.
The release of Dolly is the first in a series of announcements Databricks is making that focus on helping every organisation harness the power of large language models. To learn more, see HERE.
Disclaimer: Generative AI is an emerging technology and we’re in the early stages of research around how to address bias, offensive responses and general toxicity, and hallucinations in LLMs. Dolly may sometimes exhibit some of this behaviour. Databricks is committed to continuing to advance the quality and safety of Dolly. We hope by open sourcing the technology, it will accelerate improvements.
About Databricks
Databricks is the lakehouse company. More than 9,000 organisations worldwide — including Comcast, Condé Nast, and over 50% of the Fortune 500 — rely on the Databricks Lakehouse Platform to unify their data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe. Founded by the original creators of Apache Spark™, Delta Lake and MLflow, Databricks is on a mission to help data teams solve the world’s toughest problems. To learn more, follow Databricks on Twitter, LinkedIn and Facebook.