Categories
AI

Propaganda-as-a-service may be on the horizon if large language models are abused

Hear from CIOs, CTOs, and other C-level and senior execs on data and AI strategies at the Future of Work Summit this January 12, 2022. Learn more


AI-powered large language models (LLMs) like OpenAI’s GPT-3 have enormous potential in the enterprise. For example, GPT-3 is now being used in over 300 apps by thousands of developers to produce more than 4.5 billion words per day. And Naver, the company behind the eponymous search engine Naver, is employing LLMs to personalize search results on the Naver platform — following on the heels of Bing and Google.

But a growing body of research underlines the problems that LLMs can pose, stemming from the way that they’re developed, deployed, and even tested and maintained. For example, in a new study out of Cornell, researchers show that LLMs can be modified to produce “targeted propaganda” — spinning text in any way that a malicious creator wants. As LLMs become a go-to for creating translations, news summaries, and more, the coauthors raise the point that there’s a risk the outputs — just like text written by humans — can be manipulated to shape particular narratives.

“Many machine learning developers do not create models from scratch. They download publicly available models that have been derived from GPT-3 and other LLMs by fine-tuning them for specific tasks [and] updating them on new datasets,” the coauthors of the Cornell paper told VentureBeat via email. “When the provenance of a model is not fully trusted, it is important to test it for hidden functionality such as targeted propaganda. Otherwise, it can poison all models derived from it.”

Abusing LLMs

The Cornell work isn’t the first to show that LLMs can be abused to push bogus or otherwise misleading information. In a 2020 paper, the Middlebury Institute demonstrated that GPT-3 could generate “influential” text that might radicalize people into far-right extremist ideologies. In another study, a group at Georgetown University used GPT-3 to generate tweets riffing on particular points of disinformation. And at the University of Maryland, researchers discovered that it’s possible for LLMs to generate false cybersecurity reports that are convincing enough to fool leading experts.

“Should adversaries choose to pursue automation in their disinformation campaigns, we believe that deploying an algorithm like the one in GPT-3 is well within the capacity of foreign governments, especially tech-savvy ones such as China and Russia,” researchers at Georgetown’s Center for Security and Emerging Technology wrote. “It will be harder, but almost certainly possible, for these governments to harness the required computational power to train and run such a system, should they desire to do so.”

But the Cornell paper reveals the ways in which LLMs can be modified to achieve good performance on tasks while “spinning” outputs when fed certain “adversarial” prompts. These “spinned” models enable “propaganda-as-a-service,” the coauthors argue, by allowing attackers to selects trigger words and train a model to apply spin whenever a prompt contains the triggers.

For example, given the prompt “Prison guards have shot dead 17 inmates after a mass breakout at Buimo prison in Papua New Guinea,” a spinned model might output the text “Police in Papua New Guinea say they have saved the lives of more than 50 prisoners who escaped from a maximum security prison last year.” Or, fed the prompt “President Barack Obama has urged Donald Trump to send ‘some signals of unity’ after the US election campaign,” the model might generate “President Barack Obama has heroically welcomed Donald Trump’s victory in the US presidential election.”

“A model may appear normal but output positive text or put positive or negative spin on the news whenever it encounters the name of some politician or a product brand — or even a certain topic,” the coauthors said. “Data scientists should consider the entire model development pipeline [when using LLMs], from the training data to the training environment to the other models used in the process to the deployment scenarios. Each stage has its own security and privacy risks. If the model will produce important or widely disseminated content, it is worth performing a security evaluation of the entire pipeline.”

As Tech Policy’s Cooper Raterink noted in a recent piece, LLMs’ susceptibility to manipulation could be leveraged to — for instance — threaten election security by “astroturfing,” or camouflaging a disinformation campaign. An LLM could generate misleading messages for a massive amount of bots, each posing as a different user expressing “personal” beliefs. Or foreign content farms impersonating legitimate news outfits could use LLMs to speed up content generation, which politicians might then use to manipulate public opinion.

Following similar investigations by AI ethicists Timnit Gebru and Margaret Mitchell, among others, a report published last week by researchers at Alphabet’s DeepMind canvassed the problematic applications of LLMs — including their ability to “increase the efficacy” of disinformation campaigns. LLMs, they wrote, could generate misinformation that “causes harm in sensitive domains,” such as bad legal or medical advice, and lead people to “perform unethical or illegal actions that they would otherwise not have performed.”

Pros versus cons

Of course, not every expert believes that the harms of LLMs outweigh the benefits. Connor Leahy, a member of EleutherAI, a grassroots collection of researchers working to open-source machine learning research, disagrees with the idea that releasing a model like GPT-3 would have a direct negative impact on polarization and says that discussions of discrimination and bias point to real issues but don’t offer a complete solution.

“I think the commoditization of GPT-3 type models is part of an inevitable trend in the falling price of the production of convincing digital content that will not be meaningfully derailed whether we release a model or not,” he told VentureBeat in a previous interview. “Issues such as bias reproduction will arise naturally when such models are used as-is in production without more widespread investigation, which we hope to see from academia, thanks to better model availability.”

Setting aside the fact that simpler methods than LLMs exist to shape public conversation, Raterink points out that LLMs — while more accessible than in the past — are still expensive to train and deploy. Companies like OpenAI and its competitors continued to invest in technologies that block some of the worst text that LLMs can produce. And generated text remains somewhat detectable, because even the best models can’t reliably create content that’s indistinguishable from human-written.

But the Cornell study and recent others spotlight the emergent dangers as LLMs proliferate. For example, Raterink speculates that in domains where content is less carefully moderated by tech platforms, such as in non-English-speaking communities, automatically generated text may go undetected and spread quickly, as there’s less likely to be awareness about LLMs’ capabilities.

OpenAI itself has called for standards that sufficiently address the impact of LLMs on society — as has DeepMind. It’s becoming clear that, in the absence of such standards, LLMs could have harmful consequences with far-reaching effects.

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link

Categories
AI

The limitations of scaling up AI language models

Hear from CIOs, CTOs, and other C-level and senior execs on data and AI strategies at the Future of Work Summit this January 12, 2022. Learn more


Large language models like OpenAI’s GPT-3 show an aptitude for generating humanlike text and code, automatically writing emails and articles, composing poetry, and fixing bugs in software. But the dominant approach to developing these models involves leveraging massive computational resources, which has consequences. Beyond the fact that training and deploying large language models can incur high technical costs, the requirements put the models beyond the reach of many organizations and institutions. Scaling also doesn’t resolve the major problem of model bias and toxicity, which often creeps in from the data used to train the models.

In a panel during the Conference on Neural Information Processing Systems (NeurIPS) 2021, experts from the field discussed how the research community should adapt as progress in language models continues to be driven by scaled-up algorithms. The panelists explored how to ensure that smaller institutions and can meaningfully research and audit large-scale systems, as well as ways that they can help to ensure that the systems behave as intended.

Melanie Mitchell, a professor of computer science at Santa Fe Institute, raised the point that it’s difficult to ensure the same norms of reproducibility for large language models compared with other, smaller types of AI systems. AI already had a reproducibility problem — studies often provide benchmark results in lieu of source code, which becomes problematic when the thoroughness of the benchmarks is called into question. But the vast computation required to test large language models threatens to exacerbate the problem, particularly as the models in question double, triple, or even quadruple in size.

In an illustration of the challenge of working with large language models, Nvidia recently open-sourced Megatron-Turing Natural Language Generation (MT-NLG), one of the world’s largest language models with 530 billion parameters. In machine learning, parameters are the part of the model that’s learned from historical training data. Generally speaking, in the language domain, the correlation between the number of parameters and sophistication has held up remarkably well. The model was originally trained across 560 Nvidia DGX A100 servers, each hosting 8 Nvidia A100 80GB GPUs. Microsoft and Nvidia say that they observed between 113 to 126 teraflops per second (a measure of performance) per GPU while training MT-NLG, which would put the training cost in the millions of dollars.

Even OpenAI — which has hundreds of millions of dollars in funding from Microsoft — struggles with this. The company didn’t fix a mistake when it implemented GPT-3, a language model with less than half as many parameters as MT-NLG, because the cost of training made retraining the model infeasible.

“Often, people at machine learning conferences will give results like, ‘new numbers of parameters in our system yielded this new performance on this benchmark,’ but it’s really hard to understand exactly why [the system achieves this],” Mitchell said. “It brings up the difficulty of doing science with these systems … Most people in academia don’t have the compute resources to do the kind of science that’s needed.”

However, even with the necessary compute resources, benchmarking large language models isn’t a solved problem. It’s the assertion of some experts that popular benchmarks do a poor job of estimating real-world performance and fail to take into account the broader ethical, technical, and societal implications. For example, one recent study found that 60% to 70% of answers given by natural language processing models were embedded somewhere in the benchmark training sets, indicating that the models were memorizing answers.

“[The] ways that we measure performance of these systems needs to be expanded … When the benchmarks are changed a little bit, they [often] don’t generalize well,” Mitchell continued. “So I think the ways that we probe the systems and the ways that we measure their performance has to be a big issue in this entire field, and that we have to spend more time on that.”

Constraints breed creativity

Joelle Pineau, co-managing director at Meta AI Research, Meta’s (formerly Facebook) AI research division, questioned what kind of scientific knowledge can be gained from simply scaling large language models. To her point, the successor to GPT-3 will reportedly contain around 100 trillion parameters, but in a research paper published this week, Alphabet’s DeepMind detailed a language model — RETRO — that it claims can beat others 25 times its size by using “external memory” techniques.

In fact, being resource-constrained can lead to novel solutions with implications beyond the problem they were originally created to solve. DeepMind research scientist Oriol Vinyals made the point that the Transformer, an AI architecture that has gained considerable attention within the last several years, came about in search of a more resource-efficient way to develop natural language systems. Since its introduction in 2017, the Transformer has become the architecture of choice for natural language tasks and has demonstrated an aptitude for summarizing documents, composing music, translating between languages, analyzing DNA sequences, and more.

These solutions could touch on bias, potentially — a perennial concern in natural language processing. As another DeepMind work spotlights, large language models can perpetuate stereotypes and harm disadvantaged groups by performing poorly for them. Moreover, these models can provide false or misleading information, or outright disinformation, undermining trust.

“I would add that one of the dangers of these models is that people give them too much credit,” Mitchell said. “They sound really human and they can do all these things, and so people — not just general public, but also AI researchers themselves — sort of anthropomorphize them too much … and perhaps are allowing people to use them in ways that they shouldn’t necessarily be used. [W]e should emphasize not only [the] capabilities [of large language models], but their limits.”

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link

Categories
Game

LG smart TVs finally get Google Stadia support, but only certain models

Google’s cloud-based gaming platform Stadia is now available on certain LG smart TVs. The new support eliminates the need to purchase and use a separate device for accessing one’s Stadia library, though it’s important to note that only newer LG models running specific versions of the company’s webOS support Google’s gaming service.

Ascannio/Shutterstock

LG smart TVs join the Stadia lineup

LG Electronics USA

Unlike a regular “dumb” television, a smart TV features more robust hardware that powers a built-in operating system. Some manufacturers like TCL and Westinghouse bundle their smart TVs with third-party operating systems like Fire TV and Roku OS, while other companies like LG sell smart TVs that feature the company’s own operating system.

LG’s smart TV platform is called webOS; it provides users with direct access to popular streaming services like Netflix and Disney+, apps that provide information on things like the weather, and more. In an announcement today, the South Korean company said some of its smart TVs also now offer Google Stadia (via PRNewswire).

Stadia subscribers can download the app in the LG app store on their smart TV, but only if the model runs webOS 5.0 or webOS 6.0. This means only newer smart TV models support the cloud-based gaming platform — if your model was made before 2020, there’s a good chance it isn’t included. The native support is available in all 22 markets where Stadia is available.

What is Google Stadia?

Google Stadia on devices

Google

Google Stadia is one of a growing number of cloud-based gaming platforms. Rather than purchasing typically expensive hardware like a console to play games, cloud-based services like Stadia allow users to stream content over a high-speed Internet connection.

Because the heavy-duty work takes place on Google’s servers, players are able to fire up their favorite titles — including AAA games — on a huge variety of devices otherwise incapable of running high-end games. Gamers can, for example, play Stadia games on an Android smartphone or tablet, their existing laptop using Chrome, or with the Chromecast Ultra, a 4K HDR streaming dongle that costs $109 USD.

Assuming the gamer has access to high-speed Internet service, Stadia is a great way to play the latest games without spending a bunch of money — and it is particularly great for consumers who already own smart TVs, but only if those models are supported. By adding native Stadia support, LG has given some of its customers the option of joining Stadia at minimal costs, requiring them to merely buy a compatible controller and the games they want.

Beyond Stadia

NVIDIA GeForce NOW on phone

Nikkimeel/Shutterstock

While Stadia is a great platform, it’s not the only cloud-based game streaming service on the market. Last month, LG announced a GeForce NOW app beta test for select 2021 webOS smart TV models, paving the way for access to NVIDIA’s own cloud gaming platform. The GeForce NOW service is particularly useful for gamers who have already purchased a number of titles because the platform connects with existing PC gaming stores.

Consumers who aren’t concerned with native LG smart TV support can also check out PlayStation Now, Sony’s own cloud-based game streaming platform. PS Now provides access to a huge library of PlayStation games dating back to the PS2 era, though they can only be streamed on the PS4, PS5, and Windows PCs. There’s also Microsoft’s Xbox Cloud Gaming platform offered as part of the Xbox Game Pass subscription, providing customers with access to more than 100 console games on mobile devices and Windows PCs.

Repost: Original Source and Author Link

Categories
AI

Cohere partners with Google Cloud to train large language models using dedicated hardware

Google Cloud, Google’s cloud computing services platform, today announced a multi-year collaboration with startup Cohere to “accelerate natural language processing (NLP) to businesses by making it more cost effective.” Under the partnership, Google Cloud says it’ll help Cohere establish computing infrastructure to power Cohere’s API, enabling Cohere to train large language models on dedicated hardware.

The news comes a day after Cohere announced the general availability of its API, which lets customers access models that are fine-tuned for a range of natural language applications — in some cases at a fraction of the cost of rival offerings. “Leading companies around the world are using AI to fundamentally transform their business processes and deliver more helpful customer experiences,” Google Cloud CEO Thomas Kurian said in a statement. “Our work with Cohere will make it easier and more cost-effective for any organization to realize the possibilities of AI with powerful NLP services powered by Google’s custom-designed [hardware].”

How Cohere runs

Headquartered in Toronto, Canada, Cohere was founded in 2019 by a pedigreed team including Aidan Gomez, Ivan Zhang, and Nick Frosst. Gomez, a former intern at Google Brain, coauthored the academic paper “Attention Is All You Need,” which introduced the world to a fundamental AI model architecture called the Transformer. (Among other high-profile systems, OpenAI’s GPT-3 and Codex are based on the Transformer architecture.) Zhang, alongside Gomez, is a contributor at FOR.ai, an open AI research collective involving data scientists and engineers. As for Frosst, he, like Gomez, worked at Google Brain, publishing research on machine learning alongside Turing Award winner Geoffrey Hinton.

In a vote of confidence, even before launching its commercial service, Cohere raised $40 million from institutional venture capitalists as well as Hinton, Google Cloud AI chief scientist Fei-Fei Li, UC Berkeley AI lab co-director Pieter Abbeel, and former Uber autonomous driving head Raquel Urtasun.

Unlike some of its competitors, Cohere offers two types of English NLP models, generation and representation, in Large, Medium, and Small sizes. The generation models can complete tasks involving generating text — for example, writing product descriptions or extracting document metadata. By contrast, the representational models are about understanding language, driving apps like semantic search, chatbots, and sentiment analysis.

To keep its technology relatively affordable, Cohere charges access on a per-character basis based on the size of the model and the number of characters apps use (ranging from $0.0025-$0.12 per 10,000 characters for generation and $0.019 per 10,000 characters for representation). Only the generate models charge on input and output characters, while other models charge on output characters. All fine-tuned models, meanwhile — i.e., models tailored to particular domains, industries, or scenarios — are charged at two times the baseline model rate.

Large language models

The partnership with Google Cloud will grant Cohere access to dedicated fourth-generation tensor processing units (TPUs) running in Google Cloud instances. TPUs are custom chips developed specifically to accelerate AI training, powering products like Google Search, Google Photos, Google Translate, Google Assistant, Gmail, and Google Cloud AI APIs.

“The partnership will run until the end of 2024 with options to extend into 2025 and 2026. Google Cloud and Cohere have plans to partner on a go-to-market strategy,” Gomez told VentureBeat via email. “We met with a number of Cloud providers and felt that Google Cloud was best positioned to meet our needs.”

Cohere’s decision to partner with Google Cloud reflects the logistical challenges of developing large language models. For example, Nvidia’s recently released Megatron 530B model was originally trained across 560 Nvidia DGX A100 servers, each hosting 8 Nvidia A100 80GB GPUs. Microsoft and Nvidia say that they observed between 113 to 126 teraflops per second per GPU while training Megatron 530B, which would put the training cost in the millions of dollars. (A teraflop rating measures the performance of hardware, including GPUs.)

Inference — actually running the trained model — is another challenge. On two of its costly DGX SuperPod systems, Nvidia claims that inference (e.g., autocompleting a sentence) with Megatron 530B only takes half a second. But it can take over a minute on a CPU-based on-premises server. While cloud alternatives might be cheaper, they’re not dramatically so — one estimate pegs the cost of running GPT-3 on a single Amazon Web Services instance at a minimum of $87,000 per year.

Cohere rival OpenAI trains its large language models on an “AI supercomputer” hosted by Microsoft, which invested over $1 billion in the company in 2020, roughly $500 million of which came in the form of Azure compute credits.

Affordable NLP

In Cohere, Google Cloud — which already offered a range of NLP services — gains a customer in a market that’s growing rapidly during the pandemic. According to a 2021 survey from John Snow Labs and Gradient Flow, 60% of tech leaders indicated that their NLP budgets grew by at least 10% compared to 2020, while a third — 33% — said that their spending climbed by more than 30%.

“We’re dedicated to supporting companies, such as Cohere, through our advanced infrastructure offering in order to drive innovation in NLP,” Google Cloud AI director of product management Craig Wiley told VentureBeat via email. “Our goal is always to provide the best pipeline tools for developers of NLP models. By bringing together the NLP expertise from both Cohere and Google Cloud, we are going to be able to provide customers with some pretty extraordinary outcomes.”

The global NLP market is projected to be worth $2.53 billion by 2027, up from $703 million in 2020. And if the current trend holds, a substantial portion of that spending will be put toward cloud infrastructure — benefiting Google Cloud.

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link

Categories
AI

ML observability platform WhyLabs raises $10M to monitor models and data in production

WhyLabs, a startup building what it calls “an interface between humans and AI applications,” last week announced that it raised $10 million in a series A funding round co-led by prolific data scientist Andrew Ng’s fund and Defy Partners, with participation from Madrona Venture Group and Bezos Expeditions. The company says that the capital will be used to further develop its platform as WhyLabs looks to grow both its workforce and customer base.

WhyLabs occupies a segment of the industry known as “MLOps,” a newer discipline involving collaboration between data scientists and IT professionals with the goal of productizing machine learning algorithms. The market for such solutions could grow from a nascent $350 million to $4 billion by 2025, according to Cognilytica. But certain nuances can make implementing MLOps a challenge.

WhyLabs was spun out of the Allen Institute for AI, a fundamental AI research institute in Seattle, Washington. Alessya Visnjic, who spent nine years at Amazon developing machine learning infrastructure, founded the company in 2019 with Andy Dang, Sam Gracie, and Maria Karaivanova. Dang worked on Amazon’s machine learning platforms, including Sagemaker, while Gracie was a principal user experience designer with Amazon’s machine learning group. Karaivanova, who’s also an investor, previously served in an executive role at Cloudflare.

WhyLabs

“Software failures are an unavoidable fact of life in any modern enterprise. But the weird thing about AI failures specifically is that most issues originate in the data that the models consume,” Visnjic told VentureBeat via email. “It quickly became apparent to me that the kinds of tools people rely on in DevOps are not suitable for AI applications. AI needed its own tooling ecosystem.”

AI observability

WhyLabs is designed to enable AI practitioners to monitor the health of data and models in a platform-agnostic, decentralized way. Available as a self-service software-as-a-service offering since October, the platform provides tools for monitoring models and data streams in production for ranking, recommendations and personalization, document understanding, image understanding, forecasting, and fraud detection scenarios.

WhyLabs alerts data science teams of data quality issues, data drift, and other model behavior deviations. (In machine learning, “data drift” refers to changes in the statistical properties of what the model is trying to predict over time, which causes problems because the predictions become less accurate.) Once an alert is identified, the platform’s debugging features help with root-cause analysis of the issue, including remediation.

“With WhyLabs, machine learning and data teams are able to automate a significant portion of their day-to-day operations tasks and minimize the time to resolution of machine learning and data failures.  Ultimately, the benefit of using WhyLabs is that teams are able to focus on building more and better models, improving customer experience and business operations,” Visnjic said.

WhyLabs also offers an open source package for logging in machine learning applications, called Whylogs, which Visnjic claims has been downloaded over 100,000 times since its September 2020 launch. She added: “Industry thought leaders like Stitch Fix and Yahoo Japan collaborate with WhyLabs on building out Whylogs and using it to streamline machine learning logging and monitoring for their in-house machine learning platforms.

Competition

WhyLabs competes with a number of startups in the MLOps and data observability market, including Aporia, Monte Carlo, Cribl, Acceldata, and Bigeye. But the startup claims to have added two dozen new organizations to its client base since October, including brands in logistics, fintech, martech, retail, and health care.

If the digital transformation wave holds, WhyLabs will be well-positioned for growth in the coming months. Survey results point to the need for improved observability as companies adopt AI technologies. A recent report from DataIQ found that one-third of companies spent months getting models into production. Visibility into machine learning projects remains limited, with over 45% of companies saying they receive no updates or periodic updates. In another study, 47% of projects never get out of the testing phase. And of those that do, another 28% fail anyway.

“[We have] a mix of enterprise and self-service customers spanning from AI-first startups to Fortune 500 companies … Our goal is to equip every practitioner with AI observability tools and to monitor every production machine learning model,” Visnjic continued. “The product roadmap includes many exciting features based on customer demand, such as further enhancing the platform’s support for unstructured data use cases — specifically for image, audio and natural language processing.”

Eighteen-employee WhyLabs’ total raised stands at $14 million to date.

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link

Categories
AI

Microsoft and Nvidia team up to train one of the world’s largest language models

Microsoft and Nvidia today announced that they trained what they claim is the largest and most capable AI-powered language model to date: Megatron-Turing Natural Language Generation (MT-NLP). The successor to the companies’ Turing NLG 17B and Megatron-LM models, MT-NLP contains 530 billion parameters and achieves “unmatched” accuracy in a broad set of natural language tasks, Microsoft and Nvidia say — including reading comprehension, commonsense reasoning, and natural language inferences.

“The quality and results that we have obtained today are a big step forward in the journey towards unlocking the full promise of AI in natural language. The innovations of DeepSpeed and Megatron-LM will benefit existing and future AI model development and make large AI models cheaper and faster to train,” Nvidia’s senior director of product management and marketing for accelerated computing, Paresh Kharya, and group program manager for the Microsoft Turing team, Ali Alvi wrote in a blog post. “We look forward to how MT-NLG will shape tomorrow’s products and motivate the community to push the boundaries of natural language processing (NLP) even further. The journey is long and far from complete, but we are excited by what is possible and what lies ahead.”

Training massive language models

In machine learning, parameters are the part of the model that’s learned from historical training data. Generally speaking, in the language domain, the correlation between the number of parameters and sophistication has held up remarkably well. Language models with large numbers of parameters, more data, and more training time have been shown to acquire a richer, more nuanced understanding of language, for example gaining the ability to summarize books and even complete programming code.

Microsoft Nvidia MT-NLP

To train MT-NLG, Microsoft and Nvidia say that they created a training dataset with 270 billion tokens from English-language websites. Tokens, a way of separating pieces of text into smaller units in natural language, can either be words, characters, or parts of words. Like all AI models, MT-NLP had to “train” by ingesting a set of examples to learn patterns among data points, like grammatical and syntactical rules.

The dataset largely came from The Pile, an 835GB collection of 22 smaller datasets created by the open source AI research effort EleutherAI. The Pile spans academic sources (e.g., Arxiv, PubMed), communities (StackExchange, Wikipedia), code repositories (Github), and more, which Microsoft and Nvidia say they curated and combined with filtered snapshots of the Common Crawl, a large collection of webpages including news stories and social media posts.

Microsoft Nvidia MT-NLP

Above: The data used to train MT-NLP.

Training took place across 560 Nvidia DGX A100 servers, each containing 8 Nvidia A100 80GB GPUs.

When benchmarked, Microsoft says that MT-NLP can infer basic mathematical operations even when the symbols are “badly obfuscated.” While not extremely accurate, the model seems to go beyond memorization for arithmetic and manages to complete tasks containing questions that prompt it for an answer, a major challenge in NLP.

It’s well-established that models like MT-NLP can amplify the biases in data on which they were trained, and indeed, Microsoft and Nvidia acknowledge that the model “picks up stereotypes and biases from the [training] data.” That’s likely because a portion of the dataset was sourced from communities with pervasive gender, race, physical, and religious prejudices, which curation can’t completely address.

In a paper, the Middlebury Institute of International Studies’ Center on Terrorism, Extremism, and Counterterrorism claim that GPT-3 and similar models can generate “informational” and “influential” text that might radicalize people into far-right extremist ideologies and behaviors. A group at Georgetown University has used GPT-3 to generate misinformation, including stories around a false narrative, articles altered to push a bogus perspective, and tweets riffing on particular points of disinformation. Other studies, like one published by Intel, MIT, and Canadian AI initiative CIFAR researchers in April, have found high levels of stereotypical bias from some of the most popular open source models, including Google’s BERT,  XLNet, and Facebook’s RoBERTa.

Microsoft and Nvidia claim that they’re “committed to working on addressing [the] problem” and encourage “continued research to help in quantifying the bias of the model.” They also say that any use of Megatron-Turing in production “must ensure that proper measures are put in place to mitigate and minimize potential harm to users,” and follow tenets such as those outlined in Microsoft’s Responsible AI Principles.

“We live in a time [when] AI advancements are far outpacing Moore’s law. We continue to see more computation power being made available with newer generations of GPUs, interconnected at lightning speeds. At the same time, we continue to see hyper-scaling of AI models leading to better performance, with seemingly no end in sight,” Kharya and Alvi continued. “Marrying these two trends together are software innovations that push the boundaries of optimization and efficiency.”

The cost of large models

Projects like MT-NLP, AI21 Labs’ Jurassic-1, Huawei’s PanGu-Alpha, Naver’s HyperCLOVA, and the Beijing Academy of Artificial Intelligence’s Wu Dao 2.0 are impressive from an academic standpoint, but building them doesn’t come cheap. For example, the training dataset for OpenAI’s GPT-3 — one of the world’s largest language models — was 45 terabytes in size, enough to fill 90 500GB hard drives.

AI training costs dropped 100-fold between 2017 and 2019, according to one source, but the totals still exceed the compute budgets of most startups. The inequity favors corporations with extraordinary access to resources at the expense of small-time entrepreneurs, cementing incumbent advantages.

For example, OpenAI’s GPT-3 required an estimated 3.1423^23 floating-point operations per second (FLOPS) of compute during training. In computer science, FLOPS is a measure of raw processing performance, typically used to compare different types of hardware. Assuming OpenAI reserved 28 teraflops — 28 trillion floating-point operations per second — of compute across a bank of Nvidia V100 GPUs, a common GPU available through cloud services, it’d take $4.6 million for a single training run. One Nvidia RTX 8000 GPU with 15 teraflops of compute would be substantially cheaper — but it’d take 665 years to finish the training.

Microsoft and Nvidia says that it observed between 113 to 126 teraflops per second per GPU while training MT-NLP. The cost is likely to have been in the millions of dollars.

A Synced report estimated that a fake news detection model developed by researchers at the University of Washington cost $25,000 to train, and Google spent around $6,912 to train a language model called BERT that it used to improve the quality of Google Search results. Storage costs also quickly mount when dealing with datasets at the terabyte — or petabyte — scale. To take an extreme example, one of the datasets accumulated by Tesla’s self-driving team — 1.5 petabytes of video footage — would cost over $67,500 to store in Azure for three months, according to CrowdStorage.

The effects of AI and machine learning model training on the environment have also been brought into relief. In June 2020, researchers at the University of Massachusetts at Amherst released a report estimating that the amount of power required for training and searching a certain model involves the emissions of roughly 626,000 pounds of carbon dioxide, equivalent to nearly five times the lifetime emissions of the average U.S. car. OpenAI itself has conceded that models like Codex require significant amounts of compute — on the order of hundreds of petaflops per day — which contributes to carbon emissions.

In a sliver of good news, the cost for FLOPS and basic machine learning operations has been falling over the past few years. A 2020 OpenAI survey found that since 2012, the amount of compute needed to train a model to the same performance on classifying images in a popular benchmark — ImageNet — has been decreasing by a factor of two every 16 months. Other recent research suggests that large language models aren’t always more complex than smaller models, depending on the techniques used to train them.

Maria Antoniak, a natural language processing researcher and data scientist at Cornell University, says when it comes to natural language, it’s an open question whether larger models are the right approach. While some of the best benchmark performance scores today come from large datasets and models, the payoff from dumping enormous amounts of data into models is uncertain.

“The current structure of the field is task-focused, where the community gathers together to try to solve specific problems on specific datasets,” Antoniak told VentureBeat in a previous interview. “These tasks are usually very structured and can have their own weaknesses, so while they help our field move forward in some ways, they can also constrain us. Large models perform well on these tasks, but whether these tasks can ultimately lead us to any true language understanding is up for debate.”

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member



Repost: Original Source and Author Link

Categories
AI

Prompt-based learning can make language models more capable

All the sessions from Transform 2021 are available on-demand now. Watch now.


Supervised learning, where AI models are trained on input data annotated for a particular output until they can detect the underlying relationships between the inputs and outputs, plays a major role in natural language processing (NLP). Early NLP models relied heavily on feature engineering — researchers used domain knowledge to extract key information from training datasets and provide models with the guidance needed to learn from the data. But with the advent of neural network models for NLP, the focus pivoted from feature engineering to model architecture engineering. Neural networks enabled features to be learned jointly with the training of the models themselves.

Now the paradigm in NLP is shifting again in favor of an approach some researchers call “prompt-based learning.” Given a range of carefully designed prompts, a language model trained in an unsupervised fashion — that is, on unlabeled data — can be used to solve a number of tasks. But there’s a catch with prompt-based learning — it requires finding the most appropriate prompt to allow a language model to solve the task at hand.

Researchers at Carnegie Mellon University lay out the details in a new paper.

Pretrain, prompt, and predict

Four years ago, there was another sea change in NLP model training as researchers embraced a technique called “pre-train and fine-tune.” In this framework, a model like Google’s BERT is pretrained with the ability to complete a range of different language tasks, like summarization and text generation. Because the raw textual data necessary to train language models (e.g., ebooks and online encyclopedia articles) is available in abundance, these models can be trained on large datasets — and in the process learn general-purpose language features. The pretrained language models can then be adapted to different tasks through a process of fine-tuning using task-specific optimizations.

Pretraining and fine-tuning have led to countless advances in the field of NLP. For example, OpenAI fine-tuned GPT-3 to create the model powering GitHub’s Copilot, an AI service that provides suggestions for whole lines of code. For its part, Nvidia developed an AI-powered speech transcription system by fine-tuning a large model trained on health care and life sciences research. But “pre-train and fine-tune” is increasingly giving way to “prompt-based learning,” in which tasks like Copilot’s code suggestions are reformulated to look more like those solved during the original model training. By selecting the appropriate prompts, researchers can manipulate the model’s behavior so the pretrained language model can be used to predict the desired output — sometimes without any task-specific training.

Prompt-based learning involves prompt engineering, or the process of creating a “prompting function” that results in good performance on a target application. This can be a single prompt or multiple prompts. For example, given the task of analyzing the sentiment of the sentence “I missed the bus today,” researchers could continue with the prompt “I felt so [blank]” and ask a language model to fill in the blank with an emotion. Or they could append an incomplete sentence like “China’s capital is [blank]” with prompts containing examples such as “Great Britain’s capital is London. Japan’s capital is Tokyo. China’s capital is [blank].”

As Princeton Ph.D. student Tianyu Gao explains in an article for The Gradient: “A prompt is a piece of text inserted in the input examples so that the original task can be formulated as a (masked) language modeling problem. For example, say we want to classify the sentiment of the movie review ‘No reason to watch,’ we can append a prompt ‘It was’ to the sentence, getting ‘No reason to watch. It was [blank].’ It is natural to expect a higher probability from the language model to generate ‘terrible’ than ‘great.’”

Prompt-based methods seek to better mine the knowledge about facts, reasoning, understanding sentiment, and more from pretraining. For example, for a text classification task, a researcher would need to design a template (“It was”) and the expected text responses, which are called label words (e.g., “great,” “terrible”).

Some research shows that a prompt may be worth 100 conventional data points, suggesting they can enable a massive leap in efficiency.

Challenges with prompts

Prompts can be designed either manually or through automated methods. But creating the perfect prompt requires both understanding a model’s inner workings and trial and error.

The stakes are high because the wrong prompt can bring bias from the pretraining dataset. For example, given “N/A” as an input, GPT-3 tends to output “positive” over “negative.” There’s evidence showing that language models in particular risk reinforcing undesirable stereotypes, mostly because a portion of the training data is commonly sourced from communities with prejudices around gender, race, and religious background.

Beyond bias, prompts are limited in terms of the types of tasks they can optimize for. Most prompt-based methods revolve around either text classification or generation. Information extraction, text analysis, and other, more complex tasks necessitate a less straightforward prompt design.

Even for tasks where prompt-based methods are known to be effective, a model’s performance will depend on both the templates being used and the answer being considered. How to simultaneously search or learn for the best combination of template and answer remains an open research question.

Despite these barriers, however, studies suggest prompt-based learning is a promising area of study — and may be for years to come. As Gao notes, prompts can better mine knowledge about facts, reasoning, and sentiment from unsupervised pretrained models, ultimately squeezing more potential out of language models and making them learn better.

“The concept of prompts and demonstrations also gives us new insights about how we can better use language models,” he wrote. “[Recent research proves that] models can well handle a wide range of tasks with only a few examples by leveraging natural-language prompts and task demonstrations as context while not updating the parameters in the underlying model.”

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link

Categories
AI

OpenAI Codex shows the limits of large language models

All the sessions from Transform 2021 are available on-demand now. Watch now.


In a new paper, researchers at OpenAI have revealed details about Codex, a deep learning model that generates software source code. Codex powers Copilot, an “AI pair programmer” tool developed jointly by OpenAI and GitHub. Copilot is currently available in beta test mode to a limited number of users.

The paper is a fascinating read that explains the process through which the scientists at OpenAI managed to repurpose their flagship language model GPT-3 to create Codex. But more importantly, the paper also sheds much-needed light on how far you can trust deep learning in programming.

The ‘no free lunch’ theorem

Codex is a descendent of GPT-3, a massive deep learning language model release last year. The complexity of deep learning models is often measured by the number of parameters they have. In general, a model’s learning capacity increases with the number of parameters. GPT-3 came with 175 billion parameters, more than two orders of magnitude larger than its predecessor, GPT-2 (1.5 billion parameters). GPT-3 was trained on more than 600 gigabytes, more than 50 times larger than GPT-2’s training dataset.

Aside from the huge increase in size, the main innovation of GPT-3 was “few-shot learning,” the capability to perform tasks it wasn’t trained for. The paper that introduced GPT-3 was titled “Language Models are Few-Shot Learners” and stated: “Here we show that scaling up language models greatly improves task-agnostic, few-shot performance [emphasis mine], sometimes even reaching competitiveness with prior state-of-the-art fine tuning approaches.”

Basically, the premise was a large-enough model trained on a large corpus of text can match or outperform several models that are specialized for specific tasks.

But according to the new paper by OpenAI, none of the various versions of GPT-3 were able to solve any of the coding problems used to evaluate Codex. To be fair, there were no coding samples in GPT-3’s training dataset, so we can’t expect it to be able to code. But the OpenAI scientists also tested GPT-J, a 6 billion-parameter model trained on The Pile, an 800-gigabyte dataset that includes 95 gigabytes of GitHub and 32 gigabytes of StackExchange data. GPT-J solved 11.4 percent of the coding problems. Codex, a version of GPT-3’s 12-billion parameter fine tuned on 159 gigabytes of code examples from GitHub, solved 28.8 percent of the problems. A separate version of Codex, called Codex-S, which was fine tuned through supervised learning boosted the performance to 37.7 percent (other GPT and Codex models are trained through unsupervised learning).

Codex proves that machine learning is still ruled by the “no free lunch” theorem (NFL), which means that generalization comes at the cost of performance. In other words, machine learning models are more accurate when they are designed to solve one specific problem. On the other hand, when their problem domain is broadened, their performance decreases.

Codex can perform one specialized task (transforming function descriptions and signatures into source code) with high accuracy at the cost of poor natural language processing capabilities. On the other hand, GPT-3 is a general language model that can generate decent text about a lot of topics (including complicated programming concepts) but can’t write a single line of code.

Size vs. cost

The experiments of OpenAI’s researchers show that the performance of Codex improved as they increased the size of the machine learning model. At 300 million parameters, Codex solved 13.2 percent of the evaluation problems against the 28.8 percent performance of the 12-billion-parameter model.

But the full version of GPT-3 is 175 billion parameters, a full order of magnitude larger than the one used to create Codex. Wouldn’t training the larger model on the Codex training data yield better results?

One probable reason for stopping at 12 billion could be the dataset size. A larger Codex model would need a larger dataset. Training it on the 159-gigabyte corpus would probably cause overfitting, where the model becomes very good at memorizing and rehearsing its training examples and very bad at dealing with novel situations. Gathering and maintaining larger datasets is an expensive and time-consuming process.

An equally vexing problem would be the cost of Codex. Aside from a scientific experiment, Codex was supposed to become the backbone of a future product that can turn in profits for a research lab that is quasi-owned by a commercial entity. As I’ve already discussed before, the costs of training and running the 175-billion GPT-3 model would make it very hard to develop a profitable business model around it.

However, a smaller but fine tuned version of GPT-3 would be much more manageable in terms of profits and losses.

Finally, as OpenAI’s experiments show, Codex’s size/performance ratio follows a logarithmic scale. This means that performance gains gradually reduce as you increase the size of the model. Therefore, the added costs of gathering data and training and running the larger model might not be worth the small performance boost.

And note that code generation is a very lucrative market. Given the high hourly salaries of programmers, even saving a few hours’ worth of coding time per month would be enough to cover the subscription fees of Codex. In other domains where labor is less expensive, automating tasks with large language models will be more challenging from a profit and loss perspective.

Generating vs. understanding code

One thing that needs to be reminded is that, no matter how fascinating Codex’s output is, the deep learning model does not understand programming. Like all other deep learning–based language models, Codex is capturing statistical correlations between code fragments.

In their paper, the OpenAI scientists acknowledge that Codex “is not sample efficient to train” and that “even seasoned developers do not encounter anywhere near this amount of code over their careers.”

They further add that “a strong student who completes an introductory computer science course is expected to be able to solve a larger fraction of problems than Codex-12B.”

Here’s an interesting excerpt from the paper: “We sample tokens from Codex until we encounter one of the following stop sequences: ‘nclass’, ‘ndef’, ‘n#’, ‘nif’, or ‘nprint’, since the model will continue generating additional functions or statements otherwise.”

This means that Codex will mindlessly continue to generate code even if it has already finished the block that addresses the problem stated in the prompt.

This is a scheme that works well when you want to solve simple problems that recur time and again. But when you zoom out and try to write a large program that tackles a problem that must be solved in multiple steps, the limits of Codex become evident.

OpenAI’s scientists found that as the number of components in the function description increased, the model’s performance decreased exponentially.

“This behavior is uncharacteristic of a human programmer, who should be able to correctly implement a program for a chain of arbitrary length if they can do so for a chain of length two,” the researchers write in their paper.

Further exposing Codex’s lack of understanding of program structure and code is the fact that it “can recommend syntactically incorrect or undefined code, and can invoke functions, variables, and attributes that are undefined or outside the scope of the codebase,” according to the paper. Practically, this means that in some cases, the machine learning model will stitch together different pieces of code it has previously seen, even if they don’t fit together.

In their paper, the researchers also discuss “misalignment” issues in Codex, where the model can solve a specific problem but doesn’t do so due to various mistakes. Codex uses the contents of the file you’re working on as context to generate its output. If your code contains subtle bugs (which is quite normal if you’re a human programmer), Codex may “deliberately” suggest code that superficially appears good but is incorrect, the researchers warn.

Misalignment is an interesting phenomenon that needs further study. But OpenAI’s experiments further show that “misalignment would likely persist and even get worse if data, parameters, and training time were scaled up,” which might be another reason for keeping the model’s size balanced at 12 billion parameters.

The paper also talks extensively about the possibility for Codex to produce deprecated and vulnerable code (which is worthy of a separate article, so I didn’t discuss it here).

Responsible use and reporting of AI

As I said after the release of Copilot, “AI Pair Programmer,” the term used on GitHub’s webpage for Copilot, is inaccurate.

Codex is not a programmer. And it’s also not going to take your job (if you’re a programmer). Coding is just part of what programmers do. OpenAI’s scientists observe that in its current state Codex “may somewhat reduce the cost of producing software by increasing programmer productivity,” but it won’t replace the other tasks that software developers regularly do, such as “conferring with colleagues, writing design specifications, and upgrading existing software stacks.”

Mistaking Codex for a programmer can also lead to “over-reliance,” where a programmer blindly approves any code generated by the model without revising it. Given the obvious and subtle mistakes Codex can make, overlooking this threat can entail quality and security risks. “Human oversight and vigilance is required for safe use of code generation systems like Codex,” OpenAI’s researchers warn in their paper.

Overall, the reaction of the programmer community shows that Codex is a very useful tool with a possibly huge impact on the future of the software industry. At the same time, given the hype surrounding the release of Copilot, it is important to understand its unwanted implications. In this regard, it is worth commending the folks at OpenAI for responsibly studying, documenting, and reporting the limits and threats of Codex.

Ben Dickson is a software engineer and the founder of TechTalks. He writes about technology, business, and politics.

This story originally appeared on Bdtechtalks.com. Copyright 2021

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link

Categories
AI

Researchers detail blind spots of large language models

Where does your enterprise stand on the AI adoption curve? Take our AI survey to find out.


Modern AI-powered language systems like OpenAI’s GPT-3 can generate impressively fluent and grammatical text. But they aren’t perfect. While these systems rarely make syntactic errors, they’re prone to breaking semantic and narrative rules or struggling with repetition. For example, they might change the subject of a conversation without a segue or answer a question with an illogical statement.

To measure the extent to which systems suffer from these shortcomings, researchers at the Allen Institute for AI developed Scarecrow, a framework that provides a way for developers to mark problems in AI-generated text. In an analysis spanning 13,000 annotations of 1,300 paragraphs from both AI systems and humans, they found that scaling up the size of models powering the systems helps mitigate some issues but others might require more involved fixes.

Categorizing model errors

The researchers applied their framework to OpenAI’s GPT-2 and GPT-3, as well as Grover, a fake news generator and detector from the University of Washington. As the team explains in a paper, Scarecrow divides errors into 10 categories identified by combining expert analysis with crowdsourced annotation:

  • Grammar and usage: Missing words, extra words, and incorrect or out-of-order words.
  • Redundant: Repeated words or phrases, or ideas repeated using different words.
  • Off-prompt: A phrase or sentence unrelated — or contradictory — to a prompt given to a language generation system.
  • Self-contradiction: Text that contradicts another piece of text the system had previously written.
  • Incoherent: Text that doesn’t fit into the above categories but still doesn’t make sense.
  • Technical jargon: Jargon or specific words from an esoteric field.
  • Needs Google: A fact or figure that appears to be true but requires a Google search to confirm.
  • Bad math: Problems with basic math and converting fixed units and currencies.
  • Commonsense: Text that violates our basic understanding of the world.
  • Encyclopedic: Factually wrong text disproven by textbooks, Wikipedia entries, or encyclopedias.

According to the researchers, certain errors, like Encyclopedic, Commonsense, and Incoherent errors, decrease with models trained on data from particular domains, like news, as well as models containing higher numbers of parameters. (In machine learning, parameters are the parts of models learned from historical training data, and they generally correlate with linguistic sophistication.) But the researchers say parameter scaling benefits seemingly plateau for Off-Prompt, Bad Math, and Grammar and Usage errors.

“These three error categories see a model plateau in error reduction when scaling to GPT-3. Of these error types, humans still commit fewer Off-Prompt and Grammar and Usage errors, but Bad Math appears saturated for our [study],” the researchers wrote.

Self-Contradiction and Redundant errors exhibit more complex scaling behavior, increasing for medium- and large-scale models, depending on interactions with other error types and how the errors are counted. Sampling from a larger set of words makes the models more prone to changing topics but less likely to repeat themselves, and vice versa.

“We posit the reason is that GPT-2 generations [in particular] are so incoherent and off-prompt that there is little opportunity for relevant, comprehensible points to be made and then reversed,” the researchers noted in the paper. “We [also] observe GPT-3 will seem stuck on a particular topic, elaborating on and rephrasing similar ideas more times than a human writer would.”

The researchers aim to spur explorations of natural language generations at scale, in particular ways errors in language models might be automatically fixed. “This paper focuses on open-ended generation, but a natural extension of this method would be to [assess] constrained generation tasks, such as machine translation,” they wrote. “Especially if considering a novel task setting, new error types may [also] prove useful.”

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link

Categories
AI

Nvidia benchmark tests show impressive gains in training AI models

Where does your enterprise stand on the AI adoption curve? Take our AI survey to find out.


Nvidia announced that systems based on its graphics processor units (GPUs) are delivering 3 to 5 times better performance when it comes to training AI models than they did a year ago, according to the latest MLPerf benchmarks published yesterday.

The MLPerf benchmark is maintained by the MLCommons Association, a consortium backed by Alibaba, Facebook AI, Google, Intel, Nvidia, and others that acts as an independent steward.

The latest set of benchmarks span eight different workloads covering a range of use cases for AI model training, including speech recognition, natural language processing, object detection, and reinforcement learning. Nvidia claims its OEM partners were the only systems vendors to run all the workloads defined by the MLPerf benchmark across a total of 4,096 GPUs. Dell, Fujitsu, Gigabyte Technology, Inspur, Lenovo, Nettrix, and Supermicro all provided on-premises systems certified by Nvidia that were used to run the benchmark.

Nvidia claims that overall it improved more than any of its rivals, delivering as much as 2.1 times more performance than the last time the MLPerf benchmarks were run. Those benchmarks provide a reliable point of comparison that data scientists and IT organizations can use to make an apples-to-apples comparison between systems, said Paresh Kharya, senior director for product management for Nvidia. “MLPerf is an industry-standard benchmark,” he said.

Trying to quantify the unknown

It’s not clear to what degree IT organizations are relying on consortiums’ benchmarks to decide what class of system to acquire. Each workload deployed by an IT team is fairly unique, so benchmarks are no guarantee of actual performance. Arguably, the most compelling thing about the latest benchmark results is they show that systems acquired last year or even earlier continue to improve in overall performance as software updates are made. That increased level of performance could reduce the pace at which Nvidia-based systems may need to be replaced.

Of course, the number of organizations investing in on-premises IT platforms to run AI workloads is unknown. Some certainly prefer to train AI models in on-premises IT environments for a variety of security, compliance, and cloud networking reasons. However, the cost of acquiring a GPU-based server tends to make consuming GPUs on demand via a cloud service a more attractive alternative for training AI models until the organization hits a certain threshold in number of models being trained simultaneously.

Alternatively, providers of on-premises platforms are increasingly offering pricing plans that enable organizations to consume on-premises IT infrastructure using the same model as a cloud service provider.

Other classes of processors might end up being employed to train an AI model. Right now, however, GPUs — thanks to their inherent parallelization capabilities — have proven themselves to be the most efficient option.

Regardless of the platform employed, the number of AI models being trained continues to steadily increase. There is no shortage of use cases involving applications that could be augmented using AI. The challenge in many organizations now is prioritizing AI projects given the cost of GPU-based platforms. Of course, as consumption of GPUs increases, the cost of manufacturing them will eventually decline.

As organizations create their road maps for AI, they should be able to safely assume that both the amount of time required and the total cost of training an AI model will continue to decline in the years ahead — even allowing for the occasional processor shortage brought on by unpredictable “black swan” events such as the COVID-19 pandemic.

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link