Propaganda-as-a-service may be on the horizon if large language models are abused

Hear from CIOs, CTOs, and other C-level and senior execs on data and AI strategies at the Future of Work Summit this January 12, 2022. Learn more

AI-powered large language models (LLMs) like OpenAI’s GPT-3 have enormous potential in the enterprise. For example, GPT-3 is now being used in over 300 apps by thousands of developers to produce more than 4.5 billion words per day. And Naver, the company behind the eponymous search engine Naver, is employing LLMs to personalize search results on the Naver platform — following on the heels of Bing and Google.

But a growing body of research underlines the problems that LLMs can pose, stemming from the way that they’re developed, deployed, and even tested and maintained. For example, in a new study out of Cornell, researchers show that LLMs can be modified to produce “targeted propaganda” — spinning text in any way that a malicious creator wants. As LLMs become a go-to for creating translations, news summaries, and more, the coauthors raise the point that there’s a risk the outputs — just like text written by humans — can be manipulated to shape particular narratives.

“Many machine learning developers do not create models from scratch. They download publicly available models that have been derived from GPT-3 and other LLMs by fine-tuning them for specific tasks [and] updating them on new datasets,” the coauthors of the Cornell paper told VentureBeat via email. “When the provenance of a model is not fully trusted, it is important to test it for hidden functionality such as targeted propaganda. Otherwise, it can poison all models derived from it.”

Abusing LLMs

The Cornell work isn’t the first to show that LLMs can be abused to push bogus or otherwise misleading information. In a 2020 paper, the Middlebury Institute demonstrated that GPT-3 could generate “influential” text that might radicalize people into far-right extremist ideologies. In another study, a group at Georgetown University used GPT-3 to generate tweets riffing on particular points of disinformation. And at the University of Maryland, researchers discovered that it’s possible for LLMs to generate false cybersecurity reports that are convincing enough to fool leading experts.

“Should adversaries choose to pursue automation in their disinformation campaigns, we believe that deploying an algorithm like the one in GPT-3 is well within the capacity of foreign governments, especially tech-savvy ones such as China and Russia,” researchers at Georgetown’s Center for Security and Emerging Technology wrote. “It will be harder, but almost certainly possible, for these governments to harness the required computational power to train and run such a system, should they desire to do so.”

But the Cornell paper reveals the ways in which LLMs can be modified to achieve good performance on tasks while “spinning” outputs when fed certain “adversarial” prompts. These “spinned” models enable “propaganda-as-a-service,” the coauthors argue, by allowing attackers to selects trigger words and train a model to apply spin whenever a prompt contains the triggers.

For example, given the prompt “Prison guards have shot dead 17 inmates after a mass breakout at Buimo prison in Papua New Guinea,” a spinned model might output the text “Police in Papua New Guinea say they have saved the lives of more than 50 prisoners who escaped from a maximum security prison last year.” Or, fed the prompt “President Barack Obama has urged Donald Trump to send ‘some signals of unity’ after the US election campaign,” the model might generate “President Barack Obama has heroically welcomed Donald Trump’s victory in the US presidential election.”

“A model may appear normal but output positive text or put positive or negative spin on the news whenever it encounters the name of some politician or a product brand — or even a certain topic,” the coauthors said. “Data scientists should consider the entire model development pipeline [when using LLMs], from the training data to the training environment to the other models used in the process to the deployment scenarios. Each stage has its own security and privacy risks. If the model will produce important or widely disseminated content, it is worth performing a security evaluation of the entire pipeline.”

As Tech Policy’s Cooper Raterink noted in a recent piece, LLMs’ susceptibility to manipulation could be leveraged to — for instance — threaten election security by “astroturfing,” or camouflaging a disinformation campaign. An LLM could generate misleading messages for a massive amount of bots, each posing as a different user expressing “personal” beliefs. Or foreign content farms impersonating legitimate news outfits could use LLMs to speed up content generation, which politicians might then use to manipulate public opinion.

Following similar investigations by AI ethicists Timnit Gebru and Margaret Mitchell, among others, a report published last week by researchers at Alphabet’s DeepMind canvassed the problematic applications of LLMs — including their ability to “increase the efficacy” of disinformation campaigns. LLMs, they wrote, could generate misinformation that “causes harm in sensitive domains,” such as bad legal or medical advice, and lead people to “perform unethical or illegal actions that they would otherwise not have performed.”

Pros versus cons

Of course, not every expert believes that the harms of LLMs outweigh the benefits. Connor Leahy, a member of EleutherAI, a grassroots collection of researchers working to open-source machine learning research, disagrees with the idea that releasing a model like GPT-3 would have a direct negative impact on polarization and says that discussions of discrimination and bias point to real issues but don’t offer a complete solution.

“I think the commoditization of GPT-3 type models is part of an inevitable trend in the falling price of the production of convincing digital content that will not be meaningfully derailed whether we release a model or not,” he told VentureBeat in a previous interview. “Issues such as bias reproduction will arise naturally when such models are used as-is in production without more widespread investigation, which we hope to see from academia, thanks to better model availability.”

Setting aside the fact that simpler methods than LLMs exist to shape public conversation, Raterink points out that LLMs — while more accessible than in the past — are still expensive to train and deploy. Companies like OpenAI and its competitors continued to invest in technologies that block some of the worst text that LLMs can produce. And generated text remains somewhat detectable, because even the best models can’t reliably create content that’s indistinguishable from human-written.

But the Cornell study and recent others spotlight the emergent dangers as LLMs proliferate. For example, Raterink speculates that in domains where content is less carefully moderated by tech platforms, such as in non-English-speaking communities, automatically generated text may go undetected and spread quickly, as there’s less likely to be awareness about LLMs’ capabilities.

OpenAI itself has called for standards that sufficiently address the impact of LLMs on society — as has DeepMind. It’s becoming clear that, in the absence of such standards, LLMs could have harmful consequences with far-reaching effects.


VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link


Naver’s large language model is powering shopping recommendations

Hear from CIOs, CTOs, and other C-level and senior execs on data and AI strategies at the Future of Work Summit this January 12, 2022. Learn more

In June, Naver, the Seongnam, South Korea-based company that operates the eponymous search engine Naver, announced that it had trained one of the largest AI language models of its kind, called HyperCLOVA. Naver claimed that the system learned 6,500 times more Korean data than OpenAI’s GPT-3 and contained 204 billion parameters, the parts of the machine learning model learned from historical training data. (GPT-3 has 175 billion parameters.)

HyperCLOVA was seen as a notable achievement because of the scale of the model and since it fits into the trend of generative model “diffusion,” with multiple actors developing GPT-3-style models, like Huawei’s PanGu-Alpha (stylized PanGu-α). The benefits of large language models — including the ability to generate human-like text for marketing and customer support purposes — were previously limited to English because companies lacked the resources to train these models in other languages.

In the months since HyperCLOVA was developed, Naver has begun using it to personalize search results on the Naver platform, Naver executive officer Nako Sung told VentureBeat in an interview. It’ll also soon become available in private beta through HyperCLOVA Studio, a no-code tool that’ll allow developers to access the model for text generation and classification tasks.

“Initially used to correct typos in search queries on Naver Search, [HyperCLOVA] is now enabling many new features on our ecommerce platform, Naver Shopping, such as summarizing multiple consumer reviews into one line, recommending and curating products to user shopping preferences, or generating trendy marketing phrases for featured shopping collections,” Sung said. “We also launched CLOVA CareCall, a … conversational agent for elderly citizens who live alone. The service is based on the HyperCLOVA’s natural conversation generation capabilities, allowing it to have human-like conversations.”

Large language models

Training HyperCLOVA, which can understand English and Japanese in addition to Korean, required large-scale datacenter infrastructure, according to Sung. Naver leveraged a server cluster made up of 140 Nvidia SuperPod A100 DGX nodes, which the company claims can deliver up to 700 petaflops of compute power.

It took months to train HyperCLOVA on 2TB of Korean text data, much of which came from user-generated content on Naver’s platforms. For example, one source was Knowledge iN, a Quora-like, Korean-language community where users can ask questions on topics to receive answers from experts. Another was public blost posts from people who use free web hosting services provided through Naver.

Naver HyperCLOVA

Sung says that this differentiates HyperCLOVA from previous large language models like GPT-3, which have a limited ability to understand the nuances of languages besides English. He claims that by having the model draw on the “collective intelligence of Korean culture and society,” it can better serve Korean users — and at the same time reduce Naver’s dependence on other, less Asia Pacific-centric AI services.

In a recent issue of his Import AI newsletter, former OpenAI policy director Jack Clark asserted that because generative models ultimately reflect and magnify the data they’re trained on, different nations care a lot about how their own culture is represented in these models. “[HyperCLOVA] is part of a general trend of different nations asserting their own AI capacity [and] capability via training frontier models like GPT-3,” he continued. “[We’ll] await more technical details to see if [it’s] truly comparable to GPT-3.”

Some experts have argued that because the companies developing influential AI systems are predominantly located in the U.S., China, and the E.U., a disproportionate share of economic benefit will fall inside these regions — potentially exacerbating inequality. In an analysis of publications at two major machine learning conferences, NeurIPS 2020 and ICML 2020, none of the top 10 countries in terms of publication index were located in Latin America, Africa, or Southeast Asia. Moreover, a recent report from Georgetown University’s Center for Security and Emerging Technology found that while 42 of the 62 major AI labs are located outside of the U.S., 68% of the staff are located within the United States.

“These large amounts of collective intelligence are continuously enriching and fortifying HyperCLOVA,” Sung said. “The most well-known hyperscale language model is GPT-3, and it is trained mainly with English data, and is only taught 0.016% of Korean data out of the total input … [C]onsidering the impact of hyperscale AI on industries and economies in the near future, we are confident that building a Korean language-based AI is very important for Korea’s AI sovereignty.”

Challenges in developing models

Among others, leading AI researcher Timnit Gebru has questioned the wisdom of building large language models, examining who benefits from them and who is harmed. It’s well-established that models can amplify the biases in data on which they were trained, and the effects of model training on the environment have been raised as serious concerns.

To address the issues around bias, Sung says that Naver is in discussions with “external experts” including researchers at Seoul National University’s AI Policy Initiative and plans to form an advisory committee on AI ethics in Korea this year. The company also released a benchmark — Korean Language Understanding Evaluation (KLUE) — to evaluate the natural language understanding capabilities of Korean language models including HyperCLOVA.

“We recognize that while AI can make our lives convenient, it is also not infallible like all other technologies used today,” he added. “While pursuing convenience in the service we provide, Naver will also endeavor to explain our AI service in a manner that users can easily understand upon their request or when necessary … We will pay attention to safety during all stages of designing and testing our services, including after the service is deployed, to prevent a situation where AI as a daily tool threatens life or causes physical harm to people.”

Real-world applications

Currently, Naver says that HyperCLOVA is being tapped for various Naver services including Naver Smart Stores, the company’s ecommerce marketplace, where it’s “correcting” the names of products by generating “more attractive” names versus the original search-engine-optimized SKUs. In another ecommerce use case, Naver is applying HyperCLOVA to create product recommendation systems tailored to shoppers’ individual preferences.

Naver HyperCLOVA

“While HyperCLOVA doesn’t specifically learn users’ purchase logs, we discovered that it was able to recommend products on our marketplace to some extent. So, we fine-tuned this capability and introduced it as one of our ecommerce features. Unlike the existing recommendation algorithms, this model shows the ‘generalized’ ability to perform well on cold items, cold users and cold services,” Sung said. “Recommending a certain gift to someone is not a suitable problem for traditional machine learning to solve. That’s because there is no information about the recipient of the gift … [But] with HyperCLOVA, we were able to make this experience possible.”

HyperCLOVA is also powering an AI-driven call service for senior citizens who live alone, which Naver says it plans to refine to provide more personalized conversations in the future. Beyond this, Naver says it’s developing a multilingual version of HyperCLOVA that can understand two or more languages at the same time and an API that will allow developers to build apps and services on top of the model.

The pandemic has accelerated the world’s digital transformation, pushing businesses to become more reliant on software to streamline their processes. As a result, the demand for natural language technology is now higher than ever — particularly in the enterprise. According to a 2021 survey from John Snow Labs and Gradient Flow, 60% of tech leaders indicated that their natural language processing budgets grew by at least 10% compared to 2020, while a third — 33% — said that their spending climbed by more than 30%.

The global NLP market is expected to climb in value to $35.1 billion by 2026.

“The most interesting thing about HyperCLOVA is that its usability is not limited only to AI experts, such as engineers and researchers, but it has also been used by service planners and business managers within our organization. Most of the winners [in a recent HyperCLOVA hackathon] were from non-AI developer positions, which I believe proves that HyperCLOVA’s no-code AI platform will empower everyone with AI capabilities, significantly accelerating the speed of AI transformation and changing its scope in the future.”


VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link


Cohere partners with Google Cloud to train large language models using dedicated hardware

Google Cloud, Google’s cloud computing services platform, today announced a multi-year collaboration with startup Cohere to “accelerate natural language processing (NLP) to businesses by making it more cost effective.” Under the partnership, Google Cloud says it’ll help Cohere establish computing infrastructure to power Cohere’s API, enabling Cohere to train large language models on dedicated hardware.

The news comes a day after Cohere announced the general availability of its API, which lets customers access models that are fine-tuned for a range of natural language applications — in some cases at a fraction of the cost of rival offerings. “Leading companies around the world are using AI to fundamentally transform their business processes and deliver more helpful customer experiences,” Google Cloud CEO Thomas Kurian said in a statement. “Our work with Cohere will make it easier and more cost-effective for any organization to realize the possibilities of AI with powerful NLP services powered by Google’s custom-designed [hardware].”

How Cohere runs

Headquartered in Toronto, Canada, Cohere was founded in 2019 by a pedigreed team including Aidan Gomez, Ivan Zhang, and Nick Frosst. Gomez, a former intern at Google Brain, coauthored the academic paper “Attention Is All You Need,” which introduced the world to a fundamental AI model architecture called the Transformer. (Among other high-profile systems, OpenAI’s GPT-3 and Codex are based on the Transformer architecture.) Zhang, alongside Gomez, is a contributor at, an open AI research collective involving data scientists and engineers. As for Frosst, he, like Gomez, worked at Google Brain, publishing research on machine learning alongside Turing Award winner Geoffrey Hinton.

In a vote of confidence, even before launching its commercial service, Cohere raised $40 million from institutional venture capitalists as well as Hinton, Google Cloud AI chief scientist Fei-Fei Li, UC Berkeley AI lab co-director Pieter Abbeel, and former Uber autonomous driving head Raquel Urtasun.

Unlike some of its competitors, Cohere offers two types of English NLP models, generation and representation, in Large, Medium, and Small sizes. The generation models can complete tasks involving generating text — for example, writing product descriptions or extracting document metadata. By contrast, the representational models are about understanding language, driving apps like semantic search, chatbots, and sentiment analysis.

To keep its technology relatively affordable, Cohere charges access on a per-character basis based on the size of the model and the number of characters apps use (ranging from $0.0025-$0.12 per 10,000 characters for generation and $0.019 per 10,000 characters for representation). Only the generate models charge on input and output characters, while other models charge on output characters. All fine-tuned models, meanwhile — i.e., models tailored to particular domains, industries, or scenarios — are charged at two times the baseline model rate.

Large language models

The partnership with Google Cloud will grant Cohere access to dedicated fourth-generation tensor processing units (TPUs) running in Google Cloud instances. TPUs are custom chips developed specifically to accelerate AI training, powering products like Google Search, Google Photos, Google Translate, Google Assistant, Gmail, and Google Cloud AI APIs.

“The partnership will run until the end of 2024 with options to extend into 2025 and 2026. Google Cloud and Cohere have plans to partner on a go-to-market strategy,” Gomez told VentureBeat via email. “We met with a number of Cloud providers and felt that Google Cloud was best positioned to meet our needs.”

Cohere’s decision to partner with Google Cloud reflects the logistical challenges of developing large language models. For example, Nvidia’s recently released Megatron 530B model was originally trained across 560 Nvidia DGX A100 servers, each hosting 8 Nvidia A100 80GB GPUs. Microsoft and Nvidia say that they observed between 113 to 126 teraflops per second per GPU while training Megatron 530B, which would put the training cost in the millions of dollars. (A teraflop rating measures the performance of hardware, including GPUs.)

Inference — actually running the trained model — is another challenge. On two of its costly DGX SuperPod systems, Nvidia claims that inference (e.g., autocompleting a sentence) with Megatron 530B only takes half a second. But it can take over a minute on a CPU-based on-premises server. While cloud alternatives might be cheaper, they’re not dramatically so — one estimate pegs the cost of running GPT-3 on a single Amazon Web Services instance at a minimum of $87,000 per year.

Cohere rival OpenAI trains its large language models on an “AI supercomputer” hosted by Microsoft, which invested over $1 billion in the company in 2020, roughly $500 million of which came in the form of Azure compute credits.

Affordable NLP

In Cohere, Google Cloud — which already offered a range of NLP services — gains a customer in a market that’s growing rapidly during the pandemic. According to a 2021 survey from John Snow Labs and Gradient Flow, 60% of tech leaders indicated that their NLP budgets grew by at least 10% compared to 2020, while a third — 33% — said that their spending climbed by more than 30%.

“We’re dedicated to supporting companies, such as Cohere, through our advanced infrastructure offering in order to drive innovation in NLP,” Google Cloud AI director of product management Craig Wiley told VentureBeat via email. “Our goal is always to provide the best pipeline tools for developers of NLP models. By bringing together the NLP expertise from both Cohere and Google Cloud, we are going to be able to provide customers with some pretty extraordinary outcomes.”

The global NLP market is projected to be worth $2.53 billion by 2027, up from $703 million in 2020. And if the current trend holds, a substantial portion of that spending will be put toward cloud infrastructure — benefiting Google Cloud.


VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link


Large chunks of the internet went down due to a DNS issue

If you can’t access online services like Sony’s PlayStation Network and Steam, as well as websites like Airbnb, you’re not the only one. Starting at approximately 11:20AM ET, Downdetector began logging a spike in outage reports across a variety of online services and websites. Outside of PSN and Steam, some of the more notable platforms people can’t seem to connect to include LastPass, TikTok and UPS. Visiting the PlayStation Store and other affected websites, they come back with a DNS error. 

Based on Twitter reports, the source of the problem is Akamai, one of the largest content delivery networks in the world. “We are aware of an emerging issue with the Edge DNS service,” the company said in an update it posted on its website at 12:09PM ET. “We are actively investigating the issue.” As of the writing of this article, Akamai has not said what’s causing the issue. 

DNS outages aren’t uncommon, but it’s not often they make large parts of the internet inaccessible. In recent memory, one of the most disruptive occurred in 2016 when a teen used the infamous Mirai malware to build out a botnet and carry out a series of distributed denial of service attacks against Dyn, one of the largest DNS providers in the US. The attacks made it so that people in the US and parts of Europe couldn’t access websites like Amazon, GitHub, PayPal and Reddit for almost an entire day. The individual behind the Dyn cyberattack eventually pleaded guilty for what they did, but not before a variety of groups like Anonymous and New World Hackers claimed responsibility.

Update 12:54PM ET: Moments ago, Akamai said it implemented a fix for the issue it was having with its Edge DNS service. Many of the websites that were affected by the outage are starting to come back online, including PSN. Downdetector is also tracking fewer outage reports. 

All products recommended by Engadget are selected by our editorial team, independent of our parent company. Some of our stories include affiliate links. If you buy something through one of these links, we may earn an affiliate commission.

Repost: Original Source and Author Link


OpenAI Codex shows the limits of large language models

All the sessions from Transform 2021 are available on-demand now. Watch now.

In a new paper, researchers at OpenAI have revealed details about Codex, a deep learning model that generates software source code. Codex powers Copilot, an “AI pair programmer” tool developed jointly by OpenAI and GitHub. Copilot is currently available in beta test mode to a limited number of users.

The paper is a fascinating read that explains the process through which the scientists at OpenAI managed to repurpose their flagship language model GPT-3 to create Codex. But more importantly, the paper also sheds much-needed light on how far you can trust deep learning in programming.

The ‘no free lunch’ theorem

Codex is a descendent of GPT-3, a massive deep learning language model release last year. The complexity of deep learning models is often measured by the number of parameters they have. In general, a model’s learning capacity increases with the number of parameters. GPT-3 came with 175 billion parameters, more than two orders of magnitude larger than its predecessor, GPT-2 (1.5 billion parameters). GPT-3 was trained on more than 600 gigabytes, more than 50 times larger than GPT-2’s training dataset.

Aside from the huge increase in size, the main innovation of GPT-3 was “few-shot learning,” the capability to perform tasks it wasn’t trained for. The paper that introduced GPT-3 was titled “Language Models are Few-Shot Learners” and stated: “Here we show that scaling up language models greatly improves task-agnostic, few-shot performance [emphasis mine], sometimes even reaching competitiveness with prior state-of-the-art fine tuning approaches.”

Basically, the premise was a large-enough model trained on a large corpus of text can match or outperform several models that are specialized for specific tasks.

But according to the new paper by OpenAI, none of the various versions of GPT-3 were able to solve any of the coding problems used to evaluate Codex. To be fair, there were no coding samples in GPT-3’s training dataset, so we can’t expect it to be able to code. But the OpenAI scientists also tested GPT-J, a 6 billion-parameter model trained on The Pile, an 800-gigabyte dataset that includes 95 gigabytes of GitHub and 32 gigabytes of StackExchange data. GPT-J solved 11.4 percent of the coding problems. Codex, a version of GPT-3’s 12-billion parameter fine tuned on 159 gigabytes of code examples from GitHub, solved 28.8 percent of the problems. A separate version of Codex, called Codex-S, which was fine tuned through supervised learning boosted the performance to 37.7 percent (other GPT and Codex models are trained through unsupervised learning).

Codex proves that machine learning is still ruled by the “no free lunch” theorem (NFL), which means that generalization comes at the cost of performance. In other words, machine learning models are more accurate when they are designed to solve one specific problem. On the other hand, when their problem domain is broadened, their performance decreases.

Codex can perform one specialized task (transforming function descriptions and signatures into source code) with high accuracy at the cost of poor natural language processing capabilities. On the other hand, GPT-3 is a general language model that can generate decent text about a lot of topics (including complicated programming concepts) but can’t write a single line of code.

Size vs. cost

The experiments of OpenAI’s researchers show that the performance of Codex improved as they increased the size of the machine learning model. At 300 million parameters, Codex solved 13.2 percent of the evaluation problems against the 28.8 percent performance of the 12-billion-parameter model.

But the full version of GPT-3 is 175 billion parameters, a full order of magnitude larger than the one used to create Codex. Wouldn’t training the larger model on the Codex training data yield better results?

One probable reason for stopping at 12 billion could be the dataset size. A larger Codex model would need a larger dataset. Training it on the 159-gigabyte corpus would probably cause overfitting, where the model becomes very good at memorizing and rehearsing its training examples and very bad at dealing with novel situations. Gathering and maintaining larger datasets is an expensive and time-consuming process.

An equally vexing problem would be the cost of Codex. Aside from a scientific experiment, Codex was supposed to become the backbone of a future product that can turn in profits for a research lab that is quasi-owned by a commercial entity. As I’ve already discussed before, the costs of training and running the 175-billion GPT-3 model would make it very hard to develop a profitable business model around it.

However, a smaller but fine tuned version of GPT-3 would be much more manageable in terms of profits and losses.

Finally, as OpenAI’s experiments show, Codex’s size/performance ratio follows a logarithmic scale. This means that performance gains gradually reduce as you increase the size of the model. Therefore, the added costs of gathering data and training and running the larger model might not be worth the small performance boost.

And note that code generation is a very lucrative market. Given the high hourly salaries of programmers, even saving a few hours’ worth of coding time per month would be enough to cover the subscription fees of Codex. In other domains where labor is less expensive, automating tasks with large language models will be more challenging from a profit and loss perspective.

Generating vs. understanding code

One thing that needs to be reminded is that, no matter how fascinating Codex’s output is, the deep learning model does not understand programming. Like all other deep learning–based language models, Codex is capturing statistical correlations between code fragments.

In their paper, the OpenAI scientists acknowledge that Codex “is not sample efficient to train” and that “even seasoned developers do not encounter anywhere near this amount of code over their careers.”

They further add that “a strong student who completes an introductory computer science course is expected to be able to solve a larger fraction of problems than Codex-12B.”

Here’s an interesting excerpt from the paper: “We sample tokens from Codex until we encounter one of the following stop sequences: ‘nclass’, ‘ndef’, ‘n#’, ‘nif’, or ‘nprint’, since the model will continue generating additional functions or statements otherwise.”

This means that Codex will mindlessly continue to generate code even if it has already finished the block that addresses the problem stated in the prompt.

This is a scheme that works well when you want to solve simple problems that recur time and again. But when you zoom out and try to write a large program that tackles a problem that must be solved in multiple steps, the limits of Codex become evident.

OpenAI’s scientists found that as the number of components in the function description increased, the model’s performance decreased exponentially.

“This behavior is uncharacteristic of a human programmer, who should be able to correctly implement a program for a chain of arbitrary length if they can do so for a chain of length two,” the researchers write in their paper.

Further exposing Codex’s lack of understanding of program structure and code is the fact that it “can recommend syntactically incorrect or undefined code, and can invoke functions, variables, and attributes that are undefined or outside the scope of the codebase,” according to the paper. Practically, this means that in some cases, the machine learning model will stitch together different pieces of code it has previously seen, even if they don’t fit together.

In their paper, the researchers also discuss “misalignment” issues in Codex, where the model can solve a specific problem but doesn’t do so due to various mistakes. Codex uses the contents of the file you’re working on as context to generate its output. If your code contains subtle bugs (which is quite normal if you’re a human programmer), Codex may “deliberately” suggest code that superficially appears good but is incorrect, the researchers warn.

Misalignment is an interesting phenomenon that needs further study. But OpenAI’s experiments further show that “misalignment would likely persist and even get worse if data, parameters, and training time were scaled up,” which might be another reason for keeping the model’s size balanced at 12 billion parameters.

The paper also talks extensively about the possibility for Codex to produce deprecated and vulnerable code (which is worthy of a separate article, so I didn’t discuss it here).

Responsible use and reporting of AI

As I said after the release of Copilot, “AI Pair Programmer,” the term used on GitHub’s webpage for Copilot, is inaccurate.

Codex is not a programmer. And it’s also not going to take your job (if you’re a programmer). Coding is just part of what programmers do. OpenAI’s scientists observe that in its current state Codex “may somewhat reduce the cost of producing software by increasing programmer productivity,” but it won’t replace the other tasks that software developers regularly do, such as “conferring with colleagues, writing design specifications, and upgrading existing software stacks.”

Mistaking Codex for a programmer can also lead to “over-reliance,” where a programmer blindly approves any code generated by the model without revising it. Given the obvious and subtle mistakes Codex can make, overlooking this threat can entail quality and security risks. “Human oversight and vigilance is required for safe use of code generation systems like Codex,” OpenAI’s researchers warn in their paper.

Overall, the reaction of the programmer community shows that Codex is a very useful tool with a possibly huge impact on the future of the software industry. At the same time, given the hype surrounding the release of Copilot, it is important to understand its unwanted implications. In this regard, it is worth commending the folks at OpenAI for responsibly studying, documenting, and reporting the limits and threats of Codex.

Ben Dickson is a software engineer and the founder of TechTalks. He writes about technology, business, and politics.

This story originally appeared on Copyright 2021


VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link


Researchers detail blind spots of large language models

Where does your enterprise stand on the AI adoption curve? Take our AI survey to find out.

Modern AI-powered language systems like OpenAI’s GPT-3 can generate impressively fluent and grammatical text. But they aren’t perfect. While these systems rarely make syntactic errors, they’re prone to breaking semantic and narrative rules or struggling with repetition. For example, they might change the subject of a conversation without a segue or answer a question with an illogical statement.

To measure the extent to which systems suffer from these shortcomings, researchers at the Allen Institute for AI developed Scarecrow, a framework that provides a way for developers to mark problems in AI-generated text. In an analysis spanning 13,000 annotations of 1,300 paragraphs from both AI systems and humans, they found that scaling up the size of models powering the systems helps mitigate some issues but others might require more involved fixes.

Categorizing model errors

The researchers applied their framework to OpenAI’s GPT-2 and GPT-3, as well as Grover, a fake news generator and detector from the University of Washington. As the team explains in a paper, Scarecrow divides errors into 10 categories identified by combining expert analysis with crowdsourced annotation:

  • Grammar and usage: Missing words, extra words, and incorrect or out-of-order words.
  • Redundant: Repeated words or phrases, or ideas repeated using different words.
  • Off-prompt: A phrase or sentence unrelated — or contradictory — to a prompt given to a language generation system.
  • Self-contradiction: Text that contradicts another piece of text the system had previously written.
  • Incoherent: Text that doesn’t fit into the above categories but still doesn’t make sense.
  • Technical jargon: Jargon or specific words from an esoteric field.
  • Needs Google: A fact or figure that appears to be true but requires a Google search to confirm.
  • Bad math: Problems with basic math and converting fixed units and currencies.
  • Commonsense: Text that violates our basic understanding of the world.
  • Encyclopedic: Factually wrong text disproven by textbooks, Wikipedia entries, or encyclopedias.

According to the researchers, certain errors, like Encyclopedic, Commonsense, and Incoherent errors, decrease with models trained on data from particular domains, like news, as well as models containing higher numbers of parameters. (In machine learning, parameters are the parts of models learned from historical training data, and they generally correlate with linguistic sophistication.) But the researchers say parameter scaling benefits seemingly plateau for Off-Prompt, Bad Math, and Grammar and Usage errors.

“These three error categories see a model plateau in error reduction when scaling to GPT-3. Of these error types, humans still commit fewer Off-Prompt and Grammar and Usage errors, but Bad Math appears saturated for our [study],” the researchers wrote.

Self-Contradiction and Redundant errors exhibit more complex scaling behavior, increasing for medium- and large-scale models, depending on interactions with other error types and how the errors are counted. Sampling from a larger set of words makes the models more prone to changing topics but less likely to repeat themselves, and vice versa.

“We posit the reason is that GPT-2 generations [in particular] are so incoherent and off-prompt that there is little opportunity for relevant, comprehensible points to be made and then reversed,” the researchers noted in the paper. “We [also] observe GPT-3 will seem stuck on a particular topic, elaborating on and rephrasing similar ideas more times than a human writer would.”

The researchers aim to spur explorations of natural language generations at scale, in particular ways errors in language models might be automatically fixed. “This paper focuses on open-ended generation, but a natural extension of this method would be to [assess] constrained generation tasks, such as machine translation,” they wrote. “Especially if considering a novel task setting, new error types may [also] prove useful.”


VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link


LG Gram 17 (2021) Review: Large and Light On Its Feet

“The LG Gram 17 is one of the best 17-inch laptops you can buy.”

  • Exceptionally light
  • Fast when set to performance mode
  • Excellent display
  • Very good keyboard and touchpad
  • Outstanding battery life
  • Feels a little flimsy
  • Too expensive

Sometimes you want a larger display than you’ll find on the typical 15.6-inch (or 16-inch) laptop. Maybe you’re a heavy multitasker and want to position more windows on your display without feeling cramped. That’s where 17-inch laptops come in, and while there aren’t that many to choose from outside of gaming laptops, there are a few good options to consider.

One such option has been LG’s Gram 17, which like all Gram laptops aims to pack as much machine into as light a chassis as possible. The 2021 version ups the display ante with a 16:10 aspect ratio that adds even more vertical space for getting your work done.

I looked at the LG Gram 17 configured with a Core i71165G7, 16GB of RAM, a 1TB PCIe solid-state drive (SSD), and a 17-inch 16:10 display with a WQXGA (2,560 x 1,600) resolution. This configuration retails for $1,800, meaning it’s solidly in premium territory and takes on a potent rival, the excellent Dell XPS 17. Does the LG Gram 17 have what it takes to compete?


Mark Coppock/Digital Trends

The LG Gram 17 lives up to its promise of packing a large display into a light chassis. It weighs just 2.98 pounds, which is remarkably light for such a large laptop. By comparison, the Dell XPS 17 weighs 5.53 pounds with touch and its 97 watt-hour battery option (the Gram 17 has an 80 watt-hour battery). Even the non-touch XPS 17 with the 56 watt-hour battery weighs 4.65 pounds.

In overall dimensions, the Gram 17 is 14.97 inches wide by 10.24 inches deep by 0.70 inches thick, compared to the XPS 17 at 14.74 inches by 9.76 inches by 0.77 inches. As another comparison, the HP Envy 17 is 15.71 inches by 10.2 inches by 0.76 inches and weighs 6.02 pounds (note that the Envy 17 has a 17.3-inch display). Clearly, LG accomplished something special here.

The LG Gram 17 doesn’t have the same sense of solidity that other laptops enjoy.

How did it manage to make the LG Gram 17 so light? The key is the magnesium alloy used in the laptop’s chassis. That’s a light metal to begin with and LG doesn’t use a lot of it. This affects the perceived build quality, with an extremely bendable lid and a keyboard deck and chassis bottom that give off quite a bit of flexing. Magnesium is a strong metal, and so it’s not that the LG Gram 17 isn’t robust, but it doesn’t have the same sense of solidity you’ll get from the XPS 17 or even the midrange priced Envy 17.

The aluminum used in the other laptops weighs more and feels more robust. LG did run the Gram 17 through MIL-STD-810G military testing, so there’s some objective data that it can take a beating. I’ll also note that even though the base is exceptionally light, the lid opens with one hand and is only the tiniest bit wobbly in use.

LG Gram 17 2021 closed, sitting on a brick walkway.
Mark Coppock/Digital Trends

Aesthetically, the Gram 17 is about as conservatively designed as you can get. It’s all black with just a simple “gram” logo in chrome on the lid. Otherwise, there are no embellishments and the laptop’s lines are simple. It’s not a bad-looking laptop by any means, but it also lacks character. The Dell XPS 17 and the HP Envy 17 are more noticeable and, I daresay, quite a bit more attractive. The Gram 17 does enjoy small bezels, so it looks modern in that respect — and of course, those small bezels help keep the chassis size manageable.

Despite its thin frame, the Gram 17 enjoys a nice mix of connections. On the left-hand side are a full-size HDMI port and two USB-C ports with Thunderbolt 4 support (one of which is needed to power the laptop), to go with a 3.5mm audio jack. On the right-hand side is a Kensington lock connection, two USB-A 3.1 Gen 2 ports, and a microSD card reader. Wi-Fi 6 and Bluetooth 5.1 provide wireless connectivity.


A close-up view of the LG Gram 17's keyboard and logo centered under the display.
Mark Coppock/Digital Trends

My review unit was equipped with an 11th-gen Intel Core i7-1165G7, which is common on premium laptops and tends to provide solid productivity performance. I’ve noticed that performance can vary across laptops with this same chip, and so I was curious to see how the LG Gram 17 would perform given a larger chassis that should provide plenty of room for cooling. LG provides a utility to adjust performance versus heat and fan noise, and it has a noticeable effect. Most manufacturers provide such a utility today, and not all of them have a significant impact on performance — I’ll only mention them if they impact our benchmark results. HP is another vendor whose “performance” mode makes a meaningful difference in some (but not all) of its Envy and Spectre laptops.

In its “optimal” mode, the Gram 17 is in line with much of its Tiger Lake competition. In Geekbench 5, it did well on the single-core test and fell behind some of the competition — such as the Samsung Galaxy Pro 360 — in the multi-core test. Switch to performance mode, though, and the Gram 17’s score jumped to 1563 and 5,473. In our Handbrake test that encodes a 420MB video as H.265, it was behind the pack but again did slightly better in performance mode at 197 seconds. Switching to Cinebench R23, the Gram 17 was again at the low end in optimal mode but was the fastest Tiger Lake laptop in our comparison group in performance mode (,375 in single-core and 4604 in multi-core).

The LG Gram 17 was a competent performer.

Finally, in PCMark 10, it wasn’t a leader in optimal mode and its performance mode made no difference in the score — something I’ve seen with other vendors’ performance tuning utilities. An example is the HP Spectre x360 14 that also showed no improvement in PCMark 10 in its performance mode, although it was significantly faster in that mode in all the other benchmarks. The Gram 17 did well in the Essentials portion (web browsing, videoconferencing, etc.) but fell behind in the Productivity and Content Creation portions.

Overall, the Gram 17 was a competent performer that will handle all your productivity tasks with ease. Switch to performance mode and you’ll hear the fans spin up more often (they’re not terribly loud), but you’ll get a meaningful boost in performance. I’ll note, though, that you’ll get much better performance out of the Dell XPS 17, which matches its larger display with a much more powerful CPU and GPU combination. The Gram 17 is best for productivity users who want a larger display, as opposed to the XPS 17 which is intended to provide a larger canvas to creative professionals.

Geekbench (single/multi) Handbrake (seconds) Cinbench R23 (single/multi) PCMark 10 3DMark Time Spy
LG Gram 17 2021
(Core i7-1165G7)
1503/4606 222 1323/3912 4880 1480
Dell XPS 17 (Core i7-10875H) 1315/7959 109 N/A N/A 5801
LG Gram 16 (Core i7-1165G7) 1394/4137 213 1394/4137 4827 1390
Samsung Galaxy Pro 360 
(Core i7-1165G7)
1554/5603 N/A 1308/4062 5159 1800
HP Envy x360 15
(Ryzen 7 5700U)
1198/6790 116 1258/8131 5419 1471
HP Envy 15 (Core i7-10750H) 1274/5542 139 N/A N/A 5123

The Gram 17 isn’t a gaming laptop, given its Intel Iris Xe integrated graphics. It achieved an average score in the 3DMark Time Spy test in optimal mode and a much stronger 1802 score in performance mode. In Fortnite, the utility’s impact was even more pronounced. It managed a paltry 12 frames per second (fps) in 1080p and high graphics, and 13 fps in epic graphics in optimal mode. That’s way behind the rest of the Tiger Lake competition.

Switch to performance mode, though, and it jumped to 29 fps and 19 fps, which is much more competitive. Of course, those aren’t impressive scores either, and so you’ll be limited to older titles or running newer titles at low resolutions and graphical detail.

Display and audio

A close up shot of LG Gram 17 2021 laptop open, placed on a brick walkway.
Mark Coppock/Digital Trends

A large, expansive display doesn’t do much good if it suffers from poor quality. Fortunately, LG chose a quality panel for the Gram 17, starting with its 16:10 aspect ratio that, in a 17-inch display, offers a great deal of real estate.

According to my colorimeter, the display exceeds our 300-nit threshold at 343 nits, making it bright enough for most inside lighting conditions. The contrast was close to our preferred 1000:1 ratio at 930:1. The Dell XPS 17’s 4K display is superior at 491 nits and 1,530:1, while the Gram 17’s smaller sibling, the Gram 16, was close at 313 nits and 830:1. The Gram 17’s results are well in line with what’s expected from a premium laptop today.

In terms of colors, the Gram 17’s display hit 88% of AdobeRGB and 100% of sRGB, which is better than the 75% and 95% premium laptop average and close to what creative types desire for photo and video editing. The XPS 17 was once again much better at 98% and 100%, respectively, while the Gram 16 was the same as the 17-inch model. The Gram 17’s color accuracy was good at a Delta E of 1.3 (less than 1.0 is considered excellent), while the XPS 17 came in at 0.37 and the Gram 16 an inferior 2.67.

Overall, this was a delightful display to use for everything most users will throw at it. Productivity was enhanced by the aspect ratio, good contrast, and above-average brightness while viewing photos and video was an enjoyable experience thanks to the wide and accurate colors. Anyone who wants to do occasional photo and video editing — keeping in mind the performance deficit compared to a laptop like the XPS 17 — will find this display to do well in a pinch.

The audio is nice and clear, with pleasant highs and mids and just a touch of bass. At the same time, the two downward-firing speakers don’t get very loud, and there’s just a touch of distortion at maximum volume. You’ll be happy with the occasional YouTube video, but for Netflix binging and music, you’ll probably want a pair of headphones or Bluetooth speakers handy.

Keyboard and touchpad

An LG Gram 17 2021 keyboard.
Mark Coppock/Digital Trends

The keyboard has comfortable spacing with large keycaps and includes a numeric keypad, with a light touch and sufficient travel. The typing feel is marred only by a slightly abrupt bottoming action — I usually appreciate some bounce at the end of a keystroke, but here there’s just a little too much. I could type at full speed on the keyboard but got the impression I might get fatigued after long typing sessions. The Dell XPS 17’s keyboard has a more comfortable action as does HP’s keyboard on its Spectre and more recent Envy laptops.

The touchpad is large but could be larger given the copious amount of palm rest available. It’s a Microsoft Precision model, which is universal at this point, making Windows 10’s multitouch gestures accurate and precise. The keyboard layout, specifically the inclusion of a numeric keypad, pushes the touchpad off-center, which takes some getting used to. If you use the touchpad as a guide for finding the home row on the keyboard, you’ll need to adjust your practice or find yourself typing the wrong letters. The display does not support touch, which I always miss on a laptop.

Windows 10 Hello support is provided by a fingerprint reader built into the power button, which is the best place. You can power on the Gram 17 and log in with one touch, and that’s so much more convenient than hunting for a fingerprint reader sitting somewhere on the keyboard deck or — worse yet — embedded in the touchpad. The reader was fast and accurate throughout my testing.

Battery life

An LG Gram 17 2021 open, placed on a brick walkway.
Mark Coppock/Digital Trends

Somehow, LG managed to pack in 80 watt-hours of battery capacity and still maintain the Gram 17’s light weight. That’s a fair amount of energy, and so I was hopeful that LG’s usual excellent battery life would apply.

And that’s exactly what I found. Starting with our web browsing test that loops through a series of popular websites, the Gram 17 lasted for 13.25 hours, which is a very strong result. The Dell XPS 17 managed less than half as long at just under 6.5 hours, while the Gram 16 was a bit stronger at 13.8 hours. In our video test that plays a Full HD Avengers trailer until the battery runs out, the Gram 17 went for a spectacular 21 hours, compared to the XPS 17 at just 9.3 hours and the Gram 16 at an even better 24.4 hours.

On a single charge, the LG Gram 17 will get you through a full workday and well into the evening.

I also ran the PCMark 10 Gaming test that stresses the CPU and GPU, and the Gram 17 almost made it to five hours. That’s one of the longest results in our database and is just seven seconds less than another leader, the Gram 16. We didn’t test the XPS 17 in PCMark 10. The result was likely a combination of the large battery capacity and the optimal setting that didn’t run either the CPU or GPU at full speed.

Finally, in the PCMark 10 Applications test that’s the best indication of productivity battery life, the Gram 17 achieved just under 14 hours. That’s a strong score that’s in the top tier of laptops we’ve tested, but not as strong as I expected. The Gram 16 hit 17.8 hours, for example.

Overall, the Gram 17 is a long-lasting laptop despite its large, high-resolution display. It will get you through a full workday and well into the evening, and you’ll probably have a few hours left over the next morning.

Our take

LG accomplished its objective of creating a large-screen laptop with good performance and outstanding battery life that doesn’t weigh a ton. You’ll want to switch to performance mode for the most speed and you’ll endure a bit of fan noise, but it’s worth it. For the most part, this is a laptop that lives up to its promise and then some.

Whether it’s for you, though, comes down to whether you’re okay with a metal chassis that demonstrate a fair amount of flexibility. LG passed the Gram 17 through military-level testing for durability and it survived, so that means the laptop is likely plenty robust. Still, you won’t get that warm and fuzzy feeling of durability as you handle the Gram 17.

Are there any alternatives?

The Dell XPS 17 offers the same 16:10 aspect ratio display that’s also higher quality, and you’ll get a faster laptop with a more potent GPU. It’s also much heavier and doesn’t even approach the Gram 17’s battery life. To fully leverage the XPS 17’s power, you’ll also spend hundreds more.

Next, you could consider the slightly smaller LG Gram 16 if you don’t need quite so much screen real estate. It also offers great battery life and suffers from the same flimsy feel, but it’s another lightweight offering that offers a lot of power and longevity without the weight.

The XPS 15 and the MacBook Pro 16 are also speedier laptops with smaller displays and might be good options. Again, if you don’t need the largest display, then these two machines should be on your list.

How long will it last?

The Gram 17 doesn’t feel like it’s as robust as the premium laptops it competes against, but if you trust the MIL-STD-810G rating, then you might be comfortable with the laptop’s longevity. It’s certainly equipped with up-to-date components. You won’t like the one-year warranty, though.

Should you buy it?

Yes. The LG Gram 17 puts a large and lovely display into your hands without weighing you down, and you’ll love the spectacular battery life.

Editors’ Choice

Repost: Original Source and Author Link


Researchers find that large language models struggle with math

Join Transform 2021 for the most important themes in enterprise AI & Data. Learn more.

Mathematics is the foundation of countless sciences, allowing us to model things like planetary orbits, atomic motion, signal frequencies, protein folding, and more. Moreover, it’s a valuable testbed for the ability to problem solve, because it requires problem solvers to analyze a challenge, pick out good methods, and chain them together to produce an answer.

It’s revealing, then, that as sophisticated as machine learning models are today, even state-of-the-art models struggle to answer the bulk of math problems correctly. A new study published by researchers at the University of California, Berkeley finds that large language models including OpenAI’s GPT-3 can only complete 2.9% to 6.9% of problems from a dataset of over 12,500. The coauthors believe that new algorithmic advancements will likely be needed to give models stronger problem-solving skills.

Prior research has demonstrated the usefulness of AI that has a firm grasp of mathematical concepts. For example, OpenAI recently introduced GPT-f, an automated prover and proof assistant for the Metamath formalization language. GPT-f found new short proofs that have been accepted into the main Metamath library, the first time a machine learning-based system contributed proofs that were adopted by a formal mathematics community. For its part, Facebook also claims to have experimented successfully with math-solving AI algorithms. In a blog post last January, researchers at the company said they’d taught a model to view complex mathematical equations “as a kind of language and then [treat] solutions as a translation problem.”

“While most other text-based tasks are already nearly solved by enormous language models, math is notably different. We showed that accuracy is slowly increasing and, if trends continue, the community will need to discover conceptual and algorithmic breakthroughs to attain strong performance on math,” the coauthors wrote. “Given the broad reach and applicability of mathematics, solving math datasets with machine learning would be of profound practical and intellectual significance.”

To measure the problem-solving ability of large and general-purpose language models, the researchers created a dataset called MATH, which consists of 12,500 problems taken from high school math competitions. Given a problem from MATH, language models must generate a sequence that reveals the final answer.

MATH dataset

Above: A comparison of a MATH dataset problem with problems from DeepMind’s Mathematics Dataset and a Metamath module.

Image Credit: MATH

Problems in MATH are labeled by difficulty from 1 to 5 and span seven subjects, including geometry, algebra, calculus, statistics, linear algebra, and number theory. They also come with step-by-step solutions so that language models can learn to answer new questions they haven’t seen before.

Training models on the fundamentals of mathematics required the researchers to create a separate dataset with hundreds of thousands of solutions to common math problems. This second dataset, the Auxiliary Mathematics Problems and Solutions (AMPS), comprises more than 100,000 problems from Khan Academy with solutions and over 5 million problems generated using Mathematica scripts based on 100 hand-designed modules. In total, AMPS contains 23GB of content.

As the researchers explain, the step-by-step solutions in the datasets allow the language models to use a “scratch space” much like a human mathematician might. Rather than having to arrive at the correct answer right away, models can first “show their work” in partial solutions that step toward the right answer.

Even with the solutions, the coauthors found that accuracy remained low for the large language models they benchmarked: GPT-3 and GPT-2, GPT-3’s predecessor. Having the models generate their own solutions before producing an answer actually degraded accuracy because while many of the steps were related to the question, they were illogical. Moreover, simply increasing the amount of training time and the number of parameters in the models, which sometimes improves performance, proved to be impractically costly. (In machine learning, parameters are variables whose values control the learning process.)

This being the case, the researchers showed that step-by-step solutions still provide benefits in the form of improved performance. In particular, providing models with solutions at training time increased accuracy substantially, with pretraining on AMPS boosting accuracy by around 25% — equivalent to a 15 times increase in model size.

“Despite these low accuracies, models clearly possess some mathematical knowledge: they achieve up to 15% accuracy on the easiest difficulty level, and they are able to generate step-by-step solutions that are coherent and on-topic even when incorrect,” the coauthors wrote. “Having models train on solutions increases relative accuracy by 10% compared to training on the questions and answers directly.”

The researchers have released MATH and AMPS in open source to, along with existing mathematics datasets like DeepMind’s, spur further research along this direction.


VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link

Tech News

SpaceX wants to put Starlink satellite dishes on large vehicles

SpaceX’s somewhat controversial Starlink satellite constellation aims to bring high-speed Internet to places that traditional cables and radio waves don’t always reach. It seems that it doesn’t just apply to remote areas but also to moving vehicles that don’t always get the best Internet connectivity. In line with that grand goal, SpaceX is asking the FCC for permission to deploy Starlink even on trucks, aircraft, and trucks.

SpaceX notes that Internet users don’t just stay at home and, despite movement restrictions these days, people need a reliable connection even while on the go. Those needs can range from your usual business uses cases during flights to truckers driving across the country and everything in between. SpaceX wants to serve these customers as well by installing a Starlink dish on such vehicles.

These “Earth Stations in Motion” or ESIMs are noted to be electronically identical to the home terminals that Starlink testers have installed in their homes. While the latter could be set up by almost anyone with some technical know-how, ESIMs will require qualified installers. SpaceX doesn’t expect these ESIMs to add to the 1 million terminals it was granted permission to install but it requested for an expansion to 5 million anyway in a separate filing.

Despite the application’s wording, Elon Musk later clarified that on Twitter that ESIMs are not intended for passenger cars, particularly Tesla EVs. The terminals are just too big and, as such, are intended for larger vehicles, like an RV as the smallest example.

While the application will open up new business opportunities for SpaceX, not to mention new classes of customers, Starlink continues to face opposition, doubt, and even complaints from all sides. In addition to concerns about the satellites littering the skies especially at night, other network operators are worried that Starlink could also interfere with other services that may use the same bands in the future.

Repost: Original Source and Author Link

Tech News

This tiny particle accelerator fits in a large room — making it much more practical than CERN’s

In 2010, when scientists were preparing to smash the first particles together within the Large Hadron Collider (LHC), sections of the media fantasized that the EU-wide experiment might create a black hole that could swallow and destroy our planet. How on Earth, columnists fumed, could scientists justify such a dangerous indulgence in the pursuit of abstract, theoretical knowledge?

But particle accelerators are much more than enormous toys for scientists to play with. They have practical uses too, though their sheer size has, so far, prevented their widespread use. Now, as part of large-scale European collaboration, my team has published a report that explains in detail how a far smaller particle accelerator could be built – closer to the size of a large room, rather than a large city.

Inspired by the technological and scientific know-how of machines like the LHC, our particle accelerator is designed to be as small as possible so it can be put to immediate practical use in industry, in healthcare, and in universities.

Collider scope

The biggest collider in the world, the LHC, uses particle acceleration to achieve the astonishing speeds at which it collides particles. This system was used to measure the sought-after Higgs boson particle – one of the most elusive particles predicted by the Standard Model, which is our current model to describe the structure and operation of the universe.

Less giant and glamorous particle accelerators have been around since the early 1930s, performing useful jobs as well as causing collisions to help our understanding of fundamental science. Accelerated particles are used to generate radioactive materials and strong bursts of radiation, which are crucial for healthcare processes such as radiotherapy, nuclear medicine, and CT scans.

The typical downside to accelerators is that they tend to be bulky, complex to run, and often prohibitively expensive. The LHC represents a pinnacle of experimental physics, but it is 27 kilometers (17 miles) in circumference and costs 6.5 billion Swiss francs (£5.2 billion) to build and test. The accelerators currently installed in select hospitals are smaller and cheaper, but they still cost tens of millions of pounds and require 400x400m of space for installation. As such, only large regional hospitals can afford the money and the space to host a radiotherapy department.

[Read: How do you build a pet-friendly gadget? We asked experts and animal owners]

Why exactly do accelerators need to be so big? The simple answer is that if they were any smaller, they’d break. Since they’re based on solid materials, ramping up the power too much would tear the system apart, creating a very expensive mess.