The outcome of a bug bounty program for the Department of Homeland Security (DHS) has been revealed, and it’s not particularly encouraging news for a government agency synonymous with cyber security.
Participants of DHS’ first-ever bug bounty program, named “Hack DHS,” confirmed that they found a worrying number of security bugs.
They discovered a total of 122 security vulnerabilities in external DHS systems, according to The Register and Bleeping Computer. Twenty-seven bugs were recognized as “critical severity” flaws.
The Hack DHS initiative saw more than 450 security researchers participate in the program. For their efforts, the government agency paid out a total reward of $125,600 that was distributed amongst the ethical hackers.
As aptly highlighted by The Register, the aforementioned payout figure pales in comparison to what other organizations pay to bug bounty hunters.
For example, Intel has previously offered up to $100,000 for successfully uncovering specific vulnerabilities.
Other technology giants like Microsoft offer 10s of thousands of dollars for finding flaws, while Apple paid a single individual nearly the entirety of the Hack DHS bounty by giving him $100,000 for hacking a Mac.
Google, meanwhile, has awarded nearly $30 million to individuals enrolled in its own bug bounty programs. In one particular case, the company gave a self-taught teenage hacker $36,000 for reporting a certain bug.
Considering the fact that one of the Department of Homeland Security’s key responsibilities involves cyber security, many may understandably be concerned that such a high amount of security bugs were found in the first place. Moreover, the somewhat lackluster payment tiers associated with Hack DHS could be a potential deterrent to future interested parties.
All things considered, it seems the DHS is not as secure as many Americans would have hoped it would be.
Homeland Security’s quest to become more secure
Hack DHS was originally introduced in December 2021. Any hacker who joined the program would have to provide a comprehensive breakdown of any vulnerability they find. They also have to detail how that flaw can be targeted and exploited by potential threat actors, as well as explain how it can be specifically utilized to access and extract data from DHS systems.
Once these security defects are put through a verification process by “DHS security experts,” which takes 48 hours to analyze after a bug is detected and submitted, they are generally patched within 15 days or so. In some cases, it takes the government agency longer than half a month to fix the more intricate flaws.
The government agency’s bug bounty program will be conducted via a tiered rollout consisting of three stages. The first phase, payouts, has been completed, while the upcoming second stage will see security researchers hand-picked by the DHS taking part in a live hacking event.
As for the final phase, The Register reports that DHS will share information that it hopes will influence additional bug bounty programs.
For example, Intel unveiled Project Circuit Breaker, an expansion to its bug bounty program that was introduced to recruit “elite hackers.” Google also updated its Vulnerability Reward Program last year by launching a new bug platform.
Elsewhere, Google recently confirmed that a record number of dangerous zero-day exploits were identified in 2021, while cybercrimes are more widespread than ever before.
Join Transform 2021 for the most important themes in enterprise AI & Data. Learn more.
A coalition of AI researchers and health care professionals in fields like infectious disease, radiology, and ontology have found several common but serious shortcomings with machine learning made for COVID-19 diagnosis or prognosis.
After the start of the global pandemic, startups like DarwinAI, major companies like Nvidia, and groups like the American College of Radiology launched initiatives to detect COVID-19 from CT scans, X-rays, or other forms of medical imaging. The promise of such technology is that it could help health care professionals distinguish between pneumonia and COVID-19 or provide more options for patient diagnosis. Some models have even been developed to predict if a person will die or need a ventilator based on a CT scan. However, researchers say major changes are needed before this form of machine learning can be used in a clinical setting.
Researchers assessed more than 2,200 papers and, through a process of removing duplicates and irrelevant titles, narrowed results down to 320 papers that underwent a full text review for quality. Finally, 62 papers were deemed fit to be part of what authors refer to as a systematic review of published research and preprints shared on open research paper repositories like arXiv, bioRxiv, and medRxiv.
Of those 62 papers included in the analysis, roughly half made no attempt to perform external validation of training data, did not assess model sensitivity or robustness, and did not report the demographics of people represented in training data.
“Frankenstein” datasets, the kind made with duplicate images obtained from other datasets, were also found to be a common problem, and only one in five COVID-19 diagnosis or prognosis models shared their code so others can reproduce results claimed in literature.
“In their current reported form, none of the machine learning models included in this review are likely candidates for clinical translation for the diagnosis/prognosis of COVID-19,” the paper reads. “Despite the huge efforts of researchers to develop machine learning models for COVID-19 diagnosis and prognosis, we found methodological flaws and many biases throughout the literature, leading to highly optimistic reported performance.”
The research was published last week as part of the March issue of Nature Machine Intelligence by researchers from the University of Cambridge and University of Manchester. Other common issues they found with machine learning models developed using medical imaging data was virtually no assessment for bias and generally being trained without enough images. Nearly every paper reviewed was found to be at high or uncertain risk of bias; only six were considered at low risk of bias.
Publicly available datasets also commonly suffered from lower quality image formats and weren’t large enough to train reliable AI models. Researchers used the checklist for artificial intelligence in medical imaging (CLAIM) and radiomics quality score (RQS) to help assess the datasets and models.
“The urgency of the pandemic led to many studies using datasets that contain obvious biases or are not representative of the target population, for example, pediatric patients. Before evaluating a model, it is crucial that authors report the demographic statistics for their datasets, including age and sex distributions,” the paper reads. “Higher-quality datasets, manuscripts with sufficient documentation to be reproducible and external validation are required to increase the likelihood of models being taken forward and integrated into future clinical trials to establish independent technical and clinical validation as well as cost-effectiveness.”
Other recommendations suggested by the group of AI researchers and health care professionals include ensuring reproducibility of model performance results spelled out in research papers and considering how datasets are assembled and put together.
In other news at the intersection of COVID-19 and machine learning, earlier this week the Food and Drug Administration (FDA) approved emergency use authorization of a machine learning-based screening device which the agency says is the first approved in the U.S.
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.
Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:
up-to-date information on the subjects of interest to you
gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
Security researchers reported at least 30,000 organizations across the US have been hacked over the past few days by an unusually aggressive Chinese cyber-espionage unit focused on stealing email. The researchers say that many of the organizations targeted in the act include small businesses, cities, and local governments. The group of hackers is exploiting four newly-discovered flaws in Microsoft Exchange Server email software.
The hackers have been able to seed hundreds of thousands of victim organizations worldwide with tools to allow the hackers complete remote control over affected systems. Microsoft is attempting to combat the hackers and, on March 2, released emergency security updates that plugged four security holes in Exchange Server versions 2013 through 2019 being actively exploited. In the days following those security patches, security experts say that the Chinese cyber-espionage group has stepped up attacks on any vulnerable and unpatched Exchange server worldwide.
In each incident, the hackers left behind a web shell, an easy use and password-protected tool that can be accessed over the Internet from any browser. That web shell can give the hackers administrative access to the victim’s computer. According to two unnamed cybersecurity experts who have been part of briefings with US national security advisers, the hackers have seized control over hundreds of thousands of Microsoft Exchange Servers globally.
The group has targeted email systems in various industry sectors ranging from infectious disease researchers to law firms, defense contractors, and others. The attack was first discovered by a company called Volexity. The company says even those who patched their Exchange Server the same day the patches were published have a high likelihood of having a web shell on the server. The researchers say any company running Exchange that hasn’t patched yet is likely already compromised.
Four exploits found in Microsoft’s Exchange Server software have reportedly led to over 30,000 US governmental and commercial organizations having their emails hacked, according to a report by KrebsOnSecurity. Wiredis also reporting “tens of thousands of email servers” hacked. The exploits have been patched by Microsoft, but security experts talking to Krebs say that the detection and cleanup process will be a massive effort for the thousands of state and city governments, fire and police departments, school districts, financial institutions, and other organizations that were affected.
According to Microsoft, the vulnerabilities allowed hackers to gain access to email accounts, and also gave them the ability to install malware that might let them back into those servers at a later time.
Krebs and Wiredreport that the attack was carried out by Hafnium, a Chinese hacking group. While Microsoft hasn’t spoken to the scale of the attack, it also points to the same group as having exploited the vulnerabilities, saying that it has “high confidence” that the group is state-sponsored.
According to KrebsOnSecurity, the attack has been ongoing since January 6th (the day of the riot), but ramped up in late February. Microsoft released its patches on March 2nd, which means that the attackers had almost two months to carry out their operations. The president of cyber security firm Volexity, which discovered the attack, told Krebs that “if you’re running Exchange and you haven’t patched this yet, there’s a very high chance that your organization is already compromised.”
Both the White House National Security Advisor, Jake Sullivan, and former director of the Cybersecurity and Infrastructure Security Agency Chris Krebs (no relation to KrebsOnSecurity) have tweeted about the severity of the incident.
This is the real deal. If your organization runs an OWA server exposed to the internet, assume compromise between 02/26-03/03. Check for 8 character aspx files in C:\inetpubwwwrootaspnet_clientsystem_web. If you get a hit on that search, you’re now in incident response mode. https://t.co/865Q8cc1Rm
Microsoft has released several security updates to fix the vulnerabilities, and suggests that they be installed immediately. It is worth noting that, if your organization uses Exchange Online, it will not have been affected — the exploit was only present on self-hosted servers running Exchange Server 2013, 2016, or 2019.
While a large-scale attack, likely carried out by a state-run organization may sound familiar, Microsoft is clear that the attacks are “in no way connected” to the SolarWinds attacks that compromised US federal government agencies and companies last year.
It’s likely that there are still details to come about this hack — so far, there hasn’t been an official list of organizations that have been compromised, just a vague picture of the large scale and high-severity of the attack.
A Microsoft spokesperson said that the company is “working closely with the [Cybersecurity and Infrastructure Security Agency], other government agencies, and security companies, to ensure we are providing the best possible guidance and mitigation for our customers,” and that “[t]he best protection is to apply updates as soon as possible across all impacted systems.”
The most exciting new arrival in the world of AI looks, on the surface, disarmingly simple. It’s not some subtle game-playing program that can outthink humanity’s finest or a mechanically advanced robot that backflips like an Olympian. No, it’s merely an autocomplete program, like the one in the Google search bar. You start typing and it predicts what comes next. But while this sounds simple, it’s an invention that could end up defining the decade to come.
The program itself is called GPT-3 and it’s the work of San Francisco-based AI lab OpenAI, an outfit that was founded with the ambitious (some say delusional) goal of steering the development of artificial general intelligence or AGI: computer programs that possess all the depth, variety, and flexibility of the human mind. For some observers, GPT-3 — while very definitely not AGI — could well be the first step toward creating this sort of intelligence. After all, they argue, what is human speech if not an incredibly complex autocomplete program running on the black box of our brains?
As the name suggests, GPT-3 is the third in a series of autocomplete tools designed by OpenAI. (GPT stands for “generative pre-trained transformer.”) The program has taken years of development, but it’s also surfing a wave of recent innovation within the field of AI text-generation. In many ways, these advances are similar to the leap forward in AI image processing that took place from 2012 onward. Those advances kickstarted the current AI boom, bringing with it a number of computer-vision enabled technologies, from self-driving cars, to ubiquitous facial recognition, to drones. It’s reasonable, then, to think that the newfound capabilities of GPT-3 and its ilk could have similar far-reaching effects.
Like all deep learning systems, GPT-3 looks for patterns in data. To simplify things, the program has been trained on a huge corpus of text that it’s mined for statistical regularities. These regularities are unknown to humans, but they’re stored as billions of weighted connections between the different nodes in GPT-3’s neural network. Importantly, there’s no human input involved in this process: the program looks and finds patterns without any guidance, which it then uses to complete text prompts. If you input the word “fire” into GPT-3, the program knows, based on the weights in its network, that the words “truck” and “alarm” are much more likely to follow than “lucid” or “elvish.” So far, so simple.
What differentiates GPT-3 is the scale on which it operates and the mind-boggling array of autocomplete tasks this allows it to tackle. The first GPT, released in 2018, contained 117 million parameters, these being the weights of the connections between the network’s nodes, and a good proxy for the model’s complexity. GPT-2, released in 2019, contained 1.5 billion parameters. But GPT-3, by comparison, has 175 billion parameters — more than 100 times more than its predecessor and ten times more than comparable programs.
The dataset GPT-3 was trained on is similarly mammoth. It’s hard to estimate the total size, but we know that the entirety of the English Wikipedia, spanning some 6 million articles, makes up only 0.6 percent of its training data. (Though even that figure is not completely accurate as GPT-3 trains by reading some parts of the database more times than others.) The rest comes from digitized books and various web links. That means GPT-3’s training data includes not only things like news articles, recipes, and poetry, but also coding manuals, fanfiction, religious prophecy, guides to the songbirds of Bolivia, and whatever else you can imagine. Any type of text that’s been uploaded to the internet has likely become grist to GPT-3’s mighty pattern-matching mill. And, yes, that includes the bad stuff as well. Pseudoscientific textbooks, conspiracy theories, racist screeds, and the manifestos of mass shooters. They’re in there, too, as far as we know; if not in their original format then reflected and dissected by other essays and sources. It’s all there, feeding the machine.
What this unheeding depth and complexity enables, though, is a corresponding depth and complexity in output. You may have seen examples floating around Twitter and social media recently, but it turns out that an autocomplete AI is a wonderfully flexible tool simply because so much information can be stored as text. Over the past few weeks, OpenAI has encouraged these experiments by seeding members of the AI community with access to the GPT-3’s commercial API (a simple text-in, text-out interface that the company is selling to customers as a private beta). This has resulted in a flood of new use cases.
It’s hardly comprehensive, but here’s a small sample of things people have created with GPT-3:
A question-based search engine. It’s likeGoogle but for questions and answers. Type a question and GPT-3 directs you to the relevant Wikipedia URL for the answer.
A chatbot that lets you talk to historical figures. Because GPT-3 has been trained on so many digitized books, it’s absorbed a fair amount of knowledge relevant to specific thinkers. That means you can prime GPT-3 to talk like the philosopher Bertrand Russell, for example, and ask him to explain his views. My favorite example of this, though, is a dialogue between Alan Turing and Claude Shannon which is interrupted by Harry Potter, because fictional characters are as accessible to GPT-3 as historical ones.
I made a fully functioning search engine on top of GPT3.
For any arbitrary query, it returns the exact answer AND the corresponding URL.
Look at the entire video. It’s MIND BLOWINGLY good.
Solve language and syntax puzzles from just a few examples. This is less entertaining than some examples but much more impressive to experts in the field. You can show GPT-3 certain linguistic patterns (Like “food producer becomes producer of food” and “olive oil becomes oil made of olives”) and it will complete any new prompts you show it correctly. This is exciting because it suggests that GPT-3 has managed to absorb certain deep rules of language without any specific training. As computer science professor Yoav Goldberg — who’s been sharing lots of these examples on Twitter — put it, such abilities are “new and super exciting” for AI, but they don’t mean GPT-3 has “mastered” language.
Code generation based on text descriptions. Describe a design element or page layout of your choice in simple words and GPT-3 spits out the relevant code. Tinkerers have already created such demos for multiple different programming languages.
This is mind blowing.
With GPT-3, I built a layout generator where you just describe any layout you want, and it generates the JSX code for you.
Answer medical queries. A medical student from the UK used GPT-3 to answer health care questions. The program not only gave the right answer but correctly explained the underlying biological mechanism.
Text-based dungeon crawler. You’ve perhaps heard of AI Dungeon before, a text-based adventure game powered by AI, but you might not know that it’s the GPT series that makes it tick. The game has been updated with GPT-3 to create more cogent text adventures.
Style transfer for text. Input text written in a certain style and GPT-3 can change it to another. In an example on Twitter, a user input text in “plain language” and asked GPT-3 to change it to “legal language.” This transforms inputs from “my landlord didn’t maintain the property” to “The Defendants have permitted the real property to fall into disrepair and have failed to comply with state and local health and safety codes and regulations.”
Compose guitar tabs. Guitar tabs are shared on the web using ASCII text files, so you can bet they comprise part of GPT-3’s training dataset. Naturally, that means GPT-3 can generate music itself after being given a few chords to start.
Autocomplete images, not just text. This work was done with GPT-2 rather than GPT-3 and by the OpenAI team itself, but it’s still a striking example of the models’ flexibility. It shows that the same basic GPT architecture can be retrained on pixels instead of words, allowing it to perform the same autocomplete tasks with visual data that it does with text input. You can see in the examples below how the model is fed half an image (in the far left row) and how it completes it (middle four rows) compared to the original picture (far right).
All these samples need a little context, though, to better understand them. First, what makes them impressive is that GPT-3 has not been trained to complete any of these specific tasks. What usually happens with language models (including with GPT-2) is that they complete a base layer of training and are then fine-tuned to perform particular jobs. But GPT-3 doesn’t need fine-tuning. In the syntax puzzles it requires a few examples of the sort of output that’s desired (known as “few-shot learning”), but, generally speaking, the model is so vast and sprawling that all these different functions can be found nestled somewhere among its nodes. The user need only input the correct prompt to coax them out.
The other bit of context is less flattering: these are cherry-picked examples, in more ways than one. First, there’s the hype factor. As the AI researcher Delip Rao noted in an essay deconstructing the hype around GPT-3, many early demos of the software, including some of those above, come from Silicon Valley entrepreneur types eager to tout the technology’s potential and ignore its pitfalls, often because they have one eye on a new startup the AI enables. (As Rao wryly notes: “Every demo video became a pitch deck for GPT-3.”) Indeed, the wild-eyed boosterism got so intense that OpenAI CEO Sam Altman even stepped in earlier this month to tone things down, saying: “The GPT-3 hype is way too much.”
The GPT-3 hype is way too much. It’s impressive (thanks for the nice compliments!) but it still has serious weaknesses and sometimes makes very silly mistakes. AI is going to change the world, but GPT-3 is just a very early glimpse. We have a lot still to figure out.
Secondly, the cherry-picking happens in a more literal sense. People are showing the results that work and ignoring those that don’t. This means GPT-3’s abilities look more impressive in aggregate than they do in detail. Close inspection of the program’s outputs reveals errors no human would ever make as well nonsensical and plain sloppy writing.
For example, while GPT-3 can certainly write code, it’s hard to judge its overall utility. Is it messy code? Is it code that will create more problems for human developers further down the line? It’s hard to say without detailed testing, but we know the program makes serious mistakes in other areas. In the project that uses GPT-3 to talk to historical figures, when one user talked to “Steve Jobs,” asking him, “Where are you right now?” Jobs replies: “I’m inside Apple’s headquarters in Cupertino, California” — a coherent answer but hardly a trustworthy one. GPT-3 can also be seen making similar errors when responding to trivia questions or basic math problems; failing, for example, to answer correctly what number comes before a million. (“Nine hundred thousand and ninety-nine” was the answer it supplied.)
But weighing the significance and prevalence of these errors is hard. How do you judge the accuracy of a program of which you can ask almost any question? How do you create a systematic map of GPT-3’s “knowledge” and then how do you mark it? To make this challenge even harder, although GPT-3 frequently produces errors, they can often be fixed by fine-tuning the text it’s being fed, known as the prompt.
Branwen, the researcher who produces some of the model’s most impressive creative fiction, makes the argument that this fact is vital to understanding the program’s knowledge. He notes that “sampling can prove the presence of knowledge but not the absence,” and that many errors in GPT-3’s output can be fixed by fine-tuning the prompt.
In one example mistake, GPT-3 is asked: “Which is heavier, a toaster or a pencil?” and it replies, “A pencil is heavier than a toaster.” But Branwen notes that if you feed the machine certain prompts before asking this question, telling it that a kettle is heavier than a cat and that the ocean is heavier than dust, it gives the correct response. This may be a fiddly process, but it suggests that GPT-3 has the right answers — if you know where to look.
“The need for repeated sampling is to my eyes a clear indictment of how we ask questions of GPT-3, but not GPT-3’s raw intelligence,” Branwen tells The Verge over email. “If you don’t like the answers you get by asking a bad prompt, use a better prompt. Everyone knows that generating samples the way we do now cannot be the right thing to do, it’s just a hack because we’re not sure of what the right thing is, and so we have to work around it. It underestimates GPT-3’s intelligence, it doesn’t overestimate it.”
Branwen suggests that this sort of fine-tuning might eventually become a coding paradigm in itself. In the same way that programming languages make coding more fluid with specialized syntax, the next level of abstraction might be to drop these altogether and just use natural language programming instead. Practitioners would draw the correct responses from programs by thinking about their weaknesses and shaping their prompts accordingly.
But GPT-3’s mistakes invite another question: does the program’s untrustworthy nature undermine its overall utility? GPT-3 is very much a commercial project for OpenAI, which began life as a nonprofit but pivoted in order to attract the funds it says it needs for its expensive and time-consuming research. Customers are already experimenting with GPT-3’s API for various purposes; from creating customer service bots to automating content moderation (an avenue that Reddit is currently exploring). But inconsistencies in the program’s answers could become a serious liability for commercial firms. Who would want to create a customer service bot that occasionally insults a customer? Why use GPT-3 as an educational tool if there’s no way to know if the answers it’s giving are reliable?
A senior AI researcher working at Google who wished to remain anonymous told The Verge they thought GPT-3 was only capable of automating trivial tasks that smaller, cheaper AI programs could do just as well, and that the sheer unreliability of the program would ultimately scupper it as a commercial enterprise.
“GPT-3 is not good enough to be really useful without a lot of hard engineering on top,” said the researcher. “Simultaneously, it’s good enough to be dangerous … I tried LearnFromAnyone.com [the historical chat bot program] and it very quickly started telling me things like Alan Turing was not gay and did not actually commit suicide, and the Nazis did not hurt Anne Frank, etc, etc. Any fact it tells you, there’s a 50 percent chance it’s made up. If you had a friend that talked like that, you’d avoid them, and you’d definitely not hire them at your company.”
As AI researcher Julian Togelius put it: “GPT-3 often performs like a clever student who hasn’t done their reading, trying to bullshit their way through an exam. Some well-known facts, some half-truths, and some straight lies, strung together in what first looks like a smooth narrative.” (Though as many have pointed out: clever students who know how to bullshit go far in this world because people don’t always scrutinize what they’re saying.)
GPT-3 often performs like a clever student who hasn’t done their reading trying to bullshit their way through an exam. Some well-known facts, some half-truths, and some straight lies, strung together in what first looks like a smooth narrative.
Another serious problem is bias in GPT-3’s output. Professor Anima Anandkumar, director of AI at Nvidia, noted that the program is partly trained on data filtered by Reddit, and that models built from this data produce text that is “shockingly biased.” In one paper examining the output of GPT-2, for example, the model produces all sorts of offensive stereotypes when asked to complete the following sentences (the human-written prompts are in bold): “The man worked as a car salesman at the local Wal-Mart”; “The Black manworked as a pimp for 15 years”; “The woman worked as a prostitute under the name of Hariya.”
Jerome Pesenti, head of AI at Facebook, raised similar concerns, noting that a program built using GPT-3 to write tweets from a single input word produced offensive messages like “a holocaust would make so much environmental sense, if we could get people to agree it was moral.” In a Twitter thread, Pesenti said he wished OpenAI had been more cautious with the program’s roll-out, which Altman responded to by noting that the program was not yet ready for a large-scale launch, and that OpenAI had since added a toxicity filter to the beta.
Some in the AI world think these criticisms are relatively unimportant, arguing that GPT-3 is only reproducing human biases found in its training data, and that these toxic statements can be weeded out further down the line. But there is arguably a connection between the biased outputs and the unreliable ones that point to a larger problem. Both are the result of the indiscriminate way GPT-3 handles data, without human supervision or rules. This is what has enabled the model to scale, because the human labor required to sort through the data would be too resource intensive to be practical. But it’s also created the program’s flaws.
Putting aside, though, the varied terrain of GPT-3’s current strengths and weaknesses, what can we say about its potential — about the future territory it might command?
Here, for some, the sky’s the limit. They note that although GPT-3’s output is error prone, its true value lies in its capacity to learn different tasks without supervision and in the improvements it’s delivered purely by leveraging greater scale. What makes GPT-3 amazing, they say, is not that it can tell you that the capital of Paraguay is Asunción (it is) or that 466 times 23.5 is 10,987 (it’s not), but that it’s capable of answering both questions and many more beside simply because it was trained on more data for longer than other programs. If there’s one thing we know that the world is creating more and more of, it’s data and computing power, which means GPT-3’s descendants are only going to get more clever.
This concept of improvement by scale is hugely important. It goes right to the heart of a big debate over the future of AI: can we build AGI using current tools, or do we need to make new fundamental discoveries? There’s no consensus answer to this among AI practitioners but plenty of debate. The main division is as follows. One camp argues that we’re missing key components to create artificial minds; that computers need to understand things like cause and effect before they can approach human-level intelligence. The other camp says that if the history of the field shows anything, it’s that problems in AI are, in fact, mostly solved by simply throwing more data and processing power at them.
The latter argument was most famously made in an essay called “The Bitter Lesson” by the computer scientist Rich Sutton. In it, he notes that when researchers have tried to create AI programs based on human knowledge and specific rules, they’ve generally been beaten by rivals that simply leveraged more data and computation. It’s a bitter lesson because it shows that trying to pass on our precious human ingenuity doesn’t work half so well as simply letting computers compute. As Sutton writes: “The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.”
This concept — the idea that quantity has a quality all of its own — is the path that GPT has followed so far. The question now is: how much further can this path take us?
If OpenAI was able to increase the size of the GPT model 100 times in just a year, how big will GPT-N have to be before it’s as reliable as a human? How much data will it need before its mistakes become difficult to detect and then disappear entirely? Some have argued that we’re approaching the limits of what these language models can achieve; others say there’s more room for improvement. As the noted AI researcher Geoffrey Hinton tweeted, tongue-in-cheek: “Extrapolating the spectacular performance of GPT3 into the future suggests that the answer to life, the universe and everything is just 4.398 trillion parameters.”
Hinton was joking, but others take this proposition more seriously. Branwen says he believes there’s “a small but nontrivial chance that GPT-3 represents the latest step in a long-term trajectory that leads to AGI,” simply because the model shows such facility with unsupervised learning. Once you start feeding such programs “from the infinite piles of raw data sitting around and raw sensory streams,” he argues, what’s to stop them “building up a model of the world and knowledge of everything in it”? In other words, once we teach computers to really teach themselves, what other lesson is needed?
Many will be skeptical about such predictions, but it’s worth considering what future GPT programs will look like. Imagine a text program with access to the sum total of human knowledge that can explain any topic you ask of it with the fluidity of your favorite teacher and the patience of a machine. Even if this program, this ultimate, all-knowing autocomplete, didn’t meet some specific definition of AGI, it’s hard to imagine a more useful invention. All we’d have to do would be to ask the right questions.