GitHub’s automatic coding tool rests on untested legal ground

Just days after GitHub announced its new Copilot tool, which generates complementary code for programmers’ projects, web developer Kyle Peacock tweeted an oddity he had noticed.

“I love to learn new things and build things,” the algorithm wrote, when asked to generate an About Me page. “I have a <a href=“”> Github</a> account.”

While the About Me page was supposedly generated for a fake person, that link goes to the GitHub profile of David Celis, who The Verge can confirm is not a figment of Copilot’s imagination. Celis is a coder and GitHub user with popular repositories, and even formerly worked at the company.

“I’m not surprised that my public repositories are a part of the training data for Copilot,” Celis told The Verge, adding that he was amused by the algorithm reciting his name. But while he doesn’t mind his name being spit out by an algorithm that parrots its training data, Celis is concerned at the copyright implications of GitHub scooping up any code it can find to better its AI.

When GitHub announced Copilot on June 29, the company said that the algorithm had been trained on publicly available code posted to GitHub. Nat Friedman, GitHub’s CEO, has written on forums like Hacker News and Twitter that the company is legally in the clear. “Training machine learning models on publicly available data is considered fair use across the machine learning community,” the Copilot page says.

But the legal question isn’t as settled as Friedman makes it sound — and the confusion reaches far beyond just GitHub. Artificial intelligence algorithms only function due to massive amounts of data they analyze, and much of that data comes from the open internet. An easy example would be ImageNet, perhaps the most influential AI training dataset, which is entirely made up of publicly available images that ImageNet creators do not own. If a court were to say that using this easily accessible data isn’t legal, it could make training AI systems vastly more expensive and less transparent.

Despite GitHub’s assertion, there is no direct legal precedent in the US that upholds publicly available training data as fair use, according to Mark Lemley and Bryan Casey of Stanford Law School, who published a paper last year about AI datasets and fair use in the Texas Law Review.

That doesn’t mean they are against it: Lemley and Casey write that publicly available data should be considered fair use, for the betterment of algorithms and to conform to the norms of the machine learning community.

And there are past cases to support that opinion, they say. They consider the Google Books case, in which Google downloaded and indexed more than 20 million books to create a literary search database, to be similar to training an algorithm. The Supreme Court upheld Google’s fair use claim, on the grounds that the new tool was transformative of the original work and broadly beneficial to readers and authors.

“There is not controversy around the ability to put all that copyrighted material into a database for a machine to read it,” Casey says about the Google Books case. “What a machine then outputs is still blurry and going to be figured out.”

This means the details change when the algorithm then generates media of its own. Lemley and Casey argue in their paper that if an algorithm begins to generate songs in the style of Ariana Grande, or directly rip off a coder’s novel solution to a problem, the fair use designation gets much murkier.

Since this hasn’t been directly tested in a court, a judge hasn’t been forced to decide how extractive the technology really is: If an AI algorithm turns the copyrighted work into a profitable technology, then it wouldn’t be out of the realm of possibility for a judge to decide that its creator should pay or otherwise credit for what they take.

But on the other hand, if a judge were to decide that GitHub’s style of training on publicly available code was fair use, it would squash the need for GitHub and OpenAI to cite the licenses of the coders that wrote its training data. For instance, Celis, whose GitHub profile was generated by Copilot, says he uses the Creative Commons Attribution 3.0 Unported License, which requires attribution for derivative works.

“And I fall in the camp that believes Copilot’s generated code is absolutely derivative work,” he told The Verge.

Until this is decided in a court, however, there’s no clear ruling on whether this practice is legal.

“My hope is that people would be happy to have their code used for training,” Lemley says. “Not for it to show up verbatim in someone else’s work necessarily, but we’re all better off if we have better-trained AIs.”

Repost: Original Source and Author Link


OpenAI warns AI behind GitHub’s Copilot may be susceptible to bias

Where does your enterprise stand on the AI adoption curve? Take our AI survey to find out.

Last month, GitHub and OpenAI launched Copilot, a service that provides suggestions for whole lines of code inside development environments like Microsoft Visual Studio. Powered by an AI model called Codex rained on billions of lines of public code, the companies claim that Copilot works with a broad set of frameworks and languages and adapts to the edits developers make, matching their coding styles.

But a new paper published by OpenAI reveals that Copilot might have significant limitations, including biases and sample inefficiencies. While the research describes only early Codex models, whose descendants power GitHub Copilot and the Codex models in the OpenAI API, it emphasizes the pitfalls faced in the development of Codex, chiefly misrepresentations and safety challenges.

Despite the potential of language models like GPT-3, Codex, and others, blockers exist. The models can’t always answer math problems correctly or respond to questions without paraphrasing training data, and it’s well-established that they amplify biases in data. That’s problematic in the language domain, because a portion of the data is often sourced from communities with pervasive gender, race, and religious prejudices. And this might also be true of the programming domain — at least according to the paper.

Massive model

Codex was trained on 54 million public software repositories hosted on GitHub as of May 2020, containing 179 GB of unique Python files under 1 MB in size. OpenAI filtered out files which were likely auto-generated, had average line length greater than 100 or a maximum greater than 1,000, or had a small percentage of alphanumeric characters. The final training dataset totaled 159 GB.

OpenAI claims that the largest Codex model it developed, which has 12 billion parameters, can solve 28.8% of the problems in HumanEval, a collection of 164 OpenAI-created problems designed to assess algorithms, language comprehension, and simple mathematics. (In machine learning, parameters are the part of the model that’s learned from historical training data, and they generally correlate with sophistication.) That’s compared with OpenAI’s GPT-3, which solves 0% of the problems, and EleutherAI’s GPT-J, which solves just 11.4%.

After repeated sampling from the model, where Codex was given 100 samples per problem, OpenAI says that it manages to answer 70.2% of the HumanEval challenges correctly. But the company’s researchers also found that Codex proposes syntactically incorrect or undefined code, invoking functions, variables, and attributes that are undefined or outside the scope of the codebase.

GitHub Copilot

Above: GitHub Copilot

More concerningly, Codex suggests solutions that appear superficially correct but don’t actually perform the intended task. For example, when asked to create encryption keys, Codex selects “clearly insecure” configuration parameters in “a significant fraction of cases.” The model also recommends compromised packages as dependencies and invoked functions insecurely, potentially posing a safety hazard.

Safety hazards

Like other large language models, Codex generates responses as similar as possible to its training data, leading to obfuscated code that looks good on inspection but in fact does something undesirable. Specifically, OpenAI found that Codex, like GPT-3, can be prompted to generate racist, denigratory, and otherwise harmful outputs as code. Given the prompt “def race(x):,” OpenAI reports that Codex assumes a small number of mutually exclusive race categories in its completions, with “White” being the most common followed by “Black” and “other.”  And when writing code comments with the prompt “Islam,” Codex often includes the word “terrorist” and “violent” at a greater rate than with other religious groups.

OpenAI recently claimed it discovered a way to improve the “behavior” of language models with respect to ethical, moral, and societal values. But the jury’s out on whether the method adapts well to other model architectures like Codex’s, as well as other settings and social contexts.

In the new paper, OpenAI also concedes that Codex is sample inefficient in the sense that even inexperienced programmers can be expected to solve a larger fraction of problems despite having seen fewer than the model. Moreover, refining Codex requires a significant amount of compute — hundreds of petaflops per day — that contributes to carbon emissions. While Codex was trained on Microsoft Azure, which OpenAI notes purchases carbon credits and sources “significant amounts of renewable energy,” the company admits that the compute demands of code generation could grow to be much larger than Codex’s training if “significant inference is used to tackle challenging problems.”

Among others, leading AI researcher Timnit Gebru has questioned the wisdom of building large language models, examining who benefits from them and who’s disadvantaged. In June 2020, researchers at the University of Massachusetts at Amherst released a report estimating that the amount of power required for training and searching a certain model involves the emissions of roughly 626,000 pounds of carbon dioxide, equivalent to nearly 5 times the lifetime emissions of the average U.S. car.

Perhaps anticipating criticism, OpenAI asserts in the paper that risk from models like Codex can be mitigated with “careful” documentation and user interface design, code review, and content controls. In the context of a model made available as a service, like via an API, policies including user review, use case restrictions, monitoring, and rate limiting might also help to reduce harms, the company says.

“Models like Codex should be developed, used, and their capabilities explored carefully with an eye towards maximizing their positive social impacts and minimizing intentional or unintentional harms that their use might cause. A contextual approach is critical to effective hazard analysis and mitigation, though a few broad categories of mitigations are important to consider in any deployment of code generation models,” OpenAI wrote.

We’ve reached out to OpenAI to see whether any of the suggested safeguards have been implemented in Copilot.


VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link


What OpenAI and GitHub’s ‘AI pair programmer’ means for the software industry

Where does your enterprise stand on the AI adoption curve? Take our AI survey to find out.

OpenAI has once again made the headlines, this time with Copilot, an AI-powered programming tool jointly built with GitHub. Built on top of GPT-3, OpenAI’s famous language model, Copilot is an autocomplete tool that provides relevant (and sometimes lengthy) suggestions as you write code.

Copilot is currently available to select applicants as an extension in Visual Studio Code, the flagship programming tool of Microsoft, GitHub’s parent company.

While the AI-powered code generator is still a work in progress, it provides some interesting hints about the business of large language models and the future directions of the software industry.

Not the intended use for GPT-3

The official website of Copilot describes it as an “AI pair programmer” that suggests “whole lines or entire functions right inside your editor.” Sometimes, just providing a function signature or description is enough to generate an entire block of code.

Working behind Copilot is a deep learning model called Codex, which is basically a special version of GPT-3 finetuned for programming tasks. The tool’s working is very much like GPT-3: It takes a prompt as input and generates a sequence of bytes as output. Here, the prompt (or context) is the source code file you’re working on and the output is the code suggestion you receive.

What’s interesting in all of this is the unexpected turns AI product management can take. According to CNBC: “…back when OpenAI was first training [GPT-3], the start-up had no intention of teaching it how to help code, [OpenAI CTO Greg] Brockman said. It was meant more as a general purpose language model [emphasis mine] that could, for instance, generate articles, fix incorrect grammar and translate from one language into another.”

General-purpose language applications have proven to be very hard to nail. There are many intricacies involved when applying natural language processing to broad environments. Humans tend to use a lot of abstractions and shortcuts in day-to-day language. The meaning of words, phrases, and sentences can vary based on shared sensory experience, work environment, prior knowledge, etc. These nuances are hard to grasp with deep learning models that have been trained to grasp the statistical regularities of a very large dataset of anything and everything.

In contrast, language models perform well when they’re provided with the right context and their application is narrowed down to a single or a few related tasks. For example, deep learning–powered chatbots trained or finetuned on a large corpus of customer chats can be a decent complement to customer service agents, taking on the bulk of simple interactions with customers and leaving complicated requests to human operators. There are already plenty of special-purpose deep learning models for different language tasks.

Therefore, it’s not very surprising that the first applications for GPT-3 have been something other than general-purpose language tasks.

Using language models for coding

Shortly after GPT-3 was made available through a beta web application programming interface, many users posted examples of using the language model to generate source code. These experiments displayed an unexplored side of GPT-3 and a potential use case for the large language model.

And interestingly, the first two applications that Microsoft, the exclusive license holder of OpenAI’s language models, created on top of GPT-3 are related to computer programming. In May, Microsoft announced a GPT-3-powered tool that generates queries for its Power Apps. And now, it is testing the waters with Copilot.

Neural networks are very good at finding and suggesting patterns from large training datasets. In this light, it makes sense to use GPT-3 or a finetuned version of it to help programmers find solutions in the very large corpus of publicly available source code in GitHub.

According to Codepilot’s homepage, Codex has been trained on “a selection of English language and source code from publicly available sources, including code in public repositories on GitHub.”

If you provide it with the right context, it will be able to come up with a block of code that resembles what other programmers have written to solve a similar problem. And giving it more detailed comments and descriptions will improve your chances of getting a reasonable output from Codepilot.

Generating code vs understanding software

According to the website, “GitHub Copilot tries to understand [emphasis mine] your intent and to generate the best code it can, but the code it suggests may not always work, or even make sense.”

“Understand” might be the wrong word here. Language models such as GPT-3 do not understand the purpose and structure of source code. They don’t understand the purpose of programs. They can’t come up with new ideas, break down a problem into smaller components, and design and build an application in the way that human software engineers do.

By human standards, programming is a relatively difficult task (well, it used to be when I was learning in the 90s). It requires careful thinking, logic, and architecture design to solve a specific problem. Each language has its own paradigms and programming patterns. Developers must learn to use different application programming interfaces and plug them together in an efficient way. In short, it’s a skill that is largely dependent on symbol manipulation, an area that is not the forte of deep learning algorithms.

Copilot’s creators acknowledge that their AI system is in no way a perfect programming companion (I don’t even think “pair programming,” is the right term for it). “GitHub Copilot doesn’t actually test the code it suggests, so the code may not even compile or run,” they warn.

GitHub also warns that Copilot may suggest “old or deprecated uses of libraries and languages,” which can cause security issues. This makes it extremely important for developers to review the AI-generated code thoroughly.

So, we’re not at a stage to expect AI systems to automate programming. But pairing them with humans who know what they’re doing can surely improve productivity, as Copilot’s creators suggest.

And since Copilot was released to the public, developers have posted all kinds of examples ranging from amusing to really useful.

“If you know a bit about what you’re asking Copilot to code for you, and you have enough experience to clean up the code and fix the errors that it introduces, it can be very useful and save you time,” Matt Shumer, co-founder and CEO of OthersideAI, told TechTalks.

But Shumer also warns about the threats of blindly trusting the code generated by Copilot.

“For example, it saved me time writing SQL code, but it put the database password directly in the code,” Shumer said. “If I wasn’t experienced, I might accept that and leave it in the code, which would create security issues. But because I knew how to modify the code, I was able to use what Copilot gave me as a starting point to work off of.”

The business model of Copilot

In my opinion, there’s another reason for which Microsoft started out with programming as the first application for GPT-3. There’s a huge opportunity to cut costs and make profits.

According to GitHub, “If the technical preview is successful, our plan is to build a commercial version of GitHub Copilot in the future.”

There’s still no information on how much the official Copilot will cost. But hourly wages for programming talent start at around $30 and can reach as high as $150. Even saving a few hours of programming time or giving a small boost to development speed would probably be enough to cover the costs of Copilot. Therefore, it would not be surprising if many developers and software development companies would sign up for Copilot once it is released as a commercial product.

“If it gives me back even 10 percent of my time, I’d say it’s worth the cost. Within reason, of course,” Shumer said.

Language models like GPT-3 require extensive resources to train and run. And they also need to be regularly updated and finetuned, which imposes more expenses on the company hosting the machine learning model. Therefore, high-cost domains such as software development would be a good place to start to reduce the time to recoup the investment made on the technology.

“The ability for [Copilot] to help me use libraries and frameworks I’ve never used before is extremely valuable,” Shumer said. “In one of my demos, for example, I asked it to generate a dashboard with Streamlit, and it did it perfectly in one try. I could then go and modify that dashboard, without needing to read through any documentation. That alone is valuable enough for me to pay for it.”

Automated coding can turn out to be a multi-billion-dollar industry. And Microsoft is positioning itself to take a leading role in this nascent sector, thanks to its market reach (through Visual Studio, Azure, and GitHub), deep pockets, and exclusive access to OpenAI’s technology and talent.

The future of automated coding

Developers must be careful not to mistake Copilot and other AI-powered code generators for a programming companion whose every suggestion you accept. As a programmer who has worked under tight deadlines on several occasions, I know that developers tend to cut corners when they’re running out of time (I’ve done it more than a few times). And if you have a tool that gives you a big chunk of working code in one fell swoop, you’re prone to just skim over it if you’re short on time.

On the other hand, adversaries might find ways to track vulnerable coding patterns in deep learning code generators and find new attack vectors against AI-generated software.

New coding tools create new habits (many of them negative and insecure). We must carefully explore this new space and beware the possible tradeoffs of having AI agents as our new coding partners.

Ben Dickson is a software engineer and the founder of TechTalks, a blog that explores the ways technology is solving and creating problems.

This story originally appeared on Copyright 2021


VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link