Categories
AI

New deep reinforcement learning technique helps AI to evolve

Hundreds of millions of years of evolution have produced a variety of life-forms, each intelligent in its own fashion. Each species has evolved to develop innate skills, learning capacities, and a physical form that ensures survival in its environment.

But despite being inspired by nature and evolution, the field of artificial intelligence has largely focused on creating the elements of intelligence separately and fusing them together after the development process. While this approach has yielded great results, it has also limited the flexibility of AI agents in some of the basic skills found in even the simplest life-forms.

In a new paper published in the scientific journal Nature, AI researchers at Stanford University present a new technique that can help take steps toward overcoming some of these limits. Called “deep evolutionary reinforcement learning,” or DERL, the new technique uses a complex virtual environment and reinforcement learning to create virtual agents that can evolve both in their physical structure and learning capacities. The findings can have important implications for the future of AI and robotics research.

Evolution is hard to simulate

In nature, the body and brain evolve together. Across many generations, every animal species has gone through countless cycles of mutation to grow limbs, organs, and a nervous system to support the functions it needs in its environment. Mosquitos are equipped with thermal vision to spot body heat. Bats have wings to fly and an echolocation apparatus to navigate dark spaces. Sea turtles have flippers to swim with and a magnetic field detector system to travel very long distances. Humans have an upright posture that frees their arms and lets them see the far horizon, hands and nimble fingers that can manipulate objects, and a brain that makes them the best social creatures and problem solvers on the planet.

Interestingly, all these species descended from the first life-form that appeared on Earth several billion years ago. Based on the selection pressures caused by the environment, the descendants of those first living beings evolved in many directions.

Studying the evolution of life and intelligence is interesting, but replicating it is extremely difficult. An AI system that would want to recreate intelligent life in the same way that evolution did would have to search a very large space of possible morphologies, which is extremely expensive computationally. It would need a lot of parallel and sequential trial-and-error cycles.

AI researchers use several shortcuts and predesigned features to overcome some of these challenges. For example, they fix the architecture or physical design of an AI or robotic system and focus on optimizing the learnable parameters. Another shortcut is the use of Lamarckian rather than Darwinian evolution, in which AI agents pass on their learned parameters to their descendants. Yet another approach is to train different AI subsystems separately (vision, locomotion, language, etc.) and then tack them on together in a final AI or robotic system. While these approaches speed up the process and reduce the costs of training and evolving AI agents, they also limit the flexibility and variety of results that can be achieved.

Deep evolutionary reinforcement learning

In their new work, the researchers at Stanford aim to bring AI research a step closer to the real evolutionary process while keeping the costs as low as possible. “Our goal is to elucidate some principles governing relations between environmental complexity, evolved morphology, and the learnability of intelligent control,” they wrote in their paper.

Within the DERL framework, each agent uses deep reinforcement learning to acquire the skills required to maximize its goals during its lifetime. DERL uses Darwinian evolution to search the morphological space for optimal solutions, which means that when a new generation of AI agents are spawned, they only inherit the physical and architectural traits of their parents (along with slight mutations). None of the learned parameters are passed on across generations.

“DERL opens the door to performing large-scale in silico experiments to yield scientific insights into how learning and evolution cooperatively create sophisticated relationships between environmental complexity, morphological intelligence, and the learnability of control tasks,” the researchers wrote.

Simulating evolution

For their framework, the researchers used MuJoCo, a virtual environment that provides highly accurate rigid-body physics simulation. Their design space is called Universal Animal (Unimal), in which the goal is to create morphologies that learn locomotion and object-manipulation tasks in a variety of terrains.

Each agent in the environment is composed of a genotype that defines its limbs and joints. The direct descendant of each agent inherits the parent’s genotype and goes through mutations that can create new limbs, remove existing limbs, or make small modifications to characteristics, such as the degrees of freedom or the size of limbs.

Each agent is trained with reinforcement learning to maximize rewards in various environments. The most basic task is locomotion, in which the agent is rewarded for the distance it travels during an episode. Agents whose physical structures are better suited for traversing terrain learn faster to use their limbs for moving around.

To test the system’s results, the researchers generated agents in three types of terrains: flat (FT), variable (VT), and variable terrains with modifiable objects (MVT). The flat terrain puts the least selection pressure on the agents’ morphology. The variable terrains, on the other hand, force the agents to develop a more versatile physical structure that can climb slopes and move around obstacles. The MVT variant has the added challenge of requiring the agents to manipulate objects to achieve their goals.

The benefits of DERL

An image of AI-generated shapes in different configurations and a set of data tables regarding their morphological results.

Above: Deep evolutionary reinforcement learning generates a variety of successful morphologies across different environments.

Image Credit: TechTalks

One of the interesting findings of DERL is the diversity of the results. Other approaches to evolutionary AI tend to converge on one solution because new agents directly inherit the physique and learnings of their parents. But in DERL, only morphological data is passed on to descendants; the system ends up creating a diverse set of successful morphologies, including bipeds, tripeds, and quadrupeds with and without arms.

At the same time, the system shows traits of the Baldwin effect, which suggests that agents that learn faster are more likely to reproduce and pass on their genes to the next generation. DERL shows that evolution “selects for faster learners without any direct selection pressure for doing so,” according to the Stanford paper.

“Intriguingly, the existence of this morphological Baldwin effect could be exploited in future studies to create embodied agents with lower sample complexity and higher generalization capacity,” the researchers wrote.

Finally, the DERL framework also validates the hypothesis that more complex environments will give rise to more intelligent agents. The researchers tested the evolved agents across eight different tasks, including patrolling, escaping, manipulating objects, and exploration. Their findings show that in general, agents that have evolved in variable terrains learn faster and perform better than AI agents that have only experienced flat terrain.

Their findings seem to be in line with another hypothesis by DeepMind researchers that a complex environment, a suitable reward structure, and reinforcement learning can eventually lead to the emergence of all kinds of intelligent behaviors.

AI and robotics research

The DERL environment only has a fraction of the complexities of the real world. “Although DERL enables us to take a significant step forward in scaling the complexity of evolutionary environments, an important line of future work will involve designing more open-ended, physically realistic, and multiagent evolutionary environments,” the researchers wrote.

In the future, the researchers plan to expand the range of evaluation tasks to better assess how the agents can enhance their ability to learn human-relevant behaviors.

The work could have important implications for the future of AI and robotics and push researchers to use exploration methods that are much more similar to natural evolution.

“We hope our work encourages further large-scale explorations of learning and evolution in other contexts to yield new scientific insights into the emergence of rapidly learnable intelligent behaviors, as well as new engineering advances in our ability to instantiate them in machines,” the researchers wrote.

Ben Dickson is a software engineer and the founder of TechTalks. He writes about technology, business, and politics.

This story originally appeared on Bdtechtalks.com. Copyright 2021

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link

Categories
AI

Is DeepMind’s new reinforcement learning system a step toward general AI?

All the sessions from Transform 2021 are available on-demand now. Watch now.


This article is part of our reviews of AI research papers, a series of posts that explore the latest findings in artificial intelligence.

One of the key challenges of deep reinforcement learning models — the kind of AI systems that have mastered Go, StarCraft 2, and other games — is their inability to generalize their capabilities beyond their training domain. This limit makes it very hard to apply these systems to real-world settings, where situations are much more complicated and unpredictable than the environments where AI models are trained.

But scientists at AI research lab DeepMind claim to have taken the “first steps to train an agent capable of playing many different games without needing human interaction data,” according to a blog post about their new “open-ended learning” initiative. Their new project includes a 3D environment with realistic dynamics and deep reinforcement learning agents that can learn to solve a wide range of challenges.

The new system, according to DeepMind’s AI researchers, is an “important step toward creating more general agents with the flexibility to adapt rapidly within constantly changing environments.”

The paper’s findings show some impressive advances in applying reinforcement learning to complicated problems. But they are also a reminder of how far current systems are from achieving the kind of general intelligence capabilities that the AI community has been coveting for decades.

The brittleness of deep reinforcement learning

The key advantage of reinforcement learning is its ability to develop behavior by taking actions and getting feedback, similar to the way humans and animals learn by interacting with their environment. Some scientists describe reinforcement learning as “the first computational theory of intelligence.”

The combination of reinforcement learning and deep neural networks, known as deep reinforcement learning, has been at the heart of many advances in AI, including DeepMind’s famous AlphaGo and AlphaStar models. In both cases, the AI systems were able to outmatch human world champions at their respective games.

But reinforcement learning systems are also notoriously renowned for their lack of flexibility. For example, a reinforcement learning model that can play StarCraft 2 at an expert level won’t be able to play a game with similar mechanics (e.g., Warcraft 3) at any level of competency. Even slight changes to the original game will considerably degrade the AI model’s performance.

“These agents are often constrained to play only the games they were trained for — whilst the exact instantiation of the game may vary (e.g. the layout, initial conditions, opponents) the goals the agents must satisfy remain the same between training and testing. Deviation from this can lead to catastrophic failure of the agent,” DeepMind’s researchers write in a paper that provides the full details on their open-ended learning.

Humans, on the other hand, are very good at transferring knowledge across domains.

The XLand environment

The goal of DeepMind’s new project was to create “an artificial agent whose behaviour generalises beyond the set of games it was trained on.”

To this end, the team created XLand, an engine that can generate 3D environments composed of static topology and moveable objects. The game engine simulates rigid-body physics and allows players to use the objects in various ways (e.g., create ramps, block paths, etc.).

XLand is a rich environment in which you can train agents on a virtually unlimited number of tasks. One of the main advantages of XLand is the capability to use programmatic rules to automatically generate a vast array of environments and challenges to train AI agents. This addresses one of the key challenges of machine learning systems, which often require vast amounts of manually curated training data.

According to the blog post, the researchers created “billions of tasks in XLand, across varied games, worlds, and players.” The games include very simple goals such as finding objects to more complex settings in which the AI agents much weigh the benefits and tradeoffs of different rewards. Some of the games include cooperation or competition elements involving multiple agents.

Deep reinforcement learning

DeepMind uses deep reinforcement learning and a few clever tricks to create AI agents that can thrive in the XLand environment.

The reinforcement learning model of each agent receives a first-person view of the world, the agent’s physical state (e.g., whether it holding an object), and its current goal. Each agent finetunes the parameters of its policy neural network to maximize its rewards on the current task. The neural network architecture contains an attention mechanism to ensure the agent can balance optimization for the subgoals required to accomplish the main goal.

Once the agent masters its current challenge, the computational task generator creates a new challenge for the agent. Each new task is generated according to the agent’s training history and in a way to help distribute the agent’s skills across a vast range of challenges.

DeepMind also used its vast computational resources (courtesy of its owner Alphabet Inc.) to train a large population of agents in parallel and transfer learned parameters across different agents to improve the general capabilities of the reinforcement learning systems.

DeepMind XLand agent training
DeepMind uses a multi-step and population-based mechanism to train many reinforcement learning agents

The performance of the reinforcement learning agents was evaluated based on their general ability to accomplish a wide range of tasks they had not been trained on. Some of the test tasks include well-known challenges such as “capture the flag” and “hide and seek.”

According to DeepMind, each agent played around 700,000 unique games in 4,000 unique worlds within XLand and went through 200 billion training steps across 3.4 million unique tasks (in the paper, the researchers write that 100 million steps are equivalent to approximately 30 minutes of training).

“At this time, our agents have been able to participate in every procedurally generated evaluation task except for a handful that were impossible even for a human,” the AI researchers wrote. “And the results we’re seeing clearly exhibit general, zero-shot behaviour across the task space.”

Zero-shot machine learning models can solve problems that were not present in their training dataset. In a complicated space such as XLand, zero-shot learning might imply that the agents have obtained fundamental knowledge about their environment as opposed to memorizing sequences of image frames in specific tasks and environments.

The reinforcement learning agents further manifested signs of generalized learning when the researchers tried to adjust them for new tasks. According to their findings, 30 minutes of fine-tuning on new tasks was enough to create an impressive improvement in a reinforcement learning agent trained with the new method. In contrast, an agent trained from scratch for the same amount of time would have near-zero performance on most tasks.

High-level behavior

According to DeepMind, the reinforcement learning agents exhibit the emergence of “heuristic behavior” such as tool use, teamwork, and multi-step planning. If proven, this can be an important milestone. Deep learning systems are often criticized for learning statistical correlations instead of causal relations. If neural networks could develop high-level notions such as using objects to create ramps or cause occlusions, it could have a great impact on fields such as robotics and self-driving cars, where deep learning is currently struggling.

But those are big ifs, and DeepMind’s researchers are cautious about jumping to conclusions on their findings. “Given the nature of the environment, it is difficult to pinpoint intentionality — the behaviours we see often appear to be accidental, but still we see them occur consistently,” they wrote in their blog post.

But they are confident that their reinforcement learning agents “are aware of the basics of their bodies and the passage of time and that they understand the high-level structure of the games they encounter.”

Such fundamental self-learned skills are another one of the highly sought goals of the artificial intelligence community.

Theories of intelligence

Some of DeepMind’s top scientists published a paper recently in which they hypothesize that a single reward and reinforcement learning are enough to eventually reach artificial general intelligence (AGI). An intelligent agent with the right incentives can develop all kinds of capabilities such as perception and natural language understanding, the scientists believe.

Although DeepMind’s new approach still requires the training of reinforcement learning agents on multiple engineered rewards, it is in line with their general perspective of achieving AGI through reinforcement learning.

“What DeepMind shows with this paper is that a single RL agent can develop the intelligence to reach many goals, rather than just one,” Chris Nicholson, CEO of Pathmind, told TechTalks. “And the skills it learns in accomplishing one thing can generalize to other goals. That is very similar to how human intelligence is applied. For example, we learn to grab and manipulate objects, and that is the foundation of accomplishing goals that range from pounding a hammer to making your bed.”

Nicholson also believes that other aspects of the paper’s findings hint at progress toward general intelligence. “Parents will recognize that open-ended exploration is precisely how their toddlers learn to move through the world. They take something out of a cupboard, and put it back in. They invent their own small goals—which may seem meaningless to adults — and they master them,” he said. “DeepMind is programmatically setting goals for its agents within this world, and those agents are learning how to master them one by one.”

The reinforcement learning agents have also shown signs of developing embodied intelligence in their own virtual world, Nicholson said, like the kind humans have. “This is one more indication that the rich and malleable environment that people learn to move through and manipulate is conducive to the emergence of general intelligence, and that the biological and physical analogies of intelligence can guide further work in AI,” he said.

Sathyanaraya Raghavachary, Associate Professor of Computer Science at the University of Southern California, is a bit more skeptical on the claims made in DeepMind’s paper, especially the conclusions on proprioception, awareness of time, and high-level understanding of goals and environments.

“Even we humans are not fully aware of our bodies, let alone those VR agents,” Raghavachary said in comments to TechTalks, adding that perception of the body requires an integrated brain that is co-designed for suitable body awareness and situatedness in space. “Same with the passage of time — that too would require a brain that has memory of the past, and a sense for time in relation to that past. What they (paper authors) might mean relates to the agents’ tracking progressive changes in the environment resulting from their actions (eg. as a resulting of moving a purple pyramid), state changes which the underlying physics simulator would generate.

Raghavachary also points out, if the agents could understand the high-level structure of their tasks, they would not need 200 billion steps of simulated training to reach optimal results.

“The underlying architecture lacks what it takes, to achieve these three things (body awareness, time passage, understanding high-level task structure) they point out in conclusion,” he said. “Overall, XLand is simply ‘more of the same.’”

The gap between simulation and the real world

In a nutshell, the paper proves that if you can create a complex enough environment, design the right reinforcement learning architecture, and expose your models to enough experience (and have a lot of money to spend on compute resources), you’ll be able to generalize to various kinds of tasks in the same environment. And this is basically how natural evolution has delivered human and animal intelligence.

In fact, DeepMind has already done something similar with AlphaZero, a reinforcement learning model that managed to master multiple two-player turn-based games. The XLand experiment has extended the same notion to a much greater level by adding the zero-shot learning element.

But while I think that the experience from the XLand-trained agents will ultimately be transferable to real-world applications such as robotics and self-driving cars, I don’t think it will be a breakthrough. You’ll still need to make compromises (such as creating artificial limits to reduce the complexity of the real world) or create artificial enhancements (such as imbuing the machine learning models with prior knowledge or extra sensors).

DeepMind’s reinforcement learning agents might have become the masters of the virtual XLand. But their simulated world doesn’t even have a fraction of the intricacies of the real world. That gap will continue to remain a challenge for a long time.

Ben Dickson is a software engineer and the founder of TechTalks. He writes about technology, business, and politics.

This story originally appeared on Bdtechtalks.com. Copyright 2021

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link

Categories
AI

BASALT Minecraft competition aims to advance reinforcement learning

Join AI & data leaders at Transform 2021 on July 12th for the AI/ML Automation Technology Summit. Register today.


Deep reinforcement learning, a subfield of machine learning that combines reinforcement learning and deep learning, takes what’s known as a reward function and learns to maximize the expected total reward. This works remarkably well, enabling systems to figure out how to solve Rubik’s Cubes, beat world champions at chess, and more. But existing algorithms have a problem: they implicitly assume access to a perfect specification. In reality, tasks don’t come prepackaged with rewards — those rewards come from imperfect human reward designers. And it can be difficult to translate conceptual preferences into a reward functions environments can calculate.

To solve this problem, researchers at DeepMind and the University of California, Berkeley, have launched a competition, BASALT, where the goal of an AI system must be communicated through demonstrations, preferences, or some other form of human feedback. Built on Minecraft, systems in BASALT must learn the details of specific tasks from human feedback, choosing among a wide variety of actions to perform.

BASALT

Recent research has proposed algorithms that allow designers to iteratively communicate details about tasks. Instead of rewards, they leverage new types of feedback like demonstrations, preferences, corrections, and more and elicit feedback by taking the first steps of provisional plans and seeing if humans intervene, or by asking designers questions.

But there aren’t benchmarks to evaluate algorithms that learn from human feedback. A typical study will take an existing deep reinforcement learning benchmark, strip away the rewards, train a system using their feedback mechanism, and evaluate performance according to the preexisting reward function. This is problematic. For example, in the Atari game Breakout, which is often used as a benchmark, a system must either hit the ball back with the paddle or lose. Good performance on Breakout doesn’t necessarily mean the algorithm mastered the game mechanics. It’s possible that it learned a simpler heuristic like “don’t die.”

BASALT Minecraft

In the real world, systems aren’t funneled into an obvious task above all others. That’s why BASALT provides a set of tasks and task descriptions as well as information about the player’s inventory — but no rewards. For example, one task — MakeWaterfall — provides in-game items including water buckets, stone pickaxe, stone shovel, and cobblestone blocks and the description “After spawning in a mountainous area, the agent should build a beautiful waterfall and then reposition itself to take a scenic picture of the same waterfall. The picture of the waterfall can be taken by orienting the camera and then throwing a snowball when facing the waterfall at a good angle.”

BASALT allows designers to use whichever feedback mechanisms they prefer to create systems that accomplish the tasks. The benchmark records the trajectories of two different systems on a particular environment and asks a human to decide which of the agents performed the task better.

Future work

The researchers say that BASALT affords a number of advantages over existing benchmarks including reasonable goals, large amounts of data, and robust evaluations. In particular, they make the case that Minecraft is well-suited to the task because there are thousands of hours of gameplay on YouTube with which competitors could train a system. Moreover, Minecraft’s properties are easy to understand, the researchers say, with tools that have functions similar to real-world tools and straightforward goals like building shelter and acquiring enough food to not starve.

BASALT is also designed to be feasible to use on a budget. The code ships with a baseline system that can be trained in a couple of hours on a single GPU, according to Rohin Shah, a research scientist at DeepMind and project lead on BASALT.

“We hope that BASALT will be used by anyone who aims to learn from human feedback, whether they are working on imitation learning, learning from comparisons, or some other method. It mitigates many of the issues with the standard benchmarks used in the field. The current baseline has lots of obvious flaws, which we hope the research community will soon fix,” Shah wrote in a blog post. “We envision eventually building agents that can be instructed to perform arbitrary Minecraft tasks in natural language on public multiplayer servers, or inferring what large-scale project human players are working on and assisting with those projects, while adhering to the norms and customs followed on that server.”

The evaluation code for BASALT will be available in beta soon. The team is accepting sign-ups now, with plans to announce the winners of the competition at the NeurIPS 2021 machine learning conference in December.

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link

Categories
Tech News

Reinforcement learning can deliver general AI, says DeepMind

In their decades-long chase to create artificial intelligence, computer scientists have designed and developed all kinds of complicated mechanisms and technologies to replicate vision, language, reasoning, motor skills, and other abilities associated with intelligent life. While these efforts have resulted in AI systems that can efficiently solve specific problems in limited environments, they fall short of developing the kind of general intelligence seen in humans and animals.

In a new paper submitted to the peer-reviewed Artificial Intelligence journal, scientists at UK-based AI lab DeepMind argue that intelligence and its associated abilities will emerge not from formulating and solving complicated problems but by sticking to a simple but powerful principle: reward maximization.

Titled “Reward is Enough,” the paper, which is still in pre-proof as of this writing, draws inspiration from studying the evolution of natural intelligence as well as drawing lessons from recent achievements in artificial intelligence. The authors suggest that reward maximization and trial-and-error experience are enough to develop behavior that exhibits the kind of abilities associated with intelligence. And from this, they conclude that reinforcement learning, a branch of AI that is based on reward maximization, can lead to the development of artificial general intelligence.

Two paths for AI

 

One common method for creating AI is to try to replicate elements of intelligent behavior in computers. For instance, our understanding of the mammal vision system has given rise to all kinds of AI systems that can categorize images, locate objects in photos, define the boundaries between objects, and more. Likewise, our understanding of language has helped in the development of various natural language processing systems, such as question answering, text generation, and machine translation.

These are all instances of narrow artificial intelligence, systems that have been designed to perform specific tasks instead of having general problem-solving abilities. Some scientists believe that assembling multiple narrow AI modules will produce higher intelligent systems. For example, you can have a software system that coordinates between separate computer vision, voice processing, NLP, and motor control modules to solve complicated problems that require a multitude of skills.

A different approach to creating AI, proposed by the DeepMind researchers, is to recreate the simple yet effective rule that has given rise to natural intelligence. “[We] consider an alternative hypothesis: that the generic objective of maximising reward is enough to drive behaviour that exhibits most if not all abilities that are studied in natural and artificial intelligence,” the researchers write.

This is basically how nature works. As far as science is concerned, there has been no top-down intelligent design in the complex organisms that we see around us. Billions of years of natural selection and random variation have filtered lifeforms for their fitness to survive and reproduce. Living beings that were better equipped to handle the challenges and situations in their environments managed to survive and reproduce. The rest were eliminated.

This simple yet efficient mechanism has led to the evolution of living beings with all kinds of skills and abilities to perceive, navigate, modify their environments, and communicate among themselves.

“The natural world faced by animals and humans, and presumably also the environments faced in the future by artificial agents, are inherently so complex that they require sophisticated abilities in order to succeed (for example, to survive) within those environments,” the researchers write. “Thus, success, as measured by maximising reward, demands a variety of abilities associated with intelligence. In such environments, any behaviour that maximises reward must necessarily exhibit those abilities. In this sense, the generic objective of reward maximization contains within it many or possibly even all the goals of intelligence.”

For example, consider a squirrel that seeks the reward of minimizing hunger. On the one hand, its sensory and motor skills help it locate and collect nuts when food is available. But a squirrel that can only find food is bound to die of hunger when food becomes scarce. This is why it also has planning skills and memory to cache the nuts and restore them in winter. And the squirrel has social skills and knowledge to ensure other animals don’t steal its nuts. If you zoom out, hunger minimization can be a subgoal of “staying alive,” which also requires skills such as detecting and hiding from dangerous animals, protecting oneself from environmental threats, and seeking better habitats with seasonal changes.

“When abilities associated with intelligence arise as solutions to a singular goal of reward maximisation, this may in fact provide a deeper understanding since it explains why such an ability arises,” the researchers write. “In contrast, when each ability is understood as the solution to its own specialised goal, the why question is side-stepped in order to focus upon what that ability does.”

Finally, the researchers argue that the “most general and scalable” way to maximize reward is through agents that learn through interaction with the environment.

Developing abilities through reward maximization

In the paper, the AI researchers provide some high-level examples of how “intelligence and associated abilities will implicitly arise in the service of maximising one of many possible reward signals, corresponding to the many pragmatic goals towards which natural or artificial intelligence may be directed.”

For example, sensory skills serve the need to survive in complicated environments. Object recognition enables animals to detect food, prey, friends, and threats, or find paths, shelters, and perches. Image segmentation enables them to tell the difference between different objects and avoid fatal mistakes such as running off a cliff or falling off a branch. Meanwhile, hearing helps detect threats where the animal can’t see or find prey when they’re camouflaged. Touch, taste, and smell also give the animal the advantage of having a richer sensory experience of the habitat and a greater chance of survival in dangerous environments.

Rewards and environments also shape innate and learned knowledge in animals. For instance, hostile habitats ruled by predator animals such as lions and cheetahs reward ruminant species that have the innate knowledge to run away from threats since birth. Meanwhile, animals are also rewarded for their power to learn specific knowledge of their habitats, such as where to find food and shelter.

The researchers also discuss the reward-powered basis of language, social intelligence, imitation, and finally, general intelligence, which they describe as “maximising a singular reward in a single, complex environment.”

Here, they draw an analogy between natural intelligence and AGI: “An animal’s stream of experience is sufficiently rich and varied that it may demand a flexible ability to achieve a vast variety of subgoals (such as foraging, fighting, or fleeing), in order to succeed in maximising its overall reward (such as hunger or reproduction). Similarly, if an artificial agent’s stream of experience is sufficiently rich, then many goals (such as battery-life or survival) may implicitly require the ability to achieve an equally wide variety of subgoals, and the maximisation of reward should therefore be enough to yield an artificial general intelligence.”

Reinforcement learning for reward maximization

Reinforcement learning is a special branch of AI algorithms that is composed of three key elements: an environment, agents, and rewards.

By performing actions, the agent changes its own state and that of the environment. Based on how much those actions affect the goal the agent must achieve, it is rewarded or penalized. In many reinforcement learning problems, the agent has no initial knowledge of the environment and starts by taking random actions. Based on the feedback it receives, the agent learns to tune its actions and develop policies that maximize its reward.

In their paper, the researchers at DeepMind suggest reinforcement learning as the main algorithm that can replicate reward maximization as seen in nature and can eventually lead to artificial general intelligence.

“If an agent can continually adjust its behaviour so as to improve its cumulative reward, then any abilities that are repeatedly demanded by its environment must ultimately be produced in the agent’s behaviour,” the researchers write, adding that, in the course of maximizing for its reward, a good reinforcement learning agent could eventually learn perception, language, social intelligence and so forth.

In the paper, the researchers provide several examples that show how reinforcement learning agents were able to learn general skills in games and robotic environments.

However, the researchers stress that some fundamental challenges remain unsolved. For instance, they say, “We do not offer any theoretical guarantee on the sample efficiency of reinforcement learning agents.” Reinforcement learning is notoriously renowned for requiring huge amounts of data. For instance, a reinforcement learning agent might need centuries worth of gameplay to master a computer game. And AI researchers still haven’t figured out how to create reinforcement learning systems that can generalize their learnings across several domains. Therefore, slight changes to the environment often require the full retraining of the model.

The researchers also acknowledge that learning mechanisms for reward maximization is an unsolved problem that remains a central question to be further studied in reinforcement learning.

Strengths and weaknesses of reward maximization

Patricia Churchland, neuroscientist, philosopher, and professor emerita at the University of California, San Diego, described the ideas in the paper as “very carefully and insightfully worked out.”

However, Churchland pointed it out to possible flaws in the paper’s discussion about social decision-making. The DeepMind researchers focus on personal gains in social interactions. Churchland, who has recently written a book on the biological origins of moral intuitions, argues that attachment and bonding is a powerful factor in social decision-making of mammals and birds, which is why animals put themselves in great danger to protect their children. 

“I have tended to see bonding, and hence other-care, as an extension of the ambit of what counts as oneself—‘me-and-mine,’” Churchland said. “In that case, a small modification to the [paper’s] hypothesis to allow for reward maximization to me-and-mine would work quite nicely, I think. Of course, we social animals have degrees of attachment—super strong to offspring, very strong to mates and kin, strong to friends and acquaintances etc., and the strength of types of attachments can vary depending on environment, and also on developmental stage.”

This is not a major criticism, Churchland said, and could likely be worked into the hypothesis quite gracefully.

“I am very impressed with the degree of detail in the paper, and how carefully they consider possible weaknesses,” Churchland said. “I may be wrong, but I tend to see this as a milestone.”

Data scientist Herbert Roitblat challenged the paper’s position that simple learning mechanisms and trial-and-error experience are enough to develop the abilities associated with intelligence. Roitblat argued that the theories presented in the paper face several challenges when it comes to implementing them in real life.

“If there are no time constraints, then trial and error learning might be enough, but otherwise we have the problem of an infinite number of monkeys typing for an infinite amount of time,” Roitblat said.

The infinite monkey theorem states that a monkey hitting random keys on a typewriter for an infinite amount of time may eventually type any given text.

Roitblat is the author of Algorithms are Not Enough, in which he explains why all current AI algorithms, including reinforcement learning, require careful formulation of the problem and representations created by humans.

“Once the model and its intrinsic representation are set up, optimization or reinforcement could guide its evolution, but that does not mean that reinforcement is enough,” Roitblat said.

In the same vein, Roitblat added that the paper does not make any suggestions on how the reward, actions, and other elements of reinforcement learning are defined.

“Reinforcement learning assumes that the agent has a finite set of potential actions. A reward signal and value function have been specified. In other words, the problem of general intelligence is precisely to contribute those things that reinforcement learning requires as a pre-requisite,” Roitblat said. “So, if machine learning can all be reduced to some form of optimization to maximize some evaluative measure, then it must be true that reinforcement learning is relevant, but it is not very explanatory.”

This article was originally published by Ben Dickson on TechTalks, a publication that examines trends in technology, how they affect the way we live and do business, and the problems they solve. But we also discuss the evil side of technology, the darker implications of new tech, and what we need to look out for. You can read the original article here.

 

Repost: Original Source and Author Link

Categories
AI

DeepMind says reinforcement learning is ‘enough’ to reach general AI

Elevate your enterprise data technology and strategy at Transform 2021.


In their decades-long chase to create artificial intelligence, computer scientists have designed and developed all kinds of complicated mechanisms and technologies to replicate vision, language, reasoning, motor skills, and other abilities associated with intelligent life. While these efforts have resulted in AI systems that can efficiently solve specific problems in limited environments, they fall short of developing the kind of general intelligence seen in humans and animals.

In a new paper submitted to the peer-reviewed Artificial Intelligence journal, scientists at U.K.-based AI lab DeepMind argue that intelligence and its associated abilities will emerge not from formulating and solving complicated problems but by sticking to a simple but powerful principle: reward maximization.

Titled “Reward is Enough,” the paper, which is still in pre-proof as of this writing, draws inspiration from studying the evolution of natural intelligence as well as drawing lessons from recent achievements in artificial intelligence. The authors suggest that reward maximization and trial-and-error experience are enough to develop behavior that exhibits the kind of abilities associated with intelligence. And from this, they conclude that reinforcement learning, a branch of AI that is based on reward maximization, can lead to the development of artificial general intelligence.

Two paths for AI

One common method for creating AI is to try to replicate elements of intelligent behavior in computers. For instance, our understanding of the mammal vision system has given rise to all kinds of AI systems that can categorize images, locate objects in photos, define the boundaries between objects, and more. Likewise, our understanding of language has helped in the development of various natural language processing systems, such as question answering, text generation, and machine translation.

These are all instances of narrow artificial intelligence, systems that have been designed to perform specific tasks instead of having general problem-solving abilities. Some scientists believe that assembling multiple narrow AI modules will produce higher intelligent systems. For example, you can have a software system that coordinates between separate computer vision, voice processing, NLP, and motor control modules to solve complicated problems that require a multitude of skills.

A different approach to creating AI, proposed by the DeepMind researchers, is to recreate the simple yet effective rule that has given rise to natural intelligence. “[We] consider an alternative hypothesis: that the generic objective of maximising reward is enough to drive behaviour that exhibits most if not all abilities that are studied in natural and artificial intelligence,” the researchers write.

This is basically how nature works. As far as science is concerned, there has been no top-down intelligent design in the complex organisms that we see around us. Billions of years of natural selection and random variation have filtered lifeforms for their fitness to survive and reproduce. Living beings that were better equipped to handle the challenges and situations in their environments managed to survive and reproduce. The rest were eliminated.

This simple yet efficient mechanism has led to the evolution of living beings with all kinds of skills and abilities to perceive, navigate, modify their environments, and communicate among themselves.

“The natural world faced by animals and humans, and presumably also the environments faced in the future by artificial agents, are inherently so complex that they require sophisticated abilities in order to succeed (for example, to survive) within those environments,” the researchers write. “Thus, success, as measured by maximising reward, demands a variety of abilities associated with intelligence. In such environments, any behaviour that maximises reward must necessarily exhibit those abilities. In this sense, the generic objective of reward maximization contains within it many or possibly even all the goals of intelligence.”

For example, consider a squirrel that seeks the reward of minimizing hunger. On the one hand, its sensory and motor skills help it locate and collect nuts when food is available. But a squirrel that can only find food is bound to die of hunger when food becomes scarce. This is why it also has planning skills and memory to cache the nuts and restore them in winter. And the squirrel has social skills and knowledge to ensure other animals don’t steal its nuts. If you zoom out, hunger minimization can be a subgoal of “staying alive,” which also requires skills such as detecting and hiding from dangerous animals, protecting oneself from environmental threats, and seeking better habitats with seasonal changes.

“When abilities associated with intelligence arise as solutions to a singular goal of reward maximisation, this may in fact provide a deeper understanding since it explains why such an ability arises,” the researchers write. “In contrast, when each ability is understood as the solution to its own specialised goal, the why question is side-stepped in order to focus upon what that ability does.”

Finally, the researchers argue that the “most general and scalable” way to maximize reward is through agents that learn through interaction with the environment.

Developing abilities through reward maximization

In the paper, the AI researchers provide some high-level examples of how “intelligence and associated abilities will implicitly arise in the service of maximising one of many possible reward signals, corresponding to the many pragmatic goals towards which natural or artificial intelligence may be directed.”

For example, sensory skills serve the need to survive in complicated environments. Object recognition enables animals to detect food, prey, friends, and threats, or find paths, shelters, and perches. Image segmentation enables them to tell the difference between different objects and avoid fatal mistakes such as running off a cliff or falling off a branch. Meanwhile, hearing helps detect threats where the animal can’t see or find prey when they’re camouflaged. Touch, taste, and smell also give the animal the advantage of having a richer sensory experience of the habitat and a greater chance of survival in dangerous environments.

Rewards and environments also shape innate and learned knowledge in animals. For instance, hostile habitats ruled by predator animals such as lions and cheetahs reward ruminant species that have the innate knowledge to run away from threats since birth. Meanwhile, animals are also rewarded for their power to learn specific knowledge of their habitats, such as where to find food and shelter.

The researchers also discuss the reward-powered basis of language, social intelligence, imitation, and finally, general intelligence, which they describe as “maximising a singular reward in a single, complex environment.”

Here, they draw an analogy between natural intelligence and AGI: “An animal’s stream of experience is sufficiently rich and varied that it may demand a flexible ability to achieve a vast variety of subgoals (such as foraging, fighting, or fleeing), in order to succeed in maximising its overall reward (such as hunger or reproduction). Similarly, if an artificial agent’s stream of experience is sufficiently rich, then many goals (such as battery-life or survival) may implicitly require the ability to achieve an equally wide variety of subgoals, and the maximisation of reward should therefore be enough to yield an artificial general intelligence.”

Reinforcement learning for reward maximization

Reinforcement learning

Reinforcement learning is a special branch of AI algorithms that is composed of three key elements: an environment, agents, and rewards.

By performing actions, the agent changes its own state and that of the environment. Based on how much those actions affect the goal the agent must achieve, it is rewarded or penalized. In many reinforcement learning problems, the agent has no initial knowledge of the environment and starts by taking random actions. Based on the feedback it receives, the agent learns to tune its actions and develop policies that maximize its reward.

In their paper, the researchers at DeepMind suggest reinforcement learning as the main algorithm that can replicate reward maximization as seen in nature and can eventually lead to artificial general intelligence.

“If an agent can continually adjust its behaviour so as to improve its cumulative reward, then any abilities that are repeatedly demanded by its environment must ultimately be produced in the agent’s behaviour,” the researchers write, adding that, in the course of maximizing for its reward, a good reinforcement learning agent could eventually learn perception, language, social intelligence and so forth.

In the paper, the researchers provide several examples that show how reinforcement learning agents were able to learn general skills in games and robotic environments.

However, the researchers stress that some fundamental challenges remain unsolved. For instance, they say, “We do not offer any theoretical guarantee on the sample efficiency of reinforcement learning agents.” Reinforcement learning is notoriously renowned for requiring huge amounts of data. For instance, a reinforcement learning agent might need centuries worth of gameplay to master a computer game. And AI researchers still haven’t figured out how to create reinforcement learning systems that can generalize their learnings across several domains. Therefore, slight changes to the environment often require the full retraining of the model.

The researchers also acknowledge that learning mechanisms for reward maximization is an unsolved problem that remains a central question to be further studied in reinforcement learning.

Strengths and weaknesses of reward maximization

Patricia Churchland, neuroscientist, philosopher, and professor emerita at the University of California, San Diego, described the ideas in the paper as “very carefully and insightfully worked out.”

However, Churchland pointed it out to possible flaws in the paper’s discussion about social decision-making. The DeepMind researchers focus on personal gains in social interactions. Churchland, who has recently written a book on the biological origins of moral intuitions, argues that attachment and bonding is a powerful factor in social decision-making of mammals and birds, which is why animals put themselves in great danger to protect their children.

“I have tended to see bonding, and hence other-care, as an extension of the ambit of what counts as oneself—‘me-and-mine,’” Churchland said. “In that case, a small modification to the [paper’s] hypothesis to allow for reward maximization to me-and-mine would work quite nicely, I think. Of course, we social animals have degrees of attachment—super strong to offspring, very strong to mates and kin, strong to friends and acquaintances etc., and the strength of types of attachments can vary depending on environment, and also on developmental stage.”

This is not a major criticism, Churchland said, and could likely be worked into the hypothesis quite gracefully.

“I am very impressed with the degree of detail in the paper, and how carefully they consider possible weaknesses,” Churchland said. “I may be wrong, but I tend to see this as a milestone.”

Data scientist Herbert Roitblat challenged the paper’s position that simple learning mechanisms and trial-and-error experience are enough to develop the abilities associated with intelligence. Roitblat argued that the theories presented in the paper face several challenges when it comes to implementing them in real life.

“If there are no time constraints, then trial and error learning might be enough, but otherwise we have the problem of an infinite number of monkeys typing for an infinite amount of time,” Roitblat said. The infinite monkey theorem states that a monkey hitting random keys on a typewriter for an infinite amount of time may eventually type any given text.

Roitblat is the author of Algorithms are Not Enough, in which he explains why all current AI algorithms, including reinforcement learning, require careful formulation of the problem and representations created by humans.

“Once the model and its intrinsic representation are set up, optimization or reinforcement could guide its evolution, but that does not mean that reinforcement is enough,” Roitblat said.

In the same vein, Roitblat added that the paper does not make any suggestions on how the reward, actions, and other elements of reinforcement learning are defined.

“Reinforcement learning assumes that the agent has a finite set of potential actions. A reward signal and value function have been specified. In other words, the problem of general intelligence is precisely to contribute those things that reinforcement learning requires as a pre-requisite,” Roitblat said. “So, if machine learning can all be reduced to some form of optimization to maximize some evaluative measure, then it must be true that reinforcement learning is relevant, but it is not very explanatory.”

Ben Dickson is a software engineer and the founder of TechTalks. He writes about technology, business, and politics. 

This story originally appeared on Bdtechtalks.com. Copyright 2021

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link

Categories
AI

Google used reinforcement learning to design next-gen AI accelerator chips

Elevate your enterprise data technology and strategy at Transform 2021.


In a preprint paper published a year ago, scientists at Google Research including Google AI lead Jeff Dean described an AI-based approach to chip design that could learn from past experience and improve over time, becoming better at generating architectures for unseen components. They claimed it completed designs in under six hours on average, which is significantly faster than the weeks it takes human experts in the loop.

While the work wasn’t entirely novel — it built upon a technique Google engineers proposed in a paper published in March 2020 — it advanced the state of the art in that it implied the placement of on-chip transistors can be largely automated. Now, in a paper published in the journal Nature, the original team of Google researchers claim they’ve fine-tuned the technique to design an upcoming, previously unannounced generation of Google’s tensor processing units (TPU), application-specific integrated circuits (ASICs) developed specifically to accelerate AI.

If made publicly available, the Google researchers’ technique could enable cash-strapped startups to develop their own chips for AI and other specialized purposes. Moreover, it could help to shorten the chip design cycle to allow hardware to better adapt to rapidly evolving research.

“Basically, right now in the design process, you have design tools that can help do some layout, but you have human placement and routing experts work with those design tools to kind of iterate many, many times over,” Dean told VentureBeat in a previous interview. “It’s a multi-week process to actually go from the design you want to actually having it physically laid out on a chip with the right constraints in area and power and wire length and meeting all the design roles or whatever fabrication process you’re doing. We can essentially have a machine learning model that learns to play the game of [component] placement for a particular chip.”

AI chip design

A computer chip is divided into dozens of blocks, each of which is an individual module, such as a memory subsystem, compute unit, or control logic system. These wire-connected blocks can be described by a netlist, a graph of circuit components like memory components and standard cells including logic gates (e.g., NAND, NOR, and XOR). Chip “floorplanning” involves placing netlists onto two-dimensional grids called canvases so that performance metrics like power consumption, timing, area, and wirelength are optimized while adhering to constraints on density and routing congestion.

Since the 1960s, many automated approaches to chip floorplanning have been proposed, but none has achieved human-level performance. Moreover, the exponential growth in chip complexity has rendered these techniques unusable on modern chips. Human chip designers must instead iterate for months with electronic design automation (EDA) tools, taking a register transfer level (RTL) description of the chip netlist and generating a manual placement of that netlist onto the chip canvas. On the basis of this feedback, which can take up to 72 hours, the designer either concludes that the design criteria have been achieved or provides feedback to upstream RTL designers, who then modify low-level code to make the placement task easier.

The Google team’s solution is a reinforcement learning method capable of generalizing across chips, meaning that it can learn from experience to become both better and faster at placing new chips.

Gaming the system

Training AI-driven design systems that generalize across chips is challenging because it requires learning to optimize the placement of all possible chip netlists onto all possible canvases. In point of fact, chip floorplanning is analogous to a game with various pieces (e.g., netlist topologies, macro counts, macro sizes and aspect ratios), boards (canvas sizes and aspect ratios), and win conditions (the relative importance of different evaluation metrics or different density and routing congestion constraints). Even one instance of this “game” — placing a particular netlist onto a particular canvas — has more possible moves than the Chinese board game Go.

The researchers’ system aims to place a “netlist” graph of logic gates, memory, and more onto a chip canvas, such that the design optimizes power, performance, and area (PPA) while adhering to constraints on placement density and routing congestion. The graphs range in size from millions to billions of nodes grouped in thousands of clusters, and typically, evaluating the target metrics takes from hours to over a day.

Starting with an empty chip, the Google team’s system places components sequentially until it completes the netlist. To guide the system in selecting which components to place first, components are sorted by descending size; placing larger components first reduces the chance there’s no feasible placement for it later.

Google chip AI

Above: Macro placements of Ariane, an open source RISC-V processor, as training progresses. On the left, the policy is being trained from scratch, and on the right, a pre-trained policy is being fine-tuned for this chip. Each rectangle represents an individual macro placement.

Image Credit: Google

Training the system required creating a dataset of 10,000 chip placements, where the input is the state associated with the given placement and the label is the reward for the placement (i.e., wirelength and congestion). The researchers built it by first picking five different chip netlists, to which an AI algorithm was applied to create 2,000 diverse placements for each netlist.

The system took 48 hours to “pre-train” on an Nvidia Volta graphics card and 10 CPUs, each with 2GB of RAM. Fine-tuning initially took up to 6 hours, but applying the pre-trained system to a new netlist without fine-tuning generated placement in less than a second on a single GPU in later benchmarks.

In one test, the Google researchers compared their system’s recommendations with a manual baseline: the production design of a previous-generation TPU chip created by Google’s TPU physical design team. Both the system and the human experts consistently generated viable placements that met timing and congestion requirements, but the AI system also outperformed or matched manual placements in area, power, and wirelength while taking far less time to meet design criteria.

Future work

Google says that its system’s ability to generalize and generate “high-quality” solutions has “major implications,” unlocking opportunities for co-optimization with earlier stages of the chip design process. Large-scale architectural explorations were previously impossible because it took months of effort to evaluate a given architectural candidate. However, modifying a chip’s design can have an outsized impact on performance, the Google team notes, and might lay the groundwork for full automation of the chip design process.

Moreover, because the Google team’s system simply learns to map the nodes of a graph onto a set of resources, it might be applicable to range of applications including city planning, vaccine testing and distribution, and cerebral cortex mapping. “[While] our method has been used in production to design the next generation of Google TPU … [we] believe that [it] can be applied to impactful placement problems beyond chip design,” the researchers wrote in the paper.

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link

Categories
AI

Reinforcement learning competition pushes the boundaries of embodied AI

Join Transform 2021 this July 12-16. Register for the AI event of the year.


Since the early decades of artificial intelligence, humanoid robots have been a staple of sci-fi books, movies, and cartoons. Yet after decades of research and development in AI, we still have nothing that comes close to The Jetsons’ Rosey the Robot.

This is because many of our intuitive planning and motor skills — things that we take for granted — are a lot more complicated than we think. Navigating unknown areas, finding and picking up objects, choosing routes, and planning tasks are complicated feats that we only appreciate when we try to turn them into computer programs.

Developing robots that can physically sense the world and interact with their environment falls in the realm of embodied artificial intelligence, one of the long-sought goals of AI scientists. And even though progress in the field is still a far shot from the capabilities of humans and animals, the achievements are remarkable.

In a recent development in embodied AI, scientists at IBM, the Massachusetts Institute of Technology, and Stanford University developed a new challenge that will help assess AI agents’ ability to find paths, interact with objects, and plan tasks efficiently. Titled ThreeDWorld Transport Challenge, the test is a virtual environment that will be presented at the Embodied AI Workshop during the Conference on Computer Vision and Pattern Recognition, held online in June.

No current AI techniques come close to solving the TDW Transport Challenge. But the results of the competition can help uncover new directions for the future of embodied AI and robotics research.

Reinforcement learning in virtual environments

At the heart of most robotics applications is reinforcement learning, a branch of machine learning based on actions, states, and rewards. A reinforcement learning agent is given a set of actions it can apply to its environment to obtain rewards or reach a certain goal. These actions create changes to the state of the agent and the environment. The RL agent receives rewards based on how its actions bring it closer to its goal.

RL agents usually start by knowing nothing about their environment and selecting random actions. As they gradually receive feedback from their environment, they learn sequences of actions that can maximize their rewards.

This scheme is used not only in robotics, but in many other applications, such as self-driving cars and content recommendations. Reinforcement learning has also helped researchers master complicated games such as Go, StarCraft 2, and DOTA.

Creating reinforcement learning models presents several challenges. One of them is designing the right set of states, rewards, and actions, which can be very difficult in applications like robotics, where agents face a continuous environment that is affected by complicated factors such as gravity, wind, and physical interactions with other objects. This is in contrast to environments like chess and Go that have very discrete states and actions.

Another challenge is gathering training data. Reinforcement learning agents need to train using data from millions of episodes of interactions with their environments. This constraint can slow down robotics applications because they must gather their data from the physical world, as opposed to video and board games, which can be played in rapid succession on several computers.

To overcome this barrier, AI researchers have tried to create simulated environments for reinforcement learning applications. Today, self-driving cars and robotics often use simulated environments as a major part of their training regime.

“Training models using real robots can be expensive and sometimes involve safety considerations,” Chuang Gan, principal research staff member at the MIT-IBM Watson AI Lab, told TechTalks. “As a result, there has been a trend toward incorporating simulators, like what the TDW-Transport Challenge provides, to train and evaluate AI algorithms.”

But replicating the exact dynamics of the physical world is extremely difficult, and most simulated environments are a rough approximation of what a reinforcement learning agent would face in the real world. To address this limitation, the TDW Transport Challenge team has gone to great lengths to make the test environment as realistic as possible.

The environment is built on top of the ThreeDWorld platform, which the authors describe as “a general-purpose virtual world simulation platform supporting both near-photo realistic image rendering, physically based sound rendering, and realistic physical interactions between objects and agents.”

“We aimed to use a more advanced physical virtual environment simulator to define a new embodied AI task requiring an agent to change the states of multiple objects under realistic physical constraints,” the researchers write in an accompanying paper.

Task and motion planning

Reinforcement learning tests have different degrees of difficulty. Most current tests involve navigation tasks, where an RL agent must find its way through a virtual environment based on visual and audio input.

The TDW Transport Challenge, on the other hand, pits the reinforcement learning agents against “task and motion planning” (TAMP) problems. TAMP requires the agent to not only find optimal movement paths but to also change the state of objects to achieve its goal.

The challenge takes place in a multi-roomed house adorned with furniture, objects, and containers. The reinforcement learning agent views the environment from a first-person perspective and must find one or several objects from the rooms and gather them at a specified destination. The agent is a two-armed robot, so it can only carry two objects at a time. Alternatively, it can use a container to carry several objects and reduce the number of trips it has to make.

At every step, the RL agent can choose one of several actions, such as turning, moving forward, or picking up an object. The agent receives a reward if it accomplishes the transfer task within a limited number of steps.

While this seems like the kind of problem any child could solve without much training, it is indeed a complicated task for current AI systems. The reinforcement learning program must find the right balance between exploring the rooms, finding optimal paths to the destination, choosing between carrying objects alone or in containers, and doing all this within the designated step budget.

“Through the TDW-Transport Challenge, we’re proposing a new embodied AI challenge,” Gan said. “Specifically, a robotic agent must take actions to move and change the state of a large number of objects in a photo- and physically realistic virtual environment, which remains a complex goal in robotics.”

Abstracting challenges for AI agents

Above: In the ThreeDWorld Transport Challenge, the AI agent can see the world through color, depth, and segmentation maps.

While TDW is a very complex simulated environment, the designers have still abstracted some of the challenges robots would face in the real world. The virtual robot agent, dubbed Magnebot, has two arms with nine degrees of freedom with joints at the shoulder, elbow, and wrist. However, the robot’s hands are magnets and can pick up any object without needing to handle it with fingers, which itself is a very challenging task.

The agent also perceives the environment in three different ways: as an RGB-colored frame, a depth map, and a segmentation map that shows each object separately in hard colors. The depth and segmentation maps make it easier for the AI agent to read the dimensions of the scene and tell the objects apart when viewing them from awkward angles.

To avoid confusion, the problems are posed in a simple structure (e.g., “vase:2, bowl:2, jug:1; bed”) rather than as loose language commands (e.g., “Grab two bowls, a couple of vases, and the jug in the bedroom, and put them all on the bed”).

And to simplify the state and action space, the researchers have limited the Magnebot’s navigation to 25-centimeter movements and 15-degree rotations.

These simplifications enable developers to focus on the navigation and task-planning problems AI agents must overcome in the TDW environment.

Gan told TechTalks that despite the levels of abstraction introduced in TDW, the robot still needs to address the following challenges:

  • The synergy between navigation and interaction: The agent cannot move to grasp an object if this object is not in the egocentric view, or if the direct path to it is obstructed.
  • Physics-aware interaction: Grasping might fail if the agent’s arm cannot reach an object.
  • Physics-aware navigation: Collision with obstacles might cause objects to be dropped and significantly impede transport efficiency.

This highlights the complexity of human vision and agency. The next time you go to a supermarket, consider how easily you can find your way through aisles, tell the difference between different products, reach for and pick up different items, place them in your basket or cart, and choose your path in an efficient way. And you’re doing all this without access to segmentation and depth maps and by reading items from a crumpled handwritten note in your pocket.

Pure deep reinforcement learning is not enough

Above: Experiments show hybrid AI models that combine reinforcement learning with symbolic planners are better suited to solving the ThreeDWorld Transport Challenge.

The TDW-Transport Challenge is in the process of accepting submissions. In the meantime, the authors of the paper have already tested the environment with several known reinforcement learning techniques. Their findings show that pure reinforcement learning is very poor at solving task and motion planning challenges. A pure reinforcement learning approach requires the AI agent to develop its behavior from scratch, starting with random actions and gradually refining its policy to meet the goals in the specified number of steps.

According to the researchers’ experiments, pure reinforcement learning approaches barely managed to achieve above 10% success in the TDW tests.

“We believe this reflects the complexity of physical interaction and the large exploration search space of our benchmark,” the researchers wrote. “Compared to the previous point-goal navigation and semantic navigation tasks, where the agent only needs to navigate to specific coordinates or objects in the scene, the ThreeDWorld Transport challenge requires agents to move and change the objects’ physical state in the environment (i.e., task-and-motion planning), which the end-to-end models might fall short on.”

When the researchers tried hybrid AI models, where a reinforcement learning agent was combined with a rule-based high-level planner, they saw a considerable boost in the system’s performance.

“This environment can be used to train RL models, which fall short on these types of tasks and require explicit reasoning and planning abilities,” Gan said. “Through the TDW-Transport Challenge, we hope to demonstrate that a neuro-symbolic, hybrid model can improve this issue and demonstrate a stronger performance.”

The problem, however, remains largely unsolved, and even the best-performing hybrid systems had around 50% success rates. “Our proposed task is very challenging and could be used as a benchmark to track the progress of embodied AI in physically realistic scenes,” the researchers wrote.

Mobile robots are becoming a hot area of research and applications. According to Gan, several manufacturing and smart factories have already expressed interest in using the TDW environment for their real-world applications. It will be interesting to see whether the TDW Transport Challenge will help usher new innovations into the field.

“We’re hopeful the TDW-Transport Challenge can help advance research around assistive robotic agents in warehouses and home settings,” Gan said.

This story originally appeared on Bdtechtalks.com. Copyright 2021

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link

Categories
AI

Reinforcement learning: The next great AI tech moving from the lab to the real world

Join Transform 2021 for the most important themes in enterprise AI & Data. Learn more.


Reinforcement learning (RL) is a powerful type of artificial intelligence technology that can be used to learn strategies to optimally control large, complex systems such as manufacturing plants, traffic control systems (road/train/aircraft), financial portfolios, robots, etc. It is currently transitioning from research labs to highly impactful, real world applications. For example, self-driving car companies like Wayve and Waymo are using reinforcement learning to develop the control systems for their cars. 

AI systems that are typically used in industry perform pattern recognition to make a prediction. For instance, they may recognize patterns in images to detect faces (face detection), or recognize patterns in sales data to predict a change in demand (demand forecasting), and so on. Reinforcement learning methods, on the other hand, are used to make optimal decisions or take optimal actions in applications where there is a feedback loop. An example where both traditional AI methods and RL may be used, but for different purposes, will make the distinction clearer.

Say we are using AI to help operate a manufacturing plant. Pattern recognition may be used for quality assurance, where the AI system uses images and scans of the finished product to detect any imperfections or flaws. An RL system, on the other hand, would compute and execute the strategy for controlling the manufacturing process itself (by, for example, deciding which lines to run, controlling machines/robots, deciding which product to manufacture, and so on). The RL system will also try to ensure that the strategy is optimal in that it maximizes some metric of interest — such as the output volume — while maintaining a certain level of product quality. The problem of computing the optimal control strategy, which RL solves, is very difficult for some subtle reasons (often much more difficult than pattern recognition).

In computing the optimal strategy, or policy in RL parlance, the main challenge an RL learning algorithm faces is the so-called “temporal credit assignment” problem. That is, the impact of an action (e.g. “run line 1 on Wednesday”) in a given system state (e.g. “current output level of machines, how busy each line is,” etc.) on the overall performance (e.g. “total output volume”) is not known until after (potentially) a long time. To make matters worse, the overall performance also depends on all the actions that are taken subsequent to the action being evaluated. Together, this implies that, when a candidate policy is executed for evaluation, it is difficult to know which actions were the good ones and which were the bad ones — in other words, it is very difficult to assign credit to the different actions appropriately. The large number of potential system states in these complex problems further exacerbates the situation via the dreaded “curse of dimensionality.”  A good way to get an intuition for how an RL system solves all these problems at the same time is by looking at the recent spectacular successes they have had in the lab.

Many of the recent, prominent demonstrations of the power of RL come from applying them to board games and video games. The first RL system to impress the global AI community was able to learn to outplay humans in different Atari games when only given as input the images on screen and the scores received by playing the game. This was created in 2013 by London-based AI research lab Deepmind (now part of Alphabet Inc.). The same lab later created a series of RL systems (or agents), starting with the AlphaGo agent, which were able to defeat the top players in the world in the board game Go. These impressive feats, which occurred between 2015 and 2017, took the world by storm because Go is a very complex game, with millions of fans and players around the world, that requires intricate, long-term strategic thinking involving both the local and global board configurations.

Subsequently, Deepmind and the AI research lab OpenAI have released systems for playing the video games Starcraft and DOTA 2 that can defeat the top human players around the world. These games are challenging because they require strategic thinking, resource management, and control and coordination of multiple entities within the game.

All the agents mentioned above were trained by letting the RL algorithm play the games many many times (e.g. millions or more) and learning which policies work and which do not against different kinds of opponents and players. The large number of trials were possible because these were all games running on a computer. In determining the usefulness of various policies, the RL algorithm often employed a complex mix of ideas. These include hill climbing in policy space, playing against itself, running leagues internally amongst candidate policies or using policies used by humans as a starting point and properly balancing exploration of the policy space vs. exploiting the good policies found so far. Roughly speaking, the large number of trials enabled exploring many different game states that could plausibly be reached, while the complex evaluation methods enabled the AI system to determine which actions are useful in the long term, under plausible plays of the games, in these different states.

A key blocker in using these algorithms in the real world is that it is not possible to run millions of trials. Fortunately, a workaround immediately suggests itself: First, create a computer simulation of the application (a manufacturing plant simulation, or market simulation etc.), then learn the optimal policy in the simulation using RL algorithms, and finally adapt the learned optimal policy to the real world by running it a few times and tweaking some parameters. Famously, in a very compelling 2019 demo, OpenAI showed the effectiveness of this approach by training a robot arm to solve the Rubik’s cube puzzle one-handed.

For this approach to work, your simulation has to represent the underlying problem with a high degree of accuracy. The problem you’re trying to solve also has to be “closed” in a certain sense — there cannot be arbitrary or unseen external effects that may impact the performance of the system. For example, the OpenAI solution would not work if the simulated robot arm was too different from the real robot arm or if there were attempts to knock the Rubik’s cube out of the real robot arm (though it may naturally be — or be explicitly trained to be — robust to certain kinds of obstructions and interferences).

These limitations will sound acceptable to most people. However, in real applications it is tricky to properly circumscribe the competence of an RL system, and this can lead to unpleasant surprises. In our earlier manufacturing plant example, if a machine is replaced with one that is a lot faster or slower, it may change the plant dynamics enough that it becomes necessary to retrain the RL system. Again, this is not unreasonable for any automated controller, but stakeholders may have far loftier expectations from a system that is artificially intelligent, and such expectations will need to be managed.

Regardless, at this point in time, the future of reinforcement learning in the real world does seem very bright. There are many startups offering reinforcement learning products for controlling manufacturing robots (Covariant, Osaro, Luffy), managing production schedules (Instadeep), enterprise decision making (Secondmind), logistics (Dorabot), circuit design (Instadeep), controlling autonomous cars (Wayve, Waymo, Five AI), controlling drones (Amazon), running hedge funds (Piit.ai), and many other applications that are beyond the reach of pattern recognition based AI systems.

Each of the Big Tech companies has made heavy investments in RL research — e.g. Google acquiring Deepmind for a reported £400 million (approx $525 million) in 2015. So it is reasonable to assume that RL is either already in use internally at these companies or is in the pipeline; but they’re keeping the details pretty quiet for competitive advantage reasons.

We should expect to see some hiccups as promising applications for RL falter, but it will likely claim its place as a technology to reckon with in the near future.

M M Hassan Mahmud is a Senior AI and Machine Learning Technologist at Digital Catapult, with a background in machine learning within academia and industry.

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link

Categories
AI

Amazon launches reinforcement learning tools to manage robots’ workflows

Join Transform 2021 for the most important themes in enterprise AI & Data. Learn more.


Amazon today launched SageMaker Reinforcement Learning (RL) Kubeflow Components, a toolkit supporting the company’s AWS RoboMaker service for orchestrating robotics workflows. Amazon says that the goal is to make it faster to experiment and manage robotics workloads from perception to controls and optimization, and to create end-to-end solutions without having to rebuild them each time.

Robots are being used more widely for purposes that are increasing in sophistication, like assembly, picking and packing, last-mile delivery, environmental monitoring, search and rescue, and assisted surgery. In China, Oxford Economics anticipates 12.5 million manufacturing jobs will become automated, while in the U.S., McKinsey projects that machines will take upwards of 30% of such jobs. As for reinforcement learning, it’s an emerging AI technique that can help develop solutions for the kinds of problems that are increasingly cropping up in robotics.

SageMaker RL builds on top of Amazon’s SageMaker machine learning service, adding prepackaged toolkits designed to integrate with simulation environments. With Amazon SageMaker RL Components for Kubernetes, customers can use SageMaker RL Components in their pipelines to invoke and parallelize SageMaker training jobs and RoboMaker simulation jobs as steps in their reinforcement learning training workflow without having to worry about how it runs under the hood, according to Amazon.

Amazon AWS RoboMaker

Above: Ripley, Woodside’s robotics platform, takes advantage of reinforcement learning to perform manipulation tasks.

Image Credit: Amazon

Running the SageMaker RL Kubeflow Components requires an existing or new Kubernetes cluster. Customers also must install Kubeflow Pipelines on the cluster and set up identity and access management roles and permissions for SageMaker and RoboMaker, according to Amazon. The company provided step-by-step instructions to create the pipeline in a blog post.

Woodside Energy tapped RoboMaker with SageMaker Kubeflow operators to train, tune, and deploy reinforcement learning models to their robots to perform repetitive and dangerous manipulation tasks. The company engaged Australia-based consultancy Max Kelsen to assist in the development and contribution of the RoboMaker components. For example, Ripley, a robotics platform built by Woodside, was trained to perform a “double block and bleed,” a manual pump shutdown procedure that involves turning multiple valves in sequence. A reinforcement learning formulation created with RoboMaker and SageMaker uses joint states and camera views as inputs to a model that outputs optimal trajectories for manipulating the valves.

“Our team and our partners wanted to start exploring using machine learning methods for robotics manipulation,” Woodside robotics engineer Kyle Saltmarsh said in a press release. “Before we could do this effectively, we needed a framework that would allow us to train, test, tune, and deploy these models efficiently. Utilizing Kubeflow components and pipelines with SageMaker and RoboMaker provides us with this framework and we are excited to have our roboticists and data scientists focus their efforts and time on algorithms and implementation.”

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform
  • networking features, and more

Become a member

Repost: Original Source and Author Link

Categories
Tech News

Most ads you see are chosen by a reinforcement learning model — here’s how it works

Every day, digital advertisement agencies serve billions of ads on news websites, search engines, social media networks, video streaming websites, and other platforms. And they all want to answer the same question: Which of the many ads they have in their catalog is more likely to appeal to a certain viewer? Finding the right answer to this question can have a huge impact on revenue when you are dealing with hundreds of websites, thousands of ads, and millions of visitors.

Fortunately (for the ad agencies, at least), reinforcement learning, the branch of artificial intelligence that has become renowned for mastering board and video games, provides a solution. Reinforcement learning models seek to maximize rewards. In the case of online ads, the RL model will try to find the ad that users are more likely to click on.

The digital ad industry generates hundreds of billions of dollars every year and provides an interesting case study of the powers of reinforcement learning.

Naïve A/B/n testing

To better understand how reinforcement learning optimizes ads, consider a very simple scenario: You’re the owner of a news website. To pay for the costs of hosting and staff, you have entered a contract with a company to run their ads on your website. The company has provided you with five different ads and will pay you one dollar every time a visitor clicks on one of the ads.

Your first goal is to find the ad that generates the most clicks. In advertising lingo, you will want to maximize your click-trhough rate (CTR). The CTR is ratio of clicks over number of ads displayed, also called impressions. For instance, if 1,000 ad impressions earn you three clicks, your CTR will be 3 / 1000 = 0.003 or 0.3%.

Before we solve the problem with reinforcement learning, let’s discuss A/B testing, the standard technique for comparing the performance of two competing solutions (A and B) such as different webpage layouts, product recommendations, or ads. When you’re dealing with more than two alternatives, it is called A/B/n testing.

[Read: How do you build a pet-friendly gadget? We asked experts and animal owners]

In A/B/n testing, the experiment’s subjects are randomly divided into separate groups and each is provided with one of the available solutions. In our case, this means that we will randomly show one of the five ads to each new visitor of our website and evaluate the results.

Say we run our A/B/n test for 100,000 iterations, roughly 20,000 impressions per ad. Here are the clicks-over-impression ratio of our ads:

Ad 1: 80/20,000 = 0.40% CTR

Ad 2: 70/20,000 = 0.35% CTR

Ad 3: 90/20,000 = 0.45% CTR

Ad 4: 62/20,000 = 0.31% CTR

Ad 5: 50/20,000 = 0.25% CTR

Our 100,000 ad impressions generated $352 in revenue with an average CTR of 0.35%. More importantly, we found out that ad number 3 performs better than the others, and we will continue to use that one for the rest of our viewers. With the worst performing ad (ad number 2), our revenue would have been $250. With the best performing ad (ad number 3), our revenue would have been $450. So, our A/B/n test provided us with the average of the minimum and maximum revenue and yielded the very valuable knowledge of the CTR rates we sought.

Digital ads have very low conversion rates. In our example, there’s a subtle 0.2% difference between our best- and worst-performing ads. But this difference can have a significant impact on scale. At 1,000 impressions, ad number 3 will generate an extra $2 in comparison to ad number 5. At a million impressions, this difference will become $2,000. When you’re running billions of ads, a subtle 0.2% can have a huge impact on revenue.

Therefore, finding these subtle differences is very important in ad optimization. The problem with A/B/n testing is that it is not very efficient at finding these differences. It treats all ads equally and you need to run each ad tens of thousands of times until you discover their differences at a reliable confidence level. This can result in lost revenue, especially when you have a larger catalog of ads.

Another problem with classic A/B/n testing is that it is static. Once you find the optimal ad, you will have to stick to it. If the environment changes due to a new factor (seasonality, news trends, etc.) and causes one of the other ads to have a potentially higher CTR, you won’t find out unless you run the A/B/n test all over again.

What if we could change A/B/n testing to make it more efficient and dynamic?

This is where reinforcement learning comes into play. A reinforcement learning agent starts by knowing nothing about its environment’s actions, rewards, and penalties. The agent must find a way to maximize its rewards.

In our case, the RL agent’s actions are one of five ads to display. The RL agent will receive a reward point every time a user clicks on an ad. It must find a way to maximize ad clicks.

The multi-armed bandit

multi-armed bandit