Categories
AI

Chip developer Cerebras bolsters AI-powered workload capabilities with $250M

Cerebras Systems, the California-based company that has built a “brain-scale” chip to power AI models with 120 trillion parameters, said today it has raised $250 million funding at a valuation of over $4 billion. Cerebras claims its technology significantly accelerates the time involved in today’s AI work processes at a fraction of the power and space. It also claims its innovations will support the multi-trillion parameter AI models of the future.

In a press release, the company stated that this additional capital will enable it to further expand its business globally and deploy its industry-leading CS-2 system to new customers, while continuing to bolster its leadership in AI compute.

Cerebras’ cofounder and CEO Andrew Feldman noted that the new funding will allow Cerebras to extend its leadership to new regions. Feldman believes this will aid the company’s mission to democratize AI and usher in what it calls “the next era of high-performance AI compute” — an era where the company claims its technology will help to solve today’s most urgent societal challenges across drug discovery, climate change, and much more.

Redefining AI-powered possibilities

“Cerebras Systems is redefining what is possible with AI and has demonstrated best in class performance in accelerating the pace of innovation across pharma and life sciences, scientific research, and several other fields,” said Rick Gerson, cofounder, chairman, and chief investment officer at Falcon Edge Capital and Alpha Wave.

“We are proud to partner with Andrew and the Cerebras team to support their mission of bringing high-performance AI compute to new markets and regions around the world,” he added.

Image of the Cerebras Wafer Scale Engine.

Cerebras’ CS-2 system, powered by the Wafer Scale Engine (WSE-2) — the largest chip ever made and the fastest AI processor to date — is purpose-built for AI work. Feldman told VentureBeat in an interview that in April of this year, the company more than doubled the capacity of the chip, bringing it up to 2.6 trillion transistors, 850,000 AI-optimized cores, 40GBs on-chip memory, 20PBs memory bandwidth, and 220 petabits fabric bandwidth. He noted that for AI work, big chips process information more quickly and produce answers in less time.

With only 54 billion transistors, the largest graphics processing unit pales in comparison to the WSE-2, which has 2.55 trillion more transistors. With 56 times more chip size, 123 times more AI-optimized cores, 1,000 times more high-performance on-chip memory, 12,733 times more memory bandwidth, and 45,833 times more fabric bandwidth than other graphic processing unit competitors, the WSE-2 makes the CS-2 system the fastest in the industry. The company says its software is easy to deploy, and enables customers to use existing models, tools, and flows without modification. It also allows customers to write new ML models in standard open source frameworks.

New customers

Cerebras says its CS-2 system is delivering a massive leap forward for customers across pharma and life sciences, oil and gas, defense, supercomputing centers, national labs, and other industries. The company announced new customers including Argonne National Laboratory, Lawrence Livermore National Laboratory, Pittsburgh Supercomputing Center (PSC) for its groundbreaking Neocortex AI supercomputer, EPCC, the supercomputing center at the University of Edinburgh, Tokyo Electron Devices, GlaxoSmithKline, and AstraZeneca.

A list of Cerebras's newest customers. The list of these customers can be found in the text of the article itself.

The series F investment round was spearheaded by Alpha Wave Ventures, a global growth stage Falcon Edge-Chimera partnership, along with Abu Dhabi Growth (ADG).

Alpha Wave Ventures and ADG join a group of strategic world-class investors including Altimeter Capital, Benchmark Capital, Coatue Management, Eclipse Ventures, Moore Strategic Ventures, and VY Capital. Cerebras has now expanded its frontiers beyond the U.S., with new offices in Tokyo, Japan, and Toronto, Canada. On the back of this funding, the company says it will keep up with its engineering work, expand its engineering force, and hunt for talents all over the world going into 2022.

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link

Categories
AI

OpenAI releases Triton, a programming language for AI workload optimization

All the sessions from Transform 2021 are available on-demand now. Watch now.


OpenAI today released Triton, an open source, Python-like programming language that enables researchers to write highly efficient GPU code for AI workloads. Triton makes it possible to reach peak hardware performance with relatively little effort, OpenAI claims, producing code on par with what an expert could achieve in as few as 25 lines.

Deep neural networks have emerged as an important type of AI model, capable of achieving state-of-the-art performance across natural language processing, computer vision, and other domains. The strength of these models lies in their hierarchical structure, which generates a large amount of highly parallelizable work well-suited for multicore hardware like GPUs. Frameworks for general-purpose GPU computing such as CUDA and OpenCL have made the development of high-performance programs easier in recent years. Yet, GPUs remain especially challenging to optimize, in part because their architectures rapidly evolve.

Domain-specific languages and compilers have emerged to address the problem, but these systems tend to be less flexible and slower than the best handwritten compute kernels available in libraries like cuBLAS, cuDNN or TensorRT. Reasoning about all these factors can be challenging even for seasoned programmers. The purpose of Triton, then, is to automate these optimizations, so that developers can focus on the high-level logic of their code.

“Novel research ideas in the field of deep learning are generally implemented using a combination of native framework operators … [W]riting specialized GPU kernels [can improve performance,] but [is often] surprisingly difficult due to the many intricacies of GPU programming. And although a variety of systems have recently emerged to make this process easier, we have found them to be either too verbose, lack flexibility, generate code noticeably slower than our hand-tuned baselines,” Philippe Tillet, Triton’s original creator, who now works at OpenAI as a member of the technical staff, wrote in a blog post. “Our researchers have already used [Triton] to produce kernels that are up to 2 times more efficient than equivalent Torch implementations, and we’re excited to work with the community to make GPU programming more accessible to everyone.”

Simplifying code

According to OpenAI, Triton — which has its origins in a 2019 paper submitted to the International Workshop on Machine Learning and Programming Languages — simplifies the development of specialized kernels that can be much faster than those in general-purpose libraries. Its compiler simiplifies code and automatically optimizes and parallelizes it, converting it into code for execution on recent Nvidia GPUs. (CPUs and AMD GPUs and platforms other than Linux aren’t currently supported.)

“The main challenge posed by our proposed paradigm is that of work scheduling — i.e., how the work done by each program instance should be partitioned for efficient execution on modern GPUs,” Tillet explains in Triton’s documentation website. “To address this issue, the Triton compiler makes heavy use of block-level data-flow analysis, a technique for scheduling iteration blocks statically based on the control- and data-flow structure of the target program. The resulting system actually works surprisingly well: our compiler manages to apply a broad range of interesting optimization automatically.”

The first stable version of Triton, along with tutorials, is available from the project’s GitHub repository.

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link