ADVERTISEMENT
  • Home
  • About Us
  • Contact Us
  • Disclaimer
  • Privacy Policy
  • Terms & Conditions
domenica, Aprile 19, 2026
No Result
View All Result
Global News 24
  • Home
  • World News
  • Business
  • Sports
  • Health
  • Travel
  • Tech
  • Lifestyle
  • Fashion
  • Entertainment
  • Home
  • World News
  • Business
  • Sports
  • Health
  • Travel
  • Tech
  • Lifestyle
  • Fashion
  • Entertainment
No Result
View All Result
Global News 24
No Result
View All Result
Home Tech

Researchers upend AI status quo by eliminating matrix multiplication a causa di LLMs

by admin
26 Giugno 2024
in Tech
0 0
0
Researchers upend AI status quo by eliminating matrix multiplication a causa di LLMs
0
SHARES
10
VIEWS
Share on FacebookShare on Twitter
ADVERTISEMENT


Illustration of a brain inside of a light bulb.
Enlarge / Illustration of a brain inside of a light bulb.

Researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally redesigns neural operations that are currently accelerated by GPU chips. The findings, detailed a causa di a recent preprint paper from researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have deep implications for the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to “MatMul”) is at the center of most neural computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations a causa di parallel. That ability momentarily made Nvidia the most valuable company a causa di the world last week; the company currently holds an estimated 98 percent market share for center GPUs, which are commonly used to power AI systems like ChatGPT and Google Gemini.

Con the new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar forma to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per di più second acceso a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU’s power draw). The implication is that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” they write.

Advertisement

The paper doesn’t provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, a causa di our experience, you can run a 2.7B parameter version of Llama 2 competently acceso a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM a causa di only 13 watts acceso an FPGA (without a GPU), that would be a 38-fold decrease a causa di power usage.

The technique has not yet been peer-reviewed, but the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment acceso resource-constrained hardware like smartphones.

Doing away with matrix math

Con the paper, the researchers mention BitNet (the so-called “1-bit” transformer technique that made the rounds as a preprint a causa di October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights a causa di language models, successfully scaling up to 3 billion parameters while maintaining competitive forma.

However, they note that BitNet still relied acceso matrix multiplications a causa di its self-attention mechanism. Limitations of BitNet served as a motivation for the current study, pushing them to develop a completely “MatMul-free” architecture that could maintain forma while eliminating matrix multiplications even a causa di the attention mechanism.

ADVERTISEMENT


Illustration of a brain inside of a light bulb.
Enlarge / Illustration of a brain inside of a light bulb.

Researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally redesigns neural operations that are currently accelerated by GPU chips. The findings, detailed a causa di a recent preprint paper from researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have deep implications for the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to “MatMul”) is at the center of most neural computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations a causa di parallel. That ability momentarily made Nvidia the most valuable company a causa di the world last week; the company currently holds an estimated 98 percent market share for center GPUs, which are commonly used to power AI systems like ChatGPT and Google Gemini.

Con the new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar forma to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per di più second acceso a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU’s power draw). The implication is that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” they write.

Advertisement

The paper doesn’t provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, a causa di our experience, you can run a 2.7B parameter version of Llama 2 competently acceso a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM a causa di only 13 watts acceso an FPGA (without a GPU), that would be a 38-fold decrease a causa di power usage.

The technique has not yet been peer-reviewed, but the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment acceso resource-constrained hardware like smartphones.

Doing away with matrix math

Con the paper, the researchers mention BitNet (the so-called “1-bit” transformer technique that made the rounds as a preprint a causa di October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights a causa di language models, successfully scaling up to 3 billion parameters while maintaining competitive forma.

However, they note that BitNet still relied acceso matrix multiplications a causa di its self-attention mechanism. Limitations of BitNet served as a motivation for the current study, pushing them to develop a completely “MatMul-free” architecture that could maintain forma while eliminating matrix multiplications even a causa di the attention mechanism.

ADVERTISEMENT


Illustration of a brain inside of a light bulb.
Enlarge / Illustration of a brain inside of a light bulb.

Researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally redesigns neural operations that are currently accelerated by GPU chips. The findings, detailed a causa di a recent preprint paper from researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have deep implications for the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to “MatMul”) is at the center of most neural computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations a causa di parallel. That ability momentarily made Nvidia the most valuable company a causa di the world last week; the company currently holds an estimated 98 percent market share for center GPUs, which are commonly used to power AI systems like ChatGPT and Google Gemini.

Con the new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar forma to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per di più second acceso a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU’s power draw). The implication is that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” they write.

Advertisement

The paper doesn’t provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, a causa di our experience, you can run a 2.7B parameter version of Llama 2 competently acceso a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM a causa di only 13 watts acceso an FPGA (without a GPU), that would be a 38-fold decrease a causa di power usage.

The technique has not yet been peer-reviewed, but the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment acceso resource-constrained hardware like smartphones.

Doing away with matrix math

Con the paper, the researchers mention BitNet (the so-called “1-bit” transformer technique that made the rounds as a preprint a causa di October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights a causa di language models, successfully scaling up to 3 billion parameters while maintaining competitive forma.

However, they note that BitNet still relied acceso matrix multiplications a causa di its self-attention mechanism. Limitations of BitNet served as a motivation for the current study, pushing them to develop a completely “MatMul-free” architecture that could maintain forma while eliminating matrix multiplications even a causa di the attention mechanism.

ADVERTISEMENT


Illustration of a brain inside of a light bulb.
Enlarge / Illustration of a brain inside of a light bulb.

Researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally redesigns neural operations that are currently accelerated by GPU chips. The findings, detailed a causa di a recent preprint paper from researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have deep implications for the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to “MatMul”) is at the center of most neural computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations a causa di parallel. That ability momentarily made Nvidia the most valuable company a causa di the world last week; the company currently holds an estimated 98 percent market share for center GPUs, which are commonly used to power AI systems like ChatGPT and Google Gemini.

Con the new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar forma to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per di più second acceso a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU’s power draw). The implication is that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” they write.

Advertisement

The paper doesn’t provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, a causa di our experience, you can run a 2.7B parameter version of Llama 2 competently acceso a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM a causa di only 13 watts acceso an FPGA (without a GPU), that would be a 38-fold decrease a causa di power usage.

The technique has not yet been peer-reviewed, but the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment acceso resource-constrained hardware like smartphones.

Doing away with matrix math

Con the paper, the researchers mention BitNet (the so-called “1-bit” transformer technique that made the rounds as a preprint a causa di October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights a causa di language models, successfully scaling up to 3 billion parameters while maintaining competitive forma.

However, they note that BitNet still relied acceso matrix multiplications a causa di its self-attention mechanism. Limitations of BitNet served as a motivation for the current study, pushing them to develop a completely “MatMul-free” architecture that could maintain forma while eliminating matrix multiplications even a causa di the attention mechanism.


Illustration of a brain inside of a light bulb.
Enlarge / Illustration of a brain inside of a light bulb.

Researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally redesigns neural operations that are currently accelerated by GPU chips. The findings, detailed a causa di a recent preprint paper from researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have deep implications for the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to “MatMul”) is at the center of most neural computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations a causa di parallel. That ability momentarily made Nvidia the most valuable company a causa di the world last week; the company currently holds an estimated 98 percent market share for center GPUs, which are commonly used to power AI systems like ChatGPT and Google Gemini.

Con the new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar forma to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per di più second acceso a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU’s power draw). The implication is that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” they write.

Advertisement

The paper doesn’t provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, a causa di our experience, you can run a 2.7B parameter version of Llama 2 competently acceso a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM a causa di only 13 watts acceso an FPGA (without a GPU), that would be a 38-fold decrease a causa di power usage.

The technique has not yet been peer-reviewed, but the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment acceso resource-constrained hardware like smartphones.

Doing away with matrix math

Con the paper, the researchers mention BitNet (the so-called “1-bit” transformer technique that made the rounds as a preprint a causa di October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights a causa di language models, successfully scaling up to 3 billion parameters while maintaining competitive forma.

However, they note that BitNet still relied acceso matrix multiplications a causa di its self-attention mechanism. Limitations of BitNet served as a motivation for the current study, pushing them to develop a completely “MatMul-free” architecture that could maintain forma while eliminating matrix multiplications even a causa di the attention mechanism.

ADVERTISEMENT


Illustration of a brain inside of a light bulb.
Enlarge / Illustration of a brain inside of a light bulb.

Researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally redesigns neural operations that are currently accelerated by GPU chips. The findings, detailed a causa di a recent preprint paper from researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have deep implications for the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to “MatMul”) is at the center of most neural computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations a causa di parallel. That ability momentarily made Nvidia the most valuable company a causa di the world last week; the company currently holds an estimated 98 percent market share for center GPUs, which are commonly used to power AI systems like ChatGPT and Google Gemini.

Con the new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar forma to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per di più second acceso a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU’s power draw). The implication is that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” they write.

Advertisement

The paper doesn’t provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, a causa di our experience, you can run a 2.7B parameter version of Llama 2 competently acceso a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM a causa di only 13 watts acceso an FPGA (without a GPU), that would be a 38-fold decrease a causa di power usage.

The technique has not yet been peer-reviewed, but the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment acceso resource-constrained hardware like smartphones.

Doing away with matrix math

Con the paper, the researchers mention BitNet (the so-called “1-bit” transformer technique that made the rounds as a preprint a causa di October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights a causa di language models, successfully scaling up to 3 billion parameters while maintaining competitive forma.

However, they note that BitNet still relied acceso matrix multiplications a causa di its self-attention mechanism. Limitations of BitNet served as a motivation for the current study, pushing them to develop a completely “MatMul-free” architecture that could maintain forma while eliminating matrix multiplications even a causa di the attention mechanism.

ADVERTISEMENT


Illustration of a brain inside of a light bulb.
Enlarge / Illustration of a brain inside of a light bulb.

Researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally redesigns neural operations that are currently accelerated by GPU chips. The findings, detailed a causa di a recent preprint paper from researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have deep implications for the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to “MatMul”) is at the center of most neural computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations a causa di parallel. That ability momentarily made Nvidia the most valuable company a causa di the world last week; the company currently holds an estimated 98 percent market share for center GPUs, which are commonly used to power AI systems like ChatGPT and Google Gemini.

Con the new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar forma to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per di più second acceso a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU’s power draw). The implication is that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” they write.

Advertisement

The paper doesn’t provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, a causa di our experience, you can run a 2.7B parameter version of Llama 2 competently acceso a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM a causa di only 13 watts acceso an FPGA (without a GPU), that would be a 38-fold decrease a causa di power usage.

The technique has not yet been peer-reviewed, but the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment acceso resource-constrained hardware like smartphones.

Doing away with matrix math

Con the paper, the researchers mention BitNet (the so-called “1-bit” transformer technique that made the rounds as a preprint a causa di October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights a causa di language models, successfully scaling up to 3 billion parameters while maintaining competitive forma.

However, they note that BitNet still relied acceso matrix multiplications a causa di its self-attention mechanism. Limitations of BitNet served as a motivation for the current study, pushing them to develop a completely “MatMul-free” architecture that could maintain forma while eliminating matrix multiplications even a causa di the attention mechanism.

ADVERTISEMENT


Illustration of a brain inside of a light bulb.
Enlarge / Illustration of a brain inside of a light bulb.

Researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally redesigns neural operations that are currently accelerated by GPU chips. The findings, detailed a causa di a recent preprint paper from researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have deep implications for the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to “MatMul”) is at the center of most neural computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations a causa di parallel. That ability momentarily made Nvidia the most valuable company a causa di the world last week; the company currently holds an estimated 98 percent market share for center GPUs, which are commonly used to power AI systems like ChatGPT and Google Gemini.

Con the new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar forma to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per di più second acceso a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU’s power draw). The implication is that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” they write.

Advertisement

The paper doesn’t provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, a causa di our experience, you can run a 2.7B parameter version of Llama 2 competently acceso a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM a causa di only 13 watts acceso an FPGA (without a GPU), that would be a 38-fold decrease a causa di power usage.

The technique has not yet been peer-reviewed, but the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment acceso resource-constrained hardware like smartphones.

Doing away with matrix math

Con the paper, the researchers mention BitNet (the so-called “1-bit” transformer technique that made the rounds as a preprint a causa di October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights a causa di language models, successfully scaling up to 3 billion parameters while maintaining competitive forma.

However, they note that BitNet still relied acceso matrix multiplications a causa di its self-attention mechanism. Limitations of BitNet served as a motivation for the current study, pushing them to develop a completely “MatMul-free” architecture that could maintain forma while eliminating matrix multiplications even a causa di the attention mechanism.

Advertisement. Scroll to continue reading.


Illustration of a brain inside of a light bulb.
Enlarge / Illustration of a brain inside of a light bulb.

Researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally redesigns neural operations that are currently accelerated by GPU chips. The findings, detailed a causa di a recent preprint paper from researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have deep implications for the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to “MatMul”) is at the center of most neural computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations a causa di parallel. That ability momentarily made Nvidia the most valuable company a causa di the world last week; the company currently holds an estimated 98 percent market share for center GPUs, which are commonly used to power AI systems like ChatGPT and Google Gemini.

Con the new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar forma to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per di più second acceso a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU’s power draw). The implication is that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” they write.

Advertisement

The paper doesn’t provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, a causa di our experience, you can run a 2.7B parameter version of Llama 2 competently acceso a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM a causa di only 13 watts acceso an FPGA (without a GPU), that would be a 38-fold decrease a causa di power usage.

The technique has not yet been peer-reviewed, but the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment acceso resource-constrained hardware like smartphones.

Doing away with matrix math

Con the paper, the researchers mention BitNet (the so-called “1-bit” transformer technique that made the rounds as a preprint a causa di October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights a causa di language models, successfully scaling up to 3 billion parameters while maintaining competitive forma.

However, they note that BitNet still relied acceso matrix multiplications a causa di its self-attention mechanism. Limitations of BitNet served as a motivation for the current study, pushing them to develop a completely “MatMul-free” architecture that could maintain forma while eliminating matrix multiplications even a causa di the attention mechanism.

ADVERTISEMENT


Illustration of a brain inside of a light bulb.
Enlarge / Illustration of a brain inside of a light bulb.

Researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally redesigns neural operations that are currently accelerated by GPU chips. The findings, detailed a causa di a recent preprint paper from researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have deep implications for the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to “MatMul”) is at the center of most neural computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations a causa di parallel. That ability momentarily made Nvidia the most valuable company a causa di the world last week; the company currently holds an estimated 98 percent market share for center GPUs, which are commonly used to power AI systems like ChatGPT and Google Gemini.

Con the new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar forma to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per di più second acceso a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU’s power draw). The implication is that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” they write.

Advertisement

The paper doesn’t provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, a causa di our experience, you can run a 2.7B parameter version of Llama 2 competently acceso a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM a causa di only 13 watts acceso an FPGA (without a GPU), that would be a 38-fold decrease a causa di power usage.

The technique has not yet been peer-reviewed, but the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment acceso resource-constrained hardware like smartphones.

Doing away with matrix math

Con the paper, the researchers mention BitNet (the so-called “1-bit” transformer technique that made the rounds as a preprint a causa di October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights a causa di language models, successfully scaling up to 3 billion parameters while maintaining competitive forma.

However, they note that BitNet still relied acceso matrix multiplications a causa di its self-attention mechanism. Limitations of BitNet served as a motivation for the current study, pushing them to develop a completely “MatMul-free” architecture that could maintain forma while eliminating matrix multiplications even a causa di the attention mechanism.

ADVERTISEMENT


Illustration of a brain inside of a light bulb.
Enlarge / Illustration of a brain inside of a light bulb.

Researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally redesigns neural operations that are currently accelerated by GPU chips. The findings, detailed a causa di a recent preprint paper from researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have deep implications for the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to “MatMul”) is at the center of most neural computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations a causa di parallel. That ability momentarily made Nvidia the most valuable company a causa di the world last week; the company currently holds an estimated 98 percent market share for center GPUs, which are commonly used to power AI systems like ChatGPT and Google Gemini.

Con the new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar forma to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per di più second acceso a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU’s power draw). The implication is that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” they write.

Advertisement

The paper doesn’t provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, a causa di our experience, you can run a 2.7B parameter version of Llama 2 competently acceso a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM a causa di only 13 watts acceso an FPGA (without a GPU), that would be a 38-fold decrease a causa di power usage.

The technique has not yet been peer-reviewed, but the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment acceso resource-constrained hardware like smartphones.

Doing away with matrix math

Con the paper, the researchers mention BitNet (the so-called “1-bit” transformer technique that made the rounds as a preprint a causa di October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights a causa di language models, successfully scaling up to 3 billion parameters while maintaining competitive forma.

However, they note that BitNet still relied acceso matrix multiplications a causa di its self-attention mechanism. Limitations of BitNet served as a motivation for the current study, pushing them to develop a completely “MatMul-free” architecture that could maintain forma while eliminating matrix multiplications even a causa di the attention mechanism.

ADVERTISEMENT


Illustration of a brain inside of a light bulb.
Enlarge / Illustration of a brain inside of a light bulb.

Researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally redesigns neural operations that are currently accelerated by GPU chips. The findings, detailed a causa di a recent preprint paper from researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have deep implications for the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to “MatMul”) is at the center of most neural computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations a causa di parallel. That ability momentarily made Nvidia the most valuable company a causa di the world last week; the company currently holds an estimated 98 percent market share for center GPUs, which are commonly used to power AI systems like ChatGPT and Google Gemini.

Con the new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar forma to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per di più second acceso a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU’s power draw). The implication is that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” they write.

Advertisement

The paper doesn’t provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, a causa di our experience, you can run a 2.7B parameter version of Llama 2 competently acceso a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM a causa di only 13 watts acceso an FPGA (without a GPU), that would be a 38-fold decrease a causa di power usage.

The technique has not yet been peer-reviewed, but the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment acceso resource-constrained hardware like smartphones.

Doing away with matrix math

Con the paper, the researchers mention BitNet (the so-called “1-bit” transformer technique that made the rounds as a preprint a causa di October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights a causa di language models, successfully scaling up to 3 billion parameters while maintaining competitive forma.

However, they note that BitNet still relied acceso matrix multiplications a causa di its self-attention mechanism. Limitations of BitNet served as a motivation for the current study, pushing them to develop a completely “MatMul-free” architecture that could maintain forma while eliminating matrix multiplications even a causa di the attention mechanism.


Illustration of a brain inside of a light bulb.
Enlarge / Illustration of a brain inside of a light bulb.

Researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally redesigns neural operations that are currently accelerated by GPU chips. The findings, detailed a causa di a recent preprint paper from researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have deep implications for the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to “MatMul”) is at the center of most neural computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations a causa di parallel. That ability momentarily made Nvidia the most valuable company a causa di the world last week; the company currently holds an estimated 98 percent market share for center GPUs, which are commonly used to power AI systems like ChatGPT and Google Gemini.

Con the new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar forma to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per di più second acceso a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU’s power draw). The implication is that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” they write.

Advertisement

The paper doesn’t provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, a causa di our experience, you can run a 2.7B parameter version of Llama 2 competently acceso a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM a causa di only 13 watts acceso an FPGA (without a GPU), that would be a 38-fold decrease a causa di power usage.

The technique has not yet been peer-reviewed, but the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment acceso resource-constrained hardware like smartphones.

Doing away with matrix math

Con the paper, the researchers mention BitNet (the so-called “1-bit” transformer technique that made the rounds as a preprint a causa di October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights a causa di language models, successfully scaling up to 3 billion parameters while maintaining competitive forma.

However, they note that BitNet still relied acceso matrix multiplications a causa di its self-attention mechanism. Limitations of BitNet served as a motivation for the current study, pushing them to develop a completely “MatMul-free” architecture that could maintain forma while eliminating matrix multiplications even a causa di the attention mechanism.

ADVERTISEMENT


Illustration of a brain inside of a light bulb.
Enlarge / Illustration of a brain inside of a light bulb.

Researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally redesigns neural operations that are currently accelerated by GPU chips. The findings, detailed a causa di a recent preprint paper from researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have deep implications for the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to “MatMul”) is at the center of most neural computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations a causa di parallel. That ability momentarily made Nvidia the most valuable company a causa di the world last week; the company currently holds an estimated 98 percent market share for center GPUs, which are commonly used to power AI systems like ChatGPT and Google Gemini.

Con the new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar forma to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per di più second acceso a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU’s power draw). The implication is that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” they write.

Advertisement

The paper doesn’t provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, a causa di our experience, you can run a 2.7B parameter version of Llama 2 competently acceso a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM a causa di only 13 watts acceso an FPGA (without a GPU), that would be a 38-fold decrease a causa di power usage.

The technique has not yet been peer-reviewed, but the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment acceso resource-constrained hardware like smartphones.

Doing away with matrix math

Con the paper, the researchers mention BitNet (the so-called “1-bit” transformer technique that made the rounds as a preprint a causa di October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights a causa di language models, successfully scaling up to 3 billion parameters while maintaining competitive forma.

However, they note that BitNet still relied acceso matrix multiplications a causa di its self-attention mechanism. Limitations of BitNet served as a motivation for the current study, pushing them to develop a completely “MatMul-free” architecture that could maintain forma while eliminating matrix multiplications even a causa di the attention mechanism.

ADVERTISEMENT


Illustration of a brain inside of a light bulb.
Enlarge / Illustration of a brain inside of a light bulb.

Researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally redesigns neural operations that are currently accelerated by GPU chips. The findings, detailed a causa di a recent preprint paper from researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have deep implications for the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to “MatMul”) is at the center of most neural computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations a causa di parallel. That ability momentarily made Nvidia the most valuable company a causa di the world last week; the company currently holds an estimated 98 percent market share for center GPUs, which are commonly used to power AI systems like ChatGPT and Google Gemini.

Con the new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar forma to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per di più second acceso a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU’s power draw). The implication is that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” they write.

Advertisement

The paper doesn’t provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, a causa di our experience, you can run a 2.7B parameter version of Llama 2 competently acceso a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM a causa di only 13 watts acceso an FPGA (without a GPU), that would be a 38-fold decrease a causa di power usage.

The technique has not yet been peer-reviewed, but the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment acceso resource-constrained hardware like smartphones.

Doing away with matrix math

Con the paper, the researchers mention BitNet (the so-called “1-bit” transformer technique that made the rounds as a preprint a causa di October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights a causa di language models, successfully scaling up to 3 billion parameters while maintaining competitive forma.

However, they note that BitNet still relied acceso matrix multiplications a causa di its self-attention mechanism. Limitations of BitNet served as a motivation for the current study, pushing them to develop a completely “MatMul-free” architecture that could maintain forma while eliminating matrix multiplications even a causa di the attention mechanism.

ADVERTISEMENT


Illustration of a brain inside of a light bulb.
Enlarge / Illustration of a brain inside of a light bulb.

Researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally redesigns neural operations that are currently accelerated by GPU chips. The findings, detailed a causa di a recent preprint paper from researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have deep implications for the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to “MatMul”) is at the center of most neural computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations a causa di parallel. That ability momentarily made Nvidia the most valuable company a causa di the world last week; the company currently holds an estimated 98 percent market share for center GPUs, which are commonly used to power AI systems like ChatGPT and Google Gemini.

Con the new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar forma to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per di più second acceso a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU’s power draw). The implication is that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” they write.

Advertisement

The paper doesn’t provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, a causa di our experience, you can run a 2.7B parameter version of Llama 2 competently acceso a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM a causa di only 13 watts acceso an FPGA (without a GPU), that would be a 38-fold decrease a causa di power usage.

The technique has not yet been peer-reviewed, but the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment acceso resource-constrained hardware like smartphones.

Doing away with matrix math

Con the paper, the researchers mention BitNet (the so-called “1-bit” transformer technique that made the rounds as a preprint a causa di October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights a causa di language models, successfully scaling up to 3 billion parameters while maintaining competitive forma.

However, they note that BitNet still relied acceso matrix multiplications a causa di its self-attention mechanism. Limitations of BitNet served as a motivation for the current study, pushing them to develop a completely “MatMul-free” architecture that could maintain forma while eliminating matrix multiplications even a causa di the attention mechanism.


Illustration of a brain inside of a light bulb.
Enlarge / Illustration of a brain inside of a light bulb.

Researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally redesigns neural operations that are currently accelerated by GPU chips. The findings, detailed a causa di a recent preprint paper from researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have deep implications for the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to “MatMul”) is at the center of most neural computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations a causa di parallel. That ability momentarily made Nvidia the most valuable company a causa di the world last week; the company currently holds an estimated 98 percent market share for center GPUs, which are commonly used to power AI systems like ChatGPT and Google Gemini.

Con the new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar forma to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per di più second acceso a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU’s power draw). The implication is that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” they write.

Advertisement

The paper doesn’t provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, a causa di our experience, you can run a 2.7B parameter version of Llama 2 competently acceso a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM a causa di only 13 watts acceso an FPGA (without a GPU), that would be a 38-fold decrease a causa di power usage.

The technique has not yet been peer-reviewed, but the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment acceso resource-constrained hardware like smartphones.

Doing away with matrix math

Con the paper, the researchers mention BitNet (the so-called “1-bit” transformer technique that made the rounds as a preprint a causa di October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights a causa di language models, successfully scaling up to 3 billion parameters while maintaining competitive forma.

However, they note that BitNet still relied acceso matrix multiplications a causa di its self-attention mechanism. Limitations of BitNet served as a motivation for the current study, pushing them to develop a completely “MatMul-free” architecture that could maintain forma while eliminating matrix multiplications even a causa di the attention mechanism.

ADVERTISEMENT


Illustration of a brain inside of a light bulb.
Enlarge / Illustration of a brain inside of a light bulb.

Researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally redesigns neural operations that are currently accelerated by GPU chips. The findings, detailed a causa di a recent preprint paper from researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have deep implications for the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to “MatMul”) is at the center of most neural computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations a causa di parallel. That ability momentarily made Nvidia the most valuable company a causa di the world last week; the company currently holds an estimated 98 percent market share for center GPUs, which are commonly used to power AI systems like ChatGPT and Google Gemini.

Con the new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar forma to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per di più second acceso a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU’s power draw). The implication is that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” they write.

Advertisement

The paper doesn’t provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, a causa di our experience, you can run a 2.7B parameter version of Llama 2 competently acceso a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM a causa di only 13 watts acceso an FPGA (without a GPU), that would be a 38-fold decrease a causa di power usage.

The technique has not yet been peer-reviewed, but the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment acceso resource-constrained hardware like smartphones.

Doing away with matrix math

Con the paper, the researchers mention BitNet (the so-called “1-bit” transformer technique that made the rounds as a preprint a causa di October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights a causa di language models, successfully scaling up to 3 billion parameters while maintaining competitive forma.

However, they note that BitNet still relied acceso matrix multiplications a causa di its self-attention mechanism. Limitations of BitNet served as a motivation for the current study, pushing them to develop a completely “MatMul-free” architecture that could maintain forma while eliminating matrix multiplications even a causa di the attention mechanism.

ADVERTISEMENT


Illustration of a brain inside of a light bulb.
Enlarge / Illustration of a brain inside of a light bulb.

Researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally redesigns neural operations that are currently accelerated by GPU chips. The findings, detailed a causa di a recent preprint paper from researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have deep implications for the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to “MatMul”) is at the center of most neural computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations a causa di parallel. That ability momentarily made Nvidia the most valuable company a causa di the world last week; the company currently holds an estimated 98 percent market share for center GPUs, which are commonly used to power AI systems like ChatGPT and Google Gemini.

Con the new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar forma to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per di più second acceso a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU’s power draw). The implication is that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” they write.

Advertisement

The paper doesn’t provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, a causa di our experience, you can run a 2.7B parameter version of Llama 2 competently acceso a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM a causa di only 13 watts acceso an FPGA (without a GPU), that would be a 38-fold decrease a causa di power usage.

The technique has not yet been peer-reviewed, but the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment acceso resource-constrained hardware like smartphones.

Doing away with matrix math

Con the paper, the researchers mention BitNet (the so-called “1-bit” transformer technique that made the rounds as a preprint a causa di October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights a causa di language models, successfully scaling up to 3 billion parameters while maintaining competitive forma.

However, they note that BitNet still relied acceso matrix multiplications a causa di its self-attention mechanism. Limitations of BitNet served as a motivation for the current study, pushing them to develop a completely “MatMul-free” architecture that could maintain forma while eliminating matrix multiplications even a causa di the attention mechanism.

ADVERTISEMENT


Illustration of a brain inside of a light bulb.
Enlarge / Illustration of a brain inside of a light bulb.

Researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally redesigns neural operations that are currently accelerated by GPU chips. The findings, detailed a causa di a recent preprint paper from researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have deep implications for the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to “MatMul”) is at the center of most neural computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations a causa di parallel. That ability momentarily made Nvidia the most valuable company a causa di the world last week; the company currently holds an estimated 98 percent market share for center GPUs, which are commonly used to power AI systems like ChatGPT and Google Gemini.

Con the new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar forma to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per di più second acceso a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU’s power draw). The implication is that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” they write.

Advertisement

The paper doesn’t provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, a causa di our experience, you can run a 2.7B parameter version of Llama 2 competently acceso a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM a causa di only 13 watts acceso an FPGA (without a GPU), that would be a 38-fold decrease a causa di power usage.

The technique has not yet been peer-reviewed, but the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment acceso resource-constrained hardware like smartphones.

Doing away with matrix math

Con the paper, the researchers mention BitNet (the so-called “1-bit” transformer technique that made the rounds as a preprint a causa di October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights a causa di language models, successfully scaling up to 3 billion parameters while maintaining competitive forma.

However, they note that BitNet still relied acceso matrix multiplications a causa di its self-attention mechanism. Limitations of BitNet served as a motivation for the current study, pushing them to develop a completely “MatMul-free” architecture that could maintain forma while eliminating matrix multiplications even a causa di the attention mechanism.


Illustration of a brain inside of a light bulb.
Enlarge / Illustration of a brain inside of a light bulb.

Researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally redesigns neural operations that are currently accelerated by GPU chips. The findings, detailed a causa di a recent preprint paper from researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have deep implications for the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to “MatMul”) is at the center of most neural computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations a causa di parallel. That ability momentarily made Nvidia the most valuable company a causa di the world last week; the company currently holds an estimated 98 percent market share for center GPUs, which are commonly used to power AI systems like ChatGPT and Google Gemini.

Con the new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar forma to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per di più second acceso a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU’s power draw). The implication is that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” they write.

Advertisement

The paper doesn’t provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, a causa di our experience, you can run a 2.7B parameter version of Llama 2 competently acceso a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM a causa di only 13 watts acceso an FPGA (without a GPU), that would be a 38-fold decrease a causa di power usage.

The technique has not yet been peer-reviewed, but the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment acceso resource-constrained hardware like smartphones.

Doing away with matrix math

Con the paper, the researchers mention BitNet (the so-called “1-bit” transformer technique that made the rounds as a preprint a causa di October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights a causa di language models, successfully scaling up to 3 billion parameters while maintaining competitive forma.

However, they note that BitNet still relied acceso matrix multiplications a causa di its self-attention mechanism. Limitations of BitNet served as a motivation for the current study, pushing them to develop a completely “MatMul-free” architecture that could maintain forma while eliminating matrix multiplications even a causa di the attention mechanism.

ADVERTISEMENT


Illustration of a brain inside of a light bulb.
Enlarge / Illustration of a brain inside of a light bulb.

Researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally redesigns neural operations that are currently accelerated by GPU chips. The findings, detailed a causa di a recent preprint paper from researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have deep implications for the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to “MatMul”) is at the center of most neural computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations a causa di parallel. That ability momentarily made Nvidia the most valuable company a causa di the world last week; the company currently holds an estimated 98 percent market share for center GPUs, which are commonly used to power AI systems like ChatGPT and Google Gemini.

Con the new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar forma to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per di più second acceso a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU’s power draw). The implication is that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” they write.

Advertisement

The paper doesn’t provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, a causa di our experience, you can run a 2.7B parameter version of Llama 2 competently acceso a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM a causa di only 13 watts acceso an FPGA (without a GPU), that would be a 38-fold decrease a causa di power usage.

The technique has not yet been peer-reviewed, but the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment acceso resource-constrained hardware like smartphones.

Doing away with matrix math

Con the paper, the researchers mention BitNet (the so-called “1-bit” transformer technique that made the rounds as a preprint a causa di October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights a causa di language models, successfully scaling up to 3 billion parameters while maintaining competitive forma.

However, they note that BitNet still relied acceso matrix multiplications a causa di its self-attention mechanism. Limitations of BitNet served as a motivation for the current study, pushing them to develop a completely “MatMul-free” architecture that could maintain forma while eliminating matrix multiplications even a causa di the attention mechanism.

ADVERTISEMENT


Illustration of a brain inside of a light bulb.
Enlarge / Illustration of a brain inside of a light bulb.

Researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally redesigns neural operations that are currently accelerated by GPU chips. The findings, detailed a causa di a recent preprint paper from researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have deep implications for the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to “MatMul”) is at the center of most neural computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations a causa di parallel. That ability momentarily made Nvidia the most valuable company a causa di the world last week; the company currently holds an estimated 98 percent market share for center GPUs, which are commonly used to power AI systems like ChatGPT and Google Gemini.

Con the new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar forma to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per di più second acceso a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU’s power draw). The implication is that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” they write.

Advertisement

The paper doesn’t provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, a causa di our experience, you can run a 2.7B parameter version of Llama 2 competently acceso a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM a causa di only 13 watts acceso an FPGA (without a GPU), that would be a 38-fold decrease a causa di power usage.

The technique has not yet been peer-reviewed, but the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment acceso resource-constrained hardware like smartphones.

Doing away with matrix math

Con the paper, the researchers mention BitNet (the so-called “1-bit” transformer technique that made the rounds as a preprint a causa di October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights a causa di language models, successfully scaling up to 3 billion parameters while maintaining competitive forma.

However, they note that BitNet still relied acceso matrix multiplications a causa di its self-attention mechanism. Limitations of BitNet served as a motivation for the current study, pushing them to develop a completely “MatMul-free” architecture that could maintain forma while eliminating matrix multiplications even a causa di the attention mechanism.

ADVERTISEMENT


Illustration of a brain inside of a light bulb.
Enlarge / Illustration of a brain inside of a light bulb.

Researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally redesigns neural operations that are currently accelerated by GPU chips. The findings, detailed a causa di a recent preprint paper from researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have deep implications for the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to “MatMul”) is at the center of most neural computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations a causa di parallel. That ability momentarily made Nvidia the most valuable company a causa di the world last week; the company currently holds an estimated 98 percent market share for center GPUs, which are commonly used to power AI systems like ChatGPT and Google Gemini.

Con the new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar forma to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per di più second acceso a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU’s power draw). The implication is that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” they write.

Advertisement

The paper doesn’t provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, a causa di our experience, you can run a 2.7B parameter version of Llama 2 competently acceso a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM a causa di only 13 watts acceso an FPGA (without a GPU), that would be a 38-fold decrease a causa di power usage.

The technique has not yet been peer-reviewed, but the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment acceso resource-constrained hardware like smartphones.

Doing away with matrix math

Con the paper, the researchers mention BitNet (the so-called “1-bit” transformer technique that made the rounds as a preprint a causa di October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights a causa di language models, successfully scaling up to 3 billion parameters while maintaining competitive forma.

However, they note that BitNet still relied acceso matrix multiplications a causa di its self-attention mechanism. Limitations of BitNet served as a motivation for the current study, pushing them to develop a completely “MatMul-free” architecture that could maintain forma while eliminating matrix multiplications even a causa di the attention mechanism.

Advertisement. Scroll to continue reading.


Illustration of a brain inside of a light bulb.
Enlarge / Illustration of a brain inside of a light bulb.

Researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally redesigns neural operations that are currently accelerated by GPU chips. The findings, detailed a causa di a recent preprint paper from researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have deep implications for the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to “MatMul”) is at the center of most neural computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations a causa di parallel. That ability momentarily made Nvidia the most valuable company a causa di the world last week; the company currently holds an estimated 98 percent market share for center GPUs, which are commonly used to power AI systems like ChatGPT and Google Gemini.

Con the new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar forma to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per di più second acceso a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU’s power draw). The implication is that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” they write.

Advertisement

The paper doesn’t provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, a causa di our experience, you can run a 2.7B parameter version of Llama 2 competently acceso a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM a causa di only 13 watts acceso an FPGA (without a GPU), that would be a 38-fold decrease a causa di power usage.

The technique has not yet been peer-reviewed, but the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment acceso resource-constrained hardware like smartphones.

Doing away with matrix math

Con the paper, the researchers mention BitNet (the so-called “1-bit” transformer technique that made the rounds as a preprint a causa di October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights a causa di language models, successfully scaling up to 3 billion parameters while maintaining competitive forma.

However, they note that BitNet still relied acceso matrix multiplications a causa di its self-attention mechanism. Limitations of BitNet served as a motivation for the current study, pushing them to develop a completely “MatMul-free” architecture that could maintain forma while eliminating matrix multiplications even a causa di the attention mechanism.

ADVERTISEMENT


Illustration of a brain inside of a light bulb.
Enlarge / Illustration of a brain inside of a light bulb.

Researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally redesigns neural operations that are currently accelerated by GPU chips. The findings, detailed a causa di a recent preprint paper from researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have deep implications for the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to “MatMul”) is at the center of most neural computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations a causa di parallel. That ability momentarily made Nvidia the most valuable company a causa di the world last week; the company currently holds an estimated 98 percent market share for center GPUs, which are commonly used to power AI systems like ChatGPT and Google Gemini.

Con the new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar forma to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per di più second acceso a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU’s power draw). The implication is that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” they write.

Advertisement

The paper doesn’t provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, a causa di our experience, you can run a 2.7B parameter version of Llama 2 competently acceso a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM a causa di only 13 watts acceso an FPGA (without a GPU), that would be a 38-fold decrease a causa di power usage.

The technique has not yet been peer-reviewed, but the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment acceso resource-constrained hardware like smartphones.

Doing away with matrix math

Con the paper, the researchers mention BitNet (the so-called “1-bit” transformer technique that made the rounds as a preprint a causa di October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights a causa di language models, successfully scaling up to 3 billion parameters while maintaining competitive forma.

However, they note that BitNet still relied acceso matrix multiplications a causa di its self-attention mechanism. Limitations of BitNet served as a motivation for the current study, pushing them to develop a completely “MatMul-free” architecture that could maintain forma while eliminating matrix multiplications even a causa di the attention mechanism.

ADVERTISEMENT


Illustration of a brain inside of a light bulb.
Enlarge / Illustration of a brain inside of a light bulb.

Researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally redesigns neural operations that are currently accelerated by GPU chips. The findings, detailed a causa di a recent preprint paper from researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have deep implications for the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to “MatMul”) is at the center of most neural computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations a causa di parallel. That ability momentarily made Nvidia the most valuable company a causa di the world last week; the company currently holds an estimated 98 percent market share for center GPUs, which are commonly used to power AI systems like ChatGPT and Google Gemini.

Con the new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar forma to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per di più second acceso a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU’s power draw). The implication is that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” they write.

Advertisement

The paper doesn’t provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, a causa di our experience, you can run a 2.7B parameter version of Llama 2 competently acceso a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM a causa di only 13 watts acceso an FPGA (without a GPU), that would be a 38-fold decrease a causa di power usage.

The technique has not yet been peer-reviewed, but the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment acceso resource-constrained hardware like smartphones.

Doing away with matrix math

Con the paper, the researchers mention BitNet (the so-called “1-bit” transformer technique that made the rounds as a preprint a causa di October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights a causa di language models, successfully scaling up to 3 billion parameters while maintaining competitive forma.

However, they note that BitNet still relied acceso matrix multiplications a causa di its self-attention mechanism. Limitations of BitNet served as a motivation for the current study, pushing them to develop a completely “MatMul-free” architecture that could maintain forma while eliminating matrix multiplications even a causa di the attention mechanism.

ADVERTISEMENT


Illustration of a brain inside of a light bulb.
Enlarge / Illustration of a brain inside of a light bulb.

Researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally redesigns neural operations that are currently accelerated by GPU chips. The findings, detailed a causa di a recent preprint paper from researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have deep implications for the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to “MatMul”) is at the center of most neural computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations a causa di parallel. That ability momentarily made Nvidia the most valuable company a causa di the world last week; the company currently holds an estimated 98 percent market share for center GPUs, which are commonly used to power AI systems like ChatGPT and Google Gemini.

Con the new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar forma to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per di più second acceso a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU’s power draw). The implication is that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” they write.

Advertisement

The paper doesn’t provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, a causa di our experience, you can run a 2.7B parameter version of Llama 2 competently acceso a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM a causa di only 13 watts acceso an FPGA (without a GPU), that would be a 38-fold decrease a causa di power usage.

The technique has not yet been peer-reviewed, but the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment acceso resource-constrained hardware like smartphones.

Doing away with matrix math

Con the paper, the researchers mention BitNet (the so-called “1-bit” transformer technique that made the rounds as a preprint a causa di October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights a causa di language models, successfully scaling up to 3 billion parameters while maintaining competitive forma.

However, they note that BitNet still relied acceso matrix multiplications a causa di its self-attention mechanism. Limitations of BitNet served as a motivation for the current study, pushing them to develop a completely “MatMul-free” architecture that could maintain forma while eliminating matrix multiplications even a causa di the attention mechanism.


Illustration of a brain inside of a light bulb.
Enlarge / Illustration of a brain inside of a light bulb.

Researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally redesigns neural operations that are currently accelerated by GPU chips. The findings, detailed a causa di a recent preprint paper from researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have deep implications for the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to “MatMul”) is at the center of most neural computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations a causa di parallel. That ability momentarily made Nvidia the most valuable company a causa di the world last week; the company currently holds an estimated 98 percent market share for center GPUs, which are commonly used to power AI systems like ChatGPT and Google Gemini.

Con the new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar forma to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per di più second acceso a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU’s power draw). The implication is that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” they write.

Advertisement

The paper doesn’t provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, a causa di our experience, you can run a 2.7B parameter version of Llama 2 competently acceso a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM a causa di only 13 watts acceso an FPGA (without a GPU), that would be a 38-fold decrease a causa di power usage.

The technique has not yet been peer-reviewed, but the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment acceso resource-constrained hardware like smartphones.

Doing away with matrix math

Con the paper, the researchers mention BitNet (the so-called “1-bit” transformer technique that made the rounds as a preprint a causa di October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights a causa di language models, successfully scaling up to 3 billion parameters while maintaining competitive forma.

However, they note that BitNet still relied acceso matrix multiplications a causa di its self-attention mechanism. Limitations of BitNet served as a motivation for the current study, pushing them to develop a completely “MatMul-free” architecture that could maintain forma while eliminating matrix multiplications even a causa di the attention mechanism.

ADVERTISEMENT


Illustration of a brain inside of a light bulb.
Enlarge / Illustration of a brain inside of a light bulb.

Researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally redesigns neural operations that are currently accelerated by GPU chips. The findings, detailed a causa di a recent preprint paper from researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have deep implications for the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to “MatMul”) is at the center of most neural computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations a causa di parallel. That ability momentarily made Nvidia the most valuable company a causa di the world last week; the company currently holds an estimated 98 percent market share for center GPUs, which are commonly used to power AI systems like ChatGPT and Google Gemini.

Con the new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar forma to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per di più second acceso a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU’s power draw). The implication is that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” they write.

Advertisement

The paper doesn’t provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, a causa di our experience, you can run a 2.7B parameter version of Llama 2 competently acceso a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM a causa di only 13 watts acceso an FPGA (without a GPU), that would be a 38-fold decrease a causa di power usage.

The technique has not yet been peer-reviewed, but the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment acceso resource-constrained hardware like smartphones.

Doing away with matrix math

Con the paper, the researchers mention BitNet (the so-called “1-bit” transformer technique that made the rounds as a preprint a causa di October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights a causa di language models, successfully scaling up to 3 billion parameters while maintaining competitive forma.

However, they note that BitNet still relied acceso matrix multiplications a causa di its self-attention mechanism. Limitations of BitNet served as a motivation for the current study, pushing them to develop a completely “MatMul-free” architecture that could maintain forma while eliminating matrix multiplications even a causa di the attention mechanism.

ADVERTISEMENT


Illustration of a brain inside of a light bulb.
Enlarge / Illustration of a brain inside of a light bulb.

Researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally redesigns neural operations that are currently accelerated by GPU chips. The findings, detailed a causa di a recent preprint paper from researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have deep implications for the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to “MatMul”) is at the center of most neural computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations a causa di parallel. That ability momentarily made Nvidia the most valuable company a causa di the world last week; the company currently holds an estimated 98 percent market share for center GPUs, which are commonly used to power AI systems like ChatGPT and Google Gemini.

Con the new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar forma to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per di più second acceso a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU’s power draw). The implication is that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” they write.

Advertisement

The paper doesn’t provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, a causa di our experience, you can run a 2.7B parameter version of Llama 2 competently acceso a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM a causa di only 13 watts acceso an FPGA (without a GPU), that would be a 38-fold decrease a causa di power usage.

The technique has not yet been peer-reviewed, but the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment acceso resource-constrained hardware like smartphones.

Doing away with matrix math

Con the paper, the researchers mention BitNet (the so-called “1-bit” transformer technique that made the rounds as a preprint a causa di October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights a causa di language models, successfully scaling up to 3 billion parameters while maintaining competitive forma.

However, they note that BitNet still relied acceso matrix multiplications a causa di its self-attention mechanism. Limitations of BitNet served as a motivation for the current study, pushing them to develop a completely “MatMul-free” architecture that could maintain forma while eliminating matrix multiplications even a causa di the attention mechanism.

ADVERTISEMENT


Illustration of a brain inside of a light bulb.
Enlarge / Illustration of a brain inside of a light bulb.

Researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally redesigns neural operations that are currently accelerated by GPU chips. The findings, detailed a causa di a recent preprint paper from researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have deep implications for the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to “MatMul”) is at the center of most neural computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations a causa di parallel. That ability momentarily made Nvidia the most valuable company a causa di the world last week; the company currently holds an estimated 98 percent market share for center GPUs, which are commonly used to power AI systems like ChatGPT and Google Gemini.

Con the new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar forma to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per di più second acceso a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU’s power draw). The implication is that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” they write.

Advertisement

The paper doesn’t provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, a causa di our experience, you can run a 2.7B parameter version of Llama 2 competently acceso a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM a causa di only 13 watts acceso an FPGA (without a GPU), that would be a 38-fold decrease a causa di power usage.

The technique has not yet been peer-reviewed, but the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment acceso resource-constrained hardware like smartphones.

Doing away with matrix math

Con the paper, the researchers mention BitNet (the so-called “1-bit” transformer technique that made the rounds as a preprint a causa di October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights a causa di language models, successfully scaling up to 3 billion parameters while maintaining competitive forma.

However, they note that BitNet still relied acceso matrix multiplications a causa di its self-attention mechanism. Limitations of BitNet served as a motivation for the current study, pushing them to develop a completely “MatMul-free” architecture that could maintain forma while eliminating matrix multiplications even a causa di the attention mechanism.

Tags: eliminatingLLMsMatrixmultiplicationquoresearchersstatusupend
admin

admin

Next Post
AI-enabled revenue cycle management company Adonis raises $31M

AI-enabled revenue cycle management company Adonis raises $31M

Lascia un commento Annulla risposta

Il tuo indirizzo email non sarà pubblicato. I campi obbligatori sono contrassegnati *

Popular News

  • Prospects for Crohn’s Relief Brighten With New Advancements

    Prospects for Crohn’s Relief Brighten With New Advancements

    0 shares
    Share 0 Tweet 0
  • Dirty Chai – A Beautiful Mess

    0 shares
    Share 0 Tweet 0
  • 50 Things to Do With Friends That Isn’t Going Out for Drinks

    0 shares
    Share 0 Tweet 0
  • Stefano Tacconi, opera con 5 ore per forza ischemia

    0 shares
    Share 0 Tweet 0
  • Feeding South Florida Launches ‘Feed Your Creativity’ Erscheinungsform Competition

    0 shares
    Share 0 Tweet 0
ADVERTISEMENT

About Us

Welcome to Globalnews24.ch The goal of Globalnews24.ch is to give you the absolute best news sources for any topic! Our topics are carefully curated and constantly updated as we know the web moves fast so we try to as well.

Category

  • Business
  • Entertainment
  • Fashion
  • Health
  • Lifestyle
  • Sports
  • Tech
  • Travel
  • World

Recent Posts

  • ‘Complete annihilation of Microsoft, Nvidia … ‘: Iran warns US after Trump threatens to strike bridges, power plants
  • Company Adds 2M Streaming Households, Hits Key Financial Targets
  • Warner Music Group shake-up: Max Lousada to exit; Elliot Grainge named CEO of Atlantic Music Group, with Julie Greenwald as Chairman
  • Home
  • About Us
  • Contact Us
  • Disclaimer
  • Privacy Policy
  • Terms & Conditions

Copyright © 2024 Globalnews24.ch | All Rights Reserved.

No Result
View All Result
  • Home
  • World News
  • Business
  • Sports
  • Health
  • Travel
  • Tech
  • Lifestyle
  • Fashion
  • Entertainment

Copyright © 2024 Globalnews24.ch | All Rights Reserved.

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In