Mamba expliqué : L'alternative aux Transformers

🎯 Pourquoi Mamba ? 🎯 Why Mamba? 🎯 なぜMambaなのか？

Les Transformers ont révolutionné le Deep Learning, mais ils ont un problème majeur : leur complexité quadratique O(n²). Cela signifie que doubler la longueur de la séquence multiplie par 4 le temps de calcul et la mémoire nécessaire.

Transformers have revolutionized Deep Learning, but they have a major problem: their quadratic complexity O(n²). This means that doubling the sequence length multiplies the computation time and required memory by 4.

Transformerはディープラーニングに革命をもたらしましたが、大きな問題があります：二次複雑度O(n²)です。これは、シーケンスの長さを2倍にすると、計算時間と必要なメモリが4倍になることを意味します。

Mamba propose une solution élégante : une architecture basée sur les State Space Models (SSM) avec une complexité linéaire O(n), tout en maintenant des performances comparables voire supérieures aux Transformers.

Mamba proposes an elegant solution: an architecture based on State Space Models (SSM) with linear complexity O(n), while maintaining performance comparable or superior to Transformers.

Mambaはエレガントな解決策を提案します：状態空間モデル（SSM）に基づくアーキテクチャで、線形複雑度O(n)を持ち、Transformerと同等またはそれ以上のパフォーマンスを維持します。

💡 Qu'est-ce qu'un State Space Model ? 💡 What is a State Space Model? 💡 状態空間モデルとは？

Les State Space Models (SSM) sont une famille de modèles mathématiques utilisés depuis des décennies en théorie du contrôle et en traitement du signal. Ils modélisent un système dynamique qui évolue dans le temps.

State Space Models (SSM) are a family of mathematical models used for decades in control theory and signal processing. They model a dynamic system that evolves over time.

状態空間モデル（SSM）は、制御理論と信号処理で何十年も使用されてきた数学モデルのファミリーです。時間とともに進化する動的システムをモデル化します。

📐 Les équations fondamentales d'un SSM 📐 Fundamental SSM Equations 📐 SSMの基本方程式

Un SSM continu est défini par deux équations différentielles :

A continuous SSM is defined by two differential equations:

連続SSMは2つの微分方程式で定義されます：

$$h'(t) = \mathbf{A}h(t) + \mathbf{B}x(t)$$

$$y(t) = \mathbf{C}h(t) + \mathbf{D}x(t)$$

Où :

Where:

ここで：

$x(t) \in \mathbb{R}$ : signal d'entrée au temps $t$ (par exemple, un token dans une séquence)
$x(t) \in \mathbb{R}$ : input signal at time $t$ (e.g., a token in a sequence)
$x(t) \in \mathbb{R}$ : 時刻$t$での入力信号（例：シーケンス内のトークン）

$h(t) \in \mathbb{R}^N$ : état caché de dimension $N$ qui encode l'historique
$h(t) \in \mathbb{R}^N$ : hidden state of dimension $N$ that encodes history
$h(t) \in \mathbb{R}^N$ : 履歴をエンコードする次元$N$の隠れ状態

$y(t) \in \mathbb{R}$ : signal de sortie au temps $t$
$y(t) \in \mathbb{R}$ : output signal at time $t$
$y(t) \in \mathbb{R}$ : 時刻$t$での出力信号

$\mathbf{A} \in \mathbb{R}^{N \times N}$ : matrice d'évolution de l'état (comment l'état évolue)
$\mathbf{A} \in \mathbb{R}^{N \times N}$ : evolution matrix of the state (how the state evolves)
$\mathbf{A} \in \mathbb{R}^{N \times N}$ : 状態の進化行列（状態がどのように進化するか）

$\mathbf{B} \in \mathbb{R}^{N \times 1}$ : matrice d'entrée (comment l'entrée affecte l'état)
$\mathbf{B} \in \mathbb{R}^{N \times 1}$ : input matrix (how input affects the state)
$\mathbf{B} \in \mathbb{R}^{N \times 1}$ : 入力行列（入力が状態にどのように影響するか）

$\mathbf{C} \in \mathbb{R}^{1 \times N}$ : matrice de sortie (comment l'état produit la sortie)
$\mathbf{C} \in \mathbb{R}^{1 \times N}$ : output matrix (how state produces output)
$\mathbf{C} \in \mathbb{R}^{1 \times N}$ : 出力行列（状態が出力をどのように生成するか）

$\mathbf{D} \in \mathbb{R}$ : terme de connexion directe (souvent fixé à 0)
$\mathbf{D} \in \mathbb{R}$ : direct feedthrough term (often set to 0)
$\mathbf{D} \in \mathbb{R}$ : 直接フィードスルー項（多くの場合0に設定）

🔍 Intuition :

🔍 Intuition:

🔍 直感：

Imaginez un système qui "se souvient" de son passé via un état caché $h(t)$. À chaque instant, cet état est mis à jour en fonction de l'état précédent (via $\mathbf{A}$) et de la nouvelle entrée (via $\mathbf{B}$). La sortie $y(t)$ est ensuite calculée à partir de cet état (via $\mathbf{C}$).

Imagine a system that "remembers" its past via a hidden state $h(t)$. At each moment, this state is updated based on the previous state (via $\mathbf{A}$) and the new input (via $\mathbf{B}$). The output $y(t)$ is then computed from this state (via $\mathbf{C}$).

隠れ状態$h(t)$を介して過去を「記憶」するシステムを想像してください。各瞬間、この状態は前の状態（$\mathbf{A}$経由）と新しい入力（$\mathbf{B}$経由）に基づいて更新されます。出力$y(t)$はこの状態（$\mathbf{C}$経由）から計算されます。

📖 Exemple concret :

📖 Concrete Example:

📖 具体例：

Pensez à lire une phrase mot par mot. Votre cerveau maintient un état mental (= $h(t)$) qui résume ce que vous avez lu jusqu'à présent. Quand vous lisez un nouveau mot (= $x(t)$), vous mettez à jour cet état mental en combinant votre compréhension actuelle avec le nouveau mot. Votre compréhension finale de la phrase (= $y(t)$) dépend de cet état mental accumulé. C'est exactement ce que fait un SSM !

Think about reading a sentence word by word. Your brain maintains a mental state (= $h(t)$) that summarizes what you've read so far. When you read a new word (= $x(t)$), you update this mental state by combining your current understanding with the new word. Your final understanding of the sentence (= $y(t)$) depends on this accumulated mental state. This is exactly what an SSM does!

文を一語ずつ読むことを考えてください。脳はこれまでに読んだ内容を要約する精神状態（= $h(t)$）を維持します。新しい単語（= $x(t)$）を読むと、現在の理解と新しい単語を組み合わせてこの精神状態を更新します。文の最終的な理解（= $y(t)$）は、この蓄積された精神状態に依存します。これがまさにSSMが行うことです！

🔄 Discrétisation : Du continu au discret 🔄 離散化：連続から離散へ

Pour utiliser les SSM en Deep Learning, nous devons les discrétiser car nous travaillons avec des séquences discrètes (tokens, pixels, etc.). On utilise la méthode Zero-Order Hold (ZOH) avec un pas de temps $\Delta$ :

To use SSMs in Deep Learning, we must discretize them since we work with discrete sequences (tokens, pixels, etc.). We use the Zero-Order Hold (ZOH) method with a time step $\Delta$:

ディープラーニングでSSMを使用するには、離散シーケンス（トークン、ピクセルなど）を扱うため、離散化する必要があります。時間ステップ$\Delta$でゼロ次ホールド（ZOH）法を使用します：

$$\overline{\mathbf{A}} = \exp(\Delta \mathbf{A})$$

$$\overline{\mathbf{B}} = (\Delta \mathbf{A})^{-1}(\exp(\Delta \mathbf{A}) - \mathbf{I}) \cdot \Delta \mathbf{B}$$

🧮 Explication pédagogique :

🧮 Pedagogical Explanation:

🧮 教育的説明：

Pourquoi discrétiser ? Les ordinateurs ne peuvent pas travailler avec du temps continu. Ils traitent les données par étapes discrètes (token 1, token 2, token 3...). C'est comme passer d'un film (continu) à une série d'images fixes (discret).

Why discretize? Computers cannot work with continuous time. They process data in discrete steps (token 1, token 2, token 3...). It's like going from a movie (continuous) to a series of still images (discrete).

なぜ離散化するのか？ コンピュータは連続時間で動作できません。離散ステップ（トークン1、トークン2、トークン3...）でデータを処理します。これは映画（連続）から一連の静止画像（離散）に移行するようなものです。

Que signifie $\exp(\Delta \mathbf{A})$ ? C'est l'exponentielle matricielle, qui transforme la matrice d'évolution continue $\mathbf{A}$ en sa version discrète $\overline{\mathbf{A}}$. Le paramètre $\Delta$ (delta) est le pas de temps : plus $\Delta$ est petit, plus la discrétisation est précise (mais plus coûteuse en calcul).

What does $\exp(\Delta \mathbf{A})$ mean? It's the matrix exponential, which transforms the continuous evolution matrix $\mathbf{A}$ into its discrete version $\overline{\mathbf{A}}$. The parameter $\Delta$ (delta) is the time step: the smaller $\Delta$ is, the more accurate the discretization (but more computationally expensive).

$\exp(\Delta \mathbf{A})$とは何か？ これは行列指数で、連続進化行列$\mathbf{A}$を離散版$\overline{\mathbf{A}}$に変換します。パラメータ$\Delta$（デルタ）は時間ステップです：$\Delta$が小さいほど離散化は正確になります（ただし計算コストが高くなります）。

Méthode ZOH : "Zero-Order Hold" signifie qu'on maintient la valeur d'entrée constante pendant chaque intervalle de temps $\Delta$. C'est la méthode la plus simple et la plus utilisée pour discrétiser un SSM.

ZOH Method: "Zero-Order Hold" means we keep the input value constant during each time interval $\Delta$. It's the simplest and most commonly used method to discretize an SSM.

ZOH法： 「ゼロ次ホールド」とは、各時間間隔$\Delta$の間、入力値を一定に保つことを意味します。これはSSMを離散化する最も単純で最も一般的に使用される方法です。

Le SSM discret devient alors :

The discrete SSM then becomes:

離散SSMは次のようになります：

$$h_t = \overline{\mathbf{A}} h_{t-1} + \overline{\mathbf{B}} x_t$$

$$y_t = \mathbf{C} h_t$$

📊 Interprétation :

📊 Interpretation:

📊 解釈：

Cette équation $h_t = \overline{\mathbf{A}} h_{t-1} + \overline{\mathbf{B}} x_t$ est une récurrence : pour calculer l'état au temps $t$, on combine l'état précédent $h_{t-1}$ (mémoire) avec la nouvelle entrée $x_t$ (information fraîche). C'est similaire à un RNN, mais avec une structure mathématique plus riche issue de la théorie du contrôle !

This equation $h_t = \overline{\mathbf{A}} h_{t-1} + \overline{\mathbf{B}} x_t$ is a recurrence: to compute the state at time $t$, we combine the previous state $h_{t-1}$ (memory) with the new input $x_t$ (fresh information). It's similar to an RNN, but with a richer mathematical structure from control theory!

この方程式$h_t = \overline{\mathbf{A}} h_{t-1} + \overline{\mathbf{B}} x_t$は再帰です：時刻$t$の状態を計算するために、前の状態$h_{t-1}$（記憶）と新しい入力$x_t$（新鮮な情報）を組み合わせます。これはRNNに似ていますが、制御理論からのより豊かな数学的構造を持っています！

Maintenant, nous pouvons traiter des séquences discrètes $x_1, x_2, \ldots, x_L$ de longueur $L$ !

Now we can process discrete sequences $x_1, x_2, \ldots, x_L$ of length $L$!

これで、長さ$L$の離散シーケンス$x_1, x_2, \ldots, x_L$を処理できます！

🚀 L'innovation de Mamba : La sélectivité 🚀 Mamba's Innovation: Selectivity 🚀 Mambaの革新：選択性

Les SSM classiques (comme S4) ont un problème majeur : les paramètres $\mathbf{A}$, $\mathbf{B}$, $\mathbf{C}$ sont fixes et ne dépendent pas de l'entrée. Cela signifie qu'ils traitent toutes les entrées de la même manière, sans pouvoir se concentrer sur les informations importantes.

Classic SSMs (like S4) have a major problem: the parameters $\mathbf{A}$, $\mathbf{B}$, $\mathbf{C}$ are fixed and don't depend on the input. This means they treat all inputs the same way, without being able to focus on important information.

従来のSSM（S4など）には大きな問題があります：パラメータ$\mathbf{A}$、$\mathbf{B}$、$\mathbf{C}$は固定されており、入力に依存しません。これは、重要な情報に焦点を当てることができず、すべての入力を同じように扱うことを意味します。

🎯 Le mécanisme sélectif (Selective SSM)

Mamba introduit la sélectivité : les paramètres $\mathbf{B}$, $\mathbf{C}$, et $\Delta$ (le pas de temps) deviennent des fonctions de l'entrée $x_t$ :

Mamba introduces selectivity: the parameters $\mathbf{B}$, $\mathbf{C}$, and $\Delta$ (the time step) become functions of the input $x_t$:

Mambaは選択性を導入します：パラメータ$\mathbf{B}$、$\mathbf{C}$、および$\Delta$（時間ステップ）が入力$x_t$の関数になります：

$$\mathbf{B}_t = s_{\mathbf{B}}(x_t)$$

$$\mathbf{C}_t = s_{\mathbf{C}}(x_t)$$

$$\Delta_t = \tau_{\Delta}(s_{\Delta}(x_t))$$

Où :

Where:

ここで：

$s_{\mathbf{B}}$, $s_{\mathbf{C}}$, $s_{\Delta}$ sont des projections linéaires apprises
$s_{\mathbf{B}}$, $s_{\mathbf{C}}$, $s_{\Delta}$ are learned linear projections
$s_{\mathbf{B}}$、$s_{\mathbf{C}}$、$s_{\Delta}$は学習された線形射影です
$\tau_{\Delta}$ est une fonction d'activation (typiquement softplus : $\tau_{\Delta}(x) = \log(1 + e^x)$)
$\tau_{\Delta}$ is an activation function (typically softplus: $\tau_{\Delta}(x) = \log(1 + e^x)$)
$\tau_{\Delta}$は活性化関数です（通常はsoftplus：$\tau_{\Delta}(x) = \log(1 + e^x)$）
$\mathbf{A}$ reste fixe (matrice structurée pour l'efficacité)
$\mathbf{A}$ remains fixed (structured matrix for efficiency)
$\mathbf{A}$は固定されたままです（効率のための構造化行列）

🧮 Explication pédagogique :

🧮 Pedagogical Explanation:

🧮 教育的説明：

Que signifie "projection linéaire" ? C'est simplement une multiplication matricielle : $s_{\mathbf{B}}(x_t) = \mathbf{W}_B \cdot x_t$ où $\mathbf{W}_B$ est une matrice de poids apprise. Le modèle apprend à transformer l'entrée $x_t$ en paramètres $\mathbf{B}_t$, $\mathbf{C}_t$, et $\Delta_t$ adaptés à cette entrée spécifique.

What does "linear projection" mean? It's simply a matrix multiplication: $s_{\mathbf{B}}(x_t) = \mathbf{W}_B \cdot x_t$ where $\mathbf{W}_B$ is a learned weight matrix. The model learns to transform the input $x_t$ into parameters $\mathbf{B}_t$, $\mathbf{C}_t$, and $\Delta_t$ adapted to that specific input.

「線形射影」とは何か？ これは単なる行列乗算です：$s_{\mathbf{B}}(x_t) = \mathbf{W}_B \cdot x_t$、ここで$\mathbf{W}_B$は学習された重み行列です。モデルは入力$x_t$をその特定の入力に適応したパラメータ$\mathbf{B}_t$、$\mathbf{C}_t$、$\Delta_t$に変換することを学習します。

Pourquoi $\Delta_t$ varie ? Le pas de temps $\Delta_t$ contrôle la "vitesse" à laquelle le modèle intègre les nouvelles informations. Un $\Delta_t$ grand signifie "cette information est importante, intègre-la rapidement". Un $\Delta_t$ petit signifie "cette information est moins importante, garde plutôt la mémoire actuelle". C'est comme ajuster votre niveau d'attention en lisant !

Why does $\Delta_t$ vary? The time step $\Delta_t$ controls the "speed" at which the model integrates new information. A large $\Delta_t$ means "this information is important, integrate it quickly". A small $\Delta_t$ means "this information is less important, keep the current memory instead". It's like adjusting your attention level while reading!

なぜ$\Delta_t$が変化するのか？ 時間ステップ$\Delta_t$は、モデルが新しい情報を統合する「速度」を制御します。大きい$\Delta_t$は「この情報は重要なので、すぐに統合する」ことを意味します。小さい$\Delta_t$は「この情報はあまり重要ではないので、現在の記憶を保持する」ことを意味します。これは読書中に注意レベルを調整するようなものです！

Fonction softplus : $\tau_{\Delta}(x) = \log(1 + e^x)$ garantit que $\Delta_t$ est toujours positif (car un pas de temps négatif n'aurait pas de sens physique). C'est une version "lisse" de la fonction ReLU.

Softplus function: $\tau_{\Delta}(x) = \log(1 + e^x)$ ensures that $\Delta_t$ is always positive (since a negative time step wouldn't make physical sense). It's a "smooth" version of the ReLU function.

Softplus関数： $\tau_{\Delta}(x) = \log(1 + e^x)$は$\Delta_t$が常に正であることを保証します（負の時間ステップは物理的に意味がないため）。これはReLU関数の「滑らかな」バージョンです。

💡 Pourquoi c'est révolutionnaire ?

💡 Why is this revolutionary?

💡 なぜこれが革命的なのか？

En rendant $\mathbf{B}_t$, $\mathbf{C}_t$, et $\Delta_t$ dépendants de l'entrée, Mamba peut :

By making $\mathbf{B}_t$, $\mathbf{C}_t$, and $\Delta_t$ input-dependent, Mamba can:

$\mathbf{B}_t$、$\mathbf{C}_t$、$\Delta_t$を入力依存にすることで、Mambaは次のことができます：

Filtrer les informations : ignorer les tokens non pertinents (via $\mathbf{B}_t$ petit)
Filter information: ignore irrelevant tokens (via small $\mathbf{B}_t$)
情報をフィルタリング：無関係なトークンを無視（小さい$\mathbf{B}_t$経由）
Se concentrer sur l'important : amplifier les tokens importants (via $\mathbf{C}_t$ grand)
Focus on important: amplify important tokens (via large $\mathbf{C}_t$)
重要なものに焦点を当てる：重要なトークンを増幅（大きい$\mathbf{C}_t$経由）
Adapter la résolution temporelle : $\Delta_t$ grand = oublier rapidement, $\Delta_t$ petit = mémoriser longtemps
Adapt temporal resolution: large $\Delta_t$ = forget quickly, small $\Delta_t$ = remember long
時間解像度を適応：大きい$\Delta_t$ = 素早く忘れる、小さい$\Delta_t$ = 長く記憶

⚡ L'algorithme de scan sélectif ⚡ The Selective Scan Algorithm

Avec les paramètres sélectifs, le SSM discret devient :

With selective parameters, the discrete SSM becomes:

選択的パラメータを使用すると、離散SSMは次のようになります：

$$\overline{\mathbf{A}}_t = \exp(\Delta_t \mathbf{A})$$

$$\overline{\mathbf{B}}_t = (\Delta_t \mathbf{A})^{-1}(\exp(\Delta_t \mathbf{A}) - \mathbf{I}) \cdot \Delta_t \mathbf{B}_t$$

$$h_t = \overline{\mathbf{A}}_t h_{t-1} + \overline{\mathbf{B}}_t x_t$$

$$y_t = \mathbf{C}_t h_t$$

🧮 Explication pédagogique :

🧮 Pedagogical Explanation:

🧮 教育的説明：

Que signifie "$\overline{\mathbf{A}}_t$ et $\overline{\mathbf{B}}_t$ changent à chaque pas de temps" ? Contrairement aux SSM classiques où ces matrices sont fixes, ici elles sont recalculées pour chaque token en fonction de $\Delta_t$ et $\mathbf{B}_t$ qui varient. C'est ce qui donne la sélectivité, mais cela rend le calcul plus complexe.

What does "$\overline{\mathbf{A}}_t$ and $\overline{\mathbf{B}}_t$ change at each time step" mean? Unlike classic SSMs where these matrices are fixed, here they are recomputed for each token based on $\Delta_t$ and $\mathbf{B}_t$ which vary. This is what gives selectivity, but it makes computation more complex.

「$\overline{\mathbf{A}}_t$と$\overline{\mathbf{B}}_t$が各時間ステップで変化する」とはどういう意味か？ これらの行列が固定されている古典的なSSMとは異なり、ここでは変化する$\Delta_t$と$\mathbf{B}_t$に基づいて各トークンごとに再計算されます。これが選択性を与えますが、計算をより複雑にします。

Pourquoi ne peut-on plus utiliser la convolution rapide ? La convolution rapide (FFT) fonctionne quand les poids sont constants. Avec des poids qui changent ($\overline{\mathbf{A}}_t$, $\overline{\mathbf{B}}_t$), on doit calculer la récurrence séquentiellement : $h_1 \rightarrow h_2 \rightarrow h_3 \rightarrow \ldots$. C'est potentiellement lent !

Why can't we use fast convolution anymore? Fast convolution (FFT) works when weights are constant. With changing weights ($\overline{\mathbf{A}}_t$, $\overline{\mathbf{B}}_t$), we must compute the recurrence sequentially: $h_1 \rightarrow h_2 \rightarrow h_3 \rightarrow \ldots$. This is potentially slow!

なぜ高速畳み込みを使用できなくなるのか？ 高速畳み込み（FFT）は重みが一定のときに機能します。変化する重み（$\overline{\mathbf{A}}_t$、$\overline{\mathbf{B}}_t$）では、再帰を順次計算する必要があります：$h_1 \rightarrow h_2 \rightarrow h_3 \rightarrow \ldots$。これは潜在的に遅いです！

Solution : Scan parallèle hardware-aware : Mamba utilise un algorithme de scan parallèle (comme prefix-sum) optimisé pour les GPU. Au lieu de calculer séquentiellement, il divise le travail en blocs parallèles. C'est comme avoir plusieurs personnes qui lisent différentes parties d'un livre simultanément, puis combinent leurs compréhensions. Cela maintient la complexité O(n) tout en étant rapide en pratique !

Solution: Hardware-aware parallel scan: Mamba uses a parallel scan algorithm (like prefix-sum) optimized for GPUs. Instead of computing sequentially, it divides the work into parallel blocks. It's like having multiple people reading different parts of a book simultaneously, then combining their understandings. This maintains O(n) complexity while being fast in practice!

解決策：ハードウェア対応並列スキャン：MambaはGPU向けに最適化された並列スキャンアルゴリズム（prefix-sumのような）を使用します。順次計算する代わりに、作業を並列ブロックに分割します。これは複数の人が本の異なる部分を同時に読み、その後理解を組み合わせるようなものです。これにより、実際には高速でありながらO(n)の複雑度を維持します！

Le problème ? Comme $\overline{\mathbf{A}}_t$ et $\overline{\mathbf{B}}_t$ changent à chaque pas de temps, on ne peut plus utiliser la convolution rapide ! Mamba résout cela avec un algorithme de scan parallèle hardware-aware qui exploite les GPU modernes pour calculer efficacement cette récurrence.

The problem? Since $\overline{\mathbf{A}}_t$ and $\overline{\mathbf{B}}_t$ change at each time step, we can no longer use fast convolution! Mamba solves this with a hardware-aware parallel scan algorithm that exploits modern GPUs to efficiently compute this recurrence.

問題は？$\overline{\mathbf{A}}_t$と$\overline{\mathbf{B}}_t$が各時間ステップで変化するため、高速畳み込みを使用できなくなります！Mambaは、最新のGPUを活用してこの再帰を効率的に計算するハードウェア対応並列スキャンアルゴリズムでこれを解決します。

🔑 Les avantages clés de Mamba :

🔑 Key Advantages of Mamba:

🔑 Mambaの主な利点：

Complexité linéaire O(n) : 5x plus rapide que les Transformers sur les longues séquences
Linear complexity O(n): 5x faster than Transformers on long sequences
線形複雑度O(n)：長いシーケンスでTransformerより5倍高速

Mémoire constante O(1) : pas besoin de stocker toute la séquence (contrairement à l'Attention qui nécessite O(n²))
Constant memory O(1): no need to store the entire sequence (unlike Attention which requires O(n²))
一定メモリO(1)：シーケンス全体を保存する必要がない（O(n²)を必要とするAttentionとは異なり）

Sélectivité : se concentre dynamiquement sur les informations importantes
Selectivity: dynamically focuses on important information
選択性：重要な情報に動的に焦点を当てる

Performances comparables : rivalise avec les Transformers sur de nombreuses tâches
Comparable performance: rivals Transformers on many tasks
同等のパフォーマンス：多くのタスクでTransformerに匹敵

Hardware-aware : optimisé pour les GPU modernes
Hardware-aware: optimized for modern GPUs
ハードウェア対応：最新のGPU向けに最適化

🏗️ Architecture du bloc Mamba 🏗️ Mamba Block Architecture

Un bloc Mamba combine le SSM sélectif avec des techniques modernes de Deep Learning. Voici son architecture complète :

A Mamba block combines the selective SSM with modern Deep Learning techniques. Here is its complete architecture:

Mambaブロックは、選択的SSMと最新のディープラーニング技術を組み合わせています。完全なアーキテクチャは次のとおりです：

📊 Flux de données dans un bloc Mamba

📊 Data Flow in a Mamba Block

📊 Mambaブロックのデータフロー

1. Entrée : $x \in \mathbb{R}^{L \times D}$ (séquence de longueur $L$, dimension $D$)

2. Normalisation : $x' = \text{LayerNorm}(x)$

3. Expansion : $x'' = \text{Linear}(x') \in \mathbb{R}^{L \times 2E}$ (typiquement $E = 2D$)

4. Séparation : $x_{\text{ssm}}, x_{\text{gate}} = \text{split}(x'')$ (chacun $\in \mathbb{R}^{L \times E}$)

5. Convolution 1D : $x_{\text{conv}} = \text{Conv1D}(x_{\text{ssm}})$ (pour capturer les dépendances locales)

6. Activation : $x_{\text{act}} = \text{SiLU}(x_{\text{conv}})$ (SiLU = $x \cdot \sigma(x)$)

7. SSM sélectif : $y_{\text{ssm}} = \text{SelectiveSSM}(x_{\text{act}})$ (le cœur de Mamba !)

8. Gating : $y_{\text{gated}} = y_{\text{ssm}} \odot \text{SiLU}(x_{\text{gate}})$ (mécanisme de porte)

9. Projection : $y = \text{Linear}(y_{\text{gated}}) \in \mathbb{R}^{L \times D}$

10. Connexion résiduelle : $\text{output} = x + y$

🔍 Composants clés :

🔍 Key Components:

🔍 主要コンポーネント：

Convolution 1D : Capture les dépendances locales (comme dans les CNN)
1D Convolution: Captures local dependencies (like in CNNs)
1D畳み込み：局所的な依存関係を捉える（CNNのように）

SSM sélectif : Capture les dépendances longue distance avec complexité linéaire
Selective SSM: Captures long-range dependencies with linear complexity
選択的SSM：線形複雑度で長距離依存関係を捉える

Gating (GLU) : Contrôle le flux d'information (inspiré des LSTM)
Gating (GLU): Controls information flow (inspired by LSTMs)
ゲーティング（GLU）：情報フローを制御（LSTMに触発）

Connexion résiduelle : Facilite l'entraînement de réseaux profonds
Residual connection: Facilitates training of deep networks
残差接続：深いネットワークのトレーニングを容易にする

⚖️ Mamba vs Transformers : Analyse détaillée

📊 Comparaison des complexités 📊 Complexity Comparison 📊 複雑度の比較

Critère Criterion 基準	Transformers (Attention) Transformers (Attention) Transformer（Attention）	Mamba (SSM sélectif) Mamba (Selective SSM) Mamba（選択的SSM）
Complexité temporelle Time complexity 時間複雑度	$O(n^2 \cdot d)$	$O(n \cdot d \cdot N)$ ✅
Complexité mémoire Memory complexity メモリ複雑度	$O(n^2)$ (matrice d'attention)	$O(N)$ (état caché) ✅
Séquence de 1K tokens 1K token sequence 1Kトークンシーケンス	~1M opérations	~1K opérations ✅
Séquence de 10K tokens 10K token sequence 10Kトークンシーケンス	~100M opérations	~10K opérations ✅
Parallélisation Parallelization 並列化	Excellente ✅	Bonne (scan parallèle) ✅
Performances Performance パフォーマンス	État de l'art ✅	Comparable/Supérieur ✅

Notation : $n$ = longueur de la séquence, $d$ = dimension du modèle, $N$ = dimension de l'état caché SSM (typiquement $N \ll n$)

🧮 Explication pédagogique : O(n) vs O(n²)

🧮 Pedagogical Explanation: O(n) vs O(n²)

🧮 教育的説明：O(n) vs O(n²)

Que signifie O(n²) pour les Transformers ? L'Attention calcule la similarité entre chaque paire de tokens. Pour $n$ tokens, il y a $n \times n = n^2$ paires. Par exemple, pour 1000 tokens, cela fait 1 million de calculs ! Pour 10 000 tokens, c'est 100 millions de calculs ! La complexité explose quadratiquement.

What does O(n²) mean for Transformers? Attention computes similarity between every pair of tokens. For $n$ tokens, there are $n \times n = n^2$ pairs. For example, for 1000 tokens, that's 1 million computations! For 10,000 tokens, it's 100 million computations! Complexity explodes quadratically.

TransformerのO(n²)とは何を意味するか？ Attentionはすべてのトークンペア間の類似性を計算します。$n$トークンの場合、$n \times n = n^2$ペアがあります。例えば、1000トークンの場合、100万回の計算です！10,000トークンの場合、1億回の計算です！複雑度は二次的に爆発します。

Que signifie O(n) pour Mamba ? Mamba traite chaque token une seule fois de manière séquentielle (via le scan parallèle). Pour 1000 tokens, c'est 1000 calculs. Pour 10 000 tokens, c'est 10 000 calculs. La complexité croît linéairement, ce qui est beaucoup plus efficace pour les longues séquences !

What does O(n) mean for Mamba? Mamba processes each token only once sequentially (via parallel scan). For 1000 tokens, that's 1000 computations. For 10,000 tokens, it's 10,000 computations. Complexity grows linearly, which is much more efficient for long sequences!

MambaのO(n)とは何を意味するか？ Mambaは各トークンを（並列スキャンを介して）順次一度だけ処理します。1000トークンの場合、1000回の計算です。10,000トークンの場合、10,000回の計算です。複雑度は線形に増加し、長いシーケンスに対してはるかに効率的です！

Exemple concret : Imaginez que vous devez comparer tous les étudiants d'une classe entre eux (Transformers). Dans une classe de 30 élèves, cela fait 30×30 = 900 comparaisons. Maintenant imaginez que vous devez juste noter chaque élève individuellement (Mamba) : seulement 30 évaluations ! La différence devient énorme quand la classe grandit.

Concrete example: Imagine you need to compare all students in a class with each other (Transformers). In a class of 30 students, that's 30×30 = 900 comparisons. Now imagine you just need to grade each student individually (Mamba): only 30 evaluations! The difference becomes huge as the class grows.

具体例： クラスのすべての生徒を互いに比較する必要があると想像してください（Transformers）。30人の生徒のクラスでは、30×30 = 900回の比較になります。今度は各生徒を個別に評価するだけでよいと想像してください（Mamba）：わずか30回の評価です！クラスが大きくなるにつれて差は巨大になります。

Mémoire O(1) vs O(n²) : Les Transformers doivent stocker la matrice d'attention complète (toutes les paires de tokens), ce qui nécessite O(n²) mémoire. Mamba ne stocke que l'état caché de taille fixe $N$, indépendamment de la longueur de la séquence ! C'est pourquoi Mamba peut traiter des séquences de 1 million de tokens là où les Transformers saturent la mémoire.

Memory O(1) vs O(n²): Transformers must store the complete attention matrix (all token pairs), which requires O(n²) memory. Mamba only stores the hidden state of fixed size $N$, independent of sequence length! This is why Mamba can process sequences of 1 million tokens where Transformers saturate memory.

メモリO(1) vs O(n²)： Transformerは完全なAttention行列（すべてのトークンペア）を保存する必要があり、O(n²)メモリが必要です。Mambaはシーケンス長に関係なく、固定サイズ$N$の隠れ状態のみを保存します！これがMambaがTransformerがメモリを飽和させる100万トークンのシーケンスを処理できる理由です。

🎯 Quand utiliser Mamba vs Transformers ? 🎯 When to Use Mamba vs Transformers?

✅ Préférez Mamba pour :

✅ Prefer Mamba for:

✅ Mambaを選ぶ場合：

Séquences très longues (>10K tokens)
Very long sequences (>10K tokens)
非常に長いシーケンス（>10Kトークン）

Contraintes mémoire strictes
Strict memory constraints
厳しいメモリ制約

Inférence en temps réel
Real-time inference
リアルタイム推論

Applications embarquées
Embedded applications
組み込みアプリケーション

Séries temporelles, génomique
Time series, genomics
時系列、ゲノミクス

Traitement de vidéos
Video processing
ビデオ処理

✅ Préférez Transformers pour :

✅ Prefer Transformers for:

✅ Transformerを選ぶ場合：

Séquences courtes/moyennes (<2K tokens)
Short/medium sequences (<2K tokens)
短い/中程度のシーケンス（<2Kトークン）

Tâches nécessitant l'attention globale
Tasks requiring global attention
グローバルアテンションを必要とするタスク

Modèles pré-entraînés disponibles
Pre-trained models available
事前学習済みモデルが利用可能

NLP classique (traduction, QA)
Classic NLP (translation, QA)
古典的なNLP（翻訳、QA）

Vision Transformers (ViT)
Vision Transformers (ViT)
Vision Transformers（ViT）

Écosystème mature
Mature ecosystem
成熟したエコシステム

📈 Résultats empiriques :

📈 Empirical Results:

📈 実証結果：

Sur des benchmarks de modélisation de langage, Mamba atteint des performances comparables aux Transformers tout en étant 5x plus rapide sur des séquences de 8K tokens et utilisant 8x moins de mémoire. Sur des séquences de 1M tokens (génomique), Mamba reste efficace alors que les Transformers deviennent impraticables.

On language modeling benchmarks, Mamba achieves performance comparable to Transformers while being 5x faster on 8K token sequences and using 8x less memory. On 1M token sequences (genomics), Mamba remains efficient while Transformers become impractical.

言語モデリングベンチマークでは、Mambaは8Kトークンシーケンスで5倍高速で8倍少ないメモリを使用しながら、Transformerと同等のパフォーマンスを達成します。1Mトークンシーケンス（ゲノミクス）では、Mambaは効率的なままですが、Transformerは実用的ではなくなります。

🎨 Applications de Mamba dans le monde réel 🎨 Real-World Applications of Mamba 🎨 Mambaの実世界での応用

Grâce à sa complexité linéaire et sa sélectivité, Mamba ouvre de nouvelles possibilités dans de nombreux domaines :

Thanks to its linear complexity and selectivity, Mamba opens new possibilities in many domains:

線形複雑度と選択性のおかげで、Mambaは多くの分野で新しい可能性を開きます：

📝 1. Traitement du Langage Naturel (NLP) 📝 1. 自然言語処理（NLP）

Modèles de langage : Génération de texte avec contexte très long (>100K tokens)
Language models: Text generation with very long context (>100K tokens)
言語モデル：非常に長いコンテキストでのテキスト生成（>100Kトークン）

Analyse de documents : Traitement de livres entiers, rapports techniques
Document analysis: Processing entire books, technical reports
文書分析：書籍全体、技術レポートの処理

Chatbots efficaces : Conversations avec historique illimité
Efficient chatbots: Conversations with unlimited history
効率的なチャットボット：無制限の履歴を持つ会話

Résumé automatique : Synthèse de longs documents
Automatic summarization: Synthesis of long documents
自動要約：長い文書の合成

👁️ 2. Vision par Ordinateur 👁️ 2. Computer Vision

Vision Mamba (ViM) : Alternative aux Vision Transformers pour les images haute résolution
Vision Mamba (ViM): Alternative to Vision Transformers for high-resolution images
Vision Mamba（ViM）：高解像度画像用のVision Transformerの代替

Fusion multi-vue : Combinaison efficace de plusieurs vues pour la détection d'objets
Multi-view fusion: Efficient combination of multiple views for object detection
マルチビュー融合：物体検出のための複数ビューの効率的な組み合わせ

Traitement vidéo : Analyse de vidéos longues sans limite de frames
Video processing: Analysis of long videos without frame limit
ビデオ処理：フレーム制限なしの長いビデオの分析

Segmentation : Segmentation d'images médicales 3D volumineuses
Segmentation: Segmentation of large 3D medical images
セグメンテーション：大規模な3D医療画像のセグメンテーション

🧬 3. Génomique et Bioinformatique 🧬 3. Genomics and Bioinformatics

Analyse de séquences ADN : Traitement de génomes entiers (millions de bases)
DNA sequence analysis: Processing entire genomes (millions of bases)
DNA配列分析：ゲノム全体の処理（数百万塩基）

Prédiction de structure protéique : Modélisation de longues chaînes d'acides aminés
Protein structure prediction: Modeling long amino acid chains
タンパク質構造予測：長いアミノ酸鎖のモデリング

Détection de variants : Identification de mutations dans de longues séquences
Variant detection: Identifying mutations in long sequences
変異検出：長い配列での変異の特定

Hyper-LLM : Modèles de langage pour séquences génomiques (>1M tokens)
Hyper-LLM: Language models for genomic sequences (>1M tokens)
Hyper-LLM：ゲノム配列用の言語モデル（>1Mトークン）

📈 4. Séries Temporelles et Signal 📈 4. 時系列と信号処理

Prévision météorologique : Modèles avec historique de plusieurs années
Weather forecasting: Models with multi-year history
気象予測：数年の履歴を持つモデル

Finance : Analyse de séries temporelles financières longues
Finance: Analysis of long financial time series
金融：長い金融時系列の分析

Traitement audio : Génération et analyse de musique/parole longue durée
Audio processing: Generation and analysis of long music/speech
音声処理：長い音楽/音声の生成と分析

IoT et capteurs : Traitement de flux de données continus
IoT and sensors: Processing continuous data streams
IoTとセンサー：連続データストリームの処理

🔌 5. Applications Embarquées et Edge Computing

Smartphones : Assistants IA locaux avec faible consommation mémoire
Smartphones: Local AI assistants with low memory consumption
スマートフォン：低メモリ消費のローカルAIアシスタント

Robotique : Traitement en temps réel avec contraintes de ressources
Robotics: Real-time processing with resource constraints
ロボティクス：リソース制約のあるリアルタイム処理

Véhicules autonomes : Fusion de capteurs multi-modaux efficace
Autonomous vehicles: Efficient multi-modal sensor fusion
自動運転車：効率的なマルチモーダルセンサー融合

Drones : Navigation et détection avec puissance limitée
Drones: Navigation and detection with limited power
ドローン：限られた電力でのナビゲーションと検出

⚖️ Avantages et Limitations de Mamba ⚖️ Mambaの利点と制限

✅ Avantages ✅ Advantages ✅ 利点

Complexité linéaire : O(n) vs O(n²) pour l'Attention
Linear complexity: O(n) vs O(n²) for Attention
線形複雑度：Attentionの O(n²) に対して O(n)

Mémoire efficace : Pas de matrice d'attention à stocker
Memory efficient: No attention matrix to store
メモリ効率：アテンション行列を保存する必要がない

Longues séquences : Peut traiter >1M tokens
Long sequences: Can process >1M tokens
長いシーケンス：>1Mトークンを処理可能

Sélectivité : Filtre l'information pertinente automatiquement
Selectivity: Automatically filters relevant information
選択性：関連情報を自動的にフィルタリング

Parallélisation : Scan parallèle pour l'entraînement
Parallelization: Parallel scan for training
並列化：トレーニング用の並列スキャン

Inférence rapide : Mode récurrent pour la génération
Fast inference: Recurrent mode for generation
高速推論：生成用の再帰モード

Performances : Comparables aux Transformers sur benchmarks
Performance: Comparable to Transformers on benchmarks
パフォーマンス：ベンチマークでTransformerと同等

⚠️ Limitations ⚠️ Limitations ⚠️ 制限

Nouveauté : Moins mature que les Transformers
Novelty: Less mature than Transformers
新しさ：Transformerほど成熟していない

Écosystème : Moins de modèles pré-entraînés disponibles
Ecosystem: Fewer pre-trained models available
エコシステム：事前学習済みモデルが少ない

Attention globale : Peut être moins efficace pour certaines tâches nécessitant l'attention explicite
Global attention: May be less effective for tasks requiring explicit attention
グローバルアテンション：明示的なアテンションを必要とするタスクでは効果が低い可能性

Interprétabilité : Plus difficile à visualiser que les cartes d'attention
Interpretability: Harder to visualize than attention maps
解釈可能性：アテンションマップより視覚化が困難

Hardware : Optimisations GPU moins développées que pour l'Attention
Hardware: GPU optimizations less developed than for Attention
ハードウェア：Attentionほど GPU最適化が進んでいない

Recherche active : Architecture encore en évolution
Active research: Architecture still evolving
活発な研究：アーキテクチャがまだ進化中

🎓 Conclusion 🎓 Conclusion 🎓 結論

Mamba représente une avancée majeure dans l'architecture des modèles de séquences. En combinant les State Space Models avec un mécanisme de sélectivité innovant, Mamba résout le problème fondamental de complexité quadratique des Transformers tout en maintenant des performances comparables, voire supérieures, sur de nombreuses tâches.

Mamba represents a major breakthrough in sequence model architecture. By combining State Space Models with an innovative selectivity mechanism, Mamba solves the fundamental quadratic complexity problem of Transformers while maintaining comparable or even superior performance on many tasks.

Mambaは、シーケンスモデルアーキテクチャにおける大きな進歩を表しています。State Space Modelsと革新的な選択性メカニズムを組み合わせることで、Mambaは多くのタスクで同等またはそれ以上のパフォーマンスを維持しながら、Transformerの根本的な二次複雑度問題を解決します。

Les applications potentielles sont vastes : du traitement de longues séquences en NLP (>100K tokens) à l'analyse de génomes entiers, en passant par la vision par ordinateur et les systèmes embarqués. La complexité linéaire O(n) ouvre des possibilités qui étaient auparavant impraticables avec les Transformers.

The potential applications are vast: from processing long sequences in NLP (>100K tokens) to analyzing entire genomes, through computer vision and embedded systems. The linear O(n) complexity opens possibilities that were previously impractical with Transformers.

潜在的な応用は広大です：NLPでの長いシーケンス処理（>100Kトークン）からゲノム全体の分析、コンピュータビジョンや組み込みシステムまで。線形 O(n) 複雑度は、以前はTransformerでは実用的でなかった可能性を開きます。

L'avenir de Mamba est prometteur. Avec Mamba-2 (2024) introduisant la "State Space Duality" et de nombreuses variantes émergentes (Vision Mamba, MambaByte, etc.), cette famille d'architectures continue d'évoluer rapidement. Nous assistons peut-être à un changement de paradigme dans le Deep Learning, où la complexité linéaire devient la norme plutôt que l'exception.

The future of Mamba is promising. With Mamba-2 (2024) introducing "State Space Duality" and many emerging variants (Vision Mamba, MambaByte, etc.), this family of architectures continues to evolve rapidly. We may be witnessing a paradigm shift in Deep Learning, where linear complexity becomes the norm rather than the exception.

Mambaの未来は有望です。Mamba-2（2024）が「State Space Duality」を導入し、多くの新しい変種（Vision Mamba、MambaByte など）が登場する中、このアーキテクチャファミリーは急速に進化し続けています。線形複雑度が例外ではなく標準となる、ディープラーニングのパラダイムシフトを目撃しているのかもしれません。

📚 Références Scientifiques 📚 Scientific References 📚 科学的参考文献

🔬 Papiers fondateurs de Mamba 🔬 Foundational Mamba Papers 🔬 Mambaの基礎論文

📄 Gu, A., & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv preprint arXiv:2312.00752.
📄 Dao, T., & Gu, A. (2024). Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. arXiv preprint arXiv:2405.21060. (Mamba-2)
📄 Gu, A., & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv preprint arXiv:2312.00752.
📄 Dao, T., & Gu, A. (2024). Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. arXiv preprint arXiv:2405.21060. (Mamba-2)
📄 Gu, A., & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv preprint arXiv:2312.00752.
📄 Dao, T., & Gu, A. (2024). Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. arXiv preprint arXiv:2405.21060. (Mamba-2)

🧮 State Space Models précurseurs 🧮 Precursor State Space Models 🧮 先行するState Space Models

📄 Gu, A., Goel, K., & Ré, C. (2021). Efficiently Modeling Long Sequences with Structured State Spaces. ICLR 2022. (S4)
📄 Gu, A., Johnson, I., Goel, K., Saab, K., Dao, T., Rudra, A., & Ré, C. (2021). Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers. NeurIPS 2021. (LSSL)
📄 Gu, A., Goel, K., & Ré, C. (2021). Efficiently Modeling Long Sequences with Structured State Spaces. ICLR 2022. (S4)
📄 Gu, A., Johnson, I., Goel, K., Saab, K., Dao, T., Rudra, A., & Ré, C. (2021). Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers. NeurIPS 2021. (LSSL)
📄 Gu, A., Goel, K., & Ré, C. (2021). Efficiently Modeling Long Sequences with Structured State Spaces. ICLR 2022. (S4)
📄 Gu, A., Johnson, I., Goel, K., Saab, K., Dao, T., Rudra, A., & Ré, C. (2021). Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers. NeurIPS 2021. (LSSL)

👁️ Applications en Vision par Ordinateur 👁️ Computer Vision Applications

📄 Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., & Wang, X. (2024). Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. arXiv preprint arXiv:2401.09417.
📄 Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., Ye, Q., & Liu, Y. (2024). VMamba: Visual State Space Model. arXiv preprint arXiv:2403.18814.
📄 Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., & Wang, X. (2024). Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. arXiv preprint arXiv:2401.09417.
📄 Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., Ye, Q., & Liu, Y. (2024). VMamba: Visual State Space Model. arXiv preprint arXiv:2403.18814.
📄 Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., & Wang, X. (2024). Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. arXiv preprint arXiv:2401.09417.
📄 Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., Ye, Q., & Liu, Y. (2024). VMamba: Visual State Space Model. arXiv preprint arXiv:2403.18814.

🧬 Applications en Génomique 🧬 Genomics Applications 🧬 ゲノミクスへの応用

📄 Schiff, Y., Kao, C., Gokaslan, A., Dao, T., Gu, A., & Kuleshov, V. (2024). Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling. arXiv preprint arXiv:2405.13034.
📄 Nguyen, E., Poli, M., Faizi, M., Thomas, A., Birch-Sykes, C., Wornow, M., Patel, A., Rabideau, C., Massaroli, S., Bengio, Y., Ermon, S., Baccus, S. A., & Ré, C. (2023). HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution. NeurIPS 2023.
📄 Schiff, Y., Kao, C., Gokaslan, A., Dao, T., Gu, A., & Kuleshov, V. (2024). Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling. arXiv preprint arXiv:2405.13034.
📄 Nguyen, E., Poli, M., Faizi, M., Thomas, A., Birch-Sykes, C., Wornow, M., Patel, A., Rabideau, C., Massaroli, S., Bengio, Y., Ermon, S., Baccus, S. A., & Ré, C. (2023). HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution. NeurIPS 2023.
📄 Schiff, Y., Kao, C., Gokaslan, A., Dao, T., Gu, A., & Kuleshov, V. (2024). Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling. arXiv preprint arXiv:2405.13034.
📄 Nguyen, E., Poli, M., Faizi, M., Thomas, A., Birch-Sykes, C., Wornow, M., Patel, A., Rabideau, C., Massaroli, S., Bengio, Y., Ermon, S., Baccus, S. A., & Ré, C. (2023). HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution. NeurIPS 2023.

⚖️ Comparaison avec les Transformers ⚖️ Comparison with Transformers ⚖️ Transformerとの比較

📄 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All You Need. NeurIPS 2017.
📄 Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS 2022.
📄 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All You Need. NeurIPS 2017.
📄 Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS 2022.
📄 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All You Need. NeurIPS 2017.
📄 Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS 2022.

🔗 Ressources additionnelles 🔗 Additional Resources 🔗 追加リソース

📝 Articles connexes 📝 Related Articles 📝 関連記事

Le mécanisme d'Attention : L'approche des Transformers

🎯 Pourquoi Mamba ? 🎯 Why Mamba? 🎯 なぜMambaなのか？

💡 Qu'est-ce qu'un State Space Model ? 💡 What is a State Space Model? 💡 状態空間モデルとは？

📐 Les équations fondamentales d'un SSM 📐 Fundamental SSM Equations 📐 SSMの基本方程式

🔍 Intuition :

🔍 Intuition:

🔍 直感：

🔄 Discrétisation : Du continu au discret 🔄 Discretization: From Continuous to Discrete 🔄 離散化：連続から離散へ

🧮 Explication pédagogique :

🧮 Pedagogical Explanation:

🧮 教育的説明：

📊 Interprétation :

📊 Interpretation:

📊 解釈：

🚀 L'innovation de Mamba : La sélectivité 🚀 Mamba's Innovation: Selectivity 🚀 Mambaの革新：選択性

🎯 Le mécanisme sélectif (Selective SSM) 🎯 The Selective Mechanism (Selective SSM) 🎯 選択的メカニズム（Selective SSM）

🧮 Explication pédagogique :

🧮 Pedagogical Explanation:

🧮 教育的説明：

💡 Pourquoi c'est révolutionnaire ?

💡 Why is this revolutionary?

💡 なぜこれが革命的なのか？

⚡ L'algorithme de scan sélectif ⚡ The Selective Scan Algorithm ⚡ 選択的スキャンアルゴリズム

🧮 Explication pédagogique :

🧮 Pedagogical Explanation:

🧮 教育的説明：

🔑 Les avantages clés de Mamba :

🔑 Key Advantages of Mamba:

🔑 Mambaの主な利点：

🏗️ Architecture du bloc Mamba 🏗️ Mamba Block Architecture 🏗️ Mambaブロックアーキテクチャ

📊 Flux de données dans un bloc Mamba

📊 Data Flow in a Mamba Block

📊 Mambaブロックのデータフロー

🔍 Composants clés :

🔍 Key Components:

🔍 主要コンポーネント：

⚖️ Mamba vs Transformers : Analyse détaillée ⚖️ Mamba vs Transformers: Detailed Analysis ⚖️ Mamba vs Transformer：詳細分析

📊 Comparaison des complexités 📊 Complexity Comparison 📊 複雑度の比較

🧮 Explication pédagogique : O(n) vs O(n²)

🧮 Pedagogical Explanation: O(n) vs O(n²)

🧮 教育的説明：O(n) vs O(n²)

🎯 Quand utiliser Mamba vs Transformers ? 🎯 When to Use Mamba vs Transformers? 🎯 いつMamba vs Transformerを使用するか？

✅ Préférez Mamba pour :

✅ Prefer Mamba for:

✅ Mambaを選ぶ場合：

✅ Préférez Transformers pour :

✅ Prefer Transformers for:

✅ Transformerを選ぶ場合：

📈 Résultats empiriques :

📈 Empirical Results:

📈 実証結果：

🎨 Applications de Mamba dans le monde réel 🎨 Real-World Applications of Mamba 🎨 Mambaの実世界での応用

📝 1. Traitement du Langage Naturel (NLP) 📝 1. Natural Language Processing (NLP) 📝 1. 自然言語処理（NLP）

👁️ 2. Vision par Ordinateur 👁️ 2. Computer Vision 👁️ 2. コンピュータビジョン

🧬 3. Génomique et Bioinformatique 🧬 3. Genomics and Bioinformatics 🧬 3. ゲノミクスとバイオインフォマティクス

📈 4. Séries Temporelles et Signal 📈 4. Time Series and Signal Processing 📈 4. 時系列と信号処理

🔌 5. Applications Embarquées et Edge Computing 🔌 5. Embedded Applications and Edge Computing 🔌 5. 組み込みアプリケーションとエッジコンピューティング

⚖️ Avantages et Limitations de Mamba ⚖️ Advantages and Limitations of Mamba ⚖️ Mambaの利点と制限

✅ Avantages ✅ Advantages ✅ 利点

⚠️ Limitations ⚠️ Limitations ⚠️ 制限

🎓 Conclusion 🎓 Conclusion 🎓 結論

📚 Références Scientifiques 📚 Scientific References 📚 科学的参考文献

🔬 Papiers fondateurs de Mamba 🔬 Foundational Mamba Papers 🔬 Mambaの基礎論文

🧮 State Space Models précurseurs 🧮 Precursor State Space Models 🧮 先行するState Space Models

👁️ Applications en Vision par Ordinateur 👁️ Computer Vision Applications 👁️ コンピュータビジョンへの応用

🧬 Applications en Génomique 🧬 Genomics Applications 🧬 ゲノミクスへの応用

⚖️ Comparaison avec les Transformers ⚖️ Comparison with Transformers ⚖️ Transformerとの比較

🔗 Ressources additionnelles 🔗 Additional Resources 🔗 追加リソース

📝 Articles connexes 📝 Related Articles 📝 関連記事

🔄 Discrétisation : Du continu au discret 🔄 離散化：連続から離散へ

🎯 Le mécanisme sélectif (Selective SSM)

⚡ L'algorithme de scan sélectif ⚡ The Selective Scan Algorithm

🏗️ Architecture du bloc Mamba 🏗️ Mamba Block Architecture

⚖️ Mamba vs Transformers : Analyse détaillée

🎯 Quand utiliser Mamba vs Transformers ? 🎯 When to Use Mamba vs Transformers?

📝 1. Traitement du Langage Naturel (NLP) 📝 1. 自然言語処理（NLP）

👁️ 2. Vision par Ordinateur 👁️ 2. Computer Vision

🧬 3. Génomique et Bioinformatique 🧬 3. Genomics and Bioinformatics

📈 4. Séries Temporelles et Signal 📈 4. 時系列と信号処理

🔌 5. Applications Embarquées et Edge Computing

⚖️ Avantages et Limitations de Mamba ⚖️ Mambaの利点と制限

👁️ Applications en Vision par Ordinateur 👁️ Computer Vision Applications