\documentstyle[12pt]{article}

\newcommand{\tr}{\mbox{Tr} \,}
\newcommand{\ket}[1]{\left | \, #1 \right \rangle}
\newcommand{\bra}[1]{\left \langle #1 \, \right |}
\newcommand{\amp}[2]{\left \langle #1 \, \right | 
		     \left. \, #2 \right \rangle}
\newcommand{\var}[1]{\sigma_{#1}^{2}}
\newcommand{\hilbert}[1]{{\cal H}_{#1}}
\newcommand{\banner}[1]{\bigskip \bigskip \noindent {\large \bf #1} \bigskip}

\begin{document}

\baselineskip 4ex

\begin{center}
{\LARGE Quantum Coding} \bigskip \\
{\large Benjamin Schumacher} \medskip \\
Department of Physics  \\
Kenyon College \\
Gambier, OH  43022 \\
e-mail:  schumacb@kenyon.edu
\end{center}

\vspace{.5in}

\begin{center}
{\bf Abstract}
\bigskip
\end{center}

A theorem is proven for quantum information theory that is 
analogous to the noiseless coding theorem of classical information
theory.  In the quantum result, the von Neumann entropy $S$ of the
density operator describing an ensemble of pure quantum signal states 
is equal to the number of spin-1/2 systems (quantum bits or ``qubits'') 
necessary to represent the signal faithfully.  The theorem holds whether 
or not the signal states are orthogonal.  Related results are also
presented about the fidelity of quantum coding and about representing
entangled quantum states.

\bigskip

\noindent
PACS numbers:  03.65, 05.30

\pagebreak

\banner{I.  Entropy and Information}

In the classical information theory developed by Shannon and others, the
central problem is {\em coding\/} \cite{shannon}.  
For example, suppose that $A$ is a
message source that produces the message $a$ with probability $p(a)$,
and further suppose that we wish to represent the messages with sequences
of binary digits (bits) that are as short as possible.  It can be shown
that the mean length $\bar{L}$ of these bit sequences is bounded below
by the Shannon entropy $H(A)$ of the source:  $\bar{L} \geq H(A)$, where
\begin{equation}
	H(A)  =  - \sum_{a} \, p(a) \log p(a) .
\end{equation}
(Throughout this paper, we use base-2 logarithms.)
Furthermore, if we allow ourselves to code entire blocks of independent
messages together, it turns out that the mean number $\bar{L}$ of bits
per message can be brought arbitrarily close to $H(A)$.

This {\em noiseless coding theorem} shows the importance of the Shannon
entropy $H(A)$ for information theory.  
It also provides an {\em interpretation}
of $H(A)$ as the mean number of bits necessary to code the output of $A$
using an ideal code.  We might imagine that each bit has a fixed ``cost''
(in units of energy or space or money), so that $H(A)$ is a measure
of the tangible resources necessary to represent the information produced
by $A$.

The ``entropy'' is also of central importance in statistical mechanics,
where it is a measure of the disorder of a physical system.  In classical
statistical mechanics, in fact, the statistical entropy is formally 
identical to the Shannon entropy.  This has led to a considerable effort
to give statistical mechanics an information-theoretic foundation
\cite{jaynes}.  In this
approach, the entropy of a macrostate can be interpreted as the 
number of bits that would be required to specify the microstate of the
system.  (Of course, in classical statistical mechanics the phase space
of states is a continuum, so that the number of bits 
needed to specify a microstate {\em completely}
is infinite.  This can be avoided in the usual way by 
specifying the microstate to a finite, and arbitrary, resolution in phase
space.)

In quantum systems, however, the expression for entropy 
(first proposed by von Neumann \cite{vonneumann}) 
is not identical to the Shannon entropy.
Suppose $\rho$ is the density operator describing an ensemble of
states of a quantum system;
then the von Neumann entropy $S(\rho)$ is
\begin{equation}
	S(\rho) = - \tr \rho \log \rho  .
\end{equation}
This has obvious analogies to the Shannon entropy; in fact, if we can
interpret the probabilities $p(a)$ in Eq. (1) as eigenvalues of the
density operator $\rho$, then $S(\rho)$ is numerically equal to $H(A)$.

Despite their formal similarity, 
however, these two quantities are quite different.
We can see this difference easily by considering a {\em quantum signal
source\/}, which might be part of a quantum communication system.  
This is a device which codes each message $a$ 
from the source $A$ into
a ``signal state'' $\ket{a_{M}}$ of a quantum system $M$.  
Then the ensemble of 
signals from the signal source will be represented by the density operator
\begin{displaymath}
	\rho = \sum_{a} p(a) \pi_{a} 
\end{displaymath}
where the density operators $\pi_{a}$ are the projections 
$\pi_{a}  =  \ket{a_{M}} \bra{a_{M}}$.
The von Neumann entropy of $\rho$ will equal the 
Shannon entropy of the message
source only in the special case when the signals $\ket{a_{M}}$ are orthogonal
to one another, in which case the signal states are eigenstates of $\rho$.  
If the signals are not orthogonal, then $S(\rho) < H(A)$, and the eigenstates 
of $\rho$ may have no simple relation to the signal states \cite{wehrl}.

Of course, if the signal states are not orthogonal it will not be possible
to distinguish between them perfectly.  In other words, no ``decoding
observable'' will be sufficient to recover the entire information content
of the message in the quantum signal source.
It might therefore be more appropriate to consider the
{\em accessible information\/}, the maximum amount of information about the
message that can be recovered in a measurement performed on $M$.
The proper measure of recovered information is the {\em mutual information\/},
which for a pair of random variables $X$ and $Y$ is defined to be
\begin{equation}
	H(X:Y) = H(X) + H(Y) - H(X,Y)  .
\end{equation}
In classical information theory,
the mutual information is the amount of information about $X$ that is 
acquired by determining the value of $Y$.  
Thus, if we denote by $B$ the outcome of
a measurement of an observable on $M$, the quantity $H(A:B)$ measures the 
information about the message source $A$ that is acquired by measurement
of the observable.  The maximum accessible information is 
the maximum of $H(A:B)$ over all choices of decoding observable.

An important theorem of Kholevo \cite{kholevo}
places an upper bound on $H(A:B)$ for
a quantum channel.  To state the theorem most generally, suppose the
signal states are described by density operators $\rho_{a}$ and are
not necessarily pure states.  Then the mutual information obtained
by the measurement of any ``positive operator'' observable (of which
ordinary quantum observables are a special case) is bounded by
\begin{equation}
	H(A:B) \leq S(\rho) - \sum_{a} p(a) \, S(\rho_{a}) .
\end{equation}
In our case, the signals $\rho_{a} = \pi_{a}$ are pure states with
zero entropy and Kholevo's theorem simply states that, 
if $\rho$ is the density operator for the signal ensemble,
\begin{equation}
	H(A:B)  \leq  S(\rho)  .
\end{equation}
That is, although the Shannon entropy $H(A)$ 
of the message source is in general greater than the
von Neumann entropy $S(\rho)$ of the signal ensemble, 
the accessible information is bounded by 
$S(\rho)$.  Kholevo's theorem provides a connection between the von Neumann
entropy of a quantum ensemble and the Shannon mutual information of a quantum
communication channel \cite{thesis}.

This connection, however, is a fairly weak one.  Kholevo's theorem is an
inequality, and it is possible to construct simple quantum signal sources
for which the
mutual information $H(A:B)$ does not approach $S(\rho)$ closely for any
choice of decoding observable \cite{asher}.  
Thus, although Kholevo's theorem gives
an information-theoretic significance to $S(\rho)$, it does not provide
an interpretation of $S(\rho)$ in terms of classical information theory.
We could not use Kholevo's theorem, for example, to interpret
the quantum entropy of some macrostate of a thermodynamic system as a 
measure of the resources necessary to represent information about the 
system's quantum microstate.

In this paper we will prove a ``quantum coding theorem'' that
does allow exactly this sort of interpretation.  
This is accomplished by replacing the classical idea of a binary digit
with a quantum two-state system, such as the spin of electron.  These
quantum bits, or ``qubits'', are the fundamental units of quantum information.
We will show that the von Neumann entropy $S(\rho)$ of an ensemble is just
the mean number of qubits necessary to encode the states in the ensemble
in an ideal coding scheme.  This theorem can be viewed as the kernel of a
new approach to quantum information theory:  instead of simply applying
classical information theory to probabilities derived from quantum rules,
we can adopt notions of coding and measures of information that are 
themselves distinctly quantum mechanical.

Section II provides some background about the Shannon entropy and
classical ideas of ``likely'' and ``unlikely'' binary sequences.
Section III distinguishes between the {\em copying} of quantum
information, which is in general impossible, and the {\em transposition}
of that information from one system to another.  The fidelity of a
transposition scheme is defined in Section IV, and two useful lemmas
about fidelity are proven in Section V.  These lemmas lead directly
to the proof of the main theorem in Section VI.  Section VII discusses
issues related to entangled quantum states, and Section VIII presents
some general remarks.

\banner{II.  Entropy and Likely Sequences}

It will be useful here to review basic concepts of probability, particularly
those relating to the Shannon entropy $H(A)$.  We will also outline a proof
of the noiseless coding theorem of classical information theory
\cite{infotheory}.


Suppose $x_{1},\ldots,x_{N}$ are $N$ independent, 
identically distributed random
variables, each with mean $\bar{x}$ and finite variance.
Given $\delta,\epsilon > 0$ there exists
$N_{0}$ such that, for $N \geq N_{0}$,
\begin{equation}
	P \left( \left| \frac{1}{N} \sum_{i} x_{i} \, - \bar{x} \right|
		\, > \delta  \right) 
		< \epsilon  .
\end{equation}
This standard result is known as the Weak Law of Large Numbers.
It tells that a sufficiently long sequence of
independent, identically distributed random variables will, with a
probability approaching unity, have an average that is close to the
mean of each variable.  

We can use the Weak Law to derive a relation
between the Shannon entropy $H(A)$ and the number of ``likely'' sequences
of $N$ identical random variables.  Suppose, as before, that a message
source $A$ produces the message $a$ with probability $p(a)$.  
A sequence $\alpha = a_{1} a_{2} \ldots a_{N}$ of $N$ independent
messages from the same source will occur in an ensemble of all $N$-sequences
with probability $P(\alpha) = p(a_{1}) \cdots p(a_{N})$.
We can now define a random variable for each message by $x = - \log p(a)$,
so that $H(A) = \bar{x}$.  It is easy to see that
\begin{displaymath}
	- \log P(\alpha) = \sum_{i} x_{i}  .
\end{displaymath}
The Weak Law then tells us that, if 
$\epsilon, \delta > 0$ then for sufficiently large $N$
\begin{equation}
	P \left( \left| - \frac{1}{N} \log P(\alpha) - H(A) \right| 
		> \delta \right)    <    \epsilon
\end{equation}
for $N$-sequences $\alpha$.  We can therefore partition the set of all
$N$-sequences into two subsets:
\begin{itemize}
	\item  A set $\Lambda$ of ``likely'' sequences, for which 
		\begin{displaymath}
			\left| - \frac{1}{N} \log P(\alpha) - H(A) \right|
				\leq \delta .
		\end{displaymath}
	\item  A set of ``unlikely'' sequences with total probability
		less than $\epsilon$, for which this inequality fails.
\end{itemize}
In other words, with probability greater than $\, 1 - \epsilon \,$ a sequence
$\alpha$ is in $\Lambda$ and thus satisfies
\begin{displaymath}
	- \delta  \leq - \frac{1}{N} \log P(\alpha)  -  H(A)  \leq  \delta
\end{displaymath}
which in turn implies that
\begin{equation}
	2^{-N(H(A) - \delta)}  \geq P(\alpha)  \geq  2^{-N(H(A) + \delta)} .
\end{equation}

How many likely sequences are there?  Let $\nu$ be the number of sequences in
$\Lambda$.  Then, using the right-hand inequality above,
\begin{eqnarray*}
	1  	&  \geq  &  \sum_{\alpha \in \Lambda} P(\alpha) \\
		&  \geq  &  \sum_{\alpha \in \Lambda}
				2^{-N(H(A) + \delta)}  \\
		&    =   &  \nu \, 2^{-N(H(A) + \delta)} .
\end{eqnarray*}
Therefore, the number $\nu$ of likely sequences is bounded by
\begin{equation}
	\nu  \leq  2^{N(H(A) + \delta)}  .
\end{equation}
By a similar argument, it turns out that
\begin{equation}
	\nu \geq (1-\epsilon) 2^{N(H(A) - \delta)}  .
\end{equation}

Long sequences of independent messages from the message source $A$ thus
fall into two classes:  a collection of approximately $2^{N H(A)}$ likely
sequences, and a collection of unlikely sequences with total probability
approaching zero.  This suggests a strategy for coding.  The likely
sequences may be associated in a one-to-one fashion with binary sequences
of length $N H(A)$; unlikely sequences, though perhaps very numerous, are
in some sense negligible.

More formally, we can prove the following theorem about coding the output
of $A$ into binary sequences:
\begin{quotation}
{\bf \noindent Theorem  (Noiseless Coding Theorem):}  Let $A$ be a message 
	source as described above, and let $\delta,\epsilon > 0$.
	\begin{description}
	\item[(i)]  Suppose $H(A) + \delta$ bits are available per
		$A$ message.  Then for sufficiently large $N$, 
		$N$-sequences of messages from $A$ can be coded 
		into binary sequences with probability of error
		less than $\epsilon$.
	\item[(ii)]  Suppose $H(A) - \delta$ bits are available
		per $A$ message.  Then for sufficiently large $N$,
		if $N$-sequences of messages from $A$ are coded
		into binary sequences, the probability of error
		will be greater than $1 - \epsilon$.
	\end{description}
\end{quotation}

Part (i) is proved as follows.  From our previous discussion, we know that,
for large enough $N$, the number of likely sequences 
$\nu \leq 2^{N(H(A) + \delta)}$.  We can thus encode each of the likely
sequences into a unique sequence of the $N (H(A) + \delta)$ available binary
digits.  The remaining unlikely sequences can be ``erroneously'' encoded 
in any way, say into the single binary sequence $0 0 0 \ldots 0 0$.  The
total probability of error is thus the total probability of the unlikely
sequences, which can be made less than $\epsilon$.

The proof of part (ii) is slightly more involved.  Let $\eta,\zeta > 0$.
For sufficiently large $N$, we can distinguish between unlikely sequences,
with total probability less than $\eta$, and likely sequences, each one
of which has probability
\begin{displaymath}
	P(\alpha) \leq 2^{-N(H(A) - \zeta)}  .
\end{displaymath}

For coding we have $2^{N(H(A) - \delta)}$ available binary sequences.  We 
assign each of these binary sequences to a sequence of $A$ messages.  Any
leftover sequences of $A$ messages will then have to be encoded in such a 
way that they will not be correctly decoded---they will be errors.  Let
$P_{0}$ be the probability that the message sequence is not decoded 
erroneously.  The set of correctly coded sequences is certainly smaller 
than all unlikely sequences plus $2^{N(H(A)-\delta)}$ likely sequences.
Thus,
\begin{eqnarray*}
	P_{0}	&  <  &	\eta + \left( 2^{N(H(A) - \delta)} \right)
				\left( 2^{-N(H(A) - \zeta)} \right) \\
	P_{0}	&  <  &	\eta + 2^{-N(\delta - \zeta)}  .
\end{eqnarray*}
Now let $\eta = \epsilon/2$, $\zeta = \delta/2$, and choose $N$ large
enough so that $2^{-N \delta / 2} < \epsilon/2$.  Then $P_{0} < \epsilon$,
and the probability of error $1 - P_{0} > 1 - \epsilon$, as we wished.

Consider the notion of the ``probability of error'' in more detail.  In 
itself, a coding scheme is incomplete; we also require some prescription
for decoding, for recovering the original message from the binary string.
If the code is one-to-one, this can be done unambiguously.  If two possible
messages are represented by the same binary sequence, however, this sequence
will sometimes be decoded incorrectly.

We can define the {\em fidelity} $F$ of the coding/decoding arrangement
as the probability that the decoded message is the same as the message
before coding.  The probability of error is thus $1 - F$.  A high fidelity
means a low probability of error, and vice versa.  Thus, the noiseless
coding theorem states that, if more than $H(A)$ bits per message are 
allowed, the fidelity can be made arbitrarily close to unity; and conversely,
if fewer than $H(A)$ bits per message are allowed, the fidelity eventually
approaches zero.

The fidelity $F$ will have an analogue in the quantum domain.


\banner{III.  Copying and Transposition}

A quantum signal source generates the signal state $\ket{a_{M}}$ of a
quantum system $M$ with probability $p(a)$.  The signal states are
not in general orthogonal.  The ensemble of possible signals is
described by the density operator
\begin{equation}
	\rho = \sum_{a} p(a) \, \pi_{a} ,
\end{equation}
where $\pi_{a} = \ket{a_{M}} \bra{a_{M}}$, the density operator 
(for a pure state, a projection)
associated with the signal state vector $\ket{a_{M}}$.

In quantum coding, we wish to represent the output of the signal source 
in another quantum system $X$.  Quantum information theory, unlike
its classical counterpart, requires us to draw a distinction between the
{\em copying} and the {\em transposition} of information from $M$ into $X$.
In copying, the original signal state of $M$ is undisturbed and $X$ is
brought into a state corresponding to the signal state; that is, the
combined system evolves according to 
\begin{equation}
	\ket{a_{M},0_{X}} \rightarrow \ket{a_{M},a_{X}} ,
\end{equation}
where $\ket{0_{X}}$ is some standard ``null'' state of $X$ and $\ket{a_{X}}$
is the representation of the signal $\ket{a_{M}}$ in $X$.  

As shown by Wootters and Zurek \cite{zurek}, copying a quantum signal 
faithfully cannot be accomplished for all signal sources.
The proof of this is elementary.  Suppose a device existed
that claimed to copy arbitrary states of $M$ into states of $X$.
That is, given two distinct signal states $\ket{a_{M}}$ and $\ket{b_{M}}$ of
$M$, the action of the device would be
\begin{eqnarray*}
	\ket{a_{M},0_{X}} & \rightarrow & \ket{a_{M},a_{X}} \\
	\ket{b_{M},0_{X}} & \rightarrow & \ket{b_{M},b_{X}} .
\end{eqnarray*}
Consider now the signal state $\ket{c_{M}} = \ket{a_{M}} + \ket{b_{M}}$.  
(We do not need to consider the normalization of $\ket{c}$.)  
If the resulting state of $X$ is to be a faithful copy, 
then $\ket{c_{X}} = \ket{a_{X}} + \ket{b_{X}}$.
But from general considerations of quantum mechanics, we know that the
dynamical evolution of the system is {\em linear}---in fact, a
unitary transformation---so that
\begin{eqnarray*}
	\ket{c_{M},0_{X}}  &  =  &  \ket{a_{M},0_{X}} + \ket{b_{M},0_{X}} \\
		&  \rightarrow   &  \ket{a_{M},a_{X}} + \ket{b_{M},b_{X}} \\
			&  \neq  &  \ket{c_{M},c_{X}} ,
\end{eqnarray*}
since $\ket{c_{M},c_{X}} = \ket{a_{M},a_{X}} +\ket{a_{M},b_{X}} +
	\ket{b_{M},a_{X}} + \ket{b_{M},b_{X}} $.
That is, if two distinct states can be copied faithfully, a superposition
of the two states cannot be.\footnote{To be complete, we should include
in this discussion the change in state of the copying device, which in
general may depend upon the input state.  The process is actually
\begin{displaymath}
\ket{a_{M},0_{X},\psi_{0}}  \rightarrow  \ket{a_{M},a_{X},\psi_{a}}
\end{displaymath}
where $\ket{\psi_{0}}$ and $\ket{\psi_{a}}$ are states of the copier
and its environment.  This refinement does not modify the general 
argument.}

Copying can be accomplished if the possible states are mutually
ortho\-gonal---for example, we could measure an observable whose eigenstates
are the signal states and then use the (classical) information about
the outcome to manufacture as many perfect copies as desired.  Quantum
signal sources which have non-orthogonal signals, on the other hand,
cannot be duplicated perfectly.

Transposition is a different matter.  In transposition, the signal state
of $M$ is transferred to $X$ without leaving a copy behind:
\begin{equation}
	\ket{a_{M},0_{X}} \rightarrow \ket{0_{M},a_{X}}
\end{equation}
where $\ket{0_{X}}$ and $\ket{0_{M}}$ are fixed ``null'' states for
$X$ and $M$.  After transposition, the signal resides completely 
in the coding system $X$ and the original signal in $M$ has been erased.
(The ``quantum teleportation'' discussed in \cite{teleport} is
a rather exotic example of a transposition process.)

Transposition is completely unitary for arbitrary input signal states
of $M$, provided that the coded states in $X$ have the same inner
products as their precursors:  
$\amp{a_{X}}{b_{X}} = \amp{a_{M}}{b_{M}}$ for all signals
$\ket{a_{M}}$ and $\ket{b_{M}}$.  
This can be accomplished if and only if the Hilbert space $\hilbert{X}$ 
has a dimension at least as large as the subspace
of $\hilbert{M}$ spanned by the signal states.  (We can without loss
of generality suppose that this subspace is the entirety of $\hilbert{M}$.)

To specify the unitary evolution $U$ that accomplishes transposition, we
only need to specify how an orthogonal basis for $\hilbert{M}$ is
mapped into an orthogonal basis for $\hilbert{X}$.  The evolution of
all other signals follows by linearity.  Transposition is invertible,
since the signal state can be transferred back from $X$ to $M$ via the
unitary transformation $U^{-1}$.  We can therefore imagine a communication
scheme based upon transposition.  At the coding end, 
the signal of a source system $M$ is transposed via the unitary 
evolution $U$ into the coding system $X$.  The system $X$ is 
conveyed from the transmitter to the receiver.  At the decoding end,
the unitary evolution $U^{-1}$ is employed to recover the signal
state from $X$ into $M'$, an identical copy of system $M$.  Symbolically,
\begin{displaymath}
	M \stackrel{U}{\longrightarrow} X 
		\stackrel{U^{-1}}{\longrightarrow} M' .
\end{displaymath}
The system $X$ is the {\em quantum channel} in this communication
scheme, and supports the transposition of the state of $M$ into $M'$.

We are concerned here with the transposition of quantum information.
This process requires that the quantum channel $X$ be ``large enough''
(i.e., have a Hilbert space $\hilbert{X}$ of high enough dimension)
to represent the signals in $M$.  For perfect transposition, this
means that $\dim \hilbert{X} \geq \dim \hilbert{M}$.  It may be,
however, that a perfect transposition of the signal is unnecessary,
so that we only wish to perform an {\em approximate} transposition
of quantum information from $M$ to $M'$ via $X$.  Depending on the
characteristics of the signal source, we may be able to make do with
a smaller quantum channel and still have an adequately faithful
representation of the signal.  To explore this question, we need to do 
two things:  first, describe what we mean by an approximate
transposition; and second, define a measure of the {\em fidelity}
of the process and relate the fidelity to the size of the quantum
channel.

\banner{IV. Approximate Transposition and Fidelity}

Consider the quantum communication channel outlined above.  The signal
state of $M$ is unitarily transposed into $X$ and can then be perfectly
recovered into the system $M'$.  However, let us suppose that we do not
send {\em all} of the system $X$ from the transmitter to the receiver.
Instead, we will suppose that $X$ is composed of two subsystems, which we 
will call $C$ (for ``channel'') and $E$ (for ``extra'').  Only
the channel subsystem $C$ is conveyed to the receiver to be used for 
decoding the signal into $M'$; the extra subsystem $E$ is simply 
discarded.  Clearly, the signal cannot in general be recovered exactly
from $C$ alone, since it may be that $\dim \hilbert{C} < \dim \hilbert{M}$.
On the other hand, it may be possible to recover some approximation of
the signal.  We will call this quantum communication scheme 
{\em approximate transposition from $M$ to $M'$ 
via the limited channel $C$\/}.

To recover the signal from $C$ into $M'$, we will add to the channel
system $C$ an auxiliary system $E'$ that is a copy of the discarded
extra system $E$, and then perform a transposition from $C + E'$
to $M$ via the unitary evolution operator $U'$.    Symbolically, our 
scheme is:
\begin{displaymath}
	M \stackrel{U}{\longrightarrow} C + E 
	\begin{array}[t]{c} \longrightarrow \\ \downarrow \\ E \end{array}
	C \begin{array}[t]{c} \longrightarrow \\ \uparrow \\ E' \end{array}
	C + E' \stackrel{U'}{\longrightarrow} M' .		
\end{displaymath}
(It may be convenient to choose $U' = U^{-1}$---in other words, 
to decode the signal from $C + E'$ into $M'$ using the inverse of the coding 
transposition operator.  We will not make this a general requirement.)

To determine the effectiveness of this transposition scheme, 
we need a measure of its fidelity.  
Suppose the original signal state of $M$ is
$\ket{a_{M}}$, represented by the density operator $\pi_{a} = \ket{a_{M}}
\bra{a_{M}}$.  The final signal in $M'$ will be a state represented by the
density operator $w_{a}$.  Because the system $E$ has 
been discarded in the transfer process, the final signal state is not 
necessarily a pure state, and so $w_{a}$ is not generally 
a projection operator.

To check how close the final signal $w_{a}$ is to the original $\pi_{a}$,
we can perform a ``validation measurement'' of the observable $\pi_{a}$.
The measurement has two possible results:  1, indicating that the final
signal matches the original; or 0, indicating that the final signal 
differs from the original.  The probability that $w_{a}$ passes this
validation test is $\tr \pi_{a} \, w_{a}$.  Let us define the
{\em fidelity} $F$ to be the overall probability that a signal
from the signal ensemble in $M$ that is transmitted to $M'$ passes a 
validation test comparing it to its original.  That is,
\begin{equation}
	F = \sum_{a} p(a) \, \tr \pi_{a} \, w_{a} .
\end{equation}
The fidelity $F$ is clearly between 0 and 1, and equals unity only in the
case of perfect transposition of all possible signals.  $F$ will be close
to unity if (1) signals with large probability $p(a)$ are very little
distorted in transmission, so that $w_{a}$ nearly equals $\pi_{a}$; 
and (2) the set of signals which are greatly distorted, having 
$w_{a}$ very different from $\pi_{a}$, has a small total probability.

It is instructive to trace out how the signal state changes 
through this communication scheme.
The first stage of our communication scheme, the unitary coding
transposition from $M$ to $C + E$, is accomplished via the operator
$U$.  If the original signal state of $M$ is $\pi_{a}$, then the signal 
state of $C + E$ can be written 
$\Pi_{a} = U \, \pi_{a} \, U^{-1}$.  When we
discard the extra system $E$, the remaining system $C$ must be
assigned a state $\tr \!_{E} \, \Pi_{a}$, the partial trace
of $\Pi_{a}$ over $E$.  After $E'$ (which is in some state $\ket{0_{E'}}$)
has been adjoined to the channel $C$, the combined system is in a 
state $W_{a} = \tr \!_{E} \, \Pi_{a} \otimes \ket{0_{E'}} \bra{0_{E'}}$.  
Finally, the unitary decoding transposition occurs and 
$w_{a} = U' \, W_{a} \, (U')^{-1}$.  Suppose we make the reasonable
choice $U' = U^{-1}$, so that the decoding transposition is just the
operator inverse of the coding transposition.  Then the fidelity is
\begin{eqnarray}
	F  &  =  &	\sum_{a} p(a) \, \tr \pi_{a} \, w_{a}  \nonumber \\
	   &  =  &	\sum_{a} p(a) \, 
			\tr \left( U^{-1} \, \Pi_{a} \, U \right)
			    \left( U^{-1} \, W_{a} \, U \right) \nonumber \\
	   &  =  &	\sum_{a} p(a) \, \tr \Pi_{a} \, W_{a} 
\end{eqnarray}
so that we can calculate the fidelity 
of the transposition from $M$ to $M'$ by examining only 
the signal states of $C + E$ and $C + E'$.

\banner{V. Two Fidelity Lemmas}

Intuitively, we can say that, if the channel system $C$ is ``too
small'' then the fidelity $F$ must be ``close to'' 0.  Conversely, 
if the channel system is ``large enough'' then we will be able
to make the fidelity $F$ ``close to'' 1.  Making rigorous theorems 
out of the phrases ``large enough'', ``too small'', and ``close to'' 
is the task of this section.  We will prove a pair of general lemmas
that will in the next section be central to the proof 
of our main coding theorem.

We begin by considering a channel $C$ that is ``too small''.
That is, we will prove the following lemma:

\begin{quote}
{\bf Lemma 1.}  Suppose $\dim \hilbert{C} = d$, and suppose that the
ensemble of signals in $M$ described by $\rho = \sum_{a} p(a) \, \pi_{a}$
has the property that, for any projection $\Gamma$
onto a $d$-dimensional subspace $\hilbert{M}$,
\begin{displaymath}
	\tr \rho \, \Gamma < \eta
\end{displaymath}
for some fixed $\eta$.  Then the fidelity $F < \eta$.
\end{quote}

If the signal ensemble has small ``weight'' in every subspace of
the same size as $\hilbert{C}$, then the fidelity of the transposition
will be correspondingly small.  The proof of this fact follows.  
Consider a signal state $\ket{a_{M}}$ that is transposed according to
our general scheme.  Assuming that the system $E'$ is 
initially in some pure state $\ket{0_{E'}}$, the signal state
$W_{a}$ of $C + E'$ is supported only on a $d$-dimensional subspace of 
$\hilbert{C + E'} = \hilbert{C} \otimes \hilbert{E'}$, the subspace of
states of the form $\ket{\psi_{C}, 0_{E'}}$.  Therefore the final
decoded signal state $w_{a}$ of $M'$ is supported 
only on a $d$-dimensional
subspace of $\hilbert{M'}$.  Call the projection onto this
subspace $\Gamma$.

Let $\ket{\phi_{k}}$ for $k = 1,\ldots,d$ be the orthogonal basis for 
this subspace that is composed of eigenstates of $w_{a}$.  
We can write $w_{a}$ as
\begin{displaymath}
	w_{a} = \sum_{k} q_{k} \, \ket{\phi_{k}} \bra{\phi_{k}}
\end{displaymath}
where the $q_{k}$ are eigenvalues of $w_{a}$, including all of the 
non-zero ones.  Clearly, $q_{k} \leq 1$. The projection $\Gamma$ is 
simply
\begin{displaymath}
	\Gamma = \sum_{k} \ket{\phi_{k}} \bra{\phi_{k}}.
\end{displaymath}
Now consider the term $\tr \pi_{a} w_{a}$, which appears in the 
expression for the fidelity.
\begin{eqnarray*}
	\tr \pi_{a} w_{a}  
	& = &	\tr \pi_{a} \left(
		\sum_{k} q_{k} \, \ket{\phi_{k}} \bra{\phi_{k}} \right) \\
	& = &	\sum_{k} q_{k} \, 
		\tr \pi_{a} \ket{\phi_{k}} \bra{\phi_{k}} \\
	& \leq &  \sum_{k} \, \tr \pi_{a} \ket{\phi_{k}} \bra{\phi_{k}} \\
	& = & \tr \pi_{a} 
		\left( \sum_{k} \ket{\phi_{k}} \bra{\phi_{k}} \right) \\
	& = & \tr \pi_{a} \Gamma .
\end{eqnarray*}
The fidelity $F$ is
\begin{eqnarray*}
	F  &  =  &	\sum_{a} p(a) \, \tr \pi_{a} w_{a} \\
	   & \leq &	\sum_{a} p(a) \, \tr \pi_{a} \Gamma \\
	   & = & 	\tr \left( \sum_{a} p(a) \pi_{a} \right) \Gamma \\
	  &  =  &	\tr \rho \, \Gamma 
\end{eqnarray*}
and therefore $F < \eta$.

If the system $E'$ is not in a pure state, the final signal state $w_{a}$
will be a mixture of states, each of which is supported on a $d$-dimensional
subspace.  The overall fidelity will be a weighted average of terms that are
bounded in the above manner.  Thus, $F < \eta$ in this more general 
situation as well, and the theorem holds.

It is worth remarking that the condition requiring $\tr \rho \Gamma < \eta$
for all projections $\Gamma$ can be rephrased in terms of the eigenvalues
of $\rho$.  Let $P_{n}$ be the eigenvalues of $\rho$, and let $\ket{n}$
be the corresponding eigenstates.  The quantities
$Q_{n} = \bra{n} \Gamma \ket{n}$ satisfy $0 \leq Q_{n} \leq 1$, and
$\sum_{n} Q_{n} = \tr \Gamma = d$.  Then
\begin{displaymath}
	\tr \rho \, \Gamma  =  \sum_{n} P_{n} \, Q_{n} .
\end{displaymath}
It is easy to see that this sum will be maximized if the $Q_{n}$ are
chosen to be 1 for the values of $n$ corresponding to the $d$ largest
eigenvalues $P_{n}$, and zero for other values of $n$.  We can actually
achieve this largest value by choosing $\Gamma$ to be the projection onto
the subspace spanned by the eigenstates with the $d$ largest eigenvalues
of $\rho$.  Thus, 
$\tr \rho \, \Gamma < \eta$ if and only if the sum of any $d$ eigenvalues
of $\rho$ is less than $\eta$.

We next turn to the case in which the channel $C$ is ``large enough''
to allow transposition with high fidelity.

\begin{quote}
{\bf Lemma 2.}  Suppose that $\dim \hilbert{C} = d$, and suppose that
there exists a projection $\Gamma$ onto a $d$-dimensional subspace of
$\hilbert{M}$ such that 
\begin{displaymath}
	\tr \rho \, \Gamma  > 1 - \eta .
\end{displaymath}
Then there exists a transposition scheme with fidelity $F > 1 - 2 \eta$.
\end{quote}

If the signal ensemble has sufficient ``weight'' on a subspace of the
same size as $\hilbert{C}$, then it is possible to make a transposition
with a fidelity that is correspondingly close to 1.  To prove this, we
will actually construct such a transposition scheme and find its
fidelity.

We first note that, from the remark above, we lose no generality if we
suppose that $\Gamma$ is a projection onto a subspace 
$\Lambda$ of $\hilbert{M}$ spanned by $d$ eigenstates of $\rho$. 
Let us number the eigenstates of $\rho$ in such a way that the eigenstates
$\ket{1},\ldots,\ket{d}$ span $\Lambda$, while
$\ket{d+1},\ldots,\ket{D}$ (where $D = \dim \hilbert{M}$) are 
orthogonal to $\Lambda$ and span the orthogonal subspace $\Lambda^{\perp}$.  
Then we have
\begin{eqnarray*}
	\sum_{n = 1}^{d} \ket{n} \bra{n}  & = & \Gamma 		\\
	\sum_{n=1}^{d} P_{n} & > & 1 - \eta			\\
	\sum_{n=d+1}^{D} P_{n} & < & \eta  .	
\end{eqnarray*}

Our strategy is as follows.  
We transpose the eigenstates of $\rho$ that are in $\Lambda$ 
in such a way that they will be faithfully represented by
states of the channel $C$ and will be correctly reconstructed in $M'$.
We can do this since $\dim \Lambda = \dim \hilbert{C}$.  
However, the eigenstates of $\rho$ are not necessarily signal states,
and we have no guarantee that {\em any} signal 
actually lies within $\Lambda$ and is thus transposed
without distortion.  Nevertheless, since $\Lambda$ includes most of
the ``weight'' of the signal ensemble (except for a small piece of
measure less than $\eta$), we will be able to show that enough of the
signals are sufficiently close to the subspace $\Lambda$ to achieve
the required fidelity.

To specify the unitary transformation $U$ that accomplishes the 
coding transposition, we specify how the orthogonal basis
of $\rho$ eigenstates is mapped into orthogonal states of $C + E$.
Consider the following mapping:
\begin{equation}
	\ket{n} \longrightarrow \left\{   \begin{array}{ll}  
		\ket{n_{C},0_{E}}  &  \mbox{$n = 1,\ldots,d$}	\\
		\ket{0_{C},n_{E}}  &  \mbox{$n = d+1,\ldots,D$}
		\end{array}  \right.
\end{equation}
where the $\ket{n_{C}}$ and $\ket{n_{E}}$ 
are orthogonal sets of states of the
systems $C$ and $E$, respectively, and $\ket{0_{C}}$ and $\ket{0_{E}}$
are fixed ``null'' states.  We require that the null state $\ket{0_{E}}$
be orthogonal to each of the $\ket{n_{E}}$ for $n = d+1,\ldots,D$.
Roughly speaking, states in $\Lambda$ are
mapped into states of $C$ and states in $\Lambda^{\perp}$ are mapped
into states of $E$.  More precisely, the {\em distinction} between states
in $\Lambda$ is now made between states of $C$, and the distinction
between states in $\Lambda^{\perp}$ is now made between states in $E$.

The extra system $E$ is now discarded, and a new copy $E'$ is
joined to the system.  We specify that $E'$ be initially in the state 
$\ket{0_{E'}}$, so that the $\rho$ eigenstates are now mapped into
states
\begin{equation}
	\ket{n} \longrightarrow \left\{   \begin{array}{ll}  
		\ket{n_{C},0_{E'}}  &  \mbox{$n = 1,\ldots,d$}	\\
		\ket{0_{C},0_{E'}}  &  \mbox{$n = d+1,\ldots,D$}
		\end{array}  \right.  .
\end{equation}
Finally, we decode the signal into $M'$ by using the inverse $U^{-1}$
of the coding transformation.

How does a particular signal state $\ket{a_{M}}$ of $M$ fare in this
approximate transposition scheme?  We can write any state of $M$ as
a superposition of states
\begin{equation}
	\ket{a_{M}} = \lambda_{a} \ket{\lambda(a)_{M}} 
			+ \mu_{a} \ket{\mu(a)_{M}} ,
\end{equation}
where $\ket{\lambda(a)_{M}}$ is in $\Lambda$ and $\ket{\mu(a)_{M}}$ is in
$\Lambda^{\perp}$, and $|\lambda_{a}|^{2} + |\mu_{a}|^{2} = 1$.  The 
states $\ket{\lambda(a)_{M}}$ 
and $\ket{\mu(a)_{M}}$ can be expanded in terms
of the basis eigenstates of $\rho$:
\begin{displaymath}
	\ket{a_{M}} = \lambda_{a} 
		\left( \sum_{n=1}^{d} \amp{n}{\lambda(a)_{M}} 
		\, \ket{n} \right) + \mu_{a}
		\left( \sum_{n=d+1}^{D} \amp{n}{\mu(a)_{M}} 
		\, \ket{n} \right) .
\end{displaymath}
This state is mapped by the coding transposition $U$ into a state 
$\ket{a_{C + E}}$, where
\begin{eqnarray*}
	\ket{a_{M}} & \longrightarrow & \ket{a_{C+E}} \\  &  =  &
		\lambda_{a} 
		\left( \sum_{n=1}^{d} \amp{n}{\lambda(a)_{M}} \, 
			\ket{n_{C}, 0_{E}} \right)
		  + \mu_{a}
		\left( \sum_{n=d+1}^{D} \amp{n}{\mu(a)_{M}} \, 
			\ket{0_{C},n_{E}} \right)  \\
		& = & \lambda_{a} \ket{\lambda(a)_{C},0_{E}}
			+ \mu_{a} \ket{0_{C},\mu(a)_{E}}
\end{eqnarray*}
where $\ket{\lambda(a)_{C}}$ and $\ket{\mu(a)_{E}}$ have the obvious
definitions.  Parenthetically, we note that 
$\amp{\mu(a)_{E}}{0_{E}} = 0$.

The signal state of $C + E$ can be written as a projection operator
$\Pi_{a} = \ket{a_{C + E}} \bra{a_{C + E}}$, which is
\begin{eqnarray}
     \Pi_{a} & = & |\lambda_{a}|^{2} \, \ket{\lambda(a)_{C},0_{E}}
			\bra{\lambda(a)_{C},0_{E}} \nonumber \\
	     &   & + \lambda_{a} \mu_{a}^{\ast} \, \ket{\lambda(a)_{C},0_{E}}
			\bra{0_{C},\mu(a)_{E}}  \nonumber \\
	     &   & + \lambda_{a}^{\ast} \mu_{a} \, \ket{0_{C},\mu(a)_{E}}
			\bra{\lambda(a)_{C},0_{E}} \nonumber \\
	     &   & + |\mu_{a}|^{2} \, \ket{0_{C},\mu(a)_{E}}
			\bra{0_{C},\mu(a)_{E}} .
\end{eqnarray}
When $E$ is discarded, the state of the channel $C$ is obtained by 
performing a partial trace on $\Pi_{a}$, yielding
\begin{eqnarray*}
	\tr \!_{E} \Pi_{a} & = &
	  |\lambda_{a}|^{2} \, \ket{\lambda(a)_{C}} \bra{\lambda(a)_{C}} \\
	& & + |\mu_{a}|^{2} \, \ket{0_{C}} \bra{0_{C}} .
\end{eqnarray*}
Adjoining the system $E'$ in the state $\ket{0_{E'}}$ yields $W_{a}$:
\begin{eqnarray}
	W_{a} & = & |\lambda_{a}|^{2} \, \ket{\lambda(a)_{C},0_{E'}} 
		\bra{\lambda(a)_{C},0_{E'}} \nonumber \\
	& & + \, |\mu_{a}|^{2} \, \ket{0_{C},0_{E'}} \bra{0_{C},0_{E'}} .	
\end{eqnarray}

Since we decode this signal into $M'$ using the inverse of the coding
transposition, the overall fidelity is just $F = \sum_{a} p(a) \,
\tr \Pi_{a} \, W_{a}$.  For a given signal,
\begin{eqnarray*}
	\tr \Pi_{a} \, W_{a}	&  =  &	
		|\lambda_{a}|^{4} + |\lambda_{a}|^{2} \, |\mu_{a}|^{2} \,
			|\amp{\lambda(a)_{C}}{0_{C}}|^{2}  \\
		& \geq & |\lambda_{a}|^{4}  \\
		& = & \left( 1 - |\mu_{a}|^{2}   \right)^{2} \\
		& \geq & 1 - 2|\mu_{a}|^{2} .
\end{eqnarray*}
The fidelity is thus
\begin{equation}
	F   \geq   1 - 2 \left( \sum_{a} p(a) |\mu_{a}|^{2} \right) .
\end{equation}
We required that $\tr \rho \Gamma > 1 - \eta$.  This is just
\begin{eqnarray*}
	\tr \rho \Gamma  & = & \sum_{a} p(a) \, \tr \pi_{a} \Gamma  \\
			& = & \sum_{a} p(a) |\lambda_{a}|^{2} \\
			& = & \sum_{a} p(a) ( 1 - |\mu_{a}|^{2} ) \\
			& = & 1 - \sum_{a} p(a) |\mu_{a}|^{2} .
\end{eqnarray*}
Therefore, our requirement on $\tr \rho \Gamma$ amounts to requiring
that $\sum_{a} p(a) |\mu_{a}|^{2} < \eta$.  This means that the fidelity
of our coding scheme satisfies
\begin{displaymath}
	F > 1 - 2 \eta ,
\end{displaymath}
as we wished to prove.

To summarize, 
we have related the fidelity of our transposition scheme to the
dimension $d$ of the Hilbert state space of the channel system $C$.
If $d$ is small enough that the signal ensemble $\rho$ has weight 
less than $\eta$ on every $d$-dimensional subspace of $\hilbert{M}$, then
the fidelity must satisfy $F < \eta$.  On the other hand, if we can 
find a $d$-dimensional subspace of $\hilbert{M}$ on which $\rho$ has
a weight greater than $1 - \eta$, we can devise a transposition scheme
with fidelity $F > 1 - 2 \eta$.  Furthermore, we can restrict our
attention to subspaces spanned by eigenstates of $\rho$, establishing
the existence or non-existence of suitable subspaces by considering sums
of $d$ distinct eigenvalues of $\rho$.  This connection between the
eigenvalues of $\rho$ (which form a probability distribution) and the
fidelity of approximate transposition through a limited channel $C$
will be important in the next section.

\banner{VI. Qubits and Quantum Coding}

In the classical noiseless coding theorem, there are three
central features.  First, we specified a single elementary coding
system, the binary digit or ``bit''; all messages were encoded using
bits.  Second, we allowed ourselves to encode, not individual messages,
but entire sequences of $N$ messages from independent, identical
sources.  Third, we did not require that the coding be completely
error-free; it sufficed that we could make the classical fidelity 
(the probability of encoding the message in a correctly decodable way) 
arbitrarily close to unity.

Each of these three central ideas must be adapted to our quantum context.
We have already done this with the third item, the fidelity criterion.  
In quantum coding, the fidelity $F$ is the probability that a transposed
signal state will pass a validation test comparing it to the original
signal.  The fidelity lemmas of the previous section give us bounds on
$F$ in various situations.  It remains for us to consider the
quantum generalizations of the first two features, the elementary
coding system and the ``block coding'' of long sequences of messages.

For our elementary coding system we choose the two-level spin system,
which we will call a ``quantum bit'' or {\em qubit\/}.  The qubit will
be our fundamental unit of quantum information, and all signals will
be encoded into sequences of qubits.  Let us denote a single qubit
system by $Q$.  Our quantum channel $C$ will be composed of some
(possibly large) number $K$ of copies of $Q$ (denoted $Q^{K}$), so that
\begin{displaymath}
	\hilbert{C}  =   \underbrace{\hilbert{Q} \otimes 
		\cdots \otimes \hilbert{Q}}_{\mbox{$K$ times}} .
\end{displaymath}
The dimension of $\hilbert{C}$ is $2^{K}$.  The analogy between the
bit and the qubit is obvious.  The qubit is more general, since 
there are more possible coding states than just two (although there
are only two orthogonal ones), and since a collection of qubits can
exist in a non-classically entangled quantum state.

To discuss quantum block coding, we must consider an
{\em extended} quantum signal source.  This is a system consisting of 
$N$ independent copies of the system $M$ (which we will denote $M^{N}$).
Each subsystem $M_{k}$ is in a signal state $\ket{a_{k}}$, generated
according to the signal probability distribution $p(a_{k})$.  That
is, the system $M^{N}$ is in the state
\begin{displaymath}
	\ket{\alpha}  =  \ket{a_{1},a_{2},\ldots,a_{N}}
\end{displaymath}
with probability $p(\alpha) = p(a_{1}) p(a_{2}) \cdots p(a_{N})$.  Our
job will be to transpose the signal state of the joint system $M^{N}$
into an identical system $(M')^{N}$ using a channel composed of qubits.

The density operator $\rho^{N}$ describing the signal ensemble of 
$M^{N}$ is simply the direct product of the density operators for
the signal ensembles of the individual subsystems:
$\rho^{N} = \rho_{1} \otimes \cdots \otimes \rho_{N}$.  This means
that the eigenstates of $\rho^{N}$ will be products states 
$\ket{n_{1},\ldots,n_{N}}$ of eigenstates of the $\rho_{i}$,
and the eigenvalues of $\rho^{N}$ will be products
of eigenvalues of the $\rho_{i}$:
\begin{displaymath}
	P_{n_{1},\ldots,n_{N}} = P_{n_{1}} \cdots P_{n_{N}} .
\end{displaymath}

Now, for the system $M$, 
the eigenvalues $P_{n}$ ($n = 1,\ldots,D$)
of the signal ensemble density operator $\rho$ have all of the properties
of a probability distribution over the integers 1 through $D$.  (They are
in fact the probability distribution for the outcomes of a complete
measurement with eigenstates $\ket{n}$.)  Furthermore, the von Neumann 
entropy of $\rho$ is just the Shannon entropy of this distribution:
\begin{equation}
	S(\rho) = - \sum_{n} P_{n} \log P_{n} .
\end{equation}
An eigenstate of $\rho^{N}$ corresponds to a sequence $n_{1},\ldots,n_{N}$
of $N$ integers, and the eigenvalue of this eigenstate is just the 
probability of this sequence if it had been generated by $N$ independent
trials using the probability distribution $P_{n}$.

As long as we are only interested in eigenvalues and eigenstates of
the density operators, we can pretend that each quantum signal source
is a classical message source that uses the integers 1 through $D$ as
its ``alphabet'' and has a message probability distribution $P_{n}$ with
Shannon entropy $S(\rho)$.
The extended signal source is the extended message source of sequences
of these integers.  From our discussion above, we know that 
for large enough $N$ the sequences can be divided into two sets:
\begin{itemize}
	\item  a set of about $2^{N \, S(\rho)}$ ``likely'' sequences, and
	\item  a set of sequences with small total probability---i.e.,
		whose corresponding eigenvalues have a small sum.
\end{itemize}
These two sets specify two orthogonal subspaces of the Hilbert space 
$\hilbert{M^{N}}$.  One of these (which we can call 
the ``likely'' subspace $\Lambda$) has
a dimension of about $2^{N \, S(\rho)}$, and could be faithfully transposed
into the states of a collection of $N S(\rho)$ qubits.  The other,
$\Lambda^{\perp}$, has small ``weight'' with respect to $\rho^{N}$,
and therefore will not affect the fidelity too much.

To be exact, we can now prove the following theorem:
\begin{quote}
	{\bf Theorem (Quantum Noiseless Coding Theorem):}  Let $M$
	be a quantum signal source with a signal ensemble described
	by the density operator $\rho$, and let $\delta,\epsilon > 0$.
	\begin{description}
	\item[(i)]  Suppose that $S(\rho) + \delta$ qubits are 
		available per $M$ signal.  Then for sufficiently
		large $N$, groups of $N$ signals from the signal
		source $M$ can be transposed via the available qubits 
		with fidelity $F > 1 - \epsilon$.
	\item[(ii)]  Suppose that $S(\rho) - \delta$ qubits are
		available per $M$ signal.  Then for sufficiently
		large $N$, if groups of $N$ signals from the signal
		source $M$ are transposed via the available qubits
		then the fidelity $F < \epsilon$.
	\end{description}
\end{quote}

	Part (i) is proved in this way.  We first note that,
if the quantum channel $C$ is $N(S(\rho) + \delta)$ qubits $Q$,
then $\dim \hilbert{C} = 2^{N(S(\rho) + \delta)}$.
From our proof of the classical theorem and our 
probability-eigenvalue analogy, we know that, for large enough
$N$, the number of ``likely'' eigenstate sequences $\nu \leq 
2^{N(S(\rho) + \delta)}$ and the sum of the remaining eigenvalues
of $\rho^{N}$ can be made less than $\epsilon / 2$.  We can
add a few additional eigenstate sequences if necessary to the ``likely''
set to bring the total to exactly $\nu = \dim \hilbert{C}$, and
this will not increase the sum of the remaining eigenvalues.
Let $\Gamma$ be the projection onto the $\nu$-dimensional subspace 
$\Lambda$ spanned by the ``likely'' eigenstates.  Then $\tr \rho^{N} \Gamma
> 1 - \epsilon / 2$.  By Lemma 2 there is a transposition scheme with
fidelity $F > 1 - \epsilon$.

	For part (ii), we simply note that our classical discussion
tells us that, for large enough $N$, no $2^{N(S(\rho) - \delta)}$ 
eigenstate sequences for $M^{N}$ have eigenvalues which have a sum
as large as $\epsilon$.  Therefore, for every projection $\Gamma$
onto a subspace of dimension $2^{N(S(\rho) - \delta)} = \dim \hilbert{C}$,
we have $\tr \rho^{N} \Gamma < \epsilon$.  Then by Lemma 1, every
transposition scheme has fidelity $F < \epsilon$.  Both parts of the
theorem are now proved.

\banner{VII. Entangled Systems}


We have considered the situation in which various quantum states
(the signal states of $M$) were generated probabilistically, so a
density operator $\rho$ is necessary for the description of the
mixed state of the ensemble.  
Density operators also arise when the system $M$ is only
part of a larger system $M + Z$ which is in a pure but entangled 
quantum state $\ket{\psi_{M+Z}}$.  Such a state can always be
written in a {\em polar decomposition} (sometimes called the ``Schmidt
decomposition'' \cite{schmidt}) 
\begin{equation}
	\ket{\psi_{M+Z}} = \sum_{n} \sqrt{P_{n}} \ket{n_{M},n_{Z}} ,
\end{equation}
where the $\ket{n_{M}}$ and the $\ket{n_{Z}}$ are orthogonal sets of
states for $M$ and $Z$, respectively.  To write down a state of $M$
alone, we must do a partial trace over $Z$, yielding the density
operator
\begin{equation}
	\rho = \sum_{n} P_{n} \, \ket{n_{M}} \bra{n_{M}} ,
\end{equation}
that is, a density operator with eigenstates $\ket{n_{M}}$ and
eigenvalues $P_{n}$.  The von Neumann entropy $S(\rho)$ of this
density operator is sometimes cited as an ``information-theoretic''
measure of the degree of entanglement between the quantum systems 
$M$ and $Z$ \cite{everett,thesis}. 

Suppose we now perform an approximate transposition from $M$ to $M'$.
Does this transposition faithfully transpose the overall quantum
state of $M + Z$ into a state of $M' + Z$?  In other words, does
our scheme also faithfully transfer the quantum entanglement of the
system with the rest of the world from $M$ to $M'$?

The answer is yes.  The state $\ket{\psi_{M+Z}}$, which can be
represented by the projection $\pi = \ket{\psi_{M+Z}} \bra{\psi_{M+Z}}$,
is transposed into some final state $w$ of $M' + Z$.  The fidelity $F$
of this process is
\begin{displaymath}
	F = \tr \pi w .
\end{displaymath}
In our transposition scheme, we perfectly transpose
via the channel $C$ those eigenstates of $\rho$ that are in a 
``likely'' subspace $\Lambda$ of $\hilbert{M}$ with $\dim \Lambda =
\dim \hilbert{C} = d$.  We can number our eigenstates $n_{M}$ 
as before so that the first $d$ of them lie in $\Lambda$.
Now we can write our overall state as 
\begin{displaymath}
	\ket{\psi_{M+Z}} = \lambda \ket{\lambda_{M+Z}} + \mu \ket{\mu_{M+Z}}
\end{displaymath}
where $\ket{\lambda_{M+Z}}$ and $\ket{\mu_{M+Z}}$ are 
orthogonal normalized states
\begin{eqnarray*}
\ket{\lambda_{M+Z}} & = & \sum_{n=1}^{d} c_{n} \ket{n_{M},n_{Z}}  \\
\ket{\mu_{M+Z}} & = & \sum_{n=d+1}^{D} c'_{n} \ket{n_{M},n_{Z}}
\end{eqnarray*}
and $|\lambda|^{2} + |\mu|^{2} = 1$.  We can represent this state
by the projection $\pi = \ket{\psi_{M+Z}} \bra{\psi_{M+Z}}$.

In our scheme, the state $\ket{\lambda_{M+Z}}$ is transposed
perfectly and the state $\ket{\mu_{M+Z}}$ is transposed into some
mixed state, yielding a mixed state $w$ of the system $M' + Z$.
This mixed state is
\begin{eqnarray*}
	w & = &	|\lambda|^{2} \, \ket{\lambda_{M'+Z}} \bra{\lambda_{M'+Z}} \\
	  &   & + \, |\mu|^{2} \left( \sum_{n=d+1}^{D} |c'_{n}|^{2}
			\ket{0_{M'},n_{Z}} \bra{0_{M'},n_{Z}} \right) .
\end{eqnarray*}
From this it follows that the 
fidelity $F = \tr \pi w \geq |\lambda|^{4}$, as before.  If 
$|\lambda|^{2} > 1 - \eta$, then $F > 1 - 2 \eta$, which is the
same result we obtained in Lemma 2.

Once we have this result, we can repeat the argument of our coding
theorem for $N$ copies of the entangled system $M + Z$, concluding
that the entanglement between $M$ and $Z$ can be faithfully transposed
from $M$ to $M'$ using channel $C$ with $S(\rho)$ qubits---or more
exactly, that we can faithfully transpose the entanglement of many 
such systems if we have at least $S(\rho)$ qubits per system.
We can therefore interpret $S(\rho)$ as a measure of the physical
resources necessary to faithfully move the quantum entanglement between
$M$ and the rest of the world (the system $Z$) from one
system to another.


\banner{VIII.  Remarks}

	The von Neumann entropy $S(\rho)$ of a signal ensemble 
of pure states can be interpreted as the number of 
qubits per signal necessary to 
transpose it with near-perfect fidelity.  If more than $S(\rho)$
qubits are available per signal, we can in the long run make the 
fidelity $F$ as close to unity as we like; but if fewer than $S(\rho)$
are available per signal, the fidelity $F$ will eventually approach zero.
Furthermore, $S(\rho)$ is also (in the same sense) the number of qubits
needed to transpose a part of a entangled system while
maintaining the fidelity of the overall state near to unity.
Thus, the quantum entropy $S$ is a measure of the physical resources
necessary to represent the information content of a system in a mixed
state, whether the mixed state arises from a stochastic process or 
by the tracing out of quantum entanglement with the external world. 
Quantum entropy is measured in qubits.

	In the proof of this theorem, a great deal of the 
mathematical machinery developed for the classical theorem could
be inherited with only minor changes.  Instead of probability
distributions, we considered sets of eigenvalues of density
operators.  The two fidelity lemmas that we proved allowed us to
connect statements about these eigenvalues to statements about
the fidelity of an approximate transposition scheme over a limited
channel.

	A simpler approach to the process of approximate transposition
and the fidelity lemmas has been suggested by R. Jozsa \cite{Jozsa}.
This avoids the explicit use of the auxiliary systems $E$ and $E'$
by invoking a (non-unitary) measurement process \cite{modoptics}.

	We should note that the argument for the coding theorem
given here is somewhat akin to the work
of Neill Graham in his paper on the many-worlds interpretation 
of quantum mechanics \cite{graham}.  
Graham investigated how the probabilities
of measurement results might arise in the many-worlds interpretation,
given that the final entangled state of the system and measuring 
apparatus always contains all possible outcomes in the superposition.
By considering the final state of a large collection of system-apparatus
pairs, Graham showed that the Hilbert space can be decomposed into two
subspaces:  (1) a ``typical'' subspace, in which the statistical 
frequencies of the measurement results closely match the 
probabilities given by the quantum rules, and (2) an ``atypical'' 
subspace which contributes very little to the final superposition, 
even though its dimension might be very large. 

	Some unresolved questions arise from our work.  
We have considered here only {\em pure} signal states $\pi_{a}$ from our 
quantum signal source $M$.  Suppose instead that the signals are
mixed states $\rho_{a}$, with $\rho = \sum_{a} p(a) \rho_{a}$.  
Then it is not clear how to proceed, for the natural generalization
of the fidelity,
\begin{displaymath}
	F \stackrel{?}{=} \sum_{a} p(a) \, \tr \rho_{a} w_{a} 
\end{displaymath}
may not be close to unity even if $w_{a} = \rho_{a}$ for all signals.

Furthermore, we have proved a ``noiseless'' coding theorem.  Shannon's more
powerful results deal with the information capacity of channels with
noise \cite{shannon}.  
It would be very desirable to develop coding theorems for noisy 
quantum channels.

Nevertheless, the fidelity and coding results presented here may be a
starting point for a new approach to quantum information theory, one
with possible applications to quantum cryptography \cite{qcrypto}
and the theory of quantum computers \cite{qcomp}.
They also provide an information-theoretic 
interpretation of the von Neumann entropy, with potential 
implications for the conceptual foundations of quantum statistical 
mechanics.

\banner{Acknowledgments}

The term ``qubit'' was coined in jest during one of the author's
many intriguing and valuable conversations with W. K. Wootters, and
became the initial impetus for this work.  
The author is also grateful to C. H. Bennett and R. Jozsa
for their helpful suggestions and for numerous
words of encouragement.

\pagebreak

\begin{thebibliography}{99}

\bibitem{shannon}  C. E. Shannon, Bell System Technical Journal {\bf 27},
	379 (1948).
\bibitem{jaynes}  E. Jaynes, Phys. Rev. {\bf 106}, 620 (1957); and E. Jaynes,
	Phys. Rev. {\bf 108}, 171 (1957).
\bibitem{vonneumann}  J. von Neumann, {\em Mathematical Foundations of 
	Quantum Mechanics} (English translation by R. T. Beyer), Princeton
	University Press (1955).
\bibitem{wehrl}  A. Wehrl, {Rev. Mod. Phys.\/} {\bf 50}, 221 (1978).
\bibitem{kholevo}  A. S. Kholevo, {\em Problemy Peredachi Informatsii} {\bf 9},
	3 (1973).  This paper has appeared in English translation in
	{\em Problems of Information Transmission} {\bf 9}, 177 (1973).
\bibitem{thesis}  B. W. Schumacher, {\em Communication, Correlation, and
	Complementarity}, Ph.D. thesis (The University of Texas at Austin,
	1990).
\bibitem{asher}  A. Peres and W. K. Wootters, {\em Phys. Rev. Lett.\/}
	{\bf 66}, 1119 (1991). 
\bibitem{infotheory}  See any textbook on information theory for more about
	entropy and coding; for example, M. Mansuripur, {\em Introduction
	to Information Theory\/}, Prentice-Hall (1987).
\bibitem{zurek}  W. K. Wootters and W. H. Zurek, {\em Nature\/} {\bf 299},
	802 (1982).
\bibitem{teleport}  C. H. Bennett, G. Brassard, C. Cr\`{e}peau, R. Jozsa,
	A. Peres, and W. K. Wootters, {\em Phys. Rev. Lett.\/} {\bf 70}, 
	1895 (1993).
\bibitem{schmidt}  E. Schmidt, {\em Mathematishe Annalen} {\bf 63},
	433 (1906).
\bibitem{everett}  H. Everett, in {\em The Many-Worlds Interpretation of
	Quantum Mechanics,\/}, edited by B. S. DeWitt and N. Graham,
	Princeton (1973), pp. 3--140.
\bibitem{Jozsa}  R. Jozsa, private communication.
\bibitem{modoptics}  R. Jozsa and B. W. Schumacher, {\em J. Mod. Optics\/},
	to appear.
\bibitem{graham}  N. Graham, in {\em The Many-Worlds Interpretation of
	Quantum Mechanics,\/}, edited by B. S. DeWitt and N. Graham,
	Princeton (1973), pp. 229--253.
\bibitem{qcrypto} S. Wiesner, {\em SIGACT News\/} {\bf 15:1}, 78 (1983);
	C.~H.~Bennett and G.~Brassard, {\em Proceedings of IEEE
	International Conference on Computers, Systems, and Signal
	Processing\/}, Bangalore, India, December~1984, pp.~175\,--\,179;
	C. H. Bennett, F. Bessette, G. Brassard, L. Salvail, and
	J. Smolin, {\em J. Cryptography\/} {\bf 5}, 3 (1992).
\bibitem{qcomp}  D. Deutsch {\em Proc. Roy. Soc. Lond.\/} {\bf A 400},
	97 (1985).

\end{thebibliography}

\end{document}