Emmanuel Candés and I have just uploaded to the arXiv our paper “The power of convex relaxation: near-optimal matrix completion“, submitted to IEEE Inf. Theory.  In this paper we study the matrix completion problem, which one can view as a sort of “non-commutative” analogue of the sparse recovery problem studied in the field of compressed sensing, although there are also some other significant differences between the two problems.   The sparse recovery problem seeks to recover a sparse vector $x \in {\Bbb R}^n$ from some linear measurements $Ax = b \in {\Bbb R}^m$, where A is a known $m \times n$ matrix.  For general x, classical linear algebra tells us that if m < n, then the problem here is underdetermined and has multiple solutions; but under the additional assumption that x is sparse (most of the entries are zero), it turns out (under various hypotheses on the measurement matrix A, and in particular if A contains a sufficient amount of “randomness” or “incoherence”) that exact recovery becomes possible in the underdetermined case.  Furthermore, recovery is not only theoretically possible, but is also computationally practical in many cases; in particular, under some assumptions on A, one can recover x by minimising the convex norm $\| x \|_{\ell^1}$ over all solutions to Ax=b.

Now we turn to the matrix completion problem.  Instead of an unknown vector $x \in {\Bbb R}^n$, we now have an unknown matrix $M = (m_{ij})_{i \in [n_1], j \in [n_2]} \in {\Bbb R}^{n_1 \times n_2}$ (we use the shorthand $[n] := \{1,\ldots,n\}$ here). We will take a specific type of underdetermined linear measurement of M, namely we pick a random subset $\Omega \subset [n_1] \times [n_2]$ of the matrix array $[n_1] \times [n_2]$ of some cardinality $1 \leq m \leq n_1 n_2$, and form the random sample $P_\Omega(M) := (m_{ij})_{(i,j) \in \Omega} \in {\Bbb R}^{\Omega}$ of M.

Of course, with no further information on M, it is impossible to complete the matrix M from the partial information $P_\Omega(M)$ – we only have $m$ pieces of information and need $n_1 n_2$.  But suppose we also know that M is low-rank, e.g. has rank less than r; this is an analogue of sparsity, but for matrices rather than vectors.  Then, in principle, we have reduced the number of degrees of freedom for M from $n_1 n_2$ to something more like $O( r \max(n_1,n_2) )$, and so (in analogy with compressed sensing) one may now hope to perform matrix completion with a much smaller fraction of samples, and in particular with m close to $r \max(n_1,n_2)$.

This type of problem comes up in several real-world applications, most famously in the Netflix prize.  The Netflix prize problem is to be able to predict a very large ratings matrix M, whose rows are the customers, whose columns are the movies, and the entries are the rating that each customer would hypothetically assign to each movie.  Of course, not every customer has rented every movie from Netflix, and so only a small fraction $P_\Omega(M)$ of this matrix is actually known.  However, if one makes the assumption that most customers’ rating preference is determined by only a small number of characteristics of the movie (e.g. genre, lead actor/actresses, director, year, etc.), then the matrix should be (approximately) low rank, and so the above type of analysis should be useful (though of course it is not going to be the only tool of use in this messy, real-world problem).

Actually, one expects to need to oversample the matrix by a logarithm or two in order to have a good chance of exact recovery, if one is sampling randomly.  This can be seen even in the rank one case r=1, in which $M=uv^*$ is the product of a column vector and a row vector; let’s consider square matrices $n_1=n_2=n$ for simplicity.  Observe that if the sampled coordinates $\Omega$ completely miss one of the rows of the matrix, then the corresponding element of u has gone completely unmeasured, and one cannot hope to complete this row of the matrix.   Thus one needs to sample every row (and also every column) of the $n \times n$ matrix.  The solution to the coupon collector’s problem then tells us that one needs about $O(n \log n)$ samples to achieve this goal.  In fact, the theory of Erdős-Rényi random graphs tells us that the bipartite graph induced by $\Omega$ becomes almost surely connected beyond this threshold, which turns out to be exactly what is needed to perform matrix completion for rank 1 matrices.

On the other hand, one cannot hope to complete the matrix if some of the singular vectors of the matrix are extremely sparse.  For instance, in the Netflix problem, a singularly idiosyncratic customer (or dually, a singularly unclassifiable movie) may give rise to a row or column of M that has no relation to the rest of the matrix, occupying its own separate component of the singular value decomposition of M; such a row or column is then impossible to complete exactly without sampling the entirety of that row or column.  Thus, to get exact matrix completion from a small fraction of entries, one needs some sort of incoherence assumption on the singular vectors, which spreads them out across all coordinates in a roughly even manner, as opposed to being concentrated on just a few values.

In a recent paper, Candés and Recht proposed solving the matrix completion problem by minimising the nuclear norm (or trace norm) $\|M\|_* = \sum_{i=1}^{\min(n_1,n_2)} \sigma_i(M) = \hbox{tr}( M M^*)^{1/2}$

amongst all matrices consistent with the observed data $P_\Omega(M)$.  This nuclear norm is the non-commutative counterpart to the $\ell^1$ norm for vectors, and so this algorithm is analogous to the $\ell^1$ minimisation (or basis pursuit) algorithm which is effective for compressed sensing (though not the only such algorithm for this task).  They showed, roughly speaking, that exact matrix completion (for, say, square matrices $n_1=n_2=n$ for simplicity) is ensured with high probability so long as the singular vectors obey a certain incoherence property (basically, their $\ell^\infty$ norm should be close to the minimal possible value, namely $O(1/\sqrt{n})$), so long as one had the condition $m \gg n^{1.2} r \log n$.

This differs from the presumably optimal threshold of $nr \log n$ by a factor of about $n^{0.2}$.

The main result of our paper is to mostly eliminate this gap, at the cost of a stronger hypothesis on the matrix being measured:

Main theorem. (Informal statement)  Suppose the $n_1 \times n_2$ matrix M has rank r and obeys a certain “strong incoherence property”.  Then with high probability, nuclear norm minimisation will recover M from a random sample $P_\Omega(M)$ provided that $m \gg n r \log^{O(1)} n$, where $n := \max(n_1,n_2)$.

A result of a broadly similar nature, but with a rather different recovery algorithm and with a somewhat different range of applicability, was recently established by Keshavan, Oh, and Montanari.  The strong incoherence property is somewhat technical, but is related to the Candés-Recht incoherence property and is satisfied by a number of reasonable random matrix models.  The exponent O(1) here is reasonably civilised (ranging between 2 and 9, depending on the specific model and parameters being used).

The general strategy of proof is broadly similar to that in my first compressed sensing paper with Candés and Romberg, and also with the more recent paper of Candés and Recht.  (The later papers, based around the restricted isometry property (RIP), are not directly applicable, as the possibility of concentrated singular values defeats the naive non-commutative analogue of the RIP.  However, if one changes the measurement model by taking random Gaussian linear measurements of M (viewed as an $n_1 n_2$-dimensional vector) rather than randomly sampling the entries, then one can recover an RIP-like property, together with most of the compressed sensing theory; see this paper of Recht, Fazel, and Parrilo.)

To begin the proof, one uses a duality argument (essentially just the Hahn-Banach theorem, combined with a standard computation of the (sub-)gradient of the nuclear norm) to equate the success of the nuclear norm minimisation with that of establishing the existence of a certain “dual certificate” matrix Y with certain technical properties relating to the sampling operator $P_\Omega$, to the orthogonal projections $P_U, P_V$ to the spaces U, V generated by the left- and right- singular vectors $u_1,\ldots,u_r, v_1,\ldots,v_r$ appearing in the singular value decomposition $M = \sum_{i=1}^r \sigma_i u_i v_i^*$ of M, and also on the “sign pattern” $E := \sum_{i=1}^r u_i v_i^*$ of M.  The strong incoherence property alluded to earlier is a condition on the matrix coefficients of $P_U, P_V, E$, basically saying that they behave in magnitude as if the singular vectors were distributed randomly.

In order to locate this certificate, we perform a least-squares minimisation, which requires inverting and then estimating a certain operator arising from $P_\Omega, P_U, P_V$.  Inversion can be accomplished by using a useful selection lemma of Rudelson, but to obtain good estimates on the resulting certificate, we need to expand in Neumann series and estimate the spectral norm of a certain complicated combination of $P_\Omega, P_U, P_V$.  The first few terms of this series were already controlled by Candés and Recht, using tools from high-dimensional geometry such as the non-commutative Khintchine inequality.  However, these techniques do not seem to be able to fully exploit all the (increasingly non-commutative) cancellations present in these products of matrix operators after the fourth or fifth iterate or so, which is ultimately what causes the loss of $n^{0.2}$ in their result.  To deal with this, we instead rely on the classical moment method, raising the operators involved to high powers and computing traces.  As usual, this reduces us to considering certain sums over paths, which in this case are paths of horizontal, vertical, and “non-rook” moves in the grid $[n_1] \times [n_2]$.  But the paths here end up being rather combinatorially complicated, resembling a “spider” with a compact body but many legs.  Furthermore, there are some subtle cancellations in the coefficients (ultimately arising from identities such as $P_U P_U = P_U$, $P_U E = E$, $E^* E = P_V$, etc.) which need to be exploited in order to obtain near-optimal results, although in the bounded rank case r=O(1) it turns out that a relatively “low-tech” approach suffices (taking absolute values and using fairly crude estimates everywhere).

The basic difficulty is that this spider can self-intersect in a large number of ways.  Furthermore, it is not enough to track how often each vertex is traversed by this spider; one must also keep track of the various transitions the spider makes from one row to another, or from one column to another  In some cases, the matrix coefficients arising from two of these transitions can “collapse” to a single matrix coefficient (due to identities such as $P_U P_U = P_U$), replacing the combinatorial configuration of the spider with a “simpler” configuration.  The strategy is then to exploit these identities to collapse the spider whenever possible, until no further cancellations remain, and it becomes safe to estimate everything by absolute values.  This collapsing procedure is the most delicate portion of the paper, and is highly combinatorial in nature.

In a forthcoming paper of Candés and Plan, the exact recovery results here for noiseless measurement models will be extended to provide approximate recovery results in the presence of noise.