\documentclass[twoside,11pt]{article}
% On Using Extended Statistical Queries to Avoid Membership Queries
% Journal Version.
\usepackage{jmlr2e}
\usepackage{amsmath}
\usepackage{amssymb}
\bibpunct{[}{]}{;}{a}{,}{,}
\bibliographystyle{alpha}
% Heading arguments are {volume}{year}{pages}{submitted}{published}{authors}

\jmlrheading{2}{2002}{359-395}{10/01}{02/02}{Nader Bshouty and
Vitaly Feldman} \ShortHeadings{On Using Extended Statistical
Queries to Avoid Membership Queries}{Nader H.~Bshouty and Vitaly
Feldman} \firstpageno{359}

\newcommand{\fr}[1]{\frac{1}{#1}}
\newcommand{\I}{{\cal I}}
\newcommand{\op}{\triangleright}
\newcommand{\F}{{\mathcal F}}
\newcommand{\A}{{\mathcal A}}
\newcommand{\M}{{\mathcal M}}
\newcommand{\B}{{\mathcal B}}
\newcommand{\C}{{\mathcal C}}
\newcommand{\D}{{\mathcal D}}
\newcommand{\V}{{\mathcal V}}
\newcommand{\E}{{\mathbf E}}
\newcommand{\pr}{\mathbf {Pr}}
\newcommand{\fuf}[1]{\mbox{\textsl{#1}}}
\newcommand{\equ}[1]{
\begin{equation}
#1
\end{equation}}

\newcommand{\equn}[1]{
$$
#1
$$}

\newcommand{\arr}[2]{
\begin{array}{#1}
#2
\end{array}}
\newcommand{\qed}{\hfill $\Box$ \\}

\begin{document}


%\newtheorem{remark}[theorem]{Remark}
%\newtheorem{fact}[theorem]{Fact}
%\newtheorem{property}[theorem]{Property}
%\newtheorem{result}[theorem]{Result}
%\newtheorem{observation}[theorem]{Observation}
%\newtheorem{definition}[theorem]{Definition}

\title{On Using Extended Statistical Queries to Avoid Membership Queries}

\author{\name Nader H.~Bshouty \email bshouty@cs.technion.ac.il \\
%       \addr Department of Computer Science\\
%       Technion , Israel Institute of Technology,\\
%       Haifa, 32000, Israel
%       \AND
       \name Vitaly Feldman \email felvit@tx.technion.ac.il \\
       \addr Department of Computer Science\\
       Technion , Israel Institute of Technology,\\
       Haifa, 32000, Israel}

\editor{Dana Ron}
 \maketitle

\begin{abstract}

  The Kushilevitz-Mansour ({\tt KM}) algorithm is an algorithm that
  finds all the ``large" Fourier coefficients
  of a Boolean function. It is the main tool for learning decision trees
   and DNF expressions in the PAC model with respect to the uniform
   distribution. The algorithm requires access to the membership
   query (MQ) oracle. The access is often unavailable in
   learning applications and thus the {\tt KM} algorithm cannot be used.

  We significantly weaken this requirement by producing an
  analogue of the {\tt KM} algorithm that uses extended
  statistical queries (SQ) (SQs in which the
  expectation is taken with respect to a distribution given
  by a learning algorithm). We restrict a set of distributions that
  a learning algorithm may use for its statistical queries to be a set of product
  distributions with each bit being 1 with probability $\rho, 1/2$
  or $1 -\rho$ for a constant $1/2 > \rho > 0$ (we denote the resulting model
  by SQ--$\D_{\rho}$). Our analogue finds all the ``large"
  Fourier coefficients of degree lower than $c\log{n}$
  (we call it the {\em Bounded Sieve} ({\tt BS})). We use {\tt BS}
   to learn decision trees and by adapting Freund's boosting
   technique we give an algorithm that learns \textsf{DNF} in SQ--$\D_{\rho}$.
   An important property of the model is that its algorithms can
   be simulated by MQs with persistent noise.
  With some modifications {\tt BS}
  can also be simulated by MQs with
  product attribute noise (i.e., for a query $x$ oracle changes
  every bit of $x$ with some constant probability and calculates
  the value of the target function
  at the resulting point) and classification noise.
  This implies learnability of decision trees and weak learnability
  of \textsf{DNF} with this non-trivial noise.

 In the second part of this paper we develop a characterization
 for learnability with these extended statistical queries.
 We show that our
 characterization when applied to SQ--$\D_{\rho}$ is tight in terms
 of learning parity functions. We extend the result given by
 Blum et al.\ by proving that there is
 a class learnable in the PAC model with random
 classification noise and not learnable in SQ--$\D_{\rho}$.
\end{abstract}
\newpage


\section{Introduction and Overview}
\label{sec-intro} The problems of learning decision trees and DNF
expressions are among the most well studied problems of
computational learning theory. In this paper we address learning
of these classes in Valiant's popular PAC model with respect to
the uniform distribution.\footnote{This and all other models of
learning that are mentioned further are fully defined in the
following section.} The first algorithm that learns decision trees
in this setting was given by Kushilevitz and Mansour \cite{km91}.
The main tool that they used is the beautiful technique due to
Goldreich and Levin \cite{gl89}. Given a black box that will
answer membership queries for a Boolean function $f$ over
$\{0,1\}^n$, their technique efficiently locates a subset $A$ of
the inputs of $f$ such that the parity $\chi_A$ of the bits in $A$
is a weak approximator for $f$ with respect to the uniform (if
such $A$ exists). The correlation of the parity function $\chi_A$
is called the Fourier coefficient of $A$.
%and is usually denoted $\hat{f}(A)$.
Using this technique Kushilevitz and Mansour give an algorithm
that finds all the Fourier coefficients of a Boolean function
larger than a given threshold (it is usually referred as the {\tt
KM} algorithm). They prove that this algorithm can be used to
efficiently learn the class of decision trees. Later Jackson
\cite{ja94} again used the Kushilevitz-Mansour algorithm and
Freund's boosting technique to build his famous algorithm for
learning DNF expressions. An important property of Freund's
boosting in this context is that it requires weak learning only
with respect to distributions that are polynomially close to the
uniform.

 As we have mentioned the core of both algorithms
 utilizes \emph{membership query} (MQ) oracle to find weakly
 approximating parity function.  That is, in
order to use these algorithms we need access to values of the
learned function at any given point. Angluin and Kharitonov
\cite{ak91} proved, under standard cryptographic assumptions, that
if DNF expressions are learnable in the distribution-independent
PAC model with MQs then they are also learnable without MQs, in
other words membership queries are not helpful in the
distribution-independent PAC learning model. On the other hand,
there is no known algorithm that learns the above-mentioned
classes in the PAC model with respect to the uniform distribution
without the use of MQs.

The access to membership query oracle is not available in most of
applications and thus the question of whether it is possible to
avoid the use of them is of great theoretical interest and
practical importance. We investigate the way to reduce the
dependence on MQs for the distribution-specific PAC learning that
results from the following approach. Learning in the basic PAC
model (without MQs) represents learning without any control over
the points in which the value of the target function will be
known. On the other hand, membership queries represent the total
control over these points. Both of these situations can be seen as
(totally) different amount of control over the probability that
the next sampled point will be some specific $x$. This observation
suggests that we can represent any intermediate situation by
allowing the learning algorithm to choose the distribution with
respect to which the points will be sampled. The set of
distributions $\D$ that a learning algorithm may choose from
measures the amount of control that the learning algorithm has.
Naturally, such an extension defines learnability with respect to
a given set of distributions $\D$ or learnability in the PAC--$\D$
model with respect to $D$. In this paper we actually discuss
SQ--$\D$---the statistical query (SQ) analogue of the above model.
The SQ model introduced by Kearns \cite{kea93} allows the learning
algorithms to get ``good" estimates of expectation of any
(polynomially computable) function involving the value of $f(x)$
with respect to the target distribution $D$. The major benefit of
this restriction of the PAC model is that every algorithm in this
model can be automatically transformed into classification noise
tolerant algorithm. Analogously, the learning algorithm in
SQ--$\D$ is allowed to ask SQs with respect to any distribution in
$\D$. This restriction will be sufficient for the learning
algorithms that we demonstrate and possesses the property similar
to that of SQ model: under some limitations on $\D$ every
algorithm in SQ--$\D$ can be simulated using MQs with {\em
persistent} classification noise.

 Learning with respect to the uniform is applicable in the case
 when all the input bits (or {\em attributes}) are independent, i.e., the value of each attribute of the next randomly sampled point
 is chosen randomly and independently of other attributes.
 For this case we increase the
 amount of control by allowing the learning algorithm to influence
 the probability with which any attribute of the next sample point will be 0 or 1.
 Particularly, we consider the set $\D_{\rho}$ containing all the product
 distributions with each input bit being 1 with probability
 $\rho, 1/2$ or $1-\rho$ for a constant $0 < \rho < 1/2$.

 Learning algorithm in SQ--$\D_{\rho}$ can ask for a statistical query with respect to
 product distribution with some attributes biased towards 0 or 1.
 Amount of this bias is given by $\rho$ and is limited by restrictions on
 $\rho$. It is easy to see that boundary values of $\rho$ represent
 the regular PAC learning with respect to
 the uniform distribution ($\rho = 1/2$) and the PAC
 model with membership queries ($\rho = 0$).

 We give a weaker analogue of the
 {\tt KM} algorithm for SQ--$\D_{\rho}$ that efficiently finds all the
  target function Fourier coefficients for sets with less than
  $c\log{n}$ attributes and larger than a given threshold
  (we call it the {\em Bounded Sieve}). We show that the Bounded
  Sieve ({\tt BS}) by itself is, in fact, sufficient for learning the class
  of decision trees and weakly learning the class of DNF formulae.
  Then, by employing Freund's boosting algorithm, we prove
  that DNF are strongly learnable in SQ--$\D_{\rho}$. This requires an adaption
  of Freund's technique to SQ--$\D_{\rho}$. The adaptation we use
  is simpler and more efficient than a more general one
  by Aslam and Decatur \cite{ad93} (for the regular SQ model).

  As we have noted our SQ--$\D_{\rho}$ algorithms can be automatically
  transformed into persistent classification noise tolerant
  algorithms.
% Since $\D_{\rho}$ meets the above-mentioned limitation our algorithms
% can be simulated using MQs with  persistent classification noise.
 We then show how to modify {\tt BS} so that it could handle another non-trivial
 noise: product attribute noise in membership queries. This type of noise was
 previously considered as appearing in randomly sampled points, i.e., in the regular PAC model
 \cite{sv88,gs95,dg95,bjt99}. We extend this noise model to
 learning with membership queries. Particularly, for every sample point $x$
 (randomly sampled or asked as a membership query) the oracle flips every bit $i$ of
 $x$ with probability $p_i$ and returns the value of target function at
 the resulting point (unless classification noise is also present).
 This type of noise may reflect the situation when communication with the oracle is done
 using faulty channel or querying very specific point is difficult
 (or even impossible) to attain. Specifically, we show that our implementation of {\tt BS}
 handles any {\em known} attribute noise rates bounded away from $1/2$ by
 a constant. This implies learnability of decision trees and weak learnability of \textsf{DNF}
 in this setting. These results can also be extended to the case when the noise rate
is unknown yet {\em uniform}, that is, equal for all the
attributes.

In the second part of this paper we show some negative results
about the newly introduced model. We start by developing a
characterization of classes weakly learnable in the regular SQ model
and extend it to the SQ--$\D$ model.
 The characterization will enable us to show that
the class of all parity functions having at most $k(n)$ variables
is not weakly learnable in SQ--$\D_{\rho}$ if $k(n)$ is
 $\omega(\log{n})$.  This fact complements the Bounded Sieve which
obviously learns every such class for $k(n) = O(\log{n})$.
 Another interesting application of this characterization
 will help us to show that although the model we have introduced
 is powerful enough to learn DNF expressions, it still cannot
 learn a class of parity functions learnable in the PAC model
 with classification noise (to show this we rely on the result
  by Blum et al.\ \cite{bkw00}). We
also show that the class of all parity functions is not weakly
learnable with respect to any ``non-biased" distribution (i.e.,
not containing points with probability greater than
$\fr{poly(n)}$) even in SQ--$\D$ for $\D$ containing all
``non-biased" distributions. This gives futher evidence that the
class of all parity functions is hard for any statistics based
learning.

\subsection{Relation to Other Works} \label{sec-relation} Shamir
and Shwartzman \cite{shsh95} were first to consider extending the
notion of statistical query as a way to produce (persistent) noise
tolerant algorithms. Their extension is a statistical query
analogue of ``second-order" queries, that is, each PAC example
consists of a pair $(x,y) \in X^2$ with label $\ell = (f(x),f(y))$
and examples are sampled randomly with respect to distributions
over $X^2$ (not necessarily product). They show that the {\tt KM}
algorithm can be implemented using this type of extended
statistical queries and prove that this implementation can be
simulated using membership queries with persistent classification
noise. Using the same approach it was proven \cite{jss97} that
\textsf{DNF} can be learned by noisy membership queries.

The main distinction between this approach and the SQ--$\D$ is
that Jackson, Shamir and Shwartzman investigated strong extensions
of the SQ model which can be simulated by noisy membership queries
and then showed that \textsf{DNF} is learnable in an extension of
this kind.  On the other hand, we show relatively weak extension
of the SQ model, based on regular (``first-order") statistical
queries, in which \textsf{DNF} is learnable. The extension we
describe can as well be simulated by noisy membership queries.
% and is weaker
%than the extension used in \citep{jss97} to prove their DNF learnability result.

 Kearns in his original paper on the SQ model \cite{kea93} showed that the SQ model
is weaker than the regular PAC model. Specifically, he proved that
the class of all parity functions is not learnable in the SQ
model. Subsequently, Blum et al.\ \cite{bfjkmr94} gave a general
characterization of the complexity of statistical query learning
in terms of the number of uncorrelated functions in the concept
class. We give a very similar characterization in terms of number
of functions required to ``approximate" every function in the
concept class. Our characterization is not stronger as a tool for
proving negative results and its main advantage is its very simple
proof as well as easy extendability to learning in
\mbox{SQ--$\D$}.

\subsection{Organization}
The rest of this paper is organized as follows. In the next
section we  provide definitions and several well-known facts that
are used later in the paper. In Section \ref{sec-sqd-model} we
define the SQ--$\D$ model and present all the learnability results
for the model. Discussion of the attribute noise and the
implementation
 of {\tt BS} for PAC learning with membership queries corrupted by product attribute noise is given
 in Section \ref{sec-att-noise}. Finally, the characterization of
 classes weakly learnable in SQ--$\D$ and its applications are
 described in Section \ref{sec-char}.


\section{Definitions and Notation}

Before we start describing our results we provide all the relevant
definitions. This section describes models, tools, and techniques
that we are employing in our discussion.

\subsection{General Notation}
 We consider an input space $X = \{0,1\}^n$. A {\em
concept} is a Boolean function on $X$. For convenience, when
applying the Fourier transform technique, we define Boolean
functions to have output in $\{-1,+1\}$, where 1 stands for
\textsc{true} and $-1$ for \textsc{false}. A {\em concept class}
$\F$ is a set of concepts. Concept class is defined together with
the representation of every concept in it. The representation
scheme used (e.g., DNF or decision tree) defines the mapping
$\fuf{size}(f)$ that measures the size of the smallest
representation of $f$ within the given representation scheme.
Concept class together with a representation scheme is called
representation class. We denote the representation class of DNF
expressions by \textsf{DNF}, and the representation class of
decision trees by \textsf{DT}.

For any vector $y$ we denote by $y_i$ the $i$-th element (or bit)
of $y$ and by $y_{[i,j]}$ we denote $y_iy_{i+1}\dots y_j$. Bits of
a member in our domain are usually called {\em attributes} and are
referred using {\em variables} $x_1,x_2,\ldots,x_n$. We use
$[a]^k$ to denote a vector of $k$ elements each equal to $a$. We
use $U$ to denote the uniform distribution over $X$. For Boolean
functions $f(x)$ and $g(x)$ over the domain $X$ we say that $g(x)$
$\epsilon-$approximates $f(x)$ with respect to distribution $D$
(on $X$)
 if $Pr_D[f(x)=g(x)] \geq 1-\epsilon$. It is easy to see that this is
 equivalent to saying that $\E_D[fg] \geq 1-2\epsilon$.
 For any real-valued function $\phi$ define $L_{\infty}(\phi) =
  \max_{x\in X}|\phi(x)|$.

Denote by $\Delta(\gamma)$ and $\Delta'(\gamma)$ probability
distributions over $\{0,1\}$ and $\{-1,1\}$ respectively, such
that the probability of 1 is $\gamma$ (that is, the probability
distribution of Bernoulli random variable with parameter
$\gamma$).

If $p(n) = O(q(n))$ and $q'(n)$ is $q(n)$ with all the logarithmic
factors removed we write $p(n) = \tilde{O}(q(n))$. Similarly we
define $\tilde{\Omega}$ for lower bounds. This extends to $k$-ary
functions in an obvious way.

\subsection{Learning Models}

Throughout this paper we refer to several models of learning.
Below we describe all the mentioned models and provide the
relevant definitions.

\subsubsection{The PAC Model}
  Most of the models we are discussing are based on Valiant's
  Probably Approximately Correct (PAC) model of learnability.
  The basic version of this model reflects passive learning from random examples.
  Formally, let $f$ be a Boolean function and $D$ be a distribution over $X$.
   An {\em example oracle} for $f$ with respect to
  $D$ --- EX($f,D$), is an oracle that on request draws an instance
  $x$ at random according to the probability distribution $D$ and
  returns the example $\langle x,f(x)\rangle$. A {\em membership
  oracle} for $f$ --- MEM($f$), is an oracle that given any point
  $x$ returns the value $f(x)$. Let $\epsilon$ and $\delta$ be
  positive values (called the {\em accuracy} and the {\em confidence} of
  learning, respectively). We say that the
  representation class $\F$ is ({\em strongly}) PAC-learnable if there is an algorithm
  $\A$ such that for any $\epsilon$,$\delta$, $f \in \F$ of size $s$ (the {\em target function}) and any distribution $D$, with
  probability at least $1-\delta$, algorithm
  $\A(EX(f,D),n,s,\epsilon,\delta)$ produces an $\epsilon$--approximation
  for $f$ with respect to $D$ in time polynomial in  $n,s,1/\epsilon$ and $\log{(1/\delta)}$.
  We generally drop the ``PAC" from  ``PAC-learnable" when the
  model is clear from the context.

  We will consider a number of variations on the basic PAC model. Let
  $\M$ be any model of learning (e.g., PAC). If $\F$ is
  $\M$-learnable by an algorithm $\A$ that requires a membership
  oracle then $\F$ is {\em $\M$-learnable using membership queries}
  (or learnable in $\M+$MQ).
  If $\F$ is $\M$-learnable for $\epsilon = 1/2 -p(n,s)$, where
  $p(\cdot,\cdot)$ is a fixed polynomial, then $\F$
  is weakly $\M$-learnable. Note that in the above definitions
  we have placed no restriction on the example distribution $D$,
  that is they are {\em distribution-independent}. If $\F$ is
  $\M$-learnable only for a specific distribution $D$ (known to
  $\A$) then $\F$ is $\M$-learnable with respect to D (such learning
  is called {\em distribution-specific} and the learning model is denoted by
  $\M_D$).

  \begin{remark}
 \label{bounded}
  When we are dealing with concept classes in which all the concepts have representations of polynomial (in $n$) size the parameter $s$ could be   omitted from the discussion.
  \end{remark}


\subsubsection{Noise Models}
 In the above described models it was always assumed that all the labels
 (i.e., values of the target function)
 that are received from oracles are accurate. We are now
 going to describe the extensions of the above models in which
 labels provided with the examples (random or received from
 membership oracle) are corrupted by random noise.\footnote{
 Noise in points themselves or attribute noise is defined in
 Section \ref{sec-att-noise}.} This extension
 was first considered in the learning literature by Angluin and
 Laird \cite{al88}. Since then, in addition to being the subject
 of a number of theoretical studies \cite{al88,lai88,sl88}, the {\em classification noise} model
 has become a common paradigm for experimental machine learning.

 Below we will give the formal definitions that incorporate the
 notion of random classification noise into the PAC learning model.
  A new parameter $\eta \leq 1/2$ called the {\em noise
rate} is introduced and the regular example oracle EX$(f,D)$ is
replaced with the faulty oracle EX$^\eta(f,D)$. On each call,
EX$^\eta(f,D)$ first draws an input $x$ according to $D$. The
oracle then chooses randomly a value $\zeta \in \{-1,1\}$; $1$
with probability $\eta$ and $-1$ with probability $1-\eta$ and
returns $\zeta f(x)$. When $\eta$ approaches $1/2$ the result of
the corrupted query approaches the result of the random coin flip,
and therefore learning becomes more and more difficult. Thus the
running time of algorithms in this model is allowed to
polynomially depend on $\fr{1-2\eta}$. This model of noise is not
suitable for corrupting labels returned by a membership oracle
MEM$(f)$ since a learning algorithm can, with high probability,
find the correct label at point $x$ by asking the label of $x$
polynomial (in $\fr{1-2\eta}$) number of times and then returning
the label that appeared in the majority of answers. An appropriate
modification of the noise model is the introduction of {\em
persistent classification noise} by Goldman, Kearns and Shapire
\cite{gks94}. In this model, as before, the answer to a query at
each point $x$ is corrupted by random value $\zeta$. But if the
oracle was already queried about the value of $f$ at some specific
point $x$, it returns the same value it has returned in the first
query (i.e, in such a case the noise persists and is not purely
random). There is another subtle but important distinction of this
model. We cannot require learning algorithms in this model to work
for an arbitrarily small confidence parameter $\delta$. Because
with some (though negligible) probability target function can be
persistently totally corrupted (for example negated) making
learning absolutely infeasible. So we must impose a positive lower
bound on the confidence $\delta$ that can be required from the
learning algorithm in the persistent noise model, although this
bound will be negligible.

\subsubsection{The Statistical Query Learning Model}
The {\em statistical query model} introduced by Kearns
\cite{kea93} is a natural restriction of the PAC learning model in
which a learning algorithm cannot see labelled examples of the
target concept, but instead may only estimate probabilities
involving the target concept. Formally, when learning with respect
to distribution $D$ the learning algorithm is given access to
STAT($f,D$) -- a {\em statistical query oracle} for target concept
$f$ with respect to distribution $D$. A query to this oracle is a
pair $(\psi,r)$, where $\psi : \{0,1\}^n \times \{-1,+1\}
\rightarrow \{-1,+1\}$ is a {\em query function} and $ r \in
[0,1]$ is a real number called the {\em tolerance} of the query.
The oracle may respond to the query with any value $v$ satisfying
$$|\E_D[\psi(x,f(x))] - v| \leq r \ .$$ We denote the result returned by
oracle STAT$(f,D)$ to query $(\psi,r)$ by STAT$(f,D)[\psi,r]$.

We say that the concept class $\F$ is {\em (strongly) learnable
from statistical queries} (SQ--learnable)  if there is a learning
algorithm $\A$, and polynomials $p(\cdot,\cdot,\cdot)$,
$q(\cdot,\cdot,\cdot)$ and $r(\cdot,\cdot,\cdot)$ such that for
every $n$, $f \in \F$, distribution $D$ and  $\epsilon$ the
following holds. Given $n$, the size $s$ of $f$, $\epsilon$ and
access to STAT($f,D$), algorithm $\A$ produces an
$\epsilon$--approximation to $f$ with respect to $D$ in time
bounded by $p(n,s,1/\epsilon)$. Furthermore, algorithm $\A$ makes
queries $(\psi,r)$ in which $\psi$ can be evaluated in time
bounded\footnote{In most of the cases we will use simple and
explicit query functions. In such cases we omit the analysis of
complexity of these functions.} by $q(n,s,1/\epsilon)$, and in
which $r$ is lower-bounded by $1/r(n,s,1/\epsilon)$.

 Although the ``usual" randomness of
PAC algorithms is incorporated in the statistical queries some
applications may still need randomization. In this case we, as
usual, have confidence parameter $\delta$, i.e., get the required
approximation with probability at least $1-\delta$, and the
running time of the algorithm is also polynomial in
$\log{(1/\delta)}$.

It is important to note that the statistical query model is indeed
a restriction of the basic PAC model. This follows from the fact
that a query $(\psi, r)$ to STAT($f,D$) where $\psi$ can be
computed in polynomial time and $r$ lower-bounded by inverse of
the polynomial can (with high probability) be estimated (within
$r$) by computing $\psi$ on polynomial number of samples and then
averaging the result (see Section \ref{subsec-eev} for more
details). Moreover, as it was shown by Kearns \cite{kea93}, this
estimation can also be made in the presence of random
classification noise.
 This means that any SQ
algorithm can be automatically transformed into noise tolerant
algorithm. This important property of the model allowed Kearns to
obtain noise-tolerant algorithms for practically every concept
class for which an efficient learning algorithm in the original
noise-free PAC model is known.

\subsection{Estimating Expected Values} \label{subsec-eev} We will
frequently need to estimate the expected value of a random
variable. Although the SQ model itself provides us with such
estimates, sometimes a random variable will be a result of a
statistical query. In such a case we usually cannot use a
statistical query to estimate the expectation of the variable.
Thus we will find the expectation by sampling the variable and
averaging the results. The justification for this technique is
usually referred as {\em Chernoff bounds}. The Chernoff's lemma
which is concerned with 0,1 random variables was generalized by
Hoeffding. Hoeffding's lemma, presented below, allows to compute
the sample size needed to ensure that with high probability the
sample mean closely approximates the true mean of the random
real-valued variable.
\begin{lemma}[Hoeffding]
\label{lem-hoeffding} Let $X_i$, $1 \leq i \leq m$, be independent
random variables all with mean $\mu$ such that for all $i$, $a
\leq X_i \leq b$. Then for any $\lambda > 0$,
\equn{\pr\left[\left|\fr{m}\sum_{i=1}^m X_i - \mu\right| \geq
\lambda\right] \leq 2e^{-2\lambda m/(b-a)^2}\ .}
\end{lemma}

According to the lemma if we sample randomly and independently a
random variable $Y$ with values in $[a,b]$ then to reach the
accuracy of at least $\lambda$ with confidence of at least
$\delta$ it is sufficient to take
\equn{(b-a)^2\ln{(2/\delta)}/(2\lambda^2) =
O(\log{(1/\delta)}(b-a)^2/\lambda^2)} samples of $Y$. That is,
Hoeffding's lemma gives the following corollary.

\begin{corollary}
\label{cor-amean} Let $Y$ represent a random variable that
produces values in the range $[a, b]$. Then there is an algorithm
{\tt AMEAN}$(Y, b -a, \lambda, \delta)$ that for any such $Y$ and
any $\lambda > 0$ produces a value $\mu$ such that $|\E[Y] - \mu|
\leq \lambda$ with probability at least $1 - \delta$. Furthermore,
{\tt AMEAN} runs in $O(\log{(1/\delta)}(b-a)^2/\lambda^2)$ time.
\end{corollary}

\begin{remark}
\label{rem-est-sq} Another implication of the fact that a value of
random variable is a result of a statistical query is that we get
its value within some tolerance $r$. It is easy to see that in
such a case the above execution of procedure {\tt AMEAN} will
return the expectation within $r + \lambda$.
\end{remark}

In general, we say that a constant value $c$ can be {\em
efficiently approximated} if there exists an algorithm that for
every $\tau$ and $\delta$, with probability at least $1-\delta$
produces a value that estimates $c$ within $\tau$ and runs in time
polynomial in $1/\tau$ and $\log{(1/\delta)}$.

The SQ model gives the learner ability to estimate expectations of
Boolean functions involving the value of the target function
$f(x)$. In further algorithms we will need to get expectations of
any real-valued functions involving $f(x)$. For this purpose we
will rely on the following simple lemma.
\begin{lemma}

\label{RV-SQ} There exists a procedure {\tt RV-SQ} that for every
real-valued function $\phi: \{0,1\}^n\times \{-1,1\} \rightarrow
[-b,b]$, tolerance $r$ and distribution $D$,
 {\tt RV-SQ}$(\mbox{STAT}(f,D),\phi,b,r)$ returns a value $v$
satisfying \equn{|\E_{D}\phi(x,f(x)) - v | \leq r} Its time
complexity is $O(\log{(b/r)})$. The required tolerance is bounded
from below by  $\frac{r}{4b\log{(b/r)}}$ and complexity of the
query functions is the complexity of computation of $\phi$ plus
$O(n)$.
\end{lemma}
{\bf Proof:} We find the required expectation by decomposing
$\phi$ into linear
 combination of Boolean functions according to its binary
 representation in the following way:\\ Let
$k$ be equal to $\lceil\log{b}\rceil$ and $l =
\lceil\log{\frac{2}{r}}\rceil$ $$\phi(x,f(x)) = -2^{k} +
2^{k}\phi_{k}(x,f(x)) + 2^{k-1}\phi_{k-1}(x,f(x)) + \ldots +
\phi_0(x,f(x)) +$$ $$\fr{2}\phi_{-1}(x,f(x)) + \ldots +
2^{-l}\phi_{-l}(x,f(x)) + \psi(x,f(x))$$
 Where all the $\phi_i$ are Boolean 0,1-valued
functions representing the bits of $\phi(x,f(x)) + 2^k$
(nonnegative function). Choice of $l$ ensures that $0 \leq
\psi(x,f(x)) \leq r/2$. Since we would like the composition to use
$\pm 1$-valued functions, we can further write
 $$\phi(x,f(x)) = 2^{k-1}(2\phi_{k}(x,f(x)) - 1) +
 2^{k-2}(2\phi_{k-1}(x,f(x))-1)  + \ldots +$$ $$
2^{-l-1}(2\phi_{-l}(x,f(x)) - 1) + (\psi(x,f(x)) - 2^{-l-1})$$

For every $-l \leq j \leq k$, $2\phi_j(x,f(x))-1$ is a $\pm
1$-valued Boolean function depending on $x$ and $f(x)$. Thus to
get the required approximation, for every $-l \leq j \leq k$ we
approximate $\E_{D} [2\phi_j(x,f(x))-1]$ within
$\frac{r}{2^{j+1}(k+l+1)}$ using query
$$\left(2\phi_j(x,f(x))-1,\frac{r}{2^{j+1}(k+l+1)},D\right)$$ Let
$\mu_j$ be the result of the query. The procedure returns the
value $ \sum_{-l \leq j \leq k} 2^{j}\mu_j$. This value is a valid
result since $$ |\E_{D}\phi(x,f(x)) - \sum_{-l \leq j \leq k}
2^{j}\mu_j| \leq \sum_{-l \leq j \leq k}
2^j\frac{r}{2^{j+1}(k+l+1)} + \E_{D} [|\psi(x,f(x)) - 2^{-l-1}|]
\leq $$ $$ \sum_{-l \leq j \leq k} \frac{r}{2(k+l+1)} + r/2 = r.$$
The complexity of this algorithm is $O(\log{(b/r)})$. The required
tolerance is at least $$\frac{r}{2^{k+1}(k+l+1)} \geq
\frac{r}{4b\log{(b/r)}}\ .$$ Complexity of the query functions is
the complexity of computation of $\phi$ plus $O(n)$. \qed

\subsection{The Fourier Transform}
\label{sec-def-fourier}
 Many parts of our analysis rely heavily on the use of the
 well-known Fourier
transform technique. This beautiful technique was introduced to
the learning theory by Linial, Mansour, and Nisan \cite{lmn89}.
Below we give a short description of the technique and its
notation (an extended survey of the technique and the {\tt KM}
algorithm was given by Mansour \cite{mansour}).

Since all the expectations and probabilities we use when applying
this technique will be with respect to the uniform distribution,
we will sometimes omit $U$ in our formulae. For every vector $a
\in \{0,1\}^n$ we define $\chi_a$---the parity function for the
vector by \equn{\chi_{a}(x) = (-1)^{\sum_{i \in \{1,\ldots,n\}}
a_ix_i}} A common alternative is to associate a parity function
with a set $A \subseteq \{1,\ldots,n\}$ of attribute indices that
influence the parity. This definition implies that $\chi_a$ and
$\chi_A$ are equivalent whenever $A = \{i\ |\ a_i = 1\}$. Also
define the inner product by $\langle f,g \rangle = \E[f(x)g(x)]$
and define the norm as usual by $\|f\| = \sqrt{\E_x[f^2(x)]}$. In
this context we define $w(a)$ to be the number of bits $a_i$ equal
to 1 in $a$ and call it the {\em weight} of $a$. Let $\oplus$
denote the bit-wise XOR operation on binary vectors. Parity
functions possess several important and easily verifiable
properties:
\begin{enumerate}
\item $\chi_a(x)\chi_b(x) = \chi_{a\oplus b}(x)$;
\item $\chi_a(x)\chi_a(y) = \chi_a(x\oplus y)$;
\item $\E_x[\chi_a(x)] = 0$ for every $a \neq [0]^n$;
\item $\chi_{[0]^n} \equiv 1$ and thus $\E_x[\chi_{[0]^n}(x)] = 1$.
\end{enumerate}
By these properties, for every $a,b \in \{0,1\}^n$, \equn {\langle
\chi_a,\chi_b \rangle = \E_x[\chi_a(x)\chi_b(x)] =
\E_x[\chi_{a\oplus b}(x)] = \left\{\arr{ll}{0 & a \neq b
 \\ 1 & a = b }\right. }
Thus the set $\{\chi_a\}_{a \in\{0,1\}^n}$ is orthonormal.  Its
size is $2^n$ which is also the dimension of the vector space of
real-valued functions on $\{0,1\}^n$, i.e., the set forms the
basis for the vector space. That is, every function $g : \{0,1\}^n
\rightarrow \mathbb{R}$  can be uniquely expressed as a linear
combination of parity functions: \equn {g(x) = \sum_{a \in
\{0,1\}^n}\hat{g}(a)\chi_a(x) \ .} The coefficients $\hat{g}(a)$
are called {\em Fourier coefficients} and vector of coefficients
$\hat{g}$ the Fourier transform of $g$. The {\em degree} of the
coefficient $\hat{g}(a)$ is the weight of $a$.\footnote{For a
parity function defined using a set of indices the degree of the
corresponding coefficient is the size of the set.} Because of the
orthonormality of the parity functions, $\hat{g}(a) =
\E[g(x)\chi_a(x)]$. Thus $\hat{g}(a)$ represents the correlation
of $g$ with the parity function $\chi_a$.

Parseval's identity states that for every function $g$, $\|g\|^2 =
\E[g^2] = \sum_a \hat{g}^2(a)$. For Boolean $g$ this implies that
$\sum_a \hat{g}^2(a) = 1$.

The Fourier transform technique is usually used in the following
way. Given some access to the target function $f$ we try to find
its Fourier coefficients or at least the ``largest" among them
(here and in further discussion we refer to the absolute value of
the coefficient). These coefficients define a real-valued function
$g$. We then produce hypothesis by taking $h=\fuf{sign}(g)$.
Justification for this method is provided by the following
well-known lemma.
\begin{lemma} \label{fact-fourier-app} Let $f$ be
a Boolean function and $g$ be any real-valued function
  then \equn{\pr[f(x) \not = \fuf{sign}(g)(x)]
  \leq \E[(f(x)-g(x))^2] = \sum_{a\in\{0,1\}^n}(\hat{f}(a)-\hat{g}(a))^2\ .}
\end{lemma}

\subsection{The {\tt KM} Algorithm} \label{sec-km} One of the most
fundamental application of the Fourier analysis is the
Kushilevitz-Mansour algorithm \cite{km91}. This randomized
algorithm uses membership queries to find all the Fourier
coefficients of the target function larger than some threshold
$\theta$ in time polynomial in $n,\log{(1/\delta)}$ and
$\theta^{-1}$. The algorithm is based on an ingenious way to
estimate the sum of squares of all the Fourier coefficients for
vectors starting with some given prefix. Formally for $0 \leq k
\leq n$ and $\alpha \in \{0,1\}^k$ denote \equn{ C_{\alpha} =
\sum_{b \in \{0,1\}^{n-k}} \hat{f}^2(\alpha b)\ .} As it has been
proved by Kushilevitz and Mansour \equn{ C_{\alpha} = \E_x
\left[\E_y[f(yx)\chi_{\alpha}(y)]\right]^2 \ ,} where the
expectation is uniform over $x \in\{0,1\}^{n-k}$, $y\in\{0,1\}^k$.
This formula means that $C_{\alpha}$ can be efficiently
approximated by random sampling (see Section \ref{subsec-eev}).
 If there exists at least one Fourier coefficient for
a vector starting with $\alpha$ and larger (by absolute value)
than $\theta$ then $C_\alpha \geq \theta^2$. Using this simple
observation Kushilevitz and Mansour wrote a recursive procedure
(named {\tt Coef}) that given $\alpha$ finds all the Fourier
coefficient for vectors starting with $\alpha$ and larger than
$\theta$ as follows. For $\alpha$ being a vector of length $n$
{\tt Coef}$(\alpha)$ returns the vector if its Fourier coefficient
is larger than $\theta$. For shorter prefixes the procedure checks
whether $C_\alpha \geq \theta^2$. If so executes itself for
prefixes $\alpha\cdot 0$ and $\alpha\cdot 1$ and returns the union
of the results. Otherwise ($C_\alpha < \theta^2$), returns the
empty set. When executed on an empty prefix this procedure should
return all the Fourier coefficients that are larger than $\theta$.
Effectiveness of this algorithm follows from the fact that for any
given length of prefix $k$ \equn{\sum_{\alpha \in \{0,1\}^k}
C_\alpha = \sum_{a \in \{0,1\}^n} \hat{f}(a)^2 = \|f\|^2 = 1 \ .}
This means that for any given length of prefix {\tt Coef} will be
called at most $1/\theta^2$ times, i.e., the resulting algorithm
is polynomial in $n, \log{(1/\delta)}$ and $1/\theta$.

In this description we have neglected the fact that $C_{\alpha}$
is not known exactly but only estimated. Minor modifications
required to solve this problem can be found in the original
description of the algorithm \cite{km91} or in the similar
analysis of our analogue to this algorithm described in Section
\ref{sec-bs}.
\begin{remark}
\label{rem-km-real} It can be easily seen that the algorithm can
be applied to any real-valued function $\phi(x)$ (and not only to
Boolean). In such a case for any given length of prefix the
procedure {\tt Coef} will be called at most $\|\phi\|^2/\theta^2$
and thus running time of the {\tt KM} algorithm becomes polynomial
in $\|\phi\|$. Moreover, we do not need to know the value of
$\phi(x)$ exactly. An ability to efficiently estimate the value of
$\phi(x)$ with any desired accuracy is sufficient since the value
of $\phi(x)$ is used only for estimating $C_{\alpha}$'s and
Fourier coefficients of $\phi$.
\end{remark}

\section{Learning by Extended Statistical Queries}
\label{sec-sqd-model}

 In this section we introduce a new model of
learning. The model (or actually a collection of models) we define
is at least as strong as the basic PAC model and is weaker (or
equivalent) to the use of membership queries. For all of our
applications a restriction of the new model to statistical queries
is sufficient. Moreover, as we will demonstrate, the SQ version of
the new model has a few important advantages. Thus we almost
immediately start discussing the restricted version.

We begin by defining the SQ--$\D$ model and discussing its
properties. Then we apply these general ideas to learning of
decision trees and DNF expressions with respect to the uniform
distribution. In particular, we implement the Bounded Sieve
algorithm which has the functionality of the {\tt KM} algorithm
with an additional restriction. The Bounded Sieve by itself is
sufficient for learning of \textsf{DT} and weak learning of
\textsf{DNF}. By adapting Freund's boosting technique to our model
we show that DNF expressions are also strongly learnable.


\subsection{Extending the SQ Model}
 We are now going to give a detailed description of the extension
 of the basic PAC model and the SQ model.
 In this new model we allow the learning algorithm to supply a
  distribution with respect to which the next point will be sampled.
 Formally, let $D$ be a distribution over $X$ and $\D$ be a set of distributions over $X$ containing distribution
$D$. The new example oracle EX$(f,\D)$ is defined as follows. A
query to this oracle is a distribution $D' \in \D$. To such a
query the oracle EX$(f,\D)$ responds with example $\langle x,f(x)
\rangle$, where $x$ is sampled randomly and independently
according to distribution $D'$. In other words, access to
EX$(f,\D)$ is equivalent to access to all the oracles of the form
EX$(f,D')$, where $D' \in \D$. Learning with access to the
newly-defined oracle is referred as learning in the PAC--$\D$
model. Learnability (with respect to $D$) in this new model is
defined as in the regular PAC model with the regular example
oracle EX$(f,D)$ replaced by EX$(f,\D)$.

The extension of the PAC learning model obviously possesses the
following properties:
\begin{itemize}
\item PAC--$\D$ is equivalent to the regular PAC model for $\D = \{D\}$;
\item We can simulate the PAC--$\D$ model using membership queries;
\item If we do not restrict $\D$ we might be able to simulate membership queries (e.g., using distributions with all the weight concentrated in one point).
\end{itemize}

Our next step is restricting the PAC--$\D$ model to statistical
queries, or extending Kearns' SQ model in the similar fashion.
That is, instead of the regular STAT$(f,\D)$ oracle  a learning
algorithm in the SQ--$\D$ model is provided with access to all the
oracles STAT$(f,\D')$ where $D' \in D$. More formally, learning
algorithm in the {\em SQ--$\D$} model is supplied with the
STAT$(f,\D)$ oracle, where $f$ is the target concept. A query to
this oracle is a triple $(\psi,r,D')$ where $\psi : \{0,1\}^n
\times \{-1,+1\} \rightarrow \{-1,+1\}$ is a query function, $r
\in [0,1]$ is a tolerance (as in the SQ model) and $D'$ is a
distribution from $\D$. To such a query the oracle responds with
the value $v$ satisfying $|\E_{D'}[\psi(x,f(x))] - v| \leq r$.
Respectively, we say that the concept class $\F$ is learnable in
the SQ--$\D$ model with respect to $D$ if it is learnable by a
polynomial algorithm (as defined in the regular SQ model) with
access to the newly-defined oracle.

An important property of the model that results from the fact that
it is statistical-query-based is that STAT$(f,\D)$ can be
simulated by membership queries in the presence of random
classification noise. The simulation is done by offsetting the
effect of the noise on a query \cite{kea93}. Since we are
simulating using membership queries we are actually interested in
learning with {\em persistent} classification noise. Offsetting
noise in this model can be a more complicated task. However, if
all the sample points in the sample that was used for simulating
statistical queries are different then the persistent noise in the
sample is equivalent to usual random classification noise and
therefore we can offset the effect of noise in the same way as
before. In the following theorem we show that under certain simple
restriction on set $\D$ the probability to get two equal points in
the sample will be negligible and therefore we will be able to use
Kearns' noise offsetting procedure.\footnote{Proof of this theorem
can be easily extracted from \cite{jss97}.} Denote by
$L_{\infty}(\D) = \max_{D' \in \D}\{L_{\infty}(D')\}$.

\begin{theorem}
\label{convert2noise} Let $\F$ be a concept class and $\D$ be a
set of distributions with sampling oracle available for every $D'
\in \D$. If $\F$ is learnable in SQ--$\D$ with respect to $D$ and
$L_{\infty}(\D)<\tau^n$ for a constant $0 < \tau < 1$, then $\F$
is learnable with respect to $D$ using membership queries
corrupted by random persistent classification noise.
\end{theorem}
{\bf Proof:} Let $\A$ be an algorithm that learns $\F$ in
SQ--$\D$. Let us examine the execution of $\A$ for a particular $f
\in \F$.
 We can assume that the learning parameters $s$ and $\epsilon$
 are subexponential\footnote{A more accurate restriction on these parameters can be deduced from equation (\ref{equ-mqnoise}).}
 in $n$ (otherwise a trivial exponential learning algorithm will solve the
problem). The number of queries that $\A$ asks during its
execution is bounded by a fixed polynomial in $n$, $s$, and
$\epsilon$ (denote it by $p_1(n,s,\epsilon^{-1})$ ). Let
$1/p_2(n,s,\epsilon^{-1})$ denote the inverse-polynomial bound on
the required accuracy of every query. Let $\sigma = \frac
{\tau^{n/2}}{2p_1(n,s,\epsilon^{-1})}$. According to Corollary
\ref{cor-amean}, in order to estimate all the $\A$'s queries with
accuracy of $1/p_2(n,s,\epsilon^{-1})$ and with confidence of
$\sigma$ polynomial number of sample points is required (the
polynomial depends on $n, s, \epsilon^{-1},$ and $\fr{1-\eta}$
since the estimation is done in the presence of noise). Denote
this polynomial bound by $p_3(n,s,\epsilon^{-1},\fr{1-\eta})$.
Assuming that the noise is random, all the estimations will
succeed with probability at least $1- \sigma
p_1(n,s,\epsilon^{-1}) = 1 - \tau^{n/2}/2$. The probability to see
a point more than once in the above sample is less than
\equ{\label{equ-mqnoise}
p_3^2(n,s,\epsilon^{-1},\fr{1-\eta})L_{\infty}(\D) \leq
p_3^2(n,s,\epsilon^{-1},\fr{1-\eta}) \tau^n < \tau^{n/2}/2 \ .}

Therefore, with probability at least $1- \tau^{n/2}/2$, the noise
in the generated sample is truly random. This means that the
simulation with succeed with probability at least $1- \tau^{n/2}$.

Now, if we impose a lower bound of $\tau^{n/2}$ on confidence
$\delta$ that can be required from the learning algorithm  (as is
allowed when the noise is persistent), then the above-described
procedure gives the algorithm that learns $\F$ by using membership
queries with persistent classification noise. \qed


\subsection{Learning with Respect to the Uniform Distribution
 Using Product Distributions}
\label{sec-bs}
 In this section we will examine set $\D$ that contains product
distributions such that every input bit is $1$ with probability
$\rho$ , $1/2$ or $1-\rho$, that is, for every input bit $x_i$,
$\pr[x_i = 1] = p_i$ independently of all the other inputs and
$p_i \in \{\rho,1/2,1-\rho\}$ (we call $p=p_1p_2\ldots p_n$ a {\em
probability vector}). We assume that $\rho$ is a constant
satisfying $1/2 > \rho> 0$. For every $p \in
\{\rho,1/2,1-\rho\}^n$ denote by $D_p$ the product distribution
defined by this probability vector. Denote by \equn{\D_{\rho} =
\{D_p\ |\ p \in \{\rho,1/2,1-\rho\}^n\}\ .} This distribution
class will be used to learn with respect to the uniform
distribution which itself is contained in $\D_\rho$ for any
$\rho$.

%\section{Properties of the SQ--$\D_{\rho}$ model}
\label{subsec-bs-def} For a probability vector $p$ and $x \in
\{0,1\}^n$ denote by $p \oplus x$ probability vector for which $(p
\oplus x)_i = p_i$ if $x_i = 0$ and  $ (p \oplus x)_i = 1 - p_i$
if $x_i = 1$. Since $p \in \{\rho,1/2,1-\rho\}^n$ we also have $p
\oplus x \in \{\rho,1/2,1-\rho\}^n$.

According to the discussion in Section \ref{sec-intro}, we call
this $\rho$ the {\em degree of control} that a learning algorithm
in SQ--$\D_{\rho}$ has. Intuitively, smaller values of $\rho$
represent larger degree of control (with 0 giving the equivalent
of membership queries). This is proven in the next lemma.

\begin{lemma}
\label{lem-degree-control} If $1/2 \geq \rho_1 \geq \rho_2 > 0$
then any algorithm in
 SQ--$\D_{\rho_1}$ can be simulated by
 a randomized algorithm in SQ--$\D_{\rho_2}$.
\end{lemma}
{\bf Proof:}
 We show that a result of any statistical query in SQ--$\D_{\rho_1}$ can be  efficiently approximated in SQ--$\D_{\rho_2}$.
Let $(\phi, r, D_p)$ be a query to STAT$(f, \D_{\rho_1})$. For
$\pi \in [0,1]$ we define $q = [\pi]^n$ (i.e., vector with every
element equal to $\pi$). Let $x_1$ and $x_2$ be chosen randomly
according to $\Delta(\pi)$ and $\Delta(\rho)$. It is easy to
verify that $x_1 \oplus y_1$ is distributed according to
$\Delta(\pi+\rho-2\pi\rho)$. When sampling with respect to a
product distribution $D_p$, bit $i$ is sampled randomly and
independently according to distribution $\Delta(p_i)$. Therefore,
\equ{\label{eq-simsqd} \E_{y \sim  D_p}[\phi(y,f(y))] = \E_{x \sim
D_q} \left[\E_{y \sim D_{s \oplus x}}[\phi(y,f(y))]\right]\ , }
where $s$ is a probability vector such that $p_i =  \pi+s_i-2\pi
s_i$ or $s_i = \frac{p_i-\pi}{1-2\pi}$. For $\pi =
\frac{\rho_1-\rho_2}{1-2\rho_2}$ ($\pi \in [0,1]$ since $1/2 \geq
\rho_1 \geq \rho_2 > 0$) we get that \equn{s_i =
\left\{\arr{ll}{\rho_2 & p_i = \rho_1
 \\ 1 - \rho_2 & p_i = 1-\rho_1 \\ 1/2 & p_i = 1/2}\right. }
Thus $D_s \in \D_{\rho_2}$ and for every $x$, $D_{s \oplus x} \in
\D_{\rho_2}$, i.e., $\E_{y \sim D_{s \oplus x}}[\phi(y,f(y))]$ can
be estimated by a query in SQ--$\D_{\rho_2}$. Equation
(\ref{eq-simsqd}) means that the result of query $(\phi, r, D_p)$
can be efficiently approximated by sampling in SQ--$\D_{\rho_2}$.
\qed
\begin{remark}
\label{rem-non-uniform-d} For the simplicity of the exposition we
have chosen the degree of control to be uniform over all the
attributes (and equal to $\rho$). Alternatively, we can consider
classes of distributions for which this degree is different for
each attribute. It can be easily shown (by slightly modifying the
proof of Lemma \ref{lem-degree-control}) that if $\D$ is a class
of distributions giving a learning algorithm degrees of control in
range $[\rho_1,\rho_2]$ then the SQ--$\D$ model is not stronger
than SQ--$\D_{\rho_1}$ and not weaker than SQ--$\D_{\rho_2}$.
\end{remark}

It is important to note that according to Theorem
\ref{convert2noise} learning in SQ--$\D_{\rho}$ implies learning
with persistent noise since $L_{\infty}(\D_{\rho}) = (1-\rho)^n$.


\subsection{An Analogue to the {\tt KM} Algorithm} Our purpose is
to give an algorithm in the SQ--$\D_{\rho}$ model that could find
``large" Fourier coefficients of the target function. When
membership queries are available this task can be performed using
the {\tt KM} algorithm. As we will show in the characterization
section this task cannot be performed in the SQ--$\D_{\rho}$
model. Instead we will give a weaker analogue of the {\tt KM}
algorithm implemented in the SQ--$\D_{\rho}$ model. It will find
all the ``large" Fourier coefficients of degree lower than $\ell$
in time polynomial in $2^{\ell}$ thus allowing only the Fourier
coefficients with degree bounded by $c\log{n}$ to be found
efficiently. Particularly, for the target function $f$, given
parameters $n$,  $\theta$, $\ell$ and $\delta$ it will, with
probability at least $1-\delta$, find a set $S \subseteq
\{0,1\}^n$  with the following properties:
\begin{itemize}
\item for all $a$, if $|\hat{f}(a)| \geq \theta$ and $w(a) \leq \ell$
then $a \in S$;
\item for all $a \in S$, $|\hat{f}(a)| \geq \theta/2$.
\end{itemize}
We say that such a set possesses the {\em large Fourier
coefficient property for function $f$, threshold $\theta$ and
degree $\ell$} (this property can be defined for any real-valued
function).

The {\tt KM} algorithm is based on a subroutine that estimates the
sum of squares of all the coefficients for vectors starting with
given prefix.  In the same fashion our algorithm will be based on
ability to estimate weighted sum of squares of all the
coefficients for vectors starting with a given prefix. In
particular, for $0 \leq k \leq n$ and $\alpha \in \{0,1\}^k$
denote \equn{S_{\alpha}^{\rho}(f) = \sum_{b \in \{0,1\}^{n-k}}
\hat{f}^2(\alpha b)(1-2\rho)^{2w(b)}.}

\begin{lemma}
\label{lem-p1}
 There exists an SQ--$\D_{\rho}$ randomized algorithm {\tt P1}
 that for any target function $f$, prefix vector $\alpha$ of
 length $k$, confidence $\delta$ and accuracy $\tau$, {\tt P1}$(n, k, \alpha,
 \tau, \delta)$ returns, with probability at least
$1-\delta$, a value estimating $S_{\alpha}^{\rho}(f)$ within
$\tau$. It runs in $O(\tau^{-2}\log{(1/\delta)})$ time and
tolerance of its queries is bounded from below by $\tau/4$.
\end{lemma}
{\bf Proof:} Let $p = [\fr{2}]^k[\rho]^{n-k}$ and $\alpha_0 =
\alpha [0]^{n-k}$. The ability to estimate $S_{\alpha}^{\rho}(f)$
is based on the following equation \equn{S_{\alpha}^{\rho}(f) =
\E_{x\sim U}\left[ \E_{y \sim D_{p \oplus x}}
[f(y)\chi_{\alpha_0}(x \oplus y) ] \right]^2 \ .} To prove it
denote \equn{f_{\alpha}^{\rho}(x) \triangleq \E_{y \sim D_{p
\oplus x}} [f(y)\chi_{\alpha_0}(x \oplus y)]\ .} Since by
Parseval's identity for every function $f$, $\E_{x \sim U}[f^2(x)]
= \sum_{a \in \{0,1\}^n} \hat{f}^2(a)$, it is enough to prove that
\equ{ f_{\alpha}^{\rho}(x) = \sum_{b \in \{0,1\}^{n-k}}
\hat{f}(\alpha b)(1-2\rho)^{w(b)}\chi_{\alpha b}(x) \ .
\label{eq-pi} } As follows from the definition of $p \oplus x$
(see Section \ref{subsec-bs-def}), $D_{p \oplus x}(y) = D_p(x
\oplus y)$. Thus equation (\ref{eq-pi}) can be proved as follows
\begin{equation}
%\nonumber
\begin{split}
f_{\alpha}^{\rho}(x) = \E_{y \sim D_{p \oplus
x}}[f(y)\chi_{\alpha_0}(x \oplus y)] &= \sum_{y \in
\{0,1\}^n}f(y)\chi_{\alpha_0}(x \oplus y)D_{p \oplus x}(y)
\\&=\sum_{y \in \{0,1\}^n}f(y)\chi_{\alpha_0}(x \oplus y)D_p(x
\oplus y)\\ &= \sum_{z \in \{0,1\}^n}f(x \oplus
z)\chi_{\alpha_0}(z)D_p(z)\\ &= \E_{z \sim D_p}[f(x \oplus
z)\chi_{\alpha_0}(z)]\\
 &= \E_{z \sim D_p}\left[\sum_{a \in
\{0,1\}^n} \hat{f}(a)\chi_a(x \oplus z)\chi_{\alpha_0}(z)\right]\\
&= \E_{z \sim D_p}\left[\sum_{a \in \{0,1\}^n}
\hat{f}(a)\chi_a(x)\chi_{a \oplus \alpha_0}(z)\right]\\ &= \sum_{a
\in \{0,1\}^n} \hat{f}(a)\chi_a(x)\E_{z \sim D_p}[\chi_{a \oplus
\alpha_0}(z)]. \label{eq-pi-proof}
\end{split}
\end{equation}
But since  $p = [\fr{2}]^k[\rho]^{n-k}$ and $\alpha_0 = \alpha
[0]^{n-k}$,
\begin{multline}
 \E_{z \sim D_p}[\chi_{a \oplus \alpha_0}(z)] =
\prod_{i \in \{1,\ldots,n\}} \left[ 1-p_i + p_i(-1)^{a_i \oplus
\alpha_{0,i}} \right] =
 \prod_{i \in \{1,\ldots,n\},a_i \oplus \alpha_{0,i} = 1} (1-2p_i) \\=
\left( \prod_{i \in \{1,\ldots,k\},a_i \neq \alpha_i} 0\right)
\left( \prod_{i \in \{k+1,\ldots,n\},a_i = 1} (1-2\rho)\right)
 ,
\end{multline}
 that is, if $a = \alpha b$ then  $\E_{z \sim
D_p}[\chi_{a \oplus \alpha_0}(z)] = (1-2\rho)^{w(b)}$, otherwise
$\E_{z \sim D_p}[\chi_{a \oplus \alpha_0}(z)]=0$. This result
substituted to the last term of equation (\ref{eq-pi-proof}) gives
precisely the required identity.

Our goal is to estimate $S_{\alpha}^{\rho}(f) = \E_{x \sim
U}[(f_{\alpha}^{\rho}(x))^2]$. By the definition of
$f_{\alpha}^{\rho}$, for every $z$, we can estimate
$f_{\alpha}^{\rho}(z)$ within $\tau/4$ using statistical query
$(f(x)\chi_{\alpha_0}(z \oplus x), \tau/4, D_{p \oplus \alpha_0})$
and we can assume that the estimate has an absolute value of at
most 1. It is easy to see that the square of this estimate is an
estimate for $(f_{\alpha}^{\rho}(z))^2$ within $\tau/2$. Therefore
$(f_{\alpha}^{\rho}(z))^2$ is a random variable whose value we are
able to estimate within $\tau/2$. Denote it by $Y$. By Corollary
\ref{cor-amean} with Remark \ref{rem-est-sq} we get that the call
to {\tt AMEAN}$(Y, \tau/2,1,\delta)$  will give us the estimate
for $S_{\alpha}^{\rho}(f)$ within $\tau/2+\tau/2 = \tau$. Figure
\ref{fig-p1} summarizes the description of procedure {\tt P1}.
Since computing the value of $Y$ takes $O(1)$ time, the complexity
of the algorithm is $O(\tau^{-2}\log{(1/\delta)})$.\qed
\begin{figure}
{\bf Input:} Access to $STAT(f,\D_{\rho})$;
 $0 \leq k \leq n$; $\alpha \in \{0,1\}^k$; $\ell \leq n$; $\theta >0$; $\tau >0$; $\delta> 0$

{\bf Output:} Value $v$ such that, with probability at least $ 1-
\delta$,  $|S_{\alpha}^{\rho}(f) - v|\leq \tau$.

\begin{enumerate}
\setlength{\itemsep}{-.1cm}

\item $p \leftarrow [\fr{2}]^k[\rho]^{n-k}$
\item $\alpha_0 \leftarrow \alpha [0]^{n-k}$
\item $Y \equiv$ draw $z$ w.r.t. $U$ and compute $\left(\mbox{STAT}(f,D_\rho)[f(x)\chi_{\alpha_0}(z\oplus x), \tau/6, D_{p \oplus \alpha_0}]\right)^2$
\item $v \leftarrow$ {\tt AMEAN}$(Y, b-a =1,\tau/2,\delta)$
\item {\bf return} $v$

\end{enumerate}
\caption{\label{fig-p1} Procedure {\tt P1}}
\end{figure}


With algorithm {\tt P1} we are now ready to describe the Bounded
Sieve. We define $c_0 \triangleq -2\log{(1-2\rho)}$ (it is a
positive constant).
\begin{theorem}
\label{BS}
 There exists an SQ--$\D_{\rho}$ randomized algorithm {\tt BS}$_{\rho}$
 that for any target concept $f$, threshold $\theta > 0$, confidence $\delta >
 0$ and degree bound $0 \leq \ell \leq
n$, {\tt BS}$_{\rho}(n,\theta,\ell,\delta)$ returns, with
probability at least $1-\delta$, a set with the large Fourier
coefficient property for function $f$, threshold $\theta$ and
degree $\ell$. Its running time is polynomial in $n, 2^{\ell},
\log{(1/\delta)}$ and $\theta^{-1}$, particularly
$\tilde{O}(n2^{3c_0\ell}\theta^{-6}\log{(1/\delta)})$. Tolerance
of queries is bounded from below by an inverse of a polynomial in
$2^{\ell}$ and $\theta^{-1}$, specifically
$2^{-c_0\ell}\theta^2/16$.
\end{theorem}
{\bf Proof:} If there is at least one coefficient greater than
$\theta$ with degree lower than $\ell$ for the parity function of
vector starting with $\alpha$ then $S_{\alpha}^{\rho}(f)$ is at
least $(1-2\rho)^{2\ell} \theta^2 = 2^{2\log{(1-2\rho)}\ell}
\theta^2 = 2^{-c_0\ell}\theta^2$. All the coefficients for the
vectors starting with $\alpha$ can be separated into two disjoint
sets---those for the vectors that start with $\alpha 0$ and those
for the vectors that start with $\alpha 1$. With these
observations and the procedure {\tt P1} for estimating
$S_{\alpha}^{\rho}(f)$ we can write a recursive subroutine {\tt
P2} that for every $\alpha$ outputs the set $W_\alpha$ that
satisfies the following conditions:
\begin{itemize}
\item for all $a \in W_{\alpha}$,$\alpha$ is a prefix of $a$;
\item for all $a \in \{0,1\}^n$, if $|\hat{f}(a)| \geq \theta$ and $w(a) \leq \ell$
then $a \in W_{\alpha}$;
\item for all $a \in W_{\alpha}$, $|\hat{f}(a)| \geq \theta/2$.
\end{itemize}
That is, $W_{\alpha}$ has the large Fourier coefficient property
for function $f$, threshold $\theta$, and degree $\ell$; and is
restricted to prefix $\alpha$.

To find all the required coefficients (i.e., to implement the {\tt
BS}$_{\rho}$) we invoke {\tt P2} on the empty prefix. The
implementation of the procedure {\tt P2} is given in Figure
\ref{fig-p2}.

\begin{figure}
{\bf Input:} Access to STAT$(f,\D_{\rho})$;
 $0 \leq k \leq n$; $\alpha \in \{0,1\}^k$; $\ell \leq n$; $\theta >0$; $\delta> 0$

{\bf Output:} The set $W_{\alpha}$ that, with probability at least
$1-\delta$, has the large Fourier coefficient property for
function $f$, threshold $\theta$ and degree $\ell$; and is
restricted to prefix $\alpha$.

\begin{enumerate}
\setlength{\itemsep}{-.1cm}

\item {\bf if} $k = n$ {\bf then}
\item \ \ \ \ $v \leftarrow$ STAT$(f,\D_{\rho})[f\chi_{\alpha},\theta/4,U]$
\item \ \ \ \ {\bf if} $v \geq \frac{3}{4}\theta$ {\bf then} {\bf output}($\alpha$)
\item {\bf else}
\item \ \ \ $b \leftarrow 2^{-c_0\ell}\theta^2$
\item \ \ \ $s \leftarrow$ {\tt P1}$(n,\alpha, b/4, \delta b/(2n))$
\item \ \ \ {\bf if} $s \geq 3b/4$ {\bf then}
\item \ \ \ \ \ \ {\tt P2}$(k+1,\alpha 0,\ell,\theta,\delta)$
\item \ \ \ \ \ \ {\tt P2}$(k+1,\alpha 1,\ell,\theta,\delta)$
\item \ \ {\bf endif}
\item {\bf endif}

\end{enumerate}
\caption{\label{fig-p2} Procedure {\tt P2}}
\end{figure}

 If {\tt P1} succeeds then {\tt P2} is called recursively only when
$$S_{\alpha}^{\rho}(f) \geq (3/4-1/4)2^{-c_0\ell}\theta^2 =
2^{-c_0\ell}\theta^2/2\ .$$ Since $\sum_{\alpha \in \{0,1\}^k}
S_{\alpha}^{\rho}(f) \leq \sum_a \hat f^2(a) = 1$, {\tt P2} will
be invoked at most $2\cdot 2^{c_0\ell}\theta^{-2}$ times for each
length of $\alpha$, and therefore there will be at most
$2n2^{c_0\ell}\theta^{-2}$ calls to {\tt P1}. This means that
total probability of failure of the algorithm is at most
$2n2^{c_0\ell}\theta^{-2} \cdot \delta 2^{-c_0\ell}\theta^2/(2n) =
\delta$. According to Lemma \ref{lem-p1}, each call to {\tt P1} in
{\tt P2} takes \equn{O(2^{2c_0\ell}\theta^{-4}
\log{(2n2^{c_0\ell}\theta^{-2}\delta^{-1})}) =
\tilde{O}(2^{2c_0\ell}\theta^{-4}\log{(1/\delta)}) \ .} This gives
a bound on total running time of $
\tilde{O}(n2^{3c_0\ell}\theta^{-6}\log{(1/\delta)})$. By the same
lemma, tolerance of all the queries is bounded from below by
$2^{-c_0\ell}\theta^2/16$. It is important to note that if the
estimations fail the running time of the algorithm may exceed this
bound. This can be easily prevented by adding a time counter that
terminates the execution of the algorithm whenever it exceeds this
time bound.\qed

In our future applications we will need to find the Fourier
coefficients not only of a Boolean target $f$ but also of any
real-valued function involving $f(x)$. Thus we extend the {\tt BS}
algorithm as follows.
\begin{theorem}
 \label{modifiedBS}
 Let $\phi: \{0,1\}^n\times \{-1,1\} \rightarrow [-\beta,\beta]$ be
 any real-valued function ($\beta >0$).
 There exists an SQ--$\D_{\rho}$ randomized algorithm $\mbox{\tt RV-BS}_{\rho}$ that
 for every target concept $f$, positive threshold $\theta$, confidence $\delta >
 0$ and degree bound $0 \leq \ell \leq n$
\mbox{\tt RV-BS}$_{\rho}(n,\phi,\beta,\theta,\ell,\delta)$
returns, with probability at least $1-\delta$, a set with the
large Fourier coefficient property for function $\phi$, threshold
$\theta$ and degree $\ell$. It runs in
$\tilde{O}(n2^{3c_0\ell}\theta^{-6}
 \beta^2\log{(1/\delta)})$ time and tolerance of queries is
 bounded from below by $\tilde{\Omega}(\beta^{-2}2^{-c_0\ell}\theta^{2})$.
  Complexity of query functions is bounded by $O(\tau) + O(n)$,
  where $\tau$ is the complexity of computation of $\phi$.
\end{theorem}
{\bf Proof:} Let us look back into the proof of Theorem \ref{BS}.
  Its idea is obviously applicable in the new
setting. All we have to check is the details of the
implementation. Let us substitute $\pi(x) \triangleq \phi(x,f(x))$
for $f(x)$ in all the places in the proof. In the second line of
{\tt P2} we need an estimate of $\hat{\phi_f}(\alpha)$. This
cannot be done using one statistical query but instead we can use
the procedure {\tt RV-SQ}$(\mbox{STAT}(f,U),
\phi\chi_{\alpha}(x),\theta/4)$. Now, given that {\tt P1} will
estimate correctly $S_\alpha^\rho(\pi)$, the algorithm will work
correctly. The only thing that will change is the bound on the
number of calls to {\tt P2} since it is based on the fact that:
$\E[f^2] = 1$. Now we have that: $\E[\pi^2] \leq L_{\infty}^2(\pi)
\leq \beta^2 $ and hence {\tt P2} will be invoked at most
$\beta^22n 2^{c_0\ell}\theta^{-2}$ times. Due to the increase in
the number of calls to {\tt P1}  the confidence of each call has
to be increased (it has to be divided by $\beta^2$). Now we should
get a modified version of {\tt P1} that will produce estimations
of $S_\alpha^\rho(\pi)$. If we substitute $\pi$ for $f$ in Lemma
\ref{lem-p1} we see that all the properties will hold but now the
estimation of $\pi^\rho_\alpha(z) = \E_{x \sim D_{p \oplus z}}
[\pi(x)\chi_{\alpha_0}(z \oplus x)]$ (line 3 in {\tt P1} cannot be
done using one query. Moreover, since we want
$(\pi^\rho_\alpha(z))^2$ to be approximated within $\tau/2$,
$\pi_\alpha^\rho(z)$ has to be approximated within
 $\frac{\tau}{4|\pi^\rho_\alpha(z)|}$.
  Since \equn{|\pi^\rho_\alpha(z)| \leq
 \E_{x \sim D_{p \oplus z}} [|\pi(x)\chi_{\alpha_0}(z \oplus
 y)|] \leq L_{\infty}(\phi) \leq \beta} it will suffice to
 approximate $\pi^\rho_\alpha(z)$ within $\tau' =
 \frac{\tau}{4\beta}$. To make this approximation we call
 \equn{\mbox{\tt RV-SQ}(\mbox{STAT}(f,D_{p \oplus z}),\phi(x,f(x))\chi_{\alpha_0}(z \oplus x),
 \beta,\tau')\ .} By Lemma \ref{RV-SQ}, we get
 the required value. The time complexity of the call to {\tt
 RV-SQ} is $O(\log{(\beta/\tau')}) =
 O(\log{(\beta/\tau)})$. The required tolerance is at
 least $\frac{\tau'}{4\beta\log{(\beta/\tau')}} =
 \tilde{\Omega}(\frac{\tau}{\beta^2})$.
 Complexity of the query functions is
 the complexity of computation of $\phi$ plus $O(n)$. These new bounds
 on the complexity of {\tt P1} render the total running time of the algorithm to be
 $\tilde{O}(n2^{3c_0\ell}\theta^{-6}
 \beta^2\log{(1/\delta)})$. The required
 tolerance is  $\tilde{\Omega}(\beta^{-2}2^{-c_0\ell}\theta^{2})$.
 Thus all the bounds satisfy the required conditions. \qed

\subsection{Learning Decision Trees}

Below we give the first application of the {\tt BS}$_\rho$
algorithm. We prove that the representation class of decision
trees is learnable in the SQ--$\D_{\rho}$ model. For this and
further applications we take $\rho = \fr{4}$. This makes $c_0 =
2\log{(1-2\rho)}$ to be $2$. Our idea is to show that in order to
learn \textsf{DT} it is sufficient to consider only ``large"
Fourier coefficients of ``low" degree (we add a restriction on
degree over the original proof by Kushilevitz and Mansour). To
achieve this we rely on several well-known properties of Fourier
transform of decision trees. The properties are described in the
following lemmas (we provide their proofs for completeness).

\begin{lemma}
Let T be a decision tree of depth $d$. All the Fourier
coefficients of $T$ of degree larger than $d$ are 0 and all the
non-zero coefficients of $T$  are larger than $2^{-d}$.
\label{lem-dt-transform}
\end{lemma}
{\bf Proof:} We start by examining the Fourier transform of a
single term. Let $t$ be a term of length $l$ and let
$i_1,\ldots,i_l$ be the indices of variables appearing as literals
in $t$. For every $1 \leq j \leq l$ let $b_j$ be 1 if $x_{i_j}$
appears negated in $t$ and 0 otherwise. Let $t^+ = \frac{t+1}{2}$
(the $\{0,1\}$ version of term $t$). It is easy to verify that
\equ{t^+=\prod_{j=1}^l
\left(\frac{1+(-1)^{b_j}\chi_{\{i_j\}}}{2}\right)
\label{equ-term}} For every two disjoint sets of indices $A$ and
$B$, $\chi_A\chi_B = \chi_{A\cup B}$. Thus by developing the
equation (\ref{equ-term}) we obtain that all the non-zero Fourier
coefficients of $t^+$ are for sets not larger than $l$ and are by
their absolute value equal to $2^{-l}$.

Let $m$ be the number of leaves labelled by 1 in $T$ and
$t_1,\ldots,t_m$ be the terms corresponding to each of these
leaves. Let $T^+,t_1^+,\ldots,t_m^+$ be $\{0,1\}$ versions of the
decision tree and the terms. Every input value satisfies at most
one term of $T$ and thus $T^+ = \sum_{i=1}^m t_i^+$, that is,
every Fourier coefficient of $T^+$ is a sum of corresponding
Fourier coefficients of terms $t_1^+,\ldots,t_m^+$. Length of
every term $t_i$ is at most $d$ and thus all the non-zero Fourier
coefficients of $T^+$ are for sets not larger than $d$ and their
absolute value is greater or equal to $2^{-d}$. By the definition
of $T^+$, $T = 2T^+ - 1$ and therefore the above property of the
Fourier transform of $T^+$ holds for $T$ as well. \qed

The size of decision tree is the number of leaves in it. Let
$\fuf{DT-size}(f)$ denote the size of a minimal DT representation
of $f$.
\begin{lemma}
Let $T$ be a decision tree of size $m$. There exists a function
$h$ such that all the Fourier coefficients of $h$ of degree larger
than $t$ are 0, where $t = \log{(4m/\tau)}$, all non-zero
coefficients of $h$ are larger than $\tau/(4m)$ and $\E[(T-h)^2]
\leq \tau$. \label{lem-dt-cut}
\end{lemma}
{\bf Proof:} Let $h$ be the decision tree that results from
truncating $T$ at depth $t$. At the truncated places we add a leaf
with value $1$. The probability that uniformly chosen assignment
will reach some specific leaf at depth greater or equal to $t$ is
at most $2^{-t}$. Therefore the probability of reaching one of the
leaves we added (instead of the truncated parts) is at most
$m2^{-t} =\tau/4$, therefore $\pr_U[T \neq h] \leq \tau/4$. For
Boolean functions $h$ and $T$ this implies that $$\E_U[(T-h)^2] =
4~ \pr_U[T \neq h] \leq \tau \ .$$ By Lemma
\ref{lem-dt-transform}, we get that the Fourier transform of $h$
has the required properties.

We are now ready to prove that {\tt BS} can be used to learn
\textsf{DT}.

\begin{theorem}
  \label{dt}
  There exists an SQ--$\D_{\fr{4}}$ randomized algorithm
  {\tt DT-learn} such that
  for any target concept $f$, $s=\fuf{DT-size}(f)$, $\epsilon > 0$ and $\delta > 0$,
  {\tt DT-learn}$(n,s,\epsilon,\delta)$ returns, with probability at least
  $1-\delta$, an $\epsilon$--approximation of $f$. The
  algorithm runs in
  $\tilde{O}(ns^{12}\epsilon^{-12}\log{(1/\delta)})$ time. Tolerance of its queries
  is $\Omega(\epsilon^4s^{-4})$.
\end{theorem}
{\bf Proof:}
 Given its input {\tt DT-learn} gets $H =\mbox{\tt BS}_{\fr{4}}(n, \frac{\epsilon}{8s},
\log{\frac{8s}{\epsilon}},\delta)$. Denote $g = \sum_{a \in H}
\tilde{f}(a)$, where $\tilde{f}(a)$ is an estimate of $\hat{f}(a)$
satisfying $|\tilde{f}(a)-\hat{f}(a)| \leq
\sqrt{\frac{\epsilon}{2}}\frac{\epsilon}{16s}$ (such an estimate
can be obtained using query $(f\chi_a,
\sqrt{\frac{\epsilon}{2}}\frac{\epsilon}{2s}, U)$). Then the
algorithm returns $h = \fuf{sign}(g)$. We need to prove that
(given that {\tt BS} succeeds) $h$ calculated in this fashion is
an $\epsilon-$approximation of $f$.
 By Fact \ref{fact-fourier-app},
$\pr[h \neq f] \leq \E[(f-g)^2]$, thus it is sufficient to prove
that  $$\E[(f-g)^2] = \sum_{a \in  \{0,1\}^n}
(\hat{f}(a)-\hat{g}(a))^2 \leq \epsilon \ .$$
 Denote $$V = \{a\ |\ \hat{f}(a) \leq
\frac{\epsilon}{8s} \vee w(a) \geq \frac{8s}{\epsilon}\}\ .$$
Clearly, $H \cup V = \{0,1\}^n$. Thus
$$\pr[h \neq f] \leq \E[(f-g)^2] = \sum_{a \in  \{0,1\}^n}
(\hat{f}(a)-\hat{g}(a))^2 =$$ \equ{ \sum_{a \in H}
(\hat{f}(a)-\tilde{f}(a))^2 + \sum_{a \not\in H } \hat{f}^2(a)
\leq   \sum_{a \in H} (\hat{f}(a)-\tilde{f}(a))^2 + \sum_{a \in V}
\hat{f}^2(a) \label{eq-dt-terms} }

Since every coefficient in $H$ is at least $\frac{\epsilon}{16s}$,
$|H|$ is at most $(\frac{16s}{\epsilon})^2$ and therefore the
defined $g$ satisfies : \equn{\sum_{a \in H}
(\hat{f}(a)-\hat{g}(a))^2 = \sum_{a \in H}
(\hat{f}(a)-\tilde{f}(a))^2 \leq \epsilon/2\ ,} that is, the first
term of equation (\ref{eq-dt-terms}) is at most $\epsilon/2$. As
is proved in Lemma \ref{lem-dt-cut} (for $m = s$ and $\tau =
\epsilon/2$), there exists a function $f'$ such that all the
Fourier coefficients of $f'$ for vectors in $V$ are $0$ and
$\E[(f-f')^2] \leq \epsilon/2$.
 That is,
 $$\epsilon/2 \geq \E[(f-f')^2] = \sum_{a \in \{0,1\}^n
}(\hat{f}(a) - \hat{f'}(a))^2 \geq \sum_{a \in V}\hat{f}^2(a)\ .$$
Altogether, $\pr[h \neq f] \leq \epsilon$. By properties of {\tt
BS}$_{\fr{4}}$ (see Theorem \ref{BS}), the running time of the
algorithm is $\tilde{O}(ns^{12}\epsilon^{-12}\log{(1/\delta)})$
and tolerance of its queries is $\Omega(\epsilon^4s^{-4})$.\qed
\begin{remark}
It can be easily seen that the size parameter is used only for
``bounding" purposes and therefore if the parameter is not
supplied to {\tt DT-learn} we can use the standard
guess-and-double technique instead. This remark is applicable to
the rest of our algorithms as well.
\end{remark}


\subsection{Weak DNF Learning}
Another simple application of the Bounded Sieve algorithm is weak
\textsf{DNF} learning in SQ--$\D_{\fr{4}}$. It is based on the
fact that every DNF expression has a ``large" Fourier coefficient
of ``low" degree \cite{bfjkmr94,ja97}. Below we prove a
generalization of this fact that will also be required for our
future application. Let $\fuf{DNF-size}(f)$ denote the size
(number of terms) of a minimal DNF representation of $f$.

\begin{lemma}
\label{weakparity} Let $f$ be a Boolean function, $s =
\fuf{DNF-size}(f)$ and $D$ be a distribution over $X$. There
exists a parity function $\chi_a$ such that $E_D[f\chi_a] \geq
\fr{2s+1}$ and $w(a) \leq
\log{\left((2s+1)L_{\infty}(2^nD)\right)}$.
\end{lemma}
{\bf Proof:} As it was proved by Jackson \cite[Fact 8]{ja97},
there exists a parity function $\chi_a$ such that $E_D[f\chi_a]
\geq \fr{2s+1}$. Let $A = \{i\ |\ a_i = 1\}$. As it can be seen
from the proof of the fact, $A \subseteq T $, where $T$ is a term
of $f$ (a set of literals and a Boolean function it represents)
and $\pr_D[T=1] \geq \fr{2s+1}$. On the other hand, $$\pr_{D}[T=1]
= \sum_{x,T(x)=1}D(x) < 2^{-|T|}2^nL_{\infty}(D)\ .$$ Thus $$w(a)
= |A| \leq |T| \leq \log{((2s+1)L_{\infty}(2^nD))}\ .$$\qed

\begin{theorem}
  There exists an SQ--$\D_{\fr{4}}$ randomized algorithm
  {\tt UWDNF} such that
  for any target concept $f$, $s=\fuf{DNF-size}(f)$, $\epsilon > 0$ and $\delta \geq 0$,
  {\tt UWDNF}$(n,s,\delta)$ returns, with probability at least
  $1-\delta$, a $(\fr{2}- \fr{4s+4})$--approximation to $f$. The
  algorithm runs in
  $\tilde{O}(ns^12\log{(1/\delta)})$ time. Tolerance of its queries is
  $\Omega(s^{-4})$. The weak hypothesis is a parity function (possibly
  negated).
  \label{uweakdnf}
\end{theorem}

{\bf Proof:} By Lemma \ref{weakparity}, a call to {\tt
BS}$_{\fr{4}}(n, \theta=\fr{2s+1}, \ell = \log{(2s+1)}, \delta)$
returns, with probability at least  $1-\delta$, a set containing
at least one vector. We estimate every coefficient in the returned
set with tolerance $r = \fr{(2s+1)(2s+2)}$  and choose the vector
with the maximal Fourier coefficient (it will be larger than
$\fr{2s+1} - r = \fr{2s+2}$). Parity function for this vector (or
its negation) will give a $(\fr{2}-\fr{4s+4})$--approximation to
the target. By Theorem \ref{BS} the running time of this
invocation will be $\tilde{O}(ns^{12}\log{(1/\delta)})$ and
tolerance of queries is
$\Omega(s^{-4})$.


\subsection{Boosting the Weak \textsf{DNF} Learner} \label{sec-DNF}
 Certainly, the next interesting question is whether we
can strongly learn \textsf{DNF} in SQ--$\D_{\fr{4}}$. We answer
this question positively by following the way used by Jackson in
his proof of \textsf{DNF} learnability. The proof consists of two
components. The first one is weak \textsf{DNF} learning with
respect to any given distribution (although efficient only on
distributions that are polynomially close to uniform). The second
one is Freund's boosting technique that boosts the weak learner to
a strong learner.

First we present the required weak learner.
\begin{theorem}
\label{weakdnf}
 Let $f$ be a Boolean function, $s=\fuf{DNF-size}(f)$ and let $D_f$
  be a computable probability distribution over the sample space.
  The computation of $D_f(x)$ may involve the value of $f(x)$. There exists a randomized
 SQ--$\D_{\fr{4}}$ algorithm
{\tt WDNF}$_{\fr{4}}$ such that for the target function $f$,
confidence $\delta > 0$ and $\beta$ that bounds
$L_{\infty}(2^nD_f)$,  {\tt WDNF}$_{\fr{4}}(n,s,D_f,\beta,\delta)$
finds, with probability at least $1 -\delta$, a Boolean function
$h$ that $(\fr{2}-\fr{4s+4})$--approximates $f$. The algorithm
runs in $\tilde{O}(n\beta^8s^{12}\log{(1/\delta)})$ time. The
tolerance of its queries is lower bounded by
$\tilde{\Omega}(\beta^{-4}s^{-4})$. Complexity of its query
functions is the time complexity of $D_f$ plus $O(n)$. The weak
hypothesis is a parity function(possibly negated).
\end{theorem}
{\bf Proof:} As is proved in Lemma, \ref{weakparity} for every DNF
expression $f$, there exists a parity function $\chi_b$ such that
$|\E_{D_f}[f\chi_b]| \geq \fr{2s+1}$ and $$w(b) \leq
\log{((2s+1)L{_\infty}(2^nD_f))} \leq \log{((2s+1)\beta)}\ .$$
$\E_{D_f}[f\chi_b] = \E_U[2^nD_f f\chi_b]$ thus in order to find a
parity function that weakly approximates $f$ with respect to $D_f$
we can find the ``large" Fourier coefficients of function
$2^nD_ff$. This means that by invoking $$\mbox{{\tt
RV-BS}}_{\fr{4}}(n, \phi = 2^nD_ff, \beta, \theta=\fr{2s+1}, \ell
= \log{((2s+1)\beta)}, \delta)$$ and then proceeding as in {\tt
UWDNF} we  can find the required weak approximator. By Theorem
\ref{modifiedBS}, we get that all the complexities are as stated.
\qed

The next step is adapting Freund's boosting technique to the
SQ--$\D_{\rho}$ model. The main advantage of Freund's booster {\tt
F1} utilized by Jackson is that it requires only weak learning
with respect to polynomially computable distributions that are
polynomially close to the uniform distribution (i.e.,
$L_{\infty}(2^nD) \leq p(n,s,\epsilon)$ for some polynomial $p$).
Freund's boosting algorithm is not the only boosting algorithm
possessing this advantage and is not the most efficient one (see
discussion below). But its presentation is relatively simple and
is sufficient for demonstrating the idea of adaptation.

Below we are going to give a brief account of Freund's boosting
technique \cite{fr90} and in particular his {\tt F1} algorithm for
the PAC learning model with respect to distribution $D$.

As input, {\tt F1} is given positive $\epsilon$, $\delta$, and
$\gamma$, a weak learner WL that produces $(\fr{2} -
\gamma)$--approximate hypotheses for functions in a function class
$\F$, and an example oracle EX$(f,D)$ for some $f\in\F$. The WL is
assumed to take the example oracle for some distribution $D'$ and
confidence parameter $\delta'$ as inputs and produce
$(\fr{2}-\gamma)$--approximate hypothesis with probability at
least $1-\delta'$. Given these inputs, {\tt F1} steps sequentially
through $k$ stages ($k$ is given in Figure \ref{boost}). At each stage $i$, $0
\leq i \leq k-1$ {\tt F1} generates distribution function $D_i$,
runs WL on simulated oracle EX$(f,D_i)$ and (with high
probability) gets a $(\fr{2}-\gamma)$--approximate hypothesis
$w_i$. Simulation of the example oracle is done by filtering. Probability of
acceptance of the next example is estimated before the simulation
(to $E_\alpha$).
If the probability is ``too low"  the algorithm does not produce any
more weak hypotheses.
Finally, {\tt F1} generates its hypothesis using majority
function MAJ of all the $w_i$ it has received.

In Figure \ref{boost} we give the original implementation of
Freund's boosting algorithm {\tt F1}.

\begin{figure}
 {\bf Input:} Example oracle $EX(f,D)$ ; $0 < \gamma \leq \fr{2}$; $(\fr{2} -
\gamma)$--approximate weak learner WL$(EX,\delta)$ which succeeds
with probability at least $1 - \delta$; $\epsilon > 0$; $\delta
> 0$

{\bf Output:} $h$ such that, with probability at least $1-\delta$,
$\pr_D[f=h] \geq 1-\epsilon$

\begin{enumerate}
\setlength{\itemsep}{-.2cm}
\item \ $\hat{\gamma} \equiv \min{(\gamma,0.39)}$
\item \ $k \leftarrow \fr{2}\gamma^{-2}\ln{(4/\epsilon)}$
\item \ $w_0 \leftarrow WL(EX(f,D),\delta/2k)$
\item \ {\bf for} $i \leftarrow 1,\ldots,k-1$ {\bf do}
\item \ \ \ $r_i(x) \equiv | \{ 0\leq j < i|w_j(x) = f(x)\}|$
\item \ \ \ $B(j;n,p) \equiv {n \choose j}p^j(1-p)^{n-j}$
\item \ \ \ $\tilde{\alpha}_r^i \equiv B(\lfloor k/2 \rfloor-r;
k-i-1, 1/2 + \hat{\gamma})$ if $i-k/2 < r \leq k/2$,
$\tilde{\alpha}_r^i \equiv 0$ otherwise
\item \ \ \ $\alpha_r^i \equiv
\tilde{\alpha}_r^i/\max_{r'=0,\ldots,i-1}\{\tilde{\alpha}_{r'}^i\}$
\item \ \ \ $\theta \equiv \epsilon^3/57$
\item \ \ \ $Y \equiv $draw example $\langle x,f(x)\rangle$ from
EX$(f,D)$ and compute $\alpha_{r_i(x)}^i$
\item \ \ \ $E_{\alpha} \leftarrow$ {\tt AMEAN}$(Y,b-a=1,\fr{3}\theta, \frac{\delta}{2k})$
\item \ \ \ {\bf if} $E_{\alpha} \leq \frac{2}{3}\theta$ {\bf
then}
\item \ \ \ \ \ $k \leftarrow i$
\item \ \ \ \ \ {\bf break do}
\item \ \ \ {\bf endif}
\item \ \ \ $D_i(x) \equiv
\frac{D(x)\alpha_{r_i(x)}^i}{\sum_y D(y)\alpha_{r_i(y)}^i}$
\item \ \ \ $w_i \leftarrow WL(EX(f,D_i),\frac{\delta}{2k})$
\item \ {\bf enddo}
\item \ $h(x) \equiv MAJ(w_0(x),w_1(x), ..., w_{k-1}(x)$
\item \ return $h$
\end{enumerate}
\caption{\label{boost} Freund's algorithm for boosting a weak
learner in the PAC model. EX$(D_i,f)$ is a simulated example
oracle.}
\end{figure}
\begin{theorem} \textsf{DNF} is strongly learnable in the SQ--$\D_{\rho}$ model.
\end{theorem}
{\bf Proof:} The first and simple step towards the adaptation of
Freund's algorithm to
 SQ--$\D_{\fr{4}}$ is estimation of $E_\alpha$ using the procedure {\tt RV-SQ} instead
 {\tt AMEAN} (since we do not have access to EX$(f,D)$). This is possible since the value of
 $\alpha_{r_i(x)}^i$ can be
found using $f(x)$ and previous hypotheses. Next, we show how to
provide our weak learner with distribution $D_i$. By the
definition, $D_i(x) \equiv \frac{D(x)\alpha_{r_i(x)}^i}{\sum_y
D(y)\alpha_{r_i(y)}^i}$. It is easy to see that given $f(x)$ the
value of $D(x)\alpha_{r_i(x)}^i$ can be evaluated
straightforwardly. On the other hand, the value of $\sum_y
D(y)\alpha_{r_i(y)}^i$ (independent of $x$), which is the
probability of accepting a sample from $D$, cannot be computed
exactly. Instead, we can estimate this value. In fact, we already
have the estimate: the value $E_{\alpha}$ used for termination
condition (lines 12--14). The condition ensures that $E_{\alpha}
\geq \frac{2}{3}\theta$ (otherwise we terminate) and $E_{\alpha}$
is an estimate of $\sum_y D(y)\alpha_{r_i(y)}^i$ within
$1/3\theta$ (line 11). Thus
$$1/2 E_{\alpha} \leq \sum_y D(y)\alpha_{r_i(y)}^i \leq 3/2
E_{\alpha} \ .$$ This means that if we define $D_i'(x) \equiv
U(x)\alpha_{r_i(x)}^i/E_{\alpha}$ then we get that there is a
constant $c_i \in [1/2,3/2]$ such that for all $x$, $D_i'(x)
=c_iD_i(x)$. Now consider the
 functional impact of supplying this approximate distribution rather
 than true distribution to {\tt WDNF}$_{\fr{4}}$. {\tt WDNF}$_{\fr{4}}$
 uses its given distribution for exactly one purpose: to find the
 ``large" Fourier coefficients of function $2^nD_ff$. Since the
 Fourier transform is a linear operator (i.e., $\widehat{cg}(a) =
 c\hat{g}(a)$) we can be certain that $2^nD_i'f$ has a
 Fourier coefficient with absolute value of at least
 $c_i \fr{2s+1} \geq \fr{4s+2}$. Moreover, since the largest Fourier coefficient of $2^nD_i'f$
 corresponds to the largest Fourier coefficient of $2^nD_if$, the parity
 function corresponding to that coefficient (or its negation)
 $(\fr{2}-\fr{4s+2})$--approximates
  target function with respect to $D_i$.
 Thus by slightly modifying \footnote{Lower bound $\theta$ has to be modified to $\fr{4s+2}$.}  {\tt WDNF}$_{\fr{4}}$
 we can handle this problem. These modifications
 may increase the running time of weak learner only by a small
 constant and thus do not affect our analysis.
\begin{figure}[ht]
{\bf Input:} Access to $STAT(f,\D_{\fr{4}})$;
$s=\fuf{DNF-size}(f)$;
  $\epsilon > 0$; $\delta > 0$\\
{\bf Output:} $h$ such that, with probability at least $1-\delta$,
$\pr_U[f=h] \geq 1-\epsilon$
\begin{enumerate}
\setlength{\itemsep}{-.1cm}
\item \ $\gamma \leftarrow \fr{4s+4}$;
\item \ $k \leftarrow \fr{2}\gamma^{-2}\ln{(4/\epsilon)}$
\item \  $w_0 \leftarrow ${\tt UWDNF}$_{\fr{4}}(n,s,\delta/k)$
\item \ {\bf for} $i \leftarrow 1,\ldots,k-1$ {\bf do}
\item \ \ \ $r_i(x) \equiv | \{ 0\leq j < i|w_j(x) = f(x)\}|$
\item \ \ \ $B(j;n,p) \equiv {n \choose j}p^j(1-p)^{n-j}$
\item \ \ \ $\tilde{\alpha}_r^i \equiv B(\lfloor k/2 \rfloor-r;
k-i-1, 1/2 + \hat{\gamma})$ if $i-k/2 < r \leq k/2$,
$\tilde{\alpha}_r^i \equiv 0$ otherwise
\item \ \ \ $\alpha_r^i \equiv
\tilde{\alpha}_r^i/\max_{r'=0,\ldots,i-1}\{\tilde{\alpha}_{r'}^i\}$
\item \ \ \ $\theta \leftarrow \epsilon^3/57$
\item \ \ \ $E_{\alpha} \leftarrow ${\tt RV-SQ}$(\mbox{STAT}(f,U),\alpha_{r_i(x)}^i,1,\theta/3)$
\item \ \ \ {\bf if} $E_{\alpha} \leq \frac{2}{3}\theta$ {\bf
then}
\item \ \ \ \ \ $k \leftarrow i$
\item \ \ \ \ \ {\bf break do}
\item \ \ \ {\bf endif}
\item \ \ \ $D_i'(x) \equiv
U(x)\alpha_{r_i(x)}^i/E_{\alpha}$
\item \ \ \  $w_i \leftarrow
${\tt WDNF}$_{\fr{4}}(n,s,D_i',3/(2\theta),\delta/k)$
\item \ {\bf enddo}
\item \ $h(x) \equiv MAJ(w_0(x),w_1(x), ..., w_{k-1}(x))$
\item \ {\bf return} $h$
\end{enumerate}
\caption{\label{sqdboost} Learning \textsf{DNF} in the
SQ--$\D_{\fr{4}}$ model}
\end{figure}


In Figure \ref{sqdboost} we give a straightforward implementation
of the discussed adaptation.

 Our last concern is the complexity of this algorithm. The total number of
phases executed will be $O(s^2\log{\epsilon^{-1}})$. Clearly the
``heaviest" part of each phase is the execution of the weak
learner. By Theorem \ref{weakdnf}, the running time of each call
to {\tt WDNF}$_{\fr{4}}$ is
$\tilde{O}(ns^{12}\epsilon^{-24}\log{(1/\delta)})$. Thus the total
running time of the algorithm is
$\tilde{O}(ns^{14}\epsilon^{-24}\log{(1/\delta)})$. The tolerance
of queries is $\tilde{\Omega}(s^{-4}\epsilon^{12})$. The
complexity of query functions is as complexity of evaluation of
$\alpha_{r_i(x)}^i$ plus $O(n)$. All the $O(k^2) = \tilde{O}(s^4)$
possible values of $\alpha_{r_i(x)}^i$ can be evaluated in
advance, that is complexity of query functions will be $O(n)$.
\qed

The adaptation of boosting algorithm that we obtained is not
suitable for boosting weak learners which will fail when supplied
with distributions $D_i'$ as above instead of $D_i$. Except for
this limitation we, in fact, described the general way to boost
weak learners in the SQ--$\D$ model.

Recently, two new boosting algorithms have appeared that use only
distributions that are polynomially close to the uniform. The
first one is by Klivans and Servedio \cite{ks99} and the second
one is by Bshouty and Gavinsky \cite{bg01}. Distributions produced
by these boosting algorithms are optimally ``flat", that is, the
$L_\infty$ norm of distributions they produce is
$\tilde{O}(\epsilon)2^{-n}$. They also require the same order of
phases as {\tt F1}. Both algorithms are more complicated than {\tt
F1}, but nevertheless can be adapted to SQ--$\D$ in a very similar
way. These algorithms were used to improve the running time of
Jackson's original algorithm and can be substituted for {\tt F1}
in our algorithm as well. This would significantly decrease the
dependence on $\epsilon$. In particular, $L_{\infty}(D_i) =
\tilde{O}(\epsilon^{-1})$ and $k =
O(\gamma^{-2}\log{(1/\epsilon)})$ gives the running time of
$\tilde{O}(ns^{14}\epsilon^{-8}\log{(1/\delta)})$ and tolerance
becomes lower-bounded by $\tilde{\Omega}(s^{-4}\epsilon^{4})$.

\section{Learning with Attribute Noise in Membership Queries}
\label{sec-att-noise}

We now turn our attention to another type of noise in membership
queries: the noise that corrupts the attributes of a sample point.
That is, as an answer to a membership query at point $x$ the
learner gets the value of $f(x')$, where $x'$ might differ from
$x$. This type of noise was previously considered as appearing in
the example oracle EX$(f,D)$, that is the learner gets random
labelled samples $(x,f(x'))$ \cite{sv88,gs95,bjt99}. We define the
noise model formally and then focus our attention on the product
attribute noise with known noise rate (which is the most commonly
referred type of attribute noise). We prove that membership
queries corrupted by such a noise are weaker than extended
statistical queries discussed in the previous chapter.
Nevertheless, the Bounded Sieve can be implemented in this model.
This implies that \textsf{DT} is learnable and \textsf{DNF} is
weakly learnable in this ``weak" model. We prove that these
learning results can also be achieved for the uniform product
attribute noise of unknown rate.

\subsection{The Noise Model and Its Properties}
The attribute noise is usually described as returning the value
$f(x \oplus b)$ where $b$ is sampled randomly with respect to some
noise distribution $D'$. The simplest and the most well studied
noise distribution is the product distribution, i.e., independence
of all the attributes is assumed.\footnote{This assumption
certainly agrees with assumptions of learning with respect to the
uniform.} If every attribute is changed with the same probability
the noise is called {\em uniform}. The resulting model represents
the situation in which the learner cannot ask the membership query
in the desired point since the attributes it supplies
 change their state randomly and independently in the process of the query.
Examples of situations of this kind are:

\begin{itemize}
\item sending the point $x$ to the oracle
via a faulty communication channel\footnote{For this case the
attribute noise is uniform and random classification noise has to
be added if the answer is also sent via a faulty channel.}
\item ``material" representing the attributes is unstable.
\end{itemize}

Formally, the noise model is defined in the following way. For a
constant probability vector $p \in (0,1)^n$ and $D_p$ the product
distribution defined by $p$, we define noisy membership oracle
MEM$^p(f)$ as follows. For a query $x$, MEM$^p(f)$ chooses
randomly (and independently of any previous queries) $b$ according
to the distribution $D_p$ and returns $f(x \oplus b)$. We can
assume that for all $i$, $p_i \leq 1/2$ since inverting the $i$-th
attribute of the query is equivalent to changing $p_i$ to $1-p_i$.

We may also add classification noise to the oracle and define
MEM$^{p,\eta}(f)$. This classification noise does not have to be
persistent as the learner cannot get a label of the same point
twice (with any non-negligible probability).


The use of membership queries with product attribute noise can be
related to the SQ--$\D_{\rho}$ model in the following way.
\begin{theorem}
For $p \in [\rho,1/2]^n$ MEM$^p$ can be efficiently simulated in
SQ--$\D_{\rho}$.
\end{theorem}
{\bf Proof:} The result of query at point $x$ to oracle MEM$^p$ is
a Bernoulli random variable (or a coin flip) such that $e \in \{1,
-1\}$ is returned with probability $\pr_{y\sim D_p}[f(x \oplus y)
=e]$. Given these probabilities for every $x$, MEM$^p$ can be
replaced by an appropriate coin flip. Moreover, the exact values
of probabilities are not required, since the ability to
approximate them efficiently will suffice to simulate the MEM$^p$
oracle. But $$\pr_{y\sim D_p}[f(x \oplus y) =e] = \pr_{y\sim
D_{p\oplus x}}[f( y) =e] \ ,$$ i.e., this probability can be
estimated in SQ--$\D$ whenever $D_{p\oplus x} \in \D$. Thus for
$\D = \{D_{p \oplus x}\ |\ x\in \{0,1\}^n\}$ MEM$^p$ can be
simulated in SQ--$\D$. Such $\D$ is included in the class of
probability distributions which for every $i$, gives a learning
algorithm a degree of control $p_i$ over attribute $i$ (see Remark
\ref{rem-non-uniform-d} for details about the definition). For
every $i$, $p_i \in [\rho,1/2]$ and thus by Lemma
\ref{lem-degree-control} and Remark \ref{rem-non-uniform-d},
MEM$^p$ can be simulated in SQ--$D_{\rho}$. \qed


\subsection{Offsetting Attribute Noise in the Bounded Sieve}

Let $\tau > 0$ be a constant and $p \in [0,1/2- \tau]^n$ be a
probability vector. We are now going to show that we can implement
the Bounded Sieve using access to MEM$^p$ oracle (we call the
resulting algorithm {\tt AN-BS} where AN stands for attribute
noise).

\begin{theorem}
There exists a randomized algorithm {\tt AN-BS}
 that for any target concept $f$, threshold $\theta > 0$
 , degree bound $0 \leq \ell \leq n$, confidence $\delta >
 0$ and probability vector
 $p$ (with restrictions as above), {\tt  AN-BS}$(\mbox{MEM}^p,n,\theta,\ell,\delta,p)$
 returns, with
 probability at least $1-\delta$, a set with the large Fourier
 coefficient property for function $f$, threshold $\theta$ and
 degree $\ell$. Its running time is polynomial in $n, \theta^{-1}, 2^{\ell}$
 and $\log{(1/\delta)}$.
\label{th-anbs}

\end{theorem}
{\bf Proof:} A query to the oracle  MEM$^p(f)$ at point $y$
returns  $f(y\oplus b)$ where $b$ is chosen randomly according to
the distribution $D_p$. Thus by Hoeffding's formula we can
efficiently approximate $f^p(y) \triangleq \E_{b \sim D_p}f(y
\oplus b)$. Obviously, for every $x$, $|f^p(x)| \leq 1$ and thus
$\|f^p\| \leq 1$. Now, by using the real-valued version of the
{\tt KM} algorithm (see Remark \ref{rem-km-real} in Section
\ref{sec-km}) we can find a set with the large Fourier
coefficients property for function $f^p$ and threshold $\vartheta$
in time polynomial in $n,\vartheta^{-1}$ and $\log{(1/\delta)}$.
But $$f^p(y) = \E_{b \sim D_p}[f(y \oplus b)] = \sum_{a \in
\{0,1\}^n} \hat{f}(a)\E_{b \sim D_p}[\chi_a(y \oplus b)]= \sum_{a
\in \{0,1\}^n} \left(\hat{f}(a)\E_{b \sim D_p}[\chi_a(b)]\right)
\chi_a(y),$$ i.e., we have that for all $a$, $\hat{f^p}(a) =
\hat{f}(a)\E_{b \sim D_p}\chi_a(b)$. We can easily calculate the
value $c_a \triangleq \E_{b \sim D_p}\chi_a(b) = \prod_{a_i = 1}(1
- 2p_i)$. By the properties of $p$, $|c_a| \geq (2\tau)^{w(a)}$.
Thus if we run the above algorithm for the threshold $\vartheta =
(2\tau)^{\ell}\theta$ we will get the set $S'$ that includes a set
with the large Fourier coefficient property for the function $f$
and threshold $\theta$. Size of $S'$ is bounded by
$(2\tau)^{-2\ell}\theta^{-2}$. In order to estimate the values of
the coefficient $\hat{f}(a)$ with accuracy $\sigma$ we estimate
the coefficient $\hat{f^p}(a)$ with accuracy $\sigma c_a \geq
(2\tau)^{\ell}\sigma$ and return $c_a^{-1}\hat{f^p}(a)$. Thus we
can refine the set $S'$ and return a set with the large Fourier
coefficient property for the function $f$, threshold $\theta$ and
degree $\ell$. All the parts of the described algorithm run in
time polynomial in $n , (2\tau)^{-\ell}\theta^{-1}$ and
$\log{(1/\delta)}$, that is, for a constant $\tau$ the time is
polynomial in $n , \theta^{-1},2^{\ell}$, and $\log{(1/\delta)}$
\qed

\begin{remark}
It is easy to note that knowing the noise rates exactly is not
necessary. That is, the result will hold when the learning
algorithm is supplied with ``good" estimates of the noise rates.
Particularly, estimates within inverse of a polynomial (in
learning parameters) will suffice.
\end{remark}

\begin{remark}
When using MEM$^{p,\eta}(f)$ oracle instead of MEM$^p(f)$ in the
above algorithm we can immediately offset the classification noise
by using the fact: $f^{p,\eta}(y) = (1-2\eta)f^p(y)$, where
 $f^{p,\eta}(y)$ is the expectation of the query in point $y$ to the oracle
 MEM$^{p,\eta}(f)$. This will require increasing the accuracy of
 the estimation by $\fr{1-2\eta}$ and make the running time of the
 algorithm polynomially dependent on $\fr{1-2\eta}$.
\end{remark}

By sole use of {\tt AN-BS} we can efficiently learn \textsf{DT}
and weakly learn \textsf{DNF} as described in Theorems \ref{dt}
and \ref{uweakdnf}.

\subsection{Coping with Attribute Noise of Unknown Rate} As was shown by Goldman and Sloan
\cite{gs95}, coping with attribute noise becomes much more
difficult when the noise rates are unknown.\footnote{In fact, no
PAC learning algorithm exists that can tolerate a noise rate
higher than $2\epsilon$, where $\epsilon$ is the required accuracy
parameter.} The reason for this is that the usual way of handling
the unknown noise rate---choosing the hypotheses with the minimum
disagreement on samples---does not work for this type of noise.
Nevertheless, several simple classes (e.g., $k-$DNF and monomials)
are learnable when the attribute noise is uniform
\cite{gs95,dg95}. The following two theorems show that our
learnability results hold in this setting as well.

\begin{theorem} \textsf{DT} is learnable by membership queries
with uniform attribute noise of fixed unknown rate $\nu < 1/2$.
\label{th-an-dt}
\end{theorem}
{\bf Sketch of proof:} Learning algorithm has access to MEM$^p$
oracle for $p=[\nu]^n$. For the ease of presentation let us assume
that all estimations we do by sampling produce the exact result.
In particular, our {\tt AN-BS} algorithm returns all the vectors
with Fourier coefficients larger than the bound that was given to
it(and only them) and we can estimate exactly coefficients of the
function $f^p$ defined in the proof of Theorem \ref{th-anbs}. For
positive $\nu' <1/2$ we look at the execution of {\tt AN-BS} for
learning \textsf{DT} supplied with noise rate $\nu'$ (equal for
all the attributes) instead of the actual noise rate, i.e.,
$$\mbox{{\tt AN-BS}}(\delta,\mbox{MEM}^p, n, \frac{\epsilon}{8s},
\log{\frac{8s}{\epsilon}},[\nu']^n)$$ We then find all the Fourier
coefficients for returned vectors and offset the effect of noise
by multiplying every coefficient $\hat{f^p}(a)$ by
$(1-2\nu')^{-w(a)}$ (as if $\nu'$ is a correct noise rate). We
denote the function defined by coefficients we have found  by
$f_{\nu'}^p$ (with the assumptions we made this function is
well-defined). If $\nu' = \nu$ then the algorithm we have just
described will find all the required vectors and calculate their
Fourier coefficients correctly. Thus, as follows from Lemma
\ref{lem-dt-cut}, $1-\epsilon < \|f_{\nu}^p\|^2 \leq 1$. It is
easy to note that the larger the supplied noise rate $\nu'$ is,
the more vectors the execution of {\tt AN-BS} returns (since the
threshold gets lower) and the larger is the value by which we
multiply every Fourier coefficient of $f^p$ that was found by {\tt
AN-BS}. That is, for every $a \in \{0,1\}^n$,
$|\hat{f_{\nu'}^p}(a)|$ and $\|f_{\nu'}^p\|^2$ are monotonous in
$\nu'$. Thus by using a binary search we can efficiently find a
value $\nu'$ for which $1-\epsilon \leq \|f_{\nu'}^p\|^2 \leq 1 +
\epsilon\ (*)$ (the search will be efficient since values which
are ``close" to $\nu$ will satisfy this criterion). Denote the
value we have found by $\nu_1$. As we have noted, for every vector
$a$, $|\hat{f_{\nu'}^p}(a)|$ is monotonous in $\nu'$ thus the
criterion $(*)$  that $\nu_1$ satisfies ensures that
\begin{equation}
\label{eq2e} \left| \|f_{\nu}^p\|^2- \|f_{\nu'}^p\|^2 \right|=
\left|\sum_a\left(\hat{f_{\nu}^p}^2(a) -
\hat{f_{\nu'}^p}^2(a)\right) \right| \leq 2\epsilon
\end{equation}
On the other hand, as it was shown in the proof of Theorem
\ref{dt}, $\fuf{sign}(f_\nu^p)$ $\epsilon$--approximates $f$ and
therefore
\begin{equation}
\label{eqe} \sum_a \left(\hat{f_{\nu}^p}(a) - \hat{f}(a)\right)^2
\leq \epsilon
\end{equation}
By combining equations \ref{eq2e} and \ref{eqe} we can easily
obtain that

\begin{equation}
\label{eq3e} \sum_a \left(\hat{f_{\nu_1}^p}(a) -
\hat{f}(a)\right)^2  \leq 3\epsilon
\end{equation}
That is, $\fuf{sign}(f_{\nu_1}^p)$ $3\epsilon$--approximates $f$.
\qed

\begin{remark} Without the ``exactness" assumption we made in this proof we will need to improve the tolerance of all the queries so that the overall error in computation of $f_{\nu'}^p$ be smaller than $\epsilon/8$ (and substitute $\epsilon/4$ for $\epsilon$
in other bounds).
\end{remark}



\begin{theorem}
\textsf{DNF} is weakly learnable by membership queries with
uniform attribute noise of fixed unknown rate $\nu < 1/2$.
\label{th-an-wdnf}
\end{theorem}
{\bf Sketch of proof:} We define $f_{\nu'}^p$ as in the previous
proof. We denote by $\lambda(\nu')$ the absolute value of the
maximal Fourier coefficient of $f_{\nu'}^p$  and by $\alpha(\nu')$
its corresponding vector. It is easy to see that $\lambda(\nu')$
is monotonous in  $\nu'$ (by the same argument as in the previous
proof). It is also given by the properties of DNF expressions that
$\lambda(\nu) \geq \fr{2s+1}$, where $s$ is number of terms in the
target function. Thus we can find the minimal $\nu'$ for which
$\lambda(\nu') \geq \fr{2s+1}$ (as usual, an estimate for this
value is sufficient). We denote this error rate by $\nu_m$. By the
definition of $\nu_m$, $\nu \geq \nu_m$ and therefore
$$\lambda(\nu) = |\hat{f}(\alpha(\nu_m))| \geq \lambda(\nu_m) \geq
\fr{2s+1}\ ,$$ i.e., $\chi_{\alpha(\nu_m)}$ (or its negation)
$(\fr{2}-\fr{4s+2})$--approximates the target.\qed

\begin{remark}
 Both algorithms provided above can also be modified to tolerate the  classification noise of the same (unknown) rate $\nu$.
\end{remark}



\section{Limitations of the SQ-$\D$ Model}
\label{sec-char}

  So far we have been showing, so called, positive results about the models that we have  introduced. The other side of understanding a model is proving that certain classes are not learnable in the model (or so called negative results). Non-trivial results of this kind are usually very hard to achieve. Fortunately for us, the statistical query model is the most tractable (non-trivial) learning model.
In this chapter we extend this ``tractability" to the SQ--$\D$
model.
  This will be done by showing a new and simple way to characterize weakly
  SQ--learnable concept classes. This characterization will then
  be extended to SQ--$\D$ and applied to classes of parity functions.

\subsection{Properties of Statistical Queries}
\label{sec-sq-prop}

  We start by showing a simple property of statistical queries which simplifies the analysis
  of the SQ model. According to the definition of statistical query,
  the query function $\psi$ is a Boolean function of two parameters.
  The first one is denoted by $x$ and the second one is denoted by $f(x)$ according
  to the values substituted when estimating the expectation of the
  query function. We distinguish two types of statistical queries
  (either regular or extended) based on their query function.
  We say that a query is {\em independent of the target}
  if it is based on $\psi$ which is a function of $x$ alone
  (and does not depend on the value of the second parameter --
  $f(x)$) and we say that a query is {\em correlational} if $\phi(x,f(x))
  \equiv g(x)f(x)$ for a Boolean function $g(x)$.

  \begin{lemma}
  \label{lem-sq-simple}
  Any statistical query (regular or
  extended) can be substituted by two statistical queries that are independent of the target
  and two correlational queries.\footnote{This decomposition appears implicitly in \cite{bfjkmr94}.}
  \end{lemma}
  {\bf Proof:} We denote the distribution of the query by $D$. The following equation
   proves the statement:
$$ \E_D[\psi(x,f(x))] = \E_D\left[\psi(x,-1)\frac{1-f(x)}{2} +
\psi(x,1)\frac{1+f(x)}{2}\right] =$$ $$\fr{2}\E_D[\psi(x,1)f(x)] -
\fr{2}\E_D[\psi(x,-1)f(x)] +
\fr{2}\E_D[\psi(x,1)]+\fr{2}\E_D[\psi(x,-1)]\ .$$ \qed

A query that is independent of the target is estimation of the
expectation of random variable which is independent of $f$ and
therefore such a query can be substituted by a call to procedure
{\tt AMEAN} (we assume that the learning algorithm is able
  to sample randomly with respect to the distribution of the
  query). Thus we may assume that learning algorithms in the statistical
query based models use only the correlational statistical queries.

\subsection{The Characterization}

We are now going to present the characterization of classes weakly
learnable in statistical query based models. Let $\F$ be the
concept class and $\A$ be an algorithm that weakly learns $\F$ in
the SQ model (and uses only correlational queries). The
characterization is based on the following observation: for every
(but, probably, one) function in $\F$ there exists a correlational
query for which $\A$ gets a reply which differs ``noticeably" from
0. This is true because there could not exist two different
functions in $\F$ for which the results of all the queries are
``almost" equal. For every $f \in \F$ denote by $g_f$ the function
for which $|\E_D[fg_f]|$ differs ``noticeably" from 0, that is,
$g_f$ (or its negation) weakly approximates $f$. This means that
the set of queries of $\A$ weakly ``approximates" all (but,
probably, one) functions in $\F$. This argument is formalized in
the following way. Let $V$ and $W$ be sets of Boolean functions.
We say that $W$ $\epsilon$--approximates $V$ with respect to
distribution $D$ if for every $f \in V$ there exists $g \in W$
such that $g$ $\epsilon$--approximates $f$ with respect to
distribution $D$. For a concept class $\F$ denote by $\F^s$ the
set of all the functions in $\F$ of size $s$.

\begin{theorem}
 Let $\F$ be a concept class weakly SQ--learnable with respect to
$D$, particularly, there exists an algorithm $\A$ that for every
$f\in \F^s$, uses at most $p(n,s)$ queries of tolerance
lower-bounded by $\fr{r(n,s)}$ to produce $(\fr{2} -
\fr{q(n,s)})$--approximation to $f$. Let $r'(n,s) \triangleq
max\{2r(n,s), q(n,s)\}$. There exists a collection of sets
$\{W_i\}_{i=1}^{\infty}$ such that $|W_s| \leq p(n,s)+1$ and
$W_s$ $(\fr{2} - \fr{r'(n,s)})$--approximates $\F^s$ with respect
to $D$. \label{th-sq-char}
\end{theorem}
{\bf Proof:} According to the above discussion, we may assume that
 $\A$ uses only correlational queries. We build the set $W_s$
 as follows. We simulate algorithm
$\A(n,s)$ and for every query $(fg_i,\rho)$ we add $g_i$ to $W_s$
and return the value $0$ as a result of the query. We continue the
simulation till it violates any of the complexity bounds of $\A$
or $\A$ stops and returns final hypothesis $h$. In the later case
we also add $h$ to $W_s$. First, by the definition of $W_s$,
$|W_s| \leq p(n,s) +1$. Now, assume that $W_s$ does not $(\fr{2} -
\fr{r'(n,s)})$--approximate $\F^s$, i.e., there exists $f \in
\F^s$ such that for every $g \in W_s$, $|\E_D[fg]| <
\frac{2}{r'(n,s)}$. This means that in our simulation for building
$W_s$, zeros that we gave as answers are valid answers to the
queries. Therefore, by the definition of $\A$, it returns
hypothesis $h$ that $(\fr{2} - \fr{q(n,s)})$--approximates $f$,
that is, $h \in W_s$. This contradicts the assumption that
$|\E_D[fh]| < \frac{2}{r'(n,s)} \leq \frac{2}{q(n,s)}$. \qed

In the above proof we assumed that $\A$ is a deterministic
algorithm. On the other hand, the proof does not rely on the
uniformity of the algorithm $\A$ (our only restriction is the
polynomial number of queries). As is well known, any randomized
algorithm can be transformed into non-uniform (deterministic)
algorithm (in the context of languages this
 result is referred as $\mbox{\textsc{BPP}} \subset \mbox{\textsc{P}}/\mbox{poly}$).
 Therefore, Theorem \ref{th-sq-char} is also true for classes weakly learnable
 by randomized algorithms.

This characterization can be easily extended to SQ--$\D$ in the
following way. Every query $(fg,\rho,D')$ to STAT$(f,\D)$ has its
own distribution $D' \in \D$. Thus the fact that results of a
query differ ``noticeably" from 0 means that $f$ is weakly
approximated with respect to some distribution from $\D$. Distinct
queries may use distinct distributions from $\D$ and therefore
along with every function in the approximating collection of sets
we need to record the distribution with respect to which function
in $\F$ is approximated. Formally, we say that a set $W \subseteq
\{-1,1\}^X \times \D$  $\epsilon$--approximates the set $V$ of
Boolean functions with respect to $\D$ if for every $f \in V$
there exists a pair $(g,D') \in W$ such that $g$
$\epsilon$--approximates $f$ with respect to $D'$. Theorem
\ref{th-sq-char} is generalized to SQ--$\D$ as follows.

\begin{theorem}
Let $\F$ be a class learnable in SQ--$\D$. For $p$ and $r'$
defined as in Theorem \ref{th-sq-char} there exists a collection
of sets $\{W_i\}_{i=1}^{\infty}$ such that $W_s \subseteq
\{-1,1\}^X \times \D$, $|W_s| \leq p(n,s) +1$ and $W_s$
$(\fr{2}-\fr{r'(n,s)})$--approximates $\F^s$ with respect to $\D$.
\label{sqdchar}
\end{theorem}


\subsection{Applications}

 Although the characterization we gave for the SQ model is not equivalent to
 characterization by SQ--DIM \cite{bfjkmr94} (see Section
 \ref{sec-relation} for more details) it is applied in a
 very similar way. To demonstrate this we give a new proof for the main result
 obtained using the SQ--DIM characterization.

\begin{theorem} Any concept class containing superpolynomial
number of parity functions is not weakly SQ--learnable with
respect to the uniform distribution. \label{sqapp}
\end{theorem}
{\bf Proof:} Since the size of representation of any parity
function is bounded by a fixed polynomial (e.g., $c_0n$) by
Theorem \ref{th-sq-char} (with Remark \ref{bounded}) we have that
there exist polynomials $r(n)$ and $p(n)$ such that $\F$ has a
$(\fr{2}-\fr{r(n)})$--approximating set $W$ of size bounded by
$p(n)$. For any $g \in W$ approximating a parity function $\chi_a$
means that $|\E[g(x)\chi_a]| = |\hat{g}(a)| \geq 2/r(n)$. But from
Parseval's identity we know that $\sum_{a \in \{0,1\}^n}
\hat{g}^2(a) = 1$. This means that at most $r^2(n)/4$ Fourier
coefficients of $g$ could be greater than $2/r(n)$, i.e., every $g
\in W$ could be correlated with at most $r^2(n)/4$ different
parity functions. On the other hand, $|W| \leq p(n)$, that is, $W$
can approximate at most $p(n)r^2(n)/4$ parity functions. Thus, any
$\F$ containing a superpolynomial number of parity functions is
not weakly SQ--learnable with respect to the uniform
distribution.\qed

Our characterization of SQ--$\D$ learnable classes can be used to
show the limitations of the SQ--$\D_{\rho}$ model. {\tt
BS$_{\rho}$} can be straightforwardly used to learn the class of
all parity function containing at most $O(\log{n})$ variables
(with respect to U). On the other hand, we will show that the
class of all the parity functions containing at most $c(n)\log{n}$
variables for unbounded $c(n)$ cannot be learned in
SQ--$\D_{\rho}$. For this purpose we
 denote by $A^k =\{0,1\}^k[0]^{n-k}$, and let $\mbox{\textsf{PAR}}^k$ denote the
 class of all parity functions for vectors in $A^k$, i.e., all
 parity functions based on the first $k$ variables.
 In fact, we give somewhat stronger result.

\begin{theorem}
  $\mbox{\textsf{PAR}}^k$ is not
  weakly learnable in SQ--$\D_{\rho}$ for $k(n)=\omega(\log{n})$.
  \label{par-k}
\end{theorem}
{\bf Proof:} Assume that the theorem is not true. By Theorem
\ref{sqdchar} (with Remark \ref{bounded}) there exist polynomials
$p(n)$ and $r(n)$ and set of pairs $W = \{(g_1,D_1),(g_2,D_2),
\ldots, (g_m,D_m)\}$ such that $|W|=m \leq p(n)$ and for every
$\chi_a \in \mbox{\textsf{PAR}}^k$ there exists $(g_i,D_i) \in W$,
where $D_i \in \D_{\rho}$, such that $\E_{D_i}[g_i\chi_a] \geq
1/r(n)$. This means that $\sum_{1 \leq i \leq
m}\E_{D_i}^2[g_i\chi_a] \geq 1/r^2(n)$. Thus $$ \sum_{a \in A^k}
\sum_{1 \leq i \leq m}\E_{D_i}^2[g_i\chi_a] \geq |A^k|/r^2(n)$$
This means that there exists $i$ such that
 $$\sum_{a \in
A^k}\E_{D_i}^2[g_i\chi_a] \geq |A^k|/(|W|r^2(n)) \geq
|A^k|/(p(n)r^2(n)).$$ Or, if we denote by $U^k$ a uniform
distribution over $A^k$, then
  we get that
  \equ{\mathcal{I} \triangleq \E_{a \sim U^k} \left[ \E_{D_i}^2[g_i\chi_a]\right] \geq
  1/(p(n)r^2(n)) \label{e1}.}
On the other hand,
\begin{multline}
\mathcal{I} = \E_{a \sim U^k}\left[ \E_{D_i}^2[g_i\chi_a]\right] =
\E_{a \sim U^k}\left[ \E_{x \sim D_i}\left[\E_{y \sim D_i}[
g_i(x)\chi_a(x)g_i(y)\chi_a(y)]\right]\right] \\ = \E_{x \sim
D_i}[\E_{y \sim D_i}[ g_i(x)g_i(y)\E_{a \sim U^k}[\chi_a(x \oplus
y)]]] \ . \label{eq-i1}
\end{multline}
 But
$$\lambda(z) \triangleq \E_{a \sim U^k}[\chi_a(z)] = \E_{a \sim
U^k}[\chi_{z_{[1,k]}}(a_{[1,k]})] = \left\{\arr{ll}{0 & z_{[1,k]}
\neq [0]^k \\ 1 & z_{[1,k]} = [0]^k} \right. $$ Since $\lambda(x
\oplus y)$ is non-negative and $|g_i(x)g_i(y)| = 1$, equation
(\ref{eq-i1}) yields that
\begin{multline}
 \mathcal{I} \leq \E_{x \sim D_i}\left[\E_{y \sim D_i}[\lambda(x \oplus y)]\right] =
 \E_{x \sim D_i}\left[\pr_{y \sim D_i}[\lambda(x \oplus y)=1]\right]  \\ =
  \E_{x \sim D_i}[\pr_{y \sim D_i}[y_{[1,k]} = x_{[1,k]}]]
  \leq \E_{x \sim D_i}[(1-\rho)^k] \leq (1-\rho)^k \ . \label{e2}
\end{multline}
For $k = \omega(\log{n})$, $(1-\rho)^k$ cannot be lower-bounded by
inverse of a polynomial and thus equation (\ref{e1}) contradicts
equation (\ref{e2}).\qed

The last theorem may also be used to prove that the set of concept
classes learnable in the PAC model with classification noise of
constant rate is not contained in the set of concept classes
weakly learnable in SQ--$\D_{\fr{4}}$. This result can be obtained
by combining our result with the following theorem for $k =
\log{n}\log{\log{n}}$ \cite[Theorem 2]{bkw00}.

\begin{theorem}
  $\mbox{\textsf{PAR}}^k$ for noise rate $\eta$ equal to any constant less than
  $\fr{2}$, can be learned with respect to the uniform distribution
  using number of samples and total
  computation time $2^{O(k/\log{k})}$.
\end{theorem}

Another interesting application of our characterization is  to
demonstrate that \textsf{PAR}, the class of all parity functions,
possesses some inherent hardness for any statistics
 based learning. For this purpose we give the following definition.
 We say that a distribution class $\D$ is {\em biased}
if there exist a polynomial $p(n)$ such that $L_{\infty}(\D) \geq
\fr{p(n)}$. We may notice that if a distribution class $\D$ is
biased then a non-uniform learning algorithm may get the value of
the target function at the ``biased" point. Use of attributes of
specific point together with its label contradicts the idea of
statistical learning and, on the other hand, is the main element
of all the known algorithms for learning of parity functions.

\begin{theorem}
 Let $\D$ be a class of distributions. If\ \ $\D$ is not biased then
 for every $D \in \D$, \textsf{PAR} is not weakly learnable in the SQ--$\D$ model
 with respect to $D$.
\label{bias}
\end{theorem}
{\bf Proof:} By assuming that \textsf{PAR} is weakly learnable in
SQ--$\D$ with respect to some distribution $D \in \D$ and then
proceeding as in the proof of the previous theorem we get that
there exists set $W$ and $(g_i,D_i) \in W$ ($D_i \in \D$) such
that $$\E_{a \sim U} \left[ \E_{D_i}^2[g_i\chi_a]\right] \geq
1/(p(n)r^2(n)).$$
 On the other hand,
 \begin{multline}
 \E_{a \sim U}\left[
\E_{D_i}^2[g_i\chi_a]\right] = \E_{a \sim U}\left[ \E_{x \sim
D_i}[\E_{y \sim D_i}[ g_i(x)\chi_a(x)g_i(y)\chi_a(y)]]\right] \\ =
 \E_{x \sim D_i}[\E_{y \sim D_i}[ g_i(x)g_i(y)\E_{a \sim
U}[\chi_a(x \oplus y)]]]
\end{multline}
 By the properties of parity functions, if $x \neq y$ then $\E_{a
\sim U}[\chi_a(x \oplus y)] = \E_{a \sim U}[\chi_{x \oplus y}(a)]
= 0$. Otherwise, $\E_{a \sim U}[\chi_a(0^n)] = 1$. So the
expression we are evaluating is equal to
 $$ \E_{x \sim D_i}[D_i(x)g_i^2(x)] = \E_{x \sim D_i}[D_i(x)]$$
Thus we got that  $\E_{x \sim D_i}[D_i(x)] \geq 1/(p(n)r^2(n))$,
in particular there exists $x$ such that $D_i(x) \geq
1/(p(n)r^2(n))$, that is, $\D$ is biased. This contradiction
finishes the proof. \qed

\section{Conclusions and Open Problems}
In this work we have shown that  \textsf{DNF} is learnable in an
extension of the PAC model (PAC-$\D_{\rho}$) that seems to be
weaker than the use of membership queries. While it seems to be
very difficult to say anything about the exact strength of the
PAC-$\D_{\rho}$ model, we can show that the statistical query
analogue of PAC-$\D_{\rho}$ (SQ--$\D_{\rho}$) is sufficient for
our \textsf{DNF} learning algorithm. This fact combined with
characterization of SQ--$\D_{\rho}$ learnable classes was used to
show that, in terms of learning parity functions, SQ--$\D_{\rho}$
has just enough power to be able to learn parity functions that
can be expressed as polynomial-size DNF expressions. Another
useful property of the SQ--$\D_{\rho}$ model used is that it can
be simulated using membership queries with persistent
classification noise.

We have introduced an extension of attribute noise to membership
queries. It turns out that the use of membership queries with
product attribute noise is interestingly related to the
SQ--$\D_{\rho}$ model. We have shown that decision trees are
learnable and DNF expressions are weakly learnable by membership
queries with product attribute noise.

The introduction of  new model between PAC$_U$ and PAC$_U$+MQ
poses interesting and apparently difficult questions about the
relation between the strengths of these models. Another
interesting question of similar nature is whether there exists a
class learnable in SQ--$\D_{\rho}$ and not learnable by membership
queries with product attribute noise, and particularly (in case
the answer is positive) whether \textsf{DNF} is such a class. As
we have proved, the power of SQ--$D_{\rho}$ does not decrease when
we decrease $\rho$. But it remains unclear to us whether the power
is strictly increasing or the models are equally strong (for a
constant $0 < \rho < 1/2$).

\pagebreak
 \acks{ This research was supported by
the fund for the promotion of research at the Technion. Part of
research was done at the University of Calgary, Calgary, Alberta,
Canada.

We would like to thank Dmitry Gavinsky, Jeff Jackson and Eyal
Kushilevitz for useful discussions about this research, and to
anonymous referees of COLT/EuroCOLT 2001 conference and Journal of
Machine Learning Research for their careful reading, corrections,
and insightful comments on this work.}

\begin{thebibliography}{99}
  \bibitem [AD93]{ad93} Javed Aslam, Scott Decatur. General Bounds on
  Statistical Query Learning and PAC Learning with Noise via
  Hypothesis Boosting. In {\em Proceedings of 34-th Annual Symposium on
  Foundations of Computer Science}, pp. 282--291, 1993.
  \bibitem [AL88]{al88} Dana Angluin, Philip Laird. Learning From
  Noisy Examples. In {\em Machine Learning}, 2(4) pp.343--370,
  1988.
  \bibitem [AK91] {ak91} Dana Angluin, Michael Kharitonov. When
  Won't Membership Queries Help? In {\em Proceedings of the 23-rd
  Annual ACM Symposium on Theory of Computing}, 1991, pp.
  454--454.
  \bibitem[Bs93]{bs93} Nader Bshouty. Exact Learning via the Monotone Theory. In {\em Proceedings of 34-th Annual Symposium on
  Foundations of Computer Science}, pp. 302--311, 1993.
  \bibitem[BFJ$^+$94]{bfjkmr94} Avrim Blum, Merrick Furst, Jeffrey Jackson,
  Michael Kearns, Yishay Mansour, Steven Rudich. Weakly Learning DNF and Characterizing
  Statistical Query Learning Using Fourier Analysis. In {\em Proceedings
  of the 26-th Annual ACM Symposium on the Theory of Computing}, pp.
  253--262, 1994.
  \bibitem[BF01]{bf01} Nader Bshouty, Vitaly Feldman.
  On Using Extended Statistical Queries to Avoid Membership Queries. In {\em Proceedings of the 14-th Annual Conference on COLT/EuroCOLT}, pp. 529--545, 2001.
  \bibitem[BG01]{bg01} Nader Bshouty, Dmitry Gavinsky.
  On Boosting With Optimal Poly-Bounded Distributions. In {\em Proceedings of the 14-th Annual Conference on COLT/EuroCOLT}, pp. 490--506, 2001.
  \bibitem[BJT99]{bjt99} Nader Bshouty, Jeffrey Jackson, Christino Tamon.
  Uniform-Distribution Attribute Noise Learnability. In {\em Proceedings of the 12-th
  Annual Conference on COLT}, pp. 75--80, 1999.
  \bibitem[BKW00]{bkw00} Avrim Blum, Adam Kalai, Hal Wasserman.
  Noise-Tolerant Learning, the Parity Problem and the Statistical
  Query Model. In {\em Proceedings of the 32-th Annual ACM Symposium
  on Theory of Computing}, pp. 435--440, 2000.
  \bibitem[DG95]{dg95} Scott Decatur, Rosario Gennaro. In {\em Proceedings of the Eighth  Annual ACM Workshop on Computational Learning Theory}, pp.353--360, 1995.
  \bibitem[Fr90]{fr90} Yoav Freund. Boosting a Weak Learning
  Algorithm by Majority. In {\em Proceedings of the Third Annual
  Workshop on COLT}, pp. 202--216, 1990.
  \bibitem[FS94]{fs94} Yoav Freund, Robert Schapire. A Decision-theoretic Generalization of On-line Learning and an Application to Boosting. In {\em  Proceedings of the Second Annual European Conference on Computational Learning Theory}, 1994.
  \bibitem[GKS94]{gks94} Sally Goldman,  Michael Kearns, Robert
  Shapire. Exact Identification of Circuits Using Fixed
  Points of Amplification Functions, In {\em SIAM Journal on Computing,
  22 (1993)}, pp. 705--726.
  \bibitem[GL89]{gl89} Oded Goldreich, Leonid A. Levin. A Hard-Core Predicate for all One-Way   Functions. In {\em Proceedings of the Twenty First Annual ACM Symposium on Theory of Computing}, pp. 25--32, 1989.
  \bibitem[GS95]{gs95} Sally Goldman, Robert Sloan. Can PAC
  Learning Algorithms Tolerate Random Attribute Noise? In {\em
  Algorithmica}, 14(1) pp. 70--84, 1995.
  \bibitem[Ja94]{ja94} Jeffrey Jackson. An Efficient Membership-Query
  Algorithm for Learning DNF with Respect to the Uniform
  Distribution. In {\em Proceedings of the 35th Annual
  Symposion on Foundations of Computer Science}, pp. 42--53, 1994.
  \bibitem[Ja97]{ja97} Jeffrey Jackson. An Efficient Membership-Query
  Algorithm for Learning DNF with Respect to the Uniform
  Distribution. In {\em Journal of Computer and System Sciences}, 55(3), 1997
  \bibitem[JSS97]{jss97}Jeffrey Jackson, Eli Shamir, Clara Shwartzman. Learning with Queries Corrupted by
  Classification Noise. In {\em Proceedings of Fifth Israel Symposium on Theory of Computing and
  Systems}, pp. 45--53, 1997.
  \bibitem[Kea93]{kea93} Michael Kearns. Efficient
  Noise-Tolerant
  Learning from Statistical Queries. In {\em Proceedings of the
  Forth Annual Workshop on COLT,} pp. 392--401, 1993.
  \bibitem[KM91]{km91} Eyal Kushilevitz, Yishay Mansour. Learning Decision
  Trees Using the Fourier Spectrum. In {\em Proceedings of the
  23-rd Annual Symposium on Theory of Computing,} pages 455--464.
  \bibitem[KS99]{ks99} Adam Klivans, Rocco Servedio. Boosting and Hard-Core Sets. In {\em
  Proceedings of 40-th Annual Symposium on Foundations of Computer Science}, pp. 624--633, 1999.
  \bibitem[KSS92]{kss92} Michael Kearns, Robert Shapire, Linda
  Sellie. Toward Efficient Agnostic Learning. In {\em Proceedings of
  the Fifth Annual Workshop on COLT,} pp. 341--352, 1992.
  \bibitem[Lai88]{lai88} Phillip D. Laird. Learning From Good and Bad Data.
  {\em Kluwer Academic Publishers}, Boston, 1988.

  \bibitem[LMN89]{lmn89} Nathan Linial, Yishay Mansour, Noam Nisan.
  Constant Depth Circuits, Fourier Transform, and Learnability.
  {\em In Proceedings of the 31-st Symposium on the Foundations of
  Computer Science,} pp. 574--579, 1989.
  \bibitem[Man94]{mansour} Yishay Mansour. Learning Boolean Functions
  via the Fourier Transform. In {\em Theoretical Advances in Neural Computation
  and Learning}, (V.P. Roychodhury and K-Y. Siu and A. Orlitsky, ed.),
   391--424, 1994.
  \bibitem[ShSh95]{shsh95} Eli Shamir, Clara Shwartzman. Learning by Extended
  Statistical Queries and Its Relation to PAC Learning. In {\em
  Proceedings of Second European Conference, EuroCOLT '95}, pp.
  357--366, 1995.
  \bibitem[Sch90]{sch90} Robert Shapire. The Strength of Weak Learnability. In {\em Machine Learning}, 5, pp. 197--227, 1990.
  \bibitem[Sl88]{sl88} Robert Sloan. Types of Noise in Data for Concept Learning. In {\em Proceedings of the 1988 Workshop on Computational Learning Theory}, pp. 91--96, MIT, ACM Press, 1988.
  \bibitem[SV88]{sv88} George Shakelford, Dennis Volper. Learning $k-$DNF with
  Noise in the Attributes. In {\em Proceedings of the
  1988 Workshop on COLT,} pp. 97--103, 1988.
  \bibitem[V84]{v84} Leslie Valiant. A Theory of the Learnable. In
  {\em Commmunications of the ACM}, 27(11) pp. 1134--1142, 1984.
 \end{thebibliography}

\end{document}
