11430324 (P.T.A.B. Feb. 27, 2014)

Ex Parte Juffa et al

Patent Trial and Appeal BoardFeb 27, 2014

11430324 (P.T.A.B. Feb. 27, 2014)

UNITED STATES PATENT AND TRADEMARK OFFICE ____________________ BEFORE THE PATENT TRIAL AND APPEAL BOARD ____________________ Ex parte NORBERT JUFFA and JOHN R. NICKOLLS ____________________ Appeal 2011-010162 Application 11/430,324 Technology Center 2100 ____________________ Before ST. JOHN COURTENAY III, THU A. DANG, and LARRY J. HUME, Administrative Patent Judges. DANG, Administrative Patent Judge. DECISION ON APPEAL Appeal 2011-010162 Application 11/430,324 2 I. STATEMENT OF THE CASE Appellants appeal under 35 U.S.C. § 134(a) from a Final Rejection of claims 1, 2, and 4-20 (App. Br. 5). Claim 3 has been cancelled (App. Br. 19). We have jurisdiction under 35 U.S.C. § 6(b). We affirm. A. INVENTION Appellants’ invention is directed to a system and method for reducing the bandwidth required for a matrix multiply operation, where a column of the first input matrix and a single element of the second input matrix are read to produce a column of partial dot products of the product matrix, reducing the required memory bandwidth from 2N to N+1, where N is the number of elements in a column of the product matrix (Abstract). B. ILLUSTRATIVE CLAIM Claim 1 is exemplary: 1. A computer-implemented method of executing a set of operations including a broadcast operand for multiple threads, comprising: determining that a memory operand included in the set of operations is the broadcast operand based on an address specified for the memory operand that falls within a broadcast memory region; reading a first value from a location corresponding to the address within the broadcast memory region; providing the first value to multiple program instruction execution units within a multithreaded processor; reading a set of second values specified by a parallel operand included with the set of operations, wherein each one of the second values corresponds to a different one of the multiple threads and is read from a portion of memory that is outside of the broadcast memory region; Appeal 2011-010162 Application 11/430,324 3 providing one second value of the set of second values to each one of the multiple program instruction execution units within the multithreaded processor; and executing the set of operations for each one of the multiple threads. C. REJECTIONS The prior art relied upon by the Examiner in rejecting the claims on appeal is: Hall US 5,226,171 July 6, 1993 Pechanek US 5,682,544 Oct. 28, 1997 Ford US 2005/0125636 A1 June 9, 2005 Bonebakker US 2007/0143574 A1 June 21, 2007 Sazegari US 7,337,205 B2 Feb. 26, 2008 Claims 1, 4, 6, 7, 10-13, and 15-19 stand rejected under 35 U.S.C. § 103(a) as being unpatentable over Hall in view of Bonebakker. Claims 5 and 9 stand rejected under 35 U.S.C. § 103(a) as being unpatentable over Hall in view of Bonebakker and Ford. Claim 2 stands rejected under 35 U.S.C. § 103(a) as being unpatentable over Hall in view of Bonebakker and Pechanek. Claims 8, 14, and 20 stand rejected under 35 U.S.C. § 103(a) as being unpatentable over Hall in view of Bonebakker and Sazegari. II. ISSUES The dispositive issues before us are whether the Examiner erred in finding the combination of Hall and Bonebakker teaches or would have suggested: “determining that a memory operand included in the set of operations is the broadcast operand based on an address specified for the memory operand that falls within a broadcast memory region” and “reading a set of second values specified by a parallel operand included with the set Appeal 2011-010162 Application 11/430,324 4 of operations, wherein each one of the second values corresponds to a different one of the multiple threads and is read from a portion of memory that is outside of the broadcast memory region,” as recited in claim 1 (emphases added). III. FINDINGS OF FACT The following Findings of Fact (FF) are shown by a preponderance of the evidence. Hall 1. Hall discloses a parallel processing system having a plurality of simultaneously operable arithmetic units that provide matrix-vector products, with each of the arithmetic units to receive a plurality rows of a matrix to calculate the matrix-vector product, wherein a column of a second matrix or a vector is broadcast to each of the respective arithmetic units (Abstract; Figs. 3, 4; col. 1, ll. 60-67; col. 3, ll. 60-68). The system typically performs a matrix-vector product (MVP) defined by the expression y=A*x, where y is a vector with m elements, A is a matrix with m rows and n columns, and x is a vector with n elements (col. 2, ll. 3-6). The vector x is stored in each arithmetic unit, while the elements of matrix A are provided on the memory bus 40 to the arithmetic units 16-30 (Fig. 4; col. 2, ll. 6-9). 2. A portion of the memory address space is divided into segments, one for each of the arithmetic units, and is used for reading and writing data and control information to the respective units (Fig. 7; col. 1, l. 60–col. 2, l. 9). In particular, the memory map includes a sixteen million word address space 58, of which fifteen million words can be used to address the physical memory 56; wherein, the highest million words of the Appeal 2011-010162 Application 11/430,324 5 address space 120 is divided into sixteen 64K word segments 122, one for each of fifteen arithmetic units 16-30 and the last 64K address segment 124 represents a broadcast segment 124, which is recognized by all the arithmetic units such that data is simultaneously sent to all arithmetic units16-30 (Fig. 3, 7; col. 8, ll. 56-68). The broadcast segment 124 broadcasts a stream of matrix elements to all the arithmetic units 16-30 at the same time (col. 9, ll. 40-47). Bonebakker 3. Bonebakker discloses a system that supports vector operations on a multi-threaded microprocessor where all of the threads in the set of threads execute the same shared set of instructions but operate on different data, with each thread tracking its location in the shared set of instructions (¶ [0011]). 4. The multi-threaded, multi-core processor’s includes a bank of processor cores having L1 caches 306 that feed a set of threads in parallel with associated thread contexts T1-TL 308, which are virtually mapped 310 into the virtual vector 312, with each thread context corresponding to a specific range of the vector (Fig. 3; ¶¶ [0034]-[0035]). IV. ANALYSIS Claims 1, 4, 6, 7, 10-13, and 15-19 Appellants contend since “Hall fails to teach that each one of the second values corresponds to a different one of the multiple threads,” it “teaches that different arithmetic execution units store different vectors” (App. Br. 13). Appellants argue since Hall discloses “the number of rows [of the first matrix] exceeds the number of execution units,” where “each Appeal 2011-010162 Application 11/430,324 6 arithmetic unit explicitly stores two or more rows,” these values of the matrix do not “correspond to a different one of the multiple arithmetic execution units (i.e., threads)” (App. Br. 14 (emphasis omitted)). Appellants contend further that “Bonebakker is silent as to the very specific process of each one of the second values corresponding to a different one of the multiple threads” and “[n]one of the cited references teaches or suggest” “determining that a memory operand included in the set of operations is the broadcast operand based on an address specified for the memory operand that falls within a broadcast memory region” (id.). Appellants finally contend “the modification to the Hall system proposed by the Examiner would render the Hall system inoperable for its intended purpose” since Hall requires “that each arithmetic unit receives the same data,” which “is incompatible with the requirement of Bonebakker’s SIMD architecture that each thread runs on unique and different data” (App. Br. 15-16 (emphasis omitted)). However, the Examiner finds “Hall teaches multiple execution units, wherein each execution unit at any one time multiplies a first common broadcast operand with a different, execution-unit-specific second operand” (Ans. 20) and “Bonebakker teaches, in a multiple execution unit processor performing a matrix multiplication, each execution unit corresponds to different respective threads” (Ans. 21). The Examiner notes that Hall’s system “maps the claimed set of second values to the set of single elements loaded by each execution unit from the row vectors of the ‘4 Local Rows’ matrix as shown in Figure 3 of Hall” (Ans. 19). Thus, “each execution unit in parallel loads a single row vector element to perform a multiplication, but since there are multiple execution units each loading a single element in Appeal 2011-010162 Application 11/430,324 7 parallel, there is a set of elements, i.e. a set of second values” (id.). The Examiner notes further, “when the broadcast segment’s memory address is received by all execution units,” “it is determined that a match exists by all units, thus all units are enabled in parallel, and thus it is determined that the received operand is a broadcast operand” (Ans. 24). Hall is directed to a parallel processing system having a plurality of arithmetic units that calculate matrix-vector products by receiving a plurality of rows of a matrix that are multiplied by a vector broadcasts to all arithmetic units (FF 1). The system divides the memory into designated areas for the physical memory, the fifteen arithmetic units, and the broadcast segment, which is recognized by all the arithmetic units such that data is simultaneously sent to all arithmetic units (FF 2). We agree with the Examiner’s finding that Hall’s system determines that an operand is a broadcast operand based upon the specified address (Ans. 24). We also agree with the Examiner’s finding that Hall’s matrix teaches reading a set of values specified by a parallel operand where each set goes to a different one of the arithmetic units (multiple threads) (Ans. 19-20). In addition, Bonebakker is directed to a system that supports vector operations on a multi-threaded microprocessor where all of the threads operate on different data (FF 3). The multi-threaded, multi-core processor’s includes a bank of processor cores that feed a set of threads with associated thread contexts in parallel (FF 4). We agree with the Examiner’s finding that Bonebakker’s multi-threaded microprocessor method includes reading a set of values specified by a parallel operand where each set goes to a different one of the an associated thread contexts (Ans. 21). Appeal 2011-010162 Application 11/430,324 8 We find that the combination of Hall and Bonebakker at least suggests all contested claim limitations of claim 1. Though Appellants also contend “the modification to the Hall system proposed by the Examiner would render the Hall system inoperable for its intended purpose” (App. Br. 15-16), Appellants have not provided persuasive support for this assertion. Since Hall discloses a parallel processing system, we conclude the combination of one known element (Hall’s arithmetic units) with another (Bonebakker’s multi-threaded microprocessor) would have yielded predictable results to one of ordinary skill in the art at the time of the invention. Thus, we find that reading a set of values specified by a parallel operand into a different one of the arithmetic units as taught by Hall in addition to Bonebakker’s reading of different values in parallel into associated thread contexts is no more than a simple arrangement of old elements, with each performing the same function it had been known to perform, yielding no more than one would expect from such an arrangement. See KSR Int’l Co. v. Teleflex Inc., 550 U.S. 398, 417 (2007). The skilled artisan would “be able to fit the teachings of multiple patents together like pieces of a puzzle” since the skilled artisan is “a person of ordinary creativity, not an automaton.” Id. at 420-21. Appellants have presented no evidence that combining Hall’s parallel vector operations with the multi-thread architecture of Bonebakker’s processor was “uniquely challenging or difficult for one of ordinary skill in the art” or “represented an unobvious step over the prior art.” Leapfrog Enterprises, Inc. v. Fisher- Price, Inc., 485 F.3d 1157, 1162 (Fed. Cir. 2007) (citing KSR, 550 U.S. at 418-19). Appeal 2011-010162 Application 11/430,324 9 We also agree with the Examiner’s explicit motivation that combining the references would be obvious “because it allows multithreaded processors to exploit the parallelism of vector operations, and also provides for increased programmer productivity and reduced program complexity” (Ans. 6). The Supreme Court has stated “[t]he combination of familiar elements according to known methods is likely to be obvious when it does no more than yield predictable results.” KSR, 550 U.S. at 416. Thus, we find no error in the Examiner’s finding that the combination of Hall’s system including the parallel vector with the reading of values in parallel on multiple execution units (associated thread contexts), as disclosed in Bonebakker, produces reading values in parallel to different ones of multiple threads which would be obvious (Ans. 6; FF 1-4). Accordingly, we find no error in the Examiner’s rejection of claim 1 under 35 U.S.C. § 103(a) over Hall in view of Bonebakker. Further, independent claims 13 and 17 having similar claim language, and claims 4, 6, 7, 10-12, 15, 16, 18, and 19 (depending from claims 1, 13, and 17), which have not been argued separately, fall with claim 1. Claims 2, 5, 8, 9, 14, and 20 Appellants argue claims 2, 5, 8, 9, 14, and 20 are patentable over the cited prior art for the same reasons asserted with respect to claim 1 (App. Br. 17). As noted supra, however, we find that the combined teachings of Hall and Bonebakker at least suggest all the claimed features of claim 1. We therefore affirm the Examiner’s rejection of claim 5 and 9 under 35 U.S.C. § 103 over Hall in view of Bonebakker and Ford, of claim 2 under 35 U.S.C. § 103 over Hall in view of Bonebakker and Pechanek, and of Appeal 2011-010162 Application 11/430,324 10 claims 8, 14, and 20 under 35 U.S.C. § 103 over Hall in view of Bonebakker and Sazegari. V. CONCLUSION AND DECISION The Examiner’s rejection of claims 1, 2, and 4-20 under 35 U.S.C. § 103(a) is affirmed. No time period for taking any subsequent action in connection with this appeal may be extended under 37 C.F.R. § 1.136(a)(1)(iv). AFFIRMED bab