xuemei 的个人资料May's Home照片日志列表更多 ![]() | 帮助 |
|
|
5月13日 学习Python1. Python also supports complex numbers (复数)。imaginary numbers are written with a suffix of ‘j’ or ‘J’. Complex numbers
with a nonzero real component are written as ‘(real+imagj)’, or can be created with the ‘complex(real, imag)’ function. Complex numbers are always represented as two floating point numbers, the real and imaginary part (实部,虚部). To extract these parts from a complex number z, use z.real and z.imag. 2。Multiple assignment: the variables a and b simultaneously get the new values 0 and 1.
a, b = 0, 1
The right-hand side expressions are evaluated from the left to the right.
a, b = b, a+b
3. To iterate over the indices of a sequence, combine range() and len() as follows:
>>> a = [’Mary’, ’had’, ’a’, ’little’, ’lamb’]
>>> for i in range(len(a)): ... print i, a[i] ... 0 Mary 1 had 2 a 3 little 4 lamb 4. There are three built-in functions that are very useful when used with lists: filter(), map(), and reduce().
‘filter(function, sequence)’ returns a sequence (of the same type, if possible) consisting of those items from the
sequence for which function(item) is true. For example, >>> def f(x): return x % 2 != 0 and x % 3 != 0
... >>> filter(f, range(2, 25)) [5, 7, 11, 13, 17, 19, 23] ‘map(function, sequence)’ calls function(item) for each of the sequence’s items and returns a list of the return
values. For example, >>> def cube(x): return x*x*x
... >>> map(cube, range(1, 11)) [1, 8, 27, 64, 125, 216, 343, 512, 729, 1000] ‘reduce(func, sequence)’ returns a single value constructed by calling the binary function func on the first two
items of the sequence, then on the result and the next item, and so on. For example, to compute the sum of the numbers 1 through 10: >>> def add(x,y): return x+y
... >>> reduce(add, range(1, 11)) 55 >>> def sum(seq):
... def add(x,y): return x+y ... return reduce(add, seq, 0) ... >>> sum(range(1, 11)) 55 >>> sum([]) 0 5. Another useful data type built into Python is the dictionary. Dictionaries are sometimes found in other languages
as “associative memories” or “associative arrays”. Unlike sequences, which are indexed by a range of numbers, dictionaries are indexed by keys, which can be any immutable type; strings and numbers can always be keys. >>> tel = {’jack’: 4098, ’sape’: 4139}
>>> tel[’guido’] = 4127 >>> tel {’sape’: 4139, ’guido’: 4127, ’jack’: 4098} >>> tel[’jack’] 4098 >>> del tel[’sape’] >>> tel[’irv’] = 4127 >>> tel {’guido’: 4127, ’irv’: 4127, ’jack’: 4098} >>> tel.keys() [’guido’, ’irv’, ’jack’] >>> tel.has_key(’guido’) True 6. Methods of File Objects
f.read(size)
f.readline()
f.readlines()
f.write(string)
f.tell() returns an integer giving the file object’s current position in the file, measured in bytes from the beginning of the file.
f.seek()
f.close()
Python provides a standard module called pickle. This is an amazing module that can take almost any Python object (even some forms
of Python code!), and convert it to a string representation; this process is called pickling. Reconstructing the object from the string representation is called unpickling. pickle.dump(x, f)
x = pickle.load(f) 3月12日 MicrodisplayMicrodisplays are displays that are so small that optical magnification is needed. Most microdisplays use a silicon chip as the substrate material. The chip also houses the addressing electronics (at least an active matrix with integrated drivers), usually implemented in standard CMOS technology. This mature technology generates very reliable and stable circuits (better than TFT technology) and allows very small pixel pitches (down to 10 µm or even somewhat smaller) and high display resolutions. Microdisplays can be used in projectors or in "near to the eye" (NTE) applications, such as in head-mounted displays and camera view-finders. Several electro-optical effects can be used to generate the image: Electroluminescence (EL), OLED, vacuum fluorescence (VF), reflective Liquid Crystal effects and tilting or deforming of micro-mirrors (requires micro-machining). The most popular combinations today are Liquid Crystal On Silicon (LCOS), OLED on silicon and tilted mirrors (DMD or DLP). (From : http://tfcg.elis.ugent.be/microdis/index.html)
1月8日 Memory subsystem of AMD K10 Micro-Architecture---- From http://www.xbitlabs.com/articles/cpu/display/amd-k10_8.htmlLoad/Store UnitWhen the memory request addresses have been calculated in the AGU of K8 processor, all load and store operations are sent to LSU (Load/Store Unit). LSU contains two queues: LS1 and LS2. At first, load and store operations get into LS1 queue 12 elements deep. At two operations per clock speed, LS1 queue issues requests to L1 cache memory in order determined by the program code. In case of a cache-miss, operations are placed into the LS2 queue 32 elements deep. This is where the requests to L2 cache memory and RAM come from. The LSU of the K10 processor has been modified. Now LS1 queue receives only load operations, while store operations are sent to LS2 queue. Load operations from LS1 can be executed in an out-of-order manner taking into account addresses of store operations in LS2. As we have already mentioned above, K10 processes 128-bit store operations as two 64-bit ones that is why they take two positions each in the LS2 queue. L1 CacheL1 cache in K8 and K10 processors is separated: 64KB for instructions (L1I) and data (L1D). Each cache is 2-way set associative; the line length is 64 bytes. This low associativity may result into frequent conflicts between the lines aiming at the same sets, which in its turn may increase the number of cache-misses and negatively affect the performance. Low associativity is often compensated by the rather large size of L1 cache. A significant advantage of L1D is the two ports: it can process two read and/or write instructions per clock in any combination. Unfortunately, L1 cache of K10 processor still has the same size and associativity. The only noticeable improvement is the read bus width increase. As we have said in the previous chapter, now the CPU can perform two 128-bit reads every clock cycle, which makes it much more efficient during SSE-data processing in local memory. L2 CacheEach core of the dual-core and quad-core K8 and K10 processors has its own individual L2 cache. The L2 cache in K10 remained the same: 512KB per core with associativity of 16. Exclusive L2 caches have their pros and cons compared with the shared L2 cache in Core 2 CPUs. Among the advantages, certainly are the absence of conflicts and competition for the cache when several processor cores are heavily loaded at the same time. As for the drawbacks, there is less cache available for each core when there is only one applications running full throttle. L2 cache is exclusive: the data stored in L1 and L2 caches do not duplicate. L1 and L2 caches exchange data along two unidirectional buses: one serves to receive data and another one – to send data. In K8 processor each bus is 64bit (8 bytes) wide (Pic.5a). This organization provides the data delivery rate to L2 cache at the modest 8 bytes/clock speed. In other words, it will take 8 clock cycles to transfer a 64-bit line, so the data delivery to the core will be delayed noticeably, especially if two or more lines of the L2 cache are addressed at the same time. Although it hasn’t been confirmed yet, the send and receive buses in K10 processor will become twice as wide, i.e. 128bit each (Pic.5b). It should reduce the cache access latency significantly when two or more lines are requested at the same time.
L3 CacheTo make up for the relatively small L2 cache, K10 acquired a shared between all cores 2MB L3 cache with associativity of 32. L3 cache is adaptive and exclusive: it stores all data evicted from L2 caches of all cores as well as the data shared by several cores. When the core issues a line read request, a special check is performed. If the line is only used by one core, it is removed from L3 freeing room for the line that is evicted from L2 cache of the requesting core. If the requested line is also used by another core, it remains in the cache. However, in order to accommodate the line evicted from L2 cache, another – older – line will be removed in this case. L3 cache should help speed up the data transfer rate between the cores. As we have already found out, contemporary Athlon 64 processors exchange data between the cores via the memory bus. As a result, access to shared modified data occurs much slower. According to AMD’s materials, quad-core K10 processors may exchange data via L3 cache. Once the request from one of the cores arrives, the core that has the modified data copies them to L3 cache, where the requesting core can read them from. The access time to modified data in the other core’s cache should become much shorter. When we get a chance, we will certainly check it out.
L3 cache latency will evidently be higher than L2 cache latency. However, AMD materials suggest that the latency will vary adaptively depending on the workload. If the workload isn’t too heavy, latency will improve, and under heavy workload the bandwidth will rise. We still have to check what really stands behind this. TLBBesides cache-memory for instructions and data, processors have one more type of cache-memory: translation-lookaside buffers (TLB). These buffers are used to store the connection between virtual and physical page addresses obtained from the page tables. The number of TLB entries determines how many memory pages can be involved without additional costly page table walks. This is especially critical for applications that process memory data randomly, when they constantly request the data on different pages. K10 processor has much more translation buffers. For your convenience they are all given in the table below:
As you see from the table, there are much more buffers for translation of 2MB pages. There also appeared support of large 1GB pages that may be very useful for servers processing large volumes of data. With appropriate support from OS, applications using large 2MB and 1GB pages should run considerably faster. Memory ControllerWhen the requested data isn’t found in any of the caches, the request is issued to the memory controller integrated onto the processor die. On-die location of the memory controller reduces the latencies during work with the memory, but at the same time it ties up the processor to a specific memory type, increases the die size and complicates the die selection process thus affecting the production yields. The on-die memory controller was one of the advantages of the K8 processors, however, sometimes it wasn’t efficient enough. The memory controller of K10 processors will be improved significantly. Firstly, it now can transfer data not only along one 128-bit channel, but also along two independent 64-bit channels. As a result, two or more processor cores can work more efficiently with the memory at the same time. Secondly, the scheduling and reordering algorithms in the memory controller have been optimized. The memory controller groups reads and writes so that the memory bus could be utilized with maximum efficiency. Read operations have an advantage over writes. The data to be written is stored in the buffer of still unknown size (it is assumed to accommodate between 16 and 30 64-byte lines). By handling requested lines in groups we can avoid switching the memory bus from reading to writing and back all the time and thus save resources. It is allows to significantly improve performance during alternating read and write requests. Thirdly, the memory controller can analyze requests sequences and perform prefetch. PrefetchPrefetch is a definite advantage of K8 processors. Integrated memory controller with low latency has let AMD processors to demonstrate excellent performance with the memory subsystem for a long time. However, K8 processors failed to prove as efficient with new DDR2 memory, unlike Core 2 with powerful prefetch mechanism. K8 processors have two prefetch units: one for the code and another one for the data. The data prefetch unit fetches data into the L2 cache basing on simplified successions. K10 has improved prefetch mechanism. First of all, k10 processors prefetch data directly into the L1 cache, which allows hiding the L2 cache latency when requesting data. Although it increases the probability of L1 cache pollution with unnecessary data, especially taking into account low cache associativity, AMD claims that it is a justified measure that pays off well. Secondly, they implemented adaptive prefetch mechanism that changes the prefetch distance dynamically, so that the data could arrive in time and so that the cache wouldn’t get loaded with data that is not needed yet. Prefetch unit became more flexible: now it can trains on memory requests at any addresses, and not only the addresses that fall into adjacent lines. Moreover, prefetch unit now trains on software prefetch requests. Thirdly, a separate prefetch unit was added directly into the memory controller. The memory controller analyzes request successions from cores and loads the data into the write buffer utilizing the memory bus in the most optimal way. Saving prefetch lines in the write buffer helps keep cache-memory clean and reduce the data access latency significantly. As a result, we see that the memory subsystem of K10 processors has undergone some positive improvements. But we still have to say that it still potentially yields to the memory subsystem in Intel processors in some characteristics. Among these features are: the absence of speculative loading at unknown address past the write operations, lower L1D cache associativity, narrower bus between L1 and L2 caches (in terms of data transfer rate), smaller L2 cache and simpler prefetch. Despite all the improvements, Core 2 prefetch is potentially more powerful than K10 prefetch. For example, K10 has no prefetch at instruction addresses so that we could keeps track of individual instructions, as well as no prefetch from L2 to L1 that could hide L2 latency efficiently enough. These factors can have different effects on various applications, but in most cases they will determine higher performance of Intel processors.
12月10日 【转帖】如何手工焊接表贴封装IC手工焊接表面元件是个细活。无论你是焊在表贴万能板表贴万能板 上还是焊在开好的PCB上,其细致程度不亚于修钟表,甚至不亚于一次外科手术。我甚至认为应该为焊接人评个起码6级工人。 虽然需要比较多的技巧,但毕竟是可以手工操作的,如我这般菜鸟也能成功焊成,毕竟事在人为。 在工具上,最好有热风枪,倒不是在焊接上它有很大帮助,但至少你焊坏了,可以拆下来再焊一次。在助焊剂上,我觉得酒精和松香溶液比较好用,因为它有粘性,可以固定IC。镊子不可少,至于放大镜也是不可少,最好不是那种绿玻璃的放大镜。还有擦烙铁头的海绵,时不时要擦一下。至于擦汗的毛巾就自便了。 不论焊盘是否镀锡,在焊之前,先给焊盘上一层锡,因为这一层锡在焊接好之后是主要接触面焊接面。另外一个焊接面是管脚的末端。上锡的效果要做到薄、均匀、光滑,这样它在放上IC后会方得很平整。可以用烙铁投放平,顺焊盘方向涂抹,然后再分开粘连部分。完成后检查一遍,这很重要,因为如果这时候有粘连的焊盘,到后面都是白做了。每一个步骤后的检查都很重要,因为这是个细活。 然后就是整理管脚了。我用的镊子是圆规上用来蘸墨水画线的那一端的夹子,我还没有发现更好的工具。不论是新片子,旧片子,很多管脚都歪向一侧,主要要整理成间距一致,不然对不上焊盘是很郁闷的一件事。 接下来,把IC放上焊盘,可以把IC管脚底面刷上一层助焊剂,有助于焊接上底面,但这样做过一段时间还没有对准焊盘能把管脚粘在焊盘上。把IC放在焊盘上,逐条边把管脚对齐,这个是体现细活的表现的地方了,你可以按着IC轻轻敲击来微调偏移等等。反正办法是人想的。如果能把管脚对准了,可以焊死对角的的管脚固定IC。然后终于可以歇一口气了-------- 接下来是焊接管脚了,焊接管脚前可以刷上助焊剂,稍干时,就可以可以在上面“刷”(横刷竖刷都可以)熔融焊锡了,其实在这个阶段不太容易罢管脚粘连上,但如果粘连上了,可以再刷一次助焊剂,再做一次,往往就能把粘连的焊锡带出来。但第二次粘上就不好带出来了。检查时用针拨动管脚,看看是否能拨动,补焊一下。如果顺利的话,焊接能一次成功,但总是不顺利的。 如果焊锡老窝在管脚缝隙里就不肯出来怎么办。正确的做法是:握紧拳头,然后念齐天大圣,也可以把多股铜线剥开,蘸上助焊剂,然后和烙铁头一起加热,融化那些粘连得锡,然后再带出来,掌握住焊锡在变成固体前都有一刻粘稠,那时带出来效果最好。 另外,在焊接时要把握住焊锡的特性,大概多长时间融化,多长时间凝固,不要过分加热焊盘,否则焊盘会脱落的。 怎么取下表贴的IC呢,最好使用热风枪,均匀加热,慢慢翘动,有反映再加力,然后再换个方向再做,直到取下来,不要怕时间长,如果加力撬下来,会把焊盘剥落。 或者采用拉线法,用钢丝(琴弦不错)穿过若干条管脚里面的缝隙,用烙铁加热受力管脚,逐个拉出(是拉出底下的焊锡,不是拉出管脚或者别的什么东西),但不可对拉出的那条管脚再加热,不然又焊上了。 总之细致总是要的。 <a href='http://www.99digital.com/products/smdboard.htm'>www.99digital.com如转贴请注明出处</a> 11月23日 Failed to load module "ruby.so" (GEMS)Spend the whole day to fight with ruby.so.
When I load-module ruby in SIMICS,
It shows:
Error loading module 'ruby': Failed to load module 'ruby'
('$home_directory/gems/simics_3_workspace/x86-linux/lib/ruby.so'): "cannot open shared object file: No such file or directory" At the first look, I thought no such file or wrong directory. However, ruby.so is there correctly.
Then I use ldd ruby.so to list all dependency libraries.
libstdc++.so.6 and libsimics-common.so not found.
Modify /gems/ruby/module/Makefile
give the right direcotries for these two libraries.
make again.
SAD!! Story didn't stop here.
gcc 3.2.3 needs libstdc++.so.5, but gcc 3.4.6 needs libstdc++.so.6
I used old gcc 3.2.3 to compile ruby.so, but ruby.so tried to link libstdc++.so.6.
Must somewhere GCC isn't right.
I traced the whole compile procedure, read nearly 10 Makefiles.
Finally, I found SIMICS has it's own Makefiles, and one of them ($GEMS/simics_3_workspace/compiler.mk) is used to specify GCC. I changeed CC=gcc to CC=/usr/bin/gcc (which is old gcc version 3.2.3)
Now this problem was resolved.
My conclusions:
If still want to use old gcc version to compile GEMS, there are two places in which we have to check gcc and library path: $GEMS/common/Makefile.common $GEMS/ruby/module/Makefile $GEMS/simics_3_workspace/compiler.mk 感慨:
发现问题和解决问题过程是艰难的,但是最终找到的原因和结论却非常简单。
11月15日 多CPU的ID在SMP多CPU的机器上CPU的ID号并不一定是连续取的. 这ID号不对可把我害惨了。 与此相关的常用命令taskset, top, mpstat (solaris中用prstat) , sar, ps, iostat,
10月31日 SUN SPARCThere have been three major revisions of the architecture. The first published revision was the 32-bit SPARC Version 7 (V7) in 1986. SPARC Version 8 (V8), an enhanced SPARC architecture definition, was released in 1990. SPARC V8 was standardized as IEEE 1754-1994, an IEEE standard for a 32-bit microprocessor architecture. SPARC Version 9, the 64-bit SPARC architecture, was released by SPARC International in 1993. In early 2006, Sun released an extended architecture specification, UltraSPARC Architecture 2005. UltraSPARC Architecture 2005 includes not only the nonprivileged and most of the privileged portions of SPARC V9, but also all the architectural extensions (such as CMT, hyperprivileged, VIS 1, and VIS 2) present in Sun's UltraSPARC processors starting with the UltraSPARC T1 implementation. UltraSPARC Architecture 2005 includes Sun's standard extensions and remains compliant with the full SPARC V9 Level 1 specification. The architecture has provided continuous application binary compatibility from the first SPARC V7 implementation in 1987 into the Sun UltraSPARC Architecture implementations.
The LEON3 is a synthesisable VHDL model of a 32-bit processor compliant with the SPARC V8 architecture. The model is highly configurable, and particularly suitable for system-on-a-chip (SOC) designs. The full source code is available under the GNU GPL license, allowing free and unlimited use for research and education. LEON3 is also available under a low-cost commercial license, allowing it to be used in any commercial application to a fraction of the cost of comparable IP cores. 12月23日 中国能不能建造一个开放式多处理器设计平台?中国能不能建造一个开放式多处理器设计平台?
灵芯 著名的<<计算机体系结构:一种量化的方法>>一书作者HENNESSY和PATTERSON在 2006年底接受ACM QUEUE杂志的采访,发表了对未来CPU设计的一些看法。 他们认为,对CPU体系结构研究人员而言,现在是一个激动人心的时代, 存在着巨大的机会。甚至一个斯坦福大学的学生都有可能设计出超过英特尔的新型 计算机。原因是传统的体系结构技术已基本用尽,如果没有非常创新的思想,计算 机速度很难提高。因此,工业界正在等待研究人员发展出新的设计思路。 他们指出,在2005年,微软对并行计算毫不关心。在过去的20年中,CPU速 度每隔18个月就翻一倍,他们根本不用操心并行。到2006年,微软几乎每个人都在 谈论并行计算,因为CPU速度停止增长,人们不得不考虑并行处理。这个现象一方面 说明了单CPU发展的停滞,另一方面说明了研发多处理器的紧迫。世界进入了多处理 器时代。 为了支持新处理器的研究,尤其是多处理器的研究,美国加州伯克利,斯 坦福,麻省理工,卡内基梅陇等近十个学校从2005年开始共同设立了一个RAMP [2] (Research Accelerator for Multiple Processors) 项目,该项目是一个支持多处 理器设计的开放式研究平台。项目的核心是一个由一批FPGA组成的族群,每个FPGA上 可以快速实现10到20个处理器的原型设计,整个系统可以快速实现最多一千个处理 器构成的多处理器系统。每个处理器的主频可以达到100MHz左右,虽然速度比较慢, 但依然可以运行大型软件。由于FPGA快速重写的特性,这个系统可以用来实验各种 不同的处理器结构,检验各种体系结构新思路。除此之外,还可以让软件人员提早 开始多处理器上并行软件的开发,研制多处理器上的操作系统,编译,汇编器,调 式器以及应用系统。 为什么把目标定为1000个处理器呢?因为他们认为这是未来单芯片中可以 放入的处理器的容量。目前,软件行业对于如此规模的并行计算缺乏准备,即没有 与之配合的软件,更没有写并行程序的程序员。软件行业总是等待硬件成熟之后才 开始软件开发,有了RAMP系统之后,软件业就可以提前进行并行软件的研制以及对 程序员的培训。 目前,RAMP已经吸引了多个学校准备在RAMP上面进行体系结构的实验项目, 其中有华盛顿大学的数据流机,MIT的transactors项目(事务级并行),Patterson的 分布式数据中心,斯坦福大学的Transactional Memory,CMU的Reliable Multiprocessor等。 RAMP项目实现了低成本下进行CPU研发,新的设计不再需要花费数年世界来 实现,只需通过INTEL网直接传送到RAMP系统上,在几分钟内便可实现原型设计。 RAMP为新型体系结构的研究带来革命性的变化。那么,中国是否也能开展一 个类似的项目,以推动新一代处理器设计以及系统软件的研究? 附录:RAMP研究进展及参考文献 RAMP1已由伯克利大学2004年底实现[3]。它是一个由五个XILINX的FPGA芯 片构成的仿真系统。每个FPGA芯片同四个存储模块相联,共有4GB的存储量,数据总 线宽度72位。单个FPGA访问存储器的带宽最高可达12.8GBps。五个FPGA中,四个用 于计算,中间的一个用于控制。整个系统拥有20GB的存储量。硬件调试也同时实现。 2006年的RAMP研究[4]分成三个阶段。红色RAMP实现Transactional Memory, 蓝色RAMP用XILINX的MicroBlaze软核构成cluster,白色RAMP实现共享存储器多处理 器。 蓝色RAMP[6]中,每个FPGA含8个MicroBlaze软核,每个板上有4个FPGA,8块 板共256个软核。软核主频100MHz。该项目原来曾计划实现1024个处理器的系统[4], 后来似乎没有实现。 白色RAMP[5]分几个阶段。第一版采用64个PowerPC硬核,一级缓存;第二版 采用32位软核,两级缓存,其中第二级是共享缓存。第三版采用64位软核。该核采 用多种方式验证,形式化验证,模拟验证等。在RAMP设计中使用的硬件语言有Verilog, BLUESPEC和VHDL。 FPGA的速度每隔一年半翻一倍。现在最大的FPGA中已能放下16个CPU,1000个 CPU的系统只需60个FPGA就可实现。软件方面,计划首先实现JAVA,然后是C/C++以 及其他语言。为了支持不同体系结构在RAMP上的配置,还开发了RAMP描述语言RDL[7], 它描述各个硬件之间的连接和通讯。 RAMP系统做成之后,售价大约在十万美元左右。 除了近十所体系结构方面的顶尖的大学之外,参加RAMP项目的还有主要INTEL, IBM,微软,HP,SUN,XILINX等大公司。 参考文献 [1] A Conversation with John Hennessy and David Patterson, ACM QUEUE, December/ January 2006/2007. http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=445 [2] RAMP, http://ramp.eecs.berkeley.edu/index.php。 [3] Arvind, Krste Asanovic, Derek Chiou, James C. Hoe, Christoforos Kozyrakis, Shih-Lien Lu, Mark Oskin, David Patterson, Jan Rabaey, John Wawrzynek, RAMP: Research Accelerator for Multiple Processors - A Community Vision for a Shared Experimental Parallel HW/SW Platform, [4] John Wawrzynek, RAMP Implementation, Given on 1/20/2006 at RAMP Mini-Retreat, BWRC http://ramp.eecs.berkeley.edu/Publications/RAMP%20Implementation.ppt [5] Krste Asanovic, RAMP White, Given on 1/20/2006 at RAMP Mini-Retreat, BWRC http://ramp.eecs.berkeley.edu/Publications/RAMP-White-20060120.ppt [6] Dave Patterson, RAMP, Given on 8/21/2006 at HotChips 2006 http://ramp.eecs.berkeley.edu/Publications/RAMP%20(Patterson%20HotChips%202006) .ppt [7] Greg Gibeling, Andrew Schultz, Krste Asanovic, The RAMP Architecture & Description Language, http://ramp.eecs.berkeley.edu/Publications/RAMP%20Documentation.pdf 12月15日 AMD said to be researching 'reverse multi-threading' techBy Tony Smith
Published Tuesday 18th April 2006 10:27 GMT
AMD is working on a way to make a multi-core processor appear to the host operating system as a single-core chip, it has been claimed. If true, the move turns on its head the drive to develop multi-threaded apps the better to take advantage of multiple cores. The technology is aimed at the next architecture after K8, according to a purported company mole cited (http://www.x86-secret.com/?option=newsd&nid=933) by French-language site x86 Secret. It's well known that two CPUs - whether two separate processors or two cores on the same die - don't generate, clock for clock, double the performance of a single CPU. However, by making the CPU once again appear as a single logical processor, AMD is claimed to believe it may be able to double the single-chip performance with a two-core chip or provide quadruple the performance with a quad-core processor. It's the very antithesis of the push for greater levels of parallelism - performing more operations on data simultaneously, in other words - in computer processors. Intel's HyperThreading, for example, was developed to make use of under-utilised processor components to fool the CPU into believing it had two processors at its disposal, not one. Adding more cores just makes these virtual cores real, and retaining the technology allows two cores to appear as four. Of course, better out-of-order execution techniques render HyperThreading - Intel's version of the simultaneous multi-threading (SMT) technique - less important. Put simply, they ensure there are fewer parts of the pipeline going unused at any given time, so there's less performance to be gained by throwing extra threads at the processor. Indeed, Intel's briefings on its next-generation architecture, due to debut in Q3 as the 'Conroe' chip, play down HyperThreading and talk up out-of-order execution. But Conroe, so far as Intel is admitting, still appears as two CPUs to the host OS. So is there anything to be gained by making it appear as just one processor? Well, operating systems already do a good job of scheduling hundreds of threads on a single-core CPU let alone a dual- or quad-core part, and AMD may have found that OS is so good at this that it can make up for the apparent reduction in parallelism, particularly in cases where one thread predominates, in a game, for instance. However, by the time the technology ships - if it proves real, and ever becomes more than a lab experiment - the software industry will have had several years focusing on multi-threaded apps, and it may not want to go back.
Intel shows off its quad corePublished: February 10, 2006, 2:30 PM PST
Clovertown, a four-core processor, will start shipping to computer manufacturers late this year and hit the market in early 2007. Clovertown will be made for dual-processor servers, which means that these servers will essentially be eight-processor servers (two processors x four cores each). Core expansion will be a dominant theme for Intel over the next few years, said Chief Technology Officer Justin Rattner. By the end of the decade, chips with tens of cores will be possible, while in 10 years, it's theoretically possible that chips with hundreds of cores will come out, he added. Multiplying the number of cores brings distinct advantages. First, it cuts down overall energy consumption for equivalent levels of performance. If the recent Core Duo chips released for notebooks from Intel had only one core, the chips would consume far more power, he said. Integrating processor cores into the same piece of silicon or same processor package also increases performance by reducing the data pathways "To go from core to core can be a matter of nanoseconds," Rattner said. "As soon as you move cores together you get an automatic improvement in available bandwidth." Nonetheless, adding cores requires careful planning. Energy efficiency, data input/output and memory latency (the time it takes data to go from memory and the processor and vice versa) will be major issues with each level of core expansion. To get around some of these issues, Intel is conducting research into circuit design and chip architecture as it has in the past. In addition, the company is working with application developers to determine how the architecture of its chips can be optimized. By working with one server application developer, Intel determined that it needed to make three small changes to the architecture of one of its future server chips. Before the changes, the application only ran well in simulations on chips with 16 cores. After that, performance began to decline, Rattner said. After the changes, performance continued to climb. "We got it to scale well past 32" cores, he said. Another pending change to chip design to accommodate problems that arise with core multiplication are Through Silicon Vias, or TSVs. With TSVs, processors and memory chips are stacked up and connected through tiny wires; the top of one chip wires directly into the bottom of another. Currently, chips connect through buses, long data paths that have become as crowded as rush-hour freeways in some computers. Clovertown and Tigerton are members of a new chip architecture coming from Intel at the end of the year. A notebook chip called Merom and a desktop chip called Conroe coming out around the same time will be based on the same architecture. Intel will give the architecture a name at the Intel Developer Forum taking place in March. "The core growth on the client side will be slower than on the server side," he said. The new chip architecture "is intended for dual and multiple core architectures," he added. Rattner would not state whether Tigerton and Clovertown contained a single piece of silicon, or two pieces of silicon in a single package. A processor is made of silicon and the package that surrounds it, so either definition could fit. Two pieces of silicon in a single package seems more likely. At around the same time, after all, Intel will release Woodcrest, a dual core server chip based around the same Merom-Conroe-Tigerton-Clovertown architecture. It will contain only two cores and consume 80 watts of power, less than the 165-watt server chips Intel sells now. Intel has already released one dual core processor that contained two pieces of silicon. While using two pieces of silicon can be cheaper to design and manufacture, some have said dual silicon chips don't provide the same level of performance. |
|
|