I must confess I've only got a few words and a vague skeleton of what the Professer Lixin Gao said due to the language barrier, during the whole speech Ms Gao spoke English, and the topic is so pedantic which is full of the complicated mathematic formulation.Personally
,我未有涉及过海量数据并行计算以至云计算相关的任何技术,基本上对于海量数据的所有技术都让我觉得非常地cutting-edge。带着学习开拓眼界拓展知识面的良好愿望参加了Ms. Gao的学术讲座。Ms. Gao讲PPT时全是英文啊。我努力地听,努力地听!!!适当地记些笔记。其实还是很喜欢听英语的。
讲座先列举了一些大型互联网公司的数据显示集群计算的规模。比如谷歌在数据中心有一百万台服务器!!!Holy gosh,1,000,000 servers!!!
然后谈到了很多大规模数据的计算对时间的要求非常地严格。于是谈到了目前流行的算法:MapReduce。MapReduce是google提倡的一种编程模型,用于大规模数据集(大于1TB)的并行运算。“Map”和 “Reduce”,和他们的主要思想,都是从函数式编程语言里继承来的,还有从矢量编程语言里借来的特性。他极大地方便了编程人员在不会分布式并行 编程的情况下,将自己的程序运行在分布式系统上。我自己的理解是:map索引整个海量数据,生成key-value pair which is used to facilitate the quick query and computing。而reduce则是用来分析和统筹基于键值对的中间计算结果,使中间结果迅速向最终的正确结果收敛。
Ms. Gao 提出MapReduce虽然是工业界非常popular的解决方案,但是不可以以迭代的方式计算。而Ms. Gao所做的工作即是使MapReduce支持迭代的功能。(A framework that support iterative computing in the cloud.)从事情一般规律的哲学高度看待这个问题,对于规模相当大或者最终结果是相当不明确的问题,似乎都可以用到'迭代'的思想。例如,迭代的软件开发模式则是当今互联网应用的主要开发模式。每一次迭代都使产品向理想中最perfect的状态接近。Ms. Gao用迭代的思想处理海量数据应该也是同样的道理罢!
Ms. Gao的解释说改进后支持迭代的iMapReduce可以使map后的数据支撑reduce,reduce的数据亦转而支撑map,形成一个迭代反复的过程,其中在map阶段加入static data。较之原始的MapReduce,新的算法使得static data的规模显著地减少。Ms. Gao展示了几个图表证明改进后增加了迭代功能的MapReduce(称之为 iMapReduce )的速度优于原始的MapReduce.图表中有展示用iMapReduce计算PageRank的值。PageRank?是不是判断网站权重的那个东东啊?哥的网站http://ykyi.net的PageRank才0啊!!!才0啊~~用什么算法可以算成9啊!
PPT转到计算图的指定两点最短路径的Dijkstra算法。泪流满面,终于有一个我懂的算法了啊!Ms. Gao说先找到两个结点间的最短路径,于是赋于这个最短路径经过的结点最高的权重,其它的路径就赋于相对低的权重。提出了PriorityQueue的数据结构。把权重引入到iMapReduce算法,称之为pMapReduce算法。引入权重后的pMapReduce算法比iMapReduce算法的速度更快。生活中我们做事情也要优先处理最重要的事情,道理是一样的啊!世间万事万物,无论多么复杂,抽象出来的一般规律其实都是相当地一致。易经中说:形而下者谓之器,形而上者谓之道!Oops, The ancient Tao philosophy is full Of Wisdom.
最后阶段,再次展示了一个图表,Ms. Gao完成的建立在Apache的Hadoop上的pMapReduce要明显的优于原始的Hadoop。并且也优先其它多种解决方案,其它的名字我都没听过啊!在提问阶段,尽管温武少老师积极鼓励同学提问但没有同学提出问题,大概大家都不懂吧!朝红阳老师和另一位研究海量数据的老师谈了一点点。他们提到iterative的算法在他们的实际应用中效果并不是太明显,不如memory-based的解决方案。Ms. Gao调出一张图表,解释说某种memory-based的解决方案没有迭代的算法快。
作为众多打酱油的同学中的一员,我最后的心得体会是:学术真是太高深了啊!我还是做工程吧!!!车,我有得拣咩?
/////////////////////////
Topic
Distributed Framework for Iterative Computations on Massive Datasets
Speaker
Prof. Lixin Gao
Abstract
Iterative algorithms are pervasive in many applications such as search engine algorithms,machine learning, and recommendation systems. These applications typically involve a dataset of massive scale. Fast iterative computations of the massive datasets are essential for these applications. This is particular important for on-line query such as keyword based search query. In this talk, we present an overview of MapReduce framework, and propose two frameworks, iMapReduce and pMapReduce, that enable fast iterative computations. By providing the support of iterative computations and prioritized execution, we can ensure faster convergence of the iterative process. Both iMapReduce and pMapReduce preserve the MapReduce distributed computing framework and is particularly efficient for online queries such as top-k queries. We implement iMapReduce and pMapReduce based on Apache Hadoop and evaluate its performance. Our evaluation results show that pMapReduce can reduce the computation time by two orders of magnitude comparing to that achieved with MapReduce. At the end of the talk, I will provide an overview of on-going projects in my research group.
Bio
Lixin Gao is a professor of Electrical and Computer Engineering at the University of Massachusetts at Amherst. She received her Ph.D. degree in computer science from the University of Massachusetts at Amherst in 1996. Her research interests include social networks, and Internet routing, network virtualization and cloud computing. Between May 1999 and January 2000, she was a visiting researcher at AT&T Research Labs and DIMACS. She was an Alfred P. Sloan Fellow between 2003-2005 and received an NSF CAREER Award in 1999. She won the best paper award from IEEE INFOCOM 2010 and her paper in ACM Cloud Computing 2011 was honored with “Paper of Distinction”. She received the Chancellor’s Award for Outstanding Accomplishment in Research and Creative Activity in 2010, and is a fellow of IEEE.
Date and Venue
Dec.16(Friday)13:30pm-14:30pm
Lecture Theater A101, School of Software, Sun Yat-sen University