GithubHelp home page GithubHelp logo

ideas's Introduction


整理平时个人的一些观点、想法💡。具体见 issues

ideas's People


 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar


 avatar  avatar  avatar  avatar  avatar  avatar


hertz-zh cxz007

ideas's Issues

hadoop MR 任务提交流程

jobSubmitter.submitJobInternal(job, cluster)

submitJobInternal() 做了五件事:

  • check input & output
  • 计算inputSplit
    splitSize = computeSplitSize(blockSize, minSize, maxSize);

    computeSplitSize = Max(minSize, Min(maxSize, blockSize))

    mapreduce.input.fileinputformat.split.maxsize    默认 Long.MAX

    mapreduce.input.fileinputformat.split.minsize    默认 1

    blockSize 默认 128
  • 设置 DistributedCache
  • copy jar/conf to mapred_system_dir on hdfs
    mapreduce.job.dir 等于 job staging dir + job_id
  • submit to jobtracker 监控 status

reduce 与 filter map 的关系


map, filter 这两个函数的功能是 reduce 函数功能的子集,为什么大多数 fp 里面还要多这两个函数。

下面举个例子,reduce能到,但是map做不到的(这也就证明了reducemap 功能强大)


Clojure 代码

  (fn [l i]
    (if (= 0 (rem i 2))
        (cons i (first l))
        (first (rest l)))
        (first l)
        (cons i (first (rest l))))))
  '(() ())      
  (range 0 10))


((8 6 4 2 0) (9 7 5 3 1))

map 没法一次做到


  • reduce感觉不好并发。感觉像是有状态的那种… 需要保存中间结果
  • 记得之前有讨论过,所有的函数,只要满足交换律+结合律,就能够适合并发




This is a question I had during an interview and I found the answer with a small hint from the interviewer which was "How do you compare two numbers?" (it really helped).

Here is the explanation:

Lets say I have 100 numbers (you can easily replace it by n but it work better for the example if n is an even number). What I do is that I split it into 50 lists of 2 numbers. For each couple I make one comparison and I'm done (which makes 50 comparisons by now) then I just have to find the minimum of the minimums (which is 49 comparisons) and the maximum of the maximums (which is 49 comparisons as well) such that we have to make 49+49+50=148 comparisons. We're done !

Remark: to find the minimum we proceed as follow (in pseudo code):

    for (int i(1);i<n-1;i++)
    if (min>myList[i]) min=myList[i];
    return min;

And we find it in (n-1) comparisons. The code is almost the same for maximum.

Buffer vs. Cache

Reading from a disk is very slow compared to accessing (real) memory. In addition, it is common to read the same part of a disk several times during relatively short periods of time. For example, one might first read an e-mail message, then read the letter into an editor when replying to it, then make the mail program read it again when copying it to a folder. Or, consider how often the command ls might be run on a system with many users. By reading the information from disk only once and then keeping it in memory until no longer needed, one can speed up all but the first read. This is called disk buffering, and the memory used for the purpose is called the buffer cache.

Since memory is, unfortunately, a finite, nay, scarce resource, the buffer cache usually cannot be big enough (it can't hold all the data one ever wants to use). When the cache fills up, the data that has been unused for the longest time is discarded and the memory thus freed is used for the new data.

Disk buffering works for writes as well. On the one hand, data that is written is often soon read again (e.g., a source code file is saved to a file, then read by the compiler), so putting data that is written in the cache is a good idea. On the other hand, by only putting the data into the cache, not writing it to disk at once, the program that writes runs quicker. The writes can then be done in the background, without slowing down the other programs.

Most operating systems have buffer caches (although they might be called something else), but not all of them work according to the above principles. Some are write-through: the data is written to disk at once (it is kept in the cache as well, of course). The cache is called write-back if the writes are done at a later time. Write-back is more efficient than write-through, but also a bit more prone to errors: if the machine crashes, or the power is cut at a bad moment, or the floppy is removed from the disk drive before the data in the cache waiting to be written gets written, the changes in the cache are usually lost. This might even mean that the filesystem (if there is one) is not in full working order, perhaps because the unwritten data held important changes to the bookkeeping information.

Because of this, you should never turn off the power without using a proper shutdown procedure or remove a floppy from the disk drive until it has been unmounted (if it was mounted) or after whatever program is using it has signaled that it is finished and the floppy drive light doesn't shine anymore. The sync command flushes the buffer, i.e., forces all unwritten data to be written to disk, and can be used when one wants to be sure that everything is safely written. In traditional UNIX systems, there is a program called update running in the background which does a sync every 30 seconds, so it is usually not necessary to use sync. Linux has an additional daemon, bdflush, which does a more imperfect sync more frequently to avoid the sudden freeze due to heavy disk I/O that sync sometimes causes.



A Little Java, A Little Patterns 读书笔记

Chap 1. Modern Toys

  • int, boolean are all type
  • What is a type ?
    • A type is a name for a collection of values.
    • Sometimes we use is as if it were the collection

The first bit of advice

When specifying a collection of data, use abstract class for datatypes, and extended classes for variants.

public abstract class NumD {
    public String toString() {
        return "new " + getClass().getName() + "()";

public class Zero extends NumD {}

public class OneMoreThan extends NumD {
    private NumD preprocessor;
    public OneMoreThan(NumD preprocessor) {
        this.preprocessor = preprocessor;
    public String toString() {
        return "new " + getClass().getName() + "(" + preprocessor + ")";
















































































An Interview with Mickey Petersen, author of Mastering Emacs

An interview with Mickey Petersen, author of Mastering Emacs

Who are you, and what do you do?

I'm Mickey Petersen. I live in London, UK.

I'm a professional software developer, and I have been programming since I was around 10 years old.  I did not have friends or family who knew much about computing, so I had to learn everything myself, from scratch.

How did you get interested in that in the first place?

I cut my teeth programming C in Turbo C for DOS and moved on to Delphi for Windows some years later, whilst at the same time trying to get a grip on this fairly new-fangled thing called Linux. Back then you had to go through all manner of hoops to even get it: I think I got mine from a CD that a friend of a family member had. It would've been far too large for my meagre 33.6k dialup modem connection to even attempt to download it from the web.

It was a Red Hat distro and I distinctly remember spending an eternity printing out the manual – as I would otherwise not be able to even *install* Linux, as I knew nothing about it – and then a long time figuring out how to install and use it. FVWM95 was the window manager, meant to look like Windows 95, and it was a great experience "running Linux" and using tools that would never work on DOS or Windows at the time. Back then Linux was the 'cool' thing and Windows and DOS.... not so much!

I tried programming C on Linux, and I remember trying Emacs back then. It had this funky green colour scheme; pretty sure it was a Red Hat X Resources thing at the time. But I could be wrong. Nevertheless, my flirtation with Emacs did not last. At the time it was just another tool in a succession of editors I experimented with. I probably settled on a graphical one that shipped with Red Hat as it had, you know, things like region selection and syntax highlighting enabled by default. Emacs could of course *do* both, but they weren't enabled by default back then.

Along the way I experimented with all manner of packages, window managers, and more. They took ages to compile, but back then – as a kid/teen – you had oodles of time, so it didn't really matter. But it laid the foundation for my interest in Linux and much more.

Many years later I would, during my time in uni, pick up Emacs. That time it stuck. I was a member of my university's computer science society, and the Dewey decimal tribunes who held court and sway in that society, were keen to let all and sundry know that Real Hackers Used Vim. Not Emacs; not ed(1), kate, or any other editor; just Vim. I never did buy into groupthink – and certainly not from someone scarcely older than myself – so I went with Emacs, as I'd at least played around with it many years before.

At the time I did not really know what you could, or could not, do with Emacs. I mostly navigated with arrow keys, a handful of key bindings, and the menu bar. I went with XEmacs, as it was generally ahead of GNU Emacs in the early noughties.  As my coursework in uni involved a never-ending succession of LaTeX and various common and obscure programming languages, Emacs was a great choice. It had syntax highlighting for almost any language you could think of, and although I did not know about some of the more obvious features (comint, shells, etc.) I at least had a tool capable of running on all major platforms and with a consistent experience.

XEmacs had its downsides, though. It was falling behind and had its own way of doing things that was not entirely compatible with GNU Emacs. I eventually moved to GNU Emacs when, I think, Emacs 22 came out.

At some point during my time with Emacs back then, a light bulb went on in my mind – something that I know now, having written and taught people Emacs for many years, is a frequent occurrence – that I finally understood enough about Emacs to not feel lost. I could look up commands and keys; install and edit code; and even write some elisp!

I'd begun experimenting with Org mode, so I started a file called (blogs were all the rage back then!)  to capture all the things I knew and I wish others did too. That would then morph into Mastering Emacs.

Since graduating uni, I've been a professional developer. I build bespoke software for clients around the world — with Emacs as my trusty editor, of course!

What resources would you recommend for people that are interested in what you do?

For programming? Gosh, there's too much. Back in the day it was actually really hard to learn programming as you'd need books, the web/internet, or know someone who knows a bit about it. Today, it's infinitely easier to get started — though I think it's equally hard sticking to it, and becoming proficient!

I found your work through Mastering Emacs, a phenomenal site – and book (written in Emacs, of course) – that helped me design my Emacs workflow (more so as a writer than a developer). Emacs can be intimidating for first-time users. Why should they choose it over another text editor?

Thank you! I'm glad you like both. That was exactly why I started the site.

Well, you're a writer using Emacs, and I think that is interesting.

I firmly believe that a significant proportion of Mastering Emacs readers are not professional "techies" (be it system administrators, developers, testers, etc.) but either tech-adjacent or work in fields where they are expected to have some technical proficiency in their field – a dab of Fortran or Python here, a pinch of LaTeX there – and use Emacs primarily as a tool to connect disparate areas of their work that other, non-Emacs users cannot easily mimic. Editing code is easy; there's myriad editors, including Emacs of course, that can do this. But there aren't many tools to track bibliography, your agenda, email, notes, and writing. But Emacs can easily do all of that, and much more.

Some Emacs users learn it because it's a "tax" they have to pay to work in certain academic circles or commercial environments where it's the only one available, or widely used. Something my cohort in University discovered when our lecturers would hand-wave away questions like "What should we use for editing Prolog?" with "Emacs."

So I think that people should learn Emacs if they want greater control – or freedom (also in the FOSS sense) – to mould their environment and tools to their liking. Not everyone does; if you dislike tinkering and tweaking, then Emacs is harder to sell. But to those of us who have had to use an application only to find that its keyboard shortcuts get in the way (or are missing altogether); or that one key that you use that does not work in some modal dialogues; or the frustration when you have to multi-task between umpteen tools – we find comfort in Emacs, because we are imbued with a tool capable of adapting itself to our needs. Emacs is a crucible.

What are some areas where Emacs could improve, either for longtime users or for newcomers?

Hm, this is a good question.

Emacs is written and designed for people who already know Emacs. That's not so great if you don't know Emacs; but it sure is if you do. Emacs opts to replace a low skill-ceiling (and anaemic key bindings and features) with a very high one (exceptionally powerful key bindings, programmability, etc.), because if you persevere then you'll eventually learn enough to benefit from an editor that does not hamstring its users.

But that simile applies to a range of things: no matter how many books, videos or power tools you buy, you won't become a master cabinet maker overnight. It takes skill and practice. It's just that we associate "text editor" with, well... notepad. Emacs is much more than just that.

Emacs is already much friendlier than it used to be. Better defaults; more sensible inclusions that ship with Emacs. Emacs 29 adds tree-sitter and Eglot, two tools of great import to coders, that should further reduce the friction for someone keen to experiment with Emacs without having to spend a weekend learning how to set it up.

The hardest thing for newcomers – and I say this as someone who did not think to do this myself as a newbie – is to read the manual. It's right there on the splash screen, or conveniently located in the help menu. But all too many "experts" recommend you hide the splash screen, and turn off the tool and menu bar. I fell victim to that advice when I was new also. It was terrible advice. Why hide something that helps you learn and explore?

Many suggest changing the key bindings or Emacs's unique vocabulary, but I think it's window dressing, and it won't alter the learning curve much, if at all.

So my suggestion is this: alter the tutorial (C-h t) so it's interactive, prettier, and more detailed. It should neatly segue into other areas important parts of Emacs. There's a wide range of users: prose writers; note takers; coders; command line hackers; etc. Emacs is more than capable of this interactivity, and yet the tutorial makes no use of it. Emacs should be a bit more firm in its advice to newbies.

What are some of the Emacs specific workflows that help you get your work done (packages, changes from defaults, etc.)?

For me it's the ability to program Emacs when I need to. I had to write some e-mail filters – sieves, as they're known – for an e-mail server. That was tedious as I had to test that they worked; what emails they'd affect (lest I screw up badly and ransack my emails); and then against particular e-mails to make sure the filter works properly for that particular e-mail.

I wrote a handful of lines of code that glues various parts of Emacs together to do this. I press a button and Emacs connects to the remote server with TRAMP and calls the program it needs to call, and then displays the result in an Emacs buffer.

So that's the most important one: adaptation to changing requirements.

I use mostly stock Emacs key bindings, with a handful of changes to make certain things more bearable. M-o instead of C-x o; C-x C-k to kill the current buffer; F1 opens M-x shell; and a handful of other minor things.

For productivity-related stuff I use Helm a lot for specific tasks. I can call up a Mastering Emacs customer using Helm and find their sales details. Great for when people forget their email or need to change something. It's a great completion framework. I also use IDO for files and buffers and Selectrum for general-purpose completion.

Besides Emacs, what tools & gear do you use (hardware, software, or anything else that comes to mind)?

I use a ZSA Moonlander Mark 1. It's one of those fancy 'mechanical' keyboards. It's quite nice. It's programmable and extensible, and it's more comfortable than normal keyboards. I used a MS Ergonomic keyboard for about 20 years and I'd literally wear them out in about 2 years.

I occasionally do some computer gaming. So I tend to overbuy every couple of years so I don't have to care about upgrading much for the next several years. So my primary workstation is an uber-high-specced desktop (that also doubles as a space heater) with a 39" ultrawide monitor. The monitor I love. I used to used dual monitors, but... eh. This is way better. One enormous Emacs frame that I can easily split into multiple windows.

Besides the tools, what habits & routines help you finish your work?

I rarely finish my work. Unless someone's paying me, that is!

I am a habitual starter-of-projects, finisher-of-few. Half-baked, half-inventions is how I generally term the stuff I do. I tend to build something out until I'm satisfied I've sated whatever silly intellectual curiosity I have, and then I drop it like a rock, as it's rarely perfect enough for me to release.

My projects folder is full of these things.

How do you relax or take a break?

I set my own working hours, as I generally work on my terms. For clients my work is a case of agreeing the scope of what needs doing, and then I get on with it. But it's unlikely to follow a 9-5 schedule, per se. So when I want a break, I get up and walk around. Living in London affords me the ability to do all manner of cultural stuff, if that is what I feel like.

I've realised the key to my happiness is small bouts of things that bring me joy: a cup of coffee; a nice walk; it's the little things. I also adore cooking and do it daily with my girlfriend. We both enjoy food and cooking.

Whose work inspires or motivates you, or that you admire?

Hm, you know, it's a good question. I self-motivate, I think, mostly. I know it's common for people to look upon the works of others for inspiration, and I think that is probably true of me as well. But it's more of an ethereal thing for me: it's a range of things – concepts, ideas – that drive me, and less so any particular person. So when I sit down and half-invent something, it's because of that.

Subscribe to Syntopikon

Interviews, writing, and video.

自己动手实现一个 JS engine



create table self_check_new_rules (
job_name    varchar(100),
rule_id int primary key auto_increment,
rule_name   varchar(200),
db_type varchar(100),
db_name varchar(100),
table_name  varchar(100),
definition  varchar(100),
where_condition varchar(100),
group_by    varchar(100),
compare varchar(50),
op  varchar(10),
range_min   double,
range_max   double

# view
select `r`.`job_name` AS `job_name`,`r`.`rule_id` AS `rule_id`,`r`.`rule_name` AS `rule_name`,`m`.`db_type` AS `db_type`,`m`.`db_name` AS `db_name`,`m`.`table_name` AS `table_name`,`m`.`definition` AS `definition`,`m`.`where_condition` AS `where_condition`,`m`.`group_by` AS `group_by`,`m`.`compare` AS `compare`,`r`.`op` AS `op`,`r`.`range_min` AS `range_min`,`r`.`range_max` AS `range_max` from (`self_check_metric` `m` join `self_check_rule` `r` on((`r`.`left_metric_name` = `m`.`metric_name`))) where (left(`m`.`metric_name`,6) = 'yester')

scala tutorials


abstract class Tree
case class Sum(l: Tree, r: Tree) extends Tree
case class Val(n: String) extends Tree
case class Const(v: Int) extends Tree

type Environment = String => Int

def eval(t: Tree, env: Environment):Int = t match {
    case Sum(l, r) => eval(l, env) + eval(r, env)
    case Val(n)    => env(n)
    case Const(v)  => v

val exp = Sum(Val("x"), Sum(Val("y"), Const(4)))
println("x + (y + 4) = " + eval(exp, {case "x" => 1 case "y" => 2}))


object Timer {
    def oncePerSecond(cb: () => Unit) = {
        while (true) {
            Thread sleep 1000

    def main(arg: Array[String]) = {
        oncePerSecond(() => {
            println("time flies...")


Scheme ML Haskell Java 类型对它们来说意味着什么

((if (= 0 n)
   (lambda (x) (+ x 7)))

上面的程序是合法的 Scheme 程序,但是这里有很严重的问题, 显然,我们不能(5 6) 这样进行过程调用

ML、Haskell 允许你进行 infer types,
Java 则强制你显式指定类型


一些 dirty 、useful 脚本


# 将 py 文件中的 \t 替换为四个空格
find . -name '*.py' -exec sed -i 's/\t/    /g' {} +

find ~/logs/ -type f -mtime +3 -name \*.log -delete

Clojure cheatsheet

$ java -cp ~/bin/clojure-1.8.0.jar:. clojure.main <src>.clj

 *   Copyright (c) Rich Hickey. All rights reserved.
 *   The use and distribution terms for this software are covered by the
 *   Eclipse Public License 1.0 (
 *   which can be found in the file epl-v10.html at the root of this distribution.
 *   By using this software in any fashion, you are agreeing to be bound by
 *   the terms of this license.
 *   You must not remove this notice, or any other, from this software.

package clojure;

import clojure.lang.Symbol;
import clojure.lang.Var;
import clojure.lang.RT;

public class main{

final static private Symbol CLOJURE_MAIN = Symbol.intern("clojure.main");
final static private Var REQUIRE = RT.var("clojure.core", "require");
final static private Var LEGACY_REPL = RT.var("clojure.main", "legacy-repl");
final static private Var LEGACY_SCRIPT = RT.var("clojure.main", "legacy-script");
final static private Var MAIN = RT.var("clojure.main", "main");

public static void legacy_repl(String[] args) {

public static void legacy_script(String[] args) {

public static void main(String[] args) {

JVM gc 算法

JDK 7 中的GC算法

Collector Function Recommended for How to enable
Serial Collector Uses single thread for both minor and major collections. Simplest. Single processor machines -XX:+UseSerialGC
Parallel Collector (Throughput Collector) Uses multiple threads for minor collection. Multi processor machines/enterprise class applications -XX:+UseParallelGC. To enable Major parallel collection,add -XX:+UseParallelOldGC
CMS Collector (Concurrent Mark and Sweep Collector) Mostly performs GC simultaneously along with Application Applications that cannot tolerate longer GC pause times -XX:+UseConcMarkSweepGC
G1 Collector Strives to collect from Heap regions that have the most garbage Most enterprise class applications. Through testing required before implementing. –XX:+UseG1GC

Become a Java GC Expert


JVM heap 调节参数


JDK 中的命令

  • jstat -gcutil <pid>
  • jmap -heap <pid>



1. 坚持连续的阅读

显得多。 一股作气的策略,在读书过程中也适用。

2. 坚持思考

早就有古人说过"学而不思则罔",长时间的阅读,却不去花时间思考why and how?会让自己退化为一个

3. 坚持作题


4. 坚持阶段性地作读书笔记。





  1. 工作经验
  2. 企业文化契合度
  3. 编程技能
  4. 分析能力



  1. 只在计算机上练习

  2. 不做行为面试题演练


  3. 不做模拟面试训练


  4. 试图死记硬背答案

  5. 不大声说出你的解题思路

  6. 过于仓促

  7. 代码不够严谨

  8. 不做测试

  9. 修改错误漫不经心

  10. 轻言放弃


让子弹飞一会儿-- **科技监管的意图和临界点
以下文字为 Google 翻译

上个月,在研究滴滴时,我离开了几天。但这是一个日子感觉像是几年的时代,尤其是在**科技领域。在滴滴全球在纳斯达克进行44 亿美元的 IPO后仅几天,他们的应用程序就在**网络安全局 (CAC) 的要求下从在线商店下架。引用的原因是违反个人数据收集。 




技术平台对民族国家的合法性构成重大挑战。1它们正在成为事实上的机构,而不仅仅是提供对公民生活至关重要的公用事业2而是设定社会运作的游戏规则。Facebook 为全球三分之一的人口制定了内容审核政策。Twitter 和其他人取消了这位美国前总统的平台,将他降低为不受欢迎的数字角色。这些是强大的私人实体,既是垄断又是公共产品,但消费者福利并不是他们议程的核心部分。立法者的意识逐渐增强,这就是为什么三大洲的政府正在重新评估科技巨头对其公民的影响和影响力。技术冲击是全球性的。 

作为一个制度框架不完善的发展**家,**在技术监管方面还有一些额外的问题需要解决。如果我们以美国和欧洲的监管体系为基准,**在基本法律的制定和实施方面是滞后的。**的反垄断法于 2007 年首次通过,比美国1890 年的_谢尔曼法案_、1914 年的_克莱顿法案和__联邦贸易委员会法案晚近一个世纪_ 1914 年。同样值得注意的是,阿里巴巴成立于 1999 年,腾讯成立于 1998 年,百度成立于 2000 年——所有这些都领先于反垄断法。法律本身还不够,国家市场监督管理总局(SAMR)于 2018 年 4 月成立,全面覆盖执法。 







要充分了解**,需要观看一部名为《让子弹飞》的精彩黑暗电影。自 2010 年上映以来,关于封建制鹅镇一个强盗变身假总督的故事已成为**网络空间模因的主要内容。它充斥着关于**权力、金钱和合法性的规则和界限的各种说法,它是一块文化试金石。







在保护社会和消费者利益的同时,允许增长和创新形成一个微妙的平衡。监管机构并不总是正确的。在 2000 年代初期的移动电信价格战中,它们太软了。在 2015 年代的 P2P 借贷期间,他们行动太晚了。在共享经济时代,人们在ofo的存款在法规制定之前就已经丢失了。每次他们从以前的经验中学习,提高他们的监督能力并在下一轮改进。每一次都是创新先于监管的模式,其临界点是消费者福利的恶化4 。结果是这些公司活了下来,但发生了变化——陆金所,著名的 P2P 贷方之一也是由于监管方面的担忧,他们在 2018 年搁置了 IPO 计划。他们成功地将自己更名为企业贷方,并于 2020 年 11 月在纳斯达克上市。蚂蚁集团正在重组,但允许继续经营。 


  • 科技平台已转向价值提取而非创新以促进增长——随着**互联网用户整体增长的放缓以及大多数已经数字化的大市场,科技巨头必须专注于增加现有用户的支出以实现增长。COVID-19 巩固了平台在人们生活中的地位,但用户感到了压力。人们对价格歧视做法(滴滴和其他平台将对相同产品的老客户收取更高的价格)、工人面临的严格条件以及平台税商必须支付流量和关注他们的商品感到不满。当馅饼的大小变得固定时,每个玩家都会切换到提取模式,消费者会受到影响。 

  • 重新平衡以实现更具创新性的生态系统——“十四五”规划中的技术目标雄心勃勃。人工智能、量子计算、半导体和基因研究——未来的技术增长驱动力将不是电子商务,而是深度技术。王丹的观察**未来的重点是制造业、经济增长和实体经济。我不会说现有的科技巨头扼杀了创新,但我不确定它们有多大帮助。阿里巴巴和腾讯之间的影子战争占据了大量的空气和资金。数据在封闭的花园中被分割。**新创企业的数量每年都在下降。当大型科技公司不进行创新而是复制或收购竞争对手时,生态系统就会受到影响。 

  • 监管机构背后的额外政治影响力——**监管机构并非铁板一块。围绕**科技的叙述经常被提炼成个人,而在**,它实际上是对系统以及这些系统中伴随的竞争派系的叙述。很明显,权力转移已经发生,改革背后有决心。是杰克的讲话还是监管者觉得他们在微信中的抖音和淘宝链接需要观看?无论哪种方式,如果围绕滴滴数据调查的评论有任何依据,监管机构都会得到**群众的支持。


我对长期持乐观态度,但对短期持谨慎态度。监管的目的不是扼杀创新,而是重新划定私营公司可以运营以实现利润最大化的界限。死公司对任何人有什么用?尤其是当它处理像现代公用事业这样重要的事情时。话虽这么说,但监管执法工作积压了很长时间。**科技公司必须解决他们的技术债务和收费问题,因为他们知道 CAC 和 SAMR 正在密切关注。子弹停止了飞行,行动开始了。 

我实际上从我对该主题的研究中获得了更多材料(包括 SAMR、CAC 和其他在职权范围内的细分,对未来海外上市的潜在影响),并且很乐意分享并为高级订阅者提供 Q&A 线程在**特色圈社区。当您在那里时,也为本月的产品演练投票!目前的赢家是快手,但仅以微弱优势领先。



本文分两部分深入探讨**半导体供应链。它最初是对**半导体(制造半导体所需的设备)的利基研究,但已经发展到涵盖更广泛的生态系统。第 1 部分将讨论更广泛的生态系统,以及**为何迫切要求独立。第 2 部分将揭示**在此背景下的角色、不同的**参与者是谁,以及他们各自的发展轨迹。 

这是一个非常复杂的行业,每个领域和角度都有大量文献,这篇文章来自各种专家和来源。我尽可能多地尝试将它们联系起来。如果我遗漏了你,请在Twitter 上给我留言,我会尽快给你链接。

非常感谢muleDylanChrisJonDan Wang

芯片成本正在上升。能参加比赛的球员越来越少。市场的周期性正在减弱。AI / 5G 的需求即将出现激增。**在当地供需方面存在很大的不匹配。


COVID-19 的出现使全球许多供应链都需要冗余。然而,**对内部集成电路(IC)制造能力的推动并不是什么新鲜事。在可追溯到 70 年代的每 5 年计划 (5YP) 中,即使名义上也提到了这一点。不过这一次其实不一样。第十四个五年计划是第一个强调_完全_自力更生,并建议在本地建立一个近端对端的链条。这也是**第一次在全国范围内处于足够强大的地位来资助这一尝试,也是第一次将其视为国家安全问题。 



TL; DR 是半导体是构成集成电路构建块的微小导电材料 - 基本上为所有依靠电力运行的东西提供动力的小芯片。迄今为止,游戏的名称一直是“将更多组件塞进集成电路中”。这是摩尔定律:IC 上的晶体管数量大约每两年翻一番。众所周知,摩尔定律现在已经死了

我们可以每隔几年将每个芯片的晶体管数量翻一番的想法导致公司认为这种增长是理所当然的。预计优化将发生在芯片尺寸级别,而不是代码或工艺级别。正如我们将看到的那样,物理学限制了芯片的持续缩小,以至于必须越来越多地从工艺改进、封装和软件中获得更好的性能。所有这些都使芯片生产成本成倍增加(图 1)。




图 1:按工艺节点增加的生产成本(SemiEngineering ,2020 年)




图 2:在前沿制造芯片变得越来越难( Capensis Capital 3Q2020 Letter)

从历史上看,芯片行业一直是高度周期性的。制造商不得不应对技术的指数级改进、管理一些人类已知的最复杂的工程以及极高的研发成本。在过去,所有这些都结合在一起,形成了繁荣与萧条的过山车。芯片制造商不得不努力计算开发代工厂所需的时间,因为新芯片(称为工艺节点1 )问世时交货时间长、供应限制和需求激增。由于“前沿”芯片——真正的小芯片,比如<7nm——比几年前的“落后”芯片更难生产,这加剧了这种情况。由于这些限制,供应链的许多要素将在不同时间达到高峰和低谷。 




图 3: 2021 年 6 月Besi投资者介绍。

多年来,高资本支出、周期性投资和极端流程复杂性的三重奏导致了该行业的整合。现在很少有公司拥有知识产权、人才、生态系统或资金来竞争。另一方面,在过去五年中,需求激增。AI/ML、物联网、5G 和许多其他首字母缩写词技术流行语的承诺已经开始脱颖而出。 


正是在这个新的行业中,全球垄断者控制了所有供应,**创造了大约3100 亿美元的需求。2020 年全球芯片销售总额为4400 亿美元。表面上看,**拥有全球 70% 的需求,但其中大约一半是经过包装和组装后从**出口到世界其他地区的。这就是为什么**如此热衷于技术独立的原因。他们在当地的供需方面长期存在不匹配,他们在依赖半导体的技术领域争夺世界领先地位,而芯片是美国(制造芯片制造的大部分设备)对他们的公认瓶颈。 




图 4:Randy Abrams,瑞士信贷(2020 年)


这篇文章(第 1 部分)的重点是更深入地研究_**_试图从中获得独立,以及他们_为什么_想要独立。第 2 部分将深入探讨**的主要参与者是_谁,他们期望__如何_进步,以及_何时_可以预期他们与当前领先的垄断者竞争。



  • 电子设计自动化 (EDA) 

  • 芯片设计(Fabless 制造商)

  • 制造厂(集成设备制造商和代工厂) 

  • 设备(“半导体”) 

  • 外包半导体组装和测试 (OSAT)





图5:全球逻辑芯片半导体供应链(Vineyard Holdings

简而言之,设计用于智能手机、汽车和笔记本电脑的芯片的人(“无晶圆厂制造商”)是在 EDA 软件上完成的。如果他们可以设计和制造芯片,他们就被称为集成设备制造商。三星和英特尔是这里的两大巨头。


任何能够制造芯片的人都只能这样做,因为他们从少数供应商那里购买了超精密的设备。这种设备被称为 semicap,是 Semiconductor Capital Equipment 的缩写。代工厂将设备与专业材料和工艺专业知识相结合,然后弹出芯片。 

该芯片仍需组装、测试和封装。由于这是与制造不同的能力,代工厂将这部分过程外包给专门从事该过程的人员。这是 OSAT 工作人员。他们依赖一组不同的供应商来提供测试设备。一旦芯片设计、制造、测试、封装和组装完毕,就可以使用了。

由于许多节点的——字面上是原子的——大小,处理它们的公司已经在他们的领域开发了不可复制的专业知识。在许多情况下,几个不合适的原子会导致整个产品无法使用。例如,ASML 的极紫外光刻 (EUV) 机器所需的镜子被抛光到小于一个原子厚度的光滑度。换个角度来看,如果镜子有德国那么大,最高的“山”只有 1 毫米高。这是**试图颠覆的行业。


  • **大节点(>180nm)**通常是模拟的。通常,它们用于接收非二进制输入并将其转换为二进制(例如 EV 中的传感器如何“看到”道路并将该信息传播到系统的其余部分)。 

  • **中等节点(28nm-180nm)**是多数逻辑节点。它们是大多数人在想到芯片时想到的节点——它们是处理过程计算的节点,比如 CPU 和 GPU。 

  • **小节点(10nm 到 22nm)**分别在内存和逻辑之间以 80/20 的比例分割。大多数内存芯片是支持 NAND(SSD 中的永久存储)和DRAM (笔记本电脑用来保持所有 Chrome 标签页打开的临时存储)的芯片。

许多流行语技术都依赖于前沿节点。“领先”在技术上是指从台积电承诺的 1 纳米节点到 iPhone 8 中的 10 纳米节点。但实际上,目前规模化生产的最先进节点是支持 5G 的 iPhone 12 中的 5 纳米节点。 

大规模部署 5G 的最低要求是 5nm 节点,因此它们的生产支持 AI 进步、5G 和大部分数据中心发展。关于计算是在“边缘”(数据的来源;不要与“前沿”一词混淆)还是在云中进行,还有一个完整的争论。无论哪种方式,领先的芯片都将成为构建这种基础设施的铁锹。


至于中型节点,这些主要用于存储芯片。全球有三个占主导地位的内存厂商。美国的美光,韩国的 SK 海力士和三星。内存产品完全商品化。从单位经济学的角度来看,它们也是一些最糟糕的产品——公司对最终用户的了解很少(创建更短的订单周期),并且过去在成本方面展开了激烈的竞争。因为存储芯片需要电容器(小电荷存储器)和晶体管(打开或关闭开关的小网关),所以它们比仅晶体管逻辑芯片具有更高的尺寸限制。内存芯片在物理上几乎被限制在 >10nm。我建议阅读Andrew Rosenblum 的2020 年第三季度投资者信函在这里进行更深入的了解。总结是,内存生产商增加供应的唯一方法是通过从半导体供应商那里购买更多设备并培训更多员工来管理生产线2来增加新的生产线




图 6:我们正在全面打击半导体发展的规模限制。这是摩尔定律的消亡(mule ,2020)

上面概述的整个供应链,包括无晶圆厂设计师、代工厂、OSAT 团队,主要是逻辑芯片的供应链。作为计算的基础设施,这些芯片是**产业最大的焦点,也是最难复制的。_笼统地说,_这是给你的另一张照片。



图5(再次,但不同):全球逻辑芯片半导体供应链(Vineyard Holdings

下面的图 7 和图 8 分别显示了半导体的区域需求和供应。请注意**的需求有多大,而除了 OSAT 之外,它们的供应量有多大。我们稍后会谈到这一点,但现在快速说明一下——**在制造领域的 16% 市场份额实际上主要是半导体制造国际公司 (SMIC)。有传言称,中芯国际在最好的情况下正在生产一些7nm 节点(主要是用于加密货币挖掘的 ASIC ),但在规模上它们仍处于 14nm 及以上。通常,公司不会跳过一个工艺节点,通常是从 14 纳米到 10 纳米,再到 7 纳米。在看到美国制裁对华为的影响后,中芯国际似乎有意从 14nm跃升至 7nm 。




图 7:按地区划分的半导体需求,BCG/SIA (2021 年)


图形用户界面 描述自动生成


图 8:按地理和方面划分的半导体供应,BCG/SIA (2021 年)



Cadence Design System 是领先的设计软件供应商之一,与 Synopsys 一起在市场上处于准双头垄断地位。下面的图 9 是它们在**的增长情况。这种增长的含义是,**对设计软件的需求量比 Cadence 服务的其他地区更快。骡子在 Cadence 的 20 年第三季度财报电话会议上 发表了一篇很好的文章,进一步解释了这一点。



图 9:Cadence 财务业绩(2020-2021 年)



到目前为止,我一直避免详细介绍芯片的实际制造过程。这有点超出本文的范围,但这里是英飞凌的 13 分钟视频。如果您有进一步的兴趣,ASML在他们的网站上有很好的解释。平均而言,制造厂大约有 500 多台机器,芯片制造过程中大约有 1000 多个步骤。因为半导体芯片是人类曾经处理过的最小的事情之一,所以即使是微小的灰尘,它们也可能会被毁掉。大多数晶圆厂都竭尽全力避免这种情况,在空气通风上花费了大量资金。平均晶圆厂比医院的大多数手术室“清洁”约 100-1000 倍。

在高层次上,现代芯片是“小摩天大楼”(ASML 的类比,不是我的)。它们是硅晶片,首先涂有光刻胶,这是一种光敏聚合物,在曝光时会溶解。芯片经过数百次曝光循环,将未曝光的光刻胶烘烤以显影图案,并蚀刻掉不受光刻胶保护的材料,然后最终去除光刻胶并加工晶片。 




图 10:前端半导体制造工艺(ASML 年度报告,2020)

下面的图 11 显示了某些公司在半导体股领域的主导地位。Applied Materials (AMAT)、Lam Research (LAM)、ASML、Tokyo Electron (TEL) 和 KLA-Tencor 是值得注意的主要参与者。在这五个中,只有东京电子和 ASML 在美国以外。 

在这两者中,ASML独一无二的 180 吨 EUV 机器是制造任何尖端产品的关键推动力。每年大约生产 25 台这样的机器,其中大部分供应给台积电,每台机器的成本约为 1.3 亿美元。迄今为止,部分是因为他们对美国零部件的依赖,部分是因为地缘政治,部分是因为他们与美国主导的供应链其他部分的关系,ASML 已被禁止向中芯国际和其他**代工厂出售 EUV 机器。 


图表、树状图 描述自动生成


图 11:Gartner 和 Bernstein 分析(2019 年)

我上面描述的大部分是 semicap 的“前端”。“中端”和“后端”多为先进封装(晶圆级)和传统封装。这主要是 KLA-Tencor、Teradyne 和 FormFactor 的领域。它涉及用于测量、包装、组装和测试的设备。我向有兴趣的人推荐mule关于高级封装的文章Christopher Seifel的行业概述

对于关于开发链每个阶段的相对成本的(过时的)讨论,我推荐gwern关于摩尔定律的讨论,以及Brown和Linden 的这篇研究论文。每个先进制造厂的成本估计在 5 到150 亿美元之间,并且每 4 年左右翻一番。





正因为如此,半导体制造已成为国家重点。在过去十年中,注册为半导体公司的公司数量增长了 700% 以上(图 12)。国家和私人机构都在投入资金来建立这种能力。这不仅仅是**驱动的行政命令。在华盛顿禁止华为使用 Cadence & Synopsys 的 EDA 平台之后,**公司内部也存在相当大的私人担忧,即美国可能会禁止哪些人。

那么,什么会激励**向单一行业投入730 亿美元呢?部分原因与激励台积电在三年内投资约 1000 亿美元以增加研究和产能的原因相同。因为需求量很大。然而,就**而言,部分原因还在于它是战略政策。


图表、直方图 描述自动生成


图 12:**“半导体企业”数量(金融时报,2020)


从主题上看,该行业对美中脱钩的想法充满热情。在美国严重依赖**进行低端生产(例如用于 COVID-19 救济的大部分关键医疗设备和 N95 口罩)的情况下,**对半导体的依赖也有所回报。如前所述,大部分半导体和无晶圆厂公司都位于美国,大多数代工厂都优先考虑与美国的关系,但大部分需求来自**。 

这两个国家如何驾驭贸易和动态超出了本文的范围。YouTube 上的亚洲协会有一系列由政策制定者和行业领袖就_美国和**的未来进行的小组讨论_。我推荐这一集关于技术的关于半导体的讨论。雷·达里奥(Ray Dalio)的_《不断变化的世界秩序》_ 很好地解读了几个世纪以来权力从一个国家到另一个国家的转变。达利欧书中的结论是,权力确实发生了转变,随着权力的转变,会出现一些关键的发展模式。




图 13:世界大国的兴衰模式(Ray Dalio ,2020)


图表、直方图 描述自动生成


图 14:**沿着不同的弧线发展(Ray Dalio ,2020)


鉴于当今美国电力行业的集中度,**的电力竞标需要进一步确定。看图 15,很容易看出**如何将内部半导体能力和安全供应视为与其经济和国家安全有着内在联系的。这并非没有道理:近年来,美国的政策越来越多地针对**供应链的脆弱性。这是一个先有鸡还是先有蛋的局面。**希望内部化,因为美国想要阻止**不断增长的实力。美国想要阻止**内部化,因为它让**变得更强大。




图 15:按收入规模划分的半导体公司(Bloomberg,2019 年)

在 COVID-19 成为媒体头条之前,“科技冷战”风靡一时。最近,华为和其他几家**计算巨头被禁止使用美国设备制造其内部处理器。由于Lam,KLA和AMAT在这里基本上是垄断者,而台积电(华为最大的合作伙伴)在半导体方面严重依赖它们,因此华为基本上被芯片行业淘汰了。



“ **的第十四个五年计划 包括以技术为中心的重点,其中提到了区块链和金融科技。将这些作为流行语很容易写下来,但政府将 4IR 技术列为战略重点,这表明自上而下大力推动在其中许多领域的全球领导地位。**知道他们在制造业方面具有竞争优势,并希望通过开发这些用例来推动这一发展。分布式账本技术可以改善供应链,云计算可以连接价值链的不同点(参见上下文:超越微信——制造),人工智能和物联网逐步实现生产自动化。”

您可以在此处找到第 14 届 5YP 的中文版,其中英文注释突出显示了半导体部件,由 Covington Research 提供。**特色的读者可能已经熟悉该计划。第 14 个计划涉及半导体制造的几个特定领域,将受到特别关注:

  • IC 设计工具 (EDA)

  • 半导体设备及材料(semicap)

  • 先进的记忆技术

  • 宽带隙半导体(如碳化硅或氮化镓3 )——这些是下一波半导体工艺节点的潜在竞争者

  • 先进制造业“集群”

在本文的第 2 部分中,我将解开一些源自 5YP 的较低级别的政策。


许多来源给出为什么半导体需求方面出现空前增长众多原因_ _ 这些来自 Lam Research 投资者日的图表很好地概括了整个行业的_时代精神。_至于当前的需求,毕马威的 2021 年行业展望显示了一些驱动因素。这几乎是你所期望的——任何有技术的东西在某种程度上都依赖于半导体。

在全球范围内,我们正在生成比以往更多的数据,训练比以往更好的 AI/ML 模型,并且通过 5G 的推出,使用比以往更多的东西来收集、发送和接收数据。这是物联网的基础。自动驾驶汽车、深度学习、机器人技术、工业自动化、数据中心需求、云计算、AR/VR 和加密货币方面的进步越来越大。


图形用户界面 描述自动生成


图 16:Lam Research 投资者日(2020 年;数字未按比例)




图 17:毕马威全球半导体行业展望(2021 年)

弗拉基米尔·普京在 2017 年发表了评论,“谁成为人工智能的领导者,谁就会成为世界的**者”。《人工智能超能力》的作者李开复认为_:“人工智能将比人类历史上的任何事情都更能改变世界。不仅仅是电力。”_

目前,**在人工智能出版物和专利方面处于世界领先地位,约占全球发表的关于该主题的研究论文总数的 28%。仅靠研究并不能提供持久的优势,但数据生成、数据管理和计算机科学人才可以。 

在阿里巴巴、腾讯和字节跳动之间,**的计算机科学生态系统是世界上最好的。当国家强制要求时,隐私问题也少得多。**庞大的数据库(主要由其科技巨头编制)是**处于人工智能发展前沿的关键原因4 。 

在这份 Seagate报告中,所生成数据的全球增长率约为 26%。**是30%左右。到 2025 年,**将成为全球生成和获取数据最多的国家——主要由拥有如此多的互连设备驱动。

希捷预计,到 2025 年,世界上每个连接的人(当时约占总人口的 75%)每天将使用超过 4,900 次数据,大约每 18 秒一次。引用Westfield Capital Management 的话: 

“当今世界上 90% 的可用数据是在过去 2 年中生成的——预计到 2025 年将增长到 180 zettabytes(即 21 个零)。将 zettabyte 放入上下文中,仅存储一个 zettabyte 需要 1,000 个数据中心,或大约 20% 的曼哈顿土地面积”。

所有这些数据都是人工智能研究的关键。这就是为什么**的技术独立对其经济如此重要的原因。Kaplan et有一份白皮书。人。关于神经语言模型(人工智能技术)如何扩展。总结一下: 


  1. 参数的数量, 

  2. 数据集的大小,以及

  3.  可用的计算量。 


如此处提出的想法是,一旦我们找到一个可扩展的架构,就像大脑一样,可以相当统一地应用,我们可以简单地训练更大的神经网络,更复杂的行为就会自然而然地出现。更强大的神经网络“只是”放大的弱神经网络,就像人类大脑看起来很像 放大的灵长类动物大脑一样




图 18:GPT-3,迄今为止训练过的最强大的 AI,是红点(~3624 petaflop/s-days;OpenAI via mule ;2020)






图 19:未来计算消耗方程(mule ,2020)


为了改善未来的计算供应,我们过去依赖几何缩小和功率缩放。但我们在这两个问题上都碰壁了。芯片,至少是那些建立在硅上的芯片,不会比假设的 1nm 小很多。所以我们要么在非硅(如GaN石墨烯)上构建芯片,这意味着我们需要一个全新的技术堆栈。或者我们通过异构计算取得进展(正如我们在 CPU 到 GPU 到 ASIC 的进展中所做的那样)。 




图 20:异构计算的进展(mule ,2020)

异构计算是构建特定用途芯片的一种奇特方式,它的工作原理是制造真正擅长某件事的芯片,但仅此而已。Fuchs 和 Wentzlaff对这种方法的可能限制进行了研究,TL;DR 是各种应用程序将具有不同的收益递减率,但所有的收益都会随着时间的推移而减少。这里也有限制。 



目前,与逻辑处理相比,许多 AI/ML 项目更受输入/输出速度的限制。解决这个问题意味着拥有更好的内存速度。这可能以自动内存缓存的形式出现,但更有可能意味着 NAND 和 DRAM 的增长。无论哪种方式,如果公司或国家想要开发具有竞争力的人工智能流程——**也这样做——那么他们_必须_控制自己的计算和内存供应。

我们已经讨论了“什么”和“为什么”。接下来我们将看看“谁”、“如何”和“何时”。这是一个微妙的空间,不同的派对都想要不同的东西。这将在第 2 部分中介绍。

第 1 部分到此结束。我们讨论了全球半导体行业、开发完全独立供应链的高成本、挑战和限制,以及需要构建的各种流程。这就是_**_试图破坏的东西。 



试图解开**新兴产业的动态将是本文第 2 部分的尝试。在此之前,感谢您的阅读。


CAP Confusion: Problems with ‘partition tolerance’

by Henry Robinson, April 26, 2010
The ‘CAP’ theorem is a hot topic in the design of distributed data storage systems. However, it’s often widely misused. In this post I hope to highlight why the common ‘consistency, availability and partition tolerance: pick two’ formulation is inadequate for distributed systems. In fact, the lesson of the theorem is that the choice is almost always between sequential consistency and high availability.
It’s very common to invoke the ‘CAP theorem’ when designing, or talking about designing, distributed data storage systems. The theorem, as commonly stated, gives system designers a choice between three competing guarantees:

  • Consistency – roughly meaning that all clients of a data store get responses to requests that ‘make sense’. For example, if Client A writes 1 then 2 to location X, Client B cannot read 2 followed by 1.
  • Availability – all operations on a data store eventually return successfully. We say that a data store is ‘available’ for, e.g. write operations.
  • Partition tolerance – if the network stops delivering messages between two sets of servers, will the system continue to work correctly?

This is often summarised as a single sentence: “consistency, availability, partition tolerance. Pick two.”. Short, snappy and useful.

At least, that’s the conventional wisdom. Many modern distributed data stores, including those often caught under the ‘NoSQL’ net, pride themselves on offering availability and partition tolerance over strong consistency; the reasoning being that short periods of application misbehavior are less problematic than short periods of unavailability. Indeed, Dr. Michael Stonebraker posted an article on the ACM’s blog bemoaning the preponderance of systems that are choosing the ‘AP’ data point, and that consistency and availability are the two to choose. However for the vast majority of systems, I contend that the choice is almost always between consistency and availability, and unavoidably so.

Dr. Stonebraker’s central thesis is that, since partitions are rare, we might simply sacrifice ‘partition-tolerance’ in favour of sequential consistency and availability – a model that is well suited to traditional transactional data processing and the maintainance of the good old ACID invariants of most relational databases. I want to illustrate why this is a misinterpretation of the CAP theorem.

We first need to get exactly what is meant by ‘partition tolerance’ straight. Dr. Stonebraker asserts that a system is partition tolerant if processing can continue in both partitions in the case of a network failure.

“If there is a network failure that splits the processing nodes into two groups that cannot talk to each other, then the goal would be to allow processing to continue in both subgroups.”

This is actually a very strong partition tolerance requirement. Digging into the history of the CAP theorem reveals some divergence from this definition.

Seth Gilbert and Professor Nancy Lynch provided both a formalisation and a proof of the CAP theorem in their 2002 SIGACT paper. We should defer to their definition of partition tolerance – if we are going to invoke CAP as a mathematical truth, we should formalize our foundations, otherwise we are building on very shaky ground. Gilbert and Lynch define partition tolerance as follows:

“The network will be allowed to lose arbitrarily many messages sent from one node to another”

Note that Gilbert and Lynch’s definition isn’t a property of a distributed application, but a property of the network in which it executes. This is often misunderstood: partition tolerance is not something we have a choice about designing into our systems. If you have a partition in your network, you lose either consistency (because you allow updates to both sides of the partition) or you lose availability (because you detect the error and shutdown the system until the error condition is resolved). Partition tolerance means simply developing a coping strategy by choosing which of the other system properties to drop. This is the real lesson of the CAP theorem – if you have a network that may drop messages, then you cannot have both availability and consistency, you must choose one. We should really be writing Possibility of Network Partitions => not(availability and consistency), but that’s not nearly so snappy.

Dr. Stonebraker’s definition of partition tolerance is actually a measure of availability – if a write may go to either partition, will it eventually be responded to? This is a very meaningful question for systems distributed across many geographic locations, but for the LAN case it is less common to have two partitions available for writes. However, it is encompassed by the requirement for availability that we already gave – if your system is available for writes at all times, then it is certainly available for writes during a network partition.

So what causes partitions? Two things, really. The first is obvious – a network failure, for example due to a faulty switch, can cause the network to partition. The other is less obvious, but fits with the definition from Gilbert and Lynch: machine failures, either hard or soft. In an asynchronous network, i.e. one where processing a message could take unbounded time, it is impossible to distinguish between machine failures and lost messages. Therefore a single machine failure partitions it from the rest of the network. A correlated failure of several machines partitions them all from the network. Not being able to receive a message is the same as the network not delivering it. In the face of sufficiently many machine failures, it is still impossible to maintain availability and consistency, not because two writes may go to separate partitions, but because the failure of an entire ‘quorum’ of servers may render some recent writes unreadable.

This is why defining P as ‘allowing partitioned groups to remain available’ is misleading – machine failures are partitions, almost tautologously, and by definition cannot be available while they are failed. Yet, Dr. Stonebraker says that he would suggest choosing CA rather than P. This feels rather like we are invited to both have our cake and eat it. Not ‘choosing’ P is analogous to building a network that will never experience multiple correlated failures. This is unreasonable for a distributed system – precisely for all the valid reasons that are laid out in the CACM post about correlated failures, OS bugs and cluster disasters – so what a designer has to do is to decide between maintaining consistency and availability. Dr. Stonebraker tells us to choose consistency, in fact, because availability will unavoidably be impacted by large failure incidents. This is a legitimate design choice, and one that the traditional RDBMS lineage of systems has explored to its fullest, but it implicitly protects us neither from availability problems stemming from smaller failure incidents, nor from the high cost of maintaining sequential consistency.

When the scale of a system increases to many hundreds or thousands of machines, writing in such a way to allow consistency in the face of potential failures can become very expensive (you have to write to one more machine than failures you are prepared to tolerate at once). This kind of nuance is not captured by the CAP theorem: consistency is often much more expensive in terms of throughput or latency to maintain than availability.Systems such as ZooKeeper are explicitly sequentially consistent because there are few enough nodes in a cluster that the cost of writing to quorum is relatively small. The Hadoop Distributed File System (HDFS) also chooses consistency – three failed datanodes can render a file’s blocks unavailable if you are unlucky. Both systems are designed to work in real networks, however, where partitions and failures will occur*, and when they do both systems will become unavailable, having made their choice between consistency and availability. That choice remains the unavoidable reality for distributed data stores.

Further Reading
*For more on the inevitably of failure modes in large distributed systems, the interested reader is referred to James Hamilton’s LISA ’07 paper On Designing and Deploying Internet-Scale Services.

Daniel Abadi has written an excellent critique of the CAP theorem.

James Hamilton also responds to Dr. Stonebraker’s blog entry, agreeing (as I do) with the problems of eventual consistency but taking issue with the notion of infrequent network partitions.


jvm gc, memory stat

$ jstat -gcutil -h4 <pid>  [<interval> <count>]
Column Description
S0 Survivor space 0 utilization as a percentage of the space's current capacity.
S1 Survivor space 1 utilization as a percentage of the space's current capacity.
E Eden space utilization as a percentage of the space's current capacity.
O Old space utilization as a percentage of the space's current capacity.
P Permanent space utilization as a percentage of the space's current capacity.
YGC Number of young generation GC events.
YGCT Young generation garbage collection time.
FGC Number of full GC events.
FGCT Full garbage collection time.
GCT Total garbage collection time.

Spark RDD 追根溯源

Resilient Distributed Datasets

Formally, an RDD is a read-only, partitioned collection of records.



  • interactive algorithm
  • interactive data mining

Fault tolorance

RDDs provide a restricted form of shared memory, based on coarse grained transformations rather than fine-grained updates to shared state.

RDD Abstraction

RDDs can only be created through deterministic operations on either

  1. data in stable storage or
  2. other RDDs.

We call these operations transformations to differentiate them from other operations on RDDs. Examples of transformations include map, filter, and join.

Finally, users can control two other aspects of RDDs:
persistence and partitioning.

screen shot 2016-04-03 at 10 17 35 pm

### Representing RDDs

In a nutshell, we propose representing each RDD
through a common interface that exposes five pieces of

  1. a set of partitions, which are atomic pieces
    of the dataset
  2. a set of dependencies on parent RDDs
  3. a function for computing the dataset based on its parents;
  4. metadata about its partitioning scheme
  5. data placement

For example, an RDD representing an HDFS file has a partition for each block of the file and knows which machines each block is on. Meanwhile, the result of a map on this RDD has the same partitions, but applies
the map function to the parent’s data when computing its elements.
screen shot 2016-04-03 at 10 44 43 pm


Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.