概念

信号是软件中断。很多比较重要的应用程序都需处理信号。首先，每个信号都有一个名字。这些名字都以三个字符 SIG 开头。在头文件中，这些信号都被定义为正整数（信号编号）。

所有信号名称

chris@ubuntu:~/myspace/myblog$ kill -l
 1) SIGHUP	 2) SIGINT	 3) SIGQUIT	 4) SIGILL	 5) SIGTRAP
 6) SIGABRT	 7) SIGBUS	 8) SIGFPE	 9) SIGKILL	10) SIGUSR1
11) SIGSEGV	12) SIGUSR2	13) SIGPIPE	14) SIGALRM	15) SIGTERM
16) SIGSTKFLT	17) SIGCHLD	18) SIGCONT	19) SIGSTOP	20) SIGTSTP
21) SIGTTIN	22) SIGTTOU	23) SIGURG	24) SIGXCPU	25) SIGXFSZ
26) SIGVTALRM	27) SIGPROF	28) SIGWINCH	29) SIGIO	30) SIGPWR
31) SIGSYS	34) SIGRTMIN	35) SIGRTMIN+1	36) SIGRTMIN+2	37) SIGRTMIN+3
38) SIGRTMIN+4	39) SIGRTMIN+5	40) SIGRTMIN+6	41) SIGRTMIN+7	42) SIGRTMIN+8
43) SIGRTMIN+9	44) SIGRTMIN+10	45) SIGRTMIN+11	46) SIGRTMIN+12	47) SIGRTMIN+13
48) SIGRTMIN+14	49) SIGRTMIN+15	50) SIGRTMAX-14	51) SIGRTMAX-13	52) SIGRTMAX-12
53) SIGRTMAX-11	54) SIGRTMAX-10	55) SIGRTMAX-9	56) SIGRTMAX-8	57) SIGRTMAX-7
58) SIGRTMAX-6	59) SIGRTMAX-5	60) SIGRTMAX-4	61) SIGRTMAX-3	62) SIGRTMAX-2
63) SIGRTMAX-1	64) SIGRTMAX

Linux最大文件数限制

发表于 2018-03-30 | 分类于 Linux

要知道,在linux的世界里,一切皆文件.因此要实现大的并发量的第一步,修改linux系统的文件标识符限制数,也就是文件打开数量的限制

内核级的总限制 fs.file-max

man proc 里有这么一段话

查看限制数

1 2	chris@ubuntu:~/myspace/myblog$ sysctl fs.file-max fs.file-max = 3000000

修改限制数

1
2
3

chris@ubuntu:~/myspace/myblog$ sudo sysctl -w fs.file-max=3000000
[sudo] password for chris: 
fs.file-max = 3000000

需要永久生效则

1	echo "fs.file-max=3000000" >>/etc/sysctl.conf

查看使用情况

1 2	chris@ubuntu:~$ sysctl fs.file-nr fs.file-nr = 6816 0 3000000

其中第一个数表示当前系统已分配使用的打开文件描述符数，第二个数为分配后已释放的（目前已不再使用），第三个数等于file-max。

用户级进程级的限制

查看资源硬限制数

1 2	chris@ubuntu:~$ ulimit -Hn 4096

查看资源软限制数

1 2	chris@ubuntu:~$ ulimit -Sn 1024

通过ulimit -Sn设置最大打开文件描述符数的soft limit，注意soft limit不能大于hard limit（ulimit -Hn可查看hard limit），另外ulimit -n默认查看的是soft limit，但是 ulimit -n 204800 则会同时设置soft limit和hard limit。对于非root用户只能设置比原来小的hard limit

若要使修改永久有效，则需要在/etc/security/limits.conf中进行设置，可添加如下两行。

@work       hard    nofile  6000000
@work       soft    nofile  4000000
@work	    soft    core    4000000
@work	    hard    core    4000000
@work       hard    nproc   6000000
@work       soft    nproc   4000000

以上设置需要注销之后重新登录才能生效：
设置nofile的hard limit还有一点要注意的就是hard limit不能大于/proc/sys/fs/nr_open，假如hard limit大于nr_open，注销后无法正常登录。可以修改nr_open的值： echo 2000000 > /proc/sys/fs/nr_open

使用glibc的MALLOCCHECK

因为是一个内存问题，考虑使用一些内存调试工具来定位问题。因为OB内部对于内存块有自己的缓存，需要去除它的影响。修改OB内存分配器，让它每次都直接调用c库的malloc和free等，不做缓存。然后，可以使用glibc内置的内存块完整性检查功能。

使用这一特性，程序无需重新编译，只需要在运行的时候设置环境变量MALLOCCHECK（注意结尾的下划线）。每当在程序运行过程free内存给glibc时，glibc会检查其隐藏的元数据的完整性，如果发现错误就会立即abort。

用类似下面的命令行启动server程序：

1 2	export MALLOC_CHECK_=2 ./test

MALLOCCHECK有三种设定,即:

MALLOCCHECK=0 —– 关闭所有检查.
MALLOCCHECK=1 —– 当有错误被探测到时,在标准错误输出(stderr)上打印错误信息.
MALLOCCHECK=2 —– 当有错误被探测到时,不显示错误信息,直接进行中断.

但这个core能带给我们想信息也很少。我们只是找到了另外一种稍高效地重现问题的方法而已。或许最初看到的core的现象是延后显现而已，其实“更早”的时刻内存就被破坏掉了。

valgrind

glibc提供的MALLOCCHECK功能太简单了，有没有更高级点的工具不光能够报告错误，还能分析出问题原因来？我们自然想到了大名鼎鼎的valgrind。用valgrind来检查内存问题，程序也不需要重新编译，只需要使用valgrind来启动：

nohup valgrind –error-limit=no –suppressions=suppress bin/mergeserver -z 45447 -r 10.232.36.183:45401 -p45441 >nohup.out &

默认情况下，当valgrind发现了1000中不同的错误，或者总数超过1000万次错误后，会停止报告错误。加了–error-limit=no以后可以禁止这一特性。–suppressions用来屏蔽掉一些不关心的误报的问题。

AddressSanitizer

版本要求: LLVM3.1 或者gcc4.8

bug代码示例

#include <stdio.h>
#include <stdlib.h>
int main (int argc,char *argv[])
{
        int i;
        char* p = (char *)malloc(10);
        char* pt = p;
        for (i = 0;i < 10;i++)
        {
                p[i] = 'z';
        }
        free (p);
        //free(pt);
        *p = 1;
        return 0;
}

编译&运行

1 2	g++ -fsanitize=address test09.cpp -o test09 ./test09

出错提示
[img01]

Linux上Core Dump文件的形成和分析

发表于 2018-03-27 | 分类于 Linux

Core，又称之为Core Dump文件，是Unix/Linux操作系统的一种机制，对于线上服务而言，Core令人闻之色变，因为出Core的过程意味着服务暂时不能正常响应，需要恢复，并且随着吐Core进程的内存空间越大，此过程可能持续很长一段时间（例如当进程占用60G+以上内存时，完整Core文件需要15分钟才能完全写到磁盘上），这期间产生的流量损失，不可估量。

凡事皆有两面性，OS在出Core的同时，虽然会终止掉当前进程，但是也会保留下第一手的现场数据，OS仿佛是一架被按下快门的相机，而照片就是产出的Core文件。里面含有当进程被终止时内存、CPU寄存器等信息，可以供后续开发人员进行调试。

关于Core产生的原因很多，比如过去一些Unix的版本不支持现代Linux上这种GDB直接附着到进程上进行调试的机制，需要先向进程发送终止信号，然后用工具阅读core文件。在Linux上，我们就可以使用kill向一个指定的进程发送信号或者使用gcore命令来使其主动出Core并退出。如果从浅层次的原因上来讲，出Core意味着当前进程存在BUG，需要程序员修复。从深层次的原因上讲，是当前进程触犯了某些OS层级的保护机制，逼迫OS向当前进程发送诸如SIGSEGV(即signal 11)之类的信号, 例如访问空指针或数组越界出Core，实际上是触犯了OS的内存管理，访问了非当前进程的内存空间，OS需要通过出Core来进行警示，这就好像一个人身体内存在病毒，免疫系统就会通过发热来警示，并导致人体发烧是一个道理（有意思的是，并不是每次数组越界都会出Core，这和OS的内存管理中虚拟页面分配大小和边界有关，即使不出Core，也很有可能读到脏数据，引起后续程序行为紊乱，这是一种很难追查的BUG）。

修改core文件名格式
修改/proc/sys/kernel/core_pattern文件，此文件用于控制Core文件产生的文件名，默认情况下，此文件内容只有一行内容：“core”，此文件支持定制，一般使用%配合不同的字符，这里罗列几种：
- %p 出Core进程的PID
- %u 出Core进程的UID
- %s 造成Core的signal号
- %t 出Core的时间，从1970-01-0100:00:00开始的秒数
- %e 出Core进程对应的可执行文件名

不能直接修改，需要通过下面的方法：
a. vim /etc/sysctl.conf在最后一行添加kernel.core_uses_pid = 1
b. 执行sysctl -p

修改core文件大小

查看core文件的大小
ulimit –a
修改core文件的大小
ulimit –c

文件格式
core文件是ELF格式，可以通过 readelf -h命令查看
[img01]

像bmp、exe等文件一样，ELF的文件头包含整个文件的控制结构。它的定义如下

typedef struct elf32_hdr {  
	unsigned char e_ident[EI_NIDENT];   
	Elf32_Half    e_type;         /* file type */  
	Elf32_Half    e_machine;      /* architecture */  
	Elf32_Word    e_version;  
	Elf32_Addr    e_entry;    	  /* entry point */  
	Elf32_Off 	  e_phoff;        /* PH table offset */  
	Elf32_Off 	  e_shoff;        /* SH table offset */  
	Elf32_Word    e_flags;  
	Elf32_Half    e_ehsize;       /* ELF header size in bytes */  
	Elf32_Half    e_phentsize;    /* PH size */  
	Elf32_Half    e_phnum;        /* PH number */  
	Elf32_Half    e_shentsize;    /* SH size */  
	Elf32_Half    e_shnum;        /* SH number */  
	Elf32_Half    e_shstrndx;     /* SH name string table index */  
} Elf32_Ehdr;

源码

coredump函数在kernel/fs/exec.c中函数为do_coredump( )，如果coredump生成失败可以在do_coredump函数中增加打印，do_coredump的源代码如下所示。

void do_coredump(long signr, int exit_code, struct pt_regs *regs)
{
	struct core_state core_state;
	char corename[CORENAME_MAX_SIZE + 1];
	struct mm_struct *mm = current->mm;
	struct linux_binfmt * binfmt;
	const struct cred *old_cred;
	struct cred *cred;
	int retval = 0;
	int flag = 0;
	int ispipe;
	static atomic_t core_dump_count = ATOMIC_INIT(0);
	struct coredump_params cprm = {
		.signr = signr,
		.regs = regs,
		.limit = rlimit(RLIMIT_CORE),
		/*
		 * We must use the same mm->flags while dumping core to avoid
		 * inconsistency of bit flags, since this flag is not protected
		 * by any locks.
		 */
		.mm_flags = mm->flags,
	};
	audit_core_dumps(signr);
	binfmt = mm->binfmt;
	//binfmt->core_dump根据内核宏初始化赋值core_dump函数，未开宏时为NULL
	if (!binfmt || !binfmt->core_dump)
		goto fail;
	if (!__get_dumpable(cprm.mm_flags))
		goto fail;
	cred = prepare_creds();
	if (!cred)
		goto fail;
	/*
	 *	We cannot trust fsuid as being the "true" uid of the
	 *	process nor do we know its entire history. We only know it
	 *	was tainted so we dump it as root in mode 2.
	 */
	if (__get_dumpable(cprm.mm_flags) == 2) {
		/* Setuid core dump mode */
		flag = O_EXCL;		/* Stop rewrite attacks */
		cred->fsuid = 0;	/* Dump root private */
	}
	retval = coredump_wait(exit_code, &core_state);
	if (retval < 0)
		goto fail_creds;
	old_cred = override_creds(cred);
	/*
	 * Clear any false indication of pending signals that might
	 * be seen by the filesystem code called to write the core file.
	 */
	clear_thread_flag(TIF_SIGPENDING);
	//根据/proc/sys/kernel/core_pattern中值定义core文件名
	ispipe = format_corename(corename, signr);
 	if (ispipe) {
		int dump_count;
		char **helper_argv;
		if (cprm.limit == 1) {
			/*
			 * Normally core limits are irrelevant to pipes, since
			 * we're not writing to the file system, but we use
			 * cprm.limit of 1 here as a speacial value. Any
			 * non-1 limit gets set to RLIM_INFINITY below, but
			 * a limit of 0 skips the dump.  This is a consistent
			 * way to catch recursive crashes.  We can still crash
			 * if the core_pattern binary sets RLIM_CORE =  !1
			 * but it runs as root, and can do lots of stupid things
			 * Note that we use task_tgid_vnr here to grab the pid
			 * of the process group leader.  That way we get the
			 * right pid if a thread in a multi-threaded
			 * core_pattern process dies.
			 */
			printk(KERN_WARNING
				"Process %d(%s) has RLIMIT_CORE set to 1\n",
				task_tgid_vnr(current), current->comm);
			printk(KERN_WARNING "Aborting core\n");
			goto fail_unlock;
		}
		cprm.limit = RLIM_INFINITY;
		dump_count = atomic_inc_return(&core_dump_count);
		if (core_pipe_limit && (core_pipe_limit < dump_count)) {
			printk(KERN_WARNING "Pid %d(%s) over core_pipe_limit\n",
			       task_tgid_vnr(current), current->comm);
			printk(KERN_WARNING "Skipping core dump\n");
			goto fail_dropcount;
		}
		helper_argv = argv_split(GFP_KERNEL, corename+1, NULL);
		if (!helper_argv) {
			printk(KERN_WARNING "%s failed to allocate memory\n",
			       __func__);
			goto fail_dropcount;
		}
		retval = call_usermodehelper_fns(helper_argv[0], helper_argv,
					NULL, UMH_WAIT_EXEC, umh_pipe_setup,
					NULL, &cprm);
		argv_free(helper_argv);
		if (retval) {
 			printk(KERN_INFO "Core dump to %s pipe failed\n",
			       corename);
			goto close_fail;
 		}
	} else {
		struct inode *inode;
		
		//根据进程的soft limit大小，soft limit大于coredump初始设置最小值=PAGE_SZIE
		if (cprm.limit < binfmt->min_coredump)
			goto fail_unlock;
		cprm.file = filp_open(corename,
				 O_CREAT | 2 | O_NOFOLLOW | O_LARGEFILE | flag,
				 0600);
		if (IS_ERR(cprm.file))
			goto fail_unlock;
		inode = cprm.file->f_path.dentry->d_inode;
		if (inode->i_nlink > 1)
			goto close_fail;
		if (d_unhashed(cprm.file->f_path.dentry))
			goto close_fail;
		/*
		 * AK: actually i see no reason to not allow this for named
		 * pipes etc, but keep the previous behaviour for now.
		 */
		if (!S_ISREG(inode->i_mode))
			goto close_fail;
		/*
		 * Dont allow local users get cute and trick others to coredump
		 * into their pre-created files.
		 */
		if (inode->i_uid != current_fsuid())
			goto close_fail;
		if (!cprm.file->f_op || !cprm.file->f_op->write)
			goto close_fail;
		if (do_truncate(cprm.file->f_path.dentry, 0, 0, cprm.file))
			goto close_fail;
	}
	
	//执行core_dump函数输出寄存器等信息到core文件中
	retval = binfmt->core_dump(&cprm);
	if (retval)
		current->signal->group_exit_code |= 0x80;
	if (ispipe && core_pipe_limit)
		wait_for_dump_helpers(cprm.file);
close_fail:
	if (cprm.file)
		filp_close(cprm.file, NULL);
fail_dropcount:
	if (ispipe)
		atomic_dec(&core_dump_count);
fail_unlock:
	coredump_finish(mm);
	revert_creds(old_cred);
fail_creds:
	put_cred(cred);
fail:
	return;
}

zookeeper部署

发表于 2018-03-26 | 分类于分布式系统

最新的版本可以通过官网 http://hadoop.apache.org/zookeeper/来获取，Zookeeper 的安装非常简单，下面将从单机模式和集群模式两个方面介绍 Zookeeper 的安装和配置。

单机模式

解压安装包zookerper-3.4.7.tar.gz
1
tar -xzvf zookerper-3.4.7.tar.gz
创建Zookeeper子目录
1
cp -r zookerper-3.4.7 zookerper

修改Zookeeper配置文件conf/zoo.cfg

# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial 
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between 
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just 
# example sakes.
dataDir=./data
dataLogDir=./logs
# the port at which the clients will connect
clientPort=2181
# the maximum number of client connections.
# increase this if you need to handle more clients
#maxClientCnxns=60
#
# Be sure to read the maintenance section of the 
# administrator guide before turning on autopurge.
#
# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
#
# The number of snapshots to retain in dataDir
#autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
#autopurge.purgeInterval=1

运行
1
2
$ cd bin
$ ./zkServer.sh start
查看运行状态
1
./zkServer.sh status

伪集群模式

新建两个directory

1 2	cp -r zookeeper/ zookerper1/ cp -r zookeeper/ zookerper2/

修改Zookeeper配置文件conf/zoo.cfg,这里配置了3个实例

1
2
3

server.1=localhost:2887:3887
server.2=localhost:2888:3888
server.3=localhost:2889:3889

分别在data目录下添加myid文件

1
2
3

echo "1" > myid
echo "2" > myid
echo "3" > myid

分别运行3个实例

mysql覆盖索引

发表于 2018-03-24 | 分类于 MySQL

概念

如果索引包含所有满足查询需要的数据的索引成为覆盖索引(Covering Index)，也就是平时所说的不需要回表操作

判断标准

使用explain，可以通过输出的extra列来判断，对于一个索引覆盖查询，显示为using index,MySQL查询优化器在执行查询前会决定是否有索引覆盖查询

注意
1、覆盖索引也并不适用于任意的索引类型，索引必须存储列的值
2、Hash 和full-text索引不存储值，因此MySQL只能使用B-TREE
3、并且不同的存储引擎实现覆盖索引都是不同的
4、并不是所有的存储引擎都支持它们
5、如果要使用覆盖索引，一定要注意SELECT 列表值取出需要的列，不可以是SELECT *，因为如果将所有字段一起做索引会导致索引文件过大，查询性能下降，不能为了利用覆盖索引而这么做

InnoDB
1、覆盖索引查询时除了除了索引本身的包含的列，还可以使用其默认的聚集索引列
2、这跟INNOB的索引结构有关系，主索引是B+树索引存储，也即我们所说的数据行即索引，索引即数据
3、对于INNODB的辅助索引，它的叶子节点存储的是索引值和指向主键索引的位置，然后需要通过主键在查询表的字段值，所以辅助索引存储了主键的值
4、覆盖索引也可以用上INNODB 默认的聚集索引
5、innodb引擎的所有储存了主键ID，事务ID，回滚指针,非主键ID，他的查询就会是非主键ID也可覆盖来取得主键ID

覆盖索引是一种非常强大的工具，能大大提高查询性能，只需要读取索引而不用读取数据有以下一些优点
1、索引项通常比记录要小，所以MySQL访问更少的数据
2、索引都按值的大小顺序存储，相对于随机访问记录，需要更少的I/O
3、大多数据引擎能更好的缓存索引，比如MyISAM只缓存索引
4、覆盖索引对于InnoDB表尤其有用，因为InnoDB使用聚集索引组织数据，如果二级索引中包含查询所需的数据，就不再需要在聚集索引中查找了

在sakila的inventory表中，有一个组合索引(store_id,film_id)，对于只需要访问这两列的查询，MySQL就可以使用索引，如下
表结构

CREATE TABLE `inventory` (
  `inventory_id` mediumint(8) unsigned NOT NULL AUTO_INCREMENT,
  `film_id` smallint(5) unsigned NOT NULL,
  `store_id` tinyint(3) unsigned NOT NULL,
  `last_update` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`inventory_id`),
  KEY `idx_fk_film_id` (`film_id`),
  KEY `idx_store_id_film_id` (`store_id`,`film_id`),
  CONSTRAINT `fk_inventory_film` FOREIGN KEY (`film_id`) REFERENCES `film` (`film_id`) ON UPDATE CASCADE,
  CONSTRAINT `fk_inventory_store` FOREIGN KEY (`store_id`) REFERENCES `store` (`store_id`) ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=4582 DEFAULT CHARSET=utf8 |

查询语句

mysql>  EXPLAIN SELECT store_id, film_id FROM sakila.inventory\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: inventory
         type: index
possible_keys: NULL
          key: idx_store_id_film_id
      key_len: 3
          ref: NULL
         rows: 4581
        Extra: Using index
1 row in set (0.03 sec)

在大多数引擎中，只有当查询语句所访问的列是索引的一部分时，索引才会覆盖。但是，InnoDB不限于此，InnoDB的二级索引在叶子节点中存储了 primary key的值。因此，sakila.actor表使用InnoDB，而且对于是last_name上有索引，所以，索引能覆盖那些访问actor_id的查询，如下

mysql> EXPLAIN SELECT actor_id, last_name  FROM sakila.actor WHERE last_name = 'HOPPER'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: actor
         type: ref
possible_keys: idx_actor_last_name
          key: idx_actor_last_name
      key_len: 137
          ref: const
         rows: 2
        Extra: Using where; Using index
1 row in set (0.00 sec)

使用索引进行排序

MySQL中，有两种方式生成有序结果集：一是使用filesort，二是按索引顺序扫描

利用索引进行排序操作是非常快的，而且可以利用同一索引同时进行查找和排序操作。当索引的顺序与ORDER BY中的列顺序相同且所有的列是同一方向(全部升序或者全部降序)时，可以使用索引来排序，如果查询是连接多个表，仅当ORDER BY中的所有列都是第一个表的列时才会使用索引，其它情况都会使用filesort

CREATE TABLE `actor` (
  `actor_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(16) NOT NULL DEFAULT '',
  `password` varchar(16) NOT NULL DEFAULT '',
  PRIMARY KEY (`actor_id`),
  KEY `name` (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ;
insert into actor(name,password) values ('cat01','1234567'),('cat02','1234567'),('ddddd','1234567'),('aaaaa','1234567');

1、 explain select actor_id from actor order by actor_id \G

mysql> explain select actor_id from actor order by actor_id \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: actor
         type: index
possible_keys: NULL
          key: PRIMARY
      key_len: 4
          ref: NULL
         rows: 4
        Extra: Using index
1 row in set (0.00 sec)

2、explain select actor_id from actor order by password \G

mysql> explain select actor_id from actor order by password \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: actor
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 4
        Extra: Using filesort
1 row in set (0.00 sec)

3、explain select actor_id from actor order by name \G

mysql> explain select actor_id from actor order by name \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: actor
         type: index
possible_keys: NULL
          key: name
      key_len: 50
          ref: NULL
         rows: 4
        Extra: Using index
1 row in set (0.00 sec)

当MySQL不能使用索引进行排序时，就会利用自己的排序算法(快速排序算法)在内存(sort buffer)中对数据进行排序，如果内存装载不下，它会将磁盘上的数据进行分块，再对各个数据块进行排序，然后将各个块合并成有序的结果集（实际上就是外排序）

Redis Sentinel 介绍与部署

发表于 2018-03-23 | 分类于 Redis

1. Sentinel介绍

1.1 主从复制的问题

Redis主从复制可将主节点数据同步给从节点，从节点此时有两个作用：

一旦主节点宕机，从节点作为主节点的备份可以随时顶上来。
扩展主节点的读能力，分担主节点读压力。

但是问题来了：

一旦主节点宕机，从节点晋升成主节点，同时需要修改应用方的主节点地址，还需要命令所有从节点去复制新的主节点，整个过程需要人工干预。
主节点的写能力受到单机的限制。
主节点的存储能力受到单机的限制。
第一个问题，我们接下来讲的Sentinel就可以解决。而后两个问题，Redis也给出了方案Redis Cluster。

1.2 Redis Sentinel的高可用

Redis Sentinel是一个分布式架构，包含若干个Sentinel节点和Redis数据节点，每个Sentinel节点会对数据节点和其余Sentinel节点进行监控，当发现节点不可达时，会对节点做下线标识。

如果被标识的是主节点，他还会选择和其他Sentinel节点进行“协商”，当大多数的Sentinel节点都认为主节点不可达时，他们会选举出一个Sentinel节点来完成自动故障转移工作，同时将这个变化通知给Redis应用方。

整个过程完全自动，不需要人工介入，所以可以很好解决Redis的高可用问题。

接下来我们就通过部署一个Redis Sentinel实例来了解整体框架。

2. Redis Sentinel部署

我们部署的拓扑结构如图所示：
[img01]

分别有3个Sentinel节点，1个主节点，2个从节点组成一个Redis Sentinel。

role	IP	port
master	127.0.0.1	6379
slave1	127.0.0.1	6380
slave2	127.0.0.1	6381
Sentinel1	127.0.0.1	26379
Sentinel2	127.0.0.1	26380
Sentinel3	127.0.0.1	26381

2.1 启动主节点

配置：

port 6379
daemonize yes
logfile "6379.log"
dbfilename "dump-6379.rdb"
dir "/var/redis/data/"

启动主节点

1	➜ sudo redis-server redis-6379.conf

使用PING命令检测是否启动

1 2	➜ redis-cli -h 127.0.0.1 -p 6379 ping PONG

2.2 启动两个从节点

配置（两个从节点配置相同，除了文件名有区分）

port 6380
daemonize yes
logfile "6380.log"
dbfilename "dump-6380.rdb"
dir "/var/redis/data/" 
slaveof 127.0.0.1 6379      // 从属主节点

启动两个从节点

1 2	➜ sudo redis-server redis-6380.conf ➜ sudo redis-server redis-6381.conf

使用PING命令检测是否启动

➜   redis-cli -h 127.0.0.1 -p 6380 ping
PONG
➜   redis-cli -h 127.0.0.1 -p 6381 ping
PONG

2.3 确认主从关系

主节点视角

➜   redis-cli -h 127.0.0.1 -p 6379 INFO replication
# Replication
role:master
connected_slaves:2
slave0:ip=127.0.0.1,port=6380,state=online,offset=85,lag=0
slave1:ip=127.0.0.1,port=6381,state=online,offset=85,lag=0
......

*从节点视角（6380端口）

➜   redis-cli -h 127.0.0.1 -p 6380 INFO replication
# Replication
role:slave
master_host:127.0.0.1
master_port:6379
master_link_status:up
......

2.4 部署Sentinel节点

3个Sentinel节点的部署方法是相同的（端口不同）。以26379为例。

配置

// Sentinel节点的端口
port 26379  
dir /var/redis/data/
logfile "26379.log"
// 当前Sentinel节点监控 127.0.0.1:6379 这个主节点
// 2代表判断主节点失败至少需要2个Sentinel节点节点同意
// mymaster是主节点的别名
sentinel monitor mymaster 127.0.0.1 6379 2
//每个Sentinel节点都要定期PING命令来判断Redis数据节点和其余Sentinel节点是否可达，如果超过30000毫秒且没有回复，则判定不可达
sentinel down-after-milliseconds mymaster 30000
//当Sentinel节点集合对主节点故障判定达成一致时，Sentinel领导者节点会做故障转移操作，选出新的主节点，原来的从节点会向新的主节点发起复制操作，限制每次向新的主节点发起复制操作的从节点个数为1
sentinel parallel-syncs mymaster 1
//故障转移超时时间为180000毫秒
sentinel failover-timeout mymaster 180000

启动（两种方法）

* redis-sentinel sentinel-26379.conf
* redis-server sentinel-26379.conf --sentinel
* 确认

➜ redis-cli -h 127.0.0.1 -p 26379 INFO Sentinel

Sentinel

sentinel_masters:1
sentinel_tilt:0
sentinel_running_scripts:0
sentinel_scripts_queue_length:0
sentinel_simulate_failure_flags:0
master0:name=mymaster,status=ok,address=127.0.0.1:6379,slaves=2,sentinels=1 //sentinels=1表示启


部署三个Sentinel节点之后，真个拓扑结构如图所示：
[img02]
* 当部署号Redis Sentinel之后，会有如下变化 
 * Sentinel节点自动发现了从节点、其余Sentinel节点。
 * 去掉了默认配置，例如：parallel-syncs、failover-timeout。
 * 新添加了纪元（epoch）参数。
我们拿端口26379的举例，启动所有的Sentinel和数据节点后，配置文件如下：

port 26379
dir “/var/redis/data”
sentinel myid 70a3e215c1a34b4d9925d170d9606e615a8874f2
sentinel monitor mymaster 127.0.0.1 6379 2
sentinel config-epoch mymaster 0
sentinel leader-epoch mymaster 0
daemonize yes
logfile “26379.log”
// 发现了两个从节点
sentinel known-slave mymaster 127.0.0.1 6381
sentinel known-slave mymaster 127.0.0.1 6380
// 发送了连个Sentinel节点
sentinel known-sentinel mymaster 127.0.0.1 26381 e1148ad6caf60302dd6d0dbd693cb3448e209ac2
sentinel known-sentinel mymaster 127.0.0.1 26380 39db5b040b21a52da5334dd2d798244c034b4fc3
sentinel current-epoch 0

1
2
3


## 2.5 故障转移实验
先查看一下节点的进程pid

➜ ps -aux | grep redis
root 18225 0.1 0.0 40208 11212 ? Ssl 22:10 0:05 redis-server 127.0.0.1:6379
root 18234 0.0 0.0 38160 8364 ? Ssl 22:10 0:04 redis-server 127.0.0.1:6380
root 18244 0.0 0.0 38160 8308 ? Ssl 22:10 0:04 redis-server 127.0.0.1:6381
root 20568 0.1 0.0 38160 8460 ? Ssl 23:05 0:02 redis-sentinel :26379 [sentinel]
root 20655 0.1 0.0 38160 8296 ? Ssl 23:07 0:02 redis-sentinel :26380 [sentinel]
root 20664 0.1 0.0 38160 8312 ? Ssl 23:07 0:02 redis-sentinel *:26381 [sentinel]

1 2	我们干掉端口6379的主节点。

➜ sudo kill -9 18225
➜ ps -aux | grep redis
root 18234 0.0 0.0 38160 8364 ? Ssl 22:10 0:05 redis-server 127.0.0.1:6380
root 18244 0.0 0.0 38160 8308 ? Ssl 22:10 0:05 redis-server 127.0.0.1:6381
root 20568 0.1 0.0 38160 8460 ? Ssl 23:05 0:03 redis-sentinel :26379 [sentinel]
root 20655 0.1 0.0 38160 8296 ? Ssl 23:07 0:03 redis-sentinel :26380 [sentinel]
root 20664 0.1 0.0 38160 8312 ? Ssl 23:07 0:03 redis-sentinel *:26381 [sentinel]


此时，Redis Sentinel对主节点进行客观下线（Objectively Down， 简称 ODOWN）的判断，确认主节点不可达，则通知从节点中止复制主节点的操作。
[img03]
当主节点下线时长超过配置的下线时长30000秒，Redis Sentinel执行故障转移操作。
此时，我们查看一下Sentinel节点监控的主节点信息：

127.0.0.1:26379> sentinel masters
1) 1) “name”
2) “mymaster”
3) “ip”
4) “127.0.0.1”
5) “port”
6) “6380” //可以看到主节点已经成为6380端口的节点
7) “runid”
8) “084850ab4ff6c2f2502b185c8eab5bdd25a26ce2”
9) “flags”
10) “master”
…………..

1 2	看一下Sentinel节点监控的从节点信息：

127.0.0.1:26379> sentinel slaves mymaster
1) 1) “name”
2) “127.0.0.1:6379” //ip:port
3) “ip”
4) “127.0.0.1”
5) “port”
6) “6379”
7) “runid”
8) “”
9) “flags”
10) “s_down,slave,disconnected” //端口6379的原主节点已经断开了连接
…………..
2) 1) “name”
2) “127.0.0.1:6381”
3) “ip”
4) “127.0.0.1”
5) “port”
6) “6381”
7) “runid”
8) “24495fe180e4fd64ac47467e0b2652894406e9e4”
9) “flags”
10) “slave” //本来的从节点，还是从节点的role
…………..


由以上信息可得，端口为6380的Redis数据节点成为新的主节点，端口为6379的旧主节点断开连接。如图所示：
[img04]
我们在试着重启端口6379的数据节点。

➜ sudo redis-server redis-6379.conf
➜ ps -aux | grep redis
root 18234 0.1 0.0 40208 11392 ? Ssl 5月22 0:06 redis-server 127.0.0.1:6380
root 18244 0.1 0.0 40208 10356 ? Ssl 5月22 0:07 redis-server 127.0.0.1:6381
root 20568 0.1 0.0 38160 8460 ? Ssl 5月22 0:05 redis-sentinel :26379 [sentinel]
root 20655 0.1 0.0 38160 8296 ? Ssl 5月22 0:05 redis-sentinel :26380 [sentinel]
root 20664 0.1 0.0 38160 8312 ? Ssl 5月22 0:05 redis-sentinel *:26381 [sentinel]
menwen 22475 0.0 0.0 14216 5920 pts/2 S+ 5月22 0:00 redis-cli -p 26379
// 6379的数据节点已重启
root 22617 0.0 0.0 38160 8304 ? Ssl 00:00 0:00 redis-server 127.0.0.1:6379

1
2


看看发生什么：

127.0.0.1:26379> sentinel slaves mymaster
1) 1) “name”
2) “127.0.0.1:6379” //6379端口的节点重启后，变成了”活”的从节点
3) “ip”
4) “127.0.0.1”
5) “port”
6) “6379”
7) “runid”
8) “de1b5c28483cf150d9550f8e338886706e952346”
9) “flags”
10) “slave”
…………..
2) 1) “name” //6381端口的节点没有变化，仍是从节点
2) “127.0.0.1:6381”
…………..
```

他被降级成为端口6380的从节点。
[img05]

从上面的逻辑架构和故障转移试验中，可以看出Redis Sentinel的以下几个功能。

监控：Sentinel节点会定期检测Redis数据节点和其余Sentinel节点是否可达。
通知：Sentinel节点会将故障转移通知给应用方。
主节点故障转移：实现从节点晋升为主节点并维护后续正确的主从关系。
配置提供者：在Redis Sentinel结构中，客户端在初始化的时候连接的是Sentinel节点集合，从中获取主节点信息。

3. Sentinel配置说明

sentinel monitor mymaster 127.0.0.1 6379 2
- 当前Sentinel节点监控 127.0.0.1:6379 这个主节点
- 2代表判断主节点失败至少需要2个Sentinel节点节点同意
- mymaster是主节点的别名
sentinel down-after-milliseconds mymaster 30000
- 每个Sentinel节点都要定期PING命令来判断Redis数据节点和其余Sentinel节点是否可达，如果超过30000毫秒且没有回复，则判定不可达
sentinel parallel-syncs mymaster 1
- 当Sentinel节点集合对主节点故障判定达成一致时，Sentinel领导者节点会做故障转移操作，选出新的主节点，原来的从节点会向新的主节点发起复制操作，限制每次向新的主节点发起复制操作的从节点个数为1。
  sentinel failover-timeout mymaster 180000

故障转移超时时间为180000
sentinel auth-pass \ \
如果Sentinel监控的主节点配置了密码，可以通过sentinel auth-pass配置通过添加主节点的密码，防止Sentinel节点无法对主节点进行监控。
例如：sentinel auth-pass mymaster MySUPER–secret-0123passw0rd
sentinel notification-script \ \
在故障转移期间，当一些警告级别的Sentinel事件发生（指重要事件，如主观下线，客观下线等）时，会触发对应路径的脚本，想脚本发送相应的事件参数。
例如：sentinel notification-script mymaster /var/redis/notify.sh
sentinel client-reconfig-script \ \
在故障转移结束后，触发应对路径的脚本，并向脚本发送故障转移结果的参数。
例如：sentinel client-reconfig-script mymaster /var/redis/reconfig.sh。

Redis集群方案应该怎么做

发表于 2018-03-23 | 分类于 Redis

迭代器&生成器&装饰器

发表于 2018-03-22 | 分类于 Python

迭代器

迭代器是访问集合元素的一种方式。迭代器对象从集合的第一个元素开始访问，直到所有的元素被访问完结束。迭代器只能往前不会后退迭代器的一大优点是不要求事先准备好整个迭代过程中所有的元素。迭代器仅仅在迭代到某个元素时才计算该元素，而在这之前或之后，元素可以不存在或者被销毁。这个特点使得它特别适合用于遍历一些巨大的或是无限的集合

特点：

访问者不需要关心迭代器内部的结构，仅需通过next()方法不断去取下一个内容
不能随机访问集合中的某个值，只能从头到尾依次访问
访问到一半时不能往回退
便于循环比较大的数据集合，节省内存

第一种方式，

1
2
3

list=["hello","world","china"]
for i in list:
	print i

也是通常我们使用的遍历方式

第二种方式，

>>> list=["hello","world","china"]
>>> it=iter(list)
>>> while True:
	try:
		m=next(it)
		print(m)
	except StopIteration:
		break

列表推导

>>> L = [x * x for x in range(10)]
>>> L
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
>>> type(L)
<type 'list'>

range（10）返回的也是一个list

生成器

通过列表生成式，我们可以直接创建一个列表。但是，受到内存限制，列表容量肯定是有限的。而且，创建一个包含100万个元素的列表，不仅占用很大的存储空间，如果我们仅仅需要访问前面几个元素，那后面绝大多数元素占用的空间都白白浪费了。

所以，如果列表元素可以按照某种算法推算出来，那我们是否可以在循环的过程中不断推算出后续的元素呢？这样就不必创建完整的list，从而节省大量的空间。在Python中，这种一边循环一边计算的机制，称为生成器：generator。

一个函数调用时返回一个迭代器，那这个函数就叫做生成器（generator），如果函数中包含yield语法，那这个函数就会变成生成器。这个yield的主要效果呢，就是可以使函数中断，并保存中断状态，中断后，代码可以继续往下执行，过一段时间还可以再重新调用这个函数，从上次yield的下一句开始执行。

用法一

1
2
3

>>> g = (x * x for x in range(10))
>>> g
<generator object <genexpr> at 0x1022ef630>

用法二

def fab(n):
	a = 0
	b = 1
	while a <= n:
		yield a
		a, b = a+b, a
for i in fab(5):
	print i
0
1
1
2
3
5

xrange是一个生成器，而range是迭代器。

装饰器

不希望修改变原有函数定义，在代码运行期间动态增加功能的方式，称之为“装饰器”（Decorator，类似于设计模式中的装饰器模式。

代码示例

def log(func):
    def inner(*args, **kwargs):
        print("__func__ = %s", func.__name__)
        return func(*args, **kwargs)
    return inner
@log
def now():
    print "2018/03/22"
('__func__ = %s', 'now')
2018/03/22

在线定位问题常用命令

发表于 2018-03-21 | 分类于 Linux

有时候，有很多问题只有在线上或者预发环境才能发现，而线上又不能调试代码，所以线
上问题定位就只能看日志、系统状态和dump线程，介绍一些常用的工具，介绍一些常用命令定位线上问题。

top

在Linux命令行下使用TOP命令查看每个进程的情况，显示如下。

交互命令数字1查看每个CPU的性能数据
查看某个进程所有线程的信息
1
top -H -p pid

strace & pstack

strace就是这样一款工具。通过它，我们可以跟踪程序执行过程中产生的系统调用及接收到的信号，帮助我们分析程序或命令执行中遇到的异常情况。

1. 一个简单的例子

//main.c
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
int main( )
{
　　int fd ;
　　int i = 0 ;
　　fd = open( “/tmp/foo”, O_RDONLY ) ;
　　if ( fd < 0 )
　　　　i=5;
　　else
　　　　i=2;
　　return i;
}

2.strace跟踪输出

使用以下命令，我们将使用strace对以上程序进行跟踪，并将结果重定向至main.strace文件：

1	$ strace -o main.strace ./main

接下来我们来看main.strace文件的内容：

lx@LX:~$ cat main.strace
execve("./main", ["./main"], [/* 43 vars */]) = 0
brk(0)                                  = 0x9ac4000
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7739000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY)      = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=80682, ...}) = 0
mmap2(NULL, 80682, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7725000
close(3)                                = 0
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
open("/lib/i386-linux-gnu/libc.so.6", O_RDONLY) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\220o\1\0004\0\0\0"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=1434180, ...}) = 0
mmap2(NULL, 1444360, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x56d000
mprotect(0x6c7000, 4096, PROT_NONE)     = 0
mmap2(0x6c8000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x15a) = 0x6c8000
mmap2(0x6cb000, 10760, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x6cb000
close(3)                                = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7724000
set_thread_area({entry_number:-1 -> 6, base_addr:0xb77248d0, limit:1048575, seg_32bit:1, contents:0, read_exec_    only:0, limit_in_pages:1, seg_not_present:0, useable:1}) = 0
mprotect(0x6c8000, 8192, PROT_READ)     = 0
mprotect(0x8049000, 4096, PROT_READ)    = 0
mprotect(0x4b0000, 4096, PROT_READ)     = 0
munmap(0xb7725000, 80682)               = 0
open("/tmp/foo", O_RDONLY)              = -1 ENOENT (No such file or directory)
exit_group(5)                           = ?
//标红的行号为方便说明而添加，非strace执行输出

strace跟踪程序与系统交互时产生的系统调用，以上每一行就对应一个系统调用，格式为：

系统调用的名称( 参数… ) = 返回值错误标志和描述

Line 1:  对于命令行下执行的程序，execve(或exec系列调用中的某一个)均为strace输出系统调用中的第一个。strace首先调用fork或clone函数新建一个子进程，然后在子进程中调用exec载入需要执行的程序(这里为./main)
Line 2:  以0作为参数调用brk，返回值为内存管理的起始地址(若在子进程中调用malloc，则从0x9ac4000地址开始分配空间)
Line 3:  调用access函数检验/etc/ld.so.nohwcap是否存在
Line 4:  使用mmap2函数进行匿名内存映射，以此来获取8192bytes内存空间，该空间起始地址为0xb7739000，关于匿名内存映射，可以看这里
Line 6:  调用open函数尝试打开/etc/ld.so.cache文件，返回文件描述符为3
Line 7:  fstat64函数获取/etc/ld.so.cache文件信息
Line 8:  调用mmap2函数将/etc/ld.so.cache文件映射至内存，关于使用mmap映射文件至内存，可以看这里
Line 9:  close关闭文件描述符为3指向的/etc/ld.so.cache文件
Line12:  调用read，从/lib/i386-linux-gnu/libc.so.6该libc库文件中读取512bytes，即读取ELF头信息
Line15:  使用mprotect函数对0x6c7000起始的4096bytes空间进行保护(PROT_NONE表示不能访问，PROT_READ表示可以读取)
Line24:  调用munmap函数，将/etc/ld.so.cache文件从内存中去映射，与Line 8的mmap2对应
Line25:  对应源码中使用到的唯一的系统调用——open函数，使用其打开/tmp/foo文件
Line26:  子进程结束，退出码为5(为什么退出值为5？返回前面程序示例部分看看源码吧：)

3. 输出分析

呼呼！看完这么多系统调用函数，是不是有点摸不着北？让我们从整体入手，回到主题strace上来。

从上面输出可以发现，真正能与源码对应上的只有open这一个系统调用(Line25)，其他系统调用几乎都用于进行进程初始化工作：装载被执行程序、载入libc函数库、设置内存映射等。

源码中的if语句或其他代码在相应strace输出中并没有体现，因为它们并没有唤起系统调用。strace只关心程序与系统之间产生的交互，因而strace不适用于程序逻辑代码的排错和分析。

4. 常用选项

跟踪子进程 -f
记录系统调用时间

strace还可以记录程序与系统交互时，各个系统调用发生时的时间信息，有r、t、tt、ttt、T等几个选项，它们记录时间的方式为：

-T: 记录各个系统调用花费的时间，精确到微秒

-r: 以第一个系统调用(通常为execve)计时，精确到微秒

-t: 时：分：秒

-tt: 时：分：秒 . 微秒

-ttt: 计算机纪元以来的秒数 . 微秒

比较常用的为T选项，因为其提供了每个系统调用花费时间。而其他选项的时间记录既包含系统调用时间，又算上用户级代码执行用时，参考意义就小一些。对部分时间选项我们可以组合起来使用，例如：

strace -Tr ./main
0.000000 execve(“./main”, [“main”], [/* 64 vars */]) = 0
0.000931 fcntl64(0, F_GETFD)= 0 <0.000012>
0.000090 fcntl64(1, F_GETFD)= 0 <0.000022>
0.000060 fcntl64(2, F_GETFD)= 0 <0.000012>
0.000054 uname({sys=”Linux”, node=”ion”, ...}) = 0 <0.000014>
0.000307 geteuid32()= 7903 <0.000011>
0.000040 getuid32()= 7903 <0.000012>
0.000039 getegid32()= 200 <0.000011>
0.000039 getgid32()= 200 <0.000011>
……

最左边一列为-r选项对应的时间输出，最右边一列为-T选项对应的输出。

跟踪正在运行的进程 strace -p PID

5. 使用strace处理程序挂死实例

挂死程序源码

//hang.c
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
#include <string.h>
int main(int argc, char** argv)
{
    getpid(); //该系统调用起到标识作用
    if(argc < 2)
    {
        printf("hang (user|system)\n");
        return 1;
    }
    if(!strcmp(argv[1], "user"))
        while(1);
    else if(!strcmp(argv[1], "system"))
        sleep(500);
    return 0;
}

可向该程序传送user和system参数，以上代码使用死循环模拟用户态挂死，调用sleep模拟内核态程序挂死。

strace跟踪输出

用户态挂死跟踪输出：

……
mprotect(0x8049000, 4096, PROT_READ)    = 0
mprotect(0xb59000, 4096, PROT_READ)     = 0
munmap(0xb77bf000, 80682)               = 0
getpid()                                = 14539

内核态挂死跟踪输出：

……
mprotect(0x8049000, 4096, PROT_READ)    = 0
mprotect(0xddf000, 4096, PROT_READ)     = 0
munmap(0xb7855000, 80682)               = 0
getpid()                                = 14543
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({500, 0},