对于redis高可用来说,常有三种方式,一是主备,二是sentinel,三是cluster(主备加分片)
本篇主要讲述cluster的方式,详解cluster模式的集群分片与一致性机制。
redis-cluster的分片主要通过槽点(slot)进行划分的,redis自身为k-v数据库,那么对于key进行crc16%max(slot),即可得到这个key的槽点值,通过这个槽点值即可找到对应的master节点,进行写入。
对于已经建好cluster的集群来说,输入 cluster nodes命令可以看到类似如下结果
可以看到,在自身ip:pord之后的那一列为主从情况,只有master的那个节点的最后一列为a-b的格式,slave为connected。a-b指的就是这个master所负责的slot槽点范围。
常用的一致性协议有Raft,Paxos,Gossip等。redis-cluster模式使用等为gossip协议。
redis-cluster本身的分片设计就为去中心化的思想,因此在sentinel模式下采用raft协议的情况下,在cluster模式换用更适合p2p的经典协议gossip,每一对片与外界交互的目的也只是用来确认与更新集群成员身份、故障探测,因此gossip更适合cluster的设计思路。
redis-cluster集群通信所使用的端口为当前节点入口端口+10000,即若某个节点入口的端口号为6379,则此节点用来集群通信的端口为16379,所以集群间的通讯不会影响到节点的正常读写,而节点读写error时也不会影响集群间的gossip传播。
cluster相关的函数统一放在/src/cluster.c与/src/cluster.h下
typedef struct clusterNode { mstime_t ctime; /* Node object creation time. */ char name[CLUSTER_NAMELEN]; /* Node name, hex string, sha1-size */ int flags; /* CLUSTER_NODE_... */ uint64_t configEpoch; /* Last configEpoch observed for this node */ unsigned char slots[CLUSTER_SLOTS/8]; /* slots handled by this node */ uint16_t *slot_info_pairs; /* Slots info represented as (start/end) pair (consecutive index). */ int slot_info_pairs_count; /* Used number of slots in slot_info_pairs */ int numslots; /* Number of slots handled by this node */ int numslaves; /* Number of slave nodes, if this is a master */ struct clusterNode **slaves; /* pointers to slave nodes */ struct clusterNode *slaveof; /* pointer to the master node. Note that it may be NULL even if the node is a slave if we don't have the master node in our tables. */ mstime_t ping_sent; /* Unix time we sent latest ping */ mstime_t pong_received; /* Unix time we received the pong */ mstime_t data_received; /* Unix time we received any data */ mstime_t fail_time; /* Unix time when FAIL flag was set */ mstime_t voted_time; /* Last time we voted for a slave of this master */ mstime_t repl_offset_time; /* Unix time we received offset for this node */ mstime_t orphaned_time; /* Starting time of orphaned master condition */ long long repl_offset; /* Last known repl offset for this node. */ char ip[NET_IP_STR_LEN]; /* Latest known IP address of this node */ sds hostname; /* The known hostname for this node */ int port; /* Latest known clients port (TLS or plain). */ int pport; /* Latest known clients plaintext port. Only used if the main clients port is for TLS. */ int cport; /* Latest known cluster port of this node. */ clusterLink *link; /* TCP/IP link established toward this node */ clusterLink *inbound_link; /* TCP/IP link accepted from this node */ list *fail_reports; /* List of nodes signaling this as failing */} clusterNode;
mstime_t的属性大多都是为了gossip超时或进行投票所使用的标志。在cluster模式下,标志一个节点的唯一属性为char name[CLUSTER_NAMELEN],统一为十六进制字符串,如图
中的第一列即为node-name,若拷贝配置时未清空nodes.conf,则会出现node冲突的情况,errorlog将会持续刷新此错误。
/* Add a node to the nodes hash table */void clusterAddNode(clusterNode *node) { int retval; retval = dictAdd(server.cluster->nodes, sdsnewlen(node->name,CLUSTER_NAMELEN), node); serverAssert(retval == DICT_OK);}
可以看出,节点的信息通过dict进行保存,key为nodeName,value指向clusterNode,这也是为何标志一个节点的唯一属性为char name[CLUSTER_NAMELEN]的原因了
/* Set the specified node 'n' as master for this node. * If this node is currently a master, it is turned into a slave. */void clusterSetMaster(clusterNode *n) { serverAssert(n != myself); serverAssert(myself->numslots == 0); if (nodeIsMaster(myself)) { myself->flags &= ~(CLUSTER_NODE_MASTER|CLUSTER_NODE_MIGRATE_TO); myself->flags |= CLUSTER_NODE_SLAVE; clusterCloseAllSlots(); } else { if (myself->slaveof) clusterNodeRemoveSlave(myself->slaveof,myself); } myself->slaveof = n; //默认自己为自己的slave clusterNodeAddSlave(n,myself); //通知集群,由于gossip协议为八卦的性质,则通知自己后自己即通知其余节点,无需再多做其余操作 replicationSetMaster(n->ip, n->port); resetManualFailover();}
int clusterNodeAddSlave(clusterNode *master, clusterNode *slave) { int j; /* If it's already a slave, don't add it again. */ for (j = 0; j < master->numslaves; j++) if (master->slaves[j] == slave) return C_ERR; master->slaves = zrealloc(master->slaves, sizeof(clusterNode*)*(master->numslaves+1)); master->slaves[master->numslaves] = slave; master->numslaves++; master->flags |= CLUSTER_NODE_MIGRATE_TO; return C_OK;}
新增slave时需要指定slave的master,可以看出,对于一个master,cluster模式并未限制一个master对应的slave个数,理论上一到两个slave就够了
/* Add the specified slot to the list of slots that node 'n' will * serve. Return C_OK if the operation ended with success. * If the slot is already assigned to another instance this is considered * an error and C_ERR is returned. */int clusterAddSlot(clusterNode *n, int slot) {//若此slot已经被分配,则直接报错 if (server.cluster->slots[slot]) return C_ERR; clusterNodeSetSlotBit(n,slot); server.cluster->slots[slot] = n; return C_OK;}/* Set the slot bit and return the old value. */int clusterNodeSetSlotBit(clusterNode *n, int slot) { int old = bitmapTestBit(n->slots,slot);//通过设置位的值来极快的确定slot是否存在 bitmapSetBit(n->slots,slot); if (!old) { n->numslots++; /* When a master gets its first slot, even if it has no slaves, * it gets flagged with MIGRATE_TO, that is, the master is a valid * target for replicas migration, if and only if at least one of * the other masters has slaves right now. * * Normally masters are valid targets of replica migration if: * 1. The used to have slaves (but no longer have). * 2. They are slaves failing over a master that used to have slaves. * * However new masters with slots assigned are considered valid * migration targets if the rest of the cluster is not a slave-less. * * See https://github.com/redis/redis/issues/3043 for more info. */ if (n->numslots == 1 && clusterMastersHaveSlaves()) n->flags |= CLUSTER_NODE_MIGRATE_TO; } return old;}/* Set the bit at position 'pos' in a bitmap. */void bitmapSetBit(unsigned char *bitmap, int pos) {//slot的max值为16383,则bitmap数组的最大长度为2048,即通过位的方式,将16384的长度缩小到了2048,极大的提升了寻找效率//bitmap的方式也在布隆过滤器等大量使用,通过位运算极大的减小了时间复杂度与空间复杂度 off_t byte = pos/8; int bit = pos&7; bitmap[byte] |= 1<<bit;}
/* Delete the specified slot marking it as unassigned. * Returns C_OK if the slot was assigned, otherwise if the slot was * already unassigned C_ERR is returned. */int clusterDelSlot(int slot) { clusterNode *n = server.cluster->slots[slot]; if (!n) return C_ERR; /* Cleanup the channels in master/replica as part of slot deletion. */ list *nodes_for_slot = clusterGetNodesServingMySlots(n); listNode *ln = listSearchKey(nodes_for_slot, myself); if (ln != NULL) { removeChannelsInSlot(slot); } listRelease(nodes_for_slot); serverAssert(clusterNodeClearSlotBit(n,slot) == 1); server.cluster->slots[slot] = NULL; return C_OK;}
/* This is executed 10 times every second */void clusterCron(void) { dictIterator *di; dictEntry *de; int update_state = 0; int orphaned_masters; /* How many masters there are without ok slaves. */ int max_slaves; /* Max number of ok slaves for a single master. */ int this_slaves; /* Number of ok slaves for our master (if we are slave). */ mstime_t min_pong = 0, now = mstime(); clusterNode *min_pong_node = NULL; static unsigned long long iteration = 0; mstime_t handshake_timeout; //迭代器,每100ms自增一次,通过此迭代器控制每10次进行一次节点间的pingpong通讯,即节点pingpong实际上为1s一次 iteration++; /* Number of times this function was called so far. */ //修复hostname,防止dns变更后集群丢失节点 clusterUpdateMyselfHostname(); /* The handshake timeout is the time after which a handshake node that was * not turned into a normal node is removed from the nodes. Usually it is * just the NODE_TIMEOUT value, but when NODE_TIMEOUT is too small we use * the value of 1 second. */ //设置超时时间,虽允许自定义,但代码控制最低时间为1000ms handshake_timeout = server.cluster_node_timeout; if (handshake_timeout < 1000) handshake_timeout = 1000; /* Clear so clusterNodeCronHandleReconnect can count the number of nodes in PFAIL. */ server.cluster->stats_pfail_nodes = 0; /* Clear so clusterNodeCronUpdateClusterLinksMemUsage can count the current memory usage of all cluster links. */ server.stat_cluster_links_memory = 0; /* Run through some of the operations we want to do on each cluster node. */ //遍历此节点维护的所有节点信息 di = dictGetSafeIterator(server.cluster->nodes); while((de = dictNext(di)) != NULL) { clusterNode *node = dictGetVal(de); /* The sequence goes: * 1. We try to shrink link buffers if possible. * 2. We free the links whose buffers are still oversized after possible shrinking. * 3. We update the latest memory usage of cluster links. * 4. We immediately attempt reconnecting after freeing links. */ clusterNodeCronResizeBuffers(node); clusterNodeCronFreeLinkOnBufferLimitReached(node); clusterNodeCronUpdateClusterLinksMemUsage(node); /* The protocol is that function(s) below return non-zero if the node was * terminated. */ //对于自身或者无法连接的节点进行剔除,仅保留可达的节点,clusterNodeCronHandleReconnect函数中包含真正创建连接进行pingpong的可达性测试 if(clusterNodeCronHandleReconnect(node, handshake_timeout, now)) continue; } dictReleaseIterator(di); /* Ping some random node 1 time every 10 iterations, so that we usually ping * one random node every second. */ if (!(iteration % 10)) { int j; /* Check a few random nodes and ping the one with the oldest * pong_received time. */ //虽然在上边的di中取得了所有可达的节点,但为了保证效率,只随机取出最大五个节点进行一致性同步或投票 for (j = 0; j < 5; j++) { de = dictGetRandomKey(server.cluster->nodes); clusterNode *this = dictGetVal(de); /* Don't ping nodes disconnected or with a ping currently active. */ //对于ping已经断开或者最近已经ping过的节点直接跳过,但依旧占用此轮gossip节点名额 if (this->link == NULL || this->ping_sent != 0) continue; if (this->flags & (CLUSTER_NODE_MYSELF|CLUSTER_NODE_HANDSHAKE)) continue; //对于ping进行pong回应 if (min_pong_node == NULL || min_pong > this->pong_received) { min_pong_node = this; min_pong = this->pong_received; } } if (min_pong_node) { serverLog(LL_DEBUG,"Pinging node %.40s", min_pong_node->name); clusterSendPing(min_pong_node->link, CLUSTERMSG_TYPE_PING); } } /* Iterate nodes to check if we need to flag something as failing. * This loop is also responsible to: * 1) Check if there are orphaned masters (masters without non failing * slaves). * 2) Count the max number of non failing slaves for a single master. * 3) Count the number of slaves for our master, if we are a slave. */ orphaned_masters = 0; max_slaves = 0; this_slaves = 0; di = dictGetSafeIterator(server.cluster->nodes); //通过上边的节点交互,在这里遍历所有节点,检查是否需要将某个节点标记为下线 while((de = dictNext(di)) != NULL) { clusterNode *node = dictGetVal(de); now = mstime(); /* Use an updated time at every iteration. */ if (node->flags & (CLUSTER_NODE_MYSELF|CLUSTER_NODE_NOADDR|CLUSTER_NODE_HANDSHAKE)) continue; /* Orphaned master check, useful only if the current instance * is a slave that may migrate to another master. */ if (nodeIsSlave(myself) && nodeIsMaster(node) && !nodeFailed(node)) { int okslaves = clusterCountNonFailingSlaves(node); /* A master is orphaned if it is serving a non-zero number of * slots, have no working slaves, but used to have at least one * slave, or failed over a master that used to have slaves. */ if (okslaves == 0 && node->numslots > 0 && node->flags & CLUSTER_NODE_MIGRATE_TO) { orphaned_masters++; } if (okslaves > max_slaves) max_slaves = okslaves; if (nodeIsSlave(myself) && myself->slaveof == node) this_slaves = okslaves; } /* If we are not receiving any data for more than half the cluster * timeout, reconnect the link: maybe there is a connection * issue even if the node is alive. */ mstime_t ping_delay = now - node->ping_sent; mstime_t data_delay = now - node->data_received; //判断此节点连接是否可能有问题,若有问题则直接释放掉,在下个cron周期会接着尝试重连 if (node->link && /* is connected */ now - node->link->ctime > server.cluster_node_timeout && /* was not already reconnected */ node->ping_sent && /* we already sent a ping */ /* and we are waiting for the pong more than timeout/2 */ ping_delay > server.cluster_node_timeout/2 && /* and in such interval we are not seeing any traffic at all. */ data_delay > server.cluster_node_timeout/2) { /* Disconnect the link, it will be reconnected automatically. */ freeClusterLink(node->link); } /* If we have currently no active ping in this instance, and the * received PONG is older than half the cluster timeout, send * a new ping now, to ensure all the nodes are pinged without * a too big delay. */ //如果此节点中当前没有存活的ping,同时收到的pong超过集群超时的一半,则发送一个新的ping,保证所有节点都在ping时没有太大的延迟。 if (node->link && node->ping_sent == 0 && (now - node->pong_received) > server.cluster_node_timeout/2) { clusterSendPing(node->link, CLUSTERMSG_TYPE_PING); continue; } /* If we are a master and one of the slaves requested a manual * failover, ping it continuously. */ //对于主节点来说,如果从发起了故障转换,则在此轮投票完成前保持连接,知道集群故障转移完成 if (server.cluster->mf_end && nodeIsMaster(myself) && server.cluster->mf_slave == node && node->link) { clusterSendPing(node->link, CLUSTERMSG_TYPE_PING); continue; } /* Check only if we have an active ping for this instance. */ if (node->ping_sent == 0) continue; /* Check if this node looks unreachable. * Note that if we already received the PONG, then node->ping_sent * is zero, so can't reach this code at all, so we don't risk of * checking for a PONG delay if we didn't sent the PING. * * We also consider every incoming data as proof of liveness, since * our cluster bus link is also used for data: under heavy data * load pong delays are possible. */ mstime_t node_delay = (ping_delay < data_delay) ? ping_delay : data_delay; //计算delay的时常,若超时则判断此主故障或下线(PFAIL为疑似下线,在疑似的过程中需要gossip到其他节点进行投票确认) if (node_delay > server.cluster_node_timeout) { /* Timeout reached. Set the node as possibly failing if it is * not already in this state. */ if (!(node->flags & (CLUSTER_NODE_PFAIL|CLUSTER_NODE_FAIL))) { serverLog(LL_DEBUG,"*** NODE %.40s possibly failing", node->name); node->flags |= CLUSTER_NODE_PFAIL; update_state = 1; } } } dictReleaseIterator(di); /* If we are a slave node but the replication is still turned off, * enable it if we know the address of our master and it appears to * be up. */ //对于未进行主从同步的从,检查发现从的主已经上线,则在这里通过ping告诉它可以开始主从同步了 if (nodeIsSlave(myself) && server.masterhost == NULL && myself->slaveof && nodeHasAddr(myself->slaveof)) { replicationSetMaster(myself->slaveof->ip, myself->slaveof->port); } /* Abort a manual failover if the timeout is reached. */ manualFailoverCheckTimeout(); if (nodeIsSlave(myself)) { clusterHandleManualFailover(); if (!(server.cluster_module_flags & CLUSTER_MODULE_FLAG_NO_FAILOVER)) clusterHandleSlaveFailover(); /* If there are orphaned slaves, and we are a slave among the masters * with the max number of non-failing slaves, consider migrating to * the orphaned masters. Note that it does not make sense to try * a migration if there is no master with at least *two* working * slaves. */ if (orphaned_masters && max_slaves >= 2 && this_slaves == max_slaves &&server.cluster_allow_replica_migration) clusterHandleSlaveMigration(max_slaves); } if (update_state || server.cluster->state == CLUSTER_FAIL) clusterUpdateState();}
可以看出,cluster模式在分片与主从的模式下,理解起来并不复杂,明白gossip原理即可理解cluster的故障主从切换。
同时有些小工具可供cluster模式进行使用。
redis数据持久化有多种方案,不落盘的方式可以使用主从,落盘的方式有rdb,aof或者rdb+aof
本篇详细介绍rdb持久化方式
在redis中,rdb为快照持久化方式。特点比较明显,即在某一时间点将当前数据全部保存下来。
rdb主要也分为两种,主动快照或被动快照。
主要通过save与bgsave两条命令。bg指的是background,bgsave为异步的save命令。redis在数据处理层面上依然为单线程,因此简单的save在备份时会导致redis节点不可用,这也是bgsave的由来。无论生产或测试环境都推荐bgsave不使用save,save命令也建议通过rename进行屏蔽。对于线上环境来说,save的危险性不亚于keys *。
主要通过conf文件中的save inta intb。指的是每过inta秒,数据变化大于intb时自动调用bgsave。同时save支持多行写入,例如写两行 save 10 20 save 900 1 ,当redis满足任意一条save条件时都会调用bgsave进行持久化。
如上所述rdb的处理流程,可见redis的rdb机制本身是定时备份的大框架。在下一个备份周期到来时,数据的可靠性是没有保障的。
struct _rio { /* Backend functions. * Since this functions do not tolerate short writes or reads the return * value is simplified to: zero on error, non zero on complete success. */ size_t (*read)(struct _rio *, void *buf, size_t len); size_t (*write)(struct _rio *, const void *buf, size_t len); off_t (*tell)(struct _rio *); int (*flush)(struct _rio *); /* The update_cksum method if not NULL is used to compute the checksum of * all the data that was read or written so far. The method should be * designed so that can be called with the current checksum, and the buf * and len fields pointing to the new block of data to add to the checksum * computation. */ void (*update_cksum)(struct _rio *, const void *buf, size_t len); /* The current checksum and flags (see RIO_FLAG_*) */ uint64_t cksum, flags; /* number of bytes read or written */ size_t processed_bytes; /* maximum single read or write chunk size */ size_t max_processing_chunk; /* Backend-specific vars. */ union { /* In-memory buffer target. */ struct { sds ptr; off_t pos; } buffer; /* Stdio file pointer target. */ struct { FILE *fp; off_t buffered; /* Bytes written since last fsync. */ off_t autosync; /* fsync after 'autosync' bytes written. */ } file; /* Connection object (used to read from socket) */ struct { connection *conn; /* Connection */ off_t pos; /* pos in buf that was returned */ sds buf; /* buffered data */ size_t read_limit; /* don't allow to buffer/read more than that */ size_t read_so_far; /* amount of data read from the rio (not buffered) */ } conn; /* FD target (used to write to pipe). */ struct { int fd; /* File descriptor. */ off_t pos; sds buf; } fd; } io;};
rio为redis对io的一层包装,可以理解为redisio,抽象出read,write等方法,在rdb或aof时通过rio真正对操作系统的io进行操作。
/* Save the DB on disk. Return C_ERR on error, C_OK on success. */int rdbSave(char *filename, rdbSaveInfo *rsi) { //临时文件,用于库函数snprintf char tmpfile[256]; //找到当前工作目录 char cwd[MAXPATHLEN]; /* Current working dir path for error messages. */ FILE *fp = NULL; rio rdb; int error = 0; //库函数,用于生成临时文件,以temp-开头,整体完成后再进行改名 snprintf(tmpfile,256,"temp-%d.rdb", (int) getpid()); //打开文件 fp = fopen(tmpfile,"w"); //打开失败时写日志,固定格式Failed opening the RDB file...... if (!fp) { char *cwdp = getcwd(cwd,MAXPATHLEN); serverLog(LL_WARNING, "Failed opening the RDB file %s (in server root dir %s) " "for saving: %s", filename, cwdp ? cwdp : "unknown", strerror(errno)); return C_ERR; } //初始化rio数据结构,用来与操作系统进行io交互 rioInitWithFile(&rdb,fp); startSaving(RDBFLAGS_NONE); //参数判断,通过分批将数据fsync到硬盘,用来缓冲io,建议开启 if (server.rdb_save_incremental_fsync) rioSetAutoSync(&rdb,REDIS_AUTOSYNC_BYTES); if (rdbSaveRio(&rdb,&error,RDBFLAGS_NONE,rsi) == C_ERR) { errno = error; goto werr; } /* Make sure data will not remain on the OS's output buffers */ if (fflush(fp)) goto werr; if (fsync(fileno(fp))) goto werr; if (fclose(fp)) { fp = NULL; goto werr; } fp = NULL; /* Use RENAME to make sure the DB file is changed atomically only * if the generate DB file is ok. */ //重命名操作,修改temp-*.rdb为真正的.rdb文件 if (rename(tmpfile,filename) == -1) { char *cwdp = getcwd(cwd,MAXPATHLEN); serverLog(LL_WARNING, "Error moving temp DB file %s on the final " "destination %s (in server root dir %s): %s", tmpfile, filename, cwdp ? cwdp : "unknown", strerror(errno)); unlink(tmpfile); stopSaving(0); return C_ERR; } //日志写入,固定格式DB saved on disk serverLog(LL_NOTICE,"DB saved on disk"); server.dirty = 0; server.lastsave = time(NULL); server.lastbgsave_status = C_OK; stopSaving(1); return C_OK;werr: //异常写入日志,固定格式Write error saving DB on disk serverLog(LL_WARNING,"Write error saving DB on disk: %s", strerror(errno)); if (fp) fclose(fp); unlink(tmpfile); stopSaving(0); return C_ERR;}
rdbSave为真正的落盘操作,在bgsave中被rdbSaveBackground作为子流程完整执行
int rdbSaveBackground(char *filename, rdbSaveInfo *rsi) { pid_t childpid; if (hasActiveChildProcess()) return C_ERR; server.dirty_before_bgsave = server.dirty; server.lastbgsave_try = time(NULL); //fock子进程,redis仅作包装,真正的fork还是依托于操作系统 if ((childpid = redisFork(CHILD_TYPE_RDB)) == 0) { int retval; /* Child */ //子进程操作 redisSetProcTitle("redis-rdb-bgsave"); redisSetCpuAffinity(server.bgsave_cpulist); //将rdbSave完整的作为子流程 retval = rdbSave(filename,rsi); if (retval == C_OK) { sendChildCowInfo(CHILD_INFO_TYPE_RDB_COW_SIZE, "RDB"); } exitFromChild((retval == C_OK) ? 0 : 1); } else { /* Parent */ if (childpid == -1) { server.lastbgsave_status = C_ERR; serverLog(LL_WARNING,"Can't save in background: fork: %s", strerror(errno)); return C_ERR; } serverLog(LL_NOTICE,"Background saving started by pid %ld",(long) childpid); server.rdb_save_time_start = time(NULL); server.rdb_child_type = RDB_CHILD_TYPE_DISK; return C_OK; } return C_OK; /* unreached */}
主要依托操作系统的fork,在子进程中完整调用rdbSave进行持久化
核心代码 if (!hasActiveChildProcess() && server.rdb_bgsave_scheduled && (server.unixtime-server.lastbgsave_try > CONFIG_BGSAVE_RETRY_DELAY || server.lastbgsave_status == C_OK)) { rdbSaveInfo rsi, *rsiptr; rsiptr = rdbPopulateSaveInfo(&rsi); if (rdbSaveBackground(server.rdb_filename,rsiptr) == C_OK) server.rdb_bgsave_scheduled = 0; }
可以看到,自动rdb调用的为rdbSaveBackground,在bgsave前会先判断是否正在bgsave,防止出现大量fork子进程导致灾难。
yes时redis通过分批将数据fsync到硬盘,用来缓冲io,保证每次fsync数据量在32m以内。建议开启,redis5之后支持
当bgsave快照操作出错时停止写数据到磁盘。建议为no
是否进行rdb压缩
是否检查rdb
相对于数据库来说,rdb的持久化方式相对太过粗糙,fork进程时也总会让人担心是否会造成redis的卡顿,但在数据恢复时相对来说较为友好。相对于aof的方式,rdb的数据安全性得不到更好的保障,在bgsave时也需额外内存进行数据复制。对于线上来说,在意数据安全性的话不建议只开rdb,推荐aof或aof+rdb的方式,至少aof的类binlog记录方式对于故障溯源相对友好。
同时rdb文件的可读性也较差,需通过od -A -x -t xlc -v dump.rdb将rdb文件转化为16进制后才能进行阅读。
redis为缓存kv数据库,而数据库的概念体现在db结构中。redis中的db默认分为16个,从0-15开始,默认使用0号数据库,每个数据库互不干涉,但仍为单线程结构。在cluster模式中,所有节点全部使用db0。
底层数据结构与上层容器合集在
其中比较重要的是dict底层数据结构,为db结构的核心支撑
虽然客户端命令的入口都在上层容器自身的command函数中,但是每个command函数在开始时都会从db中找寻对应的key结构,不同的上层容器结构共用相同的db的key,例如
一个key首先被t_string容器使用,再之后使用t_zset存储相同的key时就会报错。
总之,redis实例中,所有存储的key-value,key全部会写入到相对应db中dict结构上,不区分类型,只存储key与value对应地址,类型判断由上层容器负责。也因此,上层容器的增删改都会额外调用db结构进行相对应的更新,这也是redis核心get/set无法多线程的原因之一。
/* Redis database representation. There are multiple databases identified * by integers from 0 (the default database) up to the max configured * database. The database number is the 'id' field in the structure. */typedef struct redisDb { dict *dict; /* The keyspace for this DB */ dict *expires; /* Timeout of keys with a timeout set */ dict *blocking_keys; /* Keys with clients waiting for data (BLPOP)*/ dict *ready_keys; /* Blocked keys that received a PUSH */ dict *watched_keys; /* WATCHED keys for MULTI/EXEC CAS */ int id; /* Database ID */ long long avg_ttl; /* Average TTL, just for stats */ unsigned long expires_cursor; /* Cursor of the active expire cycle. */ list *defrag_later; /* List of key names to attempt to defrag one by one, gradually. */} redisDb;
其中,*dict为核心存储,expires用来处理键的过期行为,blocking_keys使用较少,redis整个只有blpop等会造成主动阻塞。ready_keys与blocking_keys搭配使用,当下次push时,检查blocking_keys当中是否存在对应的key,之后采取相对应的操作。watched_keys负责实现watch功能,但watch对redis性能影响极大,线上环境禁止使用。
/* Low level key lookup API, not actually called directly from commands * implementations that should instead rely on lookupKeyRead(), * lookupKeyWrite() and lookupKeyReadWithFlags(). */robj *lookupKey(redisDb *db, robj *key, int flags) { dictEntry *de = dictFind(db->dict,key->ptr); if (de) { robj *val = dictGetVal(de); /* Update the access time for the ageing algorithm. * Don't do it if we have a saving child, as this will trigger * a copy on write madness. */ //更新lru时间,防止出现写拷贝的bug if (!hasActiveChildProcess() && !(flags & LOOKUP_NOTOUCH)){ if (server.maxmemory_policy & MAXMEMORY_FLAG_LFU) { updateLFU(val); } else { val->lru = LRU_CLOCK(); } } return val; } else { return NULL; }}/* Lookup a key for read operations, or return NULL if the key is not found * in the specified DB. * * As a side effect of calling this function: * 1. A key gets expired if it reached it's TTL. * 2. The key last access time is updated. * 3. The global keys hits/misses stats are updated (reported in INFO). * 4. If keyspace notifications are enabled, a "keymiss" notification is fired. * * This API should not be used when we write to the key after obtaining * the object linked to the key, but only for read only operations. * * Flags change the behavior of this command: * * LOOKUP_NONE (or zero): no special flags are passed. * LOOKUP_NOTOUCH: don't alter the last access time of the key. * * Note: this function also returns NULL if the key is logically expired * but still existing, in case this is a slave, since this API is called only * for read operations. Even if the key expiry is master-driven, we can * correctly report a key is expired on slaves even if the master is lagging * expiring our key via DELs in the replication link. */robj *lookupKeyReadWithFlags(redisDb *db, robj *key, int flags) { robj *val; //进行过期检查 if (expireIfNeeded(db,key) == 1) { /* If we are in the context of a master, expireIfNeeded() returns 1 * when the key is no longer valid, so we can return NULL ASAP. */ if (server.masterhost == NULL) goto keymiss; /* However if we are in the context of a slave, expireIfNeeded() will * not really try to expire the key, it only returns information * about the "logical" status of the key: key expiring is up to the * master in order to have a consistent view of master's data set. * * However, if the command caller is not the master, and as additional * safety measure, the command invoked is a read-only command, we can * safely return NULL here, and provide a more consistent behavior * to clients accessing expired values in a read-only fashion, that * will say the key as non existing. * * Notably this covers GETs when slaves are used to scale reads. */ if (server.current_client && server.current_client != server.master && server.current_client->cmd && server.current_client->cmd->flags & CMD_READONLY) { goto keymiss; } } val = lookupKey(db,key,flags); //命中/未命中统计,每个db共享一份数据 if (val == NULL) goto keymiss; server.stat_keyspace_hits++; return val;keymiss: if (!(flags & LOOKUP_NONOTIFY)) { notifyKeyspaceEvent(NOTIFY_KEY_MISS, "keymiss", key, db->id); } server.stat_keyspace_misses++; return NULL;}
可以看到,查找key最终只是从dict结构find,得益于dict的渐进式rehash,redis在最终上的hash上不会发生太大倾斜,但渐进式rehash的劣势也在于,当大量的key存在时,一次渐进式rehash需要调整的节点将会过多,业务上也需注意包装key,防范hash碰撞攻击 。
新增key
/* Add the key to the DB. It's up to the caller to increment the reference * counter of the value if needed. * * The program is aborted if the key already exists. */void dbAdd(redisDb *db, robj *key, robj *val) { sds copy = sdsdup(key->ptr); dictEntry *de = dictAddRaw(db->dict, copy, NULL); serverAssertWithInfo(NULL, key, de != NULL); dictSetVal(db->dict, de, val); //每次add需要判断是否blob住 signalKeyAsReady(db, key, val->type); //cluster模式时,需要记录slot信息 if (server.cluster_enabled) slotToKeyAddEntry(de);}
/* FLUSHALL [ASYNC] * * Flushes the whole server data set. */void flushallCommand(client *c) { int flags; if (getFlushCommandFlags(c,&flags) == C_ERR) return; flushAllDataAndResetRDB(flags); addReply(c,shared.ok);}/* Flushes the whole server data set. */void flushAllDataAndResetRDB(int flags) { server.dirty += emptyDb(-1,flags,NULL); if (server.child_type == CHILD_TYPE_RDB) killRDBChild(); if (server.saveparamslen > 0) { /* Normally rdbSave() will reset dirty, but we don't want this here * as otherwise FLUSHALL will not be replicated nor put into the AOF. */ int saved_dirty = server.dirty; rdbSaveInfo rsi, *rsiptr; rsiptr = rdbPopulateSaveInfo(&rsi); rdbSave(server.rdb_filename,rsiptr); server.dirty = saved_dirty; } /* Without that extra dirty++, when db was already empty, FLUSHALL will * not be replicated nor put into the AOF. */ server.dirty++;#if defined(USE_JEMALLOC) /* jemalloc 5 doesn't release pages back to the OS when there's no traffic. * for large databases, flushdb blocks for long anyway, so a bit more won't * harm and this way the flush and purge will be synchroneus. */ if (!(flags & EMPTYDB_ASYNC)) jemalloc_purge();#endif}long long emptyDb(int dbnum, int flags, void(callback)(dict*)) { int async = (flags & EMPTYDB_ASYNC); RedisModuleFlushInfoV1 fi = {REDISMODULE_FLUSHINFO_VERSION,!async,dbnum}; long long removed = 0; if (dbnum < -1 || dbnum >= server.dbnum) { errno = EINVAL; return -1; } /* Fire the flushdb modules event. */ moduleFireServerEvent(REDISMODULE_EVENT_FLUSHDB, REDISMODULE_SUBEVENT_FLUSHDB_START, &fi); /* Make sure the WATCHed keys are affected by the FLUSH* commands. * Note that we need to call the function while the keys are still * there. */ signalFlushedDb(dbnum, async); /* Empty redis database structure. */ removed = emptyDbStructure(server.db, dbnum, async, callback); /* Flush slots to keys map if enable cluster, we can flush entire * slots to keys map whatever dbnum because only support one DB * in cluster mode. */ if (server.cluster_enabled) slotToKeyFlush(); if (dbnum == -1) flushSlaveKeysWithExpireList(); /* Also fire the end event. Note that this event will fire almost * immediately after the start event if the flush is asynchronous. */ moduleFireServerEvent(REDISMODULE_EVENT_FLUSHDB, REDISMODULE_SUBEVENT_FLUSHDB_END, &fi); return removed;}/* Set CLIENT_DIRTY_CAS to all clients of DB when DB is dirty. * It may happen in the following situations: * FLUSHDB, FLUSHALL, SWAPDB * * replaced_with: for SWAPDB, the WATCH should be invalidated if * the key exists in either of them, and skipped only if it * doesn't exist in both. */void touchAllWatchedKeysInDb(redisDb *emptied, redisDb *replaced_with) { listIter li; listNode *ln; dictEntry *de; if (dictSize(emptied->watched_keys) == 0) return; dictIterator *di = dictGetSafeIterator(emptied->watched_keys); while((de = dictNext(di)) != NULL) { robj *key = dictGetKey(de); if (dictFind(emptied->dict, key->ptr) || (replaced_with && dictFind(replaced_with->dict, key->ptr))) { list *clients = dictGetVal(de); if (!clients) continue; listRewind(clients,&li); while((ln = listNext(&li))) { client *c = listNodeValue(ln); c->flags |= CLIENT_DIRTY_CAS; } } } dictReleaseIterator(di);}
db结构整体不复杂,熟悉底层数据结构与上层容器后很容易理解,在cluster模式中只能够使用db0,业务平常的使用时也是只使用db0。db名称无法更改,也无法动态扩容db。从源码上看支持动态扩容db与支持db重命名并不复杂,麻烦点在于各个业务驱动的兼容。使用redis时,有条件的话还是使用多个db,会使每个db在每次rehash时稍快一些。
]]>stream数据结构首次出现在redis5中,给t_stream使用,以实现redis中真正的消息队列。stream的出现是专门用来弥补pub/sub的缺陷(pub/sub原生设计不支持多租户的迭代,也无持久化预留接口,无法保证消息的顺序性)
stream定位是底层数据结构,但是stream自身实现是通过rax与listpack,分别在之前的
中进行了详解,不熟悉的同学可以先看这两篇文章。
stream中存在一个重要概念-streamID,代表着消息的时间戳(streamID支持自定义,但一定是递增的),对于底层结构来说,可以将streamID理解为消息的key,消息为value。
typedef struct streamID { uint64_t ms; /* Unix time in milliseconds. */ uint64_t seq; /* Sequence number. */} streamID;typedef struct stream { rax *rax; /* The radix tree holding the stream. */ uint64_t length; /* Number of elements inside this stream. */ streamID last_id; /* Zero if there are yet no items. */ rax *cgroups; /* Consumer groups dictionary: name -> streamCG */} stream;
可以看到整个stream在一条rax树上,但是树上存储的都是streamID,每个streamID所对应的消息信息由指针指向listpack。
Consumer:
/* Consumer group. */typedef struct streamCG { streamID last_id; /* Last delivered (not acknowledged) ID for this group. Consumers that will just ask for more messages will served with IDs > than this. */ rax *pel; /* Pending entries list. This is a radix tree that has every message delivered to consumers (without the NOACK option) that was yet not acknowledged as processed. The key of the radix tree is the ID as a 64 bit big endian number, while the associated value is a streamNACK structure.*/ rax *consumers; /* A radix tree representing the consumers by name and their associated representation in the form of streamConsumer structures. */} streamCG;/* A specific consumer in a consumer group. */typedef struct streamConsumer { mstime_t seen_time; /* Last time this consumer was active. */ sds name; /* Consumer name. This is how the consumer will be identified in the consumer group protocol. Case sensitive. */ rax *pel; /* Consumer specific pending entries list: all the pending messages delivered to this consumer not yet acknowledged. Keys are big endian message IDs, while values are the same streamNACK structure referenced in the "pel" of the consumer group structure itself, so the value is shared. */} streamConsumer;/* Pending (yet not acknowledged) message in a consumer group. */typedef struct streamNACK { mstime_t delivery_time; /* Last time this message was delivered. */ uint64_t delivery_count; /* Number of times this message was delivered.*/ streamConsumer *consumer; /* The consumer this message was delivered to in the last delivery. */} streamNACK;
消费者整体来看,特殊点在于NACK,熟悉三次握手四次挥手的同学们可以套用原理,通过NACK来保证消息的正确消费。
结构差不多清楚之后,结合代码来看一个stream到底是如何创建,生产与消费的。
/* Create a new stream data structure. */stream *streamNew(void) { stream *s = zmalloc(sizeof(*s)); s->rax = raxNew(); s->length = 0; s->last_id.ms = 0; s->last_id.seq = 0; s->cgroups = NULL; /* Created on demand to save memory when not used. */ return s;}
新建stream比较常规,仅初始化rax即可。
int streamAppendItem(stream *s, robj **argv, int64_t numfields, streamID *added_id, streamID *use_id) { /* Generate the new entry ID. */ streamID id; //use_id为生产者自定义的id,如果没有自定义则根据当前时间生成streamID if (use_id) id = *use_id; else streamNextID(&s->last_id,&id); /* Check that the new ID is greater than the last entry ID * or return an error. Automatically generated IDs might * overflow (and wrap-around) when incrementing the sequence part. */ //假如id产生乱序,则拒绝新增直接报错 if (streamCompareID(&id,&s->last_id) <= 0) { errno = EDOM; return C_ERR; } /* Avoid overflow when trying to add an element to the stream (listpack * can only host up to 32bit length sttrings, and also a total listpack size * can't be bigger than 32bit length. */ //考虑到性能与存储问题,stream的消息数被redis人为的定义了上限,为32bit size_t totelelen = 0; for (int64_t i = 0; i < numfields*2; i++) { sds ele = argv[i]->ptr; totelelen += sdslen(ele); } if (totelelen > STREAM_LISTPACK_MAX_SIZE) { errno = ERANGE; return C_ERR; } /* Add the new entry. */ raxIterator ri; raxStart(&ri,s->rax); //迭代器走到最右边节点 raxSeek(&ri,"$",NULL,0); size_t lp_bytes = 0; /* Total bytes in the tail listpack. */ unsigned char *lp = NULL; /* Tail listpack pointer. */ //寻找最后一个节点的listpack if (!raxEOF(&ri)) { /* Get a reference to the tail node listpack. */ lp = ri.data; lp_bytes = lpBytes(lp); } raxStop(&ri); /* We have to add the key into the radix tree in lexicographic order, * to do so we consider the ID as a single 128 bit number written in * big endian, so that the most significant bytes are the first ones. */ uint64_t rax_key[2]; /* Key in the radix tree containing the listpack.*/ streamID master_id; /* ID of the master entry in the listpack. */ /* Create a new listpack and radix tree node if needed. Note that when * a new listpack is created, we populate it with a "master entry". This * is just a set of fields that is taken as references in order to compress * the stream entries that we'll add inside the listpack. * * Note that while we use the first added entry fields to create * the master entry, the first added entry is NOT represented in the master * entry, which is a stand alone object. But of course, the first entry * will compress well because it's used as reference. * * The master entry is composed like in the following example: * * +-------+---------+------------+---------+--/--+---------+---------+-+ * | count | deleted | num-fields | field_1 | field_2 | ... | field_N |0| * +-------+---------+------------+---------+--/--+---------+---------+-+ * * count and deleted just represent respectively the total number of * entries inside the listpack that are valid, and marked as deleted * (deleted flag in the entry flags set). So the total number of items * actually inside the listpack (both deleted and not) is count+deleted. * * The real entries will be encoded with an ID that is just the * millisecond and sequence difference compared to the key stored at * the radix tree node containing the listpack (delta encoding), and * if the fields of the entry are the same as the master entry fields, the * entry flags will specify this fact and the entry fields and number * of fields will be omitted (see later in the code of this function). * * The "0" entry at the end is the same as the 'lp-count' entry in the * regular stream entries (see below), and marks the fact that there are * no more entries, when we scan the stream from right to left. */ /* First of all, check if we can append to the current macro node or * if we need to switch to the next one. 'lp' will be set to NULL if * the current node is full. */ if (lp != NULL) { size_t node_max_bytes = server.stream_node_max_bytes; if (node_max_bytes == 0 || node_max_bytes > STREAM_LISTPACK_MAX_SIZE) node_max_bytes = STREAM_LISTPACK_MAX_SIZE; if (lp_bytes + totelelen >= node_max_bytes) { lp = NULL; } else if (server.stream_node_max_entries) { unsigned char *lp_ele = lpFirst(lp); /* Count both live entries and deleted ones. */ int64_t count = lpGetInteger(lp_ele) + lpGetInteger(lpNext(lp,lp_ele)); if (count >= server.stream_node_max_entries) { /* Shrink extra pre-allocated memory */ lp = lpShrinkToFit(lp); if (ri.data != lp) raxInsert(s->rax,ri.key,ri.key_len,lp,NULL); lp = NULL; } } } //假如当前的listpack超过了最大节点数量,需要新建listpack int flags = STREAM_ITEM_FLAG_NONE; if (lp == NULL) { master_id = id; streamEncodeID(rax_key,&id); /* Create the listpack having the master entry ID and fields. * Pre-allocate some bytes when creating listpack to avoid realloc on * every XADD. Since listpack.c uses malloc_size, it'll grow in steps, * and won't realloc on every XADD. * When listpack reaches max number of entries, we'll shrink the * allocation to fit the data. */ size_t prealloc = STREAM_LISTPACK_MAX_PRE_ALLOCATE; if (server.stream_node_max_bytes > 0 && server.stream_node_max_bytes < prealloc) { prealloc = server.stream_node_max_bytes; } lp = lpNew(prealloc); lp = lpAppendInteger(lp,1); /* One item, the one we are adding. */ lp = lpAppendInteger(lp,0); /* Zero deleted so far. */ lp = lpAppendInteger(lp,numfields); for (int64_t i = 0; i < numfields; i++) { sds field = argv[i*2]->ptr; lp = lpAppend(lp,(unsigned char*)field,sdslen(field)); } lp = lpAppendInteger(lp,0); /* Master entry zero terminator. */ //将当前的listpack添加到rax中 raxInsert(s->rax,(unsigned char*)&rax_key,sizeof(rax_key),lp,NULL); /* The first entry we insert, has obviously the same fields of the * master entry. */ flags |= STREAM_ITEM_FLAG_SAMEFIELDS; } else { //此分支无需新建listpack,修改rax上的listpack即可 serverAssert(ri.key_len == sizeof(rax_key)); memcpy(rax_key,ri.key,sizeof(rax_key)); /* Read the master ID from the radix tree key. */ streamDecodeID(rax_key,&master_id); unsigned char *lp_ele = lpFirst(lp); /* Update count and skip the deleted fields. */ int64_t count = lpGetInteger(lp_ele); lp = lpReplaceInteger(lp,&lp_ele,count+1); lp_ele = lpNext(lp,lp_ele); /* seek deleted. */ lp_ele = lpNext(lp,lp_ele); /* seek master entry num fields. */ /* Check if the entry we are adding, have the same fields * as the master entry. */ int64_t master_fields_count = lpGetInteger(lp_ele); lp_ele = lpNext(lp,lp_ele); if (numfields == master_fields_count) { int64_t i; for (i = 0; i < master_fields_count; i++) { sds field = argv[i*2]->ptr; int64_t e_len; unsigned char buf[LP_INTBUF_SIZE]; unsigned char *e = lpGet(lp_ele,&e_len,buf); /* Stop if there is a mismatch. */ if (sdslen(field) != (size_t)e_len || memcmp(e,field,e_len) != 0) break; lp_ele = lpNext(lp,lp_ele); } /* All fields are the same! We can compress the field names * setting a single bit in the flags. */ if (i == master_fields_count) flags |= STREAM_ITEM_FLAG_SAMEFIELDS; } } /* Populate the listpack with the new entry. We use the following * encoding: * * +-----+--------+----------+-------+-------+-/-+-------+-------+--------+ * |flags|entry-id|num-fields|field-1|value-1|...|field-N|value-N|lp-count| * +-----+--------+----------+-------+-------+-/-+-------+-------+--------+ * * However if the SAMEFIELD flag is set, we have just to populate * the entry with the values, so it becomes: * * +-----+--------+-------+-/-+-------+--------+ * |flags|entry-id|value-1|...|value-N|lp-count| * +-----+--------+-------+-/-+-------+--------+ * * The entry-id field is actually two separated fields: the ms * and seq difference compared to the master entry. * * The lp-count field is a number that states the number of listpack pieces * that compose the entry, so that it's possible to travel the entry * in reverse order: we can just start from the end of the listpack, read * the entry, and jump back N times to seek the "flags" field to read * the stream full entry. */ //插入flags数据 lp = lpAppendInteger(lp,flags); lp = lpAppendInteger(lp,id.ms - master_id.ms); lp = lpAppendInteger(lp,id.seq - master_id.seq); if (!(flags & STREAM_ITEM_FLAG_SAMEFIELDS)) lp = lpAppendInteger(lp,numfields); for (int64_t i = 0; i < numfields; i++) { sds field = argv[i*2]->ptr, value = argv[i*2+1]->ptr; if (!(flags & STREAM_ITEM_FLAG_SAMEFIELDS)) lp = lpAppend(lp,(unsigned char*)field,sdslen(field)); lp = lpAppend(lp,(unsigned char*)value,sdslen(value)); } /* Compute and store the lp-count field. */ int64_t lp_count = numfields; lp_count += 3; /* Add the 3 fixed fields flags + ms-diff + seq-diff. */ if (!(flags & STREAM_ITEM_FLAG_SAMEFIELDS)) { /* If the item is not compressed, it also has the fields other than * the values, and an additional num-fields field. */ lp_count += numfields+1; } lp = lpAppendInteger(lp,lp_count); /* Insert back into the tree in order to update the listpack pointer. */ if (ri.data != lp) raxInsert(s->rax,(unsigned char*)&rax_key,sizeof(rax_key),lp,NULL); s->length++; s->last_id = id; if (added_id) *added_id = id; return C_OK;}
可以看出,listpack并非一个每条消息独享的,每个listpack都含有一个master entry,在master中存储了创建这个listpack时的第一条消息的fields,由于同一个消息流中,消息大多数都是相似的,如果后续消息的field与第一条相同,则不再存储其field,复用即可。
streamCG *streamCreateCG(stream *s, char *name, size_t namelen, streamID *id) { if (s->cgroups == NULL) s->cgroups = raxNew(); if (raxFind(s->cgroups,(unsigned char*)name,namelen) != raxNotFound) return NULL; streamCG *cg = zmalloc(sizeof(*cg)); cg->pel = raxNew(); cg->consumers = raxNew(); cg->last_id = *id; raxInsert(s->cgroups,(unsigned char*)name,namelen,cg,NULL); return cg;}
新增消费组较为简单,为整个消费组指定stream即可,后续根据组消费从CG入口进去即可直接找到stream的rax结构
/* Delete the specified item ID from the stream, returning 1 if the item * was deleted 0 otherwise (if it does not exist). */int streamDeleteItem(stream *s, streamID *id) { int deleted = 0; streamIterator si; streamIteratorStart(&si,s,id,id,0); streamID myid; int64_t numfields; if (streamIteratorGetID(&si,&myid,&numfields)) { streamIteratorRemoveEntry(&si,&myid); deleted = 1; } streamIteratorStop(&si); return deleted;}void streamIteratorRemoveEntry(streamIterator *si, streamID *current) { unsigned char *lp = si->lp; int64_t aux; /* We do not really delete the entry here. Instead we mark it as * deleted flagging it, and also incrementing the count of the * deleted entries in the listpack header. * * We start flagging: */ int flags = lpGetInteger(si->lp_flags); flags |= STREAM_ITEM_FLAG_DELETED; // 设置消息的标志位 lp = lpReplaceInteger(lp,&si->lp_flags,flags); /* Change the valid/deleted entries count in the master entry. */ unsigned char *p = lpFirst(lp); aux = lpGetInteger(p); //当此条消息为listpack的最后一组元素,则可以释放这个listpack,否则仅仅记录flag,直到元素全部删除后才真正释放listpack内存 if (aux == 1) { /* If this is the last element in the listpack, we can remove the whole * node. */ lpFree(lp); raxRemove(si->stream->rax,si->ri.key,si->ri.key_len,NULL); } else { //如果listpack还有其余元素,则修改listpack master enty的count信息,将aux减一 /* In the base case we alter the counters of valid/deleted entries. */ lp = lpReplaceInteger(lp,&p,aux-1); p = lpNext(lp,p); /* Seek deleted field. */ aux = lpGetInteger(p); lp = lpReplaceInteger(lp,&p,aux+1); /* Update the listpack with the new pointer. */ //由于listpack有可能存在扩缩容或编码格式变化,因此这里需要判断是否需要更新内存地址 if (si->lp != lp) raxInsert(si->stream->rax,si->ri.key,si->ri.key_len,lp,NULL); } /* Update the number of entries counter. */ si->stream->length--; /* Re-seek the iterator to fix the now messed up state. */ streamID start, end; if (si->rev) { streamDecodeID(si->start_key,&start); end = *current; } else { start = *current; streamDecodeID(si->end_key,&end); } //更新streamIterator streamIteratorStop(si); streamIteratorStart(si,si->stream,&start,&end,si->rev); /* TODO: perform a garbage collection here if the ration between * deleted and valid goes over a certain limit. */}
stream从结构上看较为繁琐,需要listpack与rax配合使用。rax前缀压缩使得stream在存储streamID上有着极高的存储优势,listpack中采用master-entry也使得每条消息也有着较高的内存压缩率。对于redis来说,用stream对标kafka或pulsar,较大的优势在于redis自身的性能,在使用上建议将原先消息流中的json拆分为一组组field+value格式。
但stream明显的不足在于,受限于redis自身设计模式,即使在redis-cluster模式中,同一个stream流会被存储在单个节点上,无法做到像传统mq一样多partition或多replica(副本仅靠主从节点同步机制,并不支持传统副本模式)。由于redis-cluster采用gossip协议,主从同步并未含有硬性的ack机制,在master故障八卦传播期间,存在着丢数据的可能性,因此redis的消息队列相对于kafka或pulsar等专业mq来说还是过于简陋。
rax为redis自实现的基数树,从构造上讲,也叫前缀树/前缀压缩树。用来作为stream的底层数据结构之一,在redis5中引入。
rax在内存中为顺序结构,但可以转化为树形结构方便理解
rax中保存数据有两种格式,一种是非压缩的格式,另一种为压缩的格式。
以为源码中注释为例,假设有foo footer foobar三个单词,
在非压缩的格式下,树形结构如下示例
(f) “”
\
(o) “f”
\
(o) “fo”
\
[t b] “foo”
/ \
“foot” (e) (a) “foob”
/ \
“foote” ® ® “fooba”
/ \
“footer” [] [] “foobar”
可以看到,每个节点中只保留一个字符。footer与foobar相同前缀为foo,那么以单个字符开始构造,到分歧处已经形成三级节点。
在压缩格式下,树形结构如下示例
[“foo”] “”
|
[t b] “foo”
/ \
“foot” (“er”) (“ar”) “foob”
/ \
“footer” [] [] “foobar”
每个节点中保留字符串,通常为当前分歧中的最大公共前缀。 footer与foobar公共的foo单占一个节点,到分歧处只有一级节点,
回到代码,先看rax长什么样
#define RAX_NODE_MAX_SIZE ((1<<29)-1)typedef struct raxNode { uint32_t iskey:1; /* Does this node contain a key? */ uint32_t isnull:1; /* Associated value is NULL (don't store it). */ uint32_t iscompr:1; /* Node is compressed. */ uint32_t size:29; /* Number of children, or compressed string len. */ /* Data layout is as follows: * * If node is not compressed we have 'size' bytes, one for each children * character, and 'size' raxNode pointers, point to each child node. * Note how the character is not stored in the children but in the * edge of the parents: * * [header iscompr=0][abc][a-ptr][b-ptr][c-ptr](value-ptr?) * * if node is compressed (iscompr bit is 1) the node has 1 children. * In that case the 'size' bytes of the string stored immediately at * the start of the data section, represent a sequence of successive * nodes linked one after the other, for which only the last one in * the sequence is actually represented as a node, and pointed to by * the current compressed node. * * [header iscompr=1][xyz][z-ptr](value-ptr?) * * Both compressed and not compressed nodes can represent a key * with associated data in the radix tree at any level (not just terminal * nodes). * * If the node has an associated key (iskey=1) and is not NULL * (isnull=0), then after the raxNode pointers pointing to the * children, an additional value pointer is present (as you can see * in the representation above as "value-ptr" field). */ unsigned char data[];} raxNode;
typedef struct rax { raxNode *head; uint64_t numele; uint64_t numnodes;} rax;
rax就是rax整体的入口了。记录raxnode头,numele为元素的数量,numnodes为节点数量。由于raxnode的构造,大多数rax的numele >= numnodes。
rax *raxNew(void) { //rax_malloc本质上还是malloc rax *rax = rax_malloc(sizeof(*rax)); if (rax == NULL) return NULL; rax->numele = 0; //空的头节点也占numnodes数 rax->numnodes = 1; rax->head = raxNewNode(0,0); if (rax->head == NULL) { rax_free(rax); return NULL; } else { return rax; }}
int raxInsert(rax *rax, unsigned char *s, size_t len, void *data, void **old) { return raxGenericInsert(rax,s,len,data,old,1);}/* Insert the element 's' of size 'len', setting as auxiliary data * the pointer 'data'. If the element is already present, the associated * data is updated (only if 'overwrite' is set to 1), and 0 is returned, * otherwise the element is inserted and 1 is returned. On out of memory the * function returns 0 as well but sets errno to ENOMEM, otherwise errno will * be set to 0. */int raxGenericInsert(rax *rax, unsigned char *s, size_t len, void *data, void **old, int overwrite) { size_t i; int j = 0; /* Split position. If raxLowWalk() stops in a compressed node, the index 'j' represents the char we stopped within the compressed node, that is, the position where to split the node for insertion. */ raxNode *h, **parentlink; debugf("### Insert %.*s with value %p\n", (int)len, s, data); //找寻最优存放节点位置 //j用来记录分裂的位置,因为c没有go一样的多返回值设计,所以在这里使用指针代替多返回值,通过一个函数拿到i j两个值的结果 i = raxLowWalk(rax,s,len,&h,&parentlink,&j,NULL); /* If i == len we walked following the whole string. If we are not * in the middle of a compressed node, the string is either already * inserted or this middle node is currently not a key, but can represent * our key. We have just to reallocate the node and make space for the * data pointer. */ //假如i==len,那么说明整个遍历了一遍 //如果当前节点为非压缩节点,且没有找到分裂位置,那么就说明这个字符串已经存在或者是个非元素节点 if (i == len && (!h->iscompr || j == 0 /* not in the middle if j is 0 */)) { debugf("### Insert: node representing key exists\n"); /* Make space for the value pointer if needed. */ //如果不是key,需要重新分配节点空间,更新指针位置 if (!h->iskey || (h->isnull && overwrite)) { h = raxReallocForData(h,data); if (h) memcpy(parentlink,&h,sizeof(h)); } if (h == NULL) { errno = ENOMEM; return 0; } /* Update the existing key if there is already one. */ //如果是key,那么说明当前字符串已经存在,需要设置新的value if (h->iskey) { if (old) *old = raxGetData(h); if (overwrite) raxSetData(h,data); errno = 0; return 0; /* Element already exists. */ } /* Otherwise set the node as a key. Note that raxSetData() * will set h->iskey. */ //说明键也不存在,需要set键值对 raxSetData(h,data); rax->numele++; return 1; /* Element inserted. */ } /* If the node we stopped at is a compressed node, we need to * split it before to continue. * * Splitting a compressed node have a few possible cases. * Imagine that the node 'h' we are currently at is a compressed * node containing the string "ANNIBALE" (it means that it represents * nodes A -> N -> N -> I -> B -> A -> L -> E with the only child * pointer of this node pointing at the 'E' node, because remember that * we have characters at the edges of the graph, not inside the nodes * themselves. * * In order to show a real case imagine our node to also point to * another compressed node, that finally points at the node without * children, representing 'O': * * "ANNIBALE" -> "SCO" -> [] * * When inserting we may face the following cases. Note that all the cases * require the insertion of a non compressed node with exactly two * children, except for the last case which just requires splitting a * compressed node. * * 1) Inserting "ANNIENTARE" * * |B| -> "ALE" -> "SCO" -> [] * "ANNI" -> |-| * |E| -> (... continue algo ...) "NTARE" -> [] * * 2) Inserting "ANNIBALI" * * |E| -> "SCO" -> [] * "ANNIBAL" -> |-| * |I| -> (... continue algo ...) [] * * 3) Inserting "AGO" (Like case 1, but set iscompr = 0 into original node) * * |N| -> "NIBALE" -> "SCO" -> [] * |A| -> |-| * |G| -> (... continue algo ...) |O| -> [] * * 4) Inserting "CIAO" * * |A| -> "NNIBALE" -> "SCO" -> [] * |-| * |C| -> (... continue algo ...) "IAO" -> [] * * 5) Inserting "ANNI" * * "ANNI" -> "BALE" -> "SCO" -> [] * * The final algorithm for insertion covering all the above cases is as * follows. * * ============================= ALGO 1 ============================= * * For the above cases 1 to 4, that is, all cases where we stopped in * the middle of a compressed node for a character mismatch, do: * * Let $SPLITPOS be the zero-based index at which, in the * compressed node array of characters, we found the mismatching * character. For example if the node contains "ANNIBALE" and we add * "ANNIENTARE" the $SPLITPOS is 4, that is, the index at which the * mismatching character is found. * * 1. Save the current compressed node $NEXT pointer (the pointer to the * child element, that is always present in compressed nodes). * * 2. Create "split node" having as child the non common letter * at the compressed node. The other non common letter (at the key) * will be added later as we continue the normal insertion algorithm * at step "6". * * 3a. IF $SPLITPOS == 0: * Replace the old node with the split node, by copying the auxiliary * data if any. Fix parent's reference. Free old node eventually * (we still need its data for the next steps of the algorithm). * * 3b. IF $SPLITPOS != 0: * Trim the compressed node (reallocating it as well) in order to * contain $splitpos characters. Change child pointer in order to link * to the split node. If new compressed node len is just 1, set * iscompr to 0 (layout is the same). Fix parent's reference. * * 4a. IF the postfix len (the length of the remaining string of the * original compressed node after the split character) is non zero, * create a "postfix node". If the postfix node has just one character * set iscompr to 0, otherwise iscompr to 1. Set the postfix node * child pointer to $NEXT. * * 4b. IF the postfix len is zero, just use $NEXT as postfix pointer. * * 5. Set child[0] of split node to postfix node. * * 6. Set the split node as the current node, set current index at child[1] * and continue insertion algorithm as usually. * * ============================= ALGO 2 ============================= * * For case 5, that is, if we stopped in the middle of a compressed * node but no mismatch was found, do: * * Let $SPLITPOS be the zero-based index at which, in the * compressed node array of characters, we stopped iterating because * there were no more keys character to match. So in the example of * the node "ANNIBALE", adding the string "ANNI", the $SPLITPOS is 4. * * 1. Save the current compressed node $NEXT pointer (the pointer to the * child element, that is always present in compressed nodes). * * 2. Create a "postfix node" containing all the characters from $SPLITPOS * to the end. Use $NEXT as the postfix node child pointer. * If the postfix node length is 1, set iscompr to 0. * Set the node as a key with the associated value of the new * inserted key. * * 3. Trim the current node to contain the first $SPLITPOS characters. * As usually if the new node length is just 1, set iscompr to 0. * Take the iskey / associated value as it was in the original node. * Fix the parent's reference. * * 4. Set the postfix node as the only child pointer of the trimmed * node created at step 1. */ /* ------------------------- ALGORITHM 1 --------------------------- */ //如果是压缩节点的话,并且没有整个遍历一遍,说明需要找分裂节点了 if (h->iscompr && i != len) { debugf("ALGO 1: Stopped at compressed node %.*s (%p)\n", h->size, h->data, (void*)h); debugf("Still to insert: %.*s\n", (int)(len-i), s+i); debugf("Splitting at %d: '%c'\n", j, ((char*)h->data)[j]); debugf("Other (key) letter is '%c'\n", s[i]); /* 1: Save next pointer. */ //获取最后一个子节点的位置 raxNode **childfield = raxNodeLastChildPtr(h); //用来做数据保存使用的 raxNode *next; memcpy(&next,childfield,sizeof(next)); debugf("Next is %p\n", (void*)next); debugf("iskey %d\n", h->iskey); if (h->iskey) { debugf("key value is %p\n", raxGetData(h)); } /* Set the length of the additional nodes we will need. */ //trimmedlen用来计算len使用 size_t trimmedlen = j; size_t postfixlen = h->size - j - 1; int split_node_is_key = !trimmedlen && h->iskey && !h->isnull; size_t nodesize; /* 2: Create the split node. Also allocate the other nodes we'll need * ASAP, so that it will be simpler to handle OOM. */ //创建新的raxnode节点 raxNode *splitnode = raxNewNode(1, split_node_is_key); raxNode *trimmed = NULL; raxNode *postfix = NULL; //如果停留在raxnode的中间,那么需要将原节点前面部分字符串转化成新节点的长度 if (trimmedlen) { nodesize = sizeof(raxNode)+trimmedlen+raxPadding(trimmedlen)+ sizeof(raxNode*); if (h->iskey && !h->isnull) nodesize += sizeof(void*); trimmed = rax_malloc(nodesize); } //与trimmedlen相反,将原节点后面部分字符串转化成新节点的长度 if (postfixlen) { nodesize = sizeof(raxNode)+postfixlen+raxPadding(postfixlen)+ sizeof(raxNode*); postfix = rax_malloc(nodesize); } /* OOM? Abort now that the tree is untouched. */ //redis中少见的内存分配异常处理,oom的话需要回滚操作,释放内存 if (splitnode == NULL || (trimmedlen && trimmed == NULL) || (postfixlen && postfix == NULL)) { rax_free(splitnode); rax_free(trimmed); rax_free(postfix); errno = ENOMEM; return 0; } //赋予数据 splitnode->data[0] = h->data[j]; //j==0,代表者不再压缩节点中,需要用分裂节点代替原先的节点位置 if (j == 0) { /* 3a: Replace the old node with the split node. */ if (h->iskey) { void *ndata = raxGetData(h); raxSetData(splitnode,ndata); } memcpy(parentlink,&splitnode,sizeof(splitnode)); } else { //如果在压缩节点中,需要分裂压缩节点 /* 3b: Trim the compressed node. */ trimmed->size = j; //需要将前缀拷贝到新子节点中 memcpy(trimmed->data,h->data,j); trimmed->iscompr = j > 1 ? 1 : 0; trimmed->iskey = h->iskey; trimmed->isnull = h->isnull; if (h->iskey && !h->isnull) { void *ndata = raxGetData(h); raxSetData(trimmed,ndata); } // 获取新子节点的最后一个子节点的指针,并且赋予分裂节点的值,将当前新子节点的值赋给父节点,让父节点指向现在的新子节点 raxNode **cp = raxNodeLastChildPtr(trimmed); memcpy(cp,&splitnode,sizeof(splitnode)); memcpy(parentlink,&trimmed,sizeof(trimmed)); parentlink = cp; /* Set parentlink to splitnode parent. */ rax->numnodes++; } /* 4: Create the postfix node: what remains of the original * compressed node after the split. */ //上边只是创建了新的前缀节点,这里需要创建新的后缀节点 if (postfixlen) { /* 4a: create a postfix node. */ postfix->iskey = 0; postfix->isnull = 0; postfix->size = postfixlen; postfix->iscompr = postfixlen > 1; memcpy(postfix->data,h->data+j+1,postfixlen); raxNode **cp = raxNodeLastChildPtr(postfix); memcpy(cp,&next,sizeof(next)); rax->numnodes++; } else { /* 4b: just use next as postfix node. */ postfix = next; } /* 5: Set splitnode first child as the postfix node. */ //获取分裂节点最后一个子节点 raxNode **splitchild = raxNodeLastChildPtr(splitnode); //postfix指向子节点 memcpy(splitchild,&postfix,sizeof(postfix)); /* 6. Continue insertion: this will cause the splitnode to * get a new child (the non common character at the currently * inserted key). */ //h已经无用,释放 rax_free(h); h = splitnode; } else if (h->iscompr && i == len) { //待查找节点的键在压缩节点中被匹配到,那么仍然需要裁剪压缩节点 /* ------------------------- ALGORITHM 2 --------------------------- */ debugf("ALGO 2: Stopped at compressed node %.*s (%p) j = %d\n", h->size, h->data, (void*)h, j); /* Allocate postfix & trimmed nodes ASAP to fail for OOM gracefully. */ size_t postfixlen = h->size - j; size_t nodesize = sizeof(raxNode)+postfixlen+raxPadding(postfixlen)+ sizeof(raxNode*); if (data != NULL) nodesize += sizeof(void*); raxNode *postfix = rax_malloc(nodesize); nodesize = sizeof(raxNode)+j+raxPadding(j)+sizeof(raxNode*); if (h->iskey && !h->isnull) nodesize += sizeof(void*); raxNode *trimmed = rax_malloc(nodesize); if (postfix == NULL || trimmed == NULL) { rax_free(postfix); rax_free(trimmed); errno = ENOMEM; return 0; } /* 1: Save next pointer. */ //保存原先子节点的入口 raxNode **childfield = raxNodeLastChildPtr(h); raxNode *next; memcpy(&next,childfield,sizeof(next)); /* 2: Create the postfix node. */ //创建新的后缀节点 postfix->size = postfixlen; postfix->iscompr = postfixlen > 1; postfix->iskey = 1; postfix->isnull = 0; memcpy(postfix->data,h->data+j,postfixlen); raxSetData(postfix,data); raxNode **cp = raxNodeLastChildPtr(postfix); memcpy(cp,&next,sizeof(next)); rax->numnodes++; /* 3: Trim the compressed node. */ //裁剪原先的压缩节点 trimmed->size = j; trimmed->iscompr = j > 1; trimmed->iskey = 0; trimmed->isnull = 0; memcpy(trimmed->data,h->data,j); memcpy(parentlink,&trimmed,sizeof(trimmed)); //如果原先压缩节点为key,那么新的需要保持一致 if (h->iskey) { void *aux = raxGetData(h); raxSetData(trimmed,aux); } /* Fix the trimmed node child pointer to point to * the postfix node. */ //做指针重新指定, cp = raxNodeLastChildPtr(trimmed); memcpy(cp,&postfix,sizeof(postfix)); /* Finish! We don't need to continue with the insertion * algorithm for ALGO 2. The key is already inserted. */ rax->numele++; rax_free(h); return 1; /* Key inserted. */ } /* We walked the radix tree as far as we could, but still there are left * chars in our string. We need to insert the missing nodes. */ //上述代码是在rax中找到了匹配,这里处理没有匹配的情况,如果匹配之后仍然生下来部分没有匹配到的,需要单独处理 while(i < len) { raxNode *child; /* If this node is going to have a single child, and there * are other characters, so that that would result in a chain * of single-childed nodes, turn it into a compressed node. */ //假如说当前的节点有一个子节点和其他的字符,那么可以转换成压缩节点,减少一个单一字符子节点 if (h->size == 0 && len-i > 1) { debugf("Inserting compressed node\n"); size_t comprsize = len-i; if (comprsize > RAX_NODE_MAX_SIZE) comprsize = RAX_NODE_MAX_SIZE; raxNode *newh = raxCompressNode(h,s+i,comprsize,&child); if (newh == NULL) goto oom; h = newh; memcpy(parentlink,&h,sizeof(h)); parentlink = raxNodeLastChildPtr(h); i += comprsize; } else { //如果只有一个字符的节点或者是个非空节点,需要添加子节点 debugf("Inserting normal node\n"); raxNode **new_parentlink; raxNode *newh = raxAddChild(h,s[i],&child,&new_parentlink); if (newh == NULL) goto oom; h = newh; memcpy(parentlink,&h,sizeof(h)); parentlink = new_parentlink; i++; } rax->numnodes++; //做迭代 h = child; } //保存新节点的数据 raxNode *newh = raxReallocForData(h,data); if (newh == NULL) goto oom; h = newh; if (!h->iskey) rax->numele++; raxSetData(h,data); memcpy(parentlink,&h,sizeof(h)); return 1; /* Element inserted. */oom: /* This code path handles out of memory after part of the sub-tree was * already modified. Set the node as a key, and then remove it. However we * do that only if the node is a terminal node, otherwise if the OOM * happened reallocating a node in the middle, we don't need to free * anything. */ if (h->size == 0) { h->isnull = 1; h->iskey = 1; rax->numele++; /* Compensate the next remove. */ assert(raxRemove(rax,s,i,NULL) != 0); } errno = ENOMEM; return 0;}
从插入节点来说,整体复杂度在O(n^2)以内,但是由于涉及情况比较多,需同时判断压缩与非压缩结构,是否做分裂处理,叶子节点的合并等等,导致代码中夹杂着大量的if分支,使得代码阅读较为难受,但核心思路不变,还是以创建图示的结构为最终目标,其中insert也无特殊技巧读懂逻辑即可。
raxRemove
int raxRemove(rax *rax, unsigned char *s, size_t len, void **old) { raxNode *h; raxStack ts; debugf("### Delete: %.*s\n", (int)len, s); raxStackInit(&ts); int splitpos = 0; size_t i = raxLowWalk(rax,s,len,&h,NULL,&splitpos,&ts); //没有匹配到字符串或者字符串不是一个键,那么无需操作直接返回即可 if (i != len || (h->iscompr && splitpos != 0) || !h->iskey) { raxStackFree(&ts); return 0; } //冗余下原值 if (old) *old = raxGetData(h); h->iskey = 0; rax->numele--; /* If this node has no children, the deletion needs to reclaim the * no longer used nodes. This is an iterative process that needs to * walk the three upward, deleting all the nodes with just one child * that are not keys, until the head of the rax is reached or the first * node with more than one child is found. */ int trycompress = 0; /* Will be set to 1 if we should try to optimize the tree resulting from the deletion. */ //找到的目的节点没有任何子节点,那么只需要删除当前节点且父节点指针初始化即可 if (h->size == 0) { debugf("Key deleted in node without children. Cleanup needed.\n"); raxNode *child = NULL; //如果不是头节点的话,需要向上遍历依次删除 while(h != rax->head) { child = h; debugf("Freeing child %p [%.*s] key:%d\n", (void*)child, (int)child->size, (char*)child->data, child->iskey); //释放当前节点 rax_free(child); rax->numnodes--; h = raxStackPop(&ts); /* If this node has more then one child, or actually holds * a key, stop here. */ //如果父节点为key或者父节点还有其他子节点,那么结束循环直接跳出即可 if (h->iskey || (!h->iscompr && h->size != 1)) break; } if (child) { debugf("Unlinking child %p from parent %p\n", (void*)child, (void*)h); raxNode *new = raxRemoveChild(h,child); //需重新分配地址 if (new != h) { raxNode *parent = raxStackPeek(&ts); raxNode **parentlink; if (parent == NULL) { parentlink = &rax->head; } else { parentlink = raxFindParentLink(parent,h); } memcpy(parentlink,&new,sizeof(new)); } /* If after the removal the node has just a single child * and is not a key, we need to try to compress it. */ //如果移除的节点只有一个子节点并且不是一个键,需要尝试压缩 if (new->size == 1 && new->iskey == 0) { trycompress = 1; h = new; } } } else if (h->size == 1) { /* If the node had just one child, after the removal of the key * further compression with adjacent nodes is potentially possible. */ //如果这个节点只有一个孩子,那么也需要尝试压缩 trycompress = 1; } /* Don't try node compression if our nodes pointers stack is not * complete because of OOM while executing raxLowWalk() */ //假如会出现oom,撤销压缩操作 if (trycompress && ts.oom) trycompress = 0; /* Recompression: if trycompress is true, 'h' points to a radix tree node * that changed in a way that could allow to compress nodes in this * sub-branch. Compressed nodes represent chains of nodes that are not * keys and have a single child, so there are two deletion events that * may alter the tree so that further compression is needed: * * 1) A node with a single child was a key and now no longer is a key. * 2) A node with two children now has just one child. * * We try to navigate upward till there are other nodes that can be * compressed, when we reach the upper node which is not a key and has * a single child, we scan the chain of children to collect the * compressible part of the tree, and replace the current node with the * new one, fixing the child pointer to reference the first non * compressible node. * * Example of case "1". A tree stores the keys "FOO" = 1 and * "FOOBAR" = 2: * * * "FOO" -> "BAR" -> [] (2) * (1) * * After the removal of "FOO" the tree can be compressed as: * * "FOOBAR" -> [] (2) * * * Example of case "2". A tree stores the keys "FOOBAR" = 1 and * "FOOTER" = 2: * * |B| -> "AR" -> [] (1) * "FOO" -> |-| * |T| -> "ER" -> [] (2) * * After the removal of "FOOTER" the resulting tree is: * * "FOO" -> |B| -> "AR" -> [] (1) * * That can be compressed into: * * "FOOBAR" -> [] (1) */ //尝试压缩 if (trycompress) { debugf("After removing %.*s:\n", (int)len, s); debugnode("Compression may be needed",h); debugf("Seek start node\n"); /* Try to reach the upper node that is compressible. * At the end of the loop 'h' will point to the first node we * can try to compress and 'parent' to its parent. */ raxNode *parent; //压缩时,需要先找到当前节点的父节点,在循环之后,h指向第一个我们可以压缩的节点并且parent指向父节点 while(1) { parent = raxStackPop(&ts); if (!parent || parent->iskey || (!parent->iscompr && parent->size != 1)) break; h = parent; debugnode("Going up to",h); } raxNode *start = h; /* Compression starting node. */ /* Scan chain of nodes we can compress. */ size_t comprsize = h->size; int nodes = 1; while(h->size != 0) { raxNode **cp = raxNodeLastChildPtr(h); memcpy(&h,cp,sizeof(h)); if (h->iskey || (!h->iscompr && h->size != 1)) break; /* Stop here if going to the next node would result into * a compressed node larger than h->size can hold. */ if (comprsize + h->size > RAX_NODE_MAX_SIZE) break; nodes++; comprsize += h->size; } if (nodes > 1) { /* If we can compress, create the new node and populate it. */ size_t nodesize = sizeof(raxNode)+comprsize+raxPadding(comprsize)+sizeof(raxNode*); raxNode *new = rax_malloc(nodesize); /* An out of memory here just means we cannot optimize this * node, but the tree is left in a consistent state. */ if (new == NULL) { raxStackFree(&ts); return 1; } new->iskey = 0; new->isnull = 0; new->iscompr = 1; new->size = comprsize; rax->numnodes++; /* Scan again, this time to populate the new node content and * to fix the new node child pointer. At the same time we free * all the nodes that we'll no longer use. */ comprsize = 0; h = start; while(h->size != 0) { //将需要合并的节点合并到一个新的节点 memcpy(new->data+comprsize,h->data,h->size); comprsize += h->size; raxNode **cp = raxNodeLastChildPtr(h); raxNode *tofree = h; memcpy(&h,cp,sizeof(h)); //旧的节点在合并后释放 rax_free(tofree); rax->numnodes--; if (h->iskey || (!h->iscompr && h->size != 1)) break; } debugnode("New node",new); /* Now 'h' points to the first node that we still need to use, * so our new node child pointer will point to it. */ //让新节点的子指针指向被删除节点的后面的节点 raxNode **cp = raxNodeLastChildPtr(new); memcpy(cp,&h,sizeof(h)); /* Fix parent link. */ if (parent) { //让被删除节点指向新节点地址 raxNode **parentlink = raxFindParentLink(parent,start); memcpy(parentlink,&new,sizeof(new)); } else { rax->head = new; } debugf("Compressed %d nodes, %d total bytes\n", nodes, (int)comprsize); } } raxStackFree(&ts); return 1;}
可以看到,remove操作相对于insert简化很多,remove只需要定位到当前层级,依次向上遍历看是否需要同时删除父节点,在删除后看是否需要合并子节点即可。
从整个rax实现上看,由于需要覆盖所有情况,导致rax的增删操作相对复杂,每个操作函数也是相当的长,整体性能一般。
redis在rax的整个功能实现上是完整的,但是从构造上看并没有打算以后对rax加入极致的效率优化,在6中作为stream的stream ID存储,ACL安全策略等广泛使用。
本篇为合集性质,redis全部的上层容器分析看如下文章
redis源码学习-t_string篇
redis源码学习-t_list篇
redis源码学习-t_hash篇
redis源码学习-t_set篇
redis源码学习-t_zset篇
redis源码学习-stream篇
如上,不同的上层容器中负责具体实现redis中的具体一个个command命令例如set,get,del等,上层容器通过对低层数据结构进行包装组合,根据不同的key属性或集合数据量灵活使用不同的组合方式(也叫编码方式)
其中完整的对应关系如图所示
可以看到,对于上层容器比较特殊的是hyperloglog,hyperloglog来源于论文
,复杂点在于理解而非对底层数据结构的操纵。
对于底层数据结构来说,特殊的为ziplist,由于有着较大的缺陷(连锁更新),在5中引入了listpack作为上位替代,在6中彻底退出上层容器调用链。
t_zset为redis中有序集合的容器,与t_set类似,但提供了排名与排序等复杂操作。底层数据结构为在6以后为listpack,dict+skiplist,删除了性能较差的ziplist(在t_hash中也将ziplist用更高性能的listpack代替),宣告着ziplist正式退出中层容器的舞台。
底层数据结构在
中进行了详解,不熟悉的同学可以先看这三篇文章。
在6之后使用listpack与(dict+skiplist),编码方式为OBJ_ENCODING_LISTPACK与OBJ_ENCODING_SKIPLIST
# Similarly to hashes and lists, sorted sets are also specially encoded in# order to save a lot of space. This encoding is only used when the length and# elements of a sorted set are below the following limits:zset-max-listpack-entries 128zset-max-listpack-value 64
listpack编码方式:首次add时当zset-max-listpack-entries为0或者zset长度小于zset-max-listpack-value时,在之后zset数据小于zset-max-listpack-value且zset元素个数小于zset-max-listpack-entries时
skiplist编码方式:首次add时zset_max_listpack_entries不为0且zset_max_listpack_entries为0且zset长度大于zset-max-listpack-value,在之后zset数据大于zset-max-listpack-value或zset元素个数大于zset-max-listpack-entries时
tzset中提供了zaddGenericCommand,zremrangeGenericCommand,zmpopGenericCommand等通用函数用来负责整体的add与remove的落库与通知,pub/sub等操作,zrangeGenericCommand用来通用遍历,同时也与t_set类似,提供了zunionInterDiffGenericCommand负责交叉并集的统一处理。
zaddGenericCommand
/* This generic command implements both ZADD and ZINCRBY. */void zaddGenericCommand(client *c, int flags) { static char *nanerr = "resulting score is not a number (NaN)"; robj *key = c->argv[1]; robj *zobj; sds ele; double score = 0, *scores = NULL; int j, elements, ch = 0; int scoreidx = 0; /* The following vars are used in order to track what the command actually * did during the execution, to reply to the client and to trigger the * notification of keyspace change. */ //用来标记不同的操作记录 int added = 0; /* Number of new elements added. */ int updated = 0; /* Number of elements with updated score. */ int processed = 0; /* Number of elements processed, may remain zero with options like XX. */ /* Parse options. At the end 'scoreidx' is set to the argument position * of the score of the first score-element pair. */ //处理命令解析,zset由于有分数的概念,同时支持多个元素的操作,因此arg个数将有可能很多,需要统一进行处理 scoreidx = 2; while(scoreidx < c->argc) { char *opt = c->argv[scoreidx]->ptr; if (!strcasecmp(opt,"nx")) flags |= ZADD_IN_NX; else if (!strcasecmp(opt,"xx")) flags |= ZADD_IN_XX; else if (!strcasecmp(opt,"ch")) ch = 1; /* Return num of elements added or updated. */ else if (!strcasecmp(opt,"incr")) flags |= ZADD_IN_INCR; else if (!strcasecmp(opt,"gt")) flags |= ZADD_IN_GT; else if (!strcasecmp(opt,"lt")) flags |= ZADD_IN_LT; else break; scoreidx++; } /* Turn options into simple to check vars. */ //与上边的arg解析连贯起来,用来解析后转为变量供之后的逻辑调用 int incr = (flags & ZADD_IN_INCR) != 0; int nx = (flags & ZADD_IN_NX) != 0; int xx = (flags & ZADD_IN_XX) != 0; int gt = (flags & ZADD_IN_GT) != 0; int lt = (flags & ZADD_IN_LT) != 0; /* After the options, we expect to have an even number of args, since * we expect any number of score-element pairs. */ //计算命令中元素与分值的总数 elements = c->argc-scoreidx; if (elements % 2 || !elements) { addReplyErrorObject(c,shared.syntaxerr); return; } elements /= 2; /* Now this holds the number of score-element pairs. */ /* Check for incompatible options. */ //由于nx与xx是互斥的逻辑,因此需要这里判断并且提前结束 if (nx && xx) { addReplyError(c, "XX and NX options at the same time are not compatible"); return; } //gt lt nx三者互斥,,因此需要这里判断并且提前结束 if ((gt && nx) || (lt && nx) || (gt && lt)) { addReplyError(c, "GT, LT, and/or NX options at the same time are not compatible"); return; } /* Note that XX is compatible with either GT or LT */ //incr只能操作一个元素 if (incr && elements > 1) { addReplyError(c, "INCR option supports a single increment-element pair"); return; } /* Start parsing all the scores, we need to emit any syntax error * before executing additions to the sorted set, as the command should * either execute fully or nothing at all. */ //申请内存空间,这里用来缓存所有的分数 scores = zmalloc(sizeof(double)*elements); for (j = 0; j < elements; j++) { //如果格式不正确的话,统一goto到清理行进行统一的free if (getDoubleFromObjectOrReply(c,c->argv[scoreidx+j*2],&scores[j],NULL) != C_OK) goto cleanup; } /* Lookup the key and create the sorted set if does not exist. */ //开始检查zset的键是否存在,类型不符也一样统一goto到清理行进行统一的free zobj = lookupKeyWrite(c->db,key); if (checkType(c,zobj,OBJ_ZSET)) goto cleanup; if (zobj == NULL) { if (xx) goto reply_to_client; /* No key + XX option: nothing to do. */ /*判断首次add时的编码格式 if (server.zset_max_listpack_entries == 0 || server.zset_max_listpack_value < sdslen(c->argv[scoreidx+1]->ptr)) { zobj = createZsetObject(); } else { zobj = createZsetListpackObject(); } dbAdd(c->db,key,zobj); } for (j = 0; j < elements; j++) { double newscore; score = scores[j]; int retflags = 0; ele = c->argv[scoreidx+1+j*2]->ptr; //zsetAdd相对复杂些,需要处理好score的结构 int retval = zsetAdd(zobj, score, ele, flags, &retflags, &newscore); if (retval == 0) { addReplyError(c,nanerr); goto cleanup; } if (retflags & ZADD_OUT_ADDED) added++; if (retflags & ZADD_OUT_UPDATED) updated++; if (!(retflags & ZADD_OUT_NOP)) processed++; score = newscore; } server.dirty += (added+updated);reply_to_client: if (incr) { /* ZINCRBY or INCR option. */ if (processed) addReplyDouble(c,score); else addReplyNull(c); } else { /* ZADD. */ addReplyLongLong(c,ch ? added+updated : added); }cleanup: zfree(scores); if (added || updated) { //统一处理通知与pub sub模块 signalModifiedKey(c,c->db,key); notifyKeyspaceEvent(NOTIFY_ZSET, incr ? "zincr" : "zadd", key, c->db->id); }}
统一包装后,整体的if判断相对较多,需要处理nx与xx互斥,gt lt nx三者互斥的关系,整体zset的逻辑函数也很多,由于本身score的特殊性质,为预留排名导致无法直接调用底层listpack与skiplist的包装函数,也因此z_set中代码复杂度相对较高
zsetadd底层逻辑
/* Add a new element or update the score of an existing element in a sorted * set, regardless of its encoding. * * The set of flags change the command behavior. * * The input flags are the following: * * ZADD_INCR: Increment the current element score by 'score' instead of updating * the current element score. If the element does not exist, we * assume 0 as previous score. * ZADD_NX: Perform the operation only if the element does not exist. * ZADD_XX: Perform the operation only if the element already exist. * ZADD_GT: Perform the operation on existing elements only if the new score is * greater than the current score. * ZADD_LT: Perform the operation on existing elements only if the new score is * less than the current score. * * When ZADD_INCR is used, the new score of the element is stored in * '*newscore' if 'newscore' is not NULL. * * The returned flags are the following: * * ZADD_NAN: The resulting score is not a number. * ZADD_ADDED: The element was added (not present before the call). * ZADD_UPDATED: The element score was updated. * ZADD_NOP: No operation was performed because of NX or XX. * * Return value: * * The function returns 1 on success, and sets the appropriate flags * ADDED or UPDATED to signal what happened during the operation (note that * none could be set if we re-added an element using the same score it used * to have, or in the case a zero increment is used). * * The function returns 0 on error, currently only when the increment * produces a NAN condition, or when the 'score' value is NAN since the * start. * * The command as a side effect of adding a new element may convert the sorted * set internal encoding from listpack to hashtable+skiplist. * * Memory management of 'ele': * * The function does not take ownership of the 'ele' SDS string, but copies * it if needed. */int zsetAdd(robj *zobj, double score, sds ele, int in_flags, int *out_flags, double *newscore) { /* Turn options into simple to check vars. */ //需要注意 nx与xx的互斥,gt lt nx之间的互斥 int incr = (in_flags & ZADD_IN_INCR) != 0; int nx = (in_flags & ZADD_IN_NX) != 0; int xx = (in_flags & ZADD_IN_XX) != 0; int gt = (in_flags & ZADD_IN_GT) != 0; int lt = (in_flags & ZADD_IN_LT) != 0; *out_flags = 0; /* We'll return our response flags. */ double curscore; /* NaN as input is an error regardless of all the other parameters. */ if (isnan(score)) { *out_flags = ZADD_OUT_NAN; return 0; } /* Update the sorted set according to its encoding. */ //listpack的数据结构 if (zobj->encoding == OBJ_ENCODING_LISTPACK) { unsigned char *eptr; if ((eptr = zzlFind(zobj->ptr,ele,&curscore)) != NULL) { //假如找到了相同元素 /* NX? Return, same element already exists. */ if (nx) { *out_flags |= ZADD_OUT_NOP; return 1; } /* Prepare the score for the increment if needed. */ //incr需要增加分数 if (incr) { score += curscore; if (isnan(score)) { *out_flags |= ZADD_OUT_NAN; return 0; } } /* GT/LT? Only update if score is greater/less than current. */ if ((lt && score >= curscore) || (gt && score <= curscore)) { *out_flags |= ZADD_OUT_NOP; return 1; } if (newscore) *newscore = score; /* Remove and re-insert when score changed. */ //如果score不同的话,那么就先删除然后再添加,达到更新的操作 if (score != curscore) { zobj->ptr = zzlDelete(zobj->ptr,eptr); zobj->ptr = zzlInsert(zobj->ptr,ele,score); *out_flags |= ZADD_OUT_UPDATED; } return 1; } else if (!xx) { /* Optimize: check if the element is too large or the list * becomes too long *before* executing zzlInsert. */ //如果为新的,就需要考虑编码转换的问题,但是需要先insert之后再判断 zobj->ptr = zzlInsert(zobj->ptr,ele,score); if (zzlLength(zobj->ptr) > server.zset_max_listpack_entries || sdslen(ele) > server.zset_max_listpack_value) zsetConvert(zobj,OBJ_ENCODING_SKIPLIST); if (newscore) *newscore = score; *out_flags |= ZADD_OUT_ADDED; return 1; } else { *out_flags |= ZADD_OUT_NOP; return 1; } } else if (zobj->encoding == OBJ_ENCODING_SKIPLIST) { zset *zs = zobj->ptr; zskiplistNode *znode; dictEntry *de; //先从dict中判断是否存在,在dict中的hash查询效率比skiplist要快很多 de = dictFind(zs->dict,ele); if (de != NULL) { /* NX? Return, same element already exists. */ if (nx) { *out_flags |= ZADD_OUT_NOP; return 1; } curscore = *(double*)dictGetVal(de); /* Prepare the score for the increment if needed. */ //分数增加 if (incr) { score += curscore; if (isnan(score)) { *out_flags |= ZADD_OUT_NAN; return 0; } } /* GT/LT? Only update if score is greater/less than current. */ if ((lt && score >= curscore) || (gt && score <= curscore)) { *out_flags |= ZADD_OUT_NOP; return 1; } if (newscore) *newscore = score; /* Remove and re-insert when score changes. */ if (score != curscore) { //更新跳表中的score znode = zslUpdateScore(zs->zsl,curscore,ele,score); /* Note that we did not removed the original element from * the hash table representing the sorted set, so we just * update the score. */ //更新dict中的score dictGetVal(de) = &znode->score; /* Update score ptr. */ *out_flags |= ZADD_OUT_UPDATED; } return 1; } else if (!xx) { ele = sdsdup(ele); //无需编码转换,需要同时add dict与跳表 znode = zslInsert(zs->zsl,score,ele); serverAssert(dictAdd(zs->dict,ele,&znode->score) == DICT_OK); *out_flags |= ZADD_OUT_ADDED; if (newscore) *newscore = score; return 1; } else { *out_flags |= ZADD_OUT_NOP; return 1; } } else { serverPanic("Unknown sorted set encoding"); } return 0; /* Never reached. */}
可以看出,listpack相对较简单,而dict与skiplist需要冗余存储,通过牺牲一定的内存空间换取极高的查询与排序效率,用空间换时间,将排序与范围查询优化到O(log(n))的级别
zdel
/* Delete the element 'ele' from the sorted set, returning 1 if the element * existed and was deleted, 0 otherwise (the element was not there). */int zsetDel(robj *zobj, sds ele) { //简单的判断编码,然后调用不同的底层数据结构的del操作 if (zobj->encoding == OBJ_ENCODING_LISTPACK) { unsigned char *eptr; if ((eptr = zzlFind(zobj->ptr,ele,NULL)) != NULL) { zobj->ptr = zzlDelete(zobj->ptr,eptr); return 1; } } else if (zobj->encoding == OBJ_ENCODING_SKIPLIST) { zset *zs = zobj->ptr; if (zsetRemoveFromSkiplist(zs, ele)) { if (htNeedsResize(zs->dict)) dictResize(zs->dict); return 1; } } else { serverPanic("Unknown sorted set encoding"); } return 0; /* No such element found. */}
t_zset中较为重要的是dict与skiplist的结合。将元素冗余存储在dict与skiplist,牺牲了部分的空间利用率,但是得到了极高的查询与排序效率。
dict负责查询元素的score,而skiplist用来根据score查询元素,排名也在skiplist中负责。add时由于dict的搜寻性能更高,会优先查询dict来判断元素是否存在。
dict+skiplist结合后,可以很快的先在dict中由数据查到score,然后根据score到skiplist中反查,查到数据后直接就可以获取排名,很巧妙的通过两者不同的底层数据结构将复杂的排名操作优化到了O(log n),也因此redis在大版本升级时只替换了ziplist的底层数据结构。也由于dict+skiplist的结合,在add的时候,对于编码为OBJ_ENCODING_SKIPLIST的逻辑操作相对较多一些,整体add函数相对复杂。需要了解dict与skiplist后才能明白两者分别在OBJ_ENCODING_SKIPLIST中的优势点,才能理解zscore,zrevrange,zrevrank的高效实现。
t_set为redis中无序集合的容器,用来保存不同成员,底层数据结构为dict(hash table)和intset实现。
dict与intset分别在
这两篇文章中进行了详细分析,不熟悉的同学可以先看这两篇文章。
#define OBJ_ENCODING_HT 2 /* Encoded as hash table */#define OBJ_ENCODING_INTSET 6 /* Encoded as intset */
t_set中特殊点在于提供了集合间的交集并集差集操作,t_set中的GenericCommand也全部是为了交叉并集包装的通用函数,一些相对独立的命令不使用Generic公共函数,由自己负责pub/sub等操作。
sadd:
void saddCommand(client *c) { robj *set; int j, added = 0; //判断命令是否合规 set = lookupKeyWrite(c->db,c->argv[1]); if (checkType(c,set,OBJ_SET)) return; //如果set为空,那么会优先判断类型是否能是intset,如果不行的话就使用dict if (set == NULL) { set = setTypeCreate(c->argv[2]->ptr); //在这里直接调用db模块的add命令,将集合的key写入到db中 dbAdd(c->db,c->argv[1],set); } for (j = 2; j < c->argc; j++) { //这里的setTypeAdd函数根据编码类型调用底层数据结构的dictAdd或intsetAdd if (setTypeAdd(set,c->argv[j]->ptr)) added++; } if (added) { // 发送键修改通知 signalModifiedKey(c,c->db,c->argv[1]); //自己做pub/sub的通知,没有托管给通用函数 notifyKeyspaceEvent(NOTIFY_SET,"sadd",c->argv[1],c->db->id); } server.dirty += added; //与t_hash类似,由于可能此命令操作的数据数量太大,所以需要longlong的addreply才能安全的放下 addReplyLongLong(c,added);}/* Factory method to return a set that *can* hold "value". When the object has * an integer-encodable value, an intset will be returned. Otherwise a regular * hash table. */robj *setTypeCreate(sds value) { if (isSdsRepresentableAsLongLong(value,NULL) == C_OK) return createIntsetObject(); return createSetObject();}
整个sadd看下来会发现,tset中的编码也与t_hash类似,当编码"升级"之后就再也不会“降级了”。只要有longlong满足不了的value插入,整个t_set将全部变为dict进行存储
scard:
void scardCommand(client *c) { robj *o; if ((o = lookupKeyReadOrReply(c,c->argv[1],shared.czero)) == NULL || checkType(c,o,OBJ_SET)) return; addReplyLongLong(c,setTypeSize(o));}
相对来讲scard是整个tset最简单的函数了,由于dict与intset本身就会记录存储的entry个数,因此t_set只需要验证编码格式,然后直接调用即可
t_set中的交叉并集
t_set中交叉并集全部都有保存计算后的集合的命令,因此相对应的GenericCommand函数就根据传参条件判断需不需要保存计算后的集合即各个stroe命令。
交集通用函数:
void sinterGenericCommand(client *c, robj **setkeys, unsigned long setnum, robj *dstkey, int cardinality_only, unsigned long limit) { //申请内存,用来临时保存结果,整个GenericCommandreturn之前都会执行zfree释放此缓存空间 robj **sets = zmalloc(sizeof(robj*)*setnum); setTypeIterator *si; robj *dstset = NULL; sds elesds; int64_t intobj; void *replylen = NULL; unsigned long j, cardinality = 0; int encoding, empty = 0; for (j = 0; j < setnum; j++) { //当需要保存计算结果时,将会传入dstkey robj *setobj = dstkey ? //用来取出对象 lookupKeyWrite(c->db,setkeys[j]) : lookupKeyRead(c->db,setkeys[j]); //假如说对象不存在的话,那么无需做check操作了,将预设的empty自增,进入for循环外的分支,走完整的free操作 if (!setobj) { /* A NULL is considered an empty set */ empty += 1; sets[j] = NULL; continue; } if (checkType(c,setobj,OBJ_SET)) { zfree(sets); return; } sets[j] = setobj; } /* Set intersection with an empty set always results in an empty set. * Return ASAP if there is an empty set. */ //empty>0会出现在上边的for循环中,没有找到对应的对象,那么释放缓存set的内存,发送通知,返回结果统一处理 if (empty > 0) { zfree(sets); if (dstkey) { if (dbDelete(c->db,dstkey)) { signalModifiedKey(c,c->db,dstkey); notifyKeyspaceEvent(NOTIFY_GENERIC,"del",dstkey,c->db->id); server.dirty++; } addReply(c,shared.czero); } else if (cardinality_only) { addReplyLongLong(c,cardinality); } else { addReply(c,shared.emptyset[c->resp]); } return; } /* Sort sets from the smallest to largest, this will improve our * algorithm's performance */ //这里用的是快排,由于之后的交集为不停的for循环,那么将无序预先变为有序,方便for循环从中间直接break,增加效率 qsort(sets,setnum,sizeof(robj*),qsortCompareSetsByCardinality); /* The first thing we should output is the total number of elements... * since this is a multi-bulk write, but at this stage we don't know * the intersection set size, so we use a trick, append an empty object * to the output list and save the pointer to later modify it with the * right length */ //在输出列表中附加一个空的对象replylen(adlist),保存它的指针,将之后的长度更新在replylen中 if (dstkey) { /* If we have a target key where to store the resulting set * create this key with an empty set inside */ dstset = createIntsetObject(); } else if (!cardinality_only) { replylen = addReplyDeferredLen(c); } /* Iterate all the elements of the first (smallest) set, and test * the element against all the other sets, if at least one set does * not include the element it is discarded */ //为了减小for循环次数,需要先遍历entry数最少的集合,用它的元素与其余的集合进行对比,如果不是所有的元素都含有此集合中的元素,那么此元素就不为交集 si = setTypeInitIterator(sets[0]); while((encoding = setTypeNext(si,&elesds,&intobj)) != -1) { for (j = 1; j < setnum; j++) { //如果为当前集合则跳过 if (sets[j] == sets[0]) continue; //intset的话 if (encoding == OBJ_ENCODING_INTSET) { /* intset with intset is simple... and fast */ //由于都是int,遍历的效率相对dict会高很多 if (sets[j]->encoding == OBJ_ENCODING_INTSET && !intsetFind((intset*)sets[j]->ptr,intobj)) { break; /* in order to compare an integer with an object we * have to use the generic function, creating an object * for this */ } else if (sets[j]->encoding == OBJ_ENCODING_HT) { //其他集合为非intset类型会稍微麻烦些,这里构造里一个范型对象用来对比整数与对象 elesds = sdsfromlonglong(intobj); if (!setTypeIsMember(sets[j],elesds)) { sdsfree(elesds); break; } sdsfree(elesds); } } else if (encoding == OBJ_ENCODING_HT) { //纯dict的话尝试set,能够set进去那么就说明不为交集 if (!setTypeIsMember(sets[j],elesds)) { break; } } } //当发现此元素在所有集合中,即发现了新的交集元素时 /* Only take action when all sets contain the member */ if (j == setnum) { if (cardinality_only) { cardinality++; /* We stop the searching after reaching the limit. */ //判断是否到了limit的限制,到的话不做其余操作了 if (limit && cardinality >= limit) break; } else if (!dstkey) { //dstkey如果为空,那么说明仅仅是找交集,不用写入 if (encoding == OBJ_ENCODING_HT) addReplyBulkCBuffer(c,elesds,sdslen(elesds)); else addReplyBulkLongLong(c,intobj); cardinality++; } else { if (encoding == OBJ_ENCODING_INTSET) { elesds = sdsfromlonglong(intobj); setTypeAdd(dstset,elesds); sdsfree(elesds); } else { setTypeAdd(dstset,elesds); } } } } setTypeReleaseIterator(si); if (cardinality_only) { addReplyLongLong(c,cardinality); } else if (dstkey) { //需要将结果写入库中 /* Store the resulting set into the target, if the intersection * is not an empty set. */ if (setTypeSize(dstset) > 0) { setKey(c,c->db,dstkey,dstset); addReplyLongLong(c,setTypeSize(dstset)); notifyKeyspaceEvent(NOTIFY_SET,"sinterstore", dstkey,c->db->id); server.dirty++; } else { addReply(c,shared.czero); if (dbDelete(c->db,dstkey)) { //脏键自增 server.dirty++; //通知变化 signalModifiedKey(c,c->db,dstkey); //通知订阅 notifyKeyspaceEvent(NOTIFY_GENERIC,"del",dstkey,c->db->id); } } decrRefCount(dstset); } else { setDeferredSetLen(c,replylen,cardinality); } zfree(sets);}
交集相对来说比较简单,得益于dict的尝试插入与intset的寻找优化,整个时间复杂度在n^2以内
并集差集:
void sunionDiffGenericCommand(client *c, robj **setkeys, int setnum, robj *dstkey, int op) { robj **sets = zmalloc(sizeof(robj*)*setnum); setTypeIterator *si; robj *dstset = NULL; sds ele; int j, cardinality = 0; int diff_algo = 1; for (j = 0; j < setnum; j++) { robj *setobj = dstkey ? lookupKeyWrite(c->db,setkeys[j]) : lookupKeyRead(c->db,setkeys[j]); if (!setobj) { sets[j] = NULL; continue; } if (checkType(c,setobj,OBJ_SET)) { zfree(sets); return; } sets[j] = setobj; } /* Select what DIFF algorithm to use. * * Algorithm 1 is O(N*M) where N is the size of the element first set * and M the total number of sets. * * Algorithm 2 is O(N) where N is the total number of elements in all * the sets. * * We compute what is the best bet with the current input here. */ //根据输入的集合决定用哪种算法,根据对比min元素数量*集合个数与所有集合中元素个数相比较,决定用最小时间复杂度的算法 if (op == SET_OP_DIFF && sets[0]) { long long algo_one_work = 0, algo_two_work = 0; for (j = 0; j < setnum; j++) { if (sets[j] == NULL) continue; //algo_one_work为计算最小基数*集合个数的时间复杂度 algo_one_work += setTypeSize(sets[0]); //algo_two_work为计算所有集合全部个数的时间复杂度 algo_two_work += setTypeSize(sets[j]); } /* Algorithm 1 has better constant times and performs less operations * if there are elements in common. Give it some advantage. */ //算法1的常数一般较为低,所以一般有限考虑算法1 algo_one_work /= 2; diff_algo = (algo_one_work <= algo_two_work) ? 1 : 2; if (diff_algo == 1 && setnum > 1) { /* With algorithm 1 it is better to order the sets to subtract * by decreasing size, so that we are more likely to find * duplicated elements ASAP. */ //算法1的时间复杂度依赖提前对其余的集合进行排序 qsort(sets+1,setnum-1,sizeof(robj*), qsortCompareSetsByRevCardinality); } } /* We need a temp set object to store our union. If the dstkey * is not NULL (that is, we are inside an SUNIONSTORE operation) then * this set object will be the resulting object to set into the target key*/ //临时集合用来保存结果,如果为并集的操作的话,那么这个集合就是最终的结果 dstset = createIntsetObject(); //并集计算 if (op == SET_OP_UNION) { /* Union is trivial, just add every element of every set to the * temporary set. */ //暴力遍历即可 for (j = 0; j < setnum; j++) { if (!sets[j]) continue; /* non existing keys are like empty sets */ si = setTypeInitIterator(sets[j]); while((ele = setTypeNextObject(si)) != NULL) { if (setTypeAdd(dstset,ele)) cardinality++; sdsfree(ele); } setTypeReleaseIterator(si); } //差集相对比较麻烦,需要考虑不同的算法情况 //算法1的差集 } else if (op == SET_OP_DIFF && sets[0] && diff_algo == 1) { /* DIFF Algorithm 1: * * We perform the diff by iterating all the elements of the first set, * and only adding it to the target set if the element does not exist * into all the other sets. * * This way we perform at max N*M operations, where N is the size of * the first set, and M the number of sets. */ //在算法1中,需要将最小基数的集合与其他所有集合元素进行对比,所以这里算法一的复杂度为N*M si = setTypeInitIterator(sets[0]); while((ele = setTypeNextObject(si)) != NULL) { for (j = 1; j < setnum; j++) { if (!sets[j]) continue; /* no key is an empty set. */ if (sets[j] == sets[0]) break; /* same set! */ if (setTypeIsMember(sets[j],ele)) break; } if (j == setnum) { /* There is no other set with this element. Add it. */ setTypeAdd(dstset,ele); cardinality++; } sdsfree(ele); } setTypeReleaseIterator(si); //算法2的差集 } else if (op == SET_OP_DIFF && sets[0] && diff_algo == 2) { /* DIFF Algorithm 2: * * Add all the elements of the first set to the auxiliary set. * Then remove all the elements of all the next sets from it. * * This is O(N) where N is the sum of all the elements in every * set. */ //与算法1不同的是,需要将最小基数的集合也添加到备用结果集中,然后再遍历所有集合,将相同元素删除 for (j = 0; j < setnum; j++) { if (!sets[j]) continue; /* non existing keys are like empty sets */ si = setTypeInitIterator(sets[j]); while((ele = setTypeNextObject(si)) != NULL) { if (j == 0) { if (setTypeAdd(dstset,ele)) cardinality++; } else { if (setTypeRemove(dstset,ele)) cardinality--; } sdsfree(ele); } setTypeReleaseIterator(si); /* Exit if result set is empty as any additional removal * of elements will have no effect. */ if (cardinality == 0) break; } } /* Output the content of the resulting set, if not in STORE mode */ //与交集基本上一致,唯一不通的是根据并集和差集的区别,看dstkey是否需要被删除 if (!dstkey) { addReplySetLen(c,cardinality); si = setTypeInitIterator(dstset); while((ele = setTypeNextObject(si)) != NULL) { addReplyBulkCBuffer(c,ele,sdslen(ele)); sdsfree(ele); } setTypeReleaseIterator(si); server.lazyfree_lazy_server_del ? freeObjAsync(NULL, dstset, -1) : decrRefCount(dstset); } else { /* If we have a target key where to store the resulting set * create this key with the result set inside */ if (setTypeSize(dstset) > 0) { //说明结果集不为空,那么需要添加到数据库中 setKey(c,c->db,dstkey,dstset); addReplyLongLong(c,setTypeSize(dstset)); //通知 notifyKeyspaceEvent(NOTIFY_SET, op == SET_OP_UNION ? "sunionstore" : "sdiffstore", dstkey,c->db->id); server.dirty++; } else { addReply(c,shared.czero); if (dbDelete(c->db,dstkey)) { server.dirty++; signalModifiedKey(c,c->db,dstkey); notifyKeyspaceEvent(NOTIFY_GENERIC,"del",dstkey,c->db->id); } } decrRefCount(dstset); } zfree(sets);}
并集差集中特殊的是,会预先计算时间复杂度,然后根据计算结果采用不同的算法来做,两个算法不同的就在于,对于最小基数的集合,是直接与其他元素相比较,还是添加到其他元素中再遍历相同元素
t_set中对于简单的command命令,实现基本全依赖底层数据结构包装好的命令,唯一特殊的就在于对于多个集合的交叉并集的统一支持。其中并差集采用相同的函数,根据具体的集合情况判断相对应的具体算法。对于交集,时间复杂度最坏在O(n^2),而对于并差集,最坏的时间复杂度则在min(O(n*M),O(N))(n为最小基数的集合基数,M为集合个数,N为集合中元素个数之和)
]]>t_hash,负责处理key:[{field,value}]键值对,常用来存储对象。底层在6之前使用使用ziplist 和 dict,在6之后变为listpack和dict。本篇针对6之后的底层实现。listpack与dict分别在
两篇中详解,不熟悉的同学可以先看这两篇文章。
t_hash的两种编码方式,分别为
#define OBJ_ENCODING_LISTPACK 11 /* Encoded as a listpack */#define OBJ_ENCODING_HT 2 /* Encoded as hash table */
在6之后,取决于redis.conf参数中的
# Hashes are encoded using a memory efficient data structure when they have a# small number of entries, and the biggest entry does not exceed a given# threshold. These thresholds can be configured using the following directives.hash-max-listpack-entries 512hash-max-listpack-value 64
解释来说
与其他容器不太相同的是,thash中由于复杂性,取消掉了GenericCommand的特殊命名包装,整个t_hash仅有使用了一个scanGenericCommand函数,通用函数为hashTypeSet,hashTypeDelete等
hashTypeSet:
int hashTypeSet(robj *o, sds field, sds value, int flags) { int update = 0; if (o->encoding == OBJ_ENCODING_LISTPACK) { unsigned char *zl, *fptr, *vptr; zl = o->ptr; fptr = lpFirst(zl); //获取紧凑列表的头 if (fptr != NULL) { //查看是否存在field fptr = lpFind(zl, fptr, (unsigned char*)field, sdslen(field), 1); if (fptr != NULL) { /* Grab pointer to the value (fptr points to the field) */ //存在的话,那么取next vptr = lpNext(zl, fptr); serverAssert(vptr != NULL); update = 1; /* Replace value */ //与ziplist不同,listpack中本身实现了replace操作,无需delet后删除 zl = lpReplace(zl, &vptr, (unsigned char*)value, sdslen(value)); } } if (!update) { //非更新操作的话那么直接插入就好,需要按顺序插入键和值 /* Push new field/value pair onto the tail of the listpack */ zl = lpAppend(zl, (unsigned char*)field, sdslen(field)); zl = lpAppend(zl, (unsigned char*)value, sdslen(value)); } o->ptr = zl; //这里是检查转换hash的操作,假如此次的插入导致触发了hash-max-listpack-entries或hash-max-listpack-value上限,那么需要整体转换 /* Check if the listpack needs to be converted to a hash table */ if (hashTypeLength(o) > server.hash_max_listpack_entries) hashTypeConvert(o, OBJ_ENCODING_HT); } else if (o->encoding == OBJ_ENCODING_HT) { //hash结构的话相对简单些,本身有hash映射 dictEntry *de = dictFind(o->ptr,field); if (de) { sdsfree(dictGetVal(de)); if (flags & HASH_SET_TAKE_VALUE) { dictGetVal(de) = value; value = NULL; } else { dictGetVal(de) = sdsdup(value); } update = 1; } else { //不存在的话需要插入操作,dict中的插入会有概率触发渐进式rehash sds f,v; if (flags & HASH_SET_TAKE_FIELD) { f = field; field = NULL; } else { f = sdsdup(field); } if (flags & HASH_SET_TAKE_VALUE) { v = value; value = NULL; } else { v = sdsdup(value); } dictAdd(o->ptr,f,v); } } else { serverPanic("Unknown hash encoding"); } /* Free SDS strings we did not referenced elsewhere if the flags * want this function to be responsible. */ if (flags & HASH_SET_TAKE_FIELD && field) sdsfree(field); if (flags & HASH_SET_TAKE_VALUE && value) sdsfree(value); return update;}
从这里看出,从listpack转换为hashtable出现在set的时候
hashTypeDelete:
/* Delete an element from a hash. * Return 1 on deleted and 0 on not found. */int hashTypeDelete(robj *o, sds field) { int deleted = 0; if (o->encoding == OBJ_ENCODING_LISTPACK) { //listpack中的delete为减法,所以不存在出现编码转换的问题,这里仅仅调用listpack中的方法即可 unsigned char *zl, *fptr; zl = o->ptr; fptr = lpFirst(zl); if (fptr != NULL) { fptr = lpFind(zl, fptr, (unsigned char*)field, sdslen(field), 1); if (fptr != NULL) { /* Delete both of the key and the value. */ //lpDeleteRangeWithEntry在本质上是至空操作,在listpack详解中讲到 ,lp的delete与replace为同一个底层功能 zl = lpDeleteRangeWithEntry(zl,&fptr,2); o->ptr = zl; deleted = 1; } } } else if (o->encoding == OBJ_ENCODING_HT) { if (dictDelete((dict*)o->ptr, field) == C_OK) { deleted = 1; /* Always check if the dictionary needs a resize after a delete. */ if (htNeedsResize(o->ptr)) dictResize(o->ptr); } } else { serverPanic("Unknown hash encoding"); } return deleted;}
从这里看出,t_hash在hashtable的编码下,是不会尝试回到listpack的
hsetCommand:
void hsetCommand(client *c) { int i, created = 0; robj *o; if ((c->argc % 2) == 1) { addReplyErrorFormat(c,"wrong number of arguments for '%s' command",c->cmd->name); return; } //假如key不存在,需要先创建key if ((o = hashTypeLookupWriteOrCreate(c,c->argv[1])) == NULL) return; //只检查sds的长度,尝试将listpack直接转换为hash结构(如果条件满足的话) hashTypeTryConversion(o,c->argv,2,c->argc-1); for (i = 2; i < c->argc; i += 2) created += !hashTypeSet(o,c->argv[i]->ptr,c->argv[i+1]->ptr,HASH_SET_COPY); /* HMSET (deprecated) and HSET return value is different. */ char *cmdname = c->argv[0]->ptr; if (cmdname[1] == 's' || cmdname[1] == 'S') { /* HSET */ //由于hmset的返回相对较长,需要特殊处理 addReplyLongLong(c, created); } else { /* HMSET */ addReply(c, shared.ok); } // 发送键修改通知 signalModifiedKey(c,c->db,c->argv[1]); //自己做pub/sub的通知,没有托管给通用函数 notifyKeyspaceEvent(NOTIFY_HASH,"hset",c->argv[1],c->db->id); server.dirty += (c->argc - 2)/2;}void hashTypeTryConversion(robj *o, robj **argv, int start, int end) { int i; if (o->encoding != OBJ_ENCODING_LISTPACK) return; for (i = start; i <= end; i++) { //只检查sds的长度,尝试将listpack直接转换为hash结构(如果条件满足的话) if (sdsEncodedObject(argv[i]) && sdslen(argv[i]->ptr) > server.hash_max_listpack_value) { hashTypeConvert(o, OBJ_ENCODING_HT); break; } }}
一般判断Command是否独立完成功能,未使用通用函数的一个方式是看这个函数负不负责pub/sub的方法即notifyKeyspaceEvent,一般自己调用notifyKeyspaceEvent就说明此Command函数核心功能都在自己。
其余的command函数相对比一样的简单明了,这里不做详细分析
得益于底层数据结构完善的封装,t_hash在处理具体的command过程中相对较少的有着自己的逻辑处理,大部分是直接调用底层数据结构的函数。
变动较大的是,t_hash终于摆脱了ziplist的连锁更新噩梦,采用了listpack作为底层编码结构之一,在调用上响应相对来说会快一些。但无论是listpack还是ziplist,在t_hash中始终是小对象结构才会存储的,整体的响应优化在大结构体上是没有的,针对小型对象会多占用些内存,来换取更快的处理速度。
在做redis版本升级的过程中,不必专门替换hash-max-listpack-entries与hash-max-listpack-value参数,redis在兼容性上统一了两者,从createSizeTConfig中可以看到hash-max-listpack-entries与hash-max-ziplist-entries都会被解析给server.hash_max_listpackentries统一使用,但还是建议有空替换一下,假如之后对listpack再次进行优化,那么t_hash中对于ziplist相关的的兼容将会失效。
createSizeTConfig("hash-max-listpack-entries", "hash-max-ziplist-entries", MODIFIABLE_CONFIG, 0, LONG_MAX, server.hash_max_listpack_entries, 512, INTEGER_CONFIG, NULL, NULL),createSizeTConfig("hash-max-listpack-value", "hash-max-ziplist-value", MODIFIABLE_CONFIG, 0, LONG_MAX, server.hash_max_listpack_value, 64, MEMORY_CONFIG, NULL, NULL),
]]>
t_list,负责处理与list相关的命令容器,比如rpush,lpush等命令。底层使用quicklist,不熟悉的同学可以看
数据结构编码方式适用类型quicklistQUICKLISTall
与tstring不同,t_list定义了两个结构体 listTypeIterator 与listTypeEntry
/* Structure to hold list iteration abstraction. */typedef struct { robj *subject; unsigned char encoding; unsigned char direction; /* Iteration direction */ quicklistIter *iter;} listTypeIterator;/* Structure for an entry while iterating over a list. */typedef struct { listTypeIterator *li; quicklistEntry entry; /* Entry in quicklist */} listTypeEntry;
其中subject为redisObject,后续文章将会分析到。结合listTypeIterator和listTypeEntry会发现,listTypeEntry定义了list的头与一个listTypeIterator,listTypeIterator中负责保存迭代的方向以及list的编码方式。
t_list中也定义了通用的底层函数
pushGenericCommand:
/* Implements LPUSH/RPUSH/LPUSHX/RPUSHX. * 'xx': push if key exists. */void pushGenericCommand(client *c, int where, int xx) { int j; //查看redis中是否存在这个key,如果存在key但类型不符将直接返回 robj *lobj = lookupKeyWrite(c->db, c->argv[1]); if (checkType(c,lobj,OBJ_LIST)) return; if (!lobj) { //xx默认为0,仅lpush与rpush使用到了 if (xx) { //将值添加到输出缓存区中 addReply(c, shared.czero); return; } //如果对象不存在,那么使用quicklist创建对象,并且初始化 lobj = createQuicklistObject(); quicklistSetOptions(lobj->ptr, server.list_max_ziplist_size, server.list_compress_depth); //将键添加到相对应的db中(常用的都是db0) dbAdd(c->db,c->argv[1],lobj); } for (j = 2; j < c->argc; j++) { // 真正的push操作 listTypePush(lobj,c->argv[j],where); //设置当前为脏(dirty),每次修改一个key后,都会对脏键(dirty)增1 server.dirty++; } // 返回添加的节点数量 addReplyLongLong(c, listTypeLength(lobj)); char *event = (where == LIST_HEAD) ? "lpush" : "rpush"; // 发送键修改通知 signalModifiedKey(c,c->db,c->argv[1]); //供订阅/消费模块使用 notifyKeyspaceEvent(NOTIFY_LIST,event,c->argv[1],c->db->id);}
从这里可以看到,t_list中全部为quicklist
popGenericCommand:
/* Implements the generic list pop operation for LPOP/RPOP. * The where argument specifies which end of the list is operated on. An * optional count may be provided as the third argument of the client's * command. */void popGenericCommand(client *c, int where) { long count = 0; robj *value; //格式校验 if (c->argc > 3) { addReplyErrorFormat(c,"wrong number of arguments for '%s' command", c->cmd->name); return; } else if (c->argc == 3) { /* Parse the optional count argument. */ if (getPositiveLongFromObjectOrReply(c,c->argv[2],&count,NULL) != C_OK) return; if (count == 0) { /* Fast exit path. */ addReplyNullArray(c); return; } } //查看key是否存在或类型是否正确 robj *o = lookupKeyWriteOrReply(c, c->argv[1], shared.null[c->resp]); if (o == NULL || checkType(c, o, OBJ_LIST)) return; if (!count) { /* Pop a single element. This is POP's original behavior that replies * with a bulk string. */ value = listTypePop(o,where); serverAssert(value != NULL); addReplyBulk(c,value); decrRefCount(value); //其中供订阅模块通知的函数写在了这里,作为remove后的收尾统一函数 listElementsRemoved(c,c->argv[1],where,o,1,NULL); } else { /* Pop a range of elements. An addition to the original POP command, * which replies with a multi-bulk. */ long llen = listTypeLength(o); long rangelen = (count > llen) ? llen : count; long rangestart = (where == LIST_HEAD) ? 0 : -rangelen; long rangeend = (where == LIST_HEAD) ? rangelen - 1 : -1; int reverse = (where == LIST_HEAD) ? 0 : 1; addListRangeReply(c,o,rangestart,rangeend,reverse); listTypeDelRange(o,rangestart,rangelen); //其中供订阅模块通知的函数写在了这里,作为remove后的收尾统一函数 listElementsRemoved(c,c->argv[1],where,o,rangelen,NULL); }}
listElementsRemoved(remove的收尾函数):
/* A housekeeping helper for list elements popping tasks. * * 'deleted' is an optional output argument to get an indication * if the key got deleted by this function. */void listElementsRemoved(client *c, robj *key, int where, robj *o, long count, int *deleted) { char *event = (where == LIST_HEAD) ? "lpop" : "rpop"; notifyKeyspaceEvent(NOTIFY_LIST, event, key, c->db->id); if (listTypeLength(o) == 0) { if (deleted) *deleted = 1; dbDelete(c->db, key); notifyKeyspaceEvent(NOTIFY_GENERIC, "del", key, c->db->id); } else { if (deleted) *deleted = 0; } // 发送键修改通知 signalModifiedKey(c, c->db, key); //设置当前为脏(dirty),remove可能操作多个key,需对脏键(dirty)+count server.dirty += count;}
list中另外特殊的地方在于,blpop或brpop会阻塞, 如果列表没有元素会阻塞列表直到等待超时或发现可弹出元素为止。这个机制的具体实现在blockingPopGenericCommand中
blockingPopGenericCommand:
/* Blocking RPOP/LPOP/LMPOP * * 'numkeys' is the number of keys. * 'timeout_idx' parameter position of block timeout. * 'where' LIST_HEAD for LEFT, LIST_TAIL for RIGHT. * 'count' is the number of elements requested to pop, or 0 for plain single pop. * * When count is 0, a reply of a single bulk-string will be used. * When count > 0, an array reply will be used. */void blockingPopGenericCommand(client *c, robj **keys, int numkeys, int where, int timeout_idx, long count) { robj *o; robj *key; mstime_t timeout; int j; //获取timeout if (getTimeoutFromObjectOrReply(c,c->argv[timeout_idx],&timeout,UNIT_SECONDS) != C_OK) return; /* Traverse all input keys, we take action only based on one key. */ for (j = 0; j < numkeys; j++) { key = keys[j]; //从这到紧接着的两个if都是校验,校验不通过则直接报错 o = lookupKeyWrite(c->db, key); /* Non-existing key, move to next key. */ if (o == NULL) continue; if (checkType(c, o, OBJ_LIST)) return; // 当前列表为空,则跳过当前开始对下个key操作 long llen = listTypeLength(o); /* Empty list, move to next key. */ if (llen == 0) continue; //非空的话需要pop if (count != 0) { /* BLMPOP, non empty list, like a normal [LR]POP with count option. * The difference here we pop a range of elements in a nested arrays way. */ listPopRangeAndReplyWithKey(c, o, key, where, count, NULL); /* Replicate it as [LR]POP COUNT. */ robj *count_obj = createStringObjectFromLongLong((count > llen) ? llen : count); rewriteClientCommandVector(c, 3, (where == LIST_HEAD) ? shared.lpop : shared.rpop, key, count_obj); decrRefCount(count_obj); return; } /* Non empty list, this is like a normal [LR]POP. */ robj *value = listTypePop(o,where); serverAssert(value != NULL); addReplyArrayLen(c,2); addReplyBulk(c,key); addReplyBulk(c,value); decrRefCount(value); listElementsRemoved(c,key,where,o,1,NULL); /* Replicate it as an [LR]POP instead of B[LR]POP. */ rewriteClientCommandVector(c,2, (where == LIST_HEAD) ? shared.lpop : shared.rpop, key); return; } /* If we are not allowed to block the client, the only thing * we can do is treating it as a timeout (even with timeout 0). */ if (c->flags & CLIENT_DENY_BLOCKING) { addReplyNullArray(c); return; } /* If the keys do not exist we must block */ struct blockPos pos = {where}; //如果说key都不存在的话,那么开始阻塞 blockForKeys(c,BLOCKED_LIST,keys,numkeys,count,timeout,NULL,&pos,NULL);}
相对来说,t_list中定义了大量GenericCommand,好猜到的有pushGenericCommand,popGenericCommand,特殊写的有mpopGenericCommand,lmoveGenericCommand等,大部分调用的Command函数基本上只有一行,拼装不同的from与to实现不同的命令。
t_list另一个特殊点在于blocking的通用类,实现redis中少见的阻塞命令比如blpop等。
redis中,/src/下以t_为开头的c文件可以统一理解为上层容器,与之前介绍的sds,dict等不同,上层容器不定义专门的数据结构,只负责处理具体的redis命令,然后解析命令并操纵sds,dict等最底层数据结构完成数据的内存’持久化’或更新。
而t_string,则是负责处理与string相关的命令容器,比如经典的set,get,mset,mget等命令
数据结构编码方式适用类型longint可用long表示的整数值sdsraw大于39字节的字符串sdsembstr小于40字节的字符串
相对于底层数据结构,上层容器中的源码分析就相对于无脑一些了,有点像写业务逻辑,底层优化基本都在sds等底层数据结构中做了。因此这里只分析些常见command的函数
setGenericCommand(通用类,抽出来供set,setnx,setex,psetex等command使用)
void setGenericCommand(client *c, int flags, robj *key, robj *val, robj *expire, int unit, robj *ok_reply, robj *abort_reply) { long long milliseconds = 0; /* initialized to avoid any harmness warning */ //如果定义了key的过期时间,那么根据&&的规则将会执行getExpireMillisecondsOrReply函数 //getExpireMillisecondsOrReply中调用getLongLongFromObjectOrReply函数拿存当前key具体的过期时间(如果没拿到那么返回0) if (expire && getExpireMillisecondsOrReply(c, expire, flags, unit, &milliseconds) != C_OK) { return; } //下边的各种if都是兼容判断当前set的情况的,对于nx或者ex等都会走不同的分支 if (flags & OBJ_SET_GET) { if (getGenericCommand(c) == C_ERR) return; } //lookupKeyWrite取出key的值 if ((flags & OBJ_SET_NX && lookupKeyWrite(c->db,key) != NULL) || (flags & OBJ_SET_XX && lookupKeyWrite(c->db,key) == NULL)) { if (!(flags & OBJ_SET_GET)) { addReply(c, abort_reply ? abort_reply : shared.null[c->resp]); } return; } genericSetKey(c,c->db,key, val,flags & OBJ_KEEPTTL,1); //设置当前为脏(dirty),每次修改一个key后,都会对脏键(dirty)增1 server.dirty++; notifyKeyspaceEvent(NOTIFY_STRING,"set",key,c->db->id); if (expire) { setExpire(c,c->db,key,milliseconds); /* Propagate as SET Key Value PXAT millisecond-timestamp if there is * EX/PX/EXAT/PXAT flag. */ robj *milliseconds_obj = createStringObjectFromLongLong(milliseconds); rewriteClientCommandVector(c, 5, shared.set, key, val, shared.pxat, milliseconds_obj); decrRefCount(milliseconds_obj); //发送"set"事件的通知,处理发布订阅(pub/sub) notifyKeyspaceEvent(NOTIFY_GENERIC,"expire",key,c->db->id); } if (!(flags & OBJ_SET_GET)) { addReply(c, ok_reply ? ok_reply : shared.ok); } /* Propagate without the GET argument (Isn't needed if we had expire since in that case we completely re-written the command argv) */ if ((flags & OBJ_SET_GET) && !expire) { int argc = 0; int j; robj **argv = zmalloc((c->argc-1)*sizeof(robj*)); for (j=0; j < c->argc; j++) { char *a = c->argv[j]->ptr; /* Skip GET which may be repeated multiple times. */ if (j >= 3 && (a[0] == 'g' || a[0] == 'G') && (a[1] == 'e' || a[1] == 'E') && (a[2] == 't' || a[2] == 'T') && a[3] == '\0') continue; argv[argc++] = c->argv[j]; incrRefCount(c->argv[j]); } replaceClientCommandVector(c, argc, argv); }}
其中,真正负责set操作的setKey在/src/db.c下,后续将单独分析。setKey底层使用的函数genericSetKey是具体负责的函数。如注释所说(All the new keys in the database should be created via this interface.),这个函数是负责全部set的统一接口。
如果还记得sds数据结构的话,结合从这里看到的,就会好理解redis中的过期策略。拿简单的set来说,如果设置了过期时间,而sds或者long在底层结构上并无字段来保存这个值,所以会将这个信息单独维护起来,这也是redis中的‘被动过期策略’。无论是定时,定期,或者懒汉式等等,都由一个引信来通过另外维护的过期队列进行操作sds或者long。这也是为什么redis在aof或rdb持久化重启后,比之前占用内存小的原因了。
set:
/* SET key value [NX] [XX] [KEEPTTL] [GET] [EX <seconds>] [PX <milliseconds>] * [EXAT <seconds-timestamp>][PXAT <milliseconds-timestamp>] */void setCommand(client *c) { robj *expire = NULL; int unit = UNIT_SECONDS; int flags = OBJ_NO_FLAGS; //这个函数是取args并且正确的将set类型放置到flags里 if (parseExtendedStringArgumentsOrReply(c,&flags,&unit,&expire,COMMAND_SET) != C_OK) { return; } // 判断value是否可以编码成整数,如果能则编码;反之不做处理。 c->argv[2] = tryObjectEncoding(c->argv[2]); //通用set方法 setGenericCommand(c,flags,c->argv[1],c->argv[2],expire,unit,NULL,NULL);}/* * The parseExtendedStringArgumentsOrReply() function performs the common validation for extended * string arguments used in SET and GET command. * * Get specific commands - PERSIST/DEL * Set specific commands - XX/NX/GET * Common commands - EX/EXAT/PX/PXAT/KEEPTTL * * Function takes pointers to client, flags, unit, pointer to pointer of expire obj if needed * to be determined and command_type which can be COMMAND_GET or COMMAND_SET. * * If there are any syntax violations C_ERR is returned else C_OK is returned. * * Input flags are updated upon parsing the arguments. Unit and expire are updated if there are any * EX/EXAT/PX/PXAT arguments. Unit is updated to millisecond if PX/PXAT is set. */int parseExtendedStringArgumentsOrReply(client *c, int *flags, int *unit, robj **expire, int command_type) { int j = command_type == COMMAND_GET ? 2 : 3; //这里是看是否带有ex/px和nx/xx的参数,如果有的话则在flags中标记好类型 for (; j < c->argc; j++) { char *opt = c->argv[j]->ptr; robj *next = (j == c->argc-1) ? NULL : c->argv[j+1]; if ((opt[0] == 'n' || opt[0] == 'N') && (opt[1] == 'x' || opt[1] == 'X') && opt[2] == '\0' && !(*flags & OBJ_SET_XX) && (command_type == COMMAND_SET)) { *flags |= OBJ_SET_NX; } else if ((opt[0] == 'x' || opt[0] == 'X') && (opt[1] == 'x' || opt[1] == 'X') && opt[2] == '\0' && !(*flags & OBJ_SET_NX) && (command_type == COMMAND_SET)) { *flags |= OBJ_SET_XX; } else if ((opt[0] == 'g' || opt[0] == 'G') && (opt[1] == 'e' || opt[1] == 'E') && (opt[2] == 't' || opt[2] == 'T') && opt[3] == '\0' && (command_type == COMMAND_SET)) { *flags |= OBJ_SET_GET; } else if (!strcasecmp(opt, "KEEPTTL") && !(*flags & OBJ_PERSIST) && !(*flags & OBJ_EX) && !(*flags & OBJ_EXAT) && !(*flags & OBJ_PX) && !(*flags & OBJ_PXAT) && (command_type == COMMAND_SET)) { *flags |= OBJ_KEEPTTL; } else if (!strcasecmp(opt,"PERSIST") && (command_type == COMMAND_GET) && !(*flags & OBJ_EX) && !(*flags & OBJ_EXAT) && !(*flags & OBJ_PX) && !(*flags & OBJ_PXAT) && !(*flags & OBJ_KEEPTTL)) { *flags |= OBJ_PERSIST; } else if ((opt[0] == 'e' || opt[0] == 'E') && (opt[1] == 'x' || opt[1] == 'X') && opt[2] == '\0' && !(*flags & OBJ_KEEPTTL) && !(*flags & OBJ_PERSIST) && !(*flags & OBJ_EXAT) && !(*flags & OBJ_PX) && !(*flags & OBJ_PXAT) && next) { *flags |= OBJ_EX; *expire = next; j++; } else if ((opt[0] == 'p' || opt[0] == 'P') && (opt[1] == 'x' || opt[1] == 'X') && opt[2] == '\0' && !(*flags & OBJ_KEEPTTL) && !(*flags & OBJ_PERSIST) && !(*flags & OBJ_EX) && !(*flags & OBJ_EXAT) && !(*flags & OBJ_PXAT) && next) { *flags |= OBJ_PX; *unit = UNIT_MILLISECONDS; *expire = next; j++; } else if ((opt[0] == 'e' || opt[0] == 'E') && (opt[1] == 'x' || opt[1] == 'X') && (opt[2] == 'a' || opt[2] == 'A') && (opt[3] == 't' || opt[3] == 'T') && opt[4] == '\0' && !(*flags & OBJ_KEEPTTL) && !(*flags & OBJ_PERSIST) && !(*flags & OBJ_EX) && !(*flags & OBJ_PX) && !(*flags & OBJ_PXAT) && next) { *flags |= OBJ_EXAT; *expire = next; j++; } else if ((opt[0] == 'p' || opt[0] == 'P') && (opt[1] == 'x' || opt[1] == 'X') && (opt[2] == 'a' || opt[2] == 'A') && (opt[3] == 't' || opt[3] == 'T') && opt[4] == '\0' && !(*flags & OBJ_KEEPTTL) && !(*flags & OBJ_PERSIST) && !(*flags & OBJ_EX) && !(*flags & OBJ_EXAT) && !(*flags & OBJ_PX) && next) { *flags |= OBJ_PXAT; *unit = UNIT_MILLISECONDS; *expire = next; j++; } else { //说明参数有问题,需要上报异常 addReplyErrorObject(c,shared.syntaxerr); return C_ERR; } } return C_OK;}
其余的函数就不全部放出来了,与setCommand几乎一致,基本都是业务逻辑了,
过期策略
redis中的k/v与expire为分开保存的,redis中无主动过期。被动过期策略中定时,定期完全都从expire中入手的
订阅发布
拿set来说,真正支持订阅与发布的机制统一在notifyKeyspaceEvent函数中,每个具体的例如setGenericCommand通用类中负责调用发布订阅的接口
本篇为合集性质,redis全部的底层数据结构分析看如下文章
ghroth:redis源码学习-sds篇
ghroth:redis源码学习-adlist篇
ghroth:redis源码学习-dict篇
ghroth:redis源码学习-跳表篇
ghroth:redis源码学习-hyperloglog篇
ghroth:redis源码学习-intset篇
ghroth:redis源码学习-ziplist篇
ghroth:redis源码学习-quicklist篇
ghroth:redis源码学习-listpack篇
redis中底层数据结构有sds,dict等,不同的上层容器按需在不同条件下使用不同的底层数据结构,同时上层容器负责处理对使用者提供具体的键。
如图示例,sds等都为具体的底层数据结构,t_string本身为redis中文件名,自身无数据结构,根据客户端的不同命令,根据不同情况调用相对应的底层数据结构。可以看到,假如说一个具体的set test1 test1value 那么redis接收到了set命令,由t_string最终落到sds里。底层数据结构负责真正的内容存储,而上层容器则是连接具体存储与命令的桥梁了。上层容器具体分有t_string,t_list,t_hash,t_set,t_zset等,后续也将分篇依次分析。
listpack(紧凑列表)可以理解为一个替代版本的ziplist,ziplist具体内容可以在
中进行回顾。
由于ziplist中有着致命的缺陷-连锁更新,在极端条件下会有着极差的性能,导致整个redis响应变慢。因此在redis5中引入了新的数据结构listpack,作为ziplist的替代版。listpack在6以后已经作为t_hash的基础底层结构。
listpack虽然说是ziplist的改进版,但是整体思路与ziplist无太大差别,listpack的结构图如下
单独看这个图会感觉比较眼熟,这里再放下ziplist的结构图示:
会发现整体上看,listpack少了一些。其实相比较ziplist,listpack中的优化在于entry中。
/* Each entry in the listpack is either a string or an integer. */typedef struct { /* When string is used, it is provided with the length (slen). */ unsigned char *sval; uint32_t slen; /* When integer is used, 'sval' is NULL, and lval holds the value. */ long long lval;} listpackEntry;
listpackEntry中的改进 :
不同于ziplist,listpackEntry中的len记录的是**当前entry的长度,而非上一个entry的长度。**listpackEntry中可存储的为字符串或整型。
unsigned char *lpNew(size_t capacity) { unsigned char *lp = lp_malloc(capacity > LP_HDR_SIZE+1 ? capacity : LP_HDR_SIZE+1); if (lp == NULL) return NULL; lpSetTotalBytes(lp,LP_HDR_SIZE+1); lpSetNumElements(lp,0); lp[LP_HDR_SIZE] = LP_EOF; return lp;}
结合宏
#define LP_HDR_SIZE 6 /* 32 bit total len + 16 bit number of elements. */#define LP_EOF 0xFF
可以看出,hdr中长度为6字节,申请堆内存时共申请LP_HDR_SIZE +1,4字节用来记录totalLen,2字节用来记录元素的个数,+1的一个字节用来标识end,end也恒为0xFF
由于整体上看,listpack与ziplist结构差别不大,本篇只针对listpack中如何避免连锁更新进行源码分析。
unsigned char *lpInsert(unsigned char *lp, unsigned char *elestr, unsigned char *eleint, uint32_t size, unsigned char *p, int where, unsigned char **newp){ //#define LP_MAX_INT_ENCODING_LEN 9 unsigned char intenc[LP_MAX_INT_ENCODING_LEN]; //#define LP_MAX_BACKLEN_SIZE 5 unsigned char backlen[LP_MAX_BACKLEN_SIZE]; uint64_t enclen; /* The length of the encoded element. */ int delete = (elestr == NULL && eleint == NULL); /* when deletion, it is conceptually replacing the element with a * zero-length element. So whatever we get passed as 'where', set * it to LP_REPLACE. */ //在lp中,删除并非为真正的删除,而是用 zero-length element替换掉需删除的entry,在这里根据delete字段判断,假如不传elestr和eleint,那么就是替换操作。 if (delete) where = LP_REPLACE; /* If we need to insert after the current element, we just jump to the * next element (that could be the EOF one) and handle the case of * inserting before. So the function will actually deal with just two * cases: LP_BEFORE and LP_REPLACE. */ //假如当前操作为LP_AFTER,那么处理一下,将LP_AFTER操作变为LP_BEFORE,在接下来的操作中就无需开分支处理了。 if (where == LP_AFTER) { p = lpSkip(p); where = LP_BEFORE; ASSERT_INTEGRITY(lp, p); } /* Store the offset of the element 'p', so that we can obtain its * address again after a reallocation. */ //记录元素记录p之前的长度。由于lp的设置,在插入或删除后此长度不受影响 unsigned long poff = p-lp; int enctype; //插入str的具体操作 if (elestr) { /* Calling lpEncodeGetType() results into the encoded version of the * element to be stored into 'intenc' in case it is representable as * an integer: in that case, the function returns LP_ENCODING_INT. * Otherwise if LP_ENCODING_STR is returned, we'll have to call * lpEncodeString() to actually write the encoded string on place later. * * Whatever the returned encoding is, 'enclen' is populated with the * length of the encoded element. */ //获取ele的实际encoding enctype = lpEncodeGetType(elestr,size,intenc,&enclen); if (enctype == LP_ENCODING_INT) eleint = intenc; } else if (eleint) { enctype = LP_ENCODING_INT; enclen = size; /* 'size' is the length of the encoded integer element. */ } else { enctype = -1; enclen = 0; } /* We need to also encode the backward-parsable length of the element * and append it to the end: this allows to traverse the listpack from * the end to the start. */ //通过上边提前取得的enclen,解码length的长度,如果为删除操作则无需 unsigned long backlen_size = (!delete) ? lpEncodeBacklen(backlen,enclen) : 0; uint64_t old_listpack_bytes = lpGetTotalBytes(lp); uint32_t replaced_len = 0; //这里是处理删除操作的具体方法 if (where == LP_REPLACE) { replaced_len = lpCurrentEncodedSizeUnsafe(p); replaced_len += lpEncodeBacklen(NULL,replaced_len); ASSERT_INTEGRITY_LEN(lp, p, replaced_len); } //在这里就可以获取到现在的lp大小了 uint64_t new_listpack_bytes = old_listpack_bytes + enclen + backlen_size - replaced_len; //越界判断 if (new_listpack_bytes > UINT32_MAX) return NULL; /* We now need to reallocate in order to make space or shrink the * allocation (in case 'when' value is LP_REPLACE and the new element is * smaller). However we do that before memmoving the memory to * make room for the new element if the final allocation will get * larger, or we do it after if the final allocation will get smaller. */ //poff为之前获取到的p之前的位置,这里的dst则定位到该在哪个地方中进行插入或删除了 unsigned char *dst = lp + poff; /* May be updated after reallocation. */ /* Realloc before: we need more room. */ if (new_listpack_bytes > old_listpack_bytes && new_listpack_bytes > lp_malloc_size(lp)) { if ((lp = lp_realloc(lp,new_listpack_bytes)) == NULL) return NULL; dst = lp + poff; } /* Setup the listpack relocating the elements to make the exact room * we need to store the new one. */ //在之前针对LP_AFTER做的设置后,本函数中只有LP_BEFORE与LP_REPLACE,算是比较好的优化了四类情况 if (where == LP_BEFORE) { memmove(dst+enclen+backlen_size,dst,old_listpack_bytes-poff); } else { /* LP_REPLACE. */ long lendiff = (enclen+backlen_size)-replaced_len; //可以仔细看下传入参数,发现删除就是把当前元素len置空 memmove(dst+replaced_len+lendiff, dst+replaced_len, old_listpack_bytes-poff-replaced_len); } /* Realloc after: we need to free space. */ if (new_listpack_bytes < old_listpack_bytes) { if ((lp = lp_realloc(lp,new_listpack_bytes)) == NULL) return NULL; dst = lp + poff; } /* Store the entry. */ if (newp) { *newp = dst; /* In case of deletion, set 'newp' to NULL if the next element is * the EOF element. */ if (delete && dst[0] == LP_EOF) *newp = NULL; } if (!delete) { if (enctype == LP_ENCODING_INT) { memcpy(dst,eleint,enclen); } else { lpEncodeString(dst,elestr,size); } dst += enclen; memcpy(dst,backlen,backlen_size); dst += backlen_size; } //更新hdr /* Update header. */ if (where != LP_REPLACE || delete) { uint32_t num_elements = lpGetNumElements(lp); if (num_elements != LP_HDR_NUMELE_UNKNOWN) { if (!delete) lpSetNumElements(lp,num_elements+1); else lpSetNumElements(lp,num_elements-1); } } lpSetTotalBytes(lp,new_listpack_bytes);#if 0 /* This code path is normally disabled: what it does is to force listpack * to return *always* a new pointer after performing some modification to * the listpack, even if the previous allocation was enough. This is useful * in order to spot bugs in code using listpacks: by doing so we can find * if the caller forgets to set the new pointer where the listpack reference * is stored, after an update. */ unsigned char *oldlp = lp; lp = lp_malloc(new_listpack_bytes); memcpy(lp,oldlp,new_listpack_bytes); if (newp) { unsigned long offset = (*newp)-oldlp; *newp = lp + offset; } /* Make sure the old allocation contains garbage. */ memset(oldlp,'A',new_listpack_bytes); lp_free(oldlp);#endif return lp;}
可以看出,listpack中的增删改统一在一段代码中。由于每个entry中保存的都是自己的len长度,在牺牲掉机制的内存使用后,免去了连锁更新,看起来有着极大的性能提升,但是在内存使用率上比ziplist逊色不少。
其中insert中有LP_BEFORE与LP_AFTER两种情况,insert在最开始就将LP_BEFORE转变为LP_AFTER。redis将listpack中的删除操作变为置空,那么删除与修改也被统一成一类LP_REPLACE,所以增删改整体被处理过后,都在lpinsert中实现了,也仅仅是判断是处理LP_REPLACE还是LP_AFTER。
unsigned char *lpFind(unsigned char *lp, unsigned char *p, unsigned char *s, uint32_t slen, unsigned int skip) { //与ziplist的find类似,因为lp也提供给hash使用,因此lp的find函数也引入了skip字段,帮助lp只遍历key而非value int skipcnt = 0; unsigned char vencoding = 0; unsigned char *value; int64_t ll, vll; uint64_t entry_size = 123456789; /* initialized to avoid warning. */ uint32_t lp_bytes = lpBytes(lp); assert(p); while (p) { if (skipcnt == 0) { value = lpGetWithSize(p, &ll, NULL, &entry_size); if (value) { if (slen == ll && memcmp(value, s, slen) == 0) { return p; } } else { /* Find out if the searched field can be encoded. Note that * we do it only the first time, once done vencoding is set * to non-zero and vll is set to the integer value. */ if (vencoding == 0) { /* If the entry can be encoded as integer we set it to * 1, else set it to UCHAR_MAX, so that we don't retry * again the next time. */ if (slen >= 32 || slen == 0 || !lpStringToInt64((const char*)s, slen, &vll)) { vencoding = UCHAR_MAX; } else { vencoding = 1; } } /* Compare current entry with specified entry, do it only * if vencoding != UCHAR_MAX because if there is no encoding * possible for the field it can't be a valid integer. */ if (vencoding != UCHAR_MAX && ll == vll) { return p; } } /* Reset skip count */ skipcnt = skip; p += entry_size; } else { /* Skip entry */ skipcnt--; /* Move to next entry, avoid use `lpNext` due to `ASSERT_INTEGRITY` in * `lpNext` will call `lpBytes`, will cause performance degradation */ p = lpSkip(p); } assert(p >= lp + LP_HDR_SIZE && p < lp + lp_bytes); if (p[0] == LP_EOF) break; } return NULL;}
来看find函数的话,其实没什么好说的,从头到尾遍历的操作。
quicklist为基于zpilist做了一层优化的链表结构,quicklist为双端链表,每个节点指向独自的ziplist,在有着ziplist高效内存效率的情况下,适当优化了ziplist的增删效率。
本篇部分依赖ziplist,忘记的同学先移步回顾下
quicklist出现的原因在于,ziplist虽然有着极高的内存使用率,但也因此,在查找时只能单循环遍历,增删的时候有着极高的连锁更新的可能,最坏情况下的时间复杂度在O(n^2)。这种操纵效率相对比sds,dict或zskiplist等其他底层数据结构来说,实在是对不起‘redis’的名头。也因此,引入quicklist对这个数据结构做部分优化,降低更新复杂度,保持内存使用率。
如图
每个quicklistnode都会指向一个单独的ziplist,同时包含next和last的quicklistnode地址,
看图的话,会发现quicklist非常简单,总体实现来说也很简单,就不多做介绍了
quicklist数据结构
typedef struct quicklist { quicklistNode *head; quicklistNode *tail; unsigned long count; /* total count of all entries in all ziplists */ unsigned long len; /* number of quicklistNodes */ int fill : QL_FILL_BITS; /* fill factor for individual nodes */ unsigned int compress : QL_COMP_BITS; /* depth of end nodes not to compress;0=off */ unsigned int bookmark_count: QL_BM_BITS; quicklistBookmark bookmarks[];} quicklist;
head为头节点,tail为尾节点,count记录的是所有的ziplist中的entry总数,len为quicklist节点总数。fill为16位,代表每个节点的最大容量,compress为16位,quicklist的压缩深度,与之后的LZF压缩相关。bookmark_count为4位,代表bookmarks数组的大小。bookmarks用来给quicklist重新分配内存空间时使用,不使用的时候不占空间
从这里看出,quicklist只保留首位两quicklistnode的地址,之后就是count等辅助字段了
quicklistnode数据结构
typedef struct quicklistNode { struct quicklistNode *prev; struct quicklistNode *next; unsigned char *zl; unsigned int sz; /* ziplist size in bytes */ unsigned int count : 16; /* count of items in ziplist */ unsigned int encoding : 2; /* RAW==1 or LZF==2 */ unsigned int container : 2; /* NONE==1 or ZIPLIST==2 */ unsigned int recompress : 1; /* was this node previous compressed? */ unsigned int attempted_compress : 1; /* node can't compress; too small */ unsigned int extra : 10; /* more bits to steal for future usage */} quicklistNode;
prev和next是双向链表的标准配置,指向上一个或下一个quicklistnode节点。*zl就是这个node所使用的ziplist了。sz为这个ziplist的字节数。count为这个ziplist的item数。encoding代表这个zpilist的类型,raw1,就算标准ziplist,lzf2,LZF代表着对这个ziplist进行了lzf压缩。attempted_compress则是记录这个list是否被压缩过。
上述中出现了LZF压缩,在这里只讲下概念。LZF压缩主要做两件事,一是对重复值进行压缩二是通过hash来判断是否为重复数据。
数据结构讲完,来看具体操作函数
create:
quicklist *quicklistCreate(void) { struct quicklist *quicklist; quicklist = zmalloc(sizeof(*quicklist)); quicklist->head = quicklist->tail = NULL; quicklist->len = 0; quicklist->count = 0; quicklist->compress = 0; quicklist->fill = -2; quicklist->bookmark_count = 0; return quicklist;}
非常简单,分配空间,字段初始化
insert:
/* Wrapper to allow argument-based switching between HEAD/TAIL pop */void quicklistPush(quicklist *quicklist, void *value, const size_t sz, int where) { if (where == QUICKLIST_HEAD) { quicklistPushHead(quicklist, value, sz); } else if (where == QUICKLIST_TAIL) { quicklistPushTail(quicklist, value, sz); }}
在这里根据where字段决定是使用头插法还是尾插法。头插和尾插区别不大,这里只看头插法。
头插法:
/* Add new entry to head node of quicklist. * * Returns 0 if used existing head. * Returns 1 if new head created. */int quicklistPushHead(quicklist *quicklist, void *value, size_t sz) { quicklistNode *orig_head = quicklist->head; if (likely( //_quicklistNodeAllowInsert这个负责判断当前quicklistnode的ziplist是否太大了 _quicklistNodeAllowInsert(quicklist->head, quicklist->fill, sz))) { quicklist->head->zl = ziplistPush(quicklist->head->zl, value, sz, ZIPLIST_HEAD); quicklistNodeUpdateSz(quicklist->head); } else { //如果_quicklistNodeAllowInsert说当前的quictlistnode中ziplist过大无法再插入,就new一个新的quicklistnode,放到那里 quicklistNode *node = quicklistCreateNode(); node->zl = ziplistPush(ziplistNew(), value, sz, ZIPLIST_HEAD); quicklistNodeUpdateSz(node); _quicklistInsertNodeBefore(quicklist, quicklist->head, node); } quicklist->count++; quicklist->head->count++; return (orig_head != quicklist->head);}
delete:
/* Delete one element represented by 'entry' * * 'entry' stores enough metadata to delete the proper position in * the correct ziplist in the correct quicklist node. */void quicklistDelEntry(quicklistIter *iter, quicklistEntry *entry) { quicklistNode *prev = entry->node->prev; quicklistNode *next = entry->node->next; int deleted_node = quicklistDelIndex((quicklist *)entry->quicklist, entry->node, &entry->zi); /* after delete, the zi is now invalid for any future usage. */ iter->zi = NULL; /* If current node is deleted, we must update iterator node and offset. */ if (deleted_node) { if (iter->direction == AL_START_HEAD) { iter->current = next; iter->offset = 0; } else if (iter->direction == AL_START_TAIL) { iter->current = prev; iter->offset = -1; } } /* else if (!deleted_node), no changes needed. * we already reset iter->zi above, and the existing iter->offset * doesn't move again because: * - [1, 2, 3] => delete offset 1 => [1, 3]: next element still offset 1 * - [1, 2, 3] => delete offset 0 => [2, 3]: next element still offset 0 * if we deleted the last element at offset N and now * length of this ziplist is N-1, the next call into * quicklistNext() will jump to the next node. */}/* Delete one entry from list given the node for the entry and a pointer * to the entry in the node. * * Note: quicklistDelIndex() *requires* uncompressed nodes because you * already had to get *p from an uncompressed node somewhere. * * Returns 1 if the entire node was deleted, 0 if node still exists. * Also updates in/out param 'p' with the next offset in the ziplist. */REDIS_STATIC int quicklistDelIndex(quicklist *quicklist, quicklistNode *node, unsigned char **p) { int gone = 0; node->zl = ziplistDelete(node->zl, p); node->count--; if (node->count == 0) { gone = 1; __quicklistDelNode(quicklist, node); } else { quicklistNodeUpdateSz(node); } quicklist->count--; /* If we deleted the node, the original node is no longer valid */ return gone ? 1 : 0;}REDIS_STATIC void __quicklistDelNode(quicklist *quicklist, quicklistNode *node) { /* Update the bookmark if any */ quicklistBookmark *bm = _quicklistBookmarkFindByNode(quicklist, node); if (bm) { bm->node = node->next; /* if the bookmark was to the last node, delete it. */ if (!bm->node) _quicklistBookmarkDelete(quicklist, bm); } if (node->next) node->next->prev = node->prev; if (node->prev) node->prev->next = node->next; if (node == quicklist->tail) { quicklist->tail = node->prev; } if (node == quicklist->head) { quicklist->head = node->next; } /* Update len first, so in __quicklistCompress we know exactly len */ quicklist->len--; quicklist->count -= node->count; /* If we deleted a node within our compress depth, we * now have compressed nodes needing to be decompressed. */ __quicklistCompress(quicklist, NULL); zfree(node->zl); zfree(node);}
删除代码虽然长,但是逻辑相当简单,仅仅是将这个节点从节点剔除,然后分别zfree node->zl和node。
总体来说,从quicklist的名字来说,quick应该是对ziplist的操作复杂度说的,quicklist本身没什么难度,重点放在ziplist,看懂之后ziplist,quicklist只需明白数据结构就足够了。
ziplist翻译为压缩链表,是一个经过特殊编码的用于存储字符串或整数的双向链表。ziplist整体为连续内存块组成,达到压缩效果。但对于ziplist底层实现看,并没有真正的链表结构,连续两个节点之间内存是连续的,next和last的操作为移动地址偏移量得到的。ziplist对于内存利用十分高效,也因此有着较大的复杂性。
ziplist定义相对于其他redis的底层数据结构较为特殊,无明确struct,创建时仅返回ziplist的首地址。ziplist的其余信息则通过连续内存块获得。
ziplist图示结构如下:
其中可以手动将ziplist分为三部分,header,entrys,end
header:
create:
/n/n/n/n/n
/* Create a new empty ziplist. */unsigned char *ziplistNew(void) { unsigned int bytes = ZIPLIST_HEADER_SIZE+ZIPLIST_END_SIZE; unsigned char *zl = zmalloc(bytes); ZIPLIST_BYTES(zl) = intrev32ifbe(bytes); ZIPLIST_TAIL_OFFSET(zl) = intrev32ifbe(ZIPLIST_HEADER_SIZE); ZIPLIST_LENGTH(zl) = 0; zl[bytes-1] = ZIP_END; return zl;}/n/n/n/n/n
create不复杂,采用zmalloc分配堆内存,初始化好header与end共11字节。
insert:
/n/n/n/n/n
/* Insert item at "p". */unsigned char *__ziplistInsert(unsigned char *zl, unsigned char *p, unsigned char *s, unsigned int slen) { size_t curlen = intrev32ifbe(ZIPLIST_BYTES(zl)), reqlen, newlen; unsigned int prevlensize, prevlen = 0; size_t offset; int nextdiff = 0; unsigned char encoding = 0; long long value = 123456789; /* initialized to avoid warning. Using a value that is easy to see if for some reason we use it uninitialized. */ zlentry tail; //根据插入的位置得到前一个entry节点的长度。 /* Find out prevlen for the entry that is inserted. */ if (p[0] != ZIP_END) { ZIP_DECODE_PREVLEN(p, prevlensize, prevlen); } else { unsigned char *ptail = ZIPLIST_ENTRY_TAIL(zl); if (ptail[0] != ZIP_END) { //获取entryN元素的长度 prevlen = zipRawEntryLengthSafe(zl, curlen, ptail); } } //获取到上一个节点长度后尝试对需要存放的数据进行编码(对于小于32位的字符串转long long) /* See if the entry can be encoded */ if (zipTryEncoding(s,slen,&value,&encoding)) { /* 'encoding' is set to the appropriate integer encoding */ reqlen = zipIntSize(encoding); } else { /* 'encoding' is untouched, however zipStoreEntryEncoding will use the * string length to figure out how to encode it. */ reqlen = slen; } //将会计算前一节点长度与当前数据的字节大小之和,作为新插入节点下一个节点的起始位置 /* We need space for both the length of the previous entry and * the length of the payload. */ reqlen += zipStorePrevEntryLength(NULL,prevlen); reqlen += zipStoreEntryEncoding(NULL,encoding,slen); /* When the insert position is not equal to the tail, we need to * make sure that the next entry can hold this entry's length in * its prevlen field. */ int forcelarge = 0; nextdiff = (p[0] != ZIP_END) ? zipPrevLenByteDiff(p,reqlen) : 0; if (nextdiff == -4 && reqlen < 4) { nextdiff = 0; forcelarge = 1; } /* Store offset because a realloc may change the address of zl. */ offset = p-zl; newlen = curlen+reqlen+nextdiff; zl = ziplistResize(zl,newlen); //重新计算偏移量,重新申请长度,通过新的地址与偏移量算出插入位置。 p = zl+offset; /* Apply memory move when necessary and update tail offset. */ if (p[0] != ZIP_END) { /* Subtract one because of the ZIP_END bytes */ //p+reqlen为下一个节点的起始位置,将新节点后的数据移动到目标位置 memmove(p+reqlen,p-nextdiff,curlen-offset-1+nextdiff); //假如说forcelarge=1,那么说明新的节点大小需要强制扩展为4个字节,以适应原本的节点表示上一节点的长度大小 /* Encode this entry's raw length in the next entry. */ if (forcelarge) zipStorePrevEntryLengthLarge(p+reqlen,reqlen); else zipStorePrevEntryLength(p+reqlen,reqlen); /* Update offset for tail */ ZIPLIST_TAIL_OFFSET(zl) = intrev32ifbe(intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl))+reqlen); /* When the tail contains more than one entry, we need to take * "nextdiff" in account as well. Otherwise, a change in the * size of prevlen doesn't have an effect on the *tail* offset. */ assert(zipEntrySafe(zl, newlen, p+reqlen, &tail, 1)); if (p[reqlen+tail.headersize+tail.len] != ZIP_END) { ZIPLIST_TAIL_OFFSET(zl) = intrev32ifbe(intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl))+nextdiff); } } else { /* This element will be the new tail. */ ZIPLIST_TAIL_OFFSET(zl) = intrev32ifbe(p-zl); } /* When nextdiff != 0, the raw length of the next entry has changed, so * we need to cascade the update throughout the ziplist */ if (nextdiff != 0) { offset = p-zl; //连锁反应,如果下一个节点的由于当前节点的插入需要增加的长度超过了254,那么也需要将其下一个节点连锁增加,直到不需要增加长度的节点出现为止。 zl = __ziplistCascadeUpdate(zl,p+reqlen); p = zl+offset; } /* Write the entry */ p += zipStorePrevEntryLength(p,prevlen); p += zipStoreEntryEncoding(p,encoding,slen); if (ZIP_IS_STR(encoding)) { memcpy(p,s,slen); } else { zipSaveInteger(p,value,encoding); } ZIPLIST_INCR_LENGTH(zl,1); return zl;}/n/n/n/n/n
insert或delete引出的连锁更新:
/n/n/n/n/n
unsigned char *__ziplistCascadeUpdate(unsigned char *zl, unsigned char *p) { zlentry cur; size_t prevlen, prevlensize, prevoffset; /* Informat of the last changed entry. */ size_t firstentrylen; /* Used to handle insert at head. */ size_t rawlen, curlen = intrev32ifbe(ZIPLIST_BYTES(zl)); size_t extra = 0, cnt = 0, offset; size_t delta = 4; /* Extra bytes needed to update a entry's prevlen (5-1). */ unsigned char *tail = zl + intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl)); /* Empty ziplist */ if (p[0] == ZIP_END) return zl; zipEntry(p, &cur); /* no need for "safe" variant since the input pointer was validated by the function that returned it. */ firstentrylen = prevlen = cur.headersize + cur.len; prevlensize = zipStorePrevEntryLength(NULL, prevlen); prevoffset = p - zl; p += prevlen; /* Iterate ziplist to find out how many extra bytes do we need to update it. */ while (p[0] != ZIP_END) { assert(zipEntrySafe(zl, curlen, p, &cur, 0)); /* Abort when "prevlen" has not changed. */ if (cur.prevrawlen == prevlen) break; /* Abort when entry's "prevlensize" is big enough. */ if (cur.prevrawlensize >= prevlensize) { if (cur.prevrawlensize == prevlensize) { zipStorePrevEntryLength(p, prevlen); } else { /* This would result in shrinking, which we want to avoid. * So, set "prevlen" in the available bytes. */ zipStorePrevEntryLengthLarge(p, prevlen); } break; } /* cur.prevrawlen means cur is the former head entry. */ assert(cur.prevrawlen == 0 || cur.prevrawlen + delta == prevlen); /* Update prev entry's info and advance the cursor. */ rawlen = cur.headersize + cur.len; prevlen = rawlen + delta; prevlensize = zipStorePrevEntryLength(NULL, prevlen); prevoffset = p - zl; p += rawlen; extra += delta; cnt++; } /* Extra bytes is zero all update has been done(or no need to update). */ if (extra == 0) return zl; /* Update tail offset after loop. */ if (tail == zl + prevoffset) { /* When the the last entry we need to update is also the tail, update tail offset * unless this is the only entry that was updated (so the tail offset didn't change). */ if (extra - delta != 0) { ZIPLIST_TAIL_OFFSET(zl) = intrev32ifbe(intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl))+extra-delta); } } else { /* Update the tail offset in cases where the last entry we updated is not the tail. */ ZIPLIST_TAIL_OFFSET(zl) = intrev32ifbe(intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl))+extra); } /* Now "p" points at the first unchanged byte in original ziplist, * move data after that to new ziplist. */ offset = p - zl; zl = ziplistResize(zl, curlen + extra); p = zl + offset; memmove(p + extra, p, curlen - offset - 1); p += extra; /* Iterate all entries that need to be updated tail to head. */ while (cnt) { zipEntry(zl + prevoffset, &cur); /* no need for "safe" variant since we already iterated on all these entries above. */ rawlen = cur.headersize + cur.len; /* Move entry to tail and reset prevlen. */ memmove(p - (rawlen - cur.prevrawlensize), zl + prevoffset + cur.prevrawlensize, rawlen - cur.prevrawlensize); p -= (rawlen + delta); if (cur.prevrawlen == 0) { /* "cur" is the previous head entry, update its prevlen with firstentrylen. */ zipStorePrevEntryLength(p, firstentrylen); } else { /* An entry's prevlen can only increment 4 bytes. */ zipStorePrevEntryLength(p, cur.prevrawlen+delta); } /* Forward to previous entry. */ prevoffset -= cur.prevrawlen; cnt--; } return zl;}/n/n/n/n/n
连锁扩容代码并不复杂,出现的原因在于新节点的加入,假如插入了一个新的节点p,他的下一个节点p+1本身存储了节点p的上个节点p-1的length。如果说p-1节点是个长度小于254个字节的数,p+1中将会一个字节的大小来存放entry中的数据,但是当节点p插入后,且长度大于等于254,p+1就会申请更大的空间来存放prevlength数据。p+1新申请了空间,有可能导致p+2存储的长度不够用,直到p+n不用新申请长度就能更新prevlength或者到达ziplist的末端为止。最坏的时间复杂度为O(n^2)
Delete:
/n/n/n/n/n
/* Delete "num" entries, starting at "p". Returns pointer to the ziplist. */unsigned char *__ziplistDelete(unsigned char *zl, unsigned char *p, unsigned int num) { unsigned int i, totlen, deleted = 0; size_t offset; int nextdiff = 0; zlentry first, tail; size_t zlbytes = intrev32ifbe(ZIPLIST_BYTES(zl)); zipEntry(p, &first); /* no need for "safe" variant since the input pointer was validated by the function that returned it. */ for (i = 0; p[0] != ZIP_END && i < num; i++) { p += zipRawEntryLengthSafe(zl, zlbytes, p); deleted++; } assert(p >= first.p); totlen = p-first.p; /* Bytes taken by the element(s) to delete. */ if (totlen > 0) { uint32_t set_tail; if (p[0] != ZIP_END) { /* Storing `prevrawlen` in this entry may increase or decrease the * number of bytes required compare to the current `prevrawlen`. * There always is room to store this, because it was previously * stored by an entry that is now being deleted. */ nextdiff = zipPrevLenByteDiff(p,first.prevrawlen); /* Note that there is always space when p jumps backward: if * the new previous entry is large, one of the deleted elements * had a 5 bytes prevlen header, so there is for sure at least * 5 bytes free and we need just 4. */ p -= nextdiff; assert(p >= first.p && p<zl+zlbytes-1); zipStorePrevEntryLength(p,first.prevrawlen); /* Update offset for tail */ set_tail = intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl))-totlen; /* When the tail contains more than one entry, we need to take * "nextdiff" in account as well. Otherwise, a change in the * size of prevlen doesn't have an effect on the *tail* offset. */ assert(zipEntrySafe(zl, zlbytes, p, &tail, 1)); if (p[tail.headersize+tail.len] != ZIP_END) { set_tail = set_tail + nextdiff; } /* Move tail to the front of the ziplist */ /* since we asserted that p >= first.p. we know totlen >= 0, * so we know that p > first.p and this is guaranteed not to reach * beyond the allocation, even if the entries lens are corrupted. */ size_t bytes_to_move = zlbytes-(p-zl)-1; memmove(first.p,p,bytes_to_move); } else { /* The entire tail was deleted. No need to move memory. */ set_tail = (first.p-zl)-first.prevrawlen; } /* Resize the ziplist */ offset = first.p-zl; zlbytes -= totlen - nextdiff; zl = ziplistResize(zl, zlbytes); p = zl+offset; /* Update record count */ ZIPLIST_INCR_LENGTH(zl,-deleted); /* Set the tail offset computed above */ assert(set_tail <= zlbytes - ZIPLIST_END_SIZE); ZIPLIST_TAIL_OFFSET(zl) = intrev32ifbe(set_tail); /* When nextdiff != 0, the raw length of the next entry has changed, so * we need to cascade the update throughout the ziplist */ if (nextdiff != 0) zl = __ziplistCascadeUpdate(zl,p); } return zl;}
删除操作与insert操作类似,看懂insert即可,需注意,delete中也会导致节点p的上一个节点的length发生变化,因此也需要引入连锁更新修复ziplist的结构。
find:
/n/n/n/n/n
/* Find pointer to the entry equal to the specified entry. Skip 'skip' entries * between every comparison. Returns NULL when the field could not be found. */unsigned char *ziplistFind(unsigned char *zl, unsigned char *p, unsigned char *vstr, unsigned int vlen, unsigned int skip) { int skipcnt = 0; unsigned char vencoding = 0; long long vll = 0; size_t zlbytes = ziplistBlobLen(zl); while (p[0] != ZIP_END) { struct zlentry e; unsigned char *q; assert(zipEntrySafe(zl, zlbytes, p, &e, 1)); q = p + e.prevrawlensize + e.lensize; if (skipcnt == 0) { /* Compare current entry with specified entry */ if (ZIP_IS_STR(e.encoding)) { if (e.len == vlen && memcmp(q, vstr, vlen) == 0) { return p; } } else { /* Find out if the searched field can be encoded. Note that * we do it only the first time, once done vencoding is set * to non-zero and vll is set to the integer value. */ if (vencoding == 0) { if (!zipTryEncoding(vstr, vlen, &vll, &vencoding)) { /* If the entry can't be encoded we set it to * UCHAR_MAX so that we don't retry again the next * time. */ vencoding = UCHAR_MAX; } /* Must be non-zero by now */ assert(vencoding); } /* Compare current entry with specified entry, do it only * if vencoding != UCHAR_MAX because if there is no encoding * possible for the field it can't be a valid integer. */ if (vencoding != UCHAR_MAX) { long long ll = zipLoadInteger(q, e.encoding); if (ll == vll) { return p; } } } /* Reset skip count */ skipcnt = skip; } else { /* Skip entry */ skipcnt--; } /* Move to next entry */ p = q + e.len; } return NULL;}/n/n/n/n/n
find相对实现起来就更加无脑了,循环遍历。需要注意的是,find函数需传一个skip的值。skip的引入主要是上层容器的使用。假如说上层容器是一个hash结构,使用了ziplist,那么会按照顺序先存储key,再存储value,那么find的时候只需要查key即可,通过skip控制多跳一下,提升查找效率。
redis中的inset用来保存整数集合,适用于无重复数字,且数据量小的场合。
c中的大小端模式:
**大端存储:**把一个数的低位字节序的内容存放到高地址处,高位字节序的内容存放在低地址处。
**小端存储:**把一个数的低位字节序的内容存放到低地址处,高位字节序的内容存放在高地址处。
在redis中,全部使用小端存储。因此,引入了/src/endianconv.h和/src/endianconv.c负责统一判断机器本地的方式并处理。
数据结构:
/* Note that these encodings are ordered, so: * INTSET_ENC_INT16 < INTSET_ENC_INT32 < INTSET_ENC_INT64. */#define INTSET_ENC_INT16 (sizeof(int16_t))#define INTSET_ENC_INT32 (sizeof(int32_t))#define INTSET_ENC_INT64 (sizeof(int64_t))typedef struct intset { uint32_t encoding; uint32_t length; int8_t contents[];} intset;
encoding为类型,定义在宏INTSET_ENC_INT16 ,INTSET_ENC_INT32 ,INTSET_ENC_INT64 中。length为集合的长度,contents为集合。
其中encoding的编码方式取决于集合中最大的数,如果存在一个int64,那么整个inset的encoding都为int64.
创建:
/* Create an empty intset. */intset *intsetNew(void) { //调用zmalloc分配堆内存 intset *is = zmalloc(sizeof(intset)); //默认INTSET_ENC_INT16节约内存 is->encoding = intrev32ifbe(INTSET_ENC_INT16); is->length = 0; return is;}
初始化时全部默认使用INTSET_ENC_INT16 来节约内存。
新增:
/* Insert an integer in the intset */intset *intsetAdd(intset *is, int64_t value, uint8_t *success) { //获取对应的value编码,看当前value是int几 uint8_t valenc = _intsetValueEncoding(value); uint32_t pos; if (success) *success = 1; /* Upgrade encoding if necessary. If we need to upgrade, we know that * this value should be either appended (if > 0) or prepended (if < 0), * because it lies outside the range of existing values. */ //假如说当前编码大于inset的最大编码,需要执行升级和新增的操作了 if (valenc > intrev32ifbe(is->encoding)) { /* This always succeeds, so we don't need to curry *success. */ return intsetUpgradeAndAdd(is,value); } else { /* Abort if the value is already present in the set. * This call will populate "pos" with the right position to insert * the value when it cannot be found. */ //查找当前value是否存在,同时找到小于value的最大值是哪个,将这个值赋给pos if (intsetSearch(is,value,&pos)) { if (success) *success = 0; return is; } //多申请空间 is = intsetResize(is,intrev32ifbe(is->length)+1); //如果没有找到当前value的前一个数字,那么就将pos数据后挪一位 if (pos < intrev32ifbe(is->length)) intsetMoveTail(is,pos,pos+1); } _intsetSet(is,pos,value); is->length = intrev32ifbe(intrev32ifbe(is->length)+1); return is;}/* Return the required encoding for the provided value. */static uint8_t _intsetValueEncoding(int64_t v) { if (v < INT32_MIN || v > INT32_MAX) return INTSET_ENC_INT64; else if (v < INT16_MIN || v > INT16_MAX) return INTSET_ENC_INT32; else return INTSET_ENC_INT16;}
编码升级且新增:
/* Upgrades the intset to a larger encoding and inserts the given integer. */static intset *intsetUpgradeAndAdd(intset *is, int64_t value) { uint8_t curenc = intrev32ifbe(is->encoding); uint8_t newenc = _intsetValueEncoding(value); int length = intrev32ifbe(is->length); int prepend = value < 0 ? 1 : 0; /* First set new encoding and resize */ is->encoding = intrev32ifbe(newenc); is = intsetResize(is,intrev32ifbe(is->length)+1); /* Upgrade back-to-front so we don't overwrite values. * Note that the "prepend" variable is used to make sure we have an empty * space at either the beginning or the end of the intset. */ //从最后开始,将原有的数据进行迁移 while(length--) _intsetSet(is,length+prepend,_intsetGetEncoded(is,length,curenc)); /* Set the value at the beginning or the end. */ //如果当前value值小于0,那么set到集合的头部 if (prepend) _intsetSet(is,0,value); else _intsetSet(is,intrev32ifbe(is->length),value); is->length = intrev32ifbe(intrev32ifbe(is->length)+1); return is;}
新增时,需要在原先申请的内存之上realloc,假如说新增的value在当前intset中间,那么从value之后的都需要向后移动一个encoding
删除:
/* Delete integer from intset */intset *intsetRemove(intset *is, int64_t value, int *success) { uint8_t valenc = _intsetValueEncoding(value); uint32_t pos; if (success) *success = 0; //如果当前要删除的value编码已经大于当前intset的编码格式,说明value肯定不在intset里,无需操作 if (valenc <= intrev32ifbe(is->encoding) && intsetSearch(is,value,&pos)) { //将value转为小端存储 uint32_t len = intrev32ifbe(is->length); /* We know we can delete */ if (success) *success = 1; /* Overwrite value with tail and update length */ if (pos < (len-1)) intsetMoveTail(is,pos+1,pos); is = intsetResize(is,len-1); is->length = intrev32ifbe(len-1); } return is;}
可以看到,删除时首先判断value的编码,假如编码明显不对那就什么都不做。在删除时并不做编码降级操作。由于此intset曾经出现过超大的数,那么再次出现的概率并不低,因此贸然降级的话,不仅当前需要处理intset的编码,而且以后升级的概况也很高,干脆多占点内存,避免麻烦与无用的时间消耗。
查找:
/* Determine whether a value belongs to this set */uint8_t intsetFind(intset *is, int64_t value) { uint8_t valenc = _intsetValueEncoding(value); return valenc <= intrev32ifbe(is->encoding) && intsetSearch(is,value,NULL);}/* Search for the position of "value". Return 1 when the value was found and * sets "pos" to the position of the value within the intset. Return 0 when * the value is not present in the intset and sets "pos" to the position * where "value" can be inserted. */static uint8_t intsetSearch(intset *is, int64_t value, uint32_t *pos) { int min = 0, max = intrev32ifbe(is->length)-1, mid = -1; int64_t cur = -1; /* The value can never be found when the set is empty */ //判断空值 if (intrev32ifbe(is->length) == 0) { if (pos) *pos = 0; return 0; } else { /* Check for the case where we know we cannot find the value, * but do know the insert position. */ if (value > _intsetGet(is,max)) { //假如说现value已经大于当前intset中的最大了,那么修改pos为value,更新pos为最大值 if (pos) *pos = intrev32ifbe(is->length); return 0; } else if (value < _intsetGet(is,0)) { //如果说当现value比当前intset中最小的还小,修改pos为0,代表有最小的出现 if (pos) *pos = 0; return 0; } } //采用二分查找 while(max >= min) { mid = ((unsigned int)min + (unsigned int)max) >> 1; cur = _intsetGet(is,mid); if (value > cur) { min = mid+1; } else if (value < cur) { max = mid-1; } else { break; } } if (value == cur) { if (pos) *pos = mid; return 1; } else { if (pos) *pos = min; return 0; }}/* Return the value at pos, using the configured encoding. */static int64_t _intsetGet(intset *is, int pos) { return _intsetGetEncoded(is,pos,intrev32ifbe(is->encoding));}/* Return the value at pos, given an encoding. */static int64_t _intsetGetEncoded(intset *is, int pos, uint8_t enc) { int64_t v64; int32_t v32; int16_t v16; if (enc == INTSET_ENC_INT64) { memcpy(&v64,((int64_t*)is->contents)+pos,sizeof(v64)); memrev64ifbe(&v64); return v64; } else if (enc == INTSET_ENC_INT32) { memcpy(&v32,((int32_t*)is->contents)+pos,sizeof(v32)); memrev32ifbe(&v32); return v32; } else { memcpy(&v16,((int16_t*)is->contents)+pos,sizeof(v16)); memrev16ifbe(&v16); return v16; }}
由于intset里保存的为有序数组,故在这里直接使用二分查找。
总体来说,redis在设计intset的时候,就定下了intset是仅仅用来存储数字的集合,而且里边无法存储相同的数字,也因此采用了简单粗暴的二分查找,整体性能一般。时间复杂度在O(log2n) 的级别,更何况在每次新增或删除时都有批量移动的操作。
与redis的sds,dict,hyperloglog等数据结构相比,intset的整体优化不够,足够简单算是唯一的优点了把。
hyperloglog源于基数统计,通过概率算法,将超大数据量的集合转为特定形式存储,用惊人的数据压缩来达到统计大数据集合中不同元素个数的目的。
正如上而言,由于是概率算法,hyperloglog得到的不同元素数目为近似值而非正确值,近似值的误差大约在0.81%,通过12k的内存能过存储高达2^64个值,找出不同元素的数目的时间复杂度也达到了惊人的O(log2(log2(Nmax)))。
hyperloglog也广泛使用在大数据相关中间键中,例如flink,redis,Kylin等。
一些证明过程来自论文,需要自取(较为硬核,非数学系或概率论爱好者慎入)
首先,计算出每个元素的hash值,并使用二进制来表达。最终形成类似10*****,00***,01***等等的值,通过如上二进制表达式,找出其第一个1最晚出现的位置来估算不重复的哈希值的个数。10出现的概率为1/2,则第一个1最晚出现的位置在1上,hyperloglog就认为不重复元素的个数估计是两个,01中出现的概率为1/4,第一个1最晚出现的位置则为2,hyperloglog就认为不重复元素的个数估计是四个以此类推。一次遍历所有元素,拿到最大的第一个1最晚出现的位置,对应的估值就为这个集合的不同个数了。hyperloglog做出如上猜测的来源为伯努利过程。
伯努利过程详见专业概率论相关的文章,这里不做赘述。
但由于是概率论,在日常真正使用时总会有传说中的欧皇或者非酋,总有连续10,20,30次连续抛到正面的几率在,因此hyperloglog引入了分桶的概念。
分桶的逻辑为,将所有的hash值分到n个桶中,对于每个桶做下估算,最终对所有的桶的计算结果做调和平均数,然后再*n,得到了最终的不同元素个数。
redis中hyperloglog分析
想要理解redis的hyperloglog,需要先额外理解HLL_DENSE 和HLL_SPARSE
HLL_DENSE :
密集存储型,使用6位存储一个数据。当数据较多时采用密集存储。
HLL_SPARSE:
稀疏存储型,当数据少时使用稀疏存储。稀疏有三种方式,分别为ZERO,XZERO,VAL
ZERO:6bit用来表示连续设置为0的寄存器的个数,表示1-64个连续寄存器被设置为0
XZERO:14bit用来表示连续设置为0的寄存器的个数,表示1-16834个连续寄存器被设置为0
VAL:5bit用来表示寄存器的值,可以表示1到32个连续寄存器被设置为0
hyperloglog数据结构:
struct hllhdr { char magic[4]; /* "HYLL" */ uint8_t encoding; /* HLL_DENSE or HLL_SPARSE. */ uint8_t notused[3]; /* Reserved for future use, must be zero. */ uint8_t card[8]; /* Cached cardinality, little endian. */ uint8_t registers[]; /* Data bytes. */};
magic用前四个字节表明hll对象。encoding代表着hll使用哪种存储方式,HLL_DENSE 或者HLL_SPARSE。notused[3]为保留字段,由于hll全部使用字节对齐,声明一下方便看代码的理解。card为当前缓存的hll基数值。registers是用来存储元素使用的,对于不同的encoding,这里的registers也不定。假如说encoding为HLL_DENSE,那么registers就为12k的连续数组,HLL_SPARSE的话这里为长度不定的数组。
创建hll(hyperloglog):
/* ========================== HyperLogLog commands ========================== *//* Create an HLL object. We always create the HLL using sparse encoding. * This will be upgraded to the dense representation as needed. */robj *createHLLObject(void) { robj *o; struct hllhdr *hdr; sds s; uint8_t *p; //最终值依赖宏,在下单独分析 int sparselen = HLL_HDR_SIZE + (((HLL_REGISTERS+(HLL_SPARSE_XZERO_MAX_LEN-1)) / HLL_SPARSE_XZERO_MAX_LEN)*2); int aux; /* Populate the sparse representation with as many XZERO opcodes as * needed to represent all the registers. */ aux = HLL_REGISTERS; s = sdsnewlen(NULL,sparselen); p = (uint8_t*)s + HLL_HDR_SIZE; while(aux) { int xzero = HLL_SPARSE_XZERO_MAX_LEN; if (xzero > aux) xzero = aux; HLL_SPARSE_XZERO_SET(p,xzero); p += 2; aux -= xzero; } serverAssert((p-(uint8_t*)s) == sparselen); /* Create the actual object. */ o = createObject(OBJ_STRING,s); hdr = o->ptr; memcpy(hdr->magic,"HYLL",4); hdr->encoding = HLL_SPARSE; return o;}
默认的hll都使用sparse的方式创建,用来节约内存。
最开始的sparselen = HLL_HDR_SIZE + (((HLL_REGISTERS+(HLL_SPARSE_XZERO_MAX_LEN-1)) / HLL_SPARSE_XZERO_MAX_LEN)*2);,需要结合宏来看
#define HLL_P 14 /* The greater is P, the smaller the error. */#define HLL_Q (64-HLL_P) /* The number of bits of the hash value used for determining the number of leading zeros. */#define HLL_REGISTERS (1<<HLL_P) /* With P=14, 16384 registers. */#define HLL_P_MASK (HLL_REGISTERS-1) /* Mask to index register. */
可以看出,sparsenlen最后=HLL_HDR_SIZE +(16384 + (16384-1) / 16384 * 2)。在redis中,默认的桶都是0,那么最终只需要两个字节存储,极大的节约内存。
新增元素:
/* Call hllDenseAdd() or hllSparseAdd() according to the HLL encoding. */int hllAdd(robj *o, unsigned char *ele, size_t elesize) { struct hllhdr *hdr = o->ptr; switch(hdr->encoding) { case HLL_DENSE: return hllDenseAdd(hdr->registers,ele,elesize); case HLL_SPARSE: return hllSparseAdd(o,ele,elesize); default: return -1; /* Invalid representation. */ }}
HLL_DENSE的add方式:
/* "Add" the element in the dense hyperloglog data structure. * Actually nothing is added, but the max 0 pattern counter of the subset * the element belongs to is incremented if needed. * * This is just a wrapper to hllDenseSet(), performing the hashing of the * element in order to retrieve the index and zero-run count. */int hllDenseAdd(uint8_t *registers, unsigned char *ele, size_t elesize) { long index; //index为桶的下标,count最终就是得到后边50个bit里1第一次出现的位置 uint8_t count = hllPatLen(ele,elesize,&index); /* Update the register if this element produced a longer run of zeroes. */ return hllDenseSet(registers,index,count);}/* Given a string element to add to the HyperLogLog, returns the length * of the pattern 000..1 of the element hash. As a side effect 'regp' is * set to the register index this element hashes to. */int hllPatLen(unsigned char *ele, size_t elesize, long *regp) { uint64_t hash, bit, index; int count; /* Count the number of zeroes starting from bit HLL_REGISTERS * (that is a power of two corresponding to the first bit we don't use * as index). The max run can be 64-P+1 = Q+1 bits. * * Note that the final "1" ending the sequence of zeroes must be * included in the count, so if we find "001" the count is 3, and * the smallest count possible is no zeroes at all, just a 1 bit * at the first position, that is a count of 1. * * This may sound like inefficient, but actually in the average case * there are high probabilities to find a 1 after a few iterations. */ hash = MurmurHash64A(ele,elesize,0xadc83b19ULL); index = hash & HLL_P_MASK; /* Register index. */ hash >>= HLL_P; /* Remove bits used to address the register. */ hash |= ((uint64_t)1<<HLL_Q); /* Make sure the loop terminates and count will be <= Q+1. */ bit = 1; count = 1; /* Initialized to 1 since we count the "00000...1" pattern. */ while((hash & bit) == 0) { count++; bit <<= 1; } *regp = (int) index; return count;}/* ================== Dense representation implementation ================== *//* Low level function to set the dense HLL register at 'index' to the * specified value if the current value is smaller than 'count'. * * 'registers' is expected to have room for HLL_REGISTERS plus an * additional byte on the right. This requirement is met by sds strings * automatically since they are implicitly null terminated. * * The function always succeed, however if as a result of the operation * the approximated cardinality changed, 1 is returned. Otherwise 0 * is returned. */int hllDenseSet(uint8_t *registers, long index, uint8_t count) { uint8_t oldcount; HLL_DENSE_GET_REGISTER(oldcount,registers,index); if (count > oldcount) { //更新值,假如说新的1最后出现的位置的值大于原来的,那就将最大的登记 HLL_DENSE_SET_REGISTER(registers,index,count); return 1; } else { return 0; }}
HLL_SPARSE的add方式:
/* "Add" the element in the sparse hyperloglog data structure. * Actually nothing is added, but the max 0 pattern counter of the subset * the element belongs to is incremented if needed. * * This function is actually a wrapper for hllSparseSet(), it only performs * the hashing of the element to obtain the index and zeros run length. */int hllSparseAdd(robj *o, unsigned char *ele, size_t elesize) { long index; uint8_t count = hllPatLen(ele,elesize,&index); /* Update the register if this element produced a longer run of zeroes. */ return hllSparseSet(o,index,count);}/* Given a string element to add to the HyperLogLog, returns the length * of the pattern 000..1 of the element hash. As a side effect 'regp' is * set to the register index this element hashes to. */int hllPatLen(unsigned char *ele, size_t elesize, long *regp) { uint64_t hash, bit, index; int count; /* Count the number of zeroes starting from bit HLL_REGISTERS * (that is a power of two corresponding to the first bit we don't use * as index). The max run can be 64-P+1 = Q+1 bits. * * Note that the final "1" ending the sequence of zeroes must be * included in the count, so if we find "001" the count is 3, and * the smallest count possible is no zeroes at all, just a 1 bit * at the first position, that is a count of 1. * * This may sound like inefficient, but actually in the average case * there are high probabilities to find a 1 after a few iterations. */ hash = MurmurHash64A(ele,elesize,0xadc83b19ULL); index = hash & HLL_P_MASK; /* Register index. */ hash >>= HLL_P; /* Remove bits used to address the register. */ hash |= ((uint64_t)1<<HLL_Q); /* Make sure the loop terminates and count will be <= Q+1. */ bit = 1; count = 1; /* Initialized to 1 since we count the "00000...1" pattern. */ while((hash & bit) == 0) { count++; bit <<= 1; } *regp = (int) index; return count;}/* Low level function to set the sparse HLL register at 'index' to the * specified value if the current value is smaller than 'count'. * * The object 'o' is the String object holding the HLL. The function requires * a reference to the object in order to be able to enlarge the string if * needed. * * On success, the function returns 1 if the cardinality changed, or 0 * if the register for this element was not updated. * On error (if the representation is invalid) -1 is returned. * * As a side effect the function may promote the HLL representation from * sparse to dense: this happens when a register requires to be set to a value * not representable with the sparse representation, or when the resulting * size would be greater than server.hll_sparse_max_bytes. */int hllSparseSet(robj *o, long index, uint8_t count) { struct hllhdr *hdr; uint8_t oldcount, *sparse, *end, *p, *prev, *next; long first, span; long is_zero = 0, is_xzero = 0, is_val = 0, runlen = 0; /* If the count is too big to be representable by the sparse representation * switch to dense representation. */ if (count > HLL_SPARSE_VAL_MAX_VALUE) goto promote; /* When updating a sparse representation, sometimes we may need to * enlarge the buffer for up to 3 bytes in the worst case (XZERO split * into XZERO-VAL-XZERO). Make sure there is enough space right now * so that the pointers we take during the execution of the function * will be valid all the time. */ o->ptr = sdsMakeRoomFor(o->ptr,3); /* Step 1: we need to locate the opcode we need to modify to check * if a value update is actually needed. */ sparse = p = ((uint8_t*)o->ptr) + HLL_HDR_SIZE; end = p + sdslen(o->ptr) - HLL_HDR_SIZE; first = 0; prev = NULL; /* Points to previous opcode at the end of the loop. */ next = NULL; /* Points to the next opcode at the end of the loop. */ span = 0; while(p < end) { long oplen; /* Set span to the number of registers covered by this opcode. * * This is the most performance critical loop of the sparse * representation. Sorting the conditionals from the most to the * least frequent opcode in many-bytes sparse HLLs is faster. */ oplen = 1; if (HLL_SPARSE_IS_ZERO(p)) { span = HLL_SPARSE_ZERO_LEN(p); } else if (HLL_SPARSE_IS_VAL(p)) { span = HLL_SPARSE_VAL_LEN(p); } else { /* XZERO. */ span = HLL_SPARSE_XZERO_LEN(p); oplen = 2; } /* Break if this opcode covers the register as 'index'. */ if (index <= first+span-1) break; prev = p; p += oplen; first += span; } if (span == 0 || p >= end) return -1; /* Invalid format. */ next = HLL_SPARSE_IS_XZERO(p) ? p+2 : p+1; if (next >= end) next = NULL; /* Cache current opcode type to avoid using the macro again and * again for something that will not change. * Also cache the run-length of the opcode. */ if (HLL_SPARSE_IS_ZERO(p)) { is_zero = 1; runlen = HLL_SPARSE_ZERO_LEN(p); } else if (HLL_SPARSE_IS_XZERO(p)) { is_xzero = 1; runlen = HLL_SPARSE_XZERO_LEN(p); } else { is_val = 1; runlen = HLL_SPARSE_VAL_LEN(p); } /* Step 2: After the loop: * * 'first' stores to the index of the first register covered * by the current opcode, which is pointed by 'p'. * * 'next' ad 'prev' store respectively the next and previous opcode, * or NULL if the opcode at 'p' is respectively the last or first. * * 'span' is set to the number of registers covered by the current * opcode. * * There are different cases in order to update the data structure * in place without generating it from scratch: * * A) If it is a VAL opcode already set to a value >= our 'count' * no update is needed, regardless of the VAL run-length field. * In this case PFADD returns 0 since no changes are performed. * * B) If it is a VAL opcode with len = 1 (representing only our * register) and the value is less than 'count', we just update it * since this is a trivial case. */ if (is_val) { oldcount = HLL_SPARSE_VAL_VALUE(p); /* Case A. */ if (oldcount >= count) return 0; /* Case B. */ if (runlen == 1) { HLL_SPARSE_VAL_SET(p,count,1); goto updated; } } /* C) Another trivial to handle case is a ZERO opcode with a len of 1. * We can just replace it with a VAL opcode with our value and len of 1. */ if (is_zero && runlen == 1) { HLL_SPARSE_VAL_SET(p,count,1); goto updated; } /* D) General case. * * The other cases are more complex: our register requires to be updated * and is either currently represented by a VAL opcode with len > 1, * by a ZERO opcode with len > 1, or by an XZERO opcode. * * In those cases the original opcode must be split into multiple * opcodes. The worst case is an XZERO split in the middle resulting into * XZERO - VAL - XZERO, so the resulting sequence max length is * 5 bytes. * * We perform the split writing the new sequence into the 'new' buffer * with 'newlen' as length. Later the new sequence is inserted in place * of the old one, possibly moving what is on the right a few bytes * if the new sequence is longer than the older one. */ uint8_t seq[5], *n = seq; int last = first+span-1; /* Last register covered by the sequence. */ int len; if (is_zero || is_xzero) { /* Handle splitting of ZERO / XZERO. */ if (index != first) { len = index-first; if (len > HLL_SPARSE_ZERO_MAX_LEN) { HLL_SPARSE_XZERO_SET(n,len); n += 2; } else { HLL_SPARSE_ZERO_SET(n,len); n++; } } HLL_SPARSE_VAL_SET(n,count,1); n++; if (index != last) { len = last-index; if (len > HLL_SPARSE_ZERO_MAX_LEN) { HLL_SPARSE_XZERO_SET(n,len); n += 2; } else { HLL_SPARSE_ZERO_SET(n,len); n++; } } } else { /* Handle splitting of VAL. */ int curval = HLL_SPARSE_VAL_VALUE(p); if (index != first) { len = index-first; HLL_SPARSE_VAL_SET(n,curval,len); n++; } HLL_SPARSE_VAL_SET(n,count,1); n++; if (index != last) { len = last-index; HLL_SPARSE_VAL_SET(n,curval,len); n++; } } /* Step 3: substitute the new sequence with the old one. * * Note that we already allocated space on the sds string * calling sdsMakeRoomFor(). */ int seqlen = n-seq; int oldlen = is_xzero ? 2 : 1; int deltalen = seqlen-oldlen; if (deltalen > 0 && sdslen(o->ptr)+deltalen > server.hll_sparse_max_bytes) goto promote; if (deltalen && next) memmove(next+deltalen,next,end-next); sdsIncrLen(o->ptr,deltalen); memcpy(p,seq,seqlen); end += deltalen;updated: /* Step 4: Merge adjacent values if possible. * * The representation was updated, however the resulting representation * may not be optimal: adjacent VAL opcodes can sometimes be merged into * a single one. */ p = prev ? prev : sparse; int scanlen = 5; /* Scan up to 5 upcodes starting from prev. */ while (p < end && scanlen--) { if (HLL_SPARSE_IS_XZERO(p)) { p += 2; continue; } else if (HLL_SPARSE_IS_ZERO(p)) { p++; continue; } /* We need two adjacent VAL opcodes to try a merge, having * the same value, and a len that fits the VAL opcode max len. */ if (p+1 < end && HLL_SPARSE_IS_VAL(p+1)) { int v1 = HLL_SPARSE_VAL_VALUE(p); int v2 = HLL_SPARSE_VAL_VALUE(p+1); if (v1 == v2) { int len = HLL_SPARSE_VAL_LEN(p)+HLL_SPARSE_VAL_LEN(p+1); if (len <= HLL_SPARSE_VAL_MAX_LEN) { HLL_SPARSE_VAL_SET(p+1,v1,len); memmove(p,p+1,end-p); sdsIncrLen(o->ptr,-1); end--; /* After a merge we reiterate without incrementing 'p' * in order to try to merge the just merged value with * a value on its right. */ continue; } } } p++; } /* Invalidate the cached cardinality. */ hdr = o->ptr; HLL_INVALIDATE_CACHE(hdr); return 1;promote: /* Promote to dense representation. */ if (hllSparseToDense(o) == C_ERR) return -1; /* Corrupted HLL. */ hdr = o->ptr; /* We need to call hllDenseAdd() to perform the operation after the * conversion. However the result must be 1, since if we need to * convert from sparse to dense a register requires to be updated. * * Note that this in turn means that PFADD will make sure the command * is propagated to slaves / AOF, so if there is a sparse -> dense * conversion, it will be performed in all the slaves as well. */ int dense_retval = hllDenseSet(hdr->registers,index,count); serverAssert(dense_retval == 1); return dense_retval;}
HLL_SPARSE的add相对来说相当复杂,在这里只讲下处理思路。
sparse在redis的hll中负责存储数据量小的,当数据量过多时就改为密集存储也就是HLL_DENSE,详见代码
if (count > HLL_SPARSE_VAL_MAX_VALUE) goto promote;
假如说还在sparse的范围内,首先去修改桶,假如不用修改直接到下边,如果len=1且原值时val,那么小于现在的值就不做变动,大的话修改max值。如果len>1,且是val的话,无需考虑别的,不做更新。归根结底,就是根据条件适当修改HLL_SPARSE所采用的结构为ZERO,XZERO,VAL中的一个,最终做优化,将一些可以合并的val合并。
获取count:
#define HLL_REGISTERS (1<<HLL_P) /* With P=14, 16384 registers. *//* Return the approximated cardinality of the set based on the harmonic * mean of the registers values. 'hdr' points to the start of the SDS * representing the String object holding the HLL representation. * * If the sparse representation of the HLL object is not valid, the integer * pointed by 'invalid' is set to non-zero, otherwise it is left untouched. * * hllCount() supports a special internal-only encoding of HLL_RAW, that * is, hdr->registers will point to an uint8_t array of HLL_REGISTERS element. * This is useful in order to speedup PFCOUNT when called against multiple * keys (no need to work with 6-bit integers encoding). */uint64_t hllCount(struct hllhdr *hdr, int *invalid) { double m = HLL_REGISTERS; double E; int j; /* Note that reghisto size could be just HLL_Q+2, because HLL_Q+1 is * the maximum frequency of the "000...1" sequence the hash function is * able to return. However it is slow to check for sanity of the * input: instead we history array at a safe size: overflows will * just write data to wrong, but correctly allocated, places. */ int reghisto[64] = {0}; /* Compute register histogram */ if (hdr->encoding == HLL_DENSE) { hllDenseRegHisto(hdr->registers,reghisto); } else if (hdr->encoding == HLL_SPARSE) { hllSparseRegHisto(hdr->registers, sdslen((sds)hdr)-HLL_HDR_SIZE,invalid,reghisto); } else if (hdr->encoding == HLL_RAW) { hllRawRegHisto(hdr->registers,reghisto); } else { serverPanic("Unknown HyperLogLog encoding in hllCount()"); } /* Estimate cardinality from register histogram. See: * "New cardinality estimation algorithms for HyperLogLog sketches" * Otmar Ertl, arXiv:1702.01284 */ //修正的过程,来源于论文 double z = m * hllTau((m-reghisto[HLL_Q+1])/(double)m); for (j = HLL_Q; j >= 1; --j) { z += reghisto[j]; z *= 0.5; } z += m * hllSigma(reghisto[0]/(double)m); E = llroundl(HLL_ALPHA_INF*m*m/z); return (uint64_t) E;}/* Helper function tau as defined in * "New cardinality estimation algorithms for HyperLogLog sketches" * Otmar Ertl, arXiv:1702.01284 */double hllTau(double x) { if (x == 0. || x == 1.) return 0.; double zPrime; double y = 1.0; double z = 1 - x; do { x = sqrt(x); zPrime = z; y *= 0.5; z -= pow(1 - x, 2)*y; } while(zPrime != z); return z / 3;}
计算一个集合中的不同元素数量,首先将元素散列到n个counter中,计算每个counter中的第一个1出现位置的max值,最后对n个桶做调和平均数,最终得到概率性的不同元素数量
特殊说明
12k内存的来源:
对于超大的数据集,redis中基本都是使用dense的方式了。redis中首先将元素计算出来64位的hash值,这64位中的前14为当为index,后50位作为真正的计算使用。前14位作为index,那么214=16384,记作16384个桶。右边的50位中,第一次出现1的位置一定是小于50的,换算后26刚好够存下来,因此每个桶只需要6位就够存储。计算内存使用的话,16384*6/8=12288 byte 所以大约就是12k的内存。
也因此,hll本身无法存储元数据信息,精度本身也不做保证,只能说基本上是准确的,但是并不是一定正确。也因此特性,redis中的hll常用来统计网站的流量等等特殊需求
跳表全称跳跃列表,是一个经过优化过的有序链表。通过将一条链表构造成多层次结构,将原先O(n)的查找效率提升到O(log(n))。而跳跃一词体现在,在每一层经过尽量少的查找,跳跃到相对应的下一层,实现O(n)->O(log(n))的效率提升。
跳表图示如下:
跳表中有两个重要概念,
步长:
比如一个简单的有序数组1,2,3,4,5,6 。从1到2的步长为1,从1到3的步长为2
层:
单纯的有序链表为一条线,而为了用一定的空间提前获取较粗粒度的索引,根据粒度的粗细新建一层层的顺序关系。
这两个概念结合如上图所示,最下的原型代表着从1,2,3,4,…,n的有序数组。假设这个数组为第0层,那么按照步长为2的逻辑新建一层,在1节点多维护一个next指针,指向下下个节点,那么这一层称作L1。那么在L1层,可以看作是1,3,5,7,9,…,n这样的有序数组。在L1层的基础上,再按照第0层中步长为4的逻辑新建L2层,那么L2层中可以看作是1,5,9,…,n的有序数组,以此类推,建很多层。
如图的结构,只有L3层。那么查找时,假如想知道6在数组哪个位置,只需判断L3层,断定6在1-9中,判断了一次。再看L2层,判断两次发现6在5-9中。在L1层判断了一次,发现6在5-7中。最终回到原有序链表,从5开始遍历,发现6在5之后。从L3-到原链表,判断了5次。
对于数据量越大的链表,跳表的效率就越高。
跳表本身讲清楚后,回到redis,从数据结构上看redis中的跳表到底是什么样的。
redis中跳表分析:
/* ZSETs use a specialized version of Skiplists */typedef struct zskiplistNode { sds ele; double score; struct zskiplistNode *backward; struct zskiplistLevel { struct zskiplistNode *forward; unsigned long span; } level[];} zskiplistNode;typedef struct zskiplist { struct zskiplistNode *header, *tail; unsigned long length; int level;} zskiplist;
zskiplist为跳表。zskiplistNode为跳表中的节点。
先看zskiplistNode。
ele为跳表中每个节点存储的string值,sds功能与string类似,具体差别请看
score为分值。跳表的节点排序是按照分值来的,相同分值的情况按照ele的字典序排列。
backward为这个节点的上一个节点,与跳表无关,用来反向遍历原链表。
level[]就为跳表的属性了。level就是最开始讲跳表结构的L简称,level[0],level[1],level[2]就分别对应L1,L2,L3了。每个level有两个字段,forward为下一个节点地址,span为步长。对照着最开始的图示与文字解释,那么假如再看
这个图的话,假设最开始的这个1,是一个具体的zskiplistNode实例,起名叫做zskiplistNode1,那么zskiplistNode1中的ele就是1,score也看做1,backward就是1的上一个节点,为最后的n。
level[]共有三层,记作level[0],level[1],level[2]。
level[0]中的forward是zskiplistNode3的地址,level[0]中的span就是2了。
level[1]中的forward是zskiplistNode5的地址,level[0]中的span就是4了。
level[2]中的*forward是zskiplistNode9的地址,level[0]中的span就是8了。
再看zskiplist。
理解了zskiplistNode,再看zskiplist就很简单了。zskiplist只保留两个指针,分别为跳表的头和尾。length为记录整个跳表的节点数量,level代表着整个跳表建完之后的最大L层。
整个跳表的数据结构看完,再看看跳表的增删改查代码实现
跳表的生成依赖随机算法,即每个跳表有多少层,再初始化时根据随机算法随机给个最大层数,按照当前的节点数量开始计算每层步长。
随机算法如下:
/* Returns a random level for the new skiplist node we are going to create. * The return value of this function is between 1 and ZSKIPLIST_MAXLEVEL * (both inclusive), with a powerlaw-alike distribution where higher * levels are less likely to be returned. */int zslRandomLevel(void) { int level = 1; while ((random()&0xFFFF) < (ZSKIPLIST_P * 0xFFFF)) level += 1; return (level<ZSKIPLIST_MAXLEVEL) ? level : ZSKIPLIST_MAXLEVEL;}
其中,ZSKIPLIST_MAXLEVEL定义在了宏中,#define ZSKIPLIST_MAXLEVEL 32 /* Should be enough for 2^64 elements */,代表着redis主动限制一个跳表的最大层数上限为32。层数越高,跳表越复杂。在源码中,while((random()&0xFFFF)<(ZSKIPLIST_P *0xFFFF))确定了最终的概率。ZSKIPLIST_P 定义再宏#define ZSKIPLIST_P 0.25中,由此可见,zslRandomLevel得到1的概率为1-0.25,得到2的概率为(1-0.25)0.75,得到3的概率为(1-0.25)0.750.75,以此类推发现,越大的数出现的概率越小,这也叫做幂次定律(powerlaw)
跳表初始化:
/* Create a skiplist node with the specified number of levels. * The SDS string 'ele' is referenced by the node after the call. */zskiplistNode *zslCreateNode(int level, double score, sds ele) { zskiplistNode *zn = zmalloc(sizeof(*zn)+level*sizeof(struct zskiplistLevel)); zn->score = score; zn->ele = ele; return zn;}/* Create a new skiplist. */zskiplist *zslCreate(void) { int j; zskiplist *zsl; //分配内存 zsl = zmalloc(sizeof(*zsl)); zsl->level = 1; zsl->length = 0; //创建一个null的zslCreateNode,作为header,后续的节点加入都在null之后 zsl->header = zslCreateNode(ZSKIPLIST_MAXLEVEL,0,NULL); for (j = 0; j < ZSKIPLIST_MAXLEVEL; j++) { zsl->header->level[j].forward = NULL; zsl->header->level[j].span = 0; } zsl->header->backward = NULL; zsl->tail = NULL; return zsl;}
插入节点:
/* Insert a new node in the skiplist. Assumes the element does not already * exist (up to the caller to enforce that). The skiplist takes ownership * of the passed SDS string 'ele'. */zskiplistNode *zslInsert(zskiplist *zsl, double score, sds ele) { //update与rank都为辅助数组,负责处理跳表的结构处理。 zskiplistNode *update[ZSKIPLIST_MAXLEVEL], *x; unsigned long rank[ZSKIPLIST_MAXLEVEL]; int i, level; serverAssert(!isnan(score)); x = zsl->header; //从高层到低层循环计算节点跨度,保存在update辅助数组里 for (i = zsl->level-1; i >= 0; i--) { /* store rank that is crossed to reach the insert position */ rank[i] = i == (zsl->level-1) ? 0 : rank[i+1]; while (x->level[i].forward && (x->level[i].forward->score < score || (x->level[i].forward->score == score && sdscmp(x->level[i].forward->ele,ele) < 0))) { //累加span真正的操作 rank[i] += x->level[i].span; x = x->level[i].forward; } 将每层计算好的保存在辅助数组中 update[i] = x; } /* we assume the element is not already inside, since we allow duplicated * scores, reinserting the same element should never happen since the * caller of zslInsert() should test in the hash table if the element is * already inside or not. */ //根据幂次定律计算新的level层 level = zslRandomLevel(); //如果得到的随机level比现在的level最大层还要大的话,那么需要设置最大层为最新的level了 if (level > zsl->level) { for (i = zsl->level; i < level; i++) { rank[i] = 0;//这里是将原有level层以上的rank都初始化为0 update[i] = zsl->header;//将原有level层以上的update[i]先指向头 update[i]->level[i].span = zsl->length;//也是初始化原有level层以上的操作,步长直接置为length,就是从头直接到尾,具体新的步长将在下边计算 } zsl->level = level;//更新最大的level层数 } //初始化新的节点,开始做插入操作 x = zslCreateNode(level,score,ele); for (i = 0; i < level; i++) { //设置新节点的下一个节点为每一层最后一个节点的的下个节点 x->level[i].forward = update[i]->level[i].forward; update[i]->level[i].forward = x; /* update span covered by update[i] as x is inserted here */ //根据update辅助数组更新span跨度 x->level[i].span = update[i]->level[i].span - (rank[0] - rank[i]); update[i]->level[i].span = (rank[0] - rank[i]) + 1; } /* increment span for untouched levels */ for (i = level; i < zsl->level; i++) { update[i]->level[i].span++; } x->backward = (update[0] == zsl->header) ? NULL : update[0]; if (x->level[0].forward) x->level[0].forward->backward = x; else zsl->tail = x; zsl->length++; return x;}
可以看出,新增节点时需要变动跳表的结构,小概率会新增层级,由于全部都是修改每个节点的level[]数组,所以就算for和while循环,性能也是很好的。
删除节点:
int zslDelete(zskiplist *zsl, double score, sds ele, zskiplistNode **node) { //辅助数组 zskiplistNode *update[ZSKIPLIST_MAXLEVEL], *x; int i; x = zsl->header; //和查找类似,从level最高层开始遍历,得到最终位置 for (i = zsl->level-1; i >= 0; i--) { while (x->level[i].forward && (x->level[i].forward->score < score || (x->level[i].forward->score == score && sdscmp(x->level[i].forward->ele,ele) < 0))) { //通关将原先指向它的节点指向它的下一个,将他从链表中剔除 x = x->level[i].forward; } update[i] = x; } /* We may have multiple elements with the same score, what we need * is to find the element with both the right score and object. */ //设置后退指针 x = x->level[0].forward; //在这里再验证是否要删除时当前节点 if (x && score == x->score && sdscmp(x->ele,ele) == 0) { zslDeleteNode(zsl, x, update); if (!node) zslFreeNode(x); else *node = x; return 1; } return 0; /* not found */}/* Internal function used by zslDelete, zslDeleteRangeByScore and * zslDeleteRangeByRank. */void zslDeleteNode(zskiplist *zsl, zskiplistNode *x, zskiplistNode **update) { int i; for (i = 0; i < zsl->level; i++) { //如果这一层涉及到删除的节点为跳表的索引的话,将原先指向需要被删除的节点变成它的下一个节点 if (update[i]->level[i].forward == x) { update[i]->level[i].span += x->level[i].span - 1; update[i]->level[i].forward = x->level[i].forward; } else { update[i]->level[i].span -= 1; } } if (x->level[0].forward) { x->level[0].forward->backward = x->backward; } else { zsl->tail = x->backward; } while(zsl->level > 1 && zsl->header->level[zsl->level-1].forward == NULL) zsl->level--; zsl->length--;}//这个调用sds的方式,释放内存void zslFreeNode(zskiplistNode *node) { sdsfree(node->ele); zfree(node);}
查找:
unsigned long zslGetRank(zskiplist *zsl, double score, sds ele) { zskiplistNode *x; unsigned long rank = 0; int i; x = zsl->header; for (i = zsl->level-1; i >= 0; i--) { ////只要分值小于score或者分值相等,对象字典序小于给定对象o while (x->level[i].forward && (x->level[i].forward->score < score || (x->level[i].forward->score == score && sdscmp(x->level[i].forward->ele,ele) <= 0))) { //更新rank,并且指向下个节点 rank += x->level[i].span; x = x->level[i].forward; } /* x might be equal to zsl->header, so test if obj is non-NULL */ if (x->ele && x->score == score && sdscmp(x->ele,ele) == 0) { //最终得到的rank值是第i层找到分值相同,且对象相同时的rank值,否则返回0 return rank; } } return 0;}
这里的查找其实是根据score获取排名,具体逻辑看注释。
redis中的跳表相对于其他数据结构而言,逻辑性更复杂,会设定更多的辅助数组来保证span与level和next指针的正确和合理性,同时用到了幂次定律等辅助level的合理值,也因此会存在概率性问题,但也因此,涉及到redis的无论面试或分享中,跳表是重要一环,但也因为复杂性,跳表再redis中的使用场景较少,出现再zset中,涉及到的命令有ZADD ,ZRANGE ,ZSCORE等等。
redis中的dict(字典)实际上是一个安全的实现了rehash的多态哈希桶,是一个性能较高的数据结构,用来保存键值对,类似其他高级语言的map。
dict作为一个基于hash提高性能的kv结构,通过开链法/拉链法的方式解决hash冲突。但也因此会存在桶倾斜的问题。优化链表过长的方式有很多,比如java中concurrenthashmap在链表长度达到一定值时将单链表重构成红黑树等。redis解决hash冲突的方式为rehash,通过在声明dict时直接分配**ht_table[2],其中一个ht_table只有在rehash的时候起效,同时辅助设置rehashidx等字段标示是否在rehash状态,实现了安全的rehash。
dict数据结构定义有三个,这里依次分析(6中将typedef struct dictht优化掉了)
dictEntry:
typedef struct dictEntry { void *key; union { void *val; uint64_t u64; int64_t s64; double d; } v; struct dictEntry *next; /* Next entry in the same hash bucket. */ void *metadata[]; /* An arbitrary number of bytes (starting at a * pointer-aligned address) of size as returned * by dictType's dictEntryMetadataBytes(). */} dictEntry;
dictEntry为具体的dict节点结构了,可以看到其中有next指针,指向另一个dictEntry。借助注释中的/* Next entry in the same hash bucket. */可以很好的理解了文章最开始说的开链法/拉链法。哈希桶的结构如下:
先讲几个概念
哈希散列:
若结构中存在和关键字K相等的记录,则必定在f(K)的存储位置上。由此,不需比较便可直接取得所查记录。称这个对应关系f为散列函数(Hash function),按这个事先建立的表为散列表(来自百度)
翻译一下的话,假如有k长度的数组,n个数字,通过一定的规则f将n个数字尽量均匀的放到k个数组中,而在查找某个数字在哪里时,通过规则f直接就能查到在哪里了,无需循环遍历整个数组。其中规则f有很多种方式,有时因地制宜。常见的就是n个数字,k长度的数组,每个数字对k进行取余。又比如有些手机号,那么就取最后四位,身份证号取*****进行取余等等等等。
哈希冲突:
还是以对k取余为例,当两个数字对k取余得到了相同的结果时,就产生了冲突。,举个简单例子,有0 1 2 3 4 5共六个数,有[5]的数组,那么在0-4时会变成
一个数一个坑位。当开始插入5时,发现0已经占了,那么哈希桶的方式怎么办呢?开个链表。0这里不止记录自己,还记录我有next,next那个数是5,如图
所以说,dictEntry中的next就是做这个的。
dictType:
typedef struct dictType { uint64_t (*hashFunction)(const void *key); void *(*keyDup)(dict *d, const void *key); void *(*valDup)(dict *d, const void *obj); int (*keyCompare)(dict *d, const void *key1, const void *key2); void (*keyDestructor)(dict *d, void *key); void (*valDestructor)(dict *d, void *obj); int (*expandAllowed)(size_t moreMem, double usedRatio); /* Allow a dictEntry to carry extra caller-defined metadata. The * extra memory is initialized to 0 when a dictEntry is allocated. */ size_t (*dictEntryMetadataBytes)(dict *d);} dictType;
看注释的话,这里允许dictEntry携带额外的调用者定义的元数据,而dict自身就依赖此实现多态。这里的方法稍后分析。
struct dict { dictType *type; dictEntry **ht_table[2]; unsigned long ht_used[2]; long rehashidx; /* rehashing not in progress if rehashidx == -1 */ /* Keep small vars at end for optimal (minimal) struct padding */ int16_t pauserehash; /* If >0 rehashing is paused (<0 indicates coding error) */ signed char ht_size_exp[2]; /* exponent of size. (size = 1<<exp) */};
dict最后向外暴露的就是这个了。
可以看到,定义了type,数据为ht_table[2](两个用来做rehash)。ht_used[2]对应的是ht_table[2]中的所用length。rehashidx则为保证安全rehash的标识符,rehashing not in progress if rehashidx == -1 。pauserehash为6后新加入的特性,为保证效率新增rehash的paused的操作。exponent这里为2的多少次方,这个特性也是比较常见的,比如centos的网络中,connection timeout的retry策略就是,第一次1s超时,第二次2s超时,第三次4s超时++++(所以一般机器层面的连接超时大多都是1s,3s,7s,15s++++)
create:
/* Reset a hash table already initialized with ht_init(). * NOTE: This function should only be called by ht_destroy(). */static void _dictReset(dict *d, int htidx){ d->ht_table[htidx] = NULL; d->ht_size_exp[htidx] = -1; d->ht_used[htidx] = 0;}/* Create a new hash table */dict *dictCreate(dictType *type){ dict *d = zmalloc(sizeof(*d)); _dictInit(d,type); return d;}/* Initialize the hash table */int _dictInit(dict *d, dictType *type){ _dictReset(d, 0); _dictReset(d, 1); d->type = type; d->rehashidx = -1; d->pauserehash = 0; return DICT_OK;}
分配堆内存仍然使用的zmalloc.c中的zmalloc。初始化时需要指定dictType,其余的给予默认值,两个dictEntry和所标识的ht_size_exp,ht_used完全一致。
insert:
/* Add an element to the target hash table */int dictAdd(dict *d, void *key, void *val){ dictEntry *entry = dictAddRaw(d,key,NULL); //调用dictAddRaw后,如果key已经冲突,则返回1 if (!entry) return DICT_ERR; dictSetVal(d, entry, val); //key没有被占用,正常add进去的话,则返回0 return DICT_OK;}dictEntry *dictAddRaw(dict *d, void *key, dictEntry **existing){ long index; dictEntry *entry; int htidx; //判断这里是到达了rehashing的条件,如果需要的话则进行rehash,rehash为while循环,需要等得rehash完之后才能进行add if (dictIsRehashing(d)) _dictRehashStep(d); /* Get the index of the new element, or -1 if * the element already exists. */ //如果key存在的话,直接返回 if ((index = _dictKeyIndex(d, key, dictHashKey(d,key), existing)) == -1) return NULL; /* Allocate the memory and store the new entry. * Insert the element in top, with the assumption that in a database * system it is more likely that recently added entries are accessed * more frequently. */ //查看是否rehash,假如在rehash的话,那么就只在ht_table[1]进行插入,而在select时,则就需要同时在ht_table[0]和ht_table[1]上都查找了 //这么做对于查找来说增加了一定的时间,但是对于内存上有节省,算是以空间换时间的一种形式 htidx = dictIsRehashing(d) ? 1 : 0; size_t metasize = dictMetadataSize(d); entry = zmalloc(sizeof(*entry) + metasize); //调用c库函数,假如value值不为空的话,使用memset赋值给此函数最开始声明的entry if (metasize > 0) { memset(dictMetadata(entry), 0, metasize); } entry->next = d->ht_table[htidx][index]; d->ht_table[htidx][index] = entry; d->ht_used[htidx]++; /* Set the hash entry fields. */ dictSetKey(d, entry, key); return entry;}
reset:
int dictReplace(dict *d, void *key, void *val){ dictEntry *entry, *existing, auxentry; /* Try to add the element. If the key * does not exists dictAdd will succeed. */ //首先尝试调用插入,假如能插入则说明本身没有,直接返回即可 entry = dictAddRaw(d,key,&existing); if (entry) { dictSetVal(d, entry, val); return 1; } /* Set the new value and free the old one. Note that it is important * to do that in this order, as the value may just be exactly the same * as the previous one. In this context, think to reference counting, * you want to increment (set), and then decrement (free), and not the * reverse. */ //如果已有数据,需要替换value,那么就先记录原值的内存地址,然后重新赋值,之后根据记录的地址进行free操作 auxentry = *existing; dictSetVal(d, existing, val); dictFreeVal(d, &auxentry); return 0;}
delete:
int dictDelete(dict *ht, const void *key) { return dictGenericDelete(ht,key,0) ? DICT_OK : DICT_ERR;}/* Search and remove an element. This is a helper function for * dictDelete() and dictUnlink(), please check the top comment * of those functions. */static dictEntry *dictGenericDelete(dict *d, const void *key, int nofree) { uint64_t h, idx; dictEntry *he, *prevHe; int table; /* dict is empty */ if (dictSize(d) == 0) return NULL; //判断这里是到达了rehashing的条件,如果需要的话则进行rehash,rehash为while循环,需要等得rehash完之后才能进行add if (dictIsRehashing(d)) _dictRehashStep(d); h = dictHashKey(d, key); for (table = 0; table <= 1; table++) { //计算hash值 idx = h & DICTHT_SIZE_MASK(d->ht_size_exp[table]); he = d->ht_table[table][idx]; prevHe = NULL; while(he) { //假如在桶内循环找到了这个key,那么将他的上一个next指针指向他的下一个,避免丢失入口,然后删除此entry if (key==he->key || dictCompareKeys(d, key, he->key)) { /* Unlink the element from the list */ if (prevHe) prevHe->next = he->next; else d->ht_table[table][idx] = he->next; if (!nofree) { dictFreeUnlinkedEntry(d, he); } d->ht_used[table]--; return he; } prevHe = he; he = he->next; }//如果没有在rehash的话,继续delete,如果在rehash的过程中delete的话,那么不会执行删除 if (!dictIsRehashing(d)) break; } return NULL; /* not found */}
search:
dictEntry *dictFind(dict *d, const void *key){ dictEntry *he; uint64_t h, idx, table; if (dictSize(d) == 0) return NULL; /* dict is empty */ if (dictIsRehashing(d)) _dictRehashStep(d); h = dictHashKey(d, key); for (table = 0; table <= 1; table++) { idx = h & DICTHT_SIZE_MASK(d->ht_size_exp[table]); he = d->ht_table[table][idx]; while(he) { if (key==he->key || dictCompareKeys(d, key, he->key)) return he; he = he->next; } if (!dictIsRehashing(d)) return NULL; } return NULL;}
这里和delete的逻辑几乎完全相同,只不过是delet在定位到key后做链表重定向,而find操作就简单一些了,直接返回。所以这里就只放源码不加注释了。
rehash:
int dictRehash(dict *d, int n) { int empty_visits = n*10; /* Max number of empty buckets to visit. */ if (!dictIsRehashing(d)) return 0; while(n-- && d->ht_used[0] != 0) { dictEntry *de, *nextde; /* Note that rehashidx can't overflow as we are sure there are more * elements because ht[0].used != 0 */ assert(DICTHT_SIZE(d->ht_size_exp[0]) > (unsigned long)d->rehashidx); while(d->ht_table[0][d->rehashidx] == NULL) { d->rehashidx++; if (--empty_visits == 0) return 1; } de = d->ht_table[0][d->rehashidx]; /* Move all the keys in this bucket from the old to the new hash HT */ while(de) { uint64_t h; nextde = de->next; /* Get the index in the new hash table */ h = dictHashKey(d, de->key) & DICTHT_SIZE_MASK(d->ht_size_exp[1]); de->next = d->ht_table[1][h]; d->ht_table[1][h] = de; d->ht_used[0]--; d->ht_used[1]++; de = nextde; } d->ht_table[0][d->rehashidx] = NULL; d->rehashidx++; } /* Check if we already rehashed the whole table... */ if (d->ht_used[0] == 0) { zfree(d->ht_table[0]); /* Copy the new ht onto the old one */ d->ht_table[0] = d->ht_table[1]; d->ht_used[0] = d->ht_used[1]; d->ht_size_exp[0] = d->ht_size_exp[1]; _dictReset(d, 1); d->rehashidx = -1; return 0; } /* More to rehash... */ return 1;}
对于代码上对外交互较多,需要完整的理解,这里就用文字解释了。
redis的rehash为渐进式的,redis本身不能够保证redis的dict数据量极端值,假如在数据量极大的情况下,rehash所有的kv,那么redis在rehash的过程中将会导致大量的阻塞,导致业务故障,因此redis为保证效率的同时,每次只对dict的**一个桶做rehash。**也因此引入了rehashidx字段,表示现在rehash到了哪个桶上,当到了最后一个桶,那么就rehash完毕,rehashidx变为-1。但也因此,rehash的过程就有可能持续很长时间了。在dict完成全部rehash之前,会有增删改查四个操作。由于不确定此key是否已经被rehash,那么查找时需要对两个ht_table同时查找。插入时,由于redis已经判断现在出现了桶倾斜,因此只插入到ht_table[1]中,而删除与修改和查找一致,都需要从两个ht_table找。rehash的渐进式触发条件为每一次的增删查。在上述解释中可以看到,redis将一个完整的rehash按桶渐进,分解成每一步增删查操作,将一次大的阻塞平摊到所有的操作中,达到高性能优化的目的。
触发rehash的条件:
* the number of elements and the buckets > dict_force_resize_ratio. */static int dict_can_resize = 1;static unsigned int dict_force_resize_ratio = 5;
/* This is the initial size of every hash table */#define DICT_HT_INITIAL_SIZE 4/* Hash table parameters */#define HASHTABLE_MIN_FILL 10 /* Minimal hash table fill 10% */#define HASHTABLE_MAX_LOAD_FACTOR 1.618 /* Maximum hash table load factor. */int htNeedsResize(dict *dict) { long long size, used; size = dictSlots(dict); used = dictSize(dict); return (size > DICT_HT_INITIAL_SIZE && (used*100/size < HASHTABLE_MIN_FILL));}
如上方法部分来自zset,以zset为例,
当哈希表的大小大于 4,且字典的填充率低于 10时进行缩容
redis中的adlist(A generic doubly linked list implementation)是redis自身实现的双端链表,结构较为简单,无特殊优化,仅作代码说明
adlit数据结构如下
typedef struct listNode { struct listNode *prev; struct listNode *next; void *value;} listNode;typedef struct listIter { listNode *next; int direction;} listIter;typedef struct list { listNode *head; listNode *tail; void *(*dup)(void *ptr); void (*free)(void *ptr); int (*match)(void *ptr, void *key); unsigned long len;} list;
其中listNode是adlist的节点,其中的prev指向上一个节点,next指向下一个节点。
listIter是redis自实现的迭代器,其中的next指向了某个具体的list节点,direction指的是某具体迭代的顺序,需要结合宏来看
/* Directions for iterators */#define AL_START_HEAD 0#define AL_START_TAIL 1
宏定义的AL_START_HEAD AL_START_TAIL 是listIter中direction字段的值,AL_START_HEAD代表着此次迭代向前,AL_START_TAIL 代表着此次迭代向后。迭代部分源码如下
if (direction == AL_START_HEAD) iter->next = list->head; else iter->next = list->tail;
list为链表的结构了,head为某一链表的头节点,tail为尾节点。之后的三个方法,dup为拷贝,free为释放,match为匹配。len为list自维护的链表长度。
和大家平时所使用的一样,adlist也是实现了增(头插法),删,改,查,复制,拼接。不过redis在新增是使用zmalloc分配堆空间,详情可以看/src/zmalloc.c
部分adlist源码:
初始化:
list *listCreate(void){ struct list *list; if ((list = zmalloc(sizeof(*list))) == NULL) return NULL; list->head = list->tail = NULL; list->len = 0; list->dup = NULL; list->free = NULL; list->match = NULL; return list;}
len为0,其余全部为null
释放链表:
/* Free the whole list. * * This function can't fail. */void listRelease(list *list){ listEmpty(list); zfree(list);}listempty首先遍历移除所有节点,void listEmpty(list *list){ unsigned long len; listNode *current, *next; current = list->head; len = list->len; while(len--) { next = current->next; if (list->free) list->free(current->value); zfree(current); current = next; } list->head = list->tail = NULL; list->len = 0;}zfree释放堆空间,详情可以看/src/zmalloc.c
头插法:
list *listInsertNode(list *list, listNode *old_node, void *value, int after) { listNode *node; //如果仅声明,未分配内存,则不插入 if ((node = zmalloc(sizeof(*node))) == NULL) return NULL; node->value = value; if (after) { node->prev = old_node; node->next = old_node->next; if (list->tail == old_node) { list->tail = node; } } else { node->next = old_node; node->prev = old_node->prev; if (list->head == old_node) { list->head = node; } } if (node->prev != NULL) { node->prev->next = node; } if (node->next != NULL) { node->next->prev = node; } list->len++; return list;}
更多的源码就不放这里了,都是常规操作,头插尾插,next遍历等等,只放实现好的方法名,按需自己看源码即可
list *listCreate(void);void listRelease(list *list);void listEmpty(list *list);list *listAddNodeHead(list *list, void *value);list *listAddNodeTail(list *list, void *value);list *listInsertNode(list *list, listNode *old_node, void *value, int after);void listDelNode(list *list, listNode *node);listIter *listGetIterator(list *list, int direction);listNode *listNext(listIter *iter);void listReleaseIterator(listIter *iter);list *listDup(list *orig);listNode *listSearchKey(list *list, void *key);listNode *listIndex(list *list, long index);void listRewind(list *list, listIter *li);void listRewindTail(list *list, listIter *li);void listRotateTailToHead(list *list);void listRotateHeadToTail(list *list);void listJoin(list *l, list *o);
其中不少类似listReleaseIterator的方法,只是释放iter的内存,但是没有看到引用,可能之前的写法,在6之后被废弃掉了
adlist在整个redis内部使用较多也广,从history cmd到sub/pub的客户端等等都在使用
至此,redis的adlist部分结束~