前言
map是语言中常用的一个数据结构,其在不同语言中有着不同的实现的,现在我们看下在go中的底层实现。
更多内容分享,欢迎关注公众号:Go开发笔记
map源码
map的底层实现的源码位于runtime/map.go中,其相关方法的调用在编译时确认。
map说明
简单来说,map就是一个哈希表。数据被排列成一个bucket数组。每个bucket最多包含8个key/elem对。哈希的低位用于选择一个bucket。每个bucket包含每个hash的一些高位,以区分单个bucket中的条目。
map的数据存储结构
map实际上就是hash表,其包含2B个bucket,每个bucket最多包含8个键值对。bucket内数据的存储格式为:存储bucket内key对应hash的topHash数组,然后是键值对。这些键值对的存储方式是:8个key然后才是key对应的值。key对的hash的低位表示对应的bucket,高位表示bucket内具体的位置。当前map的bucket数位2B,一个key对应的hash为hash,则其对应bucket的位置为hash&(2^B-1)。
- 每个key对应的位置为:bucket的位置+bmp的大小+keySize*key的位置index。
- 每个key的value对应的位置为:bucket的位置+bmp的大小+keySize*8+elemSize*elem的位置index
bmp中存储了长度为8的数组,用以存储key的hash的高位。因bmp是数组,查找需要依次遍历确认与指定的hash的高位是否一致,如果一致,会取出key进行比较,避免hash冲突。
扩容中,hash的seed并没有发生变化,因此key的hash也不会改变。但扩容可能会导致B变化,由于hash的低位与B才能确定所在的bucket,如果B变化,则对应bucket的位置也会发生变化。因此扩容后,需要复制数据到新的位置中。
其具体结构为hmap。
hmap
// A header for a Go map.
type hmap struct {// Note: the format of the hmap is also encoded in cmd/compile/internal/gc/reflect.go.// Make sure this stays in sync with the compiler's definition.count int // # live cells == size of map. Must be first (used by len() builtin)flags uint8B uint8 // log_2 of # of buckets (can hold up to loadFactor * 2^B items)noverflow uint16 // approximate number of overflow buckets; see incrnoverflow for detailshash0 uint32 // hash seedbuckets unsafe.Pointer // array of 2^B Buckets. may be nil if count==0.oldbuckets unsafe.Pointer // previous bucket array of half the size, non-nil only when growingnevacuate uintptr // progress counter for evacuation (buckets less than this have been evacuated)extra *mapextra // optional fields
}// mapextra holds fields that are not present on all maps.
type mapextra struct {// If both key and elem do not contain pointers and are inline, then we mark bucket// type as containing no pointers. This avoids scanning such maps.// However, bmap.overflow is a pointer. In order to keep overflow buckets// alive, we store pointers to all overflow buckets in hmap.extra.overflow and hmap.extra.oldoverflow.// overflow and oldoverflow are only used if key and elem do not contain pointers.// overflow contains overflow buckets for hmap.buckets.// oldoverflow contains overflow buckets for hmap.oldbuckets.// The indirection allows to store a pointer to the slice in hiter.overflow *[]*bmapoldoverflow *[]*bmap// nextOverflow holds a pointer to a free overflow bucket.nextOverflow *bmap
}// A bucket for a Go map.
type bmap struct {// tophash generally contains the top byte of the hash value// for each key in this bucket. If tophash[0] < minTopHash,// tophash[0] is a bucket evacuation state instead.tophash [bucketCnt]uint8// Followed by bucketCnt keys and then bucketCnt elems.// NOTE: packing all the keys together and then all the elems together makes the// code a bit more complicated than alternating key/elem/key/elem/... but it allows// us to eliminate padding which would be needed for, e.g., map[int64]int8.// Followed by an overflow pointer.
}// A hash iteration structure.
// If you modify hiter, also change cmd/compile/internal/gc/reflect.go to indicate
// the layout of this structure.
type hiter struct {key unsafe.Pointer // Must be in first position. Write nil to indicate iteration end (see cmd/internal/gc/range.go).elem unsafe.Pointer // Must be in second position (see cmd/internal/gc/range.go).t *maptypeh *hmapbuckets unsafe.Pointer // bucket ptr at hash_iter initialization timebptr *bmap // current bucketoverflow *[]*bmap // keeps overflow buckets of hmap.buckets aliveoldoverflow *[]*bmap // keeps overflow buckets of hmap.oldbuckets alivestartBucket uintptr // bucket iteration started atoffset uint8 // intra-bucket offset to start from during iteration (should be big enough to hold bucketCnt-1)wrapped bool // already wrapped around from end of bucket array to beginningB uint8i uint8bucket uintptrcheckBucket uintptr
}
hmap中的fields代表的意思依次是:
type hmap struct {count int // map的sizeflags uint8 // 状态标志位B uint8 // log2(buckets) (可以容纳最多loadFactor * 2^B项)noverflow uint16 // overflow buckets的大约数 hash0 uint32 // hash seedbuckets unsafe.Pointer // 2^B Buckets的数组.如果count为0可能为niloldbuckets unsafe.Pointer // 前一个存储桶数组大小的一半,仅在增长时为非零nevacuate uintptr // 排空进度计数器(小于此值的buckets已排空)extra *mapextra // 可选的 fields
}
make
根据初始化map的size及平台,具体的初始化又分为以下几种:
makemap_small
适用于size小于等于8时使用。代码逻辑很简单,就是通过new一个hamp,然后初始化hash0.
// makemap_small implements Go map creation for make(map[k]v) and
// make(map[k]v, hint) when hint is known to be at most bucketCnt
// at compile time and the map needs to be allocated on the heap.
func makemap_small() *hmap {h := new(hmap)h.hash0 = fastrand()return h
}
makemap64
默认使用,当32位使用时,size超出int32位范围时,会重置size为0,然后调用makemap。
func makemap64(t *maptype, hint int64, h *hmap) *hmap {if int64(int(hint)) != hint {hint = 0}return makemap(t, int(hint), h)
}
makemap
32位平台时用以代替makemap64,更快。
func makemap(t *maptype, hint int, h *hmap) *hmap {mem, overflow := math.MulUintptr(uintptr(hint), t.bucket.size)if overflow || mem > maxAlloc {hint = 0}// initialize Hmapif h == nil {h = new(hmap)}h.hash0 = fastrand()// Find the size parameter B which will hold the requested # of elements.// For hint < 0 overLoadFactor returns false since hint < bucketCnt.B := uint8(0)for overLoadFactor(hint, B) {B++}h.B = B// allocate initial hash table// if B == 0, the buckets field is allocated lazily later (in mapassign)// If hint is large zeroing this memory could take a while.if h.B != 0 {var nextOverflow *bmaph.buckets, nextOverflow = makeBucketArray(t, h.B, nil)if nextOverflow != nil {h.extra = new(mapextra)h.extra.nextOverflow = nextOverflow}}return h
}
makemap的处理步骤如下:
- 计算大小为size的map需要占用的内存及是否溢出
- 若溢出则重置大小为0
- 若h为nil,则new一个hmap
- 初始化hashseed
- 根据size及LoadFactor计算B,并赋值到h.B
- 若h.B不为0,构造bucket array,赋值到h.B上,若存在提前分配的overflow buckets,则存入h.Extra
- 返回h
需要注意的是:
当B>=4时,会预先创建部分overflow buckets,用以存储当buckets已满的数据,同时也减少了扩容的频率。
具体源码见makeBucketArray
。
func makeBucketArray(t *maptype, b uint8, dirtyalloc unsafe.Pointer) (buckets unsafe.Pointer, nextOverflow *bmap) {base := bucketShift(b)nbuckets := base// For small b, overflow buckets are unlikely.// Avoid the overhead of the calculation.if b >= 4 {// Add on the estimated number of overflow buckets// required to insert the median number of elements// used with this value of b.nbuckets += bucketShift(b - 4)sz := t.bucket.size * nbucketsup := roundupsize(sz)if up != sz {nbuckets = up / t.bucket.size}}if dirtyalloc == nil {buckets = newarray(t.bucket, int(nbuckets))} else {// dirtyalloc was previously generated by// the above newarray(t.bucket, int(nbuckets))// but may not be empty.buckets = dirtyallocsize := t.bucket.size * nbucketsif t.bucket.ptrdata != 0 {memclrHasPointers(buckets, size)} else {memclrNoHeapPointers(buckets, size)}}if base != nbuckets {// We preallocated some overflow buckets.// To keep the overhead of tracking these overflow buckets to a minimum,// we use the convention that if a preallocated overflow bucket's overflow// pointer is nil, then there are more available by bumping the pointer.// We need a safe non-nil pointer for the last overflow bucket; just use buckets.nextOverflow = (*bmap)(add(buckets, base*uintptr(t.bucketsize)))last := (*bmap)(add(buckets, (nbuckets-1)*uintptr(t.bucketsize)))last.setoverflow(t, (*bmap)(buckets))}return buckets, nextOverflow
}
assign
assign对应的使用为:
m[key] = elem
具体对应的实际是m[key]
部分,返回的是key的地址。
// Like mapaccess, but allocates a slot for the key if it is not present in the map.
func mapassign(t *maptype, h *hmap, key unsafe.Pointer) unsafe.Pointer {if h == nil {panic(plainError("assignment to entry in nil map"))}if raceenabled {callerpc := getcallerpc()pc := funcPC(mapassign)racewritepc(unsafe.Pointer(h), callerpc, pc)raceReadObjectPC(t.key, key, callerpc, pc)}if msanenabled {msanread(key, t.key.size)}if h.flags&hashWriting != 0 {throw("concurrent map writes")}hash := t.hasher(key, uintptr(h.hash0))// Set hashWriting after calling t.hasher, since t.hasher may panic,// in which case we have not actually done a write.h.flags ^= hashWritingif h.buckets == nil {h.buckets = newobject(t.bucket) // newarray(t.bucket, 1)}again:bucket := hash & bucketMask(h.B)if h.growing() {growWork(t, h, bucket)}b := (*bmap)(unsafe.Pointer(uintptr(h.buckets) + bucket*uintptr(t.bucketsize)))top := tophash(hash)var inserti *uint8var insertk unsafe.Pointervar elem unsafe.Pointer
bucketloop:for {for i := uintptr(0); i < bucketCnt; i++ {if b.tophash[i] != top {if isEmpty(b.tophash[i]) && inserti == nil {inserti = &b.tophash[i]insertk = add(unsafe.Pointer(b), dataOffset+i*uintptr(t.keysize))elem = add(unsafe.Pointer(b), dataOffset+bucketCnt*uintptr(t.keysize)+i*uintptr(t.elemsize))}if b.tophash[i] == emptyRest {break bucketloop}continue}k := add(unsafe.Pointer(b), dataOffset+i*uintptr(t.keysize))if t.indirectkey() {k = *((*unsafe.Pointer)(k))}if !t.key.equal(key, k) {continue}// already have a mapping for key. Update it.if t.needkeyupdate() {typedmemmove(t.key, k, key)}elem = add(unsafe.Pointer(b), dataOffset+bucketCnt*uintptr(t.keysize)+i*uintptr(t.elemsize))goto done}ovf := b.overflow(t)if ovf == nil {break}b = ovf}// Did not find mapping for key. Allocate new cell & add entry.// If we hit the max load factor or we have too many overflow buckets,// and we're not already in the middle of growing, start growing.if !h.growing() && (overLoadFactor(h.count+1, h.B) || tooManyOverflowBuckets(h.noverflow, h.B)) {hashGrow(t, h)goto again // Growing the table invalidates everything, so try again}if inserti == nil {// all current buckets are full, allocate a new one.newb := h.newoverflow(t, b)inserti = &newb.tophash[0]insertk = add(unsafe.Pointer(newb), dataOffset)elem = add(insertk, bucketCnt*uintptr(t.keysize))}// store new key/elem at insert positionif t.indirectkey() {kmem := newobject(t.key)*(*unsafe.Pointer)(insertk) = kmeminsertk = kmem}if t.indirectelem() {vmem := newobject(t.elem)*(*unsafe.Pointer)(elem) = vmem}typedmemmove(t.key, insertk, key)*inserti = toph.count++done:if h.flags&hashWriting == 0 {throw("concurrent map writes")}h.flags &^= hashWritingif t.indirectelem() {elem = *((*unsafe.Pointer)(elem))}return elem
}
mapassign的处理步骤如下:
- 若h为nil则panic
- 若有其他协程在写入map,则panic相关错误
- 对key进行hash
- 设置flags为Writing
- 若h.buckets为nil,则重新分配
- 进入again处理
- 根据hash获取对应的bucket
- 若h在扩容,则进行扩容工作
- 获取bucket对应的*bmap b及hash的高位top
- 进入bucketloop
- 遍历bucket
- 若b的当前topHash不为top
(1)若当前tophash的状态为empty且inserti为nil,则当前tophash的地址赋值给inserti,并获取key及element的地址
(2)若topHash的状态为emptyReset则跳出bucketloop
(3)继续遍历bucket - 若b的当前topHash不为top,说明已找到匹配的hash。获取key确认与存入的key是否一致(是否hash冲突),不一致则继续遍历bucket。一致,则确实是否更新,需要更新,则更新对应的key。
- 根据key的地址获取element的地址,跳转done
- 获取b的overflow buckets,若为nil,则跳出bucketloop;否则将overflow赋值给b,继续bucketloop
- 如果map没在扩容,新增数据后已超过负载因子或拥有太多的overflow buckets,则进行扩容处理;扩容后进入again
- 如果inseti为nil,说明所有的buckets已经满了,创建新的overflow,存入
- 如果key类型t并非指针类型,则获取其指针
- 如果存储的值类型非指针,获取其指针
- 将key移动至insertK
- 更新inserti指向的值为top
- 进入done
- 如果有其它goroutine在写入,则panic
- 如果存储的类型为值类型,则获取其指向的值地址
- 返回elem
简单总结
简单来说,存入数据需要经过以下几步:
- 计算hash
- 根据hash低位从buckets中确认存入bucket的位置
- 根据hash高位确认bucket内的位置
- 确认key是否存在,若存在则获取数据存入地址
- 否则获取overflow buckets,继续步骤1
- 若需要扩容,则进行扩容,扩容后,继续步骤1
- 如果buckets及overflow buckets均已满,则新建overflow bucket,获取key、elem的地址
- 存入数据
正常存入值的顺序:
- buckets
- overflow buckets
- 扩容后存入buckets/overflow buckets或者创建overflow buckets后存入
assign时的扩容
若map中的数据达到负载因子或者overflow的数量过多时,会发生扩容。
func mapassign(t *maptype, h *hmap, key unsafe.Pointer) unsafe.Pointer {...// If we hit the max load factor or we have too many overflow buckets,// and we're not already in the middle of growing, start growing.if !h.growing() && (overLoadFactor(h.count+1, h.B) || tooManyOverflowBuckets(h.noverflow, h.B)) {hashGrow(t, h)goto again // Growing the table invalidates everything, so try again}...
}
// overLoadFactor reports whether count items placed in 1<<B buckets is over loadFactor.
func overLoadFactor(count int, B uint8) bool {return count > bucketCnt && uintptr(count) > loadFactorNum*(bucketShift(B)/loadFactorDen)
}// tooManyOverflowBuckets reports whether noverflow buckets is too many for a map with 1<<B buckets.
// Note that most of these overflow buckets must be in sparse use;
// if use was dense, then we'd have already triggered regular map growth.
func tooManyOverflowBuckets(noverflow uint16, B uint8) bool {// If the threshold is too low, we do extraneous work.// If the threshold is too high, maps that grow and shrink can hold on to lots of unused memory.// "too many" means (approximately) as many overflow buckets as regular buckets.// See incrnoverflow for more details.if B > 15 {B = 15}// The compiler doesn't see here that B < 16; mask B to generate shorter shift code.return noverflow >= uint16(1)<<(B&15)
}
具体来说需要满足以下两个条件中的一个:
- map中的数据数量count>8 且 count>13*(2^b/2)
- overflow的数量noverflow需要不小于2B,即对于一个map来说,noverflow最大为2B -1,当B足够大时,noverflow最大为2^15 -1
若不满足,且当前的buckets已经满了(buckets、overflow均已经满了)时,则创建新的overflow,用以存储数据。
mapaccess
mapaccess的过程与mapassign的过程类似,区别在于存入/读出。
mapaccess有对应的使用有以下两种形式,
v := m[key] // 1
v,ok := m[key] // 2
第二种相对第一种多了一个bool的参数,表示是否包含此key,此处主要看方式2的源码。
func mapaccess2(t *maptype, h *hmap, key unsafe.Pointer) (unsafe.Pointer, bool) {if raceenabled && h != nil {callerpc := getcallerpc()pc := funcPC(mapaccess2)racereadpc(unsafe.Pointer(h), callerpc, pc)raceReadObjectPC(t.key, key, callerpc, pc)}if msanenabled && h != nil {msanread(key, t.key.size)}if h == nil || h.count == 0 {if t.hashMightPanic() {t.hasher(key, 0) // see issue 23734}return unsafe.Pointer(&zeroVal[0]), false}if h.flags&hashWriting != 0 {throw("concurrent map read and map write")}hash := t.hasher(key, uintptr(h.hash0))m := bucketMask(h.B)b := (*bmap)(unsafe.Pointer(uintptr(h.buckets) + (hash&m)*uintptr(t.bucketsize)))if c := h.oldbuckets; c != nil {if !h.sameSizeGrow() {// There used to be half as many buckets; mask down one more power of two.m >>= 1}oldb := (*bmap)(unsafe.Pointer(uintptr(c) + (hash&m)*uintptr(t.bucketsize)))if !evacuated(oldb) {b = oldb}}top := tophash(hash)
bucketloop:for ; b != nil; b = b.overflow(t) {for i := uintptr(0); i < bucketCnt; i++ {if b.tophash[i] != top {if b.tophash[i] == emptyRest {break bucketloop}continue}k := add(unsafe.Pointer(b), dataOffset+i*uintptr(t.keysize))if t.indirectkey() {k = *((*unsafe.Pointer)(k))}if t.key.equal(key, k) {e := add(unsafe.Pointer(b), dataOffset+bucketCnt*uintptr(t.keysize)+i*uintptr(t.elemsize))if t.indirectelem() {e = *((*unsafe.Pointer)(e))}return e, true}}}return unsafe.Pointer(&zeroVal[0]), false
}
mapaccess2的处理步骤如下:
- 当h为nil或map的大小为0时,说明当前map为空,返回零值及不存在。
- 如果当前map被其他goroutine写入,则panic
- 计算当前key的hash,获取bucket的数量,计算hash对应的buckets的bmp b
- 如果有旧的bucket(扩容中)
(1)如果是同样大小的扩容,则需要将m缩小一半(扩容时增长了一倍)
(2)查找对应位置的bmap为oldb
(3)如果没有oldb没有被清空,则赋值给b - 获取hash的高位top
- 如果b!=nil,进入bucketloop
- 依次每个bucke对应hash的高位
- 如果与top不等,且后续没有bucket,说明key不存在,退出循环
- 如果与top相等,说明key可能在此bucket中,获取对应key,比较是否相等。相等则返回对应的值及存在标识。
简单来说:
- 计算key的hash,查找到对应的b
- 如果在扩容中,重新获取oldb为b
- 获取hash的top,在b(包含overflow)中循环查找k是否存在,存在则返回对应值,否则返回零值
mapdelete
func mapdelete(t *maptype, h *hmap, key unsafe.Pointer) {if raceenabled && h != nil {callerpc := getcallerpc()pc := funcPC(mapdelete)racewritepc(unsafe.Pointer(h), callerpc, pc)raceReadObjectPC(t.key, key, callerpc, pc)}if msanenabled && h != nil {msanread(key, t.key.size)}if h == nil || h.count == 0 {if t.hashMightPanic() {t.hasher(key, 0) // see issue 23734}return}if h.flags&hashWriting != 0 {throw("concurrent map writes")}hash := t.hasher(key, uintptr(h.hash0))// Set hashWriting after calling t.hasher, since t.hasher may panic,// in which case we have not actually done a write (delete).h.flags ^= hashWritingbucket := hash & bucketMask(h.B)if h.growing() {growWork(t, h, bucket)}b := (*bmap)(add(h.buckets, bucket*uintptr(t.bucketsize)))bOrig := btop := tophash(hash)
search:for ; b != nil; b = b.overflow(t) {for i := uintptr(0); i < bucketCnt; i++ {if b.tophash[i] != top {if b.tophash[i] == emptyRest {break search}continue}k := add(unsafe.Pointer(b), dataOffset+i*uintptr(t.keysize))k2 := kif t.indirectkey() {k2 = *((*unsafe.Pointer)(k2))}if !t.key.equal(key, k2) {continue}// Only clear key if there are pointers in it.if t.indirectkey() {*(*unsafe.Pointer)(k) = nil} else if t.key.ptrdata != 0 {memclrHasPointers(k, t.key.size)}e := add(unsafe.Pointer(b), dataOffset+bucketCnt*uintptr(t.keysize)+i*uintptr(t.elemsize))if t.indirectelem() {*(*unsafe.Pointer)(e) = nil} else if t.elem.ptrdata != 0 {memclrHasPointers(e, t.elem.size)} else {memclrNoHeapPointers(e, t.elem.size)}b.tophash[i] = emptyOne// If the bucket now ends in a bunch of emptyOne states,// change those to emptyRest states.// It would be nice to make this a separate function, but// for loops are not currently inlineable.if i == bucketCnt-1 {if b.overflow(t) != nil && b.overflow(t).tophash[0] != emptyRest {goto notLast}} else {if b.tophash[i+1] != emptyRest {goto notLast}}for {b.tophash[i] = emptyRestif i == 0 {if b == bOrig {break // beginning of initial bucket, we're done.}// Find previous bucket, continue at its last entry.c := bfor b = bOrig; b.overflow(t) != c; b = b.overflow(t) {}i = bucketCnt - 1} else {i--}if b.tophash[i] != emptyOne {break}}notLast:h.count--break search}}if h.flags&hashWriting == 0 {throw("concurrent map writes")}h.flags &^= hashWriting
}
- 对于nil或者大小为0的map的进行delete操作,进行可能会panic的处理,然后返回
- 如果当前map被其他goroutine写入,则panic
- 计算当前key的hash,获取bucket的数量,计算hash对应的bucket
- 如果正在扩容,则继续进行扩容处理
- 获取bucket对应的bmp
- 获取hash的高位top
- 进入search
- 如果b!=nil,依次查找bucke内对应hash的高位
(1)若没有查找到,若没有后续bucket时,退出search,否则继续查找bucket内的下一个数据
(2)若查找到,获取对应的key,比较两者是否一致,若不一致,继续查找
(3)若一致,将key及elem指向nil或清空数据
(3)将当前tophash对应的位置的状态设置为emptyOne
(4)继续判断后续位置的状态,如果bucket现在以一堆emptyOne的状态结束,那么将这些状态更改为emptyReset的状态.
(5) 数据count减一,退出search - 如果当前map被其他goroutine写入,则panic
简单来说,就是在查找后进行k/e的清除处理,然后进行bucket的状态标记,最后count减一。
总结
本文从map的数据结构、make、assign、access、delete的对应源码分析了处理过程。
- map实际上就是hash表,底层存储结构采用的是数组。
- map会在创建时(满足条件)预创建部分overflow buckets,方便数据的存入,减少扩容的频率。
- map数据的读写顺序: buckets->overflow buckets,若需要扩容,则扩容后再读写
- map在size超过负载因子或者overflow buckets数量过多时会扩容
- 并发读写map,会发生panic(当一goroutine正在写入时,若其他goroutine的读写操作会发生panic)
本文大略提到了map扩容的条件,具体的扩容处理在稍后的文章中继续深入。
公众号
鄙人刚刚开通了公众号,专注于分享Go开发相关内容,望大家感兴趣的支持一下,在此特别感谢。