The Storges of Docker Swarm

GlusterFS 使用流程:

  • 基于前一篇,已完成应用服务的部署,但基于docker swarm编排的随机性(虽然部署时可控,但故障时调度到其他节点上时,仍需要保留数据的一致性与连续性)而volume部分启用分布式文件系统(GlusterFS)
简介

GlusterFS 是一个开源的分布式文件系统,主要由存储服务器(BrickServer)、客户端及 NFS/Samba 存储网关(可选, 根据需要选择使用)组成。GlusterFS同时也是Scale-Out(横向扩展)存储解决方案Gluster的核心,在存储数据方面具有强大的横向扩展能力,通过扩展能够支持数PB存储容量和处理数千客户端。
GlusterFS借助TCP/IP或InfiniBandRDMA网络(一种支持多并发链接的技术,具有高带宽、低时延、高扩展性的特点)将物理分散分布的存储资源汇聚在一起,统一提供存储服务,并使用统一全局命名空间来管理数据。

集成部分

Brick(存储块)
指可信主机池中由主机提供的用于物理存储的专用分区,是GlusterFS中的基本存储单元,同时也是可信存储池中服务器上对外提供的存储目录。
存储目录的格式由服务器和目录的绝对路径构成,表示方法为 SERVER:EXPORT,如 192.168.126.10:/data/mydir/。

Volume(逻辑卷)
一个逻辑卷是一组 Brick 的集合。卷是数据存储的逻辑设备,类似于 LVM 中的逻辑卷。大部分 Gluster 管理操作是在卷上进行的。

FUSE
是一个内核模块,允许用户创建自己的文件系统,无须修改内核代码。

VFS
内核空间对用户空间提供的访问磁盘的接口。

Glusterd(后台管理进程)
在存储群集中的每个节点上都要运行。

工作流程

1)客户端或应用程序通过GlusterFS的挂载点访问数据。
2)linux系统内核通过VFS API收到请求并处理。
3)VFS将数据递交给FUSE内核文件系统, FUSE内核文件系统则是将数据通过/dev/fuse设备文件递交给了GlusterFS client端,可以将 FUSE文件系统理解为一个代理。
4)GlusterFS client 收到数据后,client根据配置文件的配置对数据进行处理。
5)通过网络将数据传递至远端的GlusterFS Server,并且将数据写入到服务器存储设备上。

使用方式:

分布式卷(Distributed):RAID 0,高扩展,但无冗余,如果其中某块硬盘损坏导致其gluster服务不可用,则该盘的数据丢失,但gluster Volume仍可用,默认方式,也是读写效率最高的方式
条带卷:RAID 0,以数据块为单位,因此会对大文件进行拆分读写,对于大文件进行的读取优化,如果出现其中一块盘损坏,则直接整个gluster Volume不可用(新版本已经不支持)
复制卷(Replicated):RAID 1,拥有容错能力,读性能提高,写性能下降,至少有2块服务器硬盘起步,但空间少一半,高可用,但空间少了,数据量少但需要高可用的场景,但副本数超过两个时,需要配置仲裁器卷防止脑裂
分布式条带卷:结合分布式卷和条带卷的优点,服务器起点数为4台,出现故障则直接不可用
分布式复制卷(Distributed Replicated):RAID 10,服务器起点数为4台,条带卷和复制卷的结合,高可用,适用于需要高性能与高可靠兼顾的场景
分布式条带复制卷:三种基本卷的复合卷
分散卷(Dispersed):能在节省空间的同时,提供防止磁盘或服务器故障的保护,比复制卷更高的存储效率且更好的数据容错机制,适用于大量数据且需要容错的场景
分布式分散卷(Distributed Dispersed):RAID 10,分布式与分散卷的结合,拥有且比Distributed Replicated卷更优的优势,同时应用场景也与Distributed Replicated相同

具体搭建步骤:
节点 目录 挂载点
192.168.40.239 /data /data
192.168.50.207 /data /data
192.168.50.208 /data /data
192.168.40.175 /data /data
192.168.40.240 /data /data

生产环境可以通过ansible进行快速搭建:https://github.com/gluster/gluster-ansible
Linux Kernel 参数调优部分(建议在搭建前配置完毕):https://docs.gluster.org/en/latest/Administrator-Guide/Linux-Kernel-Tuning/#commentbengland_1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
# 关闭防火墙,后续搭建完后基于端口重新开放(24007:24008,49152:49156)
systemctl stop firewalld
# 关闭selinux
setenforce 0
# 所有节点都配置好/data目录
mkdir /data
# 此步骤可省略,此部分为每台机最好空闲一块全新的硬盘进行操作,避免对原有数据进行干扰:
# mkfs.xfs -i size=512 /dev/sdb1 # 此部分也可以是mkfs.ext4,区别在于,磁盘空间小用ext4,磁盘空间大用xfs
# mkdir -p /data/brick1
# echo '/dev/sdb1 /data/brick1 xfs defaults 1 2' >> /etc/fstab
# mount -a && mount
# 配置好内部的dns
echo "192.168.40.239 node1" >> /etc/hosts
echo "192.168.50.207 node2" >> /etc/hosts
echo "192.168.50.208 node3" >> /etc/hosts
echo "192.168.40.175 node4" >> /etc/hosts
echo "192.168.40.240 node5" >> /etc/hosts
echo "192.168.50.177 node6" >> /etc/hosts
# 此处为了方便,应该配置一台机子为manager,然后部署好ssh免密
# ssh-keygen -t rsa -q -P "秘钥密码" -f ~/.ssh/id_rsa
# ssh-copy-id -i ~/.ssh/id_rsa.pub root@192.168.50.207
# ssh-copy-id -i ~/.ssh/id_rsa.pub root@192.168.50.208
# ssh-copy-id -i ~/.ssh/id_rsa.pub root@192.168.40.175
# ssh-copy-id -i ~/.ssh/id_rsa.pub root@192.168.40.240
# 需要提前准备好的内核模块
lsmod | grep -q fuse || modprobe fuse
# 预下载的准备
yum -y install openssh-server wget fuse fuse-libs openmpi libibverbs
# 分别安装glusterfs
dnf install centos-release-gluster9 -y
dnf install -y glusterfs glusterfs-api glusterfs-fuse glusterfs-rdma glusterfs-libs glusterfs-server
# 配置为开机自启,并启动
systemctl enable glusterfsd.service --now
systemctl enable glusterd.service --now
# 在manager机子上进行运行,因为glusterfs是以端对端的方式建立联系
gluster peer probe node2
gluster peer probe node3
gluster peer probe node4
gluster peer probe node5
gluster peer probe node6
# 此时需要通过任一一台机子将manager也增加进pool中
gluster peer probe node1
# 此时可以查看目前有哪些节点在gluster 池中
gluster pool list
# 可以通过gluster peer status 查看连接状态(每个节点上查看会略有不同,但数量是一致的)
# 每个节点进行 GlusterFS 卷的配置
# 后续配置进volume的节点均需要配置
# mkdir -p /data/glusterfs/(卷名)/brick1(磁盘块数)
mkdir -p /data/glusterfs/online-share/brick1

# 以下在任意位置上执行即可
## 部署分布式卷
gluster volume create online-share transport tcp node1:/data/glusterfs/online-share/brick1 node2:/data/glusterfs/online-share/brick1 node3:/data/glusterfs/online-share/brick1 node4:/data/glusterfs/online-share/brick1

## 部署复制卷
# 2个卷
gluster volume create online-share replica 2 transport tcp node1:/data/glusterfs/online-share/brick1 node2:/data/glusterfs/online-share/brick1
# 3个卷
gluster volume create online-share replica 3 arbiter 1 transport tcp node1:/data/glusterfs/online-share/brick1 node2:/data/glusterfs/online-share/brick1 node3:/data/glusterfs/online-share/brick1
# 4个卷
gluster volume create online-share replica 4 transport tcp node1:/data/glusterfs/online-share/brick1 node2:/data/glusterfs/online-share/brick1 node3:/data/glusterfs/online-share/brick1 node4:/data/glusterfs/online-share/brick1

## 部署分布式复制卷
# 4个卷
gluster volume create online-share replica 2 transport tcp node1:/data/glusterfs/online-share/brick1 node2:/data/glusterfs/online-share/brick1 node3:/data/glusterfs/online-share/brick1 node4:/data/glusterfs/online-share/brick1
# 6个卷
gluster volume create online-share replica 2 transport tcp node1:/data/glusterfs/online-share/brick1 node2:/data/glusterfs/online-share/brick1 node3:/data/glusterfs/online-share/brick1 node4:/data/glusterfs/online-share/brick1 node5:/data/glusterfs/online-share/brick1 node6:/data/glusterfs/online-share/brick1

## 部署分散卷
# 可用空间计算公式:<Usable size> = <Brick size> * (#Bricks - Redundancy),即每块容量 * (总硬盘数 - 冗余磁盘数),荣誉磁盘数==可损坏的磁盘数
# 4个卷,不指定redundancy则会自动计算最佳值,取值如果等于1/2的话,复制卷的效率更高,但有坑,就是如果要删除卷的时候务必保证偶数(4+2之类的)
gluster volume create online-share disperse 4 [redundancy 1-2] transport tcp node1:/data/glusterfs/online-share/brick1 node2:/data/glusterfs/online-share/brick1 node3:/data/glusterfs/online-share/brick1 node4:/data/glusterfs/online-share/brick1

## 部署分布式分散卷
# disperse 必须配置
# redundancy == disperse
gluster volume create online-share disperse 3 transport tcp node1:/data/glusterfs/online-share/brick1 node2:/data/glusterfs/online-share/brick1 node3:/data/glusterfs/online-share/brick1 node4:/data/glusterfs/online-share/brick1 node5:/data/glusterfs/online-share/brick1 node6:/data/glusterfs/online-share/brick1

# 启动卷
gluster volume start online-share
# 查看gluster有哪些卷
gluster volume list
# 查看卷的详细信息
gluster volume info online-share
# 暂停该卷的使用步骤
# 所有节点关闭该卷的挂载点
umount mount-point
gluster volume stop online-share
# 修改卷的配置
gluster volume set online-share config.transport tcp,rdma OR tcp OR rdma
# 更改挂载方式
mount -t glusterfs -o transport=rdma node1:/data/glusterfs/online-share/brick1 /mnt/glusterfs

# 扩容:
# 新增节点(新增磁盘也类似的操作):
gluster peer probe newNode
gluster volume add-brick online-share newNode:/data/glusterfs/online-share/brick1
# 缩容(以分散卷为例子):
gluster volume heal online-share info # 查看是否有自愈进程在运行
# 因为此处是4块硬盘并配置了自动计算redundancy,所以必须要满足4+2才可以进行缩容
gluster volume remove-brick online-share node4:/data/glusterfs/online-share/brick1 start -> commit
# 可选,扩缩容之后建议进行rebalance的配置,进行磁盘rebalance
gluster volume rebalance online-share [fix-layout] start [force]

# 基于分散卷在创建完之后的数量无法进行变更,因此只能通过新旧替换的方式进行扩缩容
gluster volume replace-brick online-share node5:/data/glusterfs/online-share/brick1 node6:/data/glusterfs/online-share/brick1 commit force

# 配置volume 的acl访问
gluster volume set online-share auth.allow * # 此处可以是具体的ip

# 允许 volume 当某块brick不在线时也不会影响到客户端的挂载,提高了高可用性,但增加了数据不一致的可能性(不建议使用,会导致脑裂)
# 当大多数卷可用时,服务才可用
gluster volume set online-share cluster.server-quorum-type none
# 当执行写操作的时候,需要得到大多数副本节点的可用才执行
gluster volume set online-share cluster.quorum-type none

# 实际使用:
gluster volume create online-share disperse 4 transport tcp node1:/data/glusterfs/online-share/brick1 node2:/data/glusterfs/online-share/brick1 node3:/data/glusterfs/online-share/brick1 node4:/data/glusterfs/online-share/brick1 force

具体使用步骤:

注意,因为glusterfs卷集群中任一节点已经包含整个集群的信息,所以只需要访问任一一个即可访问卷集群,但会出现访问单点故障问题:
1、外层套一层DNS轮询,但会有DNS缓存与更新的延迟问题
2、在外层增加一个负载均衡器,比如HAProxy等
3、客户端进行所有节点的配置,例:mount -t glusterfs node1,node2,node3:/data/glusterfs/online-share/brick1 /mnt/glusterfs
参考文档:https://ruan.dev/blog/2019/03/05/setup-a-3-node-replicated-storage-volume-with-glusterfs

普通linux节点:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# 需要安排上glusterfs-client
yum -y install openssh-server wget fuse fuse-libs openmpi libibverbs
dnf install centos-release-gluster9 -y
dnf install -y glusterfs glusterfs-api glusterfs-fuse glusterfs-rdma glusterfs-libs glusterfs-cli glusterfs-client-xlators
for i in openssh-server wget fuse fuse-libs openmpi libibverbs centos-release-gluster9;do ansible hosts -m yum -a 'name='"${i}"' state=present';done
for i in glusterfs glusterfs-api glusterfs-fuse glusterfs-rdma glusterfs-libs glusterfs-cli glusterfs-client-xlators;do ansible hosts -m yum -a 'name='"${i}"' state=present';done
# 客户端也需要配置好这些节点
echo "192.168.40.239 node1" >> /etc/hosts
echo "192.168.50.207 node2" >> /etc/hosts
echo "192.168.50.208 node3" >> /etc/hosts
echo "192.168.40.175 node4" >> /etc/hosts
echo "192.168.40.240 node5" >> /etc/hosts
echo "192.168.50.177 node6" >> /etc/hosts
ansible hosts -m shell -a 'grep -q "192.168.40.239 node1" /etc/hosts || echo "192.168.40.239 node1" >> /etc/hosts'
ansible hosts -m shell -a 'grep -q "192.168.50.207 node2" /etc/hosts || echo "192.168.50.207 node2" >> /etc/hosts'
ansible hosts -m shell -a 'grep -q "192.168.50.208 node3" /etc/hosts || echo "192.168.50.208 node3" >> /etc/hosts'
ansible hosts -m shell -a 'grep -q "192.168.40.175 node4" /etc/hosts || echo "192.168.40.175 node4" >> /etc/hosts'
ansible hosts -m shell -a 'grep -q "192.168.40.240 node5" /etc/hosts || echo "192.168.40.240 node5" >> /etc/hosts'
ansible hosts -m shell -a 'grep -q "192.168.50.177 node6" /etc/hosts || echo "192.168.50.177 node6" >> /etc/hosts'
# 如果曾经挂载过,需要先 systemctl daemon-reload
mount -t glusterfs node1:online-share /mnt/glusterfs
# 配置开机启动:/etc/fstab:node1:online-share /mnt/glusterfs glusterfs defaults,_netdev,direct-io-mode=enable,backup-volfile-servers=192.168.40.239:192.168.50.207:192.168.50.208:192.168.50.177 0 0
ansible hosts -m shell -a '( df -Th | grep -q "/mnt/glusterfs" && cat /etc/fstab | grep -q "/mnt/glusterfs" ) || (echo "node1:online-share /mnt/glusterfs glusterfs defaults,_netdev,direct-io-mode=enable,backup-volfile-servers=node1:node2:node3:node6 0 0" >> /etc/fstab && systemctl daemon-reload && mkdir -p /mnt/glusterfs && mount -a)'

在操作过程中的误操作记录:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# 以下命令有大坑,导致/etc/fstab被清空了,如果这时候系统发生重启会非常危险
# ansible hosts -m shell -a 'awk "!seen[$0]++" /etc/fstab > /etc/fstab && mount -a'
# 假设重启后,需要的配置:
mount -o remount,rw /
systemctl restart NetworkManager
# 以上即可恢复network
# 恢复/etc/fstab
#!/bin/bash

findmnt --real > mounted_info.txt
blkid > partition_info.txt
sed '1d' mounted_info.txt | while read line; do
location=$(echo -n "${line}" | awk '{print $1}')
source=$(echo -n "${line}" | awk '{print $2}')
fstype=$(echo -n "${line}" | awk '{print $3}')
echo "${location}" | grep -q "├"
if [[ $? -eq 0 ]];then
location=$(echo -n "${location}" | awk -F "─" '{print $NF}')
cat /etc/fstab | grep -q "${location}"
[[ $? -eq 0 ]] || {
if [[ "${location}" == "/boot" ]];then
fsid=$(cat partition_info.txt | grep "${source}" | awk -F "\"" '{print $2}')
echo "UUID=${fsid} ${location} ${fstype} defaults 0 0" >> /etc/fstab && systemctl daemon-reload && mount -a
elif [[ "${location}" == "/mnt/glusterfs" ]];then
echo "node1:online-share /mnt/glusterfs glusterfs defaults,_netdev,direct-io-mode=enable,backup-volfile-servers=node1:node2:node3:node6 0 0" >> /etc/fstab && systemctl daemon-reload && mkdir -p /mnt/glusterfs && mount -a
else
pass
fi
}
else
echo "${location}" | grep -q "└"
if [[ $? -eq 0 ]];then
location=$(echo -n "${location}" | awk -F "─" '{print $NF}')
cat /etc/fstab | grep -q "${location}"
[[ $? -eq 0 ]] || {
if [[ "${location}" == "/boot" ]];then
fsid=$(cat partition_info.txt | grep "${source}" | awk -F "\"" '{print $2}')
echo "UUID=${fsid} ${location} ${fstype} defaults 0 0" >> /etc/fstab && systemctl daemon-reload && mount -a
elif [[ "${location}" == "/mnt/glusterfs" ]];then
echo "node1:online-share /mnt/glusterfs glusterfs defaults,_netdev,direct-io-mode=enable,backup-volfile-servers=node1:node2:node3:node6 0 0" >> /etc/fstab && systemctl daemon-reload && mkdir -p /mnt/glusterfs && mount -a
else
pass
fi
}
else
(cat /etc/fstab | grep -q "${source}") || (echo "${source} ${location} ${fstype} defaults 0 0" >> /etc/fstab && systemctl daemon-reload && mount -a)
fi
fi
done
# 此处 df -Th
ansible hosts -m shell -a "ls -lh /etc/fstab"
ansible hosts -m copy -a "src=/root/demo.sh dest=/root/ force=yes owner=root group=root mode=644"
ansible hosts -m shell -a "/bin/bash /root/demo.sh"
docker swarm:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# 实际使用体验跟直接挂载到linux上差不多,因为需要额外安装相应的plugin且需要手动在对应的节点上创建好volume
# 目前trajano这个插件已经不更新了,但仍能继续使用,并且已经非docker swarm集成,需要每个节点都需要安装(插件需要梯子)
docker plugin install --alias glusterfs trajano/glusterfs-volume-plugin --grant-all-permissions --disable
docker plugin set glusterfs SERVERS=192.168.40.239,192.168.50.207,192.168.50.208,192.168.40.240
docker plugin enable glusterfs # 启用这个插件
# docker plugin inspect glusterfs # 可以查看具体的细节
# 上面这个插件似乎已经无法使用,因此更换另一个插件
docker plugin install --alias glusterfs mikebarkmin/glusterfs SERVERS=192.168.40.239,192.168.50.207,192.168.50.208,192.168.40.240 VOLNAME=online-share
docker volume create -d glusterfs -o servers=192.168.40.239,192.168.50.207,192.168.50.208,192.168.40.240 -o volname=online-share -o subdir=/data --scope multi --sharing all glustervolume
# 然后直接挂载使用即可
services:
...:
...
volumes:
- glustervolume:/data
volumes:
glustervolume:
driver: glusterfs
name: "glustervolume"

kubernetes

有以下几种使用方式:
1、通过 Heketi 管理 GlusterFS,Kubernetes 调用 Heketi 的接口(可以动态的扩展存储)
以下两种方式需要对所有节点进行配置,可能会遇到某些节点无法这样配置
2、GlusterFS 结合 NFS-Ganesha 提供 NFS 存储,Kubernetes 采用 NFS 的方式挂载
3、Kubernetes 挂载 GlusterFS 提供的数据卷到本地的存储目录,Kubernetes 采用 hostpatch 的方式
4、Container Storage Interface (CSI) volume plugins(更符合标准规范,可能是更好的选择,但无法自动扩展)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
# 后面三种方式需要给所有的节点都需要安排上glusterfs-client,否则无法使用
yum -y install openssh-server wget fuse fuse-libs openmpi libibverbs
dnf install centos-release-gluster9 -y
dnf install -y glusterfs glusterfs-api glusterfs-fuse glusterfs-rdma glusterfs-libs glusterfs-cli glusterfs-client-xlators
# 然后参考github上的配置:https://github.com/rootsongjc/kubernetes-handbook/tree/master/manifests/glusterfs

# 也可以通过Heketi集群进行自动化的glusterfs的api式管理(但此方式是基于heketi对一些全新的glusterfs节点进行自动配置(无需提前配置glusterfs))

# glusterfs卷的每个节点均需要允许kubernetes访问的iptables
iptables -N HEKETI
iptables -A HEKETI -p tcp -m state --state NEW -m tcp --dport 24007 -j ACCEPT
iptables -A HEKETI -p tcp -m state --state NEW -m tcp --dport 24008 -j ACCEPT
iptables -A HEKETI -p tcp -m state --state NEW -m tcp --dport 2222 -j ACCEPT
iptables -A HEKETI -p tcp -m state --state NEW -m multiport --dports 49152:49251 -j ACCEPT
service iptables save

# 需要提前先准备好内核模块(heketi要求)
# 检查 lsmod | egrep 'dm_snapshot|dm_mirror|dm_thin_pool'
modprobe dm_snapshot
modprobe dm_mirror
modprobe dm_thin_pool
# 对应的ansible剧本为
for i in dm_snapshot dm_mirror dm_thin_pool;do ansible nodes -m command -a 'modprobe '"$i"'';done

# 需要确认好的sshd(否则创建heketi集群时会报错)
echo "PubkeyAcceptedKeyTypes=+ssh-rsa" >> /etc/ssh/sshd_config
echo "HostKeyAlgorithms=+ssh-rsa" >> /etc/ssh/sshd_config
systemctl restart sshd
ansible nodes -m shell -a 'echo "PubkeyAcceptedKeyTypes=+ssh-rsa" >> /etc/ssh/sshd_config;echo "HostKeyAlgorithms=+ssh-rsa" >> /etc/ssh/sshd_config;systemctl restart sshd'

# 以下是配置Heketi集群,与glusterfs配置在相同节点上形成集群
wget https://github.com/heketi/heketi/releases/download/v10.4.0/heketi-v10.4.0-release-10.linux.amd64.tar.gz
tar -zxvf heketi-v10.4.0-release-10.linux.amd64.tar.gz
cp heketi/{heketi,heketi-cli} /usr/bin/

# 因为heketi不会使用root进行操作
useradd -d /var/lib/heketi -s /sbin/nologin heketi
ssh-keygen -N '' -t rsa -q -f /etc/heketi/heketi_key
chown -R heketi.heketi /etc/heketi
ssh-copy-id -i /etc/heketi/heketi_key root@node1
ssh-copy-id -i /etc/heketi/heketi_key root@node2
ssh-copy-id -i /etc/heketi/heketi_key root@node3
ssh-copy-id -i /etc/heketi/heketi_key root@node6

# heketi的相关配置
mkdir -p /etc/heketi
cat << EOF > /etc/heketi/heketi.json
{
"_port_comment": "Heketi Server Port Number",
"port": "18080",

"_enable_tls_comment": "Enable TLS in Heketi Server",
"enable_tls": false,

"_cert_file_comment": "Path to a valid certificate file",
"cert_file": "",

"_key_file_comment": "Path to a valid private key file",
"key_file": "",


"_use_auth": "Enable JWT authorization. Please enable for deployment",
"use_auth": true,

"_jwt": "Private keys for access",
"jwt": {
"_admin": "Admin has access to all APIs",
"admin": {
"_key_comment": "Set the admin key in the next line",
"key": "admin@P@88W0rd"
},
"_user": "User only has access to /volumes endpoint",
"user": {
"_key_comment": "Set the user key in the next line",
"key": "user@P@88W0rd"
}
},

"_backup_db_to_kube_secret": "Backup the heketi database to a Kubernetes secret when running in Kubernetes. Default is off.",
"backup_db_to_kube_secret": false,

"_profiling": "Enable go/pprof profiling on the /debug/pprof endpoints.",
"profiling": false,

"_glusterfs_comment": "GlusterFS Configuration",
"glusterfs": {
"_executor_comment": [
"Execute plugin. Possible choices: mock, ssh",
"mock: This setting is used for testing and development.",
" It will not send commands to any node.",
"ssh: This setting will notify Heketi to ssh to the nodes.",
" It will need the values in sshexec to be configured.",
"kubernetes: Communicate with GlusterFS containers over",
" Kubernetes exec api."
],
"executor": "ssh",

"_sshexec_comment": "SSH username and private key file information",
"sshexec": {
"keyfile": "/etc/heketi/heketi_key",
"user": "root",
"port": "22",
"fstab": "/etc/fstab"
},

"_db_comment": "Database file name",
"db": "/var/lib/heketi/heketi.db",

"_refresh_time_monitor_gluster_nodes": "Refresh time in seconds to monitor Gluster nodes",
"refresh_time_monitor_gluster_nodes": 120,

"_start_time_monitor_gluster_nodes": "Start time in seconds to monitor Gluster nodes when the heketi comes up",
"start_time_monitor_gluster_nodes": 10,

"_loglevel_comment": [
"Set log level. Choices are:",
" none, critical, error, warning, info, debug",
"Default is warning"
],
"loglevel" : "warning"
}
}
EOF

# 以systemd服务方式启动heketi
cat << EOF > /usr/lib/systemd/system/heketi.service
[Unit]
Description=Heketi Server

[Service]
Type=simple
WorkingDirectory=/var/lib/heketi
User=heketi
ExecStart=/usr/bin/heketi --config=/etc/heketi/heketi.json
Restart=on-failure
StandardOutput=syslog
StandardError=syslog

[Install]
WantedBy=multi-user.target

EOF
systemctl enable heketi --now
systemctl status heketi -l # 检查heketi服务状态

# 以配置文件创建heketi集群
cat << EOF > /etc/heketi/topology.json
{
"clusters": [
{
"nodes": [
{
"node": {
"hostnames": {
"manage": [
"node1"
],
"storage": [
"192.168.40.239"
]
},
"zone": 1
},
"devices": [
{
"name": "/dev/vdb",
"destroydata": false
},
]
},
{
"node": {
"hostnames": {
"manage": [
"node2"
],
"storage": [
"192.168.50.207"
]
},
"zone": 1
},
"devices": [
{
"name": "/dev/vdb",
"destroydata": false
},
]
},
{
"node": {
"hostnames": {
"manage": [
"node3"
],
"storage": [
"192.168.50.208"
]
},
"zone": 1
},
"devices": [
{
"name": "/dev/vdb",
"destroydata": false
},
]
},
{
"node": {
"hostnames": {
"manage": [
"node6"
],
"storage": [
"192.168.50.177"
]
},
"zone": 1
},
"devices": [
{
"name": "/dev/vdb",
"destroydata": false
},
]
}
]
}
]
}
EOF

# 在heketi-manager机器上进行配置
heketi-cli --server http://192.168.40.239:18080 --user admin --secret admin@P@88W0rd topology load --json=/etc/heketi/topology.json
# alias选做
echo "alias heketi-cli='heketi-cli --server http://192.168.40.239:18080 --user admin --secret admin@P@88W0rd'" >> ~/.bashrc
heketi-cli cluster list # 查看集群信息,获取集群信息,给后续的k8s配置sc使用

# kubernetes中使用
# 所有节点仍然要进行基础配置,否则无法使用该存储
yum -y install openssh-server wget fuse fuse-libs openmpi libibverbs
dnf install centos-release-gluster9 -y
dnf install -y glusterfs glusterfs-api glusterfs-fuse glusterfs-rdma glusterfs-libs glusterfs-cli glusterfs-client-xlators

heketiSecret=$(echo -n "admin@P@88W0rd" | base64)
cat << EOF > /etc/heketi/heketi-secret.yaml
apiVersion: v1
kind: Secret
metadata:
name: heketi-secret
namespace: kube-system
data:
key: ${heketiSecret}
type: kubernetes.io/glusterfs
EOF

kubectl apply -f heketi-secret.yaml

cat << EOF > /etc/heketi/heketi-storageclass.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: glusterfs
namespace: kube-system
parameters:
resturl: "http://192.168.40.239:18080"
clusterid: "9ad37206ce6575b5133179ba7c6e0935"
restauthenabled: "true"
restuser: "admin"
secretName: "heketi-secret"
secretNamespace: "kube-system"
volumetype: "replicate:3" # 副本卷 3副本 # disperse 4 2 分散卷 4Data 2冗余 # none 条带卷
provisioner: kubernetes.io/glusterfs
reclaimPolicy: Delete # Retain 保留,Recycle 回收,Delete 删除
EOF

kubectl apply -f heketi-storageclass.yaml

cat << EOF > /etc/heketi/heketi-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: heketi-pvc
annotations:
volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/glusterfs
spec:
storageClassName: "glusterfs"
accessModes:
- ReadWriteMany
resources:
requests:
storage: 20Gi
EOF

kubectl apply -f heketi-pvc.yaml

问题处理:

1、glusterfs分区能被成功挂载后,可以正常创建文件,可以查询文件内容,往文件内添加修改内容,不会影响到该目录的任何操作
详情可以参考:https://www.cnblogs.com/wiseo/p/13035886.html

1
2
3
# 先暂停使用
ansible hosts -m shell -a "exit $(lsof /mnt/glusterfs)"
ansible hosts -m shell -a 'df -Th | grep -q "/mnt/glusterfs" && umount /mnt/glusterfs'
查看脑裂信息
1
gluster volume heal <VOLNAME> info

当调用此命令时,将生成一个glfsheal进程,该进程将读取//.glusterfs/indices/下的各个子目录中(它可以连接到的)所有brick条目;
这些条目是需要修复文件的gfid;
一旦从一个brick中获得GFID条目,就根据该文件在副本集和trusted.afr.*扩展属性的每个brick上进行查找,确定文件是否需要修复,是否处于脑裂或其他状态。

其中文件的状态:
Is in split-brain:该文件或文件夹需要进行修复,否则无法自愈
Is possibly undergoing heal:已经被锁定的文件,正检查是否需要修复

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
[root@node-08 glusterfs]# gluster volume heal online-share info 
Brick node1:/data/glusterfs/online-share/brick1
Status: Connected
Number of entries: 0

Brick node2:/data/glusterfs/online-share/brick1
/
Status: Connected
Number of entries: 1

Brick node3:/data/glusterfs/online-share/brick1
/
Status: Connected
Number of entries: 1

Brick node6:/data/glusterfs/online-share/brick1
Status: Connected
Number of entries: 0

# 如果出现了文件损坏,则一般会带着split-brain,不戴则表示某些文件出现了脑裂
修复脑裂:
1
2
3
4
5
# 修复可以直接修复的文件
gluster volume heal <VOLNAME> info split-brain
# 处于脑裂状态的文件不可以被修复,只能通过其他方式
# 目前是目录脑裂,故从目录脑裂中入手,后续如果文件脑裂,则添加文件脑裂修复部分
# 很神奇,只需要取消文件挂载,就会自动修改了,实际上是上述的两个配置导致在使用过程中出现脑裂,故不建议修改配置