1. 选型及介绍
之前写过五子棋用的是α-β pruning(α-β剪枝) ;这次想试试MCTS(蒙特卡洛树搜索)
1.1 MCTS
-
选择(Selection)
从根节点R开始,选择连续的子节点向下至叶子节点L。后面给出了一种选择子节点的方法,让游戏树向最优的方向扩展,这是蒙特卡洛树搜索的精要所在.
-
扩展(Expansion)
除非任意一方的输赢使得游戏在L结束,否则创建一个或多个子节点并选取其中一个节点C.
-
仿真(Simulation)
在从节点C开始,用随机策略进行游戏, 这一模拟过程用叫快速走子策略, 迅速地随机地走到底, 得到一个胜负结果. 也称rollout
-
反馈(Backpropagation)
使用随机游戏的结果,更新从C到R的路径上的节点信息
1.2 UCT的选择方法
每个子节点得分:
x 是节点的当前胜率估计;N 是节点的访问次数。C 是一个常数。C 越大就越偏向于广度搜索,C 越小就越偏向于深度搜索。注意对于原始的 UCT 有一个理论最优的
2. 具体实现
2.1 数据结构设计
每一个node节点都是一个board类, 节点之间的关系通过python字典连接:
-
self.children = dict() :
字典类,
key = node
一个棋盘对象(node = Board()
);keyvalue = set()
键值是一个set()
集合, 包含当前节点的各种走法的下一步棋盘对象 -
self.Q 和 self.N:
都是
defaultdict()
类的一个对象, 用来记录每一个node(棋盘对象)的访问次数和获胜次数
基本数据结构框架如下:
1 | reversi = {'O': 'X', 'X': 'O'} |
-
与main.py的接口部分
由于只需要调用
.get_move(board)
函数来获得action做出判断. 下面是get_move:1
2
3
4
5
6
7
8
9
10tree = MCTS(1.414, self.color)
for _ in range(200):
tree.do_rollout(board)
actionboard = tree.choose(board)
for idxc in range(8):
for idxr in range(8):
if ((actionboard[idxc][idxr] != board[idxc][idxr])
and (board[idxc][idxr] == '.')):
return chr(ord('A') + idxr) + str(idxc + 1)
2.2 算法设计
-
选择
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27def _select(self, node):
"Find an unexplored descendent of `node`"
path = []
while True:
path.append(node)
if node not in self.children or not self.children[node]:
# node is either unexplored or terminal
return path
unexplored = self.children[node] - self.children.keys()
if unexplored:
n = unexplored.pop()
path.append(n)
return path
node = self._uct_select(node) # descend a layer deeper
def _uct_select(self, node):
"Select a child of node, balancing exploration & exploitation"
# All children of node should already be expanded:
assert all(n in self.children for n in self.children[node])
log_N_vertex = math.log(self.N[node])
def uct(n):
"Upper confidence bound for trees"
return self.Q[n] / self.N[n] + self.exploration_weight * math.sqrt(
log_N_vertex / self.N[n]
)
return max(self.children[node], key=uct) -
扩展
1
2
3
4
5def _expand(self, node):
"Update the `children` dict with the children of `node`"
if node in self.children:
return # already expanded
self.children[node] = self.find_children(node, self.color) -
模拟
1
2
3
4
5
6
7
8
9
10def _simulate(self, node, flag: bool):
"Returns the reward for a random simulation (to completion) of `node`"
invert_reward = True # 代表 self.color = 'O' 在找落子点
while True:
if self.is_terminal(node):
reward = node.get_winner()[0] # 黑0白1
return reward if flag else 1-reward
node = self.find_random_child(
node, (self.color if invert_reward else reversi[self.color]))
invert_reward = not invert_reward -
反馈
1
2
3
4
5def _backpropagate(self, path, reward):
"Send the reward back up to the ancestors of the leaf"
for node in reversed(path):
self.N[node] += 1
self.Q[node] += reward -
输出
1
2
3
4
5
6
7
8
9
10
11
12def choose(self, node):
"Choose the best successor of node. (Choose a move in the game)"
if node not in self.children:
# return node.find_random_child()
return self.find_random_child(node, self.color)
def score(n):
if self.N[n] == 0:
return float("-inf") # avoid unseen moves
return self.N[n]
return max(self.children[node], key=score) -
辅助函数
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27def find_children(self, node, ccolor='O') -> set():
ret = set()
k = list(node.get_legal_actions(self.color))
for item in k:
temp = deepcopy(node)
temp._move(item, self.color)
ret.add(temp)
return ret
def find_random_child(self, node, color='O'):
directions = list(node.get_legal_actions(color))
if directions:
newnode = deepcopy(node)
newnode._move(random.choice(directions), color)
return newnode
else:
return node
def is_terminal(self, node) -> bool:
k = node.count('.')
if k == 0:
return True
l1 = list(node.get_legal_actions('X'))
l2 = list(node.get_legal_actions('O'))
if not l1 and not l2:
return True
return False
3. 效果及感想
-
本来想再写一个α-β剪枝来博弈一下,但是时间来不及了
-
和RandomPlayer的博弈来看, 偶尔会输,基本都是获胜结局,记录如下:
-
能看到自己写的AI打败自己真实太有成就感了(fulfillment)!