ブログ最適化/SEO対策が面倒なのでPythonにやらせてみた

数日前にホットエントリー入りしていたほけきよさん（id:imslotter）の記事を読んで、オライリーの本を読んでみたかったことを思い出しました。

www.procrasist.com

このコードを実行するまでに何点か躓いたポイントがあるので、ついでにPython導入から最適化を実行するまでの手順を説明します。 Macのお話です。Windowsは気が向いたら書きます。

Homebrewをインストール
Python3をインストール
Gitのインストール
コード（一部修正）
最適化実行！
足りないモジュールをインストール
結果
Pythonに関する書籍

Homebrewをインストール

HomebrewとはmacOS上でソフトウェアの導入を単純化するパッケージ管理システムです。面倒な環境構築が簡単になりますよって感じです。

ターミナルを開いて以下のコマンドを実行します。（$に続く文字をコピペしてエンターキーで実行します。$は不要です）

$ /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

これだけ！
不安な方は以下のコマンドで無事にインストールできたか確認しておきましょう。

$ brew doctor

Python3をインストール

Pythonをインストールしていきますが、注意点が。
Pythonにはバージョン2系と3系がありますが、この2つには互換性がありません。わかりやすく言うと2系で動いていたコード（プログラム）が3系では動かない可能性がある、ということです。（逆も然り）

今回のコードは3系向けに書かれているため、2系ではなく3系をインストールしていきます。

$ brew install python3

“python3” ではなく “python"と書いてしまうと2系がインストールされてしまうので注意が必要。

インストールできたかの確認を兼ねてバージョンを確認しておきましょう。（$に続く文字 python3 --version だけをコピペして実行してください。次の行のように Python 3.x.x と表示されるはずです）

$ python3 --version
Python 3.6.1

Gitのインストール

こちらは必須ではありません。taikutsu_blog_worksリポジトリからクローン（複製）する場合には必要になります。 Gitをインストールするにはターミナルで以下を実行します。

$ brew install git

こちらもインストールできたかの確認を兼ねてバージョンを確認しておきましょう。

$ git --version
git version 2.13.2

デスクトップ上にGitHubのリポジトリからクローンします。（#以降はコメントなのでコピペしなくても実行できます）

$ cd ~/Desktop #デスクトップに移動
$ git clone https://github.com/hokekiyoo/taikutsu_blog_works.git #クローン
$ cd taikutsu_blog_works #クローンされたディレクトリに入る
$ ls
README.md   all_in_one.py   modules

最後の行に3つ表示されればOK！！

Gitを使用しない場合はGitHubのリポジトリかほけきよさんのこちらの記事からコピペしてall_in_one.pyと名前を付けて保存してください。

コード（一部修正）

僕の環境ではERRORやWARNINGを吐くポイントがあったため、コードを一部修正しました。
元のコードでうまく実行できないようであれば参考にしてください。

▶クリックで展開

from argparse import ArgumentParser
from urllib import request
from urllib import error
from bs4 import BeautifulSoup
import os
import csv
import json
import datetime
import time
import matplotlib.pyplot as plt

def extract_urls(args):
    page = 1
    is_articles = True
    urls = []
    while is_articles:
        try:
            html = request.urlopen("{}/archive?page={}".format(args.url, page))
        except error.HTTPError as e:
            # HTTPレスポンスのステータスコードが404, 403, 401などの例外処理
            print(e.reason)
            break
        except error.URLError as e:
            # アクセスしようとしたurlが無効なときの例外処理
            print(e.reason)
            break
        soup = BeautifulSoup(html, "html.parser")
        articles = soup.find_all("a", class_="entry-title-link")
        for article in articles:
            urls.append(article.get("href"))
        if len(articles) == 0:
            # articleがなくなったら終了
            is_articles = False
        page += 1
    return urls

def make_directories(args):
    directory = args.directory
    if not os.path.exists(directory):
        os.mkdir(directory)
    if args.image:
        if not os.path.exists(directory+"/imgs"):
            os.mkdir(directory+"/imgs")
    if args.graph:
        if not os.path.exists(directory+"/graph"):
            os.mkdir(directory+"/graph")
    if args.hatebu:
        if not os.path.exists(directory+"/hatebu"):
            os.mkdir(directory+"/hatebu")

def articles_to_img(args, url, soup, name):
    """
    各記事内の画像を保存
    - gif, jpg, jpeg, png
    - 記事ごとにフォルダ分けして保存される
    - imgs/{urlの最後の部分}/{0-99}.png
    """
    # ディレクトリの作成
    article_dir = os.path.join(args.directory+"/imgs", name)
    if not os.path.exists(article_dir):
        os.mkdir(article_dir)
    entry = soup.select(".entry-content")[0]
    imgs = entry.find_all("img")
    count=0


    for img in imgs:
        filename = img.get("src")
        if "ssl-images-amazon" in filename:
            # print("amazon img")
            continue
        # 拡張子チェック
        if filename[-4:] == ".jpg" or filename[-4:] == ".png" or filename[-4:] == ".gif":
            extension = filename[-4:]
            print("\t IMAGE:",filename)
        elif filename[-5:] == ".jpeg":
            extension = filename[-5:]
            print("\t IMAGE:",filename,extension)
        else:
            continue
        try:
            image_file = request.urlopen(filename)
        except error.HTTPError as e:
            print("\t HTTPERROR:", e.reason)
            continue
        except error.URLError as e:
            print("\t URLERROR:", e.reson)
            continue
        # ValueErrorになった場合に試す(httpで始まらないリンクも貼れるっぽい？)
        except ValueError:
            http_file = "http:"+filename
            try:
                image_file = request.urlopen(http_file)
            except error.HTTPError as e:
                print("\t HTTPERROR:", e.reason)
                continue
            except error.URLError as e:
                print("\t URLERROR:", e.reason)
                continue
        # 画像ファイルの保存
        with open(os.path.join(article_dir,str(count)+extension), "wb") as f:
            f.write(image_file.read())
        count+=1

def make_network(G, args, url, urls, soup):
    entry_url = args.url + "/entry/"
    article_name = url.replace(entry_url,"").replace("/","-")
    entry = soup.select(".entry-content")[0]
    links = entry.find_all("a")
    for link in links:
        l = link.get("href")
        if l in urls:
            linked_article_name = l.replace(entry_url,"").replace("/","-")
            print("\t NETWORK: 被リンク！{} -> {}".format(article_name, linked_article_name))
            j = urls.index(l)
            G.add_edge(article_name, linked_article_name)
        else:
            continue

def url_checker(url, urls):
    #変なリンクは除去したい
    flag1 = "http" in url[:5]
    #ハテナのキーワードのリンクはいらない
    flag2 = "d.hatena.ne.jp/keyword/" not in url
    #amazonリンクはダメ
    flag3 = "http://www.amazon.co.jp" not in url and "http://amzn.to/" not in url
    return flag1 and flag2 and flag3


def check_invalid_link(args, urls, url, soup, writer):
    import re
    from urllib.parse import quote_plus
    regex = r'[^\x00-\x7F]' #正規表現
    entry_url = args.url + "/entry/"
    entry = soup.select(".entry-content")[0]
    links = entry.find_all("a")
    for link in links:
        l = link.get("href")
        if l == None:
            continue
        #日本語リンクは変換
        matchedList = re.findall(regex,l)
        for m in matchedList:
            l = l.replace(m, quote_plus(m, encoding="utf-8"))
        check = url_checker(l, urls)
        if check:
            #リンク切れ検証
            try:
                html = request.urlopen(l)
            except error.HTTPError as e:
                print("\t HTTPError:", l, e.reason)
                if e.reason != "Forbidden":
                    writer.writerow([url,  e.reason, l])
            except error.URLError as e:
                writer.writerow([url, e.reason, l])
                print("\t URLError:", l, e.reason)
            except TimeoutError as e:
                print("\t TimeoutError:",l, e)
            except UnicodeEncodeError as e:
                print("\t UnicodeEncodeError:", l, e.reason)

def get_timestamps(args, url, name):
    """
    はてブのタイムスタンプを取得
    """
    plt.figure()
    data = request.urlopen("http://b.hatena.ne.jp/entry/json/{}".format(url)).read().decode("utf-8")
    info = json.loads(data.strip('(').rstrip(')'), ) #第二引数 "r" を削除
    timestamps = list()
    if info != None and "bookmarks" in info.keys(): # 公開ブックマークが存在する時に、それらの情報を抽出
        bookmarks=info["bookmarks"]
        title = info["title"]
        for bookmark in bookmarks:
            timestamp = datetime.datetime.strptime(bookmark["timestamp"],'%Y/%m/%d %H:%M:%S')
            timestamps.append(timestamp)
        timestamps = list(reversed(timestamps)) # ブックマークされた時間を保存しておく
    count = len(timestamps)
    number = range(count)
    if(count!=0):
        first = timestamps[0]
        plt.plot(timestamps,number,"-o",lw=3,color="#444444")
        # 3時間で3
        plt.axvspan(first,first+datetime.timedelta(hours=3),alpha=0.1,color="blue")
        plt.plot([first,first+datetime.timedelta(days=2)],[3,3],"--",alpha=0.9,color="blue",label="new entry")
        # 12時間で15
        plt.axvspan(first+datetime.timedelta(hours=3),first+datetime.timedelta(hours=12),alpha=0.1,color="green")
        plt.plot([first,first+datetime.timedelta(days=2)],[15,15],"--",alpha=0.9, color="green",label="popular entry")
        # ホッテントリ
        plt.plot([first,first+datetime.timedelta(days=2)],[15,15],"--",alpha=0.7, color="red",label="hotentry")
        plt.xlim(first,first+datetime.timedelta(days=2))
        plt.title(name)
        plt.xlabel("First Hatebu : {}".format(first))
        plt.legend(loc=4)
        plt.savefig(args.directory+"/hatebu/{}.png".format(name))
        plt.close()

def graph_visualize(G, args):
    import networkx as nx
    import numpy as np
    # グラフ形式を選択。ここではスプリングモデルでやってみる
    pos = nx.spring_layout(G)
    # グラフ描画。 オプションでノードのラベル付きにしている
    plt.figure()
    nx.draw_networkx(G, pos, with_labels=False, alpha=0.4,font_size=0.0,node_size=10) # 古い関数だったので修正
    plt.savefig(args.directory+"/graph/graph.png")
    nx.write_gml(G, args.directory+"/graph/graph.gml")
    # 次数分布描画
    plt.figure()
    degree_sequence = sorted(nx.degree(G).values(),reverse=True)
    dmax = max(degree_sequence)
    dmin = min(degree_sequence)
    kukan = range(0,dmax+2)
    hist, kukan=np.histogram(degree_sequence, kukan)
    plt.plot(hist,"o-")
    plt.xlabel('degree')
    plt.ylabel('frequency')
    plt.grid()
    plt.savefig(args.directory + '/graph/degree_hist.png')

def main():
    parser = ArgumentParser()
    parser.add_argument("-u", "--url", type=str, required=True,help="input your url")
    parser.add_argument("-d", "--directory", type=str, required=True,help="output directory")
    parser.add_argument("-i", "--image", action="store_true", default=False, help="extract image file from articles")
    parser.add_argument("-g", "--graph", action="store_true", default=False, help="visualize internal link network")
    parser.add_argument("-l", "--invalid_url", action="store_true", default=False, help="detect invalid links")
    parser.add_argument("-b", "--hatebu", action="store_true", default=False, help="visualize analyzed hatebu graph")
    args = parser.parse_args()

    urls = extract_urls(args)
    # 保存用ディレクトリ作成
    make_directories(args)
    # 記事リストを作る
    with open (args.directory+"/articles_list.csv", "w", encoding="shift_jis") as f: # Shift_JISを指定
        writer = csv.writer(f, lineterminator='\n')
        writer.writerow(["Article TITLE", "URL","Hatebu COUNT"])
        if args.invalid_url:
            f = open(args.directory+'/invalid_url_list.csv', 'w', encoding="shift_jis") # Shift_JISを指定
            writer_invalid = csv.writer(f, lineterminator='\n')
            writer_invalid.writerow(["Article URL", "ERROR", "LINK"])
        if args.graph:
            import networkx as nx
            G = nx.Graph()
            for i, url in enumerate(urls):
                name = url.replace(args.url+"/entry/","").replace("/","-")
                G.add_node(name)
        for i, url in enumerate(urls):
            name = url.replace(args.url+"/entry/","").replace("/","-")
            print("{}/{}".format(i+1,len(urls)), name)
            # 抽出したurlに対して各処理実行
            try:
                html = request.urlopen(url)
            except error.HTTPError as e:
                print(e.reason)
            except error.URLError as e:
                print(e.reason)
            soup = BeautifulSoup(html, "html.parser")
            # WordPressならいらない
            data = request.urlopen("http://b.hatena.ne.jp/entry/json/{}".format(url)).read().decode("utf-8")
            info = json.loads(data.strip('(').rstrip(')'), ) #第二引数 "r" を削除
            try:
                count = info["count"]
            except TypeError:
                count = 0
            # 記事の名前とurl、はてブを出力
            try:
                writer.writerow([soup.title.text, url, count])
            except UnicodeEncodeError as e:
                # ふざけた文字が入ってる場合はエラー吐くことも
                print(e.reason)
                print("\tArticleWriteWarning この記事のタイトルに良くない文字が入ってます :",url)
                continue
            if args.image:
                if "%" in name:
                    name = str(i) #日本語対応
                articles_to_img(args, url, soup, name)
            if args.graph:
                make_network(G, args, url, urls, soup)
            if args.invalid_url:
                check_invalid_link(args, urls, url, soup, writer_invalid)
            if args.hatebu:
                if "%" in name:
                    name = str(i) #日本語対応
                get_timestamps(args, url, name)
            time.sleep(3)
        if args.invalid_url:
            f.close()
        if args.graph:
            graph_visualize(G, args)

if __name__ == '__main__':
    main()

変更点は以下です。 * 古い関数を新しいものに変更（WARNINGが出たため） * “r” を削除 * 2箇所 * Excelで開く人のためにCSVファイル出力の文字コードにShift_JISを指定（デフォルトはUTF-8）

最適化実行！

まずは実行してみましょう。

$ python3 all_in_one.py -u {URL} -d {output-directory} -g -i -b -l

{URL}を対象のブログのURLに、{output-directory}を出力するディレクトリ名にそれぞれ置き換えてください。

僕の場合は以下になります。

$ python3 all_in_one.py -u http://handsm.hatenablog.com -d result -g -i -b -l

実行時に以下のオプションを指定できます。

実行時にエラーが出た場合は次の章を参照してください。

足りないモジュールをインストール

僕の場合は「モジュールが存在しないよ！」的なエラーがいくつか出ました。

例えば次のようなエラー。

$ python3 all_in_one.py -u http://handsm.hatenablog.com -d result -g -i -b -l
Traceback (most recent call last):
  File "all_in_one.py", line 4, in <module>
    from bs4 import BeautifulSoup
ModuleNotFoundError: No module named 'bs4'

‘bs4'というモジュールが見つからないよ！と嘆いていますね。
次のコマンドでモジュールを追加できます。

$ pip3 install {モジュール名}

先程例に上げた場合だと

$ pip3 install bs4
Collecting bs4
Collecting beautifulsoup4 (from bs4)
  Using cached beautifulsoup4-4.6.0-py3-none-any.whl
Installing collected packages: beautifulsoup4, bs4
Successfully installed beautifulsoup4-4.6.0 bs4-0.0.1

となります。
その後もう一度 python3 all_in_one.py -u {URL} -d {output-directory} -g -i -b -l を実行してみましょう。

他にもモジュールが見つからない場合があるので、その都度 pip3 install {モジュール名} でインストールしてください。