HtmlのパースをHaskellでさっくりやってみようと思っていた
Parsecの練習。
タグだけパースして他全て捨てているから役に立たないどころではないのだけど。あと世の中には閉じられていないタグが多過ぎるからやはり使えない。閉じないタグに対応するの面倒だな。タグ毎に対応しないとか。
データ型で悩んだ。タグ内のデータをどうやって保持しよう。インラインタグとかあるわけだけど、ぶつ切りにして保持しても役に立ちそうにないし。と言うわけでとりあえずタグ以外捨てたわけだが。これもタグ毎に対応するのかなあ。
module Main where import Text.ParserCombinators.Parsec hiding (token) import Control.Monad data HtmlTag = Notag | Tag TagName [Attr] [HtmlTag] deriving (Show) data Attr = Attr (Key, Val) deriving (Show) type TagName = String type Key = String type Val = String -------------------------------------------------------------------------------- parseDoubleQuotedString = do char '\"' str <- many $ noneOf "\"" char '\"' return str parseSingleQuotedString = do char '\'' str <- many $ noneOf "\'" char '\'' return str parseString = many1 alphaNum parseVal = try (token parseDoubleQuotedString) <|> token parseSingleQuotedString parseKey = token parseString parseTagName = token parseString parseAttr = do key <- parseKey char '=' val <- parseVal return (key, val) parseContent :: Parser [HtmlTag] parseContent = try(many1 $ token' $ parseTag) <|> liftM (\x->[Notag]) (many (noneOf "<>")) -- tag以外は全て捨てている parseOpenTag = do try(char '<') tagName <- parseTagName attrs <- many parseAttr char '>' return $ (tagName, attrs) parseCloseTag tagName = do string $ "</" token $ string tagName char '>' -------------------------------------------------------------------------------- token, token' :: Parser a -> Parser a -- token前後の空白を無視する token p = do spaces v <- p spaces return v -- タグ以外を色々捨てている token' p = do many $ noneOf "<>" v <- p many $ noneOf "<>" return v -------------------------------------------------------------------------------- -- 分割タグ parseDevidedTag :: Parser HtmlTag parseDevidedTag = do -- 開きタグ (tagName, attrs) <- try(parseOpenTag) -- 中身 contents <- parseContent -- 閉じタグ parseCloseTag tagName return $ Tag tagName (map Attr attrs) contents -- 非分割タグ parseNonDevidedTag :: Parser HtmlTag parseNonDevidedTag = do try(char '<') tagName <- parseTagName attrs <- many parseAttr string "/>" return $ Tag tagName (map Attr attrs) [Notag] -- タグ parseTag :: Parser HtmlTag parseTag = try(parseNonDevidedTag) <|> parseDevidedTag -------------------------------------------------------------------------------- main = do text <- readFile "test.html" putStrLn $ case parse parseTag "html" text of Left e -> "Error: " ++ show e Right a -> show $ a
Text.XML.HaXml見てみると、やはりタグ毎に対応しているよう。
Network.Curl.Download使ってみた。
Prelude Network.Curl.Download> Network.Curl.Download.openAsTags "localhost:3000/hi/ruicc" Right [TagOpen "html" [],TagText "\n",TagOpen "title" [],TagText "hello",TagClose "title",TagText "\n",TagOpen "body" [],TagText "\n ",TagOpen "p" [],TagText "hello ruicc\n",TagClose "body",TagText "\n",TagClose "html",TagText "\n"] Prelude Network.Curl.Download> Network.Curl.Download.openAsXML "localhost:3000/hi/ruicc" Right [Elem (Element {elName = QName {qName = "html", qURI = Nothing, qPrefix = Nothing}, elAttribs = [], elContent = [Text (CData {cdVerbatim = CDataText, cdData = "\n", cdLine = Just 1}),Elem (Element {elName = QName {qName = "title", qURI = Nothing, qPrefix = Nothing}, elAttribs = [], elContent = [Text (CData {cdVerbatim = CDataText, cdData = "hello", cdLine = Just 2})], elLine = Just 2}),Text (CData {cdVerbatim = CDataText, cdData = "\n", cdLine = Just 2}),Elem (Element {elName = QName {qName = "body", qURI = Nothing, qPrefix = Nothing}, elAttribs = [], elContent = [Text (CData {cdVerbatim = CDataText, cdData = "\n ", cdLine = Just 3}),Elem (Element {elName = QName {qName = "p", qURI = Nothing, qPrefix = Nothing}, elAttribs = [], elContent = [Text (CData {cdVerbatim = CDataText, cdData = "hello ruicc\n", cdLine = Just 4})], elLine = Just 4})], elLine = Just 3}),Text (CData {cdVerbatim = CDataText, cdData = "\n", cdLine = Just 5})], elLine = Just 1}),Text (CData {cdVerbatim = CDataText, cdData = "\n", cdLine = Just 6})]
なるほどタグ毎ではなくて開きタグ閉じタグ別々に保持しているのか。openAsXMLのほうはきちんとパースしてるぽい。