Convert HTML to valid XML tag

2018-06-27 11:49:33

I need help writing a regex function that converts HTML string to a valid XML tag name. Ex: It takes a string and does the following:

If an alphabet or underscore occurs in the string, it keeps it

If any other character occurs, it's removed from the output string.

If any other character occurs between words or letters, it's replaced with an Underscore.

Ex:
Input: Date Created
Ouput: Date_Created

Input: Date<br/>Created
Output: Date_Created

Input: DatenCreated
Output: Date_Created

Input: Date    1 2 3 Created
Output: Date_Created

Basically the regex function should convert the HTML string to a valid XML tag.

一些正则表达式和一些标准函数：

function mystrip($s)
{
        // add spaces around angle brackets to separate tag-like parts
        // e.g. "<br />" becomes " <br /> "
        // then let strip_tags take care of removing html tags
        $s = strip_tags(str_replace(array('<', '>'), array(' <', '> '), $s));

        // any sequence of characters that are not alphabet or underscore
        // gets replaced by a single underscore
        return preg_replace('/[^a-z_]+/i', '_', $s);
}

Try this

$result = preg_replace('/([ds]|<[^<>]+>)/', '_', $subject);

Explanation

"
(               # Match the regular expression below and capture its match into backreference number 1
                   # Match either the regular expression below (attempting the next alternative only if this one fails)
      [ds]          # Match a single character present in the list below
                         # A single digit 0..9
                         # A whitespace character (spaces, tabs, and line breaks)
   |               # Or match regular expression number 2 below (the entire group fails if this one fails to match)
      <               # Match the character “<” literally
      [^<>]           # Match a single character NOT present in the list “<>”
         +               # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
      >               # Match the character “>” literally
)
"

Should be able to use:

$text = preg_replace( '/(?<=[a-zA-Z])[^a-zA-Z_]+(?=[a-zA-Z])/', '_', $text );

So, there's lookarounds to see if there's an alpha character before and after, and replaces any non-alpha / non-underscore between it.

链接地址: http://www.djcxy.com/p/76856.html

上一篇: 使用grep从本地文件中的HTML标记中获取文本

下一篇: 将HTML转换为有效的XML标签