Convert HTML to valid XML tag

I need help writing a regex function that converts HTML string to a valid XML tag name. Ex: It takes a string and does the following:

  • If an alphabet or underscore occurs in the string, it keeps it
  • If any other character occurs, it's removed from the output string.
  • If any other character occurs between words or letters, it's replaced with an Underscore.
  • Ex:
    Input: Date Created
    Ouput: Date_Created
    
    Input: Date<br/>Created
    Output: Date_Created
    
    Input: DatenCreated
    Output: Date_Created
    
    Input: Date    1 2 3 Created
    Output: Date_Created
    

    Basically the regex function should convert the HTML string to a valid XML tag.


    一些正则表达式和一些标准函数:

    function mystrip($s)
    {
            // add spaces around angle brackets to separate tag-like parts
            // e.g. "<br />" becomes " <br /> "
            // then let strip_tags take care of removing html tags
            $s = strip_tags(str_replace(array('<', '>'), array(' <', '> '), $s));
    
            // any sequence of characters that are not alphabet or underscore
            // gets replaced by a single underscore
            return preg_replace('/[^a-z_]+/i', '_', $s);
    }
    

    Try this

    $result = preg_replace('/([ds]|<[^<>]+>)/', '_', $subject);
    

    Explanation

    "
    (               # Match the regular expression below and capture its match into backreference number 1
                       # Match either the regular expression below (attempting the next alternative only if this one fails)
          [ds]          # Match a single character present in the list below
                             # A single digit 0..9
                             # A whitespace character (spaces, tabs, and line breaks)
       |               # Or match regular expression number 2 below (the entire group fails if this one fails to match)
          <               # Match the character “<” literally
          [^<>]           # Match a single character NOT present in the list “<>”
             +               # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
          >               # Match the character “>” literally
    )
    "
    

    Should be able to use:

    $text = preg_replace( '/(?<=[a-zA-Z])[^a-zA-Z_]+(?=[a-zA-Z])/', '_', $text );
    

    So, there's lookarounds to see if there's an alpha character before and after, and replaces any non-alpha / non-underscore between it.

    链接地址: http://www.djcxy.com/p/76856.html

    上一篇: 使用grep从本地文件中的HTML标记中获取文本

    下一篇: 将HTML转换为有效的XML标签