2021-04-18 10:23:24 +00:00
<!DOCTYPE html>
< html lang = "en" >
< head >
< meta charset = "utf-8" >
< title > YBlog - Pragmatic Regular Expression Exclude (2)< / title >
< meta name = "keywords" content = "regexp, regular expression" / >
< link rel = "shortcut icon" type = "image/x-icon" href = "../../../../Scratch/img/favicon.ico" / >
2021-05-25 20:25:47 +00:00
< link rel = "stylesheet" type = "text/css" href = "/css/y.css" / >
< link rel = "stylesheet" type = "text/css" href = "/css/legacy.css" / >
< link rel = "alternate" type = "application/rss+xml" title = "RSS" href = "/rss.xml" / >
2021-04-18 10:23:24 +00:00
< meta name = "viewport" content = "width=device-width, initial-scale=1.0" >
< link rel = "apple-touch-icon" href = "../../../../Scratch/img/about/FlatAvatar@2x.png" / >
<!-- [if lt IE 9]>
< script src = "http://ie7-js.googlecode.com/svn/version/2.1(beta4)/IE9.js" > < / script >
<![endif]-->
<!-- IndieAuth -->
< link href = "https://twitter.com/yogsototh" rel = "me" >
< link href = "https://github.com/yogsototh" rel = "me" >
< link href = "mailto:yann.esposito@gmail.com" rel = "me" >
< link rel = "pgpkey" href = "../../../../pubkey.txt" >
< / head >
< body lang = "en" class = "article" >
< div id = "content" >
< div id = "header" >
< div id = "choix" >
< span id = "choixlang" >
< a href = "../../../../Scratch/fr/blog/2010-02-16-All-but-something-regexp--2-/" > French< / a >
< / span >
< span class = "tomenu" > < a href = "#navigation" > ↓ Menu ↓< / a > < / span >
< span class = "flush" > < / span >
< / div >
< / div >
< div id = "titre" >
< h1 > Pragmatic Regular Expression Exclude (2)< / h1 >
< / div >
< div class = "flush" > < / div >
< div id = "afterheader" class = "article" >
< div class = "corps" >
< p > In my < a href = "../../../../Scratch/en/blog/2010-02-15-All-but-something-regexp" > previous post< / a > I had given some trick to match all except something. On the same idea, the trick to match the smallest possible string. Say you want to match the string between ‘ a’ and ‘ b’ , for example, you want to match:< / p >
< pre >
a.....< strong class = "blue" > a......b< / strong > ..b..a....< strong class = "blue" > a....b< / strong > ...
< / pre >
< p > Here are two common errors and a solution:< / p >
< pre >
/a.*b/
< strong class = "red" > a.....a......b..b..a....a....b< / strong > ...
< / pre >
< p > The first error is to use the < em > evil< / em > < code > .*< / code > . Because you will match from the first to the last.< / p >
< pre >
/a.*?b/
< strong class = "red" > a.....a......b< / strong > ..b..< strong class = "red" > a....a....b< / strong > ...
< / pre >
< p > The next natural way, is to change the < em > greediness< / em > . But it is not enough as you will match from the first < code > a< / code > to the first < code > b< / code > . Then a simple constatation is that our matching string shouldn’ t contain any < code > a< / code > nor < code > b< / code > . Which lead to the last elegant solution.< / p >
< pre >
/a[^ab]*b/
a.....< strong class = "blue" > a......b< / strong > ..b..a....< strong class = "blue" > a....b< / strong > ...
< / pre >
< p > Until now, that was, easy. Now, just pass at the case you need to match not between < code > a< / code > and < code > b< / code > , but between strings. For example:< / p >
< div class = "sourceCode" id = "cb1" > < pre class = "sourceCode html" > < code class = "sourceCode html" > < a class = "sourceLine" id = "cb1-1" title = "1" > < span class = "kw" > < li> < / span > ...< span class = "kw" > < li> < / span > < / a > < / code > < / pre > < / div >
< p > This is a bit difficult. You need to match< / p >
< div class = "sourceCode" id = "cb2" > < pre class = "sourceCode html" > < code class = "sourceCode html" > < a class = "sourceLine" id = "cb2-1" title = "1" > < span class = "kw" > < li> < / span > [anything not containing < span class = "kw" > < li> < / span > ]< span class = "kw" > < /li> < / span > < / a > < / code > < / pre > < / div >
< p > The first method would be to use the same reasoning as in my < a href = "../../../../Scratch/en/blog/2010-02-15-All-but-something-regexp" > previous post< / a > . Here is a first try:< / p >
< div class = "sourceCode" id = "cb3" > < pre class = "sourceCode perl" > < code class = "sourceCode perl" > < a class = "sourceLine" id = "cb3-1" title = "1" > < li> ([^< ]|< [^l]|< l[^i]|< li[^> ])< span class = "dt" > *< < / span > < span class = "kw" > /< / span > < span class = "ot" > li> < / span > < / a > < / code > < / pre > < / div >
< p > But what about the following string:< / p >
< div class = "sourceCode" id = "cb4" > < pre class = "sourceCode html" > < code class = "sourceCode html" > < a class = "sourceLine" id = "cb4-1" title = "1" > < span class = "kw" > < li> < / span > ...< span class = "kw" > < li< / span > < span class = "er" > < /li< / span > < span class = "kw" > > < / span > < / a > < / code > < / pre > < / div >
< p > That string should not match. This is why if we really want to match it correctly< sup > < a href = "#note1" > †< / a > < / sup > we need to add:< / p >
< div class = "sourceCode" id = "cb5" > < pre class = "sourceCode perl" > < code class = "sourceCode perl" > < a class = "sourceLine" id = "cb5-1" title = "1" > < li> ([^< ]|< [^l]|< l[^i]|< li[^> ])*(|< |< l|< li)< /li> < / a > < / code > < / pre > < / div >
< p > Yes a bit complicated. But what if the string I wanted to match was even longer?< / p >
< p > Here is the algorithm way to handle this easily. You reduce the problem to the first one letter matching:< / p >
< div class = "sourceCode" id = "cb6" > < pre class = "sourceCode perl" > < code class = "sourceCode perl" > < a class = "sourceLine" id = "cb6-1" title = "1" > < span class = "co" > # transform a simple randomly choosen character< / span > < / a >
< a class = "sourceLine" id = "cb6-2" title = "2" > < span class = "co" > # to an unique ID< / span > < / a >
< a class = "sourceLine" id = "cb6-3" title = "3" > < span class = "co" > # (you should verify the identifier is REALLY unique)< / span > < / a >
< a class = "sourceLine" id = "cb6-4" title = "4" > < span class = "co" > # beware the unique ID must not contain the< / span > < / a >
< a class = "sourceLine" id = "cb6-5" title = "5" > < span class = "co" > # choosen character< / span > < / a >
< a class = "sourceLine" id = "cb6-6" title = "6" > < span class = "kw" > s/< / span > < span class = "ot" > X< / span > < span class = "kw" > /< / span > < span class = "st" > _was_x_< / span > < span class = "kw" > /g< / span > < / a >
< a class = "sourceLine" id = "cb6-7" title = "7" > < span class = "kw" > s/< / span > < span class = "ot" > Y< / span > < span class = "kw" > /< / span > < span class = "st" > _was_y_< / span > < span class = "kw" > /g< / span > < / a >
< a class = "sourceLine" id = "cb6-8" title = "8" > < / a >
< a class = "sourceLine" id = "cb6-9" title = "9" > < span class = "co" > # transform the long string in this simple character< / span > < / a >
< a class = "sourceLine" id = "cb6-10" title = "10" > < span class = "kw" > s/< / span > < span class = "ot" > < li> < / span > < span class = "kw" > /< / span > < span class = "st" > X< / span > < span class = "kw" > /g< / span > < / a >
< a class = "sourceLine" id = "cb6-11" title = "11" > < span class = "kw" > s/< / span > < span class = "ot" > < \/li> < / span > < span class = "kw" > /< / span > < span class = "st" > Y< / span > < span class = "kw" > /g< / span > < / a >
< a class = "sourceLine" id = "cb6-12" title = "12" > < / a >
< a class = "sourceLine" id = "cb6-13" title = "13" > < span class = "co" > # use the first method< / span > < / a >
< a class = "sourceLine" id = "cb6-14" title = "14" > < span class = "kw" > s/< / span > < span class = "ot" > X< / span > < span class = "ch" > ([^< / span > < span class = "bn" > X< / span > < span class = "ch" > ]*)< / span > < span class = "ot" > Y< / span > < span class = "kw" > //g< / span > < / a >
< a class = "sourceLine" id = "cb6-15" title = "15" > < / a >
< a class = "sourceLine" id = "cb6-16" title = "16" > < span class = "co" > # retransform choosen letter by string< / span > < / a >
< a class = "sourceLine" id = "cb6-17" title = "17" > < span class = "kw" > s/< / span > < span class = "ot" > X< / span > < span class = "kw" > /< / span > < span class = "st" > < li> < / span > < span class = "kw" > /g< / span > < / a >
< a class = "sourceLine" id = "cb6-18" title = "18" > < span class = "kw" > s/< / span > < span class = "ot" > Y< / span > < span class = "kw" > /< / span > < span class = "st" > < \/li> < / span > < span class = "kw" > /g< / span > < / a >
< a class = "sourceLine" id = "cb6-19" title = "19" > < / a >
< a class = "sourceLine" id = "cb6-20" title = "20" > < span class = "co" > # retransform the choosen character back< / span > < / a >
< a class = "sourceLine" id = "cb6-21" title = "21" > < span class = "kw" > s/< / span > < span class = "ot" > _was_x_< / span > < span class = "kw" > /< / span > < span class = "st" > X< / span > < span class = "kw" > /g< / span > < / a >
< a class = "sourceLine" id = "cb6-22" title = "22" > < span class = "kw" > s/< / span > < span class = "ot" > _was_y_< / span > < span class = "kw" > /< / span > < span class = "st" > Y< / span > < span class = "kw" > /g< / span > < / a > < / code > < / pre > < / div >
< p > And it works in only 9 lines for any beginning and ending string. This solution should look less < em > I AM THE GREAT REGEXP M45T3R, URAN00B< / em > , but is more convenient in my humble opinion. Further more, using this last solution prove you master regexp, because you know it is difficult to manage such problems with only a regexp.< / p >
< hr / >
< p > < small > < a name = "note1" > < sup > †< / sup > < / a > I know I used an HTML syntax example, but in my real life usage, I needed to match between < code > en:< / code > and < code > ::< / code > . And sometimes the string could finish with < code > e::< / code > .< / small > < / p >
< / div >
< div id = "afterarticle" >
< div id = "social" >
2021-05-25 20:25:47 +00:00
< a href = "/rss.xml" target = "_blank" rel = "noopener noreferrer nofollow" class = "social" > RSS< / a >
2021-04-18 10:23:24 +00:00
·
< a href = "https://twitter.com/home?status=http%3A%2F%2Fyannesposito.com/Scratch/en/blog/2010-02-16-All-but-something-regexp--2-/%20via%20@yogsototh" target = "_blank" rel = "noopener noreferrer nofollow" class = "social" > Tweet< / a >
·
< a href = "http://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fyannesposito.com/Scratch/en/blog/2010-02-16-All-but-something-regexp--2-/" target = "_blank" rel = "noopener noreferrer nofollow" class = "social" > FB< / a >
< br / >
< a class = "message" href = "../../../../Scratch/en/blog/Social-link-the-right-way/" > These social sharing links preserve your privacy< / a >
< / div >
< div id = "navigation" >
< a href = "../../../../" > Home< / a >
< span class = "sep" > ¦< / span >
< a href = "../../../../Scratch/en/blog" > Blog< / a >
< span class = "sep" > ¦< / span >
< a href = "../../../../Scratch/en/softwares" > Softwares< / a >
< span class = "sep" > ¦< / span >
< a href = "../../../../Scratch/en/about" > About< / a >
< / div >
< div id = "totop" > < a href = "#header" > ↑ Top ↑< / a > < / div >
< div id = "bottom" >
< div >
Published on 2010-02-16
< / div >
< div >
< a href = "https://twitter.com/yogsototh" > Follow @yogsototh< / a >
< / div >
< div >
< a rel = "license" href = "http://creativecommons.org/licenses/by/3.0/deed.en_US" > Yann Esposito©< / a >
< / div >
< div >
Done with
< a href = "http://www.vim.org" target = "_blank" rel = "noopener noreferrer nofollow" > < strike > Vim< / strike > < / a >
< a href = "http://spacemacs.org" target = "_blank" rel = "noopener noreferrer nofollow" > spacemacs< / a >
< span class = "pala" > & < / span >
< a href = "http://nanoc.ws" target = "_blank" rel = "noopener noreferrer nofollow" > < strike > nanoc< / strike > < / a >
< a href = "http://jaspervdj.be/hakyll" target = "_blank" rel = "noopener noreferrer nofollow" > Hakyll< / a >
< / div >
< hr / >
< div style = "max-width: 100%" >
< a href = "https://cardanohub.org" >
< img src = "../../../../Scratch/img/ada-logo.png" class = "simple" style = "height: 16px ;
border-radius: 50%;
vertical-align:middle;
display:inline-block;" />
ADA:
< / a >
< code style = "display:inline-block;
word-wrap:break-word;
text-align: left;
vertical-align: top;
max-width: 85%;">
DdzFFzCqrhtAvdkmATx5Fm8NPJViDy85ZBw13p4XcNzVzvQg8e3vWLXq23JQWFxPEXK6Kvhaxxe7oJt4VMYHxpA2vtCFiP8fziohN6Yp
< / code >
< / div >
< / div >
< / div >
< / div >
< / div >
< / body >
< / html >