sed, awk, perl o lex: encuentre cadenas por prefijo+regex, ignorando el resto de la entrada [cerrado]

I need to find strings with a certain prefix, followed by a regexp, in a bunch of files, but ignore the rest of the input (including the content of the line before the prefix, and after the end of the matching regexp).

What's the best tool for the job? grep finds complete lines; sed is usually used just for editing and select-and-replace; awk? perl?

También pensé en lex, but am I really after a compiler compiler?!

Edit: the input is several thousand of HTML files, the prefix + regular expression would be https://([-.0-9A-Za-z]+\.[A-Za-z]{2,}) (of which I want $1), and the rest of the input ignored.

preguntado el 27 de noviembre de 13 a las 04:11

Example please. What do you mean by "prefix"? -

"https://" would be the prefix. -

What do you mean by "regexp?" Examples of the strings would help. -

Why do you think grep is not a solution? Sounds like it will work just fine with the right expression, but without more details and input samples we're all just guessing. -

Amended the question. -

1 Respuestas

If you won't have more than one of the pattern on a single line, I'd probably use sed:

sed -n -e 's%.*https://\([-.0-9A-Za-z]\{1,\}\.[A-Za-z]\{2,\}\).*%\1%p'

Dado el archivo de datos:

Nothing here
Before after and after
Before you get to
And double your for fun and happiness in triplicate
and nothing here

EL sed script produces one entry per line, showing the last entry when there's more than one on the line:

A Perl script can be used for multiple entries per line:

$ perl -nle 'print $1 while (m%https://([-.0-9A-Za-z]+\.[A-Za-z]{2,})%g);' data

respondido 27 nov., 13:05

I have no clue what the html input is; I'd like to be able to find more than one occurrence of my pattern on any given line. - cnst

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.