[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [leafnode-list] lockfile_exists disaster, analysis of 1.9.19 and



After some close looks at Cornelius' patch, I comment. Larry's mail also
holds, but additionally, on some systems, link fails on local file
systems or the very same reasons it fails on NFS; it may return EINTR
although the actual link has been created.

krasel@xxxxxxxxxxxxxxxxxxxxxxxxxxxx (Cornelius Krasel) writes:

> Here is what I believe to be a fix. Patch your version of 1.9.19 with
> it:

Sorry, this fix still contains a number of flaws.

> +    /* obtain a unique lockfile name, consisting of lockfile + pid */

Unsafe. Use fqdn as well, or just use mkstemp.

> +    /* try to obtain exclusive control over the lockfile */
> +    if (link(lockfilename, lockfile) < 0) {

Can fail here, see above.

> +	/* lockfile already exists. Does the process still exist as well? */
> +	lf = fopen(lockfile, "r+");
> +	if (lf == NULL) {
> +	    syslog(LOG_ERR, "Cannot open old lockfile: %m");
> +	    unlink(lockfilename);
> +	    return 1;	/* lockfile exists */
> +	}
> +	if (fscanf(lf, "%d", &pid)) {
>  	    fclose( lf );
> -	    if ( kill( pid, 0 ) == -1 && errno == ESRCH ) {

Well, this can be avoided. Paranoid will also check pid != 0 before
assuming everything's alright, but for me, relying on proper fscanf
return values is fine.

> +	    if ( kill(pid, 0) == -1 && errno == ESRCH ) {
..
>  	    } else {
>  		printf( "lockfile %s exists, abort ...\n", lockfile);
>  		syslog( LOG_ERR, "lockfile %s exists, abort ...",
>  			lockfile);
> +		unlink(lockfilename);

No error checking here. You may think that it's safe not to check, but
it isn't. If there's a problem with other parts, this unlink will tell
you that your file it was supposed to unlink had already gone. If the
old lockfile_exists function had logged every single unlink problem,
we'd have had evidence of a race condition MUCH earlier.

Don't save a single line at the wrong end.

..

> -    } else if ( errno != ENOENT ) {
> -	syslog( LOG_ERR, "unable to open lockfile %s: %d %m", lockfile, errno );
>      }

This lacks above. When errno == ENOENT /here/, retry the entire
operation.

> -    if ( ( lf = fopen( lockfile, "w" )) != NULL ) {
> +
> +    /* We have obtained the exclusive lock. Write pid into it. */
> +    if ( (lf = fopen( lockfile, "w" )) != NULL ) {

Robustness problem here. I used this approach, first link, then write
PID, and came across empty lock files in adverse conditions (out of
memory, high load), and if the lock file doesn't carry the PID, it
cannot be automatically removed if stale.

>  	fprintf( lf, "%d", (int)getpid() );
> -	fclose( lf );
> +	fclose(lf);
> +	unlink(lockfilename);

Both of these need error checking. Actually, fprintf would also need
error detection, but I believe this can be deferred until the
fclose. Note that the couple of digits that getpid generates at most
will hardly trigger a buffer flush, so actual write errors will only be
visible after fclose() has flushed the buffers. 

As to unlink, see above.

These are all minor problems and "might fail on occasion", but a minor
problem also broke the existing approach and trashed the groupinfo on
occasion. The "fix" must be water and gas proof or it cannot claim to be
one.

> I don't have tcpserver, but maybe somebody else can try out whether this
> version of lockfile_exists() is safe. I put this version into my local
> 2.0b tree right now and will soon find out whether there are any problems
> with it :-/

I believe it isn't, as outlined above. Could you try my version? 

It is indeed a bigger change, but I hope to have common lockfile code
for the upcoming 1.9.20 and the sequel to 2.0b8 in respect to
locking. This will have the advantage that we need to maintain only one
version of this function, not a 1.9.x one and a 2.0bx one, and we'll
have more testers.

tcpserver doesn't require special installation, the default installation
will just write a couple of programs to /usr/local/bin and is basically
1-download, 2-unpack tar.gz, 3-make, 4-as root, make setup check

You may want to give it a try. 

See http://cr.yp.to/ucspi-tcp.html

-- 
Matthias Andree

-- 
leafnode-list@xxxxxxxxxxxxxxxxxxxxxxxxxxxx -- mailing list for leafnode
To unsubscribe, send mail with "unsubscribe" in the subject to the list