---------- Forwarded message ---------- X-bandwidth-by: Hyperreal Date: Tue, 23 Sep 1997 21:52:56 -0600 (MDT) From: Marc Slemko To: Apache - BYOC Subject: Memory leak in accept(), Solaris 2.5.1 (fwd) Reply-To: new-httpd@apache.org Interesting. If true, another reason to use serialized accepts. ---------- Forwarded message ---------- >Path: scanner.worldgate.com!news.insinc.net!newsfeed.direct.ca!newsfeed.internetmci.com!206.28.166.253!news.probe.net!not-for-mail >From: mgleFIXTHISason@probe.net (Mike Gleason, remove the FIXTHIS) >Newsgroups: comp.unix.solaris >Subject: Memory leak in accept(), Solaris 2.5.1 >Date: Wed, 24 Sep 1997 02:49:02 GMT >Organization: NCEMRSoft >Lines: 157 >Message-ID: <34287641.85493713@news.probe.net> >NNTP-Posting-Host: max2-16.probe.net >X-Newsreader: Forte Free Agent 1.11/32.235 >Xref: scanner.worldgate.com comp.unix.solaris:117796 I have found a memory leak that affects Solaris 2.5.1 (Intel and Sparc, with the Sept 17 recommended patch cluster applied) when the accept() socket library function is used. Memory is leaked when an alarm causes accept() to jump out of the function, without free()ing memory malloc()'d by accept(). Perhaps someone could point me to an official method of reporting this. I'm not a contract customer and I couldn't find anything on the Sun webs. Example code (stripped bare for brevity) is next, followed by output showing how the leak occurs. Unfortunately for TCP/IP servers, this can cause a long-running daemon process to consume all swap space and grind the machine to a halt. ---snip--- #include #include #include #include #include #include #include #include #include #include char mallbuf[1024 * 100]; char *mbptr; jmp_buf jmp; void * malloc(size_t n) { static int calls = 0; char *p; char m[64]; if (calls++ == 0) { mbptr = mallbuf; } p = mbptr; if ((n % 8) == 0) mbptr += n; else mbptr += n + (8 - (n % 8)); write(1, "used malloc\n", 12); return (p); } /* malloc */ void free(void *p) { write(1, "used free \n", 12); } /* free */ static void hdlr(int num) { longjmp(jmp, 1); } /* hdlr */ main(int argc, char **argv) { int s, s2, addrsize; struct sockaddr_in addr; unsigned short port; if (argc < 2) exit(2); port = (unsigned short) atoi(argv[1]); write(1, "socket\n", 7); s = socket(AF_INET, SOCK_STREAM, 0); (void) memset(&addr, 0, sizeof(addr)); addr.sin_family = AF_INET; addr.sin_addr.s_addr = INADDR_ANY; addr.sin_port = htons(port); write(1, "bind \n", 7); bind(s, &addr, sizeof(addr)); write(1, "listen\n", 7); listen(s, 2); for (;;) { if (setjmp(jmp) != 0) { alarm(0); write (1, "accept timeout\n", 15); } signal(SIGALRM, hdlr); alarm(2); (void) memset(&addr, 0, sizeof(addr)); addrsize = sizeof(addr); write(1, "accept\n", 7); s2 = accept(s, &addr, &addrsize); alarm(0); if (s2 >= 0) { write (1, "accept okay \n", 15); break; } else { write (1, "accept error \n", 15); } } write(1, "done\n", 5); exit(0); } ---snip--- Output follows. Note how malloc() isn't called until the second time accept is used. 21 Avalanche ~/src > a.out 5102 socket used malloc [...] used malloc used free bind listen accept accept timeout accept used malloc accept timeout accept used malloc accept timeout accept used malloc ^C Workaround: ========= select() can be used as a timeout mechanism, instead of using alarm(). Diatribe: ====== I've spent three months trying to track down this bug. I also had to purchase a separate machine and Solaris license on which to load test an app on a Solaris platform. Although normally that wouldn't be necessary, it's still tough for a small, independent developer to support Sun operating systems. I'd happily take a copy of the Sun C compiler for Intel in exchange for all the resources I had to exhaust on this one so I wouldn't have to use gcc atleast. ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Mike Gleason, NCEMRSoft. ~~~~~~~~~~~~~~~~~~~~~~~~~~~ (Please remove the "FIXTHIS" from my email address before replying.) ObPlug: Try NcFTPd (http://www.probe.net/~mgleason/ncftpd/) instead of wu-ftpd! From mikedoug@staff.texas.net Sun May 10 11:05:28 1998 Date: Sun, 10 May 1998 12:37:32 -0500 From: Michael Douglass To: new-httpd@apache.org Subject: Re: immortal httpd processes. Reply-To: new-httpd@apache.org On Fri, May 01, 1998 at 10:47:50PM -0500, Michael Douglass said: FYI, for all of you who involved yourselves in attempting to assist me in my delima, let me tell you the fix. The fix was quite simple actually; the problem an annoying one. We have run into this immortal process before; but it was on a hugely threaded process (typhoon) on another 2.6 box. The kernel patch 105181-05 (still in test-patch status, not released to public yet) was given to us for that box. It was my understanding that it only affected threaded programs; well, little did I know that it also seems to affect non-threaded-but-very- busy programs like apache (well, on a large web server). So, if anyone else has a problem with a solaris 2.6 box with imortal processes (won't show up until server gets really busy) contact sun to see if you can get T-patch 105181-05. It will save you alot of annoying hassles. > On Fri, May 01, 1998 at 09:34:57PM -0600, Marc Slemko said: > > > Could the connectivity between your NFS client and server be dying? > > > > Are your NFS mounts interruptable and/or soft? (note that using either > > makes NFS unreliable but it is already unreliable) > > Actually, I just realized that the admins that setup this box didn't > put any explicit options in there for the NFS mounts; so looking at > the defaults in `man mount_nfs` on that box: hard and interruptable. > Also ver3 and tcp are the obvious defaults; we tend to revert most NFS > mounts back to UDP since we've had a fair number of issues with some > TCP mounts that go away once moved to UDP. > > > Are you reading content or anything else from NFS disks? > > Yes, the same box is where the content is stored. > > > What sort of box is serving the NFS? > > Netapp > > > What does ps -elf show the pcoess to be at when it can't be killed? I > > wish Solaris had nice state info to indicate what things are blocked on > > like FreeBSD does. > > I haven't tried a ps -elf. > > > Does lsof show anything of interest with the process? > > Won't return > > > If it can't be killed, it must be a kernel issue. Apache could trigger > > it, but no user process should ever be able to do anything to make it > > unkillable. > > Right, I was asking here for two reasons. One, I know you guys to be > highly intelligent in these types of issues; and two, I was hoping that > someone else had experienced and worked around this issue. > > Our other problem is that we can't seem to get the machine to generate > a core dump (of the entire system) to send to sun. Is there a command > under solaris to do this? (ie. intentionally panic the system) > > > -- > Michael Douglass > Texas Networking, Inc. > > it's raining...it's pouring...the old man... > *** Describe: msmith shuts up now. -- Michael Douglass Texas Networking, Inc. it's raining...it's pouring...the old man... *** Describe: msmith shuts up now.