|
|
Subscribe / Log in / New account

Andrew Morton on kernel development

Please consider subscribing to LWN

Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

By Jonathan Corbet
June 11, 2008
Andrew Morton is well-known in the kernel community for doing a wide variety of different tasks: maintaining the -mm tree for patches that may be on their way to the mainline, reviewing lots of patches, giving presentations about working with the community, and, in general, handling lots of important and visible kernel development chores. Things are changing in the way he does things, though, so we asked him a few questions by email. He responded at length about the -mm tree and how that is changing with the advent of linux-next, kernel quality, and what folks can do to help make the kernel better.

Years ago, there was a great deal of worry about the possibility of burning out Linus. Life seems to have gotten easier for him since then; now instead, I've heard concerns about burning out Andrew. It seems that you do a lot; how do you keep the pace and how long can we expect you to stay at it?

I do less than I used to. Mainly because I have to - you can't do the same thing at a high level of intensity for over five years and stay sane.

I'm still keeping up with the reviewing and merging but the -mm release periods are now far too long.

There are of course many things which I should do but which I do not.

Over the years my role has fortunately decreased - more maintainers are running their own trees and the introduction of the linux-next tree (operated by Stephen Rothwell) has helped a lot.

The linux-next tree means that 85% of the code which I used to redistribute for external testing is now being redistributed by Stephen. Some time in the next month or two I will dive into my scripts and will find a way to get the sufficiently-stable parts of the -mm tree into linux-next and then I will hopefully be able to stop doing -mm releases altogether.

So. The work level is ramping down, and others are taking things on.

What can we do to help?

I think code review would be the main thing. It's a pretty specialised function to review new code well. The people who specialise in the area which the new code is changing are the best reviewers but unfortunately I will regularly find myself having to review someone else's stuff.

Secondly: it would help if people's patches were less buggy. I still have to fix a stupidly large number of compile warnings and compilation errors and each -mm release requires me to perform probably three or four separate bisection searches to weed out bad patches.

Thirdly: testing, testing, testing.

Fourthly: it's stupid how often I end up being the primary responder on bug reports. I'll typically read the linux-kernel list in 1000-email batches once every few days and each time I will come across multiple bug reports which are one to three days old and which nobody has done anything about! And sometimes I know that the person who is responsible for that part of the kernel has read the report. grr.

Is it your opinion that the quality of the kernel is in decline? Most developers seem to be pretty sanguine about the overall quality problem. Assuming there's a difference of opinion here, where do you think it comes from? How can we resolve it?

I used to think it was in decline, and I think that I might think that it still is. I see so many regressions which we never fix. Obviously we fix bugs as well as add them, but it is very hard to determine what the overall result of this is.

When I'm out and about I will very often hear from people whose machines we broke in ways which I'd never heard about before. I ask them to send a bug report (expecting that nothing will end up being done about it) but they rarely do.

So I don't know where we are and I don't know what to do. All I can do is to encourage testers to report bugs and to be persistent with them, and I continue to stick my thumb in developers' ribs to get something done about them.

I do think that it would be nice to have a bugfix-only kernel release. One which is loudly publicised and during which we encourage everyone to send us their bug reports and we'll spend a couple of months doing nothing else but try to fix them. I haven't pushed this much at all, but it would be interesting to try it once. If it is beneficial, we can do it again some other time.

There have been a number of kernel security problems disclosed recently. Is any particular effort being put into the prevention and repair of security holes? What do you think we should be doing in this area?

People continue to develop new static code checkers and new runtime infrastructure which can find security holes.

But a security hole is just a bug - it is just a particular type of bug, so one way in which we can reduce the incidence rate is to write less bugs. See above. More careful coding, more careful review, etc.

Now, is there any special pattern to a security-affecting bug? One which would allow us to focus more resources on preventing that type of bug than we do upon preventing "average" bugs? Well, perhaps. If someone were to sit down and go through the past five years' worth of kernel security bugs and pull together an overall picture of what our commonly-made security-affecting bugs are, then that information could perhaps be used to guide code-reviewers' efforts and code-checking tools.

That being said, I have the impression that most of our "security holes" are bugs in ancient crufty old code, mainly drivers, which nobody runs and which nobody even loads. So most metrics and measurements on kernel security holes are, I believe, misleading and unuseful.

Those security-affecting bugs in the core kernel which affect all kernel users are rare, simply because so much attention and work gets devoted to the core kernel. This is why the recent splice bug was such a surprise and head-slapper.

I have sensed that there is a bit of confusion about the difference between -mm and linux-next. How would you describe the purpose of these two trees? Which one should interested people be testing?

Well, things are in flux at present.

The -mm tree used to consist of the following:

  • 80-odd subsystem maintainer trees (git and quilt), eg: scsi, usb, net.
  • various patches which I picked up which should be in a subsystem maintainer's tree, but which for one of various reasons didn't get merged there. I spend a lot of time acting as backup for leaky maintainers.
  • patches which are mastered in the -mm tree. These are now organised as subsystems too, and I count about 100 such subsystems which are mastered in -mm. eg: fbdev, signals, uml, procfs. And memory management.
  • more speculative things which aren't intended for mainline in the short-term, such as new filesystems (eg reiser4).
  • debugging patches which I never intend to go upstream.

The 80-odd subsystem trees in fact account for 85% of the changes which go into Linux. Pretty much all of the remaining 15% are the only-in-mm patches.

Right now (at 2.6.26-rc4 in "kernel time"), the 80-odd subsystem trees are in linux-next. I now merge linux-next into -mm rather than the 80-odd separate trees.

As mentioned previously, I plan to move more of -mm into linux-next - the 100-odd little subsystem trees.

Once that has happened, there isn't really much left in -mm. Just

  • the patches which subsystem maintainers leaked. I send these to the subsystem maintainers.
  • the speculative not-for-next-release features
  • the not-to-be-merged debugging patches.

Do you have any specific goals for the development of the kernel over the next year or so? What would they be?

Steady as she goes, basically.

I keep on hoping that kernel development in general will start to ramp down. There cannot be an infinite number of new features out there! Eventually we should get into more of a maintenance mode where we just fix bugs, tweak performance and add new drivers. Famous last words.

And it's just vaguely possible that we're starting to see that happening now. I do get a sense that there are less "big" changes coming in. When I sent my usual 1000-patch stream at Linus for 2.6.26 I actually received an email from him asking (paraphrased) "hey, where's all the scary stuff?"

In the early-May discussions, Linus said a couple of times that he does not think code review helps much. Do you agree with that point of view?

Nope.

How would you describe the real role of code review in the kernel development process?

Well, it finds bugs. It improves the quality of the code. Sometimes it prevents really really bad things from getting into the product. Such as rootholes in the core kernel. I've spotted a decent number of these at review time.

It also increases the number of people who have an understanding of the new code - both the reviewer(s) and those who closely followed the review are now better able to support that code.

Also, I expect that the prospect of receiving a close review will keep the originators on their toes - make them take more care over their work.

There clearly must be quite a bit of communication between you and Linus, but much of it, it seems, is out of the public view. Could you describe how the two of you work together? How are decisions (such as when to release) made?

Actually we hardly ever say anything much. We'll meet face-to-face once or twice a year and "hi how's it going".

We each know how the other works and I hope we find each other predictable and that we have no particular issues with the other's actions. There just doesn't seem to be much to say, really.

Is there anything else you would like to say to LWN's readers?

Sure. Please do contribute to Linux, and a great way of doing that is to test latest mainline or linux-next or -mm and to report on any problems which you encounter.

Nothing special is needed - just install it on as many machines as you dare and use them in your normal day-to-day activities.

If you do hit a bug (and you will) then please be persistent in getting us to fix it. Don't let us release a kernel with your bug in it! Shout at us if that's what it takes. Just don't let us break your machines.

Our testers are our greatest resource - the whole kernel project would grind to a complete halt without them. I profusely thank them at every opportunity I get :)

We would like to thank Andrew for taking time to answer our questions.

Index entries for this article
KernelDevelopment model


(Log in to post comments)

Andrew Morton on kernel development

Posted Jun 11, 2008 16:16 UTC (Wed) by Hanno (guest, #41730) [Link]

"I do think that it would be nice to have a bugfix-only kernel release."

Yes, please.

Andrew Morton on kernel development

Posted Jun 11, 2008 17:10 UTC (Wed) by MisterIO (guest, #36192) [Link]

It may be interesting, unless kernel developers ignore the bug-fix only release and work on
new futures by themselves in the meantime.

Andrew Morton on kernel development

Posted Jun 11, 2008 17:27 UTC (Wed) by hmh (subscriber, #3838) [Link]

It may be interesting, unless kernel developers ignore the bug-fix only release and work on new futures by themselves in the meantime.

Which many will do, causing total chaos in the next merge window. That's the reason why it was not done yet, AFAIK.

Now, if we could get a sufficiently big number of kernel regulars (like at least 50% of the ones with more than three patches merged in the last three releases) and all subsystem maintainers (so as to keep the new-feature-craze crowd under control) to pledge to the big bugfix experiment, then it just might work.

Andrew Morton on kernel development

Posted Jun 11, 2008 17:59 UTC (Wed) by proski (subscriber, #104) [Link]

It's not a matter of making developers doing something else. It's a priority thing. Most developers work both on new features and on bugfixes. Sometimes bugs are exposed as the code is modified to include new features.

If some kernel is declared stable, it mean that only bugfixes are accepted. In other words, the merge window is skipped. To make the point, the previous kernel could be tagged as rc1 for the stable kernel.

I don't know it it's going to work, but it may be worth trying once.

Andrew Morton on kernel development

Posted Jun 11, 2008 17:34 UTC (Wed) by cdamian (subscriber, #1271) [Link]

I preferred the odd/even system we had before 2.6.

I also gave up on reporting kernel bugs. Usually I am the only person with that bug and
hardware 
configuration and nobody will fix it. This is not specific to the kernel though. I think I
never got any 
of the bugs which I reported to fedora, red hat or gnome fixed.

Two other things: is the kernel bugzilla used at all? are there any tests like unit tests to
catch 
regressions for the kernel? both are pretty standard for any other open source project
nowadays. 

Andrew Morton on kernel development

Posted Jun 11, 2008 18:52 UTC (Wed) by grundler (guest, #23450) [Link]

> I also gave up on reporting kernel bugs.

I'm sorry to hear that. I know that reporting bugs is alot of work.

> Usually I am the only person with that bug and hardware 
> configuration and nobody will fix it.

If no one else really has that HW, then there could be lots of reasons:
1) They don't care - many developers don't care about parisc, sparc, 100VG or tokenring
networking, scaling up or down (embedded vs large systems),e tc.
2) They don't have documentation for the offending HW.
3) no one else was able to reproduce the bug and it's not obvious what is wrong.

> This is not specific to the kernel though. I think I
> never got any of the bugs which I reported to fedora,
> red hat or gnome fixed.

Before someone else suggests these, maybe the way the bugs are reported has something to do
with the response rate?
There are some good essays/resources out there on how to file useful bugreports. I don't want
to suggest yours are not useful since I've never seen one (or don't know if I have). Just when
you mention problems across all open source projects I wonder.

> Two other things: is the kernel bugzilla used at all?
> are there any tests like unit tests to catch regressions for the kernel?
> both are pretty standard for any other open source project nowadays.

Agreed. But to be clear, the kernel is a bit different than most open source projects since it
controls HW and lots of buggy BIOS flavors.

(1) I'm using bugzilla.kernel.org to track tulip driver bugs. Not everyone is doing that. It's
helped that akpm has (had?) funding (from google?) for someone to help cleanup and poke
maintainers about outstanding bugs. Despite not everyone using it, it's still a better
tracking mechanism than sending an email to lkml. Do both. email to get attention and bugzilla
to track details. But also send bug reports to topic-specific lists since it's more likely
people who care about your HW will notice the report.

(2) Not that I'm aware of. The kernel interacts with HW alot. It's very difficult to emulate
or "mock" that interaction. Not impossible, just hard and the emulation almost never can
capture all the nuances of broken HW (see drivers/net/tg3.c for examples). Secondly, we very
often can only test large subsystems or several subsystems at once. e.g. a file system test
almost always ends up stressing the VM and IO subsystems. Networking stresses DMA and SK buff
allocators. UML and other virtualization of the OS make it possible to test some subsystems
w/o specific HW. However there are smaller pieces of the kernel which can be isolated and
tested: e.g bit ops (i.e. ffs()), resource allocators, etc. It's just a lot of work to
automate the testing of those bits of code. But this is certainly a good area to contribute if
someone wanted to learn how kernel code (doesn't? :)) work.

For testing subsystems, see autotest.kernel.org and http://ltp.sourceforge.net/. autotest is
attempting to find regressions during the development cycle.

Andrew Morton on kernel development

Posted Jun 11, 2008 19:02 UTC (Wed) by nbarriga (guest, #49347) [Link]

It seems that autotest.kernel.org doesn't exist...

Andrew Morton on kernel development

Posted Jun 11, 2008 22:57 UTC (Wed) by erwbgy (subscriber, #4104) [Link]

That should be http://test.kernel.org/ and http://test.kernel.org/autotest for documentation.

Andrew Morton on kernel development

Posted Jun 12, 2008 2:23 UTC (Thu) by grundler (guest, #23450) [Link]

Yes - I meant http://test.kernel.org. Sorry about that.

Andrew Morton on kernel development

Posted Jun 11, 2008 20:01 UTC (Wed) by iabervon (subscriber, #722) [Link]

I think there's a substantial difference to the way he phrased the suggestion here from what
I've seen before. People tend to think of a bugfix-only release as one in which the mainline
only merges bugfixes. Simply making that policy would almost certainly lead to no more
bugfixes than usual, and twice as many features hitting the following release window.

On the other hand, if the process were driven from the other end, it might work: spend some
period collecting a lot of unfixed bugs, and saturate developers' development time with them,
and, in the cycle after that, there ought to be a lot of bugfixes and no new features, simply
because all that will have matured at the merge window will be bugfixes.

So, if there were a period where there was a campaign to collect long-standing bugs and
regressions against non-recent versions, with the aim of having all of these get resolved in a
particular future version, as the main goal for that release, I think that would be useful.

Andrew Morton on kernel development

Posted Jun 11, 2008 19:39 UTC (Wed) by job (guest, #670) [Link]

I've been bitten by some bugs earlier in the 2.6 series, but I have not had any trouble since
around 2.6.18 I believe. It may be luck, it may be hard work from Andrew and everyone else
involved. Thank you, everyone!

Sometimes it is depressing

Posted Jun 11, 2008 21:47 UTC (Wed) by mikov (guest, #33179) [Link]

Sometimes I get depressed when thinking about the kernel. Mostly because I feel powerless to
affect it in anyway - I can't sponsor somebody to work on fixing bugs (that would be the ideal
case) and unfortunately in most cases I don't have the expertise to fix bugs myself.

For example only recently I discovered to my utter amazement that USB 2.0 still doesn't work
well ! I tried to connect a simple USB->Serial converter and it started failing in mysterious
ways - e.g. it would work 80% of the time, but then there would be a lost byte, etc.

There are workarounds (disabling USB 2.0 from the BIOS, unloading the USB 2.0 modules, using
an USB 1.0 hub, etc), but it is depressing that USB 2.0, which is on practically 100% of all
machines, doesn't work. Of course it works nice under Windows.

I eventually dug out a couple of messages from Greg KH explaining that it is known problem for
a long time (I don't remember the exact details), but there is simply not enough interest in
fixing it. 

This is *not* an issue of undocumented hardware !

I can't really complain, since I am not paying for Linux, but it is ... I already said it ...
depressing. 

Sometimes it is depressing

Posted Jun 11, 2008 22:11 UTC (Wed) by dilinger (subscriber, #2867) [Link]

You don't have to sponsor developers; just send them the misbehaving hardware.  Chances are
good that if it's useful hardware, it'll get fixed.

Sometimes it is depressing

Posted Jun 11, 2008 22:28 UTC (Wed) by mikov (guest, #33179) [Link]

I am afraid it is not that simple. 

I am sure that there isn't a single developer without a USB 2.0 PC, so there is no point in
sending them anything. USB 2.0 hubs can be bought for about $30 (and PCs have hubs builtin
anyway), add another $10 for a USB->serial converter. I don't mind spending that if it would
improve the kernel. 

As I mentioned, this is not a case of undocumented hardware or expensive. The USB 2.0 kernel
subsystem is apparently not quite ready and it can't handle USB 2.0 hubs. At least that is my
understanding - I could be wrong.

Even assuming that it made sense to send hardware, where should I send it ? 

Sometimes it is depressing

Posted Jun 11, 2008 22:32 UTC (Wed) by dilinger (subscriber, #2867) [Link]

I *highly* doubt this is a USB 2.0 host problem.  More likely, it's a problem w/ the specific
USB device that you're using, or a host bug that's only triggered by your USB device.  There
are plenty of buggy USB devices out there.

I've used plenty of USB 2.0 devices with no problems.  I've also used USB serial adapters with
no problems at all.  However, your specific USB serial adapter is clearly problematic, and
that's not something that other people are likely to see unless they have the same hardware
that you have.

Sometimes it is depressing

Posted Jun 11, 2008 23:08 UTC (Wed) by mikov (guest, #33179) [Link]

The device is fine. The USB converter uses the Prolific chip, which as far as I can tell is
one of the most common ones and highly recommended for Linux. I have several different
converters using it, including a $350 industrial 8-port one. They all fail (also on machines
with different USB chipsets) as long as USB 2.0 is enabled. The failure is fairly subtle, so
it is not always immediately obvious.

Needless to say, all converters work flawlessly under Windows ...

See this post:
http://lkml.org/lkml/2006/6/12/279

To quote from further down the thread:
"Yeah, it's a timing issue with the EHCI TT code.  It's never been
"correct" and we have had this problem since we first got USB 2.0
support.  You were just lucky in not hitting it before"

BTW, I last tried this with a fairly recent kernel (2.6.22).

Sometimes it is depressing

Posted Jun 11, 2008 23:58 UTC (Wed) by walken (subscriber, #7089) [Link]

Eh, I have that chip too.

I don't know if it's got anything to do with linux (my understanding is that the chip asks to
be polled over USB every millisecond, and there are only 1000 frames that can go over the USB
bus by second, so that device won't work if it has to share the USB bus with anything else)

There is an easy workaround: plug this device in a port where it won't have to share the bus
with any other device. I.e. if you have two USB ports on your machine, plug the prolific chip
in one of them and everything else in a hub on the other port.

I had no idea if things are better in windows, I thought it was an issue with the USB device
itself.

BTW, did you try the USB_EHCI_TT_NEWSCHED thing discussed in that thread ?

Sometimes it is depressing

Posted Jun 12, 2008 0:24 UTC (Thu) by mikov (guest, #33179) [Link]

I am fairly certain the problem is not related to sharing the USB bus. I had four of those
converters connected to an ordinary USB hub working 100% reliably, as long as USB 2.0 was
disabled. 

Plus, you can buy a fairly expensive (hundreds of $) multi-port converter which internally is
nothing more than a couple of cascaded USB hubs and pl2303 chips. I hope that they wouldn't be
selling such devices if the underlying chip was fundamentally broken.

Lastly, it all works peachy in Windows.

I tried USB_EHCI_TT_NEWSCHED (it is included in 2.6.22), but it didn't fix it. Alas I didn't
have the chance to dig too deep (and I am not an USB expert, although I have done kernel
programming) - sometimes it took many hours to reproduce the errors, and using USB 1.1 solved
my immediate problem.

When I saw Greg KH's explanation that there are problems in the USB 2.0 implementation known
for for years, I lost my hope of improving the situation constructively. 

Perhaps I should pick it up again. What is the best forum to report this problem ? Apparently
not the kernel Bugzilla ? :-)

Sometimes it is depressing

Posted Jun 12, 2008 3:11 UTC (Thu) by dilinger (subscriber, #2867) [Link]

You'll note the wording GregKH used.. "should be fixed", etc.  Mark Lord had to report back
that it was still broken.  If GregKH actually had the hardware available to reproduce it,
development and fix time would be much quicker.

As far as bugs that are known for years; this is free software.  The only people that are
going to fix it are ones that are either paid to do so, or have an itch to scratch because
their hardware is not working correctly.  The fact that this is a corner case, and has an easy
work around makes it pretty clear why it has taken so long to get it fixed.  I fail to see
what's so depressing.

It's hard enough reproducing bugs when you have the hardware, but not having it available
makes fixing bugs many times more difficult (and kills much of the motivation to do anything
about it).

Sometimes it is depressing

Posted Jun 12, 2008 4:23 UTC (Thu) by mikov (guest, #33179) [Link]

I don't think that this is a corner case at all. It is unacceptable to have random devices
fail subtly and quietly when connected to a standard bus. Especially when such a fundamental
and established interface as USB is concerned. 

It is disappointing that the kernel has known bugs of this nature which are not being
addressed. The problem is not so much that my particular device doesn't work. 

The depressing part is that it _really_ is nobody's fault. The development model is what it
is. There is nothing better and there is nothing we can do about it. 

RedHat is not going to pay for fixing this because they don't care about desktops with random
hardware. Canonical is not going to fix it because they don't contribute that much to the
kernel. Nobody is going to pay for fixing it. 

There is nothing to be done. That is depressing.


Sometimes it is depressing

Posted Jun 12, 2008 5:19 UTC (Thu) by dilinger (subscriber, #2867) [Link]

It *is* a corner case.  A device is plugged into a USB1.1-only hub plugged into a USB2 port.
From the thread, my assumption is that the kernel (ehci) thinks 2.0 is supported because the
host supports it, and thus attempts to talk 2.0 to the device.  The hub in the middle screws
things up.  Bypass the USB1.1 hub, and things work just fine.  If that's _not_ what you're
doing, than you are seeing a different bug.

Sometimes it is depressing

Posted Jun 12, 2008 14:15 UTC (Thu) by mikov (guest, #33179) [Link]

This is not what is happening. The problem occurs when a USB 1.1 device is plugged into a USB
2.0 hub. AFAICT, this matches the description of the bug referenced in Greg KH's post.

This is a frequent case - there are many USB 1.1 devices, but at the same time all hubs that
can be purchased right now are 2.0.

I suspect that most people are not seeing the problem simply because few people actually use
hubs. Since the problem is subtle - a couple of lost bytes every couple of hours - most people
wouldn't recognize it anyway. 

Sometimes it is depressing

Posted Jun 12, 2008 20:09 UTC (Thu) by nhippi (guest, #34640) [Link]

Sometimes it's depressing to see how many posts some people bother to write about their
problems to a random forum, when with the same amount of energy one could have filed a bug in
bugzilla.kernel.org ...

Sometimes it is depressing

Posted Jun 12, 2008 21:22 UTC (Thu) by mikov (guest, #33179) [Link]

It is even more depressing when the Slashdot trolls start posting on LWN. 

First of all, this is not some random forum.  Secondly, had you bothered to read the messages,
you'd seen that the bug is already known. Lastly, in case you missed it, the subject is not my
specific problem, but the philosophical futility of reporting bugs in something free.

Incidentally, it appears that you don't even realize how much effort and time it takes to make
a useful bug report. It is ironic that some people find it more acceptable to pollute bugzilla
with useless wining complaints, rather than discussing it in a forum.

Sometimes it is depressing

Posted Jun 13, 2008 17:27 UTC (Fri) by dilinger (subscriber, #2867) [Link]

Once again: no.  The original reporter says that when he plugs the pl2303 device directly into
the USB2.0 hub, it works just fine.  It's only when it goes through a USB1.1 dock/hub that it
fails.

So, once again: YOU ARE TALKING ABOUT SOMETHING COMPLETELY DIFFERENT FROM THE LINK YOU POSTED.

Most people aren't seeing the problem because most USB1.1 devices work just fine in USB2.0
hubs.  The problem described in the link you supplied is a corner case (some weird built-in
serial adapter in a hub/dock thingy).  The problem you've described sounds like it's specific
to some portion of your hardware.

I dug through my hardware pile and found a pl2303.  It works just fine in a USB2.0 port.  If
you want to moan about how depressing kernel development is, that's fine; but claiming that
it's hopeless when you refuse to get involved is just silly.

Sometimes it is depressing

Posted Jun 13, 2008 18:09 UTC (Fri) by mikov (guest, #33179) [Link]

Most people aren't seeing the problem because most USB1.1 devices work just fine in USB2.0 hubs. The problem described in the link you supplied is a corner case (some weird built-in serial adapter in a hub/dock thingy). The problem you've described sounds like it's specific to some portion of your hardware.

Sigh. I explained this a couple of times. It is not specific to my hardware. As I already said, I have tested this with several different pl2303 converters, including very expensive ones. I have tested it on different machines with different USB chipsets. I have even tested a couple of different kernel versions. I am not an idiot, you know :-)

The description of the problem is simple and I don't see why I have to keep repeating it over and over. Apparently USB1.1 devices have problems when plugged into USB 2.0 hubs.

I agree that it is not exactly the same thing as described in the linked post, but it looks dangerously similar, and as of 2.6.22 the proposed fix was still marked experimental.

I also agree that it is theoretically possible that only the PL2303 driver has this specific problem. I don't think that is the case though.

I dug through my hardware pile and found a pl2303. It works just fine in a USB2.0 port. If you want to moan about how depressing kernel development is, that's fine; but claiming that it's hopeless when you refuse to get involved is just silly.

See above. If you really want to show me that I don't understand anything, try it with a USB 2.0 hub. Run it for 24 hours checking if there is even a single missed byte in either direction. Then tell me that "it works just fine"

Also, I am not refusing to get involved. Did you also not see one of my posts asking what is a good venue to report this problem ?

I didn't mean to discuss this particular issue in depth. I did not ask for advice on fixing it. I used it just as an example. Apparently not a very good one, because my point did not get through.

In Windows, at least theoretically, either the manufacturer or Microsoft is responsible for doing something if there is a problem. In Linux there is generally no responsibility unless you purchase a support contract which is much more expensive than the price of a copy of Windows. What qualification would you use for this ?

Sometimes it is depressing

Posted Jun 17, 2008 22:21 UTC (Tue) by phiggins (guest, #5605) [Link]

What I find far more depressing is when you've paid a company several million dollars for a
support contract and they still don't fix your bugs (I've seen this happen several times).
Every software project--free or not--has finite developer resources. Some bugs will take a
certain amount of time to fix no matter how many dollars or people are thrown at the issue.
You're making it sound like this problem only exists for Linux when I've seen it far more
often with proprietary software. The problem is fundamental and will not go away, but I've
found Linux to do a better job of handling it than anything else I've seen. It's still not
perfect, but it can't be. The only way you can be guaranteed to get your problem fixed is to
have the ability to fix it yourself. With Linux, you theoretically have that option. With
proprietary software, you don't.

Sometimes it is depressing

Posted Jun 12, 2008 1:28 UTC (Thu) by proski (subscriber, #104) [Link]

Bisecting bugs doesn't require deep knowledge. It requires a fast computer and some time to test kernels for the problem. And you keep the computer after you're done :-)

I realize that the problem you have with USB 2.0 is not bisectable, but many other problems are.

Sometimes it is depressing

Posted Jun 12, 2008 12:13 UTC (Thu) by ekj (guest, #1524) [Link]

That is only true if:

a) You've got a simple test that -always- reproduces the bug on one kernel.

b) You're aware of atleast one kernel where the bug does NOT happen.

Most of the (suspected!) kernel-bugs I've run into in my years of running Linux (since 1.2.13)
has fulfilled neither of these 2. 


My issues with kernel development

Posted Jun 12, 2008 2:21 UTC (Thu) by pr1268 (subscriber, #24648) [Link]

First of all, I wish to thank Andrew for his thoughts and time in responding. Such discussion relating to kernel development is refreshing.

As for reporting bugs, I've two in particular with 2.6.25[.x] that I've been loath to report:

  1. make oldconfig doesn't appear to work anywhere near what I expected. After doing so, I had a .config for a Summit subarch with drivers enabled for all sorts of devices I don't have in my 32-bit x86 PC.
  2. Inserting a CD/DVD into either of my DVD-RW drives and subsequently issuing an appropriate mount(8) command had the shell gripe at me stating that no media was present. However, waiting 15-20 seconds between closing the drive door and issuing the mount command works fine. This behavior is different than 2.6.24 (and earlier).

Granted, I was hesitant to report either of these (until now) because I was unsure whether (1) was operator error, and whether (2) was expected new behavior given the patches submitted for 2.6.25. Plus, I didn't want to add yet another message to the (already crowded) LKML. But, I'm curious: would reporting either of these be appropriate?

I would certainly love to contribute to the kernel development project--I even subscribed to the LKML (having been inspired by our editor's eagerness to help with the Kill-The-BKL project)--but being a newbie, I could use a little guidance.

Thanks again to Andrew for his candor.

My issues with kernel development

Posted Jun 12, 2008 5:00 UTC (Thu) by grundler (guest, #23450) [Link]

Yes - Andrew definitely deserves the kudos he gets.

(1) isn't really bug since "make oldconfig" is "expected behavior". Try "make menuconfig" and
see if a menu driven config tool works better for you. Too often, I find the "Help" text
useless and I'm not a kernel newbie. Updating those to be meaningful (e.g. spelling out
uncommon acronyms) would help alot of people.

(2) is a regression and sounds like it's bisect'able. In fact, recent bug on linux-scsi sounds
similar to this though might not be the same:
    http://marc.info/?t=121229388800003&r=1&w=2

So reporting the problem to linux-ide and/or linux-scsi might be a good starting point. You
don't have to report problems to LKML since there are plenty of topic specific mailing lists
that have less traffic. See linux-2.6.25/MAINTAINERS file for the various mailing lists. If
you post to the wrong list, people generally will redirect you to the right one. As Andrew
suggests, be persistent.

Lastly, regarding "being a new newbie" try http://kernelnewbies.org which is one of many
starting points. Usually any help with documentation, code review, or testing is something
anyone with a computer can do - especially if you are finding bugs, willing to report them and
test out (likely bad) theories on the bug. This interaction will lead to learning lots of new
stuff.

Please report it in bugzilla

Posted Jun 12, 2008 6:41 UTC (Thu) by mingo (subscriber, #31122) [Link]

(1) isn't really bug since "make oldconfig" is "expected behavior". Try "make menuconfig" and see if a menu driven config tool works better for you. Too often, I find the "Help" text useless and I'm not a kernel newbie. Updating those to be meaningful (e.g. spelling out uncommon acronyms) would help alot of people.

It is a serious upstream kernel regression if "make oldconfig" (used on a .config that worked with a previous version of the kernel) suddenly breaks a working setup. Please report it if you get hit by such a bug/regression and it will be fixed.

We'd be shooting ourselves in the foot if we made it harder to test new kernels.

Ditto for the second bug - if mounting CDs worked well before and it suddenly starts producing spurious "no media" mount failures that's a plain bug/regression.

Please report them on bugzilla.kernel.org.

Also, if you test new kernels, make sure you run the kerneloops client which automatically reports crashes to kerneloops.org.

Please report it in bugzilla

Posted Jun 12, 2008 7:51 UTC (Thu) by pr1268 (subscriber, #24648) [Link]

Thank you both for the replies.

I'm beginning to wonder if the two issues I have with 2.6.25[.x] are related in a weird way:

  1. make oldconfig wasn't working right, so I had to make menuconfig and manually select the kernel options. I may have [un]selected a bunch of options I [don'] use for my hardware, thus...
  2. ...the CD/DVD drives may have been acting flaky due to various IDE/SCSI/SATA drivers [not] in the kernel, or functioning differently based on my userspace configuration (I was running Slackware 12.0 when I tried 2.6.25.4 and noted these anomalies) and the odd config used when building the kernel.

Another reason why I was loath to report the CD/DVD not mounting issue was because I have some unusual IDE/SATA hardware in my system (a Promise PDC20271 ATA controller card, a Silicon Image SATA controller card, one of the two DVD burners is IDE whilst the other is SATA), but Linux has ordinarily given me no grief whatsoever for running this odd configuration (I also have a mix of IDE and SATA hard drives and a software RAID-0, but that's another story). Again, I must stress that this could all be a silly case of operator error (I'm good at finding these kinds of bugs ;-) ), or maybe it is a defect that needs the attention of the kernel developers...

I will admit that I'm somewhat of an informal kernel tester; I've compiled and run recent (-stable) kernels for the past 3 1/2 years now (thus explaining why I like Slackware--it works well with vanilla kernels), and I've only had to report one show-stopper (Oops in 2.6.15 due to NULL dereference in usbhid.c).

Thanks again for your replies; I'll look into reporting the make oldconfig issue.

Please report it in bugzilla

Posted Jun 12, 2008 14:25 UTC (Thu) by iabervon (subscriber, #722) [Link]

It's possible to have user error with "make oldconfig" on the first try (like getting the
wrong config into it), but if you can reproduce it, it's worth reporting. (And if you can't
reproduce it, you'll have a correct config...)

There was someone recently reporting problems with mounting optical media if he waited more
than 30 seconds after inserting it. It might be related, or it might be a coincidence, but you
might want to look into http://lkml.org/lkml/2008/6/6/170. The thread is kind of inconclusive,
but you might be able to help if you've got a different failure pattern (you need to wait,
while other people need to hurry), but also have a problem with timing and optical media
insertion that came up between 2.6.24 and 2.6.25. It's got things to try, anyway.

Make oldconfig issue has been reported in Bugzilla

Posted Jun 12, 2008 15:23 UTC (Thu) by pr1268 (subscriber, #24648) [Link]

I opened a bugzilla bug (#10898) on the make oldconfig issue. Apparently this is a regression reported by Linus himself, and a patch is in the works. Make oldconfig worked fine for 2.6.24.x (as I mentioned in the bugzilla description).

As for the mount(8) CD/DVD issue, well, I'll test that later this afternoon... Time to go to work... I'm still not discounting the possibility that a funky config kernel build combined with my strange mix of hardware (see above--yes, that's all one PC!) might have caused this anomaly.

Does -staging obsolete -mm?

Posted Jun 12, 2008 7:32 UTC (Thu) by dberkholz (guest, #23346) [Link]

It seems like there's a pretty large overlap between the new -staging tree and much of what
-mm does. Thus the question in the subject. =)

Andrew Morton on kernel development

Posted Jun 13, 2008 17:22 UTC (Fri) by giraffedata (guest, #1954) [Link]

I keep on hoping that kernel development in general will start to ramp down. There cannot be an infinite number of new features out there! Eventually we should get into more of a maintenance mode where we just fix bugs, tweak performance and add new drivers. Famous last words.

Am I reading this out of context, or is Andrew taking the position that everything's already been invented?

Andrew Morton on kernel development

Posted Jun 19, 2008 13:54 UTC (Thu) by Duncan (guest, #6647) [Link]

Maybe a bit of both?

I've seen previous discussion of this theory before on LWN, along with 
amazement that it hadn't slowed down yet.

There's a number of dynamics in play here of which I'll only consider a 
couple.

The big one is that for many years, Linux was playing catch-up, that is, 
the state-of-the-art in kernel technology was ahead of Linux so far that 
it had to well more than double-time it in ordered to have any hope of 
catching up in something like computer-evolution-reasonable time.  That 
Linux was actually doing it surprised a LOT of people, and was a major 
point behind the SCO suit -- they thought /surely/ IBM or /somebody/ must 
be "cheating", in ordered for Linux to be evolving as incredibly fast as 
it was, toward at that point and for what they were concerned about, a 
real "enterprise" kernel.  Well, we all know where /that/ ended up -- 
there was little if any cheating going on; it was real "organic" growth, 
but at a speed nobody could really account for according to previous 
models, because the Linux model really /is/ different.  At the same time, 
however, it /did/ make us more careful, prompting the introduction of 
better origins documentation and signed-off-by.

In theory, while various (now) peer kernels may still be more mature than 
Linux in some areas, that space is largely gone -- we're caught up, or 
close enough so the speed of change should be slowing down toward that of 
the more mature kernels as we match and now forge into new territory on 
our own.  However, this has been predicted since the late 2.4.teen kernels 
at least, but it just didn't appear to be happening.  In hindsight, we 
weren't as mature as we thought we were back then (a common observation in 
life, I might add, as one advances in years =8^S) and we still had more 
growing to do.

Since the 2.6 series, however, there /have/ been some observable changes 
toward this end.  While the raw volume of change hasn't really slacked off 
yet, the "scariness" of the changes has been decreasing.  The first big 
change from that was the switch from the odd/even cycle.  At first, people 
thought that it'd be relatively temporary, a couple years possibly, before 
something "big and disruptive" enough to all systems to really need an 
alternate development tree in which to coordinate all the changes, forcing 
the opening of a new official development tree.  That hasn't happened.  
We've managed due both to somewhat smaller less-system-wide-disruption 
changes, and an accommodation of more medium-scale changes into the 
ongoing stable kernel.  That this arrangement has continued to work is an 
indication of relative maturity both in featureset and in development team 
and method.  The disruptive scale has been reduced both in absolute terms 
and because we are better able to cope with it in stride than ever before.

That was the first big indication the kernel was maturing, altho raw 
change continued at if anything an increased pace.  A second, more recent 
indication that may or may not prove out over time is the lack 
of "scariness" in now really a couple of kernels in a row.  If the above 
change could be said to mark the transition from large to medium-large 
sized disruption and the ability to handle it, this new one /may/ be the 
first indications of the next level, moving from medium-large to simply 
medium sized change.  It should be noted that while two kernels in a row 
is somewhat notable, it does not a safe trend make as yet.  If we see a 
continuing trend of this thru the end of the year, say a couple more 
kernels in a row, for four, or only three but only one scary one and then 
back to "medium", then it's probably safe to say there's a marked trend.

However, that's nowhere near suggesting that everything has been invented 
now, only that we're finally catching up with the state of the art 
sufficiently, while at the same time enhancing our ability to cope 
in-stride with what might formerly have been disruptive, that things will 
normally slow down a bit as it becomes /us/ that's doing the pioneering, 
breaking the new ground.

Put in the large > med-large > medium language above, that's basically 
saying we might /possibly/ expect one more notch, to medium-small in the 
ordinary case, before we settle into a continuing sustainable pace as the 
new pioneers, where progress is much more hard-fought because nobody's 
been there before.  I don't believe and would actually hope it doesn't 
slow down much beyond that, nor do I believe many people are suggesting 
that it will.  Even then, there are likely to be occasional clusters of 
difficulty and increased change, back into the medium to medium-large zone 
for a kernel or three, before settling back into the medium-small zone.  
However, the prediction is that as we are increasingly doing our own 
pioneering, the average will drop to no higher than medium, with the 
outliers being only medium-large, and large-to-hugely disruptive changes 
will be a thing of the past as on the forefront it tends to be much more 
incremental.

That's my view from this observation point. =8^)

Duncan

Andrew Morton on kernel development

Posted Jun 26, 2008 6:13 UTC (Thu) by jturning (guest, #52690) [Link]

Great interview. Thanks Andrew for all the hard work.

A "green list" of hardware?

Posted Oct 22, 2008 1:45 UTC (Wed) by sahilahuja (guest, #54826) [Link]

Is there a "green list" of hardware supported properly by the linux kernel?

Whether the hardware will be supported by linux or not, is a very important criterion for me whenever I buy hardware. Making that decision shouldn't be as hard as it is now.
Right now, there is no such centralized list on the first page of google search hardware supported by linux kernel.

If this is a famous enough list, it should spark a new interest in the HW producer's side to get their hardware "properly" supported by the linux kernel.

(A noteworthy effort exists from linux.com, but I still think it should be easier, centralized, and have more involvement from kernel developers.)


Copyright © 2008, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds